Alibaba-NLP

68 models • 12 total models in database
Sort by:

gte-large-en-v1.5

--- datasets: - allenai/c4 library_name: transformers tags: - sentence-transformers - gte - mteb - transformers.js - sentence-similarity license: apache-2.0 language: - en model-index: - name: gte-large-en-v1.5 results: - task: type: Classification dataset: type: mteb/amazon_counterfactual name: MTEB AmazonCounterfactualClassification (en) config: en split: test revision: e8379541af4e31359cca9fbcf4b00f2671dba205 metrics: - type: accuracy value: 73.01492537313432 - type: ap value: 35.053416966595

3,963,021
228

gte-multilingual-base

--- tags: - mteb - sentence-transformers - transformers - multilingual - sentence-similarity - text-embeddings-inference license: apache-2.0 language: - af - ar - az - be - bg - bn - ca - ceb - cs - cy - da - de - el - en - es - et - eu - fa - fi - fr - gl - gu - he - hi - hr - ht - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ky - lo - lt - lv - mk - ml - mn - mr - ms - my - ne - nl - 'no' - pa - pl - pt - qu - ro - ru - si - sk - sl - so - sq - sr - sv - sw - ta - te - th - tl -

license:apache-2.0
1,412,178
329

gte-reranker-modernbert-base

--- license: apache-2.0 language: - en base_model: - answerdotai/ModernBERT-base base_model_relation: finetune pipeline_tag: text-ranking library_name: transformers tags: - sentence-transformers - transformers.js - text-embeddings-inference ---

license:apache-2.0
598,665
77

gte-base-en-v1.5

--- library_name: transformers tags: - sentence-transformers - gte - mteb - transformers.js - sentence-similarity license: apache-2.0 language: - en model-index: - name: gte-base-en-v1.5 results: - task: type: Classification dataset: type: mteb/amazon_counterfactual name: MTEB AmazonCounterfactualClassification (en) config: en split: test revision: e8379541af4e31359cca9fbcf4b00f2671dba205 metrics: - type: accuracy value: 74.7910447761194 - type: ap value: 37.053785713650626 - type: f1 value: 68.

license:apache-2.0
335,754
69

gte-Qwen2-1.5B-instruct

gte-Qwen2-1.5B-instruct is the latest model in the gte (General Text Embedding) model family. The model is built on Qwen2-1.5B LLM model and use the same training data and strategies as the gte-Qwen2-7B-instruct model. - Integration of bidirectional attention mechanisms, enriching its contextual understanding. - Instruction tuning, applied solely on the query side for streamlined efficiency - Comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios. This training leverages both weakly supervised and supervised data, ensuring the model's applicability across numerous languages and a wide array of downstream tasks. Model Information - Model Size: 1.5B - Embedding Dimension: 1536 - Max Input Tokens: 32k Observe the configsentencetransformers.json to see all pre-built prompt names. Otherwise, you can use `model.encode(queries, prompt="Instruct: ...\nQuery: "` to use a custom prompt of your choice. You can use the scripts/evalmteb.py to reproduce the following result of gte-Qwen2-1.5B-instruct on MTEB(English)/C-MTEB(Chinese): | Model Name | MTEB(56) | C-MTEB(35) | MTEB-fr(26) | MTEB-pl(26) | |:----:|:---------:|:----------:|:----------:|:----------:| | bge-base-en-1.5 | 64.23 | - | - | - | | bge-large-en-1.5 | 63.55 | - | - | - | | gte-large-en-v1.5 | 65.39 | - | - | - | | gte-base-en-v1.5 | 64.11 | - | - | - | | mxbai-embed-large-v1 | 64.68 | - | - | - | | acgetextembedding | - | 69.07 | - | - | | stella-mrl-large-zh-v3.5-1792d | - | 68.55 | - | - | | gte-large-zh | - | 66.72 | - | - | | multilingual-e5-base | 59.45 | 56.21 | - | - | | multilingual-e5-large | 61.50 | 58.81 | - | - | | e5-mistral-7b-instruct | 66.63 | 60.81 | - | - | | gte-Qwen1.5-7B-instruct | 67.34 | 69.52 | - | - | | NV-Embed-v1 | 69.32 | - | - | - | | gte-Qwen2-7B-instruct | 70.24 | 72.05 | 68.25 | 67.86 | | gte-Qwen2-1.5B-instruct | 67.16 | 67.65 | 66.60 | 64.04 | The gte series models have consistently released two types of models: encoder-only models (based on the BERT architecture) and decode-only models (based on the LLM architecture). | Models | Language | Max Sequence Length | Dimension | Model Size (Memory Usage, fp32) | |:-------------------------------------------------------------------------------------:|:--------:|:-----: |:---------:|:-------------------------------:| | GTE-large-zh | Chinese | 512 | 1024 | 1.25GB | | GTE-base-zh | Chinese | 512 | 512 | 0.41GB | | GTE-small-zh | Chinese | 512 | 512 | 0.12GB | | GTE-large | English | 512 | 1024 | 1.25GB | | GTE-base | English | 512 | 512 | 0.21GB | | GTE-small | English | 512 | 384 | 0.10GB | | GTE-large-en-v1.5 | English | 8192 | 1024 | 1.74GB | | GTE-base-en-v1.5 | English | 8192 | 768 | 0.51GB | | GTE-Qwen1.5-7B-instruct | Multilingual | 32000 | 4096 | 26.45GB | | GTE-Qwen2-7B-instruct | Multilingual | 32000 | 3584 | 26.45GB | | GTE-Qwen2-1.5B-instruct | Multilingual | 32000 | 1536 | 6.62GB | In addition to the open-source GTE series models, GTE series models are also available as commercial API services on Alibaba Cloud. - Embedding Models: Three versions of the text embedding models are available: text-embedding-v1/v2/v3, with v3 being the latest API service. - ReRank Models: The gte-rerank model service is available. Note that the models behind the commercial APIs are not entirely identical to the open-source models. GTE models can be fine-tuned with a third party framework SWIFT. If you find our paper or models helpful, please consider cite:

NaNK
license:apache-2.0
103,573
225

gte-Qwen2-7B-instruct

NaNK
license:apache-2.0
94,139
471

gme-Qwen2-VL-2B-Instruct

We are excited to present `GME-Qwen2VL` series of unified multimodal embedding models, which are based on the advanced Qwen2-VL multimodal large language models (MLLMs). The `GME` models support three types of input: text, image, and image-text pair, all of which can produce universal vector representations and have powerful retrieval performance. - Unified Multimodal Representation: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation. This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches. - High Performance: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (UMRB) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (MTEB). - Dynamic Image Resolution: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input. - Strong Visual Retrieval Performance: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots. This capability is particularly beneficial for complex document understanding scenarios, such as multimodal retrieval-augmented generation (RAG) applications focused on academic papers. Paper: GME: Improving Universal Multimodal Retrieval by Multimodal LLMs Model List | Models | Model Size | Max Seq. Length | Dimension | MTEB-en| MTEB-zh | UMRB | |:-----: | :-----: |:-----: |:-----: |:-----: | :-----: | :-----: | |`gme-Qwen2-VL-2B` | 2.21B | 32768 | 1536 | 65.27 | 66.92 | 64.45 | |`gme-Qwen2-VL-7B` | 8.29B | 32768 | 3584 | 67.48 | 69.73 | 67.44 | The remote code has some issues with `transformers>=4.52.0`, please downgrade or use `sentencetransformers` The `encode` function accept `str` or `dict` with key(s) in `{'text', 'image', 'prompt'}`. Do not pass `prompt` as the argument to `encode`, pass as the input as a `dict` with a `prompt` key. We validated the performance on our universal multimodal retrieval benchmark (UMRB, see Release UMRB) among others. | | | Single-modal | | Cross-modal | | | Fused-modal | | | | Avg. | |--------------------|------|:------------:|:---------:|:-----------:|:-----------:|:---------:|:-----------:|:----------:|:----------:|:-----------:|:----------:| | | | T→T (16) | I→I (1) | T→I (4) | T→VD (10) | I→T (4) | T→IT (2) | IT→T (5) | IT→I (2) | IT→IT (3) | (47) | | VISTA | 0.2B | 55.15 | 31.98 | 32.88 | 10.12 | 31.23 | 45.81 | 53.32 | 8.97 | 26.26 | 37.32 | | CLIP-SF | 0.4B | 39.75 | 31.42 | 59.05 | 24.09 | 62.95 | 66.41 | 53.32 | 34.9 | 55.65 | 43.66 | | One-Peace | 4B | 43.54 | 31.27 | 61.38 | 42.9 | 65.59 | 42.72 | 28.29 | 6.73 | 23.41 | 42.01 | | DSE | 4.2B | 48.94 | 27.92 | 40.75 | 78.21 | 52.54 | 49.62 | 35.44 | 8.36 | 40.18 | 50.04 | | E5-V | 8.4B | 52.41 | 27.36 | 46.56 | 41.22 | 47.95 | 54.13 | 32.9 | 23.17 | 7.23 | 42.52 | | GME-Qwen2-VL-2B | 2.2B | 55.93 | 29.86 | 57.36 | 87.84 | 61.93 | 76.47 | 64.58 | 37.02 | 66.47 | 64.45 | | GME-Qwen2-VL-7B | 8.3B | 58.19 | 31.89 | 61.35 | 89.92 | 65.83 | 80.94 | 66.18 | 42.56 | 73.62 | 67.44 | The MTEB Leaderboard English tab shows the text embeddings performence of our model. More detailed experimental results can be found in the paper. - Single Image Input: In `Qwen2-VL`, an image could be converted into a very large number of visual tokens. We limit the number of visual tokens to 1024 to obtain a good training efficiency. Due to the lack of relevant data, our models and evaluations retain one single image. - English-only Training: Our models are trained on english data only. Although the `Qwen2-VL` models are multilingual, the multilingual-multimodal embedding performance are not guaranteed. We will extend to multi-image input, image-text interleaved data as well as multilingual data in the future version. We encourage and value diverse applications of GME models and continuous enhancements to the models themselves. - If you distribute or make GME models (or any derivative works) available, or if you create a product or service (including another AI model) that incorporates them, you must prominently display `Built with GME` on your website, user interface, blog post, About page, or product documentation. - If you utilize GME models or their outputs to develop, train, fine-tune, or improve an AI model that is distributed or made available, you must prefix the name of any such AI model with `GME`. In addition to the open-source GME series models, GME series models are also available as commercial API services on Alibaba Cloud. - MultiModal Embedding Models: The `multimodal-embedding-v1` model service is available. Note that the models behind the commercial APIs are not entirely identical to the open-source models. We have open positions for Research Interns and Full-Time Researchers to join our team at Tongyi Lab. We are seeking passionate individuals with expertise in representation learning, LLM-driven information retrieval, Retrieval-Augmented Generation (RAG), and agent-based systems. Our team is located in the vibrant cities of Beijing and Hangzhou, offering a collaborative and dynamic work environment where you can contribute to cutting-edge advancements in artificial intelligence and machine learning. If you are driven by curiosity and eager to make a meaningful impact through your work, we would love to hear from you. Please submit your resume along with a brief introduction to [email protected] . Citation If you find our paper or models helpful, please consider cite:

NaNK
license:apache-2.0
64,400
114

gte-multilingual-reranker-base

The gte-multilingual-reranker-base model is the first reranker model in the GTE family of models, featuring several key attributes: - High Performance: Achieves state-of-the-art (SOTA) results in m...

license:apache-2.0
49,026
162

gte-modernbert-base

license:apache-2.0
43,086
183

Tongyi-DeepResearch-30B-A3B

We present Tongyi DeepResearch, an agentic large language model featuring 30 billion total parameters, with only 3 billion activated per token. Developed by Tongyi Lab, the model is specifically designed for long-horizon, deep information-seeking tasks. Tongyi-DeepResearch demonstrates state-of-the-art performance across a range of agentic search benchmarks, including Humanity's Last Exam, BrowserComp, BrowserComp-ZH, WebWalkerQA, GAIA, xbench-DeepSearch and FRAMES. - ⚙️ Fully automated synthetic data generation pipeline: We design a highly scalable data synthesis pipeline, which is fully automatic and empowers agentic pre-training, supervised fine-tuning, and reinforcement learning. - 🔄 Large-scale continual pre-training on agentic data: Leveraging diverse, high-quality agentic interaction data to extend model capabilities, maintain freshness, and strengthen reasoning performance. - 🔁 End-to-end reinforcement learning: We employ a strictly on-policy RL approach based on a customized Group Relative Policy Optimization framework, with token-level policy gradients, leave-one-out advantage estimation, and selective filtering of negative samples to stabilize training in a non‑stationary environment. - 🤖 Agent Inference Paradigm Compatibility: At inference, Tongyi-DeepResearch is compatible with two inference paradigms: ReAct, for rigorously evaluating the model's core intrinsic abilities, and an IterResearch-based 'Heavy' mode, which uses a test-time scaling strategy to unlock the model's maximum performance ceiling. You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/DeepResearch.

NaNK
license:apache-2.0
13,249
757

gme-Qwen2-VL-7B-Instruct

NaNK
license:apache-2.0
3,445
61

Simulation_LLM_google_14B_V2

NaNK
819
1

gte-en-mlm-large

license:apache-2.0
671
7

Simulation_LLM_google_7B_V2

NaNK
481
1

gte-en-mlm-base

license:apache-2.0
367
7

gte-multilingual-mlm-base

NaNK
license:apache-2.0
303
15

gte-Qwen1.5-7B-instruct

NaNK
license:apache-2.0
212
107

E2Rank 0.6B

E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker 🤖 Website | 📄 Arxiv Paper | 🤗 Huggingface Collection | 🚩 Citation We introduce E2Rank, meaning Efficient Embedding-based Ranking (also meaning Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy. Our work highlights the potential of single embedding models to serve as unified retrieval-reranking engines, offering a practical, efficient, and accurate alternative to complex multi-stage ranking systems. | Supported Task | Model Name | Size | Layers | Sequence Length | Embedding Dimension | Instruction Aware | |-----------------------------|----------------------|------|--------|-----------------|---------------------|-------------------| | Embedding + Reranking | Alibaba-NLP/E2Rank-0.6B | 0.6B | 28 | 32K | 1024 | Yes | | Embedding + Reranking | Alibaba-NLP/E2Rank-4B | 4B | 36 | 32K | 2560 | Yes | | Embedding + Reranking | Alibaba-NLP/E2Rank-8B | 8B | 36 | 32K | 4096 | Yes | | Embedding Only | Alibaba-NLP/E2Rank-0.6B-Embedding-Only | 0.6B | 28 | 32K | 1024 | Yes | | Embedding Only | Alibaba-NLP/E2Rank-0.6B-Embedding-Only | 4B | 36 | 32K | 2560 | Yes | | Embedding Only | Alibaba-NLP/E2Rank-0.6B-Embedding-Only | 8B | 36 | 32K | 4096 | Yes | > Note: > - `Embedding Only` indicates that the model is trained only with the constrative learning and support embedding tasks, while `Embedding + Reranking` indicates the full E2Rank model trained with both embedding and reranking objectives (for more detals, please refer to the [paper]()). > - `Instruction Aware` notes whether the model supports customizing the input instruction according to different tasks. - For `Listwise Reranking` models, they are supervised fine-tuned from the Qwen3 Models in the paradigm of RankGPT and support only the reranking task. --> The usage of E2Rank as an embedding model is similar to Qwen3-Embedding. The only difference is that Qwen3-Embedding will automatically append an EOS token, while E2Rank requires users to manully append the special token ` ` at the end of each input text. For using E2Rank as a reranker, you only need to perform additional processing on the query by adding (part of) the docs that needs to be reranked to the listwise prompt, while the rest is the same as using the embedding model. Since E2Rank extends a single text embedding model to perform both high-quality retrieval and listwise reranking, you can directly use it to build an end-to-end search system. By reusing the embeddings computed during the retrieval stage, E2Rank only need to compute the pseudo query's embedding and can efficiently rerank the retrieved documents with minimal additional computational overhead. | | Covid | NFCorpus | Touche | DBPedia | SciFact | Signal | News | Robust | Avg. | |------------------------------------------------------------|:-----:|:--------:|:------:|:-------:|:-------:|:------:|:-----:|:------:|:---------:| | BM25 | 59.47 | 30.75 | 44.22 | 31.80 | 67.89 | 33.05 | 39.52 | 40.70 | 43.43 | | Zero-shot Listwise Reranker | | | | | | | | | | | RankGPT-4o | 83.41 | 39.67 | 32.26 | 45.56 | 77.41 | 34.20 | 51.92 | 60.25 | 53.09 | | RankGPT-4o-mini | 80.03 | 38.73 | 30.91 | 44.54 | 73.14 | 33.64 | 50.91 | 57.41 | 51.16 | | RankQwen3-14B | 84.45 | 38.94 | 38.30 | 44.52 | 78.64 | 33.58 | 51.24 | 59.66 | 53.67 | | RankQwen3-32B | 83.48 | 39.22 | 37.13 | 45.00 | 78.22 | 32.12 | 51.08 | 60.74 | 53.37 | | Fine-tuned Listwise Reranker based on Qwen3 | | | | | | | | | | | RankQwen3-0.6B | 78.35 | 36.41 | 37.54 | 39.19 | 71.01 | 30.96 | 44.43 | 46.31 | 48.03 | | RankQwen3-4B | 83.91 | 39.88 | 32.66 | 43.91 | 76.37 | 32.15 | 50.81 | 59.36 | 52.38 | | RankQwen3-8B | 85.37 | 40.05 | 31.73 | 45.44 | 78.96 | 32.48 | 52.36 | 60.72 | 53.39 | | Ours | | | | | | | | | | | E2Rank-0.6B | 79.17 | 38.60 | 41.91 | 41.96 | 73.43 | 35.26 | 52.75 | 53.67 | 52.09 | | E2Rank-4B | 83.30 | 39.20 | 43.16 | 42.95 | 77.19 | 34.48 | 52.71 | 60.16 | 54.14 | | E2Rank-8B | 84.09 | 39.08 | 42.06 | 43.44 | 77.49 | 34.01 | 54.25 | 60.34 | 54.35 | | Models | Retr. | Rerank. | Clust. | PairClass. | Class. | STS | Summ. | Avg. | |------------------------------------|:-----:|:-------:|:------:|:----------:|:------:|:-----:|:-----:|:---------:| | Instructor-xl | 49.26 | 57.29 | 44.74 | 86.62 | 73.12 | 83.06 | 32.32 | 61.79 | | BGE-large-en-v1.5 | 54.29 | 60.03 | 46.08 | 87.12 | 75.97 | 83.11 | 31.61 | 64.23 | | GritLM-7B | 53.10 | 61.30 | 48.90 | 86.90 | 77.00 | 82.80 | 29.40 | 64.70 | | E5-Mistral-7b-v1 | 52.78 | 60.38 | 47.78 | 88.47 | 76.80 | 83.77 | 31.90 | 64.56 | | Echo-Mistral-7b-v1 | 55.52 | 58.14 | 46.32 | 87.34 | 77.43 | 82.56 | 30.73 | 64.68 | | LLM2Vec-Mistral-7B | 55.99 | 58.42 | 45.54 | 87.99 | 76.63 | 84.09 | 29.96 | 64.80 | | LLM2Vec-Meta-LLaMA-3-8B | 56.63 | 59.68 | 46.45 | 87.80 | 75.92 | 83.58 | 30.94 | 65.01 | | E2Rank-0.6B | 51.74 | 55.97 | 40.85 | 83.93 | 73.66 | 81.41 | 30.90 | 61.25 | | E2Rank-4B | 55.33 | 59.10 | 44.27 | 87.14 | 77.08 | 84.03 | 30.06 | 64.47 | | E2Rank-8B | 56.89 | 59.58 | 44.75 | 86.96 | 76.81 | 84.52 | 30.23 | 65.03 | > Note: For baselines, we only compared with models that are trained using public datasets. If you have any questions, feel free to contact us via qiliu6777[AT]gmail.com or create an issue.

NaNK
license:apache-2.0
201
6

WebSailor-32B

NaNK
license:apache-2.0
201
0

WebDancer-32B

This model was presented in the paper WebDancer: Towards Autonomous Information Seeking Agency. You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/WebAgent. - Native agentic search reasoning model using ReAct framework towards autonomous information seeking agency and Deep Research-like model. - We introduce a four-stage training paradigm comprising browsing data construction, trajectory sampling, supervised fine-tuning for effective cold start, and reinforcement learning for improved generalization, enabling the agent to autonomously acquire autonomous search and reasoning skills. - Our data-centric approach integrates trajectory-level supervision fine-tuning and reinforcement learning (DAPO) to develop a scalable pipeline for training agentic systems via SFT or RL. - WebDancer achieves a Pass@3 score of 61.1% on GAIA and 54.6% on WebWalkerQA.

NaNK
license:mit
60
57

E2Rank-4B

E2Rank: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker 🤖 Website | 📄 Arxiv Paper | 🤗 Huggingface Collection | 🚩 Citation We introduce E2Rank, meaning Efficient Embedding-based Ranking (also meaning Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, E2Rank achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy. Our work highlights the potential of single embedding models to serve as unified retrieval-reranking engines, offering a practical, efficient, and accurate alternative to complex multi-stage ranking systems. | Supported Task | Model Name | Size | Layers | Sequence Length | Embedding Dimension | Instruction Aware | |-----------------------------|----------------------|------|--------|-----------------|---------------------|-------------------| | Embedding + Reranking | Alibaba-NLP/E2Rank-0.6B | 0.6B | 28 | 32K | 1024 | Yes | | Embedding + Reranking | Alibaba-NLP/E2Rank-4B | 4B | 36 | 32K | 2560 | Yes | | Embedding + Reranking | Alibaba-NLP/E2Rank-8B | 8B | 36 | 32K | 4096 | Yes | | Embedding Only | Alibaba-NLP/E2Rank-0.6B-Embedding-Only | 0.6B | 28 | 32K | 1024 | Yes | | Embedding Only | Alibaba-NLP/E2Rank-0.6B-Embedding-Only | 4B | 36 | 32K | 2560 | Yes | | Embedding Only | Alibaba-NLP/E2Rank-0.6B-Embedding-Only | 8B | 36 | 32K | 4096 | Yes | > Note: > - `Embedding Only` indicates that the model is trained only with the constrative learning and support embedding tasks, while `Embedding + Reranking` indicates the full E2Rank model trained with both embedding and reranking objectives (for more detals, please refer to the [paper]()). > - `Instruction Aware` notes whether the model supports customizing the input instruction according to different tasks. - For `Listwise Reranking` models, they are supervised fine-tuned from the Qwen3 Models in the paradigm of RankGPT and support only the reranking task. --> The usage of E2Rank as an embedding model is similar to Qwen3-Embedding. The only difference is that Qwen3-Embedding will automatically append an EOS token, while E2Rank requires users to manully append the special token ` ` at the end of each input text. For using E2Rank as a reranker, you only need to perform additional processing on the query by adding (part of) the docs that needs to be reranked to the listwise prompt, while the rest is the same as using the embedding model. Since E2Rank extends a single text embedding model to perform both high-quality retrieval and listwise reranking, you can directly use it to build an end-to-end search system. By reusing the embeddings computed during the retrieval stage, E2Rank only need to compute the pseudo query's embedding and can efficiently rerank the retrieved documents with minimal additional computational overhead. | | Covid | NFCorpus | Touche | DBPedia | SciFact | Signal | News | Robust | Avg. | |------------------------------------------------------------|:-----:|:--------:|:------:|:-------:|:-------:|:------:|:-----:|:------:|:---------:| | BM25 | 59.47 | 30.75 | 44.22 | 31.80 | 67.89 | 33.05 | 39.52 | 40.70 | 43.43 | | Zero-shot Listwise Reranker | | | | | | | | | | | RankGPT-4o | 83.41 | 39.67 | 32.26 | 45.56 | 77.41 | 34.20 | 51.92 | 60.25 | 53.09 | | RankGPT-4o-mini | 80.03 | 38.73 | 30.91 | 44.54 | 73.14 | 33.64 | 50.91 | 57.41 | 51.16 | | RankQwen3-14B | 84.45 | 38.94 | 38.30 | 44.52 | 78.64 | 33.58 | 51.24 | 59.66 | 53.67 | | RankQwen3-32B | 83.48 | 39.22 | 37.13 | 45.00 | 78.22 | 32.12 | 51.08 | 60.74 | 53.37 | | Fine-tuned Listwise Reranker based on Qwen3 | | | | | | | | | | | RankQwen3-0.6B | 78.35 | 36.41 | 37.54 | 39.19 | 71.01 | 30.96 | 44.43 | 46.31 | 48.03 | | RankQwen3-4B | 83.91 | 39.88 | 32.66 | 43.91 | 76.37 | 32.15 | 50.81 | 59.36 | 52.38 | | RankQwen3-8B | 85.37 | 40.05 | 31.73 | 45.44 | 78.96 | 32.48 | 52.36 | 60.72 | 53.39 | | Ours | | | | | | | | | | | E2Rank-0.6B | 79.17 | 38.60 | 41.91 | 41.96 | 73.43 | 35.26 | 52.75 | 53.67 | 52.09 | | E2Rank-4B | 83.30 | 39.20 | 43.16 | 42.95 | 77.19 | 34.48 | 52.71 | 60.16 | 54.14 | | E2Rank-8B | 84.09 | 39.08 | 42.06 | 43.44 | 77.49 | 34.01 | 54.25 | 60.34 | 54.35 | | Models | Retr. | Rerank. | Clust. | PairClass. | Class. | STS | Summ. | Avg. | |------------------------------------|:-----:|:-------:|:------:|:----------:|:------:|:-----:|:-----:|:---------:| | Instructor-xl | 49.26 | 57.29 | 44.74 | 86.62 | 73.12 | 83.06 | 32.32 | 61.79 | | BGE-large-en-v1.5 | 54.29 | 60.03 | 46.08 | 87.12 | 75.97 | 83.11 | 31.61 | 64.23 | | GritLM-7B | 53.10 | 61.30 | 48.90 | 86.90 | 77.00 | 82.80 | 29.40 | 64.70 | | E5-Mistral-7b-v1 | 52.78 | 60.38 | 47.78 | 88.47 | 76.80 | 83.77 | 31.90 | 64.56 | | Echo-Mistral-7b-v1 | 55.52 | 58.14 | 46.32 | 87.34 | 77.43 | 82.56 | 30.73 | 64.68 | | LLM2Vec-Mistral-7B | 55.99 | 58.42 | 45.54 | 87.99 | 76.63 | 84.09 | 29.96 | 64.80 | | LLM2Vec-Meta-LLaMA-3-8B | 56.63 | 59.68 | 46.45 | 87.80 | 75.92 | 83.58 | 30.94 | 65.01 | | E2Rank-0.6B | 51.74 | 55.97 | 40.85 | 83.93 | 73.66 | 81.41 | 30.90 | 61.25 | | E2Rank-4B | 55.33 | 59.10 | 44.27 | 87.14 | 77.08 | 84.03 | 30.06 | 64.47 | | E2Rank-8B | 56.89 | 59.58 | 44.75 | 86.96 | 76.81 | 84.52 | 30.23 | 65.03 | > Note: For baselines, we only compared with models that are trained using public datasets. If you have any questions, feel free to contact us via qiliu6777[AT]gmail.com or create an issue.

NaNK
license:apache-2.0
60
2

WebSailor-3B

You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/WebAgent. - WebSailor is a complete post-training methodology designed to teach LLM agents sophisticated reasoning for complex web navigation and information-seeking tasks. It addresses the challenge of extreme uncertainty in vast information landscapes, a capability where previous open-source models lagged behind proprietary systems. - We classify information-seeking tasks into three difficulty levels, where Level 3 represents problems with both high uncertainty and a complex, non-linear path to a solution. To generate these challenging tasks, we introduce SailorFog-QA, a novel data synthesis pipeline that constructs intricate knowledge graphs and then applies information obfuscation. This process creates questions with high initial uncertainty that demand creative exploration and transcend simple, structured reasoning patterns. - Our training process begins by generating expert trajectories and then reconstructing the reasoning to create concise, action-oriented supervision signals, avoiding the stylistic and verbosity issues of teacher models. The agent is first given a "cold start" using rejection sampling fine-tuning (RFT) on a small set of high-quality examples to establish a baseline capability. This is followed by an efficient agentic reinforcement learning stage using our Duplicating Sampling Policy Optimization (DUPO) algorithm, which refines the agent's exploratory strategies. - WebSailor establishes a new state-of-the-art for open-source agents, achieving outstanding results on difficult benchmarks like BrowseComp-en and BrowseComp-zh. Notably, our smaller models like WebSailor-7B outperform agents built on much larger backbones, highlighting the efficacy of our training paradigm. Ultimately, WebSailor closes the performance gap to proprietary systems, achieving results on par with agents like Doubao-Search.

NaNK
license:apache-2.0
53
74

GVE-3B

> One Embedder for All Video Retrieval Scenarios > Queries of text, image, video, or any combination modalities — GVE understands them all for representations, zero-shot, without in-domain training. GVE is the first video embedding model that generalizes across 9 abilities, including 3 diverse retrieval tasks and 6 domains — from coarse text-to-video to fine-grained spatial/temporal queries, composed (text+image) queries, and long-context retrieval — all evaluated on our new Universal Video Retrieval Benchmark (UVRB). Built on Qwen2.5-VL and trained only with LoRA with 13M collected and synthesized multimodal data, GVE achieves SOTA zero-shot performance than competitors. | Capability | Existing Works | GVE | |-----------|-------------------|--------| | Query Flexibility | Only text | ✅ Text, ✅ Image, ✅ Video, ✅ Text+Image, ✅ Text+Video | | Fine-grained Understanding | Weak on spatial-temporal details | S: 0.821, T: 0.469 (SOTA) | | Training Data | Uses in-domain test data (e.g., MSRVTT) | Synthesized data — true zero-shot | | Performance | Unite-7B (8.3B): 55.9 | GVE-3B (3.8B): 0.571 → better with half the size; GVE-7B (3.8B): 0.600 | - TXT: Textual Video Retrieval - CMP: Composed Video Retrieval - VIS: Visual Video Retrieval - CG: Coarse-grained Video Retrieval - FG: Fine-grained Video Retrieval - LC: Long-Context Video Retrieval - S: Spatial Video Retrieval - T: Temporal Video Retrieval - PR: Partially Relevant Video Retrieval > For each column: highest score is bolded, second-highest is underlined . | Model | AVG | TXT | CMP | VIS | CG | FG | LC | S | T | PR | |-------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| | CLIP4Clip | 0.416 | 0.401 | 0.178 | 0.714 | 0.380 | 0.360 | 0.463 | 0.559 | 0.285 | 0.236 | | ViCLIP | 0.375 | 0.336 | 0.263 | 0.640 | 0.380 | 0.315 | 0.313 | 0.484 | 0.289 | 0.171 | | VideoCLIP-XL | 0.510 | 0.550 | 0.227 | 0.632 | 0.558 | 0.493 | 0.600 | 0.787 | 0.381 | 0.310 | | LanguageBind | 0.508 | 0.543 | 0.231 | 0.645 | 0.539 | 0.479 | 0.610 | 0.723 | 0.378 | 0.336 | | InternVideo2-1B | 0.420 | 0.422 | 0.248 | 0.581 | 0.480 | 0.403 | 0.383 | 0.606 | 0.413 | 0.189 | | InternVideo2-6B | 0.445 | 0.448 | 0.220 | 0.660 | 0.504 | 0.417 | 0.423 | 0.631 | 0.400 | 0.220 | | GME-2B | 0.416 | 0.539 | 0.345 | 0.597 | 0.461 | 0.471 | 0.685 | 0.716 | 0.349 | 0.347 | | Unite-2B | 0.507 | 0.536 | 0.242 | 0.654 | 0.455 | 0.471 | 0.681 | 0.725 | 0.347 | 0.341 | | VLM2Vec-V2 | 0.538 | 0.587 | 0.263 | 0.613 | 0.498 | 0.502 | 0.762 | 0.809 | 0.348 | 0.348 | | BGE-VL | 0.480 | 0.497 | 0.268 | 0.622 | 0.448 | 0.406 | 0.636 | 0.664 | 0.292 | 0.261 | | UniME-7B | 0.542 | 0.561 | 0.308 | 0.702 | 0.500 | 0.518 | 0.664 | 0.785 | 0.396 | 0.373 | | B3-7B | 0.538 | 0.570 | 0.270 | 0.678 | 0.482 | 0.505 | 0.722 | 0.797 | 0.364 | 0.355 | | GME-7B | 0.562 | 0.604 | 0.341 | 0.615 | 0.518 | 0.507 | 0.788 | 0.749 | 0.373 | 0.398 | | Unite-7B | 0.559 | 0.609 | 0.254 | 0.666 | 0.541 | 0.539 | 0.746 | 0.779 | 0.412 | 0.425 | | GVE-3B | 0.571 | 0.619 | 0.304 | 0.647 | 0.552 | 0.541 | 0.764 | 0.816 | 0.430 | 0.377 | | GVE-7B | 0.600 | 0.657 | 0.312 | 0.657 | 0.587 | 0.570 | 0.814 | 0.821 | 0.469 | 0.419 |

NaNK
license:apache-2.0
46
11

E2Rank-0.6B-Embedding-Only

NaNK
37
1

E2Rank-8B

NaNK
license:apache-2.0
36
2

E2Rank-4B-Embedding-Only

NaNK
30
1

E2Rank-8B-Embedding-Only

NaNK
30
1

ERank-4B

NaNK
license:apache-2.0
27
10

GVE-7B

NaNK
license:apache-2.0
27
8

Simulation_LLM_wiki_7B_V2

NaNK
17
1

ERank-32B

NaNK
license:apache-2.0
15
3

WebSailor-7B

NaNK
license:apache-2.0
14
11

Simulation_LLM_wiki_3B_V2

NaNK
14
1

ERank-14B

ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking We introduce ERANK, a highly effective and efficient pointwise reranker built from a reasoning LLM, which excels across diverse relevance scenarios with low latency. Surprisingly, it also outperforms recent listwise rerankers on the most challenging reasoning-intensive tasks. ERank is trained with a novel two-stage training pipeline, i.e., Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During the SFT stage, unlike traidtional pointwise rerankers that train the LLMs for binary relevance classification, we encourage the LLM to generatively output fine grained integer scores. In the RL training, we introduce a novel listwise derived reward, which instills global ranking awareness into the efficient pointwise architecture. We provide the trained reranking models in various sizes (4B, 14B and 32B), all of which support customizing the input instruction according to different tasks. | Model | Size | Layers | Sequence Length | Instruction Aware | |------------------------------------------|------|--------|-----------------|-------------------| | ERank-4B | 4B | 36 | 32K | Yes | | ERank-14B | 14B | 40 | 128K | Yes | | ERank-32B | 32B | 64 | 128K | Yes | We evaluate ERank on both reasoning-intensive benchmarks (BRIGHT and FollowIR) and traditional semantic relevance benchmarks (BEIR and TREC DL). All methods use the original queries without hybrid scores. | Paradigm | Method | Average | BRIGHT | FollowIR | BEIR | TREC DL | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | - | First-stage retriever | 25.9 | 13.7 | 0 | 40.8 | 49.3 | | Listwise | Rank-R1-7B | 34.6 | 15.7 | 3.6 | 49.0 | 70.0 | | Listwise | Rearank-7B | 35.3 | 17.4 | 2.3 | 49.0 | 72.5 | | Pointwise | JudgeRank-8B | 32.1 | 17.0 | 9.9 | 39.1 | 62.6 | | Pointwise | Rank1-7B | 34.6 | 18.2 | 9.1 | 44.2 | 67.1 | | Pointwise | ERank-4B (Ours) | 36.8 | 22.7 | 11.0 | 44.8 | 68.9 | | Pointwise | ERank-14B (Ours) | 36.9 | 23.1 | 10.3 | 47.1 | 67.1 | | Pointwise | ERank-32B (Ours) | 38.1 | 24.4 | 12.1 | 47.7 | 68.1 | On the most challenging BRIGHT benchmark, with top-100 documents retrieved by ReasonIR-8B using GPT-4 reason-query, ERank with BM25 hybrid achieves the state-of-the-art NDCG@10. | Method | nDCG@10 | | :--- | :--- | | ReasonIR-8B | 30.5 | | Rank-R1-7B | 24.1 | | Rank1-7B | 24.3 | | Rearank-7B | 27.5 | | JudgeRank-8B | 20.2 | | + BM25 hybrid | 22.7 | | Rank-R1-32B-v0.2 | 37.7 | | + BM25 hybrid | 40.0 | | ERank-4B (Ours) | 30.5 | | + BM25 hybrid | 38.7 | | ERank-14B (Ours) | 31.8 | | + BM25 hybrid | 39.3 | | ERank-32B (Ours) | 32.8 | | + BM25 hybrid | 40.2 | Since ERank is a pointwise reranker, it has low latency compared with listwise models. We have implemented the inference code based on Transformer and vLLM, respectively. Please refer to the `examples` directory for details, in which we also provide the instructions used in the prompt during evaluation. Citation If you find our work helpful, feel free to give us a cite.

NaNK
license:apache-2.0
12
3

WebWatcher-7B

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent 🥇 Introduction In this paper, we introduce WebWatcher, a multimodal agent for deep research that possesses enhanced visual-language reasoning capabilities. Our work presents a unified framework that combines complex vision-language reasoning with multi-tool interaction. - BrowseComp-VL Benchmark: We propose a new benchmark, BrowseComp-VL, to evaluate the capabilities of multimodal agents. This challenging dataset is designed for in-depth multimodal reasoning and strategic planning, mirroring the complexity of BrowseComp but extending it into the visual domain. It emphasizes tasks that require both visual perception and advanced information-gathering abilities. - Automated Trajectory Generation: To provide robust tool-use capabilities, we developed an automated pipeline to generate high-quality, multi-step reasoning trajectories. These trajectories, which are grounded in actual tool-use behavior and reflect procedural decision-making, are used for efficient cold-start training and further optimization via reinforcement learning. The agent is equipped with several tools, including Web Image Search, Web Text Search, Webpage Visit, Code Interpreter, and an internal OCR tool. - Superior Performance: WebWatcher significantly outperforms proprietary baselines, RAG workflows, and other open-source agents across four challenging VQA benchmarks: Humanity's Last Exam (HLE)-VL, BrowseComp-VL, LiveVQA, and MMSearch. The WebWatcher-32B model, in particular, achieves an average score of 18.2% on HLE, surpassing the GPT-4o-based OmniSearch baseline. It also achieves top-tier performance on LiveVQA (58.7%) and MMSearch (55.3%), demonstrating stable and superior results on demanding, real-world visual search benchmarks. 1. Complex Reasoning (HLE-VL): On the Human Life Exam (HLE-VL), a benchmark for multi-step complex reasoning, WebWatcher achieved a commanding lead with a Pass@1 score of 13.6%, substantially outperforming representative models including GPT-4o (9.8%), Gemini2.5-flash (9.2%), and Qwen2.5-VL-72B (8.6%). 2. Information Retrieval (MMSearch): In the MMSearch evaluation, WebWatcher demonstrated exceptional retrieval accuracy with a Pass@1 score of 55.3%, significantly surpassing Gemini2.5-flash (43.9%) and GPT-4o (24.1%), showcasing superior precision in retrieval tasks and robust information aggregation capabilities in complex scenarios. 3. Knowledge-Retrieval Integration (LiveVQA): On the LiveVQA benchmark, WebWatcher achieved a Pass@1 score of 58.7%, outperforming Gemini2.5-flash (41.3%), Qwen2.5-VL-72B (35.7%), and GPT-4o (34.0%). 4. Information Optimization and Aggregation (BrowseComp-VL): On BrowseComp-VL, the most comprehensively challenging benchmark, WebWatcher dominated with an average score of 27.0%, more than doubling the performance of mainstream models including GPT-4o (13.4%), Gemini2.5-flash (13.0%), and Claude-3.7 (11.2%). You can download WebWatcher via Hugging Face 🤗 HuggingFace. Before running inference, test set images need to be downloaded to the `infer/scriptseval/images` folder. This can be accomplished by running `infer/scriptseval/downloadimage.py`. If you encounter issues downloading images from our provided OSS URLs, please obtain the images from the original dataset source and place them in the corresponding `infer/scriptseval/images` folder. Run `infer/scriptseval/scripts/eval.sh` with the following required parameters: - benchmark: Name of the dataset to test. Available options: `'hle'`, `'gaia'`, `'livevqa'`, `'mmsearch'`, `'simplevqa'`, `'bcvlv1'`, `'bcvlv2'`. These test sets should be pre-stored in `infer/vlsearchr1/evaldata` with naming convention like `hle.jsonl`. We have provided format examples for some datasets in `infer/vlsearchr1/evaldata`. If extending to new datasets, please ensure consistent formatting. - EXPERIMENTNAME: Name for this experiment (user-defined) - MODELPATH: Path to the trained model - DASHSCOPEAPIKEY: GPT API key - IMGSEARCHKEY: Google SerpApi key for image search - JINAAPIKEY: Jina API key - SCRAPERAPIKEY: Scraper API key - QWENSEARCHKEY: Google SerpApi key for text search Note: For image search tools, if you need to upload searched images to OSS, the following are required: - ALIBABACLOUDACCESSKEYID: Alibaba Cloud OSS access key ID - ALIBABACLOUDACCESSKEYSECRET: Alibaba Cloud OSS access key secret Run `infer/vlsearchr1/pass3.sh` to use LLM-as-judge for evaluating Pass@3 and Pass@1 metrics. Parameters: - DIRECTORY: Path to the folder containing JSONL files generated from inference - DASHSCOPEAPIKEY: GPT API key ```bigquery @article{geng2025webwatcher, title={WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent}, author={Geng, Xinyu and Xia, Peng and Zhang, Zhen and Wang, Xinyu and Wang, Qiuchen and Ding, Ruixue and Wang, Chenxi and Wu, Jialong and Zhao, Yida and Li, Kuan and others}, journal={arXiv preprint arXiv:2508.05748}, year={2025} }

NaNK
11
10

Simulation_LLM_wiki_14B_V2

NaNK
11
1

ZeroSearch_wiki_V2_Qwen2.5_7B_Instruct

NaNK
11
0

Simulation_LLM_google_3B_V2

NaNK
9
1

WebShaper-32B

NaNK
license:mit
9
1

WebWatcher-32B

NaNK
8
12

ZeroSearch_google_V1_Qwen2.5_7B

NaNK
license:apache-2.0
8
2

Simulation_LLM_google_14B_V1

NaNK
8
1

Simulation_LLM_google_7B_V1

NaNK
8
1

Simulation_LLM_google_3B_V1

NaNK
8
0

ZeroSearch_google_V2_Qwen2.5_7B

NaNK
6
0

OmniSearch-Qwen-VL-Chat-en

license:apache-2.0
5
2

ZeroSearch_google_V2_Llama_3.2_3B_Instruct

NaNK
llama
5
0

ZeroSearch_google_V2_Llama_3.2_3B

NaNK
llama
4
0

ZeroSearch_wiki_V2_Qwen2.5_3B

NaNK
4
0

ZeroSearch_wiki_V2_Qwen2.5_3B_Instruct

NaNK
4
0

ZeroSearch_google_V2_Qwen2.5_3B_Instruct

NaNK
3
0

ZeroSearch_google_V2_Qwen2.5_3B

NaNK
3
0

ZeroSearch_wiki_V2_Llama_3.2_3B_Instruct

NaNK
llama
3
0

ZeroSearch_google_V1_Qwen2.5_7B_Instruct

NaNK
license:apache-2.0
2
10

ZeroSearch_google_V2_Qwen2.5_7B_Instruct

NaNK
2
2

ZeroSearch_wiki_V2_Llama_3.2_3B

NaNK
llama
2
0

ZeroSearch_google_v1_Qwen2.5_3B_Instruct

NaNK
license:apache-2.0
1
3

ZeroSearch_google_v1_Llama_3.2_3B

NaNK
llama
1
1

ZeroSearch_google_v1_Llama_3.2_3B_Instruct

NaNK
llama
1
1

ZeroSearch_google_v1_Qwen2.5_3B

NaNK
license:apache-2.0
1
0

ZeroSearch_wiki_V2_Qwen2.5_7B

NaNK
1
0

WebSailor

More details are presented in https://github.com/Alibaba-NLP/WebAgent

license:apache-2.0
0
22

new-impl

license:apache-2.0
0
18

LaSER-Qwen3-0.6B

NaNK
license:mit
0
3

qwen2-impl

license:apache-2.0
0
3

LaSER-Qwen3-4B

NaNK
license:mit
0
2

UVRB

license:apache-2.0
0
2

LaSER-Qwen3-8B

NaNK
license:mit
0
1