BAAI

✓ VerifiedResearch Lab

Beijing Academy of Artificial Intelligence

149 models • 23 total models in database
Sort by:

bge-m3

For more details please refer to our github repo: https://github.com/FlagOpen/FlagEmbedding In this project, we introduce BGE-M3, which is distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity. - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. - Multi-Linguality: It can support more than 100 working languages. - Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. We recommend to use the following pipeline: hybrid retrieval + re-ranking. - Hybrid retrieval leverages the strengths of various methods, offering higher accuracy and stronger generalization capabilities. A classic example: using both embedding retrieval and the BM25 algorithm. Now, you can try to use BGE-M3, which supports both embedding and sparse retrieval. This allows you to obtain token weights (similar to the BM25) without any additional cost when generate dense embeddings. To use hybrid retrieval, you can refer to Vespa and Milvus. - As cross-encoder models, re-ranker demonstrates higher accuracy than bi-encoder embedding model. Utilizing the re-ranking model (e.g., bge-reranker, bge-reranker-v2) after retrieval can further filter the selected text. News: - 2024/7/1: We update the MIRACL evaluation results of BGE-M3. To reproduce the new results, you can refer to: bge-m3miracl2cr. We have also updated our paper on arXiv. The previous test results were lower because we mistakenly removed the passages that have the same id as the query from the search results. After correcting this mistake, the overall performance of BGE-M3 on MIRACL is higher than the previous results, but the experimental conclusion remains unchanged. The other results are not affected by this mistake. To reproduce the previous lower results, you need to add the `--remove-query` parameter when using `pyserini.search.faiss` or `pyserini.search.lucene` to search the passages. - 2024/3/20: Thanks Milvus team! Now you can use hybrid retrieval of bge-m3 in Milvus: pymilvus/examples /hellohybridsparsedense.py. - 2024/3/8: Thanks for the experimental results from @Yannael. In this benchmark, BGE-M3 achieves top performance in both English and other languages, surpassing models such as OpenAI. - 2024/3/2: Release unified fine-tuning example and data - 2024/2/6: We release the MLDR (a long document retrieval dataset covering 13 languages) and evaluation pipeline. - 2024/2/1: Thanks for the excellent tool from Vespa. You can easily use multiple modes of BGE-M3 following this notebook | Model Name | Dimension | Sequence Length | Introduction | |:----:|:---:|:---:|:---:| | BAAI/bge-m3 | 1024 | 8192 | multilingual; unified fine-tuning (dense, sparse, and colbert) from bge-m3-unsupervised| | BAAI/bge-m3-unsupervised | 1024 | 8192 | multilingual; contrastive learning from bge-m3-retromae | | BAAI/bge-m3-retromae | -- | 8192 | multilingual; extend the maxlength of xlm-roberta to 8192 and further pretrained via retromae| | BAAI/bge-large-en-v1.5 | 1024 | 512 | English model | | BAAI/bge-base-en-v1.5 | 768 | 512 | English model | | BAAI/bge-small-en-v1.5 | 384 | 512 | English model | | Dataset | Introduction | |:----------------------------------------------------------:|:-------------------------------------------------:| | MLDR | Docuemtn Retrieval Dataset, covering 13 languages | | bge-m3-data | Fine-tuning data used by bge-m3 | - Dense retrieval: map the text into a single embedding, e.g., DPR, BGE-v1.5 - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, unicoil, and splade - Multi-vector retrieval: use multiple vectors to represent a text, e.g., ColBERT. For embedding retrieval, you can employ the BGE-M3 model using the same approach as BGE. The only difference is that the BGE-M3 model no longer requires adding instructions to the queries. For hybrid retrieval, you can use Vespa and Milvus. You can follow the common in this example to fine-tune the dense embedding. If you want to fine-tune all embedding function of m3 (dense, sparse and colbert), you can refer to the unifiedfine-tuning example You also can use sentence-transformers and huggingface transformers to generate dense embeddings. Refer to baaigeneralembedding for details. Compute score for text pairs Input a list of text pairs, you can get the scores computed by different methods. The BGE-M3 model emerged as the top performer on this benchmark (OAI is short for OpenAI). For more details, please refer to the article and Github Repo Please note that MLDR is a document retrieval dataset we constructed via LLM, covering 13 languages, including test set, validation set, and training set. We utilized the training set from MLDR to enhance the model's long document retrieval capabilities. Therefore, comparing baselines with `Dense w.o.long`(fine-tuning without long document dataset) is more equitable. Additionally, this long document retrieval dataset will be open-sourced to address the current lack of open-source multilingual long text retrieval datasets. We believe that this data will be helpful for the open-source community in training document retrieval models. We utilized Pyserini to implement BM25, and the test results can be reproduced by this script. We tested BM25 using two different tokenizers: one using Lucene Analyzer and the other using the same tokenizer as M3 (i.e., the tokenizer of xlm-roberta). The results indicate that BM25 remains a competitive baseline, especially in long document retrieval. Training - Self-knowledge Distillation: combining multiple outputs from different retrieval modes as reward signal to enhance the performance of single mode(especially for sparse retrieval and multi-vec(colbert) retrival) - Efficient Batching: Improve the efficiency when fine-tuning on long text. The small-batch strategy is simple but effective, which also can used to fine-tune large embedding model. - MCLS: A simple method to improve the performance on long text without fine-tuning. If you have no enough resource to fine-tuning model with long text, the method is useful. Thanks to the authors of open-sourced datasets, including Miracl, MKQA, NarritiveQA, etc. Thanks to the open-sourced libraries like Tevatron, Pyserini. If you find this repository useful, please consider giving a star :star: and citation

7,097,927
2,477

bge-small-en-v1.5

Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - Long-Context LLM: Activation Beacon - Fine-tuning of LM : LM-Cocktail - Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding - Reranker Model: BGE Reranker - Benchmark: C-MTEB News - 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: - 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: - 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: - 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The massive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | | | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): Usage via infinity Its also possible to deploy the onnx files with the infinityemb pip package. Recommended is `device="cuda", engine="torch"` with flash attention on gpu, and `device="cpu", engine="optimum"` for onnx inference. `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

5,554,484
380

bge-large-en-v1.5

Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - Long-Context LLM: Activation Beacon - Fine-tuning of LM : LM-Cocktail - Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding - Reranker Model: BGE Reranker - Benchmark: C-MTEB News - 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model that supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: - 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: - 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: - 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report - 09/15/2023: The technical report and massive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | | | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Its also possible to deploy the onnx files with the infinityemb pip package. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

5,020,813
594

bge-base-en-v1.5

Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - Long-Context LLM: Activation Beacon - Fine-tuning of LM : LM-Cocktail - Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding - Reranker Model: BGE Reranker - Benchmark: C-MTEB News - 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: - 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: - 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: - 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report - 09/15/2023: The technical report and massive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | | | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Usage via infinity Its also possible to deploy the onnx files with the infinityemb pip package. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

4,593,879
372

bge-reranker-v2-m3

--- license: apache-2.0 pipeline_tag: text-classification tags: - transformers - sentence-transformers - text-embeddings-inference language: - multilingual ---

3,437,437
790

bge-reranker-large

--- license: mit language: - en - zh tags: - mteb model-index: - name: bge-reranker-base results: - task: type: Reranking dataset: type: C-MTEB/CMedQAv1-reranking name: MTEB CMedQAv1 config: default split: test revision: None metrics: - type: map value: 81.27206722525007 - type: mrr value: 84.14238095238095 - task: type: Reranking dataset: type: C-MTEB/CMedQAv2-reranking name: MTEB CMedQAv2 config: default split: test revision: None metrics: - type: map value: 84.10369934291236 - type: mrr value

license:mit
1,091,669
431

bge-reranker-base

--- license: mit language: - en - zh tags: - mteb - text-embeddings-inference model-index: - name: bge-reranker-base results: - task: type: Reranking dataset: type: C-MTEB/CMedQAv1-reranking name: MTEB CMedQAv1 config: default split: test revision: None metrics: - type: map value: 81.27206722525007 - type: mrr value: 84.14238095238095 - task: type: Reranking dataset: type: C-MTEB/CMedQAv2-reranking name: MTEB CMedQAv2 config: default split: test revision: None metrics: - type: map value: 84.1036

license:mit
1,076,159
214

bge-multilingual-gemma2

--- tags: - feature-extraction - sentence-similarity - sentence-transformers - transformers - mteb license: gemma model-index: - name: bge-multilingual-gemma2 results: - task: type: Retrieval dataset: type: mteb/nfcorpus name: MTEB NFCorpus config: default split: test revision: ec0fa4fe99da2ff19ca1214b7966684033a58814 metrics: - type: main_score value: 38.11433513284057 - type: ndcg_at_1 value: 48.45201238390093 - type: ndcg_at_3 value: 44.451438575534574 - type: ndcg_at_5 value: 41.139299907978

1,008,621
191

bge-large-zh-v1.5

--- license: mit language: - zh tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers ---

license:mit
685,604
596

bge-small-en

--- tags: - mteb - sentence transformers model-index: - name: bge-small-en results: - task: type: Classification dataset: type: mteb/amazon_counterfactual name: MTEB AmazonCounterfactualClassification (en) config: en split: test revision: e8379541af4e31359cca9fbcf4b00f2671dba205 metrics: - type: accuracy value: 74.34328358208955 - type: ap value: 37.59947775195661 - type: f1 value: 68.548415491933 - task: type: Classification dataset: type: mteb/amazon_polarity name: MTEB AmazonPolarityClassific

license:mit
263,187
80

bge-base-en

--- tags: - mteb model-index: - name: bge-base-en results: - task: type: Classification dataset: type: mteb/amazon_counterfactual name: MTEB AmazonCounterfactualClassification (en) config: en split: test revision: e8379541af4e31359cca9fbcf4b00f2671dba205 metrics: - type: accuracy value: 75.73134328358209 - type: ap value: 38.97277232632892 - type: f1 value: 69.81740361139785 - task: type: Classification dataset: type: mteb/amazon_polarity name: MTEB AmazonPolarityClassification config: default s

license:mit
217,459
61

bge-small-zh-v1.5

Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. 🌟Updates🌟 - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Paper :fire: - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The masive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

license:mit
114,620
82

bge-large-en

license:mit
102,837
218

bge-base-zh-v1.5

Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. 🌟Updates🌟 - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Paper :fire: - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The masive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

license:mit
83,377
94

bge-base-zh

Recommend switching to newest BAAI/bge-base-zh-v1.5, which has more reasonable similarity distribution and same method of usage. Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. 🌟Updates🌟 - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Paper :fire: - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The masive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

license:mit
81,030
56

bge-small-zh

Recommend switching to newest BAAI/bge-small-zh-v1.5, which has more reasonable similarity distribution and same method of usage. Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. 🌟Updates🌟 - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Paper :fire: - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The masive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

license:mit
49,988
26

Emu3-Chat-hf

license:apache-2.0
26,468
0

bge-reranker-v2-gemma

license:apache-2.0
24,164
77

bge-m3-unsupervised

license:mit
22,591
17

llm-embedder

license:mit
20,584
126

seggpt-vit-large

license:apache-2.0
17,183
4

bge-large-zh

Recommend switching to newest BAAI/bge-large-zh-v1.5, which has more reasonable similarity distribution and same method of usage. Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. 🌟Updates🌟 - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Paper :fire: - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The masive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.

license:mit
10,565
340

bge-code-v1

license:apache-2.0
8,353
42

AquilaChat-7B

NaNK
8,202
49

AquilaChat2-7B

NaNK
8,126
16

AltCLIP

4,529
31

CCI3-HQ-Classifier

NaNK
license:apache-2.0
4,297
10

Emu3-Stage1

license:apache-2.0
2,789
26

Emu3-VisionTokenizer

license:apache-2.0
2,738
61

BGE-VL-base

license:mit
2,470
24

RoboBrain2.0-7B

NaNK
license:apache-2.0
2,372
119

bge-en-icl

license:apache-2.0
2,286
135

bge-reranker-v2-minicpm-layerwise

license:apache-2.0
2,146
62

BGE-VL-large

license:mit
2,089
17

EVA-CLIP-8B

NaNK
license:apache-2.0
2,039
50

Emu3-Gen

| Project Page | Paper | 🤗HF Models | github | Demo | We introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction ! By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 excels in both generation and perception Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures. - Emu3 is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles. - Emu3 shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM. - Emu3 simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.

license:apache-2.0
2,030
221

BGE-VL-MLLM-S1

license:mit
1,675
21

Aquila2-34B

NaNK
1,389
17

bge-reranker-v2.5-gemma2-lightweight

801
51

Bunny-v1_0-3B

NaNK
license:apache-2.0
697
41

RoboBrain2.0-3B

NaNK
license:apache-2.0
641
8

bge-reasoner-embed-qwen3-8b-0923

For more details please refer to our Github: BGE-Reasoner. BGE-Reasoner-Embed-Qwen3-8B-0923 is an embedding model trained for reasoning-intensive retrieval tasks, based on Qwen/Qwen3-8B. It achieves an nDCG@10 of 37.1 on the BRIGHT benchmark with original query, demonstrating its strong capability in reasoning-intensive retrieval tasks. @article{chen2025reasonembed, title={ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval}, author={Chen, Jianlyu and Lan, Junwei and Li, Chaofan and Lian, Defu and Liu, Zheng}, journal={arXiv preprint arXiv:2510.08252}, year={2025} } ```

NaNK
license:apache-2.0
595
21

Bunny-Llama-3-8B-V

NaNK
bunny-llama
592
83

EVA-CLIP-18B

NaNK
license:apache-2.0
544
13

RoboBrain2.5-8B-NV

NaNK
license:apache-2.0
456
9

AquilaChat2-34B

NaNK
414
46

Aquila-7B

NaNK
408
18

Emu3-Chat

license:apache-2.0
378
74

Bunny-v1_0-2B-zh

NaNK
license:apache-2.0
372
3

EVA-CLIP-8B-448

NaNK
license:apache-2.0
370
14

Bunny-v1_0-4B

NaNK
license:apache-2.0
351
10

Bunny-v1_1-4B

NaNK
license:apache-2.0
340
26

nova-d48w1024-osp480

license:apache-2.0
327
6

AquilaDense-7B

NaNK
license:apache-2.0
323
2

Emu3-Gen-hf

license:apache-2.0
297
1

Aquila-VL-2B-llava-qwen

NaNK
license:apache-2.0
259
61

RoboBrain

license:apache-2.0
247
24

Emu3.5

license:apache-2.0
218
163

AquilaMed-RL

215
13

Video XL 2

Video-XL-2 [\[📰 Blog\]](https://unabletousegit.github.io/video-xl2.github.io/) [\[📂 GitHub\]](https://github.com/VectorSpaceLab/Video-XL) [\[📜 Tech Report(comming soon)\]]() How to use the model Video-XL-2 supply two efficiency optimization strategy: chunk-based prefill and bi-level kvs decoding. You can flexibly choose them based on your needs. TODO - [X] Release model weights. - [X] Release the inference code w/o. efficiency optimization. - [X] Release the inference code w. chunk-based prefill. - [ ] Release the inference code w. chunk-based prefill & bi-level kvs decoding. Tips: Our inference code still under updating, you could update it by assign "--include '\.py'" in huggingface-cli to only update the inference code, avoid downloading the whole model. --- 2. Inference w. Chunk-based Pre-filling Chunk-based prefill significantly reduces memory demands and response latency by encoding video input in a streaming manner. This advantage becomes particularly noticeable with longer videos. To enable this mode, you need to set `enablechunkprefill` to `True` and configure the `prefillconfig` parameters: `chunkprefillmode`: This defines the mode of chunk-based prefill. We currently support two modes: `streaming`: This mode encodes video chunks streamingly. `mask`: This mode achieves an equivalent effect using an attention mask. However, due to a lack of underlying optimized operators, the `mask` mode doesn't offer any efficiency improvements at this time. We recommend using the `streaming` mode. `chunksize`: This parameter specifies the size of each chunk processed in a single forward pass. The unit for `chunksize` is 4 frames (e.g., `chunksize = 4` means processing visual tokens from 4×4 = 16 frames at once). A larger `chunksize` will gradually approach full attention, resulting in a higher peak memory usage. `stepsize`: This controls the step size between chunks. A smaller `stepsize` leads to more continuous information transfer between chunks but may slightly decrease inference speed. `offload`: This boolean parameter determines whether to offload the key-value states (KVs) of each chunk to the CPU during forwarding. While this can reduce memory usage, it will also lower the inference speed. `chunksizeforvisiontower`: For longer video inputs, the vision tower can become a memory bottleneck during forwarding. To mitigate this, we also support a streaming mode for the vision tower, which is controlled by this parameter. The unit for `chunksizeforvisiontower` is 1 frames. And, the value of `chunksizeforvisiontower` must be a multiple of 4. Tip: Currently, chunk-based prefill only supports the 'sdpa' attention implementation. --- 3. Inference w. Chunk-based Pre-filling & Bi-level KVs Decoding coming soon

license:apache-2.0
214
54

OpenSeek-Small-v1-SFT

NaNK
214
4

JudgeLM-7B-v1.0

NaNK
llama
213
16

BGE-VL-MLLM-S2

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval News 🚀🚀 We have released the BGE-VL-MLLM models on Huggingface: BGE-VL-MLLM-S1 and BGE-VL-MLLM-S2. BGE-VL-MLLM-S1 is trained exclusively on our MegaPairs dataset, achieving outstanding performance in composed image retrieval, with an 8.1% improvement on the CIRCO benchmark (mAP@5) over the previous state-of-the-art. BGE-VL-MLLM-S2 builds on BGE-VL-MLLM-S1 with an additional epoch of fine-tuning on the MMEB benchmark training set, delivering enhanced performance across a broader range of multimodal embedding tasks. 🚀🚀 BGE-VL-CLIP models are released on Huggingface: BGE-VL-base and BGE-VL-large. 🎉🎉 Release our paper: MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval. Release Plan - [x] Paper - [x] BGE-VL-base and BGE-VL-large models - [x] BGE-VL-MLLM model - [ ] MegaPairs Dataset - [ ] Evaluation code - [ ] Fine-tuning code Introduction In this work, we introduce MegaPairs, a novel data synthesis method that leverages open-domain images to create heterogeneous KNN triplets for universal multimodal retrieval. Our MegaPairs dataset contains over 26 million triplets, and we have trained a series of multimodal retrieval models, BGE-VL, including BGE-VL-CLIP (base and large) and BGE-VL-MLLM. BGE-VL achieve state-of-the-art performance on four popular zero-shot composed image retrieval benchmarks and the massive multimodal embedding benchmark (MMEB). Extensive experiments demonstrate the efficiency, scalability, and generalization features of MegaPairs. Please refer to our paper for more details. 1. BGE-VL-CLIP Models You can easily use BGE-VL-CLIP models based on See the demo for a complete example of using BGE-VL for multimodel retrieval. Model Performance Zero-Shot Composed Image Retrieval BGE-VL sets a new performance benchmark in zero-shot composed image retrieval tasks. On the CIRCO benchmark, our BGE-VL-base model, with only 149 million parameters, surpasses all previous models, including those with 50 times more parameters. Additionally, BGE-VL-MLLM achieves an 8.1% improvement over the previous state-of-the-art model. BGE-VL-MLLM achieves state-of-the-art zero-shot performance on the Massive Multimodal Embedding Benchmark (MMEB), despite being trained only on the ImageText-to-Image paradigm. This demonstrates the excellent generalization capability of MegaPairs for multimodal embedding. After fine-tuning on downstream tasks, BGE-VL-MLLM maintains its leading performance. Notably, it surpasses the previous state-of-the-art by 7.1% on the MMEB out-of-distribution (OOD) set. These results demonstrate the robust generalization capability of BGE-VL-MLLM and highlight the potential of MegaPairs as foundational training data for universal multimodal embedding. Performance Scaling MegaPairs showcases scalability: BGE-VL-base improves as training data increases. It also demonstrates efficiency: with just 0.5M training samples, BGE-VL-base significantly outperforms MagicLens, which uses the same CLIP-base backbone and was trained on 36.7M samples. License The annotations for MegaPairs and the BGE-VL models are released under the MIT License. The images in MegaPairs originate from the Recap-Datacomp, which is released under the CC BY 4.0 license. Citation If you find this repository useful, please consider giving a star ⭐ and citation

NaNK
license:mit
207
15

Emu2-Chat

206
29

OpenSeek-Small-v1-Baseline

204
5

AquilaChat2-34B-16K

NaNK
191
24

Aquila2-7B

NaNK
190
6

bge-m3-retromae

license:mit
188
16

bunny-phi-2-siglip-lora

license:apache-2.0
179
48

JudgeLM-13B-v1.0

NaNK
llama
176
7

BGE-VL-v1.5-mmeb

NaNK
license:mit
172
10

SegVol

169
13

AquilaChat2-7B-16K

NaNK
162
9

Aquila-33B

NaNK
161
2

Bunny-Llama-3-8B-V-gguf

NaNK
license:apache-2.0
158
15

Emu3.5-Image

license:apache-2.0
154
61

AltCLIP-m18

149
5

AquilaMoE-SFT

license:apache-2.0
134
6

Bunny-v1_0-4B-gguf

NaNK
license:apache-2.0
132
7

AquilaMoE

license:apache-2.0
127
8

JudgeLM-33B-v1.0

NaNK
llama
125
26

RoboBrain X0 Preview

license:apache-2.0
123
10

CapsFus-LLaMA

llama
119
2

LLARA-document

llama
113
0

LLARA-pretrain

llama
111
0

LLARA-beir

llama
109
0

Emu2

108
89

Bunny-v1_1-Llama-3-8B-V

NaNK
bunny-llama
108
36

EVE-7B-HD-v2.0

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu📧, Xinlong Wang📧 Dalian University of Technology; Beijing Academy of Artificial Intelligence; Peking University; Beijing University of Posts and Telecommunications; University of Chinese Academy of Sciences; Chinese Academy of Sciences Institute of Automation Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. | Model name | Weight | | ---------- | ------------------------------------------------------- | | EVE-7B-HD-v2.0 | 🤗 HF link (28GB) | ✒️ Citation If EVE is helpful for your research, please consider star ⭐ and citation 📝 :

NaNK
license:apache-2.0
107
8

AquilaDense-16B

NaNK
license:apache-2.0
107
0

AquilaSQL-7B

NaNK
106
12

LLARA-passage

llama
104
1

Video-XL-2-Stage3

license:apache-2.0
103
1

Bunny-v1_0-3B-zh

NaNK
license:apache-2.0
103
0

Emu3.5-VisionTokenizer

license:apache-2.0
102
21

URSA 1.7B FSQ320

Model Details - Developed by: BAAI - Model type: Text-to-Video Generation Model - Model size: 1.7B - Model precision: torch.float16 (FP16) - Model resolution: 512x320 - Model paper: Uniform Discrete Diffusion with Metric Path for Video Generation - Model family: BAAI-Vision-URSA - Model Tokenizer: Cosmos-Tokenize1-DV4x8x8-360p - Model Description: This is a model that can be used to generate and modify videos based on text prompts. Using the 🤗's Diffusers library to run URSA in a simple and efficient manner. Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Research on generative models. - Applications in educational or creative tools. - Generation of artworks and use in design and other artistic processes. - Probing and understanding the limitations and biases of generative models. - Safe deployment of models which have the potential to generate harmful content. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Mis- and disinformation. - Representations of egregious violence and gore. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Sharing of copyrighted or licensed material in violation of its terms of use. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - The autoencoding part of the model is lossy. - The model cannot render complex legible text. - The model does not achieve perfect photorealism. - The fingers, .etc in general may not be generated properly. - The model was trained on a subset of the web datasets LAION-5B and COYO-700M, which contains adult, violent and sexual content. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
license:apache-2.0
101
7

IndustryCorpus2_DataRater

NaNK
license:apache-2.0
99
6

Matroyshka-ReRanker-document

99
3

Matroyshka-ReRanker-beir

99
2

Matroyshka-ReRanker-passage

license:apache-2.0
98
1

bge-large-zh-noinstruct

license:mit
92
11

OPI-Galactica-6.7B

NaNK
license:apache-2.0
89
5

AquilaCode-multi

84
3

RoboBrain2.5-8B-MT

NaNK
license:apache-2.0
79
9

BGE-VL-v1.5-zs

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval News 🎉🎉 We have uploaded our MegaPairs dataset to 🤗Hugging Face, which contains over 26 million multimodal retrieval instruction-tuning triplets. To reduce upload time and enhance data accessibility, we resized all images to a resolution of 512 × 512 instead of using their original size. This adjustment has minimal impact on performance, considering that most vision-language models (e.g., CLIP) use even smaller input image sizes. Dataset Card 🌟🌟 BGE-VL models are also available on WiseModel. 📰📰 Thank you to SyncedTech (机器之心), QbitAI (量子位), and AI Era (新智元) for reporting on our work! 🚀🚀 We have released the BGE-VL-MLLM models on Huggingface: BGE-VL-MLLM-S1 and BGE-VL-MLLM-S2. BGE-VL-MLLM-S1 is trained exclusively on our MegaPairs dataset, achieving outstanding performance in composed image retrieval, with an 8.1% improvement on the CIRCO benchmark (mAP@5) over the previous state-of-the-art. BGE-VL-MLLM-S2 builds on BGE-VL-MLLM-S1 with an additional epoch of fine-tuning on the MMEB benchmark training set, delivering enhanced performance across a broader range of multimodal embedding tasks. 🚀🚀 BGE-VL-CLIP models are released on Huggingface: BGE-VL-base and BGE-VL-large. 🎉🎉 Release our paper: MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval. Release Plan - [x] Paper - [x] BGE-VL-base and BGE-VL-large models - [x] BGE-VL-MLLM model - [x] MegaPairs Dataset - [x] Evaluation code examples - [ ] Fine-tuning code Introduction In this work, we introduce MegaPairs, a novel data synthesis method that leverages open-domain images to create heterogeneous KNN triplets for universal multimodal retrieval. Our MegaPairs dataset contains over 26 million triplets, and we have trained a series of multimodal retrieval models, BGE-VL, including BGE-VL-CLIP (base and large) and BGE-VL-MLLM. BGE-VL achieve state-of-the-art performance on four popular zero-shot composed image retrieval benchmarks and the massive multimodal embedding benchmark (MMEB). Extensive experiments demonstrate the efficiency, scalability, and generalization features of MegaPairs. Please refer to our paper for more details. 1. BGE-VL-CLIP Models You can easily use BGE-VL-CLIP models based on > Our code works well on transformers==4.45.2, and we recommend using this version. See the demo for a complete example of using BGE-VL for multimodel retrieval. > Our code works well on transformers==4.45.2, and we recommend using this version. We are excited to release the MegaPairs dataset on Hugging Face, which contains over 26 million training samples tailored for composed image retrieval and universal multimodal retrieval tasks. Each entry in the dataset consists of the following fields: - qtext: `list` A list of textual query statements related to the query image. During training, you can randomly select one statement from this list. - timg: `str` The file path to the target image, which serves as the positive example for the combination of `qimg` and `qtext`. - hns: `list` A list of file paths for hard negative sample images. These are challenging distractors that are visually or semantically similar to the query. It is recommended to include at least one hard negative sample during training, with `hns[0]` (the query image itself) being a mandatory choice. In our experiments, we used four hard negative samples per query. The dataset is available for download and exploration on Hugging Face. We encourage researchers and practitioners to leverage this dataset to advance multimodal retrieval research and systems. Model Performance Zero-Shot Composed Image Retrieval BGE-VL sets a new performance benchmark in zero-shot composed image retrieval tasks. On the CIRCO benchmark, our BGE-VL-base model, with only 149 million parameters, surpasses all previous models, including those with 50 times more parameters. Additionally, BGE-VL-MLLM achieves an 8.1% improvement over the previous state-of-the-art model. BGE-VL-MLLM achieves state-of-the-art zero-shot performance on the Massive Multimodal Embedding Benchmark (MMEB), despite being trained only on the ImageText-to-Image paradigm. This demonstrates the excellent generalization capability of MegaPairs for multimodal embedding. After fine-tuning on downstream tasks, BGE-VL-MLLM maintains its leading performance. Notably, it surpasses the previous state-of-the-art by 7.1% on the MMEB out-of-distribution (OOD) set. These results demonstrate the robust generalization capability of BGE-VL-MLLM and highlight the potential of MegaPairs as foundational training data for universal multimodal embedding. Performance Scaling MegaPairs showcases scalability: BGE-VL-base improves as training data increases. It also demonstrates efficiency: with just 0.5M training samples, BGE-VL-base significantly outperforms MagicLens, which uses the same CLIP-base backbone and was trained on 36.7M samples. License The annotations for MegaPairs and the BGE-VL models are released under the MIT License. The images in MegaPairs originate from the Recap-Datacomp, which is released under the CC BY 4.0 license. Citation If you find this repository useful, please consider giving a star ⭐ and citation

NaNK
license:mit
79
7

AltDiffusion-m18

72
31

Aquila2-70B-Expr

NaNK
67
10

AquilaChat2-70B-Expr

NaNK
62
5

BGE-VL-Screenshot

NaNK
license:mit
59
12

IndustryCorpus2_Classifier

NaNK
license:apache-2.0
59
9

RoboBrain2.0-32B

NaNK
license:apache-2.0
55
40

AltCLIP-m9

55
8

EVE-7B-Pretrain-v1.0

NaNK
license:apache-2.0
55
3

bunny-pretrain-phi-2-siglip

license:apache-2.0
53
6

EVE-7B-v1.0

NaNK
license:apache-2.0
51
5

EVE-7B-HD-v1.0

NaNK
license:apache-2.0
50
6

AquilaCode-py

45
2

OmniGen-v1

license:mit
43
10

URSA-1.7B-IBQ1024

Model Details - Developed by: BAAI - Model type: Text-to-Image Generation Model - Model size: 1.7B - Model precision: torch.float16 (FP16) - Model resolution: 1024x1024 - Model paper: Uniform Discrete Diffusion with Metric Path for Video Generation - Model family: BAAI-Vision-URSA - Model Tokenizer: Emu3.5-Vision-Tokenizer - Model Description: This is a model that can be used to generate and modify images based on text prompts. Using the 🤗's Diffusers library to run URSA in a simple and efficient manner. Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Research on generative models. - Applications in educational or creative tools. - Generation of artworks and use in design and other artistic processes. - Probing and understanding the limitations and biases of generative models. - Safe deployment of models which have the potential to generate harmful content. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Mis- and disinformation. - Representations of egregious violence and gore. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Sharing of copyrighted or licensed material in violation of its terms of use. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - The autoencoding part of the model is lossy. - The model cannot render complex legible text. - The model does not achieve perfect photorealism. - The fingers, .etc in general may not be generated properly. - The model was trained on a subset of the web datasets LAION-5B and COYO-700M, which contains adult, violent and sexual content. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
license:apache-2.0
41
3

AltDiffusion-m9

27
70

MTVCraft

MTVCraft An Open Veo3-style Audio-Video Generation Demo Pipeline | Installation | Models | Inference | Citation --> Sorry, your browser does not support the video tag. MTVCraft is a framework for generating videos with synchronized audio from a single text prompt, exploring a potential pipeline for creating general audio-visual content. Specifically, the framework consists of a multi-stage pipeline. First, MTVCraft employs the Qwen3 to interpret the user's initial prompt, deconstructing it into separate descriptions for three audio categories: human speech, sound effects, and background music. Subsequently, these descriptions are fed into ElevenLabs to synthesize the corresponding audio tracks. Finally, these generated audio tracks serve as conditions to guide the MTV framework in generating a video that is temporally synchronized with the sound. Notably, both Qwen3 and ElevenLabs can be replaced by available alternatives with similar capabilities. For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install `torch`, `torchvision` , `torchaudio` and `xformers`. You can easily get all pretrained models required by inference from our HuggingFace repo. Or you can download them separately from their source repo: - mtv: Our checkpoints - t5-v11-xxl: text encoder, you can download from textencoder and tokenizer - vae: Cogvideox-5b pretrained 3d vae - wav2vec: wav audio to vector model from Facebook Finally, these pretrained models should be organized as follows: API Setup (Required) Before running the inference script, make sure to configure your API keys in the file `mtv/utils.py`. Edit the following section: Once the API keys are set, you can run inference using the provided script: This will read the input prompts from `./examples/samples.txt` and the results will be saved at `./output`. If you find our work useful for your research, please consider citing the paper:

license:apache-2.0
26
36

nova-d48w768-sdxl1024

Model Details - Developed by: BAAI - Model type: Non-quantized Autoregressive Text-to-Image Generation Model - Model size: 363M - Model precision: torch.float16 (FP16) - Model resolution: 1024x1024 - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Non-quantized Video Autoregressive (NOVA) diffusion model that uses a pretrained text encoder (Phi-2) and one VAE image tokenizer (SDXL-VAE). - Model License: Apache 2.0 License - Resources for more information: GitHub Repository. Using the 🤗's Diffusers library to run NOVA in a simple and efficient manner. Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Research on generative models. - Applications in educational or creative tools. - Generation of artworks and use in design and other artistic processes. - Probing and understanding the limitations and biases of generative models. - Safe deployment of models which have the potential to generate harmful content. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Mis- and disinformation. - Representations of egregious violence and gore. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Sharing of copyrighted or licensed material in violation of its terms of use. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - The autoencoding part of the model is lossy. - The model cannot render complex legible text. - The model does not achieve perfect photorealism. - The fingers, .etc in general may not be generated properly. - The model was trained on a subset of the web datasets LAION-5B and COYO-700M, which contains adult, violent and sexual content. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

license:apache-2.0
21
2

AltDiffusion

16
57

Emu2-Gen

16
21

nova-d48w1024-sd512

license:apache-2.0
13
1

OmniGen2

News | Quick Start | Usage Tips | Limitations | Online Demos | Citation 🔥 News - 2025-06-30: Training code is available, see fine-tuning for details. - 2025-06-28: We release OmniContext benchmark. The evaluation code can be found in omnicontext. - 2025-06-24: Technical Report is available. - 2025-06-23: We’ve updated our code and HF model—OmniGen2 now runs without `flash-attn`. Users can still install it for optimal performance. - 2025-06-20: Updated resource requirements, adding CPU offload support for devices with limited VRAM. - 2025-06-16: Gradio and Jupyter is available. Online Gradio Demo: Demo1; Chat-Demo1; see more demo links in gradio section - 2025-06-16: We release OmniGen2, a multimodal generation model, model weights can be accessed in huggingface and modelscope. Introduction OmniGen2 is a powerful and efficient generative model. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. OmniGen2 has competitive performance across four primary capabilities: - Visual Understanding: Inherits the robust ability to interpret and analyze image content from its Qwen-VL-2.5 foundation. - Text-to-Image Generation: Creates high-fidelity and aesthetically pleasing images from textual prompts. - Instruction-guided Image Editing: Executes complex, instruction-based image modifications with high precision, achieving state-of-the-art performance among open-source models. - In-context Generation: A versatile capability to process and flexibly combine diverse inputs—including humans, reference objects, and scenes—to produce novel and coherent visual outputs. We will release the training code and dataset. Stay tuned! Good demonstrations of OmniGen2's image editing capabilities. Good demonstrations of OmniGen2's in-context generation capabilities. 📌 TODO - [x] Technical report. - [x] Support CPU offload and improve inference efficiency. - [x] In-context generation benchmark: OmniContext. - [ ] Integration of diffusers. - [ ] Training datasets. - [ ] Training data construction pipeline. - [ ] ComfyUI Demo (commuity support will be greatly appreciated!). Online Demo: HF Spaces. Beyond Hugging Face Spaces, we are temporarily allocating additional GPU resources to ensure smooth access to the online demos. If you notice a long queue for a particular link, please try other links: 💡 Usage Tips To achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case. - `textguidancescale`: Controls how strictly the output adheres to the text prompt (Classifier-Free Guidance). - `imageguidancescale`: This controls how much the final image should resemble the input reference image. - The Trade-off: A higher value makes the output more faithful to the reference image's structure and style, but it might ignore parts of your text prompt. A lower value (~1.5) gives the text prompt more influence. - Tip: For image editing task, we recommend to set it between 1.2 and 2.0; for in-context generateion task, a higher imageguidancescale will maintian more details in input images, and we recommend to set it between 2.5 and 3.0. - `maxpixels`: Automatically resizes images when their total pixel count (width × height) exceeds this limit, while maintaining its aspect ratio. This helps manage performance and memory usage. - Tip: Default value is 10241024. You can reduce this value if you encounter memory issues. - `maxinputimagesidelength`: Maximum side length for input images. - `negativeprompt`: Tell the model what you don't want to see in the image. - Example: blurry, low quality, text, watermark - Tip: For the best results, try experimenting with different negative prompts. If you're not sure, just use the default negative prompt. - `enablemodelcpuoffload`: Reduces VRAM usage by nearly 50% with a negligible impact on speed. - This is achieved by offloading the model weights to CPU RAM when they are not in use. - See: Model Offloading - `enablesequentialcpuoffload`: Minimizes VRAM usage to less than 3GB, but at the cost of significantly slower performance. - This works by offloading the model in submodules and loading them onto the GPU sequentially as needed. - See: CPU Offloading - `cfgrangestart`, `cfgrangeend`: Define the timestep range where CFG is applied. Per this paper, reducing `cfgrangeend` can significantly decrease inference time with a negligible impact on quality. - `scheduler`: Choose between `[euler, dpmsolver++]`. Default is `euler`. For potentially better performance with fewer steps, try `dpmsolver++`. - `numinferencestep`: Number of discretization steps for the ODE solver. Default is `50`. Some suggestions for improving generation quality: 1. Use High-Quality Images - Provide clear images, preferably with a resolution greater than 512×512 pixels. - Small or blurry inputs will result in low-quality outputs. 2. Be Specific with Instructions - Clearly describe both what to change and how you want it changed. 3. Prioritize English The model currently performs best with English prompts. 4. Change instructions to enhance subject consistency. When the generated image does not align well with the input image, you can try the following methods to improve subject consistency: - Use images with larger size, as well as images in which people occupy a larger proportion of the frame. - Increase the Image Guidance Scale, for example to 3.0. The trade-off may be slight overexposure or a greasy look in the image. - When using a single input image, you can try to use the following prompt template: "she/he ..., maintaining her/his facial features, hairstyle, and other attributes." - Increase the parameter--Number of images per prompt to generate more outputs, giving you a better chance to find one with stronger subject consistency and a more satisfactory result. - Longer prompts generally yield better results than shorter ones. More detailed descriptions of the scene and character interactions can provide additional benefits. 5. For in-context edit (edit based multiple images), we recommend using the following prompt format: "Edit the first image: add/replace (the [object] with) the [object] from the second image. [descripton for your target image]." For example: "Edit the first image: add the man from the second image. The man is talking with a woman in the kitchen". The descition for your target image should be as detailed as possible. ❌ Limitations and Suggestions The current model sometimes does not follow instructions. You can increase the "Number of images per prompt" to generate multiple images at once, so you can choose the result you are satisfied with, or try different prompts. In our own experience, being as detailed as possible tends to work better. The current model cannot decide the output image size by itself; the default size is 1024×1024. You need to set a specific size if you require a different one. When you input an image, we will set the output size to match the input image (this works best for editing tasks). If you want to modify just one image out of several, you should also set the output size to match the image you want to edit; otherwise, it may lead to low-quality outputs. The in-context generation capability sometimes produces objects that differ from the original ones. Some suggested improvements are: increasing `imageguidancescale` (it is recommended to set it to 3) can help alleviate this issue; using high-resolution images, increasing the size of the input image, and ensuring that the object to be used occupies a larger proportion of the image; and modifying the prompt. However, there is still a gap compared to GPT-4o. Compared to OmniGen 1.0, although OmniGen 2 has made some improvements, many issues still remain. It may take multiple attempts to achieve a satisfactory result. 💻 Resources Requirement OmniGen2 natively requires an NVIDIA RTX 3090 or an equivalent GPU with approximately 17GB of VRAM. For devices with less VRAM, you can enable CPU Offload to run the model. Performance Tip: To improve inference speed, consider decreasing the `cfgrangeend` parameter. Within a reasonable range, this has a negligible impact on output quality. The following table details the inference performance of OmniGen2 on an A800 GPU: 🤝 Community Efforts We’re honored and grateful for the support from the open source community. Here are some unofficial implementations contributed by the community(Currently, we have not confirmed whether there are no bugs. Please try to use the our official demo as much as possible.): - ComfyUI: - ComfyUI Official - https://github.com/Yuan-ManX/ComfyUI-OmniGen2 - https://github.com/neverbiasu/ComfyUI-OmniGen2 - Quantization: - DFloat11, a lossless compression using 11 bits ❤️ Citing Us If you find this repository or our work useful, please consider giving a star ⭐ and citation 🦖, which would be greatly appreciated:

license:apache-2.0
10
6

nova-d48w1536-sdxl1024

license:apache-2.0
8
7

nova-d48w1024-sdxl1024

license:apache-2.0
5
2

OpenSeek-Small-v1

license:apache-2.0
3
17

RoboBrain2.0-7B-W8A16

NaNK
license:apache-2.0
3
2

RoboBrain2.0-7B-FP8

NaNK
license:apache-2.0
3
0

DreamBooth-AltDiffusion

1
9

bge-visualized

0
66

EVA

license:mit
0
30

Emu

0
22

SegGPT

license:mit
0
19

tokenize-anything

license:apache-2.0
0
18

Uni3D

0
10

DIVA

license:apache-2.0
0
8

Painter

license:mit
0
7

CCI4.0-ZH-HQ-Classifiers

0
6

Aquila-135M

license:apache-2.0
0
3

Aquila-135M-Instruct

license:apache-2.0
0
3

RoboBrain-LoRA-Affordance

license:apache-2.0
0
3

URSA-0.6B-IBQ1024

Model Details - Developed by: BAAI - Model type: Text-to-Image Generation Model - Model size: 0.6B - Model precision: torch.float16 (FP16) - Model resolution: 1024x1024 - Model paper: Uniform Discrete Diffusion with Metric Path for Video Generation - Model family: BAAI-Vision-URSA - Model Tokenizer: Emu3.5-Vision-Tokenizer - Model Description: This is a model that can be used to generate and modify images based on text prompts. Using the 🤗's Diffusers library to run URSA in a simple and efficient manner. Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Research on generative models. - Applications in educational or creative tools. - Generation of artworks and use in design and other artistic processes. - Probing and understanding the limitations and biases of generative models. - Safe deployment of models which have the potential to generate harmful content. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Mis- and disinformation. - Representations of egregious violence and gore. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Sharing of copyrighted or licensed material in violation of its terms of use. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - The autoencoding part of the model is lossy. - The model cannot render complex legible text. - The model does not achieve perfect photorealism. - The fingers, .etc in general may not be generated properly. - The model was trained on a subset of the web datasets LAION-5B and COYO-700M, which contains adult, violent and sexual content. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
license:apache-2.0
0
3

CCI3-HQ-Intermediate-Checkpoints

NaNK
0
2

Aquila-VL-2B-Intermediate

NaNK
license:apache-2.0
0
2

RoboBrain-LoRA-Trajectory

license:apache-2.0
0
2

URSA-0.6B-FSQ320

Model Details - Developed by: BAAI - Model type: Text-to-Video Generation Model - Model size: 0.6B - Model precision: torch.float16 (FP16) - Model resolution: 512x320 - Model paper: Uniform Discrete Diffusion with Metric Path for Video Generation - Model family: BAAI-Vision-URSA - Model Tokenizer: Cosmos-Tokenize1-DV4x8x8-360p - Model Description: This is a model that can be used to generate and modify videos based on text prompts. Using the 🤗's Diffusers library to run URSA in a simple and efficient manner. Direct Use The model is intended for research purposes only. Possible research areas and tasks include - Research on generative models. - Applications in educational or creative tools. - Generation of artworks and use in design and other artistic processes. - Probing and understanding the limitations and biases of generative models. - Safe deployment of models which have the potential to generate harmful content. Out-of-Scope Use The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Misuse and Malicious Use Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to: - Mis- and disinformation. - Representations of egregious violence and gore. - Impersonating individuals without their consent. - Sexual content without consent of the people who might see it. - Sharing of copyrighted or licensed material in violation of its terms of use. - Intentionally promoting or propagating discriminatory content or harmful stereotypes. - Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use. - Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc. - The autoencoding part of the model is lossy. - The model cannot render complex legible text. - The model does not achieve perfect photorealism. - The fingers, .etc in general may not be generated properly. - The model was trained on a subset of the web datasets LAION-5B and COYO-700M, which contains adult, violent and sexual content. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
license:apache-2.0
0
2