Sentiment Analysis
Analyze emotional tone in text - positive, negative, or neutral. Essential for customer feedback, social media monitoring, and brand reputation.
Common Applications
Top Models
bert-base-uncased
by google-bert
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English. Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after. Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models. Other 24 smaller models are released afterward. The detailed release history can be found on the google-research/bert readme on github. | Model | #params | Language | |------------------------|--------------------------------|-------| | `bert-base-uncased` | 110M | English | | `bert-large-uncased` | 340M | English | sub | `bert-base-cased` | 110M | English | | `bert-large-cased` | 340M | English | | `bert-base-chinese` | 110M | Chinese | | `bert-base-multilingual-cased` | 110M | Multiple | | `bert-large-uncased-whole-word-masking` | 340M | English | | `bert-large-cased-whole-word-masking` | 340M | English | You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions of a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens. The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, \\(\beta{1} = 0.9\\) and \\(\beta{2} = 0.999\\), a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:| | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
distilbert-base-uncased
by distilbert
This model is a distilled version of the BERT base model. It was introduced in this paper. The code for the distillation process can be found here. This model is uncased: it does not make a difference between english and English. DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives: - Distillation loss: the model was trained to return the same probabilities as the BERT base model. - Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model. This way, the model learns the same inner representation of the English language than its teacher model, while being faster for inference or downstream tasks. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. It also inherits some of the bias of its teacher model. This bias will also affect all fine-tuned versions of this model. DistilBERT pretrained on the same data as BERT, which is BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens. The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. The model was trained on 8 16 GB V100 for 90 hours. See the training code for all hyperparameters details. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |
roberta-large
by FacebookAI
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English. Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team. RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The RoBERTa model was pretrained on the reunion of five datasets: - BookCorpus, a dataset consisting of 11,038 unpublished books; - English Wikipedia (excluding lists, tables and headers) ; - CC-News, a dataset containing 63 millions English news articles crawled between September 2016 and February 2019. - OpenWebText, an opensource recreation of the WebText dataset used to train GPT-2, - Stories a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked with ` ` and the end of one by ` ` The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by ` `. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed). The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 4e-4, \\(\beta{1} = 0.9\\), \\(\beta{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 96.4 | 90.9 | 86.6 |
gpt2
by openai-community
Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this page. Disclaimer: The team releasing GPT-2 also wrote a model card for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias. GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences. More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt. This is the smallest version of GPT-2, with 124M parameters. You can use the raw model for text generation or fine-tune it to a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card: > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases > that require the generated text to be true. > > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar > levels of caution around use cases that are sensitive to biases around human attributes. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here. The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens. The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact details of training. The model achieves the following results without any fine-tuning (zero-shot): | Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiText2 | PTB | enwiki8 | text8 | WikiText103 | 1BW | |:--------:|:-------:|:-------:|:------:|:------:|:---------:|:------:|:-------:|:------:|:-----------:|:-----:| | (metric) | (PPL) | (ACC) | (ACC) | (ACC) | (PPL) | (PPL) | (BPB) | (BPC) | (PPL) | (PPL) | | | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1,17 | 37.50 | 75.20 |
roberta-base
by FacebookAI
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English. Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team. RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at a model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The RoBERTa model was pretrained on the reunion of five datasets: - BookCorpus, a dataset consisting of 11,038 unpublished books; - English Wikipedia (excluding lists, tables and headers) ; - CC-News, a dataset containing 63 millions English news articles crawled between September 2016 and February 2019. - OpenWebText, an opensource recreation of the WebText dataset used to train GPT-2, - Stories a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with ` ` and the end of one by ` ` The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by ` `. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed). The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 6e-4, \\(\beta{1} = 0.9\\), \\(\beta{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 87.6 | 91.9 | 92.8 | 94.8 | 63.6 | 91.2 | 90.2 | 78.7 |
distilbert-base-uncased-finetuned-sst-2-english
by distilbert
Table of Contents - Model Details - How to Get Started With the Model - Uses - Risks, Limitations and Biases - Training Model Details Model Description: This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7). - Developed by: Hugging Face - Model Type: Text Classification - Language(s): English - License: Apache-2.0 - Parent Model: For more details about DistilBERT, we encourage users to check out this model card. - Resources for more information: - Model Documentation - DistilBERT paper This model can be used for topic classification. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Misuse and Out-of-scope Use The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. Based on a few experimentations, we observed that this model could produce biased predictions that target underrepresented populations. For instance, for sentences like `This film was filmed in COUNTRY`, this binary classification model will give radically different probabilities for the positive label depending on the country (0.89 if the country is France, but 0.08 if the country is Afghanistan) when nothing in the input indicates such a strong semantic shift. In this colab, Aurélien Géron made an interesting map plotting these probabilities for each country. We strongly advise users to thoroughly probe these aspects on their use-cases in order to evaluate the risks of this model. We recommend looking at the following bias evaluation datasets as a place to start: WinoBias, WinoGender, Stereoset. The authors use the following Stanford Sentiment Treebank(sst2) corpora for the model. - learningrate = 1e-5 - batchsize = 32 - warmup = 600 - maxseqlength = 128 - numtrainepochs = 3.0
ms-marco-MiniLM-L6-v2
by cross-encoder
This model was trained on the MS Marco Passage Ranking task. The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See SBERT.net Retrieve & Re-rank for more details. The training code is available here: SBERT.net Training MS Marco The usage is easy when you have SentenceTransformers installed. Then you can use the pre-trained models like this: Performance In the following table, we provide various pre-trained Cross-Encoders together with their performance on the TREC Deep Learning 2019 and the MS Marco Passage Reranking dataset. | Model-Name | NDCG@10 (TREC DL 19) | MRR@10 (MS Marco Dev) | Docs / Sec | | ------------- |:-------------| -----| --- | | Version 2 models | | | | cross-encoder/ms-marco-TinyBERT-L2-v2 | 69.84 | 32.56 | 9000 | cross-encoder/ms-marco-MiniLM-L2-v2 | 71.01 | 34.85 | 4100 | cross-encoder/ms-marco-MiniLM-L4-v2 | 73.04 | 37.70 | 2500 | cross-encoder/ms-marco-MiniLM-L6-v2 | 74.30 | 39.01 | 1800 | cross-encoder/ms-marco-MiniLM-L12-v2 | 74.31 | 39.02 | 960 | Version 1 models | | | | cross-encoder/ms-marco-TinyBERT-L2 | 67.43 | 30.15 | 9000 | cross-encoder/ms-marco-TinyBERT-L4 | 68.09 | 34.50 | 2900 | cross-encoder/ms-marco-TinyBERT-L6 | 69.57 | 36.13 | 680 | cross-encoder/ms-marco-electra-base | 71.99 | 36.41 | 340 | Other models | | | | nboost/pt-tinybert-msmarco | 63.63 | 28.80 | 2900 | nboost/pt-bert-base-uncased-msmarco | 70.94 | 34.75 | 340 | nboost/pt-bert-large-msmarco | 73.36 | 36.48 | 100 | Capreolus/electra-base-msmarco | 71.23 | 36.89 | 340 | amberoad/bert-multilingual-passage-reranking-msmarco | 68.40 | 35.54 | 330 | sebastian-hofstaetter/distilbert-cat-marginmse-T2-msmarco | 72.82 | 37.88 | 720
bge-small-en-v1.5
by BAAI
Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License More details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - Long-Context LLM: Activation Beacon - Fine-tuning of LM : LM-Cocktail - Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding - Reranker Model: BGE Reranker - Benchmark: C-MTEB News - 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: - 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: - 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: - 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report - 09/15/2023: The technical report of BGE has been released - 09/15/2023: The massive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | | | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): Usage via infinity Its also possible to deploy the onnx files with the infinityemb pip package. Recommended is `device="cuda", engine="torch"` with flash attention on gpu, and `device="cpu", engine="optimum"` for onnx inference. `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.
legal-bert-base-uncased
by nlpaueb
LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. To pre-train the different variations of LEGAL-BERT, we collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. Sub-domain variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. A light-weight model (33% the size of BERT-BASE) pre-trained from scratch on legal data with competitive performance is also available. I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. "LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) (Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261) 116,062 documents of EU legislation, publicly available from EURLEX (http://eur-lex.europa.eu), the repository of EU Law running under the EU Publication Office. 61,826 documents of UK legislation, publicly available from the UK legislation portal (http://www.legislation.gov.uk). 19,867 cases from the European Court of Justice (ECJ), also available from EURLEX. 12,554 cases from HUDOC, the repository of the European Court of Human Rights (ECHR) (http://hudoc.echr.coe.int/eng). 164,141 cases from various courts across the USA, hosted in the Case Law Access Project portal (https://case.law). 76,366 US contracts from EDGAR, the database of US Securities and Exchange Commission (SECOM) (https://www.sec.gov/edgar.shtml). We trained BERT using the official code provided in Google BERT's GitHub repository (https://github.com/google-research/bert). We released a model similar to the English BERT-BASE model (12-layer, 768-hidden, 12-heads, 110M parameters). We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4. We were able to use a single Google Cloud TPU v3-8 provided for free from TensorFlow Research Cloud (TFRC), while also utilizing GCP research credits. Huge thanks to both Google programs for supporting us! Part of LEGAL-BERT is a light-weight model pre-trained from scratch on legal data, which achieves comparable performance to larger models, while being much more efficient (approximately 4 times faster) with a smaller environmental footprint. | Model name | Model Path | Training corpora | | ------------------- | ------------------------------------ | ------------------- | | CONTRACTS-BERT-BASE | `nlpaueb/bert-base-uncased-contracts` | US contracts | | EURLEX-BERT-BASE | `nlpaueb/bert-base-uncased-eurlex` | EU legislation | | ECHR-BERT-BASE | `nlpaueb/bert-base-uncased-echr` | ECHR cases | | LEGAL-BERT-BASE | `nlpaueb/legal-bert-base-uncased` | All | | LEGAL-BERT-SMALL | `nlpaueb/legal-bert-small-uncased` | All | \ LEGAL-BERT-BASE is the model referred to as LEGAL-BERT-SC in Chalkidis et al. (2020); a model trained from scratch in the legal corpora mentioned below using a newly created vocabulary by a sentence-piece tokenizer trained on the very same corpora. \\ As many of you expressed interest in the LEGAL-BERT-FP models (those relying on the original BERT-BASE checkpoint), they have been released in Archive.org (https://archive.org/details/legalbertfp), as these models are secondary and possibly only interesting for those who aim to dig deeper in the open questions of Chalkidis et al. (2020). | Corpus | Model | Masked token | Predictions | | --------------------------------- | ---------------------------------- | ------------ | ------------ | | | BERT-BASE-UNCASED | | (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('new', '0.09'), ('current', '0.04'), ('proposed', '0.03'), ('marketing', '0.03'), ('joint', '0.02') | (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.32'), ('rape', '0.22'), ('abuse', '0.14'), ('death', '0.04'), ('violence', '0.03') | (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('farm', '0.25'), ('livestock', '0.08'), ('draft', '0.06'), ('domestic', '0.05'), ('wild', '0.05') | | CONTRACTS-BERT-BASE | | (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('letter', '0.38'), ('dealer', '0.04'), ('employment', '0.03'), ('award', '0.03'), ('contribution', '0.02') | (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('death', '0.39'), ('imprisonment', '0.07'), ('contempt', '0.05'), ('being', '0.03'), ('crime', '0.02') | (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | (('domestic', '0.18'), ('laboratory', '0.07'), ('household', '0.06'), ('personal', '0.06'), ('the', '0.04') | | EURLEX-BERT-BASE | | (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('supply', '0.11'), ('cooperation', '0.08'), ('service', '0.07'), ('licence', '0.07'), ('distribution', '0.05') | (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.66'), ('death', '0.07'), ('imprisonment', '0.07'), ('murder', '0.04'), ('rape', '0.02') | (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('live', '0.43'), ('pet', '0.28'), ('certain', '0.05'), ('fur', '0.03'), ('the', '0.02') | | ECHR-BERT-BASE | | (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('second', '0.24'), ('latter', '0.10'), ('draft', '0.05'), ('bilateral', '0.05'), ('arbitration', '0.04') | (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.99'), ('death', '0.01'), ('inhuman', '0.00'), ('beating', '0.00'), ('rape', '0.00') | (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('pet', '0.17'), ('all', '0.12'), ('slaughtered', '0.10'), ('domestic', '0.07'), ('individual', '0.05') | | LEGAL-BERT-BASE | | (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('settlement', '0.26'), ('letter', '0.23'), ('dealer', '0.04'), ('master', '0.02'), ('supplemental', '0.02') | (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '1.00'), ('detention', '0.00'), ('arrest', '0.00'), ('rape', '0.00'), ('death', '0.00') | (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('live', '0.67'), ('beef', '0.17'), ('farm', '0.03'), ('pet', '0.02'), ('dairy', '0.01') | | LEGAL-BERT-SMALL | | (Contracts) | This [MASK] Agreement is between General Motors and John Murray . | employment | ('license', '0.09'), ('transition', '0.08'), ('settlement', '0.04'), ('consent', '0.03'), ('letter', '0.03') | (ECHR) | The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate | torture | ('torture', '0.59'), ('pain', '0.05'), ('ptsd', '0.05'), ('death', '0.02'), ('tuberculosis', '0.02') | (EURLEX) | Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products . | bovine | ('all', '0.08'), ('live', '0.07'), ('certain', '0.07'), ('the', '0.07'), ('farm', '0.05') Consider the experiments in the article "LEGAL-BERT: The Muppets straight out of Law School". Chalkidis et al., 2020, (https://aclanthology.org/2020.findings-emnlp.261) AUEB's Natural Language Processing Group develops algorithms, models, and systems that allow computers to process and generate natural language texts. The group's current research interests include: question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering, natural language generation from databases and ontologies, especially Semantic Web ontologies, text classification, including filtering spam and abusive content, information extraction and opinion mining, including legal text analytics and sentiment analysis, natural language processing tools for Greek, for example parsers and named-entity recognizers, machine learning in natural language processing, especially deep learning. The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business. Ilias Chalkidis on behalf of AUEB's Natural Language Processing Group | Github: @ilias.chalkidis | Twitter: @KiddoThe2B |
XTTS-v2
by coqui
ⓍTTS ⓍTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours. This is the same or similar model to what powers Coqui Studio and Coqui API. Features - Supports 17 languages. - Voice cloning with just a 6-second audio clip. - Emotion and style transfer by cloning. - Cross-language voice cloning. - Multi-lingual speech generation. - 24khz sampling rate. Updates over XTTS-v1 - 2 new languages; Hungarian and Korean - Architectural improvements for speaker conditioning. - Enables the use of multiple speaker references and interpolation between speakers. - Stability improvements. - Better prosody and audio quality across the board. Languages XTTS-v2 supports 17 languages: English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu), Korean (ko) Hindi (hi). Stay tuned as we continue to add support for more languages. If you have any language requests, feel free to reach out! Code The code-base supports inference and fine-tuning. Demo Spaces - XTTS Space : You can see how model performs on supported languages, and try with your own reference or microphone input - XTTS Voice Chat with Mistral or Zephyr : You can experience streaming voice chat with Mistral 7B Instruct or Zephyr 7B Beta | | | | ------------------------------- | --------------------------------------- | | 🐸💬 CoquiTTS | coqui/TTS on Github| | 💼 Documentation | ReadTheDocs | 👩💻 Questions | GitHub Discussions | | 🗯 Community | Discord | License This model is licensed under Coqui Public Model License. There's a lot that goes into a license for generative models, and you can read more of the origin story of CPML here. Contact Come and join in our 🐸Community. We're active on Discord and Twitter. You can also mail us at [email protected].
bge-large-en-v1.5
by BAAI
Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - Long-Context LLM: Activation Beacon - Fine-tuning of LM : LM-Cocktail - Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding - Reranker Model: BGE Reranker - Benchmark: C-MTEB News - 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model that supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: - 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: - 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: - 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report - 09/15/2023: The technical report and massive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | | | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Its also possible to deploy the onnx files with the infinityemb pip package. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.
bert-base-multilingual-uncased
by google-bert
Pretrained model on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English. Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team. BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the languages in the training set that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The BERT model was pretrained on the 102 languages with the largest Wikipedias. You can find the complete list here. The texts are lowercased and tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a larger Wikipedia are under-sampled and the ones with lower resources are oversampled. For languages like Chinese, Japanese Kanji and Korean Hanja that don't have space, a CJK Unicode block is added around every character. With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens. The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is.
twitter-roberta-base-sentiment-latest
by cardiffnlp
Twitter-roBERTa-base for Sentiment Analysis - UPDATED (2022) This is a RoBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark. The original Twitter-based RoBERTa model can be found here and the original reference paper is TweetEval. This model is suitable for English. - Reference Paper: TimeLMs paper. - Git Repo: TimeLMs official repository. Labels : 0 -> Negative; 1 -> Neutral; 2 -> Positive This sentiment analysis model has been integrated into TweetNLP. You can access the demo here.
bge-base-en-v1.5
by BAAI
Model List | FAQ | Usage | Evaluation | Train | Contact | Citation | License For more details please refer to our Github: FlagEmbedding. If you are looking for a model that supports more languages, longer texts, and other retrieval methods, you can try using bge-m3. FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following projects currently: - Long-Context LLM: Activation Beacon - Fine-tuning of LM : LM-Cocktail - Dense Retrieval: BGE-M3, LLM Embedder, BGE Embedding - Reranker Model: BGE Reranker - Benchmark: C-MTEB News - 1/30/2024: Release BGE-M3, a new member to BGE model series! M3 stands for Multi-linguality (100+ languages), Multi-granularities (input length up to 8192), Multi-Functionality (unification of dense, lexical, multi-vec/colbert retrieval). It is the first embedding model which supports all three retrieval methods, achieving new SOTA on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Technical Report and Code. :fire: - 1/9/2024: Release Activation-Beacon, an effective, efficient, compatible, and low-cost (training) method to extend the context length of LLM. Technical Report :fire: - 12/24/2023: Release LLaRA, a LLaMA-7B based dense retriever, leading to state-of-the-art performances on MS MARCO and BEIR. Model and code will be open-sourced. Please stay tuned. Technical Report :fire: - 11/23/2023: Release LM-Cocktail, a method to maintain general capabilities during fine-tuning by merging multiple language models. Technical Report :fire: - 10/12/2023: Release LLM-Embedder, a unified embedding model to support diverse retrieval augmentation needs for LLMs. Technical Report - 09/15/2023: The technical report and massive training data of BGE has been released - 09/12/2023: New models: - New reranker model: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models. - update embedding model: release `bge--v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction. - 09/07/2023: Update fine-tune code: Add script to mine hard negatives and support adding instruction during fine-tuning. - 08/09/2023: BGE Models are integrated into Langchain, you can use it like this; C-MTEB leaderboard is available. - 08/05/2023: Release base-scale and small-scale models, best performance among the models of the same size 🤗 - 08/02/2023: Release `bge-large-`(short for BAAI General Embedding) Models, rank 1st on MTEB and C-MTEB benchmark! :tada: :tada: - 08/01/2023: We release the Chinese Massive Text Embedding Benchmark (C-MTEB), consisting of 31 test dataset. | Model | Language | | Description | query instruction for retrieval [1] | |:-------------------------------|:--------:| :--------:| :--------:|:--------:| | BAAI/bge-m3 | Multilingual | Inference Fine-tune | Multi-Functionality(dense retrieval, sparse retrieval, multi-vector(colbert)), Multi-Linguality, and Multi-Granularity(8192 tokens) | | | BAAI/llm-embedder | English | Inference Fine-tune | a unified embedding model to support diverse retrieval augmentation needs for LLMs | See README | | BAAI/bge-reranker-large | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-reranker-base | Chinese and English | Inference Fine-tune | a cross-encoder model which is more accurate but less efficient [2] | | | BAAI/bge-large-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en-v1.5 | English | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh-v1.5 | Chinese | Inference Fine-tune | version 1.5 with more reasonable similarity distribution | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-large-en | English | Inference Fine-tune | :trophy: rank 1st in MTEB leaderboard | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-base-en | English | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-en` | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-small-en | English | Inference Fine-tune |a small-scale model but with competitive performance | `Represent this sentence for searching relevant passages: ` | | BAAI/bge-large-zh | Chinese | Inference Fine-tune | :trophy: rank 1st in C-MTEB benchmark | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-base-zh | Chinese | Inference Fine-tune | a base-scale model but with similar ability to `bge-large-zh` | `为这个句子生成表示以用于检索相关文章:` | | BAAI/bge-small-zh | Chinese | Inference Fine-tune | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` | [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, no instruction needs to be added to passages. [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models. For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results. All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI. If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models . Following this example to prepare data and fine-tune your model. Some suggestions: - Mine hard negatives following this example, which can improve the retrieval performance. - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity. - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker. 2. The similarity score between two dissimilar sentences is higher than 0.5 Suggest to use bge v1.5, which alleviates the issue of the similarity distribution. Since we finetune the models by contrastive learning with a temperature of 0.01, the similarity distribution of the current BGE model is about in the interval \[0.6, 1\]. So a similarity score greater than 0.5 does not indicate that the two sentences are similar. For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value. If you need to filter similar sentences based on a similarity threshold, please select an appropriate similarity threshold based on the similarity distribution on your data (such as 0.8, 0.85, or even 0.9). For the `bge--v1.5`, we improve its retrieval ability when not using instruction. No instruction only has a slight degradation in retrieval performance compared with using instruction. So you can generate embedding without instruction in all cases for convenience. For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries. The best method to decide whether to add instructions for queries is choosing the setting that achieves better performance on your task. In all cases, the documents/passages do not need to add the instruction. Here are some examples for using `bge` models with FlagEmbedding, Sentence-Transformers, Langchain, or Huggingface Transformers. If it doesn't work for you, you can see FlagEmbedding for more methods to install FlagEmbedding. For the value of the argument `queryinstructionforretrieval`, see Model List. By default, FlagModel will use all available GPUs when encoding. Please set `os.environ["CUDAVISIBLEDEVICES"]` to select specific GPUs. You also can set `os.environ["CUDAVISIBLEDEVICES"]=""` to make all GPUs unavailable. You can also use the `bge` models with sentence-transformers: For s2p(short query to long passage) retrieval task, each short query should start with an instruction (instructions see Model List). But the instruction is not needed for passages. With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding. Usage via infinity Its also possible to deploy the onnx files with the infinityemb pip package. Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker. The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range. Get relevance scores (higher scores indicate more relevance): `baai-general-embedding` models achieve state-of-the-art performance on both MTEB and C-MTEB leaderboard! For more details and evaluation tools see our scripts. | Model Name | Dimension | Sequence Length | Average (56) | Retrieval (15) |Clustering (11) | Pair Classification (3) | Reranking (4) | STS (10) | Summarization (1) | Classification (12) | |:----:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.23 | 54.29 | 46.08 | 87.12 | 60.03 | 83.11 | 31.61 | 75.97 | | BAAI/bge-base-en-v1.5 | 768 | 512 | 63.55 | 53.25 | 45.77 | 86.55 | 58.86 | 82.4 | 31.07 | 75.53 | | BAAI/bge-small-en-v1.5 | 384 | 512 | 62.17 |51.68 | 43.82 | 84.92 | 58.36 | 81.59 | 30.12 | 74.14 | | bge-large-en | 1024 | 512 | 63.98 | 53.9 | 46.98 | 85.8 | 59.48 | 81.56 | 32.06 | 76.21 | | bge-base-en | 768 | 512 | 63.36 | 53.0 | 46.32 | 85.86 | 58.7 | 81.84 | 29.27 | 75.27 | | gte-large | 1024 | 512 | 63.13 | 52.22 | 46.84 | 85.00 | 59.13 | 83.35 | 31.66 | 73.33 | | gte-base | 768 | 512 | 62.39 | 51.14 | 46.2 | 84.57 | 58.61 | 82.3 | 31.17 | 73.01 | | e5-large-v2 | 1024| 512 | 62.25 | 50.56 | 44.49 | 86.03 | 56.61 | 82.05 | 30.19 | 75.24 | | bge-small-en | 384 | 512 | 62.11 | 51.82 | 44.31 | 83.78 | 57.97 | 80.72 | 30.53 | 74.37 | | instructor-xl | 768 | 512 | 61.79 | 49.26 | 44.74 | 86.62 | 57.29 | 83.06 | 32.32 | 61.79 | | e5-base-v2 | 768 | 512 | 61.5 | 50.29 | 43.80 | 85.73 | 55.91 | 81.05 | 30.28 | 73.84 | | gte-small | 384 | 512 | 61.36 | 49.46 | 44.89 | 83.54 | 57.7 | 82.07 | 30.42 | 72.31 | | text-embedding-ada-002 | 1536 | 8192 | 60.99 | 49.25 | 45.9 | 84.89 | 56.32 | 80.97 | 30.8 | 70.93 | | e5-small-v2 | 384 | 512 | 59.93 | 49.04 | 39.92 | 84.67 | 54.32 | 80.39 | 31.16 | 72.94 | | sentence-t5-xxl | 768 | 512 | 59.51 | 42.24 | 43.72 | 85.06 | 56.42 | 82.63 | 30.08 | 73.42 | | all-mpnet-base-v2 | 768 | 514 | 57.78 | 43.81 | 43.69 | 83.04 | 59.36 | 80.28 | 27.49 | 65.07 | | sgpt-bloom-7b1-msmarco | 4096 | 2048 | 57.59 | 48.22 | 38.93 | 81.9 | 55.65 | 77.74 | 33.6 | 66.19 | - C-MTEB: We create the benchmark C-MTEB for Chinese text embedding which consists of 31 datasets from 6 tasks. Please refer to CMTEB for a detailed introduction. | Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 | | BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 | | BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 | | BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 | | bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 | | BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 | | multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 | | BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 | | m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 | | m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 | | multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 | | multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 | | text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 | | luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 | | text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 | | text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 | | Model | T2Reranking | T2RerankingZh2En\ | T2RerankingEn2Zh\ | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg | |:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:| | text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 | | multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 | | multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 | | multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 | | m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 | | m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 | | bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 | | bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 | | BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 | | BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 | \ : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning. You can fine-tune the embedding model on your data following our examples. We also provide a pre-train example. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned. More training details for bge see baaigeneralembedding. Cross-encoder will perform full-attention over the input pair, which is more accurate than embedding model (i.e., bi-encoder) but more time-consuming than embedding model. Therefore, it can be used to re-rank the top-k documents returned by embedding model. We train the cross-encoder on a multilingual pair data, The data format is the same as embedding model, so you can fine-tune it easily following our example. More details please refer to ./FlagEmbedding/reranker/README.md Contact If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]). If you find this repository useful, please consider giving a star :star: and citation License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.
opt-125m
by facebook
OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.
bert-base-cased
by google-bert
Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English. Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). The texts are tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens. The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, \\(\beta{1} = 0.9\\) and \\(\beta{2} = 0.999\\), a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:| | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
bge-reranker-v2-m3
by BAAI
--- license: apache-2.0 pipeline_tag: text-classification tags: - transformers - sentence-transformers - text-embeddings-inference language: - multilingual ---
bart-large-mnli
by facebook
This is the checkpoint for bart-large after being trained on the MultiNLI (MNLI) dataset. Additional information about this model: - The bart-large model page - BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Yin et al. proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of `This text is about politics.`. The probabilities for entailment and contradiction are then converted to label probabilities. This method is surprisingly effective in many cases, particularly when used with larger pre-trained models like BART and Roberta. See this blog post for a more expansive introduction to this and other zero shot methods, and see the code snippets below for examples of using this model for zero-shot classification both with Hugging Face's built-in pipeline and with native Transformers/PyTorch code. The model can be loaded with the `zero-shot-classification` pipeline like so: You can then use this pipeline to classify sequences into any of the class names you specify. If more than one candidate label can be correct, pass `multilabel=True` to calculate each class independently:
vitpose-plus-base
by usyd-community
--- library_name: transformers license: apache-2.0 language: - en pipeline_tag: keypoint-detection ---
sd-turbo
by stabilityai
SD-Turbo is a fast generative text-to-image model that can synthesize photorealistic images from a text prompt in a single network evaluation. We release SD-Turbo as a research artifact, and to study small, distilled text-to-image models. For increased quality and prompt understanding, we recommend SDXL-Turbo. Please note: For commercial use, please refer to https://stability.ai/license. Model Description SD-Turbo is a distilled version of Stable Diffusion 2.1, trained for real-time synthesis. SD-Turbo is based on a novel training method called Adversarial Diffusion Distillation (ADD) (see the technical report), which allows sampling large-scale foundational image diffusion models in 1 to 4 steps at high image quality. This approach uses score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal and combines this with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. - Developed by: Stability AI - Funded by: Stability AI - Model type: Generative text-to-image model - Finetuned from model: Stable Diffusion 2.1 For research purposes, we recommend our `generative-models` Github repository (https://github.com/Stability-AI/generative-models), which implements the most popular diffusion frameworks (both training and inference). - Repository: https://github.com/Stability-AI/generative-models - Paper: https://stability.ai/research/adversarial-diffusion-distillation - Demo [for the bigger SDXL-Turbo]: http://clipdrop.co/stable-diffusion-turbo The charts above evaluate user preference for SD-Turbo over other single- and multi-step models. SD-Turbo evaluated at a single step is preferred by human voters in terms of image quality and prompt following over LCM-Lora XL and LCM-Lora 1.5. Note: For increased quality, we recommend the bigger version SDXL-Turbo. For details on the user study, we refer to the research paper. The model is intended for both non-commercial and commercial usage. Possible research areas and tasks include - Research on generative models. - Research on real-time applications of generative models. - Research on the impact of real-time generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. For commercial use, please refer to https://stability.ai/membership. SD-Turbo does not make use of `guidancescale` or `negativeprompt`, we disable it with `guidancescale=0.0`. Preferably, the model generates images of size 512x512 but higher image sizes work as well. A single step is enough to generate high quality images. When using SD-Turbo for image-to-image generation, make sure that `numinferencesteps` `strength` is larger or equal to 1. The image-to-image pipeline will run for `int(numinferencesteps strength)` steps, e.g. 0.5 2.0 = 1 step in our example below. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. The model should not be used in any way that violates Stability AI's Acceptable Use Policy. Limitations - The quality and prompt alignment is lower than that of SDXL-Turbo. - The generated images are of a fixed resolution (512x512 pix), and the model does not achieve perfect photorealism. - The model cannot render legible text. - Faces and people in general may not be generated properly. - The autoencoding part of the model is lossy. The model is intended for both non-commercial and commercial usage. Check out https://github.com/Stability-AI/generative-models