👤
biometrics

Age Group Detection

Detect age groups from facial images with proven 59% accuracy. Useful for demographics analysis, age-restricted content, and personalization.

20
Models Found

Common Applications

Demographics analysis
Age-restricted content access
Personalized advertising
Market research
Security & verification

Top Models

20 models • Sorted by downloads
#1

all-MiniLM-L6-v2

by sentence-transformers

all-MiniLM-L6-v2 This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. Usage (Sentence-Transformers) Using this model becomes easy when you have sentence-transformers installed: Usage (HuggingFace Transformers) Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the pretrained `nreimers/MiniLM-L6-H384-uncased` model and fine-tuned in on a 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset. We developed this model during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face. We developed this model as part of the project: Train the Best Sentence Embedding Model Ever with 1B Training Pairs. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks. Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks. By default, input text longer than 256 word pieces is truncated. We use the pretrained `nreimers/MiniLM-L6-H384-uncased` model. Please refer to the model card for more detailed information about the pre-training procedure. We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the cross entropy loss by comparing with true pairs. We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core). We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with a 2e-5 learning rate. The full training script is accessible in this current repository: `trainscript.py`. We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences. We sampled each dataset given a weighted probability which configuration is detailed in the `dataconfig.json` file. | Dataset | Paper | Number of training tuples | |--------------------------------------------------------|:----------------------------------------:|:--------------------------:| | Reddit comments (2015-2018) | paper | 726,484,430 | | S2ORC Citation pairs (Abstracts) | paper | 116,288,806 | | WikiAnswers Duplicate question pairs | paper | 77,427,422 | | PAQ (Question, Answer) pairs | paper | 64,371,441 | | S2ORC Citation pairs (Titles) | paper | 52,603,982 | | S2ORC (Title, Abstract) | paper | 41,769,185 | | Stack Exchange (Title, Body) pairs | - | 25,316,456 | | Stack Exchange (Title+Body, Answer) pairs | - | 21,396,559 | | Stack Exchange (Title, Answer) pairs | - | 21,396,559 | | MS MARCO triplets | paper | 9,144,553 | | GOOAQ: Open Question Answering with Diverse Answer Types | paper | 3,012,496 | | Yahoo Answers (Title, Answer) | paper | 1,198,260 | | Code Search | - | 1,151,414 | | COCO Image captions | paper | 828,395| | SPECTER citation triplets | paper | 684,100 | | Yahoo Answers (Question, Answer) | paper | 681,164 | | Yahoo Answers (Title, Question) | paper | 659,896 | | SearchQA | paper | 582,261 | | Eli5 | paper | 325,475 | | Flickr 30k | paper | 317,695 | | Stack Exchange Duplicate questions (titles) | | 304,525 | | AllNLI (SNLI and MultiNLI | paper SNLI, paper MultiNLI | 277,230 | | Stack Exchange Duplicate questions (bodies) | | 250,519 | | Stack Exchange Duplicate questions (titles+bodies) | | 250,460 | | Sentence Compression | paper | 180,000 | | Wikihow | paper | 128,542 | | Altlex | paper | 112,696 | | Quora Question Triplets | - | 103,663 | | Simple Wikipedia | paper | 102,225 | | Natural Questions (NQ) | paper | 100,231 | | SQuAD2.0 | paper | 87,599 | | TriviaQA | - | 73,346 | | Total | | 1,170,060,424 |

145.8M downloads
4.1K likes
PYTORCH

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset. For a detailed description and experimental results, please refer to our paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).

74.5M downloads
67 likes
PYTORCH
#3

Model Card: Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification The Fine-Tuned Vision Transformer (ViT) is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks. This specific model, named "google/vit-base-patch16-224-in21k," is pre-trained on a substantial collection of images in a supervised manner, leveraging the ImageNet-21k dataset. The images in the pre-training dataset are resized to a resolution of 224x224 pixels, making it suitable for a wide range of image recognition tasks. During the training phase, meticulous attention was given to hyperparameter settings to ensure optimal model performance. The model was fine-tuned with a judiciously chosen batch size of 16. This choice not only balanced computational efficiency but also allowed for the model to effectively process and learn from a diverse array of images. To facilitate this fine-tuning process, a learning rate of 5e-5 was employed. The learning rate serves as a critical tuning parameter that dictates the magnitude of adjustments made to the model's parameters during training. In this case, a learning rate of 5e-5 was selected to strike a harmonious balance between rapid convergence and steady optimization, resulting in a model that not only learns swiftly but also steadily refines its capabilities throughout the training process. This training phase was executed using a proprietary dataset containing an extensive collection of 80,000 images, each characterized by a substantial degree of variability. The dataset was thoughtfully curated to include two distinct classes, namely "normal" and "nsfw." This diversity allowed the model to grasp nuanced visual patterns, equipping it with the competence to accurately differentiate between safe and explicit content. The overarching objective of this meticulous training process was to impart the model with a deep understanding of visual cues, ensuring its robustness and competence in tackling the specific task of NSFW image classification. The result is a model that stands ready to contribute significantly to content safety and moderation, all while maintaining the highest standards of accuracy and reliability. Intended Uses & Limitations Intended Uses - NSFW Image Classification: The primary intended use of this model is for the classification of NSFW (Not Safe for Work) images. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications. How to use Here is how to use this model to classifiy an image based on 1 of 2 classes (normal,nsfw): - 'evalloss': 0.07463177293539047, - 'evalaccuracy': 0.980375, - 'evalruntime': 304.9846, - 'evalsamplespersecond': 52.462, - 'evalstepspersecond': 3.279 Note: It's essential to use this model responsibly and ethically, adhering to content guidelines and applicable regulations when implementing it in real-world applications, particularly those involving potentially sensitive content. For more details on model fine-tuning and usage, please refer to the model's documentation and the model hub. - Hugging Face Model Hub - Vision Transformer (ViT) Paper - ImageNet-21k Dataset Disclaimer: The model's performance may be influenced by the quality and representativeness of the data it was fine-tuned on. Users are encouraged to assess the model's suitability for their specific applications and datasets.

70.7M downloads
890 likes
PYTORCH
#4

bert-base-uncased

by google-bert

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English. Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team. BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives: - Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs. BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after. Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models. Other 24 smaller models are released afterward. The detailed release history can be found on the google-research/bert readme on github. | Model | #params | Language | |------------------------|--------------------------------|-------| | `bert-base-uncased` | 110M | English | | `bert-large-uncased` | 340M | English | sub | `bert-base-cased` | 110M | English | | `bert-large-cased` | 340M | English | | `bert-base-chinese` | 110M | Chinese | | `bert-base-multilingual-cased` | 110M | Multiple | | `bert-large-uncased-whole-word-masking` | 340M | English | | `bert-large-cased-whole-word-masking` | 340M | English | You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions of a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens. The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, \\(\beta{1} = 0.9\\) and \\(\beta{2} = 0.999\\), a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:| | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |

58.7M downloads
2.5K likes
PYTORCH

Detects age group with about 59% accuracy based on an image. See https://www.kaggle.com/code/dima806/age-group-image-classification-vit for details.

26.2M downloads
52 likes
OTHER

A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer based recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.5 - GMACs: 0.1 - Activations (M): 1.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

23.4M downloads
41 likes
PYTORCH

Disclaimer: The model card is taken and modified from the official CLIP repository, it can be found here. The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within. The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users. Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset. We have evaluated the performance of CLIP on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The paper describes model performance on the following datasets: - Food101 - CIFAR10 - CIFAR100 - Birdsnap - SUN397 - Stanford Cars - FGVC Aircraft - VOC2007 - DTD - Oxford-IIIT Pet dataset - Caltech101 - Flowers102 - MNIST - SVHN - IIIT5K - Hateful Memes - SST-2 - UCF101 - Kinetics700 - Country211 - CLEVR Counting - KITTI Distance - STL-10 - RareAct - Flickr30 - MSCOCO - ImageNet - ImageNet-A - ImageNet-R - ImageNet Sketch - ObjectNet (ImageNet Overlap) - Youtube-BB - ImageNet-Vid CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance. We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper). We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks. Where to send questions or comments about the model

20.3M downloads
804 likes
PYTORCH

Model card for CLAP: Contrastive Language-Audio Pretraining 0. TL;DR 1. Model Details 2. Usage 3. Uses 4. Citation > Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public. You can use this model for zero shot audio classification or extracting audio and/or textual features. You can also get the audio and text embeddings using `ClapModel` If you are using this model for your work, please consider citing the original paper:

19.0M downloads
40 likes
PYTORCH
#9

all-mpnet-base-v2

by sentence-transformers

all-mpnet-base-v2 This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. Usage (Sentence-Transformers) Using this model becomes easy when you have sentence-transformers installed: Usage (HuggingFace Transformers) Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. Text Embeddings Inference (TEI) is a blazing fast inference solution for text embedding models. Send a request to `/v1/embeddings` to generate embeddings via the OpenAI Embeddings API: Or check the Text Embeddings Inference API specification instead. The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. We used the pretrained `microsoft/mpnet-base` model and fine-tuned in on a 1B sentence pairs dataset. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset. We developed this model during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face. We developed this model as part of the project: Train the Best Sentence Embedding Model Ever with 1B Training Pairs. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks. Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks. By default, input text longer than 384 word pieces is truncated. We use the pretrained `microsoft/mpnet-base` model. Please refer to the model card for more detailed information about the pre-training procedure. We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the cross entropy loss by comparing with true pairs. We trained our model on a TPU v3-8. We train the model during 100k steps using a batch size of 1024 (128 per TPU core). We use a learning rate warm up of 500. The sequence length was limited to 128 tokens. We used the AdamW optimizer with a 2e-5 learning rate. The full training script is accessible in this current repository: `trainscript.py`. We use the concatenation from multiple datasets to fine-tune our model. The total number of sentence pairs is above 1 billion sentences. We sampled each dataset given a weighted probability which configuration is detailed in the `dataconfig.json` file. | Dataset | Paper | Number of training tuples | |--------------------------------------------------------|:----------------------------------------:|:--------------------------:| | Reddit comments (2015-2018) | paper | 726,484,430 | | S2ORC Citation pairs (Abstracts) | paper | 116,288,806 | | WikiAnswers Duplicate question pairs | paper | 77,427,422 | | PAQ (Question, Answer) pairs | paper | 64,371,441 | | S2ORC Citation pairs (Titles) | paper | 52,603,982 | | S2ORC (Title, Abstract) | paper | 41,769,185 | | Stack Exchange (Title, Body) pairs | - | 25,316,456 | | Stack Exchange (Title+Body, Answer) pairs | - | 21,396,559 | | Stack Exchange (Title, Answer) pairs | - | 21,396,559 | | MS MARCO triplets | paper | 9,144,553 | | GOOAQ: Open Question Answering with Diverse Answer Types | paper | 3,012,496 | | Yahoo Answers (Title, Answer) | paper | 1,198,260 | | Code Search | - | 1,151,414 | | COCO Image captions | paper | 828,395| | SPECTER citation triplets | paper | 684,100 | | Yahoo Answers (Question, Answer) | paper | 681,164 | | Yahoo Answers (Title, Question) | paper | 659,896 | | SearchQA | paper | 582,261 | | Eli5 | paper | 325,475 | | Flickr 30k | paper | 317,695 | | Stack Exchange Duplicate questions (titles) | | 304,525 | | AllNLI (SNLI and MultiNLI | paper SNLI, paper MultiNLI | 277,230 | | Stack Exchange Duplicate questions (bodies) | | 250,519 | | Stack Exchange Duplicate questions (titles+bodies) | | 250,460 | | Sentence Compression | paper | 180,000 | | Wikihow | paper | 128,542 | | Altlex | paper | 112,696 | | Quora Question Triplets | - | 103,663 | | Simple Wikipedia | paper | 102,225 | | Natural Questions (NQ) | paper | 100,231 | | SQuAD2.0 | paper | 87,599 | | TriviaQA | - | 73,346 | | Total | | 1,170,060,424 |

18.0M downloads
1.2K likes
PYTORCH
#10

This model is a distilled version of the BERT base model. It was introduced in this paper. The code for the distillation process can be found here. This model is uncased: it does not make a difference between english and English. DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives: - Distillation loss: the model was trained to return the same probabilities as the BERT base model. - Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. - Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model. This way, the model learns the same inner representation of the English language than its teacher model, while being faster for inference or downstream tasks. You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions. It also inherits some of the bias of its teacher model. This bias will also affect all fine-tuned versions of this model. DistilBERT pretrained on the same data as BERT, which is BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form: With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens. The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. The model was trained on 8 16 GB V100 for 90 hours. See the training code for all hyperparameters details. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 82.2 | 88.5 | 89.2 | 91.3 | 51.3 | 85.8 | 87.5 | 59.9 |

15.4M downloads
787 likes
PYTORCH
#11

roberta-large

by FacebookAI

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English. Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team. RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The RoBERTa model was pretrained on the reunion of five datasets: - BookCorpus, a dataset consisting of 11,038 unpublished books; - English Wikipedia (excluding lists, tables and headers) ; - CC-News, a dataset containing 63 millions English news articles crawled between September 2016 and February 2019. - OpenWebText, an opensource recreation of the WebText dataset used to train GPT-2, - Stories a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked with ` ` and the end of one by ` ` The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by ` `. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed). The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 4e-4, \\(\beta{1} = 0.9\\), \\(\beta{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 30,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 96.4 | 90.9 | 86.6 |

15.2M downloads
254 likes
PYTORCH
#12

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. Using this model becomes easy when you have sentence-transformers installed: Usage (HuggingFace Transformers) Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. If you find this model helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

13.7M downloads
1.1K likes
PYTORCH

Disclaimer: The model card is taken and modified from the official CLIP repository, it can be found here. The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision tasks. The model was also developed to test the ability of models to generalize to arbitrary image classification tasks in a zero-shot manner. It was not developed for general model deployment - to deploy models like CLIP, researchers will first need to carefully study their capabilities in relation to the specific context they’re being deployed within. The base model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. This repository has the variant with the Vision Transformer. The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. The primary intended users of these models are AI researchers. We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. The model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as YFCC100M. A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet which tend to skew towards more developed nations, and younger, male users. Our goal with building this dataset was to test out robustness and generalizability in computer vision tasks. As a result, the focus was on gathering large quantities of data from different publicly-available internet data sources. The data was gathered in a mostly non-interventionist manner. However, we only crawled websites that had policies against excessively violent and adult images and allowed us to filter out such content. We do not intend for this dataset to be used as the basis for any commercial or deployed model and will not be releasing the dataset. We have evaluated the performance of CLIP on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The paper describes model performance on the following datasets: - Food101 - CIFAR10 - CIFAR100 - Birdsnap - SUN397 - Stanford Cars - FGVC Aircraft - VOC2007 - DTD - Oxford-IIIT Pet dataset - Caltech101 - Flowers102 - MNIST - SVHN - IIIT5K - Hateful Memes - SST-2 - UCF101 - Kinetics700 - Country211 - CLEVR Counting - KITTI Distance - STL-10 - RareAct - Flickr30 - MSCOCO - ImageNet - ImageNet-A - ImageNet-R - ImageNet Sketch - ObjectNet (ImageNet Overlap) - Youtube-BB - ImageNet-Vid CLIP and our analysis of it have a number of limitations. CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section. Additionally, our approach to testing CLIP also has an important limitation- in many cases we have used linear probes to evaluate the performance of CLIP and there is evidence suggesting that linear probes can underestimate model performance. We find that the performance of CLIP - and the specific biases it exhibits - can depend significantly on class design and the choices one makes for categories to include and exclude. We tested the risk of certain kinds of denigration with CLIP by classifying images of people from Fairface into crime-related and non-human animal categories. We found significant disparities with respect to race and gender. Additionally, we found that these disparities could shift based on how the classes were constructed. (Details captured in the Broader Impacts Section in the paper). We also tested the performance of CLIP on gender, race and age classification using the Fairface dataset (We default to using race categories as they are constructed in the Fairface dataset.) in order to assess quality of performance across different demographics. We found accuracy >96% across all races for gender classification with ‘Middle Eastern’ having the highest accuracy (98.4%) and ‘White’ having the lowest (96.5%). Additionally, CLIP averaged ~93% for racial classification and ~63% for age classification. Our use of evaluations to test for gender, race and age classification as well as denigration harms is simply to evaluate performance of the model across people and surface potential risks and not to demonstrate an endorsement/enthusiasm for such tasks. Where to send questions or comments about the model

12.5M downloads
1.9K likes
PYTORCH
#14

gpt2

by openai-community

Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this page. Disclaimer: The team releasing GPT-2 also wrote a model card for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias. GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences. More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt. This is the smallest version of GPT-2, with 124M parameters. You can use the raw model for text generation or fine-tune it to a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card: > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases > that require the generated text to be true. > > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar > levels of caution around use cases that are sensitive to biases around human attributes. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here. The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens. The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact details of training. The model achieves the following results without any fine-tuning (zero-shot): | Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiText2 | PTB | enwiki8 | text8 | WikiText103 | 1BW | |:--------:|:-------:|:-------:|:------:|:------:|:---------:|:------:|:-------:|:------:|:-----------:|:-----:| | (metric) | (PPL) | (ACC) | (ACC) | (ACC) | (PPL) | (PPL) | (BPB) | (BPC) | (PPL) | (PPL) | | | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1,17 | 37.50 | 75.20 |

11.8M downloads
3.0K likes
PYTORCH
#15

roberta-base

by FacebookAI

Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is case-sensitive: it makes a difference between english and English. Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team. RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the BERT model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at a model like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The RoBERTa model was pretrained on the reunion of five datasets: - BookCorpus, a dataset consisting of 11,038 unpublished books; - English Wikipedia (excluding lists, tables and headers) ; - CC-News, a dataset containing 63 millions English news articles crawled between September 2016 and February 2019. - OpenWebText, an opensource recreation of the WebText dataset used to train GPT-2, - Stories a dataset containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. The beginning of a new document is marked with ` ` and the end of one by ` ` The details of the masking procedure for each sentence are the following: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by ` `. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed). The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The optimizer used is Adam with a learning rate of 6e-4, \\(\beta{1} = 0.9\\), \\(\beta{2} = 0.98\\) and \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning rate after. When fine-tuned on downstream tasks, this model achieves the following results: | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:| | | 87.6 | 91.9 | 92.8 | 94.8 | 63.6 | 91.2 | 90.2 | 78.7 |

11.3M downloads
532 likes
PYTORCH
#16

ESM-2 is a state-of-the-art protein model trained on a masked language modelling objective. It is suitable for fine-tuning on a wide range of tasks that take protein sequences as input. For detailed information on the model architecture and training data, please refer to the accompanying paper. You may also be interested in some demo notebooks (PyTorch, TensorFlow) which demonstrate how to fine-tune ESM-2 models on your tasks of interest. Several ESM-2 checkpoints are available in the Hub with varying sizes. Larger sizes generally have somewhat better accuracy, but require much more memory and time to train: | Checkpoint name | Num layers | Num parameters | |------------------------------|----|----------| | esm2t4815BUR50D | 48 | 15B | | esm2t363BUR50D | 36 | 3B | | esm2t33650MUR50D | 33 | 650M | | esm2t30150MUR50D | 30 | 150M | | esm2t1235MUR50D | 12 | 35M | | esm2t68MUR50D | 6 | 8M |

11.0M downloads
57 likes
PYTORCH
#17

Tarsier2-Recap-7b

by omni-research

Tarsier Model Card Introduction Tarsier2-Recap-7b is build upon Qwen2-VL-7B-Instruct by distilling the video description capabilities of Tarsier2-7b. Specifically, we finetuned Qwen2-VL-7B-Instruct on Tarsier2-Recap-585K for 2 epochs with a learning rate of 2e-5. Tarsier2-Recap-7b shares a similar video captioning ability as Tarsier2-7b, reaching an overall F1 score of 40.7% on DREAM-1K, which is only behind Tarsier2-7b (42.0%) and surpasses GPT-4o's 39.2%. See the Tarsier2 technical report for more details. Note: Please use Tarsier2-7b if you need the full-blooded Tarsier2. Model details - Base Model: Qwen2-VL-7B-Instruct - Training Data: Tarsier2-Recap-585K Model date: Tarsier2-Recap-7b was trained in December 2024. Paper or resources for more information: - github repo: https://github.com/bytedance/tarsier/tree/tarsier2 - paper link: https://arxiv.org/abs/2501.07888 - leaderboard: https://tarsier-vlm.github.io/ Intended use Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Model Performance Video Description We evaluate Tarsier2-Recap-7b on DREAM-1K, a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. Here is the evaluation result: Note: The results of Tarsier2-Recap-7b is different from the results we reported in Table 11 in the Tarsier2 technical report, as Tarsier2-Recap-7b is more fully trained (2 epochs vs 1 epoch). Video Question-Answering We evalute Tarsier2-Recap-7b on TVBench, a novel multiple-choice question-answering which requires a high level of temporal understanding. As Tarsier2-Recap-7b is only trained with video caption data, it needs some additional prompt to enduce it to conduct multi-choice question-answering tasks, see TVBench samples as an example. Here is the evaluation result: | Task | Tarsier2-Recap-7b | Tarsier2-7b | | ------- | :--------: | :-------: | | Action Antonym | 91.2 | 94.1 | | Action Count | 43.1 | 40.5 | | Action Localization | 42.5 | 37.5 | | Action Sequence | 70.5 | 72.3 | | Egocentric Sequence | 22.0 | 24.5 | | Moving Direction | 37.1 | 33.2 | | Object Count | 46.6 | 62.8 | | Object Shuffle | 36.9 | 31.6 | | Scene Transition | 85.9 | 88.1 | | Unexpected Action | 28.0 | 41.5 | | OVERALL | 54.0 | 54.7 | How to Use see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage. Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

10.3M downloads
20 likes
OTHER

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 7B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to this section for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. The current `config.json` is set for context length up to 32,768 tokens. To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts. For supported frameworks, you could add the following to `config.json` to enable YaRN: For deployment, we recommend using vLLM. Please refer to our Documentation for usage if you are not familar with vLLM. Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the `ropescaling` configuration only when processing long contexts is required. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

9.3M downloads
862 likes
OTHER
#19

xlm-roberta-base

by FacebookAI

XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al. and first released in this repository. Disclaimer: The team releasing XLM-RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team. XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch:

8.6M downloads
757 likes
PYTORCH
#20

colbertv2.0

by colbert-ir

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds. Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage. As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (`MaxSim`) operators. These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR'20). Relevance-guided Supervision for OpenQA with ColBERT (TACL'21). Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (NAACL'22). PLAID: An Efficient Engine for Late Interaction Retrieval (CIKM'22). (1/29/23) We have merged a new index updater feature and support for additional Hugging Face models! These are in beta so please give us feedback as you try them out. (1/24/23) If you're looking for the DSP framework for composing ColBERTv2 and LLMs, it's at: https://github.com/stanfordnlp/dsp The ColBERTv1 code from the SIGIR'20 paper is in the `colbertv1` branch. See here for more information on other branches. ColBERT requires Python 3.7+ and Pytorch 1.9+ and uses the Hugging Face Transformers library. We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.) We have also included a new environment file specifically for CPU-only environments (`condaenvcpu.yml`), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify `CUDAVISIBLEDEVICES=""` as part of your command. Note that a GPU is required for training and indexing. If you face any problems, please open a new issue and we'll help you promptly! Using ColBERT on a dataset typically involves the following steps. Step 0: Preprocess your collection. At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., `collection.tsv`) will contain all passages and another (e.g., `queries.tsv`) will contain a set of queries for searching the collection. Step 1: Download the pre-trained ColBERTv2 checkpoint. This checkpoint has been trained on the MS MARCO Passage Ranking task. You can also optionally train your own ColBERT model. Step 2: Index your collection. Once you have a trained ColBERT model, you need to index your collection to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search. Step 3: Search the collection with your queries. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query. Below, we illustrate these steps via an example run on the MS MARCO Passage Ranking task. NEW: We have an experimental notebook on Google Colab that you can use with free GPUs. Indexing 10,000 on the free Colab T4 GPU takes six minutes. This Jupyter notebook docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API. It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark. This repository works directly with a simple tab-separated file format to store queries, passages, and top-k ranked lists. Queries: each line is `qid \t query text`. Collection: each line is `pid \t passage text`. Top-k Ranking: each line is `qid \t pid \t rank`. This works directly with the data format of the MS MARCO Passage Ranking dataset. You will need the training triples (`triples.train.small.tar.gz`), the official top-1000 ranked lists for the dev set queries (`top1000.dev`), and the dev set relevant passages (`qrels.dev.small.tsv`). For indexing the full collection, you will also need the list of passages (`collection.tar.gz`). For fast retrieval, indexing precomputes the ColBERT representations of passages. We typically recommend that you use ColBERT for end-to-end retrieval, where it directly finds its top-k passages from the full collection: You can optionally specify the `ncells`, `centroidscorethreshold`, and `ndocs` search hyperparameters to trade off between speed and result quality. Defaults for different values of `k` are listed in colbert/searcher.py. We can evaluate the MSMARCO rankings using the following command: We provide a pre-trained model checkpoint, but we also detail how to train from scratch here. Note that this example demonstrates the ColBERTv1 style of training, but the provided checkpoint was trained with ColBERTv2. Training requires a JSONL triples file with a `[qid, pid+, pid-]` list per line. The query IDs and passage IDs correspond to the specified `queries.tsv` and `collection.tsv` files respectively. Running a lightweight ColBERTv2 server We provide a script to run a lightweight server which serves k (upto 100) results in ranked order for a given search query, in JSON format. This script can be used to power DSP programs. To run the server, update the environment variables `INDEXROOT` and `INDEXNAME` in the `.env` file to point to the appropriate ColBERT index. The run the following command: `main`: Stable branch with ColBERTv2 + PLAID. `colbertv1`: Legacy branch for ColBERTv1. Deprecated branches `newapi`: Base ColBERTv2 implementation. `cpuinference`: ColBERTv2 implementation with CPU search support. `fastsearch`: ColBERTv2 implementation with PLAID. `binarization`: ColBERT with a baseline binarization-based compression strategy (as opposed to ColBERTv2's residual compression, which we found to be more robust).

8.4M downloads
294 likes
PYTORCH