esm2_t33_650M_UR50D
ESM-2 is a state-of-the-art protein model trained on a masked language modelling objective. It is suitable for fine-tuning on a wide range of tasks that take protein sequences as input. For detailed information on the model architecture and training data, please refer to the accompanying paper. You may also be interested in some demo notebooks (PyTorch, TensorFlow) which demonstrate how to fine-tune ESM-2 models on your tasks of interest. Several ESM-2 checkpoints are available in the Hub with varying sizes. Larger sizes generally have somewhat better accuracy, but require much more memory and time to train: | Checkpoint name | Num layers | Num parameters | |------------------------------|----|----------| | esm2t4815BUR50D | 48 | 15B | | esm2t363BUR50D | 36 | 3B | | esm2t33650MUR50D | 33 | 650M | | esm2t30150MUR50D | 30 | 150M | | esm2t1235MUR50D | 12 | 35M | | esm2t68MUR50D | 6 | 8M |
contriever
This model has been trained without supervision following the approach described in [Towards Unsupervised Dense Information Retrieval with Contrastive Learning](https://arxiv.org/abs/2112.09118). The associated GitHub repository is available here https://github.com/facebookresearch/contriever.
wav2vec2-base-960h
The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows: This code snippet shows how to evaluate facebook/wav2vec2-base-960h on LibriSpeech's "clean" and "other" test data.
opt-125m
OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.
esmfold_v1
--- license: mit ---
bart-large-cnn
--- language: - en pipeline_tag: summarization license: mit thumbnail: https://huggingface.co/front/thumbnails/facebook.png datasets: - cnn_dailymail model-index: - name: facebook/bart-large-cnn results: - task: type: summarization name: Summarization dataset: name: cnn_dailymail type: cnn_dailymail config: 3.0.0 split: train metrics: - name: ROUGE-1 type: rouge value: 42.9486 verified: true - name: ROUGE-2 type: rouge value: 20.8149 verified: true - name: ROUGE-L type: rouge value: 30.6186 veri
bart-large-mnli
This is the checkpoint for bart-large after being trained on the MultiNLI (MNLI) dataset. Additional information about this model: - The bart-large model page - BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Yin et al. proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of `This text is about politics.`. The probabilities for entailment and contradiction are then converted to label probabilities. This method is surprisingly effective in many cases, particularly when used with larger pre-trained models like BART and Roberta. See this blog post for a more expansive introduction to this and other zero shot methods, and see the code snippets below for examples of using this model for zero-shot classification both with Hugging Face's built-in pipeline and with native Transformers/PyTorch code. The model can be loaded with the `zero-shot-classification` pipeline like so: You can then use this pipeline to classify sequences into any of the class names you specify. If more than one candidate label can be correct, pass `multilabel=True` to calculate each class independently:
flava-full
--- license: bsd-3-clause ---
dinov2-base
Vision Transformer (base-sized model) trained using DINOv2 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. and first released in this repository. Disclaimer: The team releasing DINOv2 did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion. Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for feature extraction. See the model hub to look for fine-tuned versions on a task that interests you.
w2v-bert-2.0
--- license: mit language: - af - am - ar - as - az - be - bn - bs - bg - ca - cs - zh - cy - da - de - el - en - et - fi - fr - or - om - ga - gl - gu - ha - he - hi - hr - hu - hy - ig - id - is - it - jv - ja - kn - ka - kk - mn - km - ky - ko - lo - ln - lt - lb - lg - lv - ml - mr - mk - mt - mi - my - nl - nb - ne - ny - oc - pa - ps - fa - pl - pt - ro - ru - sk - sl - sn - sd - so - es - sr - sv - sw - ta - te - tg - tl - th - tr - uk - ur - uz - vi - wo - xh - yo - ms - zu - ary - arz -
bart-base
--- license: apache-2.0 language: en ---
dinov2-small
--- license: apache-2.0 tags: - dino - vision ---
musicgen-medium
--- inference: true tags: - musicgen license: cc-by-nc-4.0 pipeline_tag: text-to-audio widget: - text: a funky house with 80s hip hop vibes example_title: Prompt 1 - text: a chill song with influences from lofi, chillstep and downtempo example_title: Prompt 2 - text: a catchy beat for a podcast intro example_title: Prompt 3 ---
roberta-hate-speech-dynabench-r4-target
--- language: en ---
esm2_t6_8M_UR50D
--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---
mms-300m
--- tags: - mms language: - ab - af - ak - am - ar - as - av - ay - az - ba - bm - be - bn - bi - bo - sh - br - bg - ca - cs - ce - cv - ku - cy - da - de - dv - dz - el - en - eo - et - eu - ee - fo - fa - fj - fi - fr - fy - ff - ga - gl - gn - gu - zh - ht - ha - he - hi - sh - hu - hy - ig - ia - ms - is - it - jv - ja - kn - ka - kk - kr - km - ki - rw - ky - ko - kv - lo - la - lv - ln - lt - lb - lg - mh - ml - mr - ms - mk - mg - mt - mn - mi - my - zh - nl - 'no' - 'no' - ne - ny - oc
esm2_t36_3B_UR50D
--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---
m2m100_418M
--- language: - multilingual - af - am - ar - ast - az - ba - be - bg - bn - br - bs - ca - ceb - cs - cy - da - de - el - en - es - et - fa - ff - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - ilo - is - it - ja - jv - ka - kk - km - kn - ko - lb - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - ns - oc - or - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - th - tl - tn - tr - uk - ur - uz - v
detr-resnet-50
--- license: apache-2.0 tags: - object-detection - vision datasets: - coco widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport ---
dino-vits16
--- license: apache-2.0 tags: - dino - vision datasets: - imagenet-1k ---
hubert-base-ls960
--- language: en datasets: - librispeech_asr tags: - speech license: apache-2.0 ---
encodec_24khz
--- inference: false ---
wav2vec2-base
--- language: en datasets: - librispeech_asr tags: - speech license: apache-2.0 ---
dinov3-vitb16-pretrain-lvd1689m
--- extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The infor
wav2vec2-large-xlsr-53
--- language: multilingual datasets: - common_voice tags: - speech license: apache-2.0 ---
VGGT-1B
--- tags: - model_hub_mixin - pytorch_model_hub_mixin license: cc-by-nc-4.0 language: - en pipeline_tag: image-to-3d ---
dinov3-vitl16-pretrain-lvd1689m
--- extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The infor
m2m100_1.2B
--- language: - multilingual - af - am - ar - ast - az - ba - be - bg - bn - br - bs - ca - ceb - cs - cy - da - de - el - en - es - et - fa - ff - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - ilo - is - it - ja - jv - ka - kk - km - kn - ko - lb - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - ns - oc - or - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - th - tl - tn - tr - uk - ur - uz - v
dinov2-large
--- license: apache-2.0 tags: - dino - vision ---
esm2_t30_150M_UR50D
--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---
wav2vec2-xlsr-53-espeak-cv-ft
--- language: multi-lingual datasets: - common_voice tags: - speech - audio - automatic-speech-recognition - phoneme-recognition widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac license: apache-2.0 ---
esm2_t12_35M_UR50D
--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---
fasttext-language-identification
--- license: cc-by-nc-4.0 library_name: fasttext tags: - text-classification - language-identification ---
PE-Core-G14-448
--- license: apache-2.0 library_name: perception-encoder pipeline_tag: zero-shot-image-classification ---
sam-vit-base
--- license: apache-2.0 tags: - vision ---
mask2former-swin-large-cityscapes-semantic
--- license: other tags: - vision - image-segmentation datasets: - coco widget: - src: http://images.cocodataset.org/val2017/000000039769.jpg example_title: Cats - src: http://images.cocodataset.org/val2017/000000039770.jpg example_title: Castle ---
nllb-200-distilled-600M
--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - k
cwm
hubert-large-ls960-ft
--- language: en datasets: - libri-light - librispeech_asr tags: - speech - audio - automatic-speech-recognition - hf-asr-leaderboard license: apache-2.0 model-index: - name: hubert-large-ls960-ft results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - name: Test WER type: wer value: 1.9 ---
dinov3-vits16-pretrain-lvd1689m
--- extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The infor
dinov2-with-registers-base
--- library_name: transformers pipeline_tag: image-feature-extraction license: apache-2.0 tags: - dino - vision inference: false ---
mms-lid-256
--- tags: - mms language: - ab - af - ak - am - ar - as - av - ay - az - ba - bm - be - bn - bi - bo - sh - br - bg - ca - cs - ce - cv - ku - cy - da - de - dv - dz - el - en - eo - et - eu - ee - fo - fa - fj - fi - fr - fy - ff - ga - gl - gn - gu - zh - ht - ha - he - hi - sh - hu - hy - ig - ia - ms - is - it - jv - ja - kn - ka - kk - kr - km - ki - rw - ky - ko - kv - lo - la - lv - ln - lt - lb - lg - mh - ml - mr - ms - mk - mg - mt - mn - mi - my - zh - nl - 'no' - 'no' - ne - ny - oc
sam-vit-huge
--- license: apache-2.0 tags: - vision ---
mbart-large-50-many-to-many-mmt
--- language: - multilingual - ar - cs - de - en - es - et - fi - fr - gu - hi - it - ja - kk - ko - lt - lv - my - ne - nl - ro - ru - si - tr - vi - zh - af - az - bn - fa - he - hr - id - ka - km - mk - ml - mn - mr - pl - ps - pt - sv - sw - ta - te - th - tl - uk - ur - xh - gl - sl tags: - mbart-50 pipeline_tag: translation ---
sam2-hiera-large
--- license: apache-2.0 pipeline_tag: mask-generation library_name: transformers ---
dinov2-giant
dino-vitb16
wav2vec2-xls-r-300m
--- language: - multilingual - ab - af - sq - am - ar - hy - as - az - ba - eu - be - bn - bs - br - bg - my - yue - ca - ceb - km - zh - cv - hr - cs - da - dv - nl - en - eo - et - fo - fi - fr - gl - lg - ka - de - el - gn - gu - ht - cnh - ha - haw - he - hi - hu - is - id - ia - ga - it - ja - jv - kb - kn - kk - rw - ky - ko - ku - lo - la - lv - ln - lt - lm - mk - mg - ms - ml - mt - gv - mi - mr - mn - ne - no - nn - oc - or - ps - fa - pl - pt - pa - ro - rm - rm - ru - sah - sa - sco
wav2vec2-large-robust-ft-libri-960h
This model is a fine-tuned version of the wav2vec2-large-robust model. It has been pretrained on: - Libri-Light: open-source audio books from the LibriVox project; clean, read-out audio data - CommonVoice: crowd-source collected audio data; read-out text snippets - Switchboard: telephone speech corpus; noisy telephone data - Fisher: conversational telephone speech; noisy telephone data When using the model make sure that your speech input is also sampled at 16Khz. Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli Abstract Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at this https URL. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows:
sam-vit-large
dinov3-vith16plus-pretrain-lvd1689m
convnextv2-tiny-22k-224
sam2.1-hiera-large
opt-350m
OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.
opt-1.3b
OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.
vjepa2-vitl-fpc64-256
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation.
mask2former-swin-large-mapillary-vistas-semantic
Mask2Former model trained on Mapillary Vistas semantic segmentation (large-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.
nllb-200-distilled-1.3B
--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - k
PE-Core-L14-336
detr-resnet-50-panoptic
musicgen-small
--- inference: true tags: - musicgen license: cc-by-nc-4.0 pipeline_tag: text-to-audio widget: - text: "a funky house with 80s hip hop vibes" example_title: "Prompt 1" - text: "a chill song with influences from lofi, chillstep and downtempo" example_title: "Prompt 2" - text: "a catchy beat for a podcast intro" example_title: "Prompt 3" ---
wav2vec2-large-es-voxpopuli
Facebook's Wav2Vec2 large model pretrained on the es unlabeled subset of VoxPopuli corpus. Paper: VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation Authors: Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux from Facebook AI See the official website for more information, here Please refer to this blog on how to fine-tune this model on a specific language. Note that you should replace `"facebook/wav2vec2-large-xlsr-53"` with this checkpoint for fine-tuning.
encodec_32khz
wav2vec2-conformer-rope-large-960h-ft
dpr-ctx_encoder-single-nq-base
mask2former-swin-tiny-coco-instance
Mask2Former model trained on COCO instance segmentation (tiny-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for instance segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.
rag-sequence-nq
chameleon-7b
metaclip-b32-400m
dinov3-vit7b16-pretrain-sat493m
dpr-question_encoder-single-nq-base
wav2vec2-large-960h
contriever-msmarco
nllb-200-3.3B
--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - k
wav2vec2-large-960h-lv60-self
The large model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech on 16kHz sampled speech audio. Model was trained with Self-Training objective. When using the model make sure that your speech input is also sampled at 16Khz. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows: This code snippet shows how to evaluate facebook/wav2vec2-large-960h-lv60-self on LibriSpeech's "clean" and "other" test data.
detr-resnet-101
bart-large
vit-mae-base
mms-1b-all
seamless-m4t-v2-large
SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages. SeamlessM4T models support the tasks of: - Speech-to-speech translation (S2ST) - Speech-to-text translation (S2TT) - Text-to-speech translation (T2ST) - Text-to-text translation (T2TT) - Automatic speech recognition (ASR). SeamlessM4T models support: - 🎤 101 languages for speech input. - 💬 96 Languages for text input/output. - 🔊 35 languages for speech output. 🌟 We are releasing SeamlessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks. The v2 version of SeamlessM4T is a multitask adaptation of our novel UnitY2 architecture. Unity2 with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed. SeamlessM4T v2 is also supported by 🤗 Transformers, more on it in the dedicated section below. SeamlessM4T models | Model Name | #params | checkpoint | metrics | | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | | SeamlessM4T-Large v2 | 2.3B | checkpoint | metrics | | SeamlessM4T-Large (v1) | 2.3B | checkpoint | metrics | | SeamlessM4T-Medium (v1) | 1.2B | checkpoint | metrics | We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above. The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found here Evaluating SeamlessM4T models To reproduce our results or to evaluate using the same metrics over your own test sets, please check out the Evaluation README here. Finetuning SeamlessM4T models Please check out the Finetuning README here. SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started: 1. First install the 🤗 Transformers library from main and sentencepiece: 2. Run the following Python code to generate speech samples. Here the target language is Russian: 3. Listen to the audio samples either in an ipynb notebook: Or save them as a `.wav` file using a third-party library, e.g. `scipy`: For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the SeamlessM4T v2 docs or to this hands-on Google Colab. Listed below, are the languages supported by SeamlessM4T-large (v1/v2). The `source` column specifies whether a language is supported as source speech (`Sp`) and/or source text (`Tx`). The `target` column specifies whether a language is supported as target speech (`Sp`) and/or target text (`Tx`). | code | language | script | Source | Target | | ---- | ---------------------- | ---------- | ------ | ------ | | afr | Afrikaans | Latn | Sp, Tx | Tx | | amh | Amharic | Ethi | Sp, Tx | Tx | | arb | Modern Standard Arabic | Arab | Sp, Tx | Sp, Tx | | ary | Moroccan Arabic | Arab | Sp, Tx | Tx | | arz | Egyptian Arabic | Arab | Sp, Tx | Tx | | asm | Assamese | Beng | Sp, Tx | Tx | | ast | Asturian | Latn | Sp | \-- | | azj | North Azerbaijani | Latn | Sp, Tx | Tx | | bel | Belarusian | Cyrl | Sp, Tx | Tx | | ben | Bengali | Beng | Sp, Tx | Sp, Tx | | bos | Bosnian | Latn | Sp, Tx | Tx | | bul | Bulgarian | Cyrl | Sp, Tx | Tx | | cat | Catalan | Latn | Sp, Tx | Sp, Tx | | ceb | Cebuano | Latn | Sp, Tx | Tx | | ces | Czech | Latn | Sp, Tx | Sp, Tx | | ckb | Central Kurdish | Arab | Sp, Tx | Tx | | cmn | Mandarin Chinese | Hans | Sp, Tx | Sp, Tx | | cmnHant | Mandarin Chinese | Hant | Sp, Tx | Sp, Tx | | cym | Welsh | Latn | Sp, Tx | Sp, Tx | | dan | Danish | Latn | Sp, Tx | Sp, Tx | | deu | German | Latn | Sp, Tx | Sp, Tx | | ell | Greek | Grek | Sp, Tx | Tx | | eng | English | Latn | Sp, Tx | Sp, Tx | | est | Estonian | Latn | Sp, Tx | Sp, Tx | | eus | Basque | Latn | Sp, Tx | Tx | | fin | Finnish | Latn | Sp, Tx | Sp, Tx | | fra | French | Latn | Sp, Tx | Sp, Tx | | fuv | Nigerian Fulfulde | Latn | Sp, Tx | Tx | | gaz | West Central Oromo | Latn | Sp, Tx | Tx | | gle | Irish | Latn | Sp, Tx | Tx | | glg | Galician | Latn | Sp, Tx | Tx | | guj | Gujarati | Gujr | Sp, Tx | Tx | | heb | Hebrew | Hebr | Sp, Tx | Tx | | hin | Hindi | Deva | Sp, Tx | Sp, Tx | | hrv | Croatian | Latn | Sp, Tx | Tx | | hun | Hungarian | Latn | Sp, Tx | Tx | | hye | Armenian | Armn | Sp, Tx | Tx | | ibo | Igbo | Latn | Sp, Tx | Tx | | ind | Indonesian | Latn | Sp, Tx | Sp, Tx | | isl | Icelandic | Latn | Sp, Tx | Tx | | ita | Italian | Latn | Sp, Tx | Sp, Tx | | jav | Javanese | Latn | Sp, Tx | Tx | | jpn | Japanese | Jpan | Sp, Tx | Sp, Tx | | kam | Kamba | Latn | Sp | \-- | | kan | Kannada | Knda | Sp, Tx | Tx | | kat | Georgian | Geor | Sp, Tx | Tx | | kaz | Kazakh | Cyrl | Sp, Tx | Tx | | kea | Kabuverdianu | Latn | Sp | \-- | | khk | Halh Mongolian | Cyrl | Sp, Tx | Tx | | khm | Khmer | Khmr | Sp, Tx | Tx | | kir | Kyrgyz | Cyrl | Sp, Tx | Tx | | kor | Korean | Kore | Sp, Tx | Sp, Tx | | lao | Lao | Laoo | Sp, Tx | Tx | | lit | Lithuanian | Latn | Sp, Tx | Tx | | ltz | Luxembourgish | Latn | Sp | \-- | | lug | Ganda | Latn | Sp, Tx | Tx | | luo | Luo | Latn | Sp, Tx | Tx | | lvs | Standard Latvian | Latn | Sp, Tx | Tx | | mai | Maithili | Deva | Sp, Tx | Tx | | mal | Malayalam | Mlym | Sp, Tx | Tx | | mar | Marathi | Deva | Sp, Tx | Tx | | mkd | Macedonian | Cyrl | Sp, Tx | Tx | | mlt | Maltese | Latn | Sp, Tx | Sp, Tx | | mni | Meitei | Beng | Sp, Tx | Tx | | mya | Burmese | Mymr | Sp, Tx | Tx | | nld | Dutch | Latn | Sp, Tx | Sp, Tx | | nno | Norwegian Nynorsk | Latn | Sp, Tx | Tx | | nob | Norwegian Bokmål | Latn | Sp, Tx | Tx | | npi | Nepali | Deva | Sp, Tx | Tx | | nya | Nyanja | Latn | Sp, Tx | Tx | | oci | Occitan | Latn | Sp | \-- | | ory | Odia | Orya | Sp, Tx | Tx | | pan | Punjabi | Guru | Sp, Tx | Tx | | pbt | Southern Pashto | Arab | Sp, Tx | Tx | | pes | Western Persian | Arab | Sp, Tx | Sp, Tx | | pol | Polish | Latn | Sp, Tx | Sp, Tx | | por | Portuguese | Latn | Sp, Tx | Sp, Tx | | ron | Romanian | Latn | Sp, Tx | Sp, Tx | | rus | Russian | Cyrl | Sp, Tx | Sp, Tx | | slk | Slovak | Latn | Sp, Tx | Sp, Tx | | slv | Slovenian | Latn | Sp, Tx | Tx | | sna | Shona | Latn | Sp, Tx | Tx | | snd | Sindhi | Arab | Sp, Tx | Tx | | som | Somali | Latn | Sp, Tx | Tx | | spa | Spanish | Latn | Sp, Tx | Sp, Tx | | srp | Serbian | Cyrl | Sp, Tx | Tx | | swe | Swedish | Latn | Sp, Tx | Sp, Tx | | swh | Swahili | Latn | Sp, Tx | Sp, Tx | | tam | Tamil | Taml | Sp, Tx | Tx | | tel | Telugu | Telu | Sp, Tx | Sp, Tx | | tgk | Tajik | Cyrl | Sp, Tx | Tx | | tgl | Tagalog | Latn | Sp, Tx | Sp, Tx | | tha | Thai | Thai | Sp, Tx | Sp, Tx | | tur | Turkish | Latn | Sp, Tx | Sp, Tx | | ukr | Ukrainian | Cyrl | Sp, Tx | Sp, Tx | | urd | Urdu | Arab | Sp, Tx | Sp, Tx | | uzn | Northern Uzbek | Latn | Sp, Tx | Sp, Tx | | vie | Vietnamese | Latn | Sp, Tx | Sp, Tx | | xho | Xhosa | Latn | Sp | \-- | | yor | Yoruba | Latn | Sp, Tx | Tx | | yue | Cantonese | Hant | Sp, Tx | Tx | | zlm | Colloquial Malay | Latn | Sp | \-- | | zsm | Standard Malay | Latn | Tx | Tx | | zul | Zulu | Latn | Sp, Tx | Tx | Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in asset card)
map-anything-apache
--- tags: - model_hub_mixin - pytorch_model_hub_mixin - computer-vision - 3d-reconstruction - multi-view-stereo - depth-estimation - camera-pose - covisibility - mapanything license: apache-2.0 language: - en pipeline_tag: image-to-3d ---
audiobox-aesthetics
mask2former-swin-large-ade-semantic
wav2vec2-lv-60-espeak-cv-ft
Wav2Vec2-Large-LV60 finetuned on multi-lingual Common Voice This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. When using the model make sure that your speech input is sampled at 16kHz. Note that the model outputs a string of phonetic labels. A dictionary mapping phonetic labels to words has to be used to map the phonetic output labels to output words. Paper: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition Abstract Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows:
timesformer-base-finetuned-k400
TimeSformer (base-sized model, fine-tuned on Kinetics-400) TimeSformer model pre-trained on Kinetics-400. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository. Disclaimer: The team releasing TimeSformer did not write a model card for this model so this model card has been written by fcakyon. You can use the raw model for video classification into one of the 400 possible Kinetics-400 labels. For more code examples, we refer to the documentation.
opt-6.7b
deit-tiny-patch16-224
metaclip-h14-fullcc2.5b
MetaCLIP model, huge-sized version, patch resolution 14 MetaCLIP model applied to 2.5 billion data points of CommonCrawl (CC). It was introduced in the paper Demystifying CLIP Data by Xu et al. and first released in this repository. Disclaimer: The team releasing MetaCLIP did not write a model card for this model so this model card has been written by the Hugging Face team. The Demystifying CLIP Data paper aims to reveal CLIP’s method around training data curation. OpenAI never open-sourced code regarding their data preparation pipeline. CLIP high-level overview. Taken from the CLIP paper . You can use the raw model for linking images with text in a shared embedding space. This enables things like zero-shot image classification, text-based image retrieval, image-based text retrieval, etc. We refer to the docs. Just replace the names of the models on the hub.
mask2former-swin-base-coco-panoptic
Mask2Former model trained on COCO panoptic segmentation (base-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.
deit-base-patch16-224
convnextv2-base-22k-224
ConvNeXt V2 model pretrained using the FCMAE framework and fine-tuned on the ImageNet-22K dataset at resolution 224x224. It was introduced in the paper ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders by Woo et al. and first released in this repository. Disclaimer: The team releasing ConvNeXT V2 did not write a model card for this model so this model card has been written by the Hugging Face team. ConvNeXt V2 is a pure convolutional model (ConvNet) that introduces a fully convolutional masked autoencoder framework (FCMAE) and a new Global Response Normalization (GRN) layer to ConvNeXt. ConvNeXt V2 significantly improves the performance of pure ConvNets on various recognition benchmarks. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
sam2.1-hiera-tiny
Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information. The official code is publicly release in this repo. SAM2 can be used for automatic mask generation to segment all objects in an image using the `mask-generation` pipeline: You can segment objects by providing a single point click on the object you want to segment: You can provide multiple points to refine the segmentation: SAM2 also supports bounding box inputs for segmentation: Process multiple images simultaneously for improved efficiency: Segment multiple objects within each image using batch inference: Batched Images with Batched Objects and Multiple Points Handle complex batch scenarios with multiple points per object: SAM2 can use masks from previous predictions as input to refine segmentation: SAM2's key strength is its ability to track objects across video frames. Here's how to use it for video segmentation: Track multiple objects simultaneously across video frames: You can add additional clicks on any frame to refine the tracking: For real-time applications, SAM2 supports processing video frames as they arrive: Track multiple objects simultaneously in video by adding them all at once: To cite the paper, model, or software, please use the below:
esm2_t48_15B_UR50D
mms-lid-126
Massively Multilingual Speech (MMS) - Finetuned LID This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's Massive Multilingual Speech project. This checkpoint is based on the Wav2Vec2 architecture and classifies raw audio input to a probability distribution over 126 output classes (each class representing a language). The checkpoint consists of 1 billion parameters and has been fine-tuned from facebook/mms-1b on 126 languages. - Example - Supported Languages - Model details - Additional links This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 126 languages. First, we install transformers and some other libraries ` Note: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version is not yet available on PyPI make sure to install `transformers` from source: Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz. Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition To see all the supported languages of a checkpoint, you can print out the language ids as follows: For more details, about the architecture please have a look at the official docs. This model supports 126 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview. - ara - cmn - eng - spa - fra - mlg - swe - por - vie - ful - sun - asm - ben - zlm - kor - ind - hin - tuk - urd - aze - slv - mon - hau - tel - swh - bod - rus - tur - heb - mar - som - tgl - tat - tha - cat - ron - mal - bel - pol - yor - nld - bul - hat - afr - isl - amh - tam - hun - hrv - lit - cym - fas - mkd - ell - bos - deu - sqi - jav - nob - uzb - snd - lat - nya - grn - mya - orm - lin - hye - yue - pan - jpn - kaz - npi - kat - guj - kan - tgk - ukr - ces - lav - bak - khm - fao - glg - ltz - lao - mlt - sin - sna - ita - srp - mri - nno - pus - eus - ory - lug - bre - luo - slk - fin - dan - yid - est - ceb - war - san - kir - oci - wol - haw - kam - umb - xho - epo - zul - ibo - abk - ckb - nso - gle - kea - ast - sco - glv - ina - Developed by: Vineel Pratap et al. - Model type: Multi-Lingual Automatic Speech Recognition model - Language(s): 126 languages, see supported languages - License: CC-BY-NC 4.0 license - Num parameters: 1 billion - Audio sampling rate: 16,000 kHz - Cite as: @article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } - Blog post - Transformers documentation. - Paper - GitHub Repository - Other MMS checkpoints - MMS base checkpoints: - facebook/mms-1b - facebook/mms-300m - Official Space
convnext-large-224-22k-1k
mask2former-swin-large-coco-panoptic
Mask2Former model trained on COCO panoptic segmentation (large-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.
dinov2-with-registers-large
Vision Transformer (large-sized model) trained using DINOv2, with registers Vision Transformer (ViT) model introduced in the paper Vision Transformers Need Registers by Darcet et al. and first released in this repository. Disclaimer: The team releasing DINOv2 with registers did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet. Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE. The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in: - no artifacts - interpretable attention maps - and improved performances. Visualization of attention maps of various models trained with vs. without registers. Taken from the original paper . Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for feature extraction. See the model hub to look for fine-tuned versions on a task that interests you.
mbart-large-50
mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper. mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning. Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below. Multilingual Denoising Pretraining: The model incorporates N languages by concatenating data: `D = {D1, ..., DN }` where each Di is a collection of monolingual documents in language `i`. The source documents are noised using two schemes, first randomly shuffling the original sentences' order, and second a novel in-filling scheme, where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text. 35% of each instance's words are masked by random sampling a span length according to a Poisson distribution `(λ = 3.5)`. The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence. `mbart-large-50` is pre-trained model and primarily aimed at being fine-tuned on translation tasks. It can also be fine-tuned on other multilingual sequence-to-sequence tasks. See the model hub to look for fine-tuned versions. As the model is multilingual, it expects the sequences in a different format. A special language id token is used as a prefix in both the source and target text. The text format is `[langcode] X [eos]` with `X` being the source or target text respectively and `langcode` is `sourcelangcode` for source text and `tgtlangcode` for target text. `bos` is never used. Once the examples are prepared in this format, it can be trained as any other sequence-to-sequence model. Languages covered Arabic (arAR), Czech (csCZ), German (deDE), English (enXX), Spanish (esXX), Estonian (etEE), Finnish (fiFI), French (frXX), Gujarati (guIN), Hindi (hiIN), Italian (itIT), Japanese (jaXX), Kazakh (kkKZ), Korean (koKR), Lithuanian (ltLT), Latvian (lvLV), Burmese (myMM), Nepali (neNP), Dutch (nlXX), Romanian (roRO), Russian (ruRU), Sinhala (siLK), Turkish (trTR), Vietnamese (viVN), Chinese (zhCN), Afrikaans (afZA), Azerbaijani (azAZ), Bengali (bnIN), Persian (faIR), Hebrew (heIL), Croatian (hrHR), Indonesian (idID), Georgian (kaGE), Khmer (kmKH), Macedonian (mkMK), Malayalam (mlIN), Mongolian (mnMN), Marathi (mrIN), Polish (plPL), Pashto (psAF), Portuguese (ptXX), Swedish (svSE), Swahili (swKE), Tamil (taIN), Telugu (teIN), Thai (thTH), Tagalog (tlXX), Ukrainian (ukUA), Urdu (urPK), Xhosa (xhZA), Galician (glES), Slovene (slSI)
mms-tts-eng
Massively Multilingual Speech (MMS): English Text-to-Speech This repository contains the English (eng) language text-to-speech (TTS) model checkpoint. This model is part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across a diverse range of languages. You can find more details about the supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and see all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text. The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform. For the MMS project, a separate VITS checkpoint is trained on each langauge. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint, first install the latest version of the library: Then, run inference with the following code-snippet: The resulting waveform can be saved as a `.wav` file: This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:
dino-vitb8
dinov3-convnext-small-pretrain-lvd1689m
dinov3-vits16plus-pretrain-lvd1689m
dinov3-convnext-tiny-pretrain-lvd1689m
blenderbot-400M-distill
wmt19-ru-en
dinov3-vitl16-pretrain-sat493m
convnextv2-tiny-22k-384
ConvNeXt V2 model pretrained using the FCMAE framework and fine-tuned on the ImageNet-22K dataset at resolution 384x384. It was introduced in the paper ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders by Woo et al. and first released in this repository. Disclaimer: The team releasing ConvNeXT V2 did not write a model card for this model so this model card has been written by the Hugging Face team. ConvNeXt V2 is a pure convolutional model (ConvNet) that introduces a fully convolutional masked autoencoder framework (FCMAE) and a new Global Response Normalization (GRN) layer to ConvNeXt. ConvNeXt V2 significantly improves the performance of pure ConvNets on various recognition benchmarks. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
map-anything
MapAnything is a simple, end-to-end trained transformer model that directly regresses the factored metric 3D geometry of a scene given various types of modalities as inputs. A single feed-forward model supports over 12 different 3D reconstruction tasks including multi-image sfm, multi-view stereo, monocular metric depth estimation, registration, depth completion and more. If you find our repository useful, please consider giving it a star ⭐ and citing our paper in your work:
esm1v_t33_650M_UR90S_1
dinov2-small-imagenet1k-1-layer
Vision Transformer (small-sized model) trained using DINOv2 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. and first released in this repository. Disclaimer: The team releasing DINOv2 did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion. Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the model for classifying an image among one of the 1000 ImageNet labels. See the model hub to look for other fine-tuned versions on a task that interests you.
mbart-large-50-one-to-many-mmt
mBART-50 one to many multilingual machine translation This model is a fine-tuned checkpoint of mBART-large-50. `mbart-large-50-one-to-many-mmt` is fine-tuned for multilingual machine translation. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper. The model can translate English to other 49 languages mentioned below. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the `forcedbostokenid` parameter to the `generate` method. See the model hub to look for more fine-tuned versions. Languages covered Arabic (arAR), Czech (csCZ), German (deDE), English (enXX), Spanish (esXX), Estonian (etEE), Finnish (fiFI), French (frXX), Gujarati (guIN), Hindi (hiIN), Italian (itIT), Japanese (jaXX), Kazakh (kkKZ), Korean (koKR), Lithuanian (ltLT), Latvian (lvLV), Burmese (myMM), Nepali (neNP), Dutch (nlXX), Romanian (roRO), Russian (ruRU), Sinhala (siLK), Turkish (trTR), Vietnamese (viVN), Chinese (zhCN), Afrikaans (afZA), Azerbaijani (azAZ), Bengali (bnIN), Persian (faIR), Hebrew (heIL), Croatian (hrHR), Indonesian (idID), Georgian (kaGE), Khmer (kmKH), Macedonian (mkMK), Malayalam (mlIN), Mongolian (mnMN), Marathi (mrIN), Polish (plPL), Pashto (psAF), Portuguese (ptXX), Swedish (svSE), Swahili (swKE), Tamil (taIN), Telugu (teIN), Thai (thTH), Tagalog (tlXX), Ukrainian (ukUA), Urdu (urPK), Xhosa (xhZA), Galician (glES), Slovene (slSI)
vit-mae-large
Vision Transformer (large-sized model) pre-trained with MAE Vision Transformer (ViT) model pre-trained using the MAE method. It was introduced in the paper Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick and first released in this repository. Disclaimer: The team releasing MAE did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like). Images are presented to the model as a sequence of fixed-size patches. During pre-training, one randomly masks out a high portion (75%) of the image patches. First, the encoder is used to encode the visual patches. Next, a learnable (shared) mask token is added at the positions of the masked patches. The decoder takes the encoded visual patches and mask tokens as input and reconstructs raw pixel values for the masked positions. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.
hubert-large-ll60k
The large model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Note: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model. Authors: Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed Abstract Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/hubert . See this blog for more information on how to fine-tune the model. Note that the class `Wav2Vec2ForCTC` has to be replaced by `HubertForCTC`.
nllb-200-1.3B
timesformer-hr-finetuned-k400
TimeSformer (high-resolution variant, fine-tuned on Kinetics-400) TimeSformer model pre-trained on Kinetics-400. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository. Disclaimer: The team releasing TimeSformer did not write a model card for this model so this model card has been written by fcakyon. You can use the raw model for video classification into one of the 400 possible Kinetics-400 labels. For more code examples, we refer to the documentation.
opt-2.7b
OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.
dinov3-vit7b16-pretrain-lvd1689m
PE-Core-B16-224
vjepa2-vitg-fpc64-384-ssv2
sam-audio-judge
xlm-roberta-xl
XLM-RoBERTa-XL model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Larger-Scale Transformers for Multilingual Masked Language Modeling by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau and first released in this repository. Disclaimer: The team releasing XLM-RoBERTa-XL did not write a model card for this model so this model card has been written by the Hugging Face team. XLM-RoBERTa-XL is a extra large multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa-XL model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch:
metaclip-2-worldwide-huge-quickgelu
mask2former-swin-large-coco-instance
dinov2-with-registers-small
metaclip-b16-fullcc2.5b
mms-tts-vie
mask2former-swin-small-ade-semantic
convnext-base-224
wmt19-en-de
wav2vec2-xlsr-53-phon-cv-ft
dinov3-convnext-base-pretrain-lvd1689m
convnextv2-tiny-1k-224
wav2vec2-large-xlsr-53-italian
vjepa2-vitg-fpc64-256
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation. ``` @techreport{assran2025vjepa2, title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and Rabbat, Michael and Ballas, Nicolas}, institution={FAIR at Meta}, year={2025} }
sam-audio-large
convnextv2-atto-1k-224
sam2.1-hiera-base-plus
blenderbot_small-90M
encodec_48khz
nougat-base
Nougat model trained on PDF-to-markdown. It was introduced in the paper Nougat: Neural Optical Understanding for Academic Documents by Blecher et al. and first released in this repository. Disclaim...
mask2former-swin-base-ade-semantic
cotracker3
dpr-ctx_encoder-multiset-base
mms-tts-ind
mms-lid-4017
convnextv2-base-22k-384
deit-base-distilled-patch16-224
mbart-large-cc25
musicgen-melody-large
sam-audio-base
wav2vec2-large-lv60
s2t-small-librispeech-asr
xlm-v-base
hf-seamless-m4t-medium
convnext-tiny-224
bart-large-xsum
PE-Lang-G14-448
opt-30b
OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. For large OPT models, such as this one, it is not recommend to make use of the `text-generation` pipeline because one should load the model in half-precision to accelerate generation and optimize memory consumption on GPU. It is recommended to directly call the `generate` method as follows: By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.
mcontriever-msmarco
nougat-small
opt-13b
dinov2-giant-imagenet1k-1-layer
wmt19-de-en
dino-vits8
data2vec-audio-base-960h
xlm-roberta-xxl
timesformer-base-finetuned-k600
vjepa2-vitl-fpc16-256-ssv2
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. 💡 This is V-JEPA 2 ViT-L 256 model with video classification head pretrained on Something-Something-V2 dataset. To run V-JEPA 2 model, ensure you have installed the latest transformers:
musicgen-large
MusicGen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts. It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods, like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. MusicGen was published in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez. Four checkpoints are released: - small - medium - large (this checkpoint) - melody You can run MusicGen locally with the 🤗 Transformers library from version 4.31.0 onwards. 1. First install the 🤗 Transformers library and scipy: 2. Run inference via the `Text-to-Audio` (TTA) pipeline. You can infer the MusicGen model via the TTA pipeline in just a few lines of code! 3. Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 32 kHz audio waveform for more fine-grained control. 4. Listen to the audio samples either in an ipynb notebook: Or save them as a `.wav` file using a third-party library, e.g. `scipy`: For more details on using the MusicGen model for inference using the 🤗 Transformers library, refer to the MusicGen docs. You can also run MusicGen locally through the original Audiocraft library: Organization developing the model: The FAIR team of Meta AI. Model date: MusicGen was trained between April 2023 and May 2023. Model type: MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation. Paper or resources for more information: More information can be found in the paper Simple and Controllable Music Generation. Where to send questions or comments about the model: Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue. Intended use Primary intended use: The primary use of MusicGen is research on AI-based music generation, including: - Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science - Generation of music guided by text or melody to understand current abilities of generative AI models by machine learning amateurs Primary intended users: The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models. Out-of-scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes. Models performance measures: We used the following objective measure to evaluate the model on a standard music benchmark: - Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish) - Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST) - CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model Additionally, we run qualitative studies with human participants, evaluating the performance of the model with the following axes: - Overall quality of the music samples; - Text relevance to the provided text input; - Adherence to the melody for melody-guided music generation. More details on performance measures and human studies can be found in the paper. The model was evaluated on the MusicCaps benchmark and on an in-domain held-out evaluation set, with no artist overlap with the training set. The model was trained on licensed data using the following sources: the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection. See the paper for more details about the training set and corresponding preprocessing. Below are the objective metrics obtained on MusicCaps with the released model. Note that for the publicly released models, we had all the datasets go through a state-of-the-art music source separation method, namely using the open source Hybrid Transformer for Music Source Separation (HT-Demucs), in order to keep only the instrumental part. This explains the difference in objective metrics with the models used in the paper. | Model | Frechet Audio Distance | KLD | Text Consistency | Chroma Cosine Similarity | |---|---|---|---|---| | facebook/musicgen-small | 4.88 | 1.42 | 0.27 | - | | facebook/musicgen-medium | 5.14 | 1.38 | 0.28 | - | | facebook/musicgen-large | 5.48 | 1.37 | 0.28 | - | | facebook/musicgen-melody | 4.93 | 1.41 | 0.27 | 0.44 | More information can be found in the paper Simple and Controllable Music Generation, in the Results section. Data: The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model. Mitigations: Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source Hybrid Transformer for Music Source Separation (HT-Demucs). - The model is not able to generate realistic vocals. - The model has been trained with English descriptions and will not perform as well in other languages. - The model does not perform equally well for all music styles and cultures. - The model sometimes generates end of songs, collapsing to silence. - It is sometimes difficult to assess what types of text descriptions provide the best generations. Prompt engineering may be required to obtain satisfying results. Biases: The source of data is potentially lacking diversity and all music cultures are not equally represented in the dataset. The model may not perform equally well on the wide variety of music genres that exists. The generated samples from the model will reflect the biases from the training data. Further work on this model should include methods for balanced and just representations of cultures, for example, by scaling the training data to be both diverse and inclusive. Risks and harms: Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. We believe that providing the code to reproduce the research and train new models will allow to broaden the application to new and more representative data. Use cases: Users must be aware of the biases, limitations and risks of the model. MusicGen is a model developed for artificial intelligence research on controllable music generation. As such, it should not be used for downstream applications without further investigation and mitigation of risks.
sam-audio-small
wmt19-en-ru
mms-tts-ewe
xmod-base
mask2former-swin-small-coco-instance
mask2former-swin-large-mapillary-vistas-panoptic
mask2former-swin-tiny-ade-semantic
Perception-LM-1B
audiogen-medium
mms-1b-l1107
blenderbot-90M
mbart-large-en-ro
mbart-large-50-many-to-one-mmt
convnext-base-224-22k
dpr-question_encoder-multiset-base
vjepa2-vith-fpc64-256
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation. ``` @techreport{assran2025vjepa2, title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and Rabbat, Michael and Ballas, Nicolas}, institution={FAIR at Meta}, year={2025} }
mask2former-swin-tiny-cityscapes-semantic
xglm-564M
mms-tts-tha
wav2vec2-large
The base model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Note that this model should be fine-tuned on a downstream task, like Automatic Speech Recognition. Check out this blog for more information. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli Abstract We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. See this notebook for more information on how to fine-tune the model.
mms-tts-spa
mms-tts-tam
deit-small-patch16-224
sam3
rag-token-nq
MobileLLM-Pro
dragon-plus-context-encoder
dragon-plus-query-encoder
blenderbot-3B
dinov3-convnext-large-pretrain-lvd1689m
mms-tts-tgl
esm1b_t33_650M_UR50S
webssl-mae300m-full2b-224
s2t-medium-mustc-multilingual-st
mask2former-swin-tiny-coco-panoptic
MobileLLM-R1-950M
DiT-XL-2-256
PE-Spatial-B16-512
vit-mae-huge
mms-lid-1024
mms-tts-hin
incoder-1B
PE-Core-T16-384
mgenre-wiki
metaclip-l14-fullcc2.5b
wav2vec2-xls-r-1b
galactica-1.3b
ijepa_vith14_1k
MEXMA
Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them. Usage You use this model as you would any other XLM-RoBERTa model, taking into account that the "pooler" has not been trained, so you should use the CLS the encoder outputs directly as your sentence representation: You can also use this model with SentenceTransformers: License This model is released under the MIT license. Training code For the training code of this model, please check the official MEXMA repo. Paper MEXMA: Token-level objectives improve sentence representations Citation If you use this model in your work, please cite:
hubert-xlarge-ls960-ft
wav2vec2-large-960h-lv60
mms-tts-fra
ijepa_vith14_22k
sam2.1-hiera-small
convnext-small-224
dinov2-base-imagenet1k-1-layer
sam2-hiera-base-plus
mms-1b
musicgen-melody
mask2former-swin-large-cityscapes-panoptic
webssl-dino300m-full2b-224
detr-resnet-50-dc5
mms-tts-rus
Massively Multilingual Speech (MMS): Russian Text-to-Speech This repository contains the Russian (rus) language text-to-speech (TTS) model checkpoint. This model is part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across a diverse range of languages. You can find more details about the supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and see all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text. The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform. For the MMS project, a separate VITS checkpoint is trained on each langauge. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint, first install the latest version of the library: Then, run inference with the following code-snippet: The resulting waveform can be saved as a `.wav` file: This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:
opt-66b
mask2former-swin-base-coco-instance
webssl-dino7b-full8b-224
PE-Spatial-L14-448
mms-tts-amh
hiera-tiny-224-hf
convnextv2-large-22k-384
sonata
data2vec-audio-base
levit-128S
pe-av-large-16-frame
xglm-1.7B
maskformer-swin-base-ade
mms-tts-ara
Perception-LM-3B
tart-full-flan-t5-xl
mms-tts-tir
mms-1b-fl102
mms-tts-lao
wav2vec2-conformer-rel-pos-large-960h-ft
mms-tts-mya
wav2vec2-large-robust
Perception-LM-8B
sam2-hiera-tiny
musicgen-stereo-small
dinov2-with-registers-giant
mms-tts-tur
maskformer-swin-large-ade
s2t-small-mustc-en-fr-st
fasttext-et-vectors
regnet-y-040
galactica-125m
wav2vec2-large-xlsr-53-spanish
mms-tts-urd-script_arabic
detr-resnet-101-dc5
convnextv2-large-22k-224
dinov2-large-imagenet1k-1-layer
mask2former-swin-large-ade-panoptic
mms-tts-kor
convnextv2-huge-22k-512
maskformer-swin-base-coco
wav2vec2-xls-r-2b
rag-token-base
mms-tts-uig-script_arabic
dragon-roberta-query-encoder
dragon-roberta-context-encoder
layerskip-llama3-8B
wav2vec2-large-100k-voxpopuli
data2vec-audio-large-960h
mms-tts-tel
vjepa2-vitg-fpc64-384
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation.
pe-a-frame-large
mms-tts-uzb-script_cyrillic
mask2former-swin-small-coco-panoptic
maskformer-swin-small-coco
layerskip-llama2-7B
mms-tts-pol
sam-audio-large-tv
DiT-XL-2-512
deit-tiny-distilled-patch16-224
convnextv2-huge-22k-384
convnext-large-224
vit-msn-small
layerskip-codellama-34B
maskformer-swin-small-ade
rag-sequence-base
mms-tts-deu
mms-tts-orm
PE-Spatial-T16-512
wav2vec2-large-xlsr-53-french
data2vec-text-base
mms-tts-fas
pe-av-base-16-frame
sam2-hiera-small
timesformer-base-finetuned-ssv2
mms-tts-ben
wav2vec2-large-robust-ft-swbd-300h
hf-seamless-m4t-large
data2vec-vision-base-ft1k
mms-tts-por
blenderbot-1B-distill
deformable-detr-detic
webssl-dino1b-full2b-224
xglm-7.5B
s2t-large-librispeech-asr
wav2vec2-conformer-rel-pos-large
mms-lid-512
MobileLLM-Pro-base-int4-cpu
wav2vec2-base-it-voxpopuli
dpr-reader-single-nq-base
s2t-small-mustc-en-it-st
ijepa_vitg16_22k
MobileLLM-R1-140M
opt-iml-max-1.3b
xglm-4.5B
MobileLLM-125M
audioseal
PE-Lang-L14-448
mcontriever
metamotivo-M-1
musicgen-stereo-medium
galactica-120b
galactica-30b
galactica-6.7b
layerskip-llama3.2-1B
dinov2-with-registers-base-imagenet1k-1-layer
metaclip-b32-fullcc2.5b
mms-tts-yor
wav2vec2-large-xlsr-53-german
opt-iml-max-30b
mms-tts-ory
regnet-y-320
regnet-y-160
mms-tts-pan
Nllb Moe 54b
- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper. - Paper or other resource for more information NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022 - License: CC-BY-NC - Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues The NLLB model was presented in No Language Left Behind: Scaling Human-Centered Machine Translation by Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. - The Expert Output Masking is used for training, which consists in droping the full contribution for some tokens. This corresponds to the following scheme: Generating with NLLB-MoE The avalable checkpoints requires around 350GB of storage. Make sure to use `accelerate` if you do not have enough RAM on your machine. While generating the target text set the `forcedbostokenid` to the target language id. The following example shows how to translate English to French using the facebook/nllb-moe-54b model. Note that we're using the BCP-47 code for French `fraLatn`. See here for the list of all BCP-47 in the Flores 200 dataset.
mask2former-swin-small-cityscapes-semantic
Mms Tts Khm
layerskip-llama2-70B
mms-tts-swh
convnext-base-384-22k-1k
magnet-small-10secs
fasttext-en-vectors
metamotivo-S-1
maskformer-resnet101-coco-stuff
MobileLLM-Pro-base
Meta-SecAlign-70B
mms-tts-heb
metaclip-2-worldwide-giant
mms-tts-ell
blt-1b
maskformer-swin-large-coco
regnet-y-080
mms-tts-nld
regnet-y-064
regnet-y-120
deit-small-distilled-patch16-224
cotracker
pe-av-small
mms-tts-hau
layerskip-llama2-13B
dpr-reader-multiset-base
mms-tts-mar
Wav2vec2 Base 100h
The base model pretrained and fine-tuned on 100 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows: This code snippet shows how to evaluate facebook/wav2vec2-base-100h on LibriSpeech's "clean" and "other" test data.
dpt-dinov2-small-kitti
drama-base
mms-tts-ron
tribev2
sapiens-seg-1b-torchscript
mms-tts-fon
MobileLLM-R1-360M
mms-tts-swe
metaclip-2-worldwide-huge-378
MetaCLIP 2 (worldwide) was presented in MetaCLIP 2: A Worldwide Scaling Recipe. This checkpoint corresponds to "ViT-H-14-378-worldwide" of the original implementation. First install the Transformers library (from source for now): In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:
wav2vec2-xls-r-1b-en-to-15
dinov2-with-registers-giant-imagenet1k-1-layer
wav2vec2-xls-r-1b-21-to-en
deit-base-distilled-patch16-384
pe-av-large
KernelLLM
PE-Spatial-G14-448
mms-tts-som
layerskip-codellama-7B
mms-tts-hun
mms-tts-bul
convnextv2-nano-22k-224
mms-tts-kab
mms-tts-guj
mms-tts-kaz
convnext-xlarge-224-22k
audio-magnet-medium
metaclip-b16-400m
metaclip-2-worldwide-giant-378
MetaCLIP 2 (worldwide) was presented in MetaCLIP 2: A Worldwide Scaling Recipe. This checkpoint corresponds to "ViT-bigG-14-378-worldwide" of the original implementation. First install the Transformers library (from source for now): In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:
convnextv2-base-1k-224
mms-tts-nan
dpt-dinov2-large-nyu
mms-tts-mon
vjepa2-vitl-fpc32-256-diving48
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. 💡 This is V-JEPA 2 ViT-L 256 model with video classification head pretrained on Diving 48 dataset. To run V-JEPA 2 model, ensure you have installed the latest transformers:
magnet-medium-30secs
maskformer-swin-tiny-coco
convnextv2-nano-1k-224
convnextv2-large-1k-224
blt-entropy
VGGT-1B-Commercial
dinov2-with-registers-small-imagenet1k-1-layer
ijepa_vith16_1k
maskformer-swin-tiny-ade
timesformer-hr-finetuned-k600
mms-tts-kan
MobileLLM-1B
Meta-SecAlign-8B
hiera-tiny-224-in1k-hf
hiera-tiny-224-mae-hf
convnext-xlarge-224-22k-1k
sam-audio-small-tv
deit-base-patch16-384
s2t-medium-librispeech-asr
MobileLLM-350M
xmod-large-prenorm
llm-compiler-7b
MobileLLM-R1-950M-base
dinov2-with-registers-large-imagenet1k-1-layer
wav2vec2-large-xlsr-53-dutch
mms-tts-hat
MobileLLM-ParetoQ-1.5B-1.58-bit
wav2vec2-large-xlsr-53-portuguese
mms-tts-mal
mms-tts-ukr
PE-Core-S16-384
blt-7b
mask2former-swin-large-cityscapes-instance
magnet-small-30secs
convnext-base-224-22k-1k
mask2former-swin-base-IN21k-cityscapes-panoptic
mms-tts-ceb
mms-tts-kmr-script_latin
mms-tts-gbm
mms-tts-kmr-script_arabic
mask2former-swin-base-IN21k-cityscapes-semantic
metamotivo-S-4
sapiens-pose-1b-torchscript
esm1v_t33_650M_UR90S_3
chameleon-30b
mms-tts-quz
metamotivo-S-2
musicgen-stereo-large
metamotivo-S-3
metamotivo-S-5
mms-tts-eus
xglm-2.9B
convnextv2-pico-1k-224
esm1v_t33_650M_UR90S_4
mms-tts-fin
mask2former-swin-base-IN21k-cityscapes-instance
MobileLLM-R1-360M-base
magnet-medium-10secs
mms-tts-crs
locate-3d-plus
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D Official model weights for the `Locate-3D` models and the `3D-JEPA` encoders Locate 3D is a model for localizing objects in 3D scenes from referring expressions like “the small coffee table between the sofa and the lamp.” Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds, is key to `Locate 3D`. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. - Locate-3D: Locate-3D model trained on public referential grounding datasets - Locate-3D+: Locate-3D model trained on public referential grounding datasets and the newly released Locate 3D Dataset - 3D-JEPA: Pre-trained SSL encoder for 3D understanding For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our GitHub repository. The majority of `locate-3` is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Pointcept is licensed under the MIT license.
mms-tts-zlm
data2vec-vision-large
timesformer-hr-finetuned-ssv2
s2t-small-mustc-en-de-st
pe-av-small-16-frame
dpt-dinov2-base-kitti
esm1v_t33_650M_UR90S_5
incoder-6B
locate-3d
mask2former-swin-base-IN21k-ade-semantic
mms-tts-pap
sapiens
hubert-xlarge-ll60k
mask2former-swin-tiny-cityscapes-panoptic
wav2vec2-base-es-voxpopuli-v2
mms-tts-sqi
mms-tts-aka
wav2vec2-base-it-voxpopuli-v2
wav2vec2-xlsr-53-phon-cv-babel-ft
dpt-dinov2-base-nyu
MobileLLM-600M
vit-msn-base
detr-resnet-101-panoptic
mms-tts-mlg
sapiens-pose-bbox-detector
MobileLLM-ParetoQ-125M-BF16
pixio-vitl16
hiera-base-224-hf
mms-tts-saq
sapiens-depth-1b-torchscript
Dpt Dinov2 Large Kitti
DPT (Dense Prediction Transformer) model with DINOv2 backbone as proposed in DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. The model is intended to showcase that using the DPT framework with DINOv2 as backbone yields a powerful depth estimator.