facebook

500 models • 65 total models in database
Sort by:

esm2_t33_650M_UR50D

ESM-2 is a state-of-the-art protein model trained on a masked language modelling objective. It is suitable for fine-tuning on a wide range of tasks that take protein sequences as input. For detailed information on the model architecture and training data, please refer to the accompanying paper. You may also be interested in some demo notebooks (PyTorch, TensorFlow) which demonstrate how to fine-tune ESM-2 models on your tasks of interest. Several ESM-2 checkpoints are available in the Hub with varying sizes. Larger sizes generally have somewhat better accuracy, but require much more memory and time to train: | Checkpoint name | Num layers | Num parameters | |------------------------------|----|----------| | esm2t4815BUR50D | 48 | 15B | | esm2t363BUR50D | 36 | 3B | | esm2t33650MUR50D | 33 | 650M | | esm2t30150MUR50D | 30 | 150M | | esm2t1235MUR50D | 12 | 35M | | esm2t68MUR50D | 6 | 8M |

11,021,509
57

contriever

This model has been trained without supervision following the approach described in [Towards Unsupervised Dense Information Retrieval with Contrastive Learning](https://arxiv.org/abs/2112.09118). The associated GitHub repository is available here https://github.com/facebookresearch/contriever.

7,793,446
71

wav2vec2-base-960h

The base model pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows: This code snippet shows how to evaluate facebook/wav2vec2-base-960h on LibriSpeech's "clean" and "other" test data.

5,669,239
381

opt-125m

OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.

4,176,325
224

esmfold_v1

--- license: mit ---

3,754,685
43

bart-large-cnn

--- language: - en pipeline_tag: summarization license: mit thumbnail: https://huggingface.co/front/thumbnails/facebook.png datasets: - cnn_dailymail model-index: - name: facebook/bart-large-cnn results: - task: type: summarization name: Summarization dataset: name: cnn_dailymail type: cnn_dailymail config: 3.0.0 split: train metrics: - name: ROUGE-1 type: rouge value: 42.9486 verified: true - name: ROUGE-2 type: rouge value: 20.8149 verified: true - name: ROUGE-L type: rouge value: 30.6186 veri

3,432,857
1,496

bart-large-mnli

This is the checkpoint for bart-large after being trained on the MultiNLI (MNLI) dataset. Additional information about this model: - The bart-large model page - BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Yin et al. proposed a method for using pre-trained NLI models as a ready-made zero-shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and to construct a hypothesis from each candidate label. For example, if we want to evaluate whether a sequence belongs to the class "politics", we could construct a hypothesis of `This text is about politics.`. The probabilities for entailment and contradiction are then converted to label probabilities. This method is surprisingly effective in many cases, particularly when used with larger pre-trained models like BART and Roberta. See this blog post for a more expansive introduction to this and other zero shot methods, and see the code snippets below for examples of using this model for zero-shot classification both with Hugging Face's built-in pipeline and with native Transformers/PyTorch code. The model can be loaded with the `zero-shot-classification` pipeline like so: You can then use this pipeline to classify sequences into any of the class names you specify. If more than one candidate label can be correct, pass `multilabel=True` to calculate each class independently:

3,285,410
1,479

flava-full

--- license: bsd-3-clause ---

license:bsd-3-clause
2,731,883
42

dinov2-base

Vision Transformer (base-sized model) trained using DINOv2 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. and first released in this repository. Disclaimer: The team releasing DINOv2 did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion. Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for feature extraction. See the model hub to look for fine-tuned versions on a task that interests you.

license:apache-2.0
2,030,054
157

w2v-bert-2.0

--- license: mit language: - af - am - ar - as - az - be - bn - bs - bg - ca - cs - zh - cy - da - de - el - en - et - fi - fr - or - om - ga - gl - gu - ha - he - hi - hr - hu - hy - ig - id - is - it - jv - ja - kn - ka - kk - mn - km - ky - ko - lo - ln - lt - lb - lg - lv - ml - mr - mk - mt - mi - my - nl - nb - ne - ny - oc - pa - ps - fa - pl - pt - ro - ru - sk - sl - sn - sd - so - es - sr - sv - sw - ta - te - tg - tl - th - tr - uk - ur - uz - vi - wo - xh - yo - ms - zu - ary - arz -

license:mit
1,700,988
194

bart-base

--- license: apache-2.0 language: en ---

license:apache-2.0
1,538,740
200

dinov2-small

--- license: apache-2.0 tags: - dino - vision ---

license:apache-2.0
1,513,306
49

musicgen-medium

--- inference: true tags: - musicgen license: cc-by-nc-4.0 pipeline_tag: text-to-audio widget: - text: a funky house with 80s hip hop vibes example_title: Prompt 1 - text: a chill song with influences from lofi, chillstep and downtempo example_title: Prompt 2 - text: a catchy beat for a podcast intro example_title: Prompt 3 ---

license:cc-by-nc-4.0
1,397,570
141

roberta-hate-speech-dynabench-r4-target

--- language: en ---

1,289,386
92

esm2_t6_8M_UR50D

--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---

license:mit
1,254,801
23

mms-300m

--- tags: - mms language: - ab - af - ak - am - ar - as - av - ay - az - ba - bm - be - bn - bi - bo - sh - br - bg - ca - cs - ce - cv - ku - cy - da - de - dv - dz - el - en - eo - et - eu - ee - fo - fa - fj - fi - fr - fy - ff - ga - gl - gn - gu - zh - ht - ha - he - hi - sh - hu - hy - ig - ia - ms - is - it - jv - ja - kn - ka - kk - kr - km - ki - rw - ky - ko - kv - lo - la - lv - ln - lt - lb - lg - mh - ml - mr - ms - mk - mg - mt - mn - mi - my - zh - nl - 'no' - 'no' - ne - ny - oc

license:cc-by-nc-4.0
901,679
36

esm2_t36_3B_UR50D

--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---

NaNK
license:mit
869,716
24

m2m100_418M

--- language: - multilingual - af - am - ar - ast - az - ba - be - bg - bn - br - bs - ca - ceb - cs - cy - da - de - el - en - es - et - fa - ff - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - ilo - is - it - ja - jv - ka - kk - km - kn - ko - lb - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - ns - oc - or - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - th - tl - tn - tr - uk - ur - uz - v

license:mit
842,115
320

detr-resnet-50

--- license: apache-2.0 tags: - object-detection - vision datasets: - coco widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport ---

license:apache-2.0
835,954
908

dino-vits16

--- license: apache-2.0 tags: - dino - vision datasets: - imagenet-1k ---

license:apache-2.0
813,359
16

hubert-base-ls960

--- language: en datasets: - librispeech_asr tags: - speech license: apache-2.0 ---

license:apache-2.0
755,825
64

encodec_24khz

--- inference: false ---

746,999
52

wav2vec2-base

--- language: en datasets: - librispeech_asr tags: - speech license: apache-2.0 ---

license:apache-2.0
736,210
111

dinov3-vitb16-pretrain-lvd1689m

--- extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The infor

NaNK
543,859
75

wav2vec2-large-xlsr-53

--- language: multilingual datasets: - common_voice tags: - speech license: apache-2.0 ---

license:apache-2.0
520,494
149

VGGT-1B

--- tags: - model_hub_mixin - pytorch_model_hub_mixin license: cc-by-nc-4.0 language: - en pipeline_tag: image-to-3d ---

NaNK
license:cc-by-nc-4.0
431,386
78

dinov3-vitl16-pretrain-lvd1689m

--- extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The infor

NaNK
420,770
48

m2m100_1.2B

--- language: - multilingual - af - am - ar - ast - az - ba - be - bg - bn - br - bs - ca - ceb - cs - cy - da - de - el - en - es - et - fa - ff - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - ht - hu - hy - id - ig - ilo - is - it - ja - jv - ka - kk - km - kn - ko - lb - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - ns - oc - or - pa - pl - ps - pt - ro - ru - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - th - tl - tn - tr - uk - ur - uz - v

NaNK
license:mit
419,149
197

dinov2-large

--- license: apache-2.0 tags: - dino - vision ---

license:apache-2.0
402,340
95

esm2_t30_150M_UR50D

--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---

license:mit
352,082
8

wav2vec2-xlsr-53-espeak-cv-ft

--- language: multi-lingual datasets: - common_voice tags: - speech - audio - automatic-speech-recognition - phoneme-recognition widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac license: apache-2.0 ---

license:apache-2.0
329,911
41

esm2_t12_35M_UR50D

--- license: mit widget: - text: "MQIFVKTLTGKTITLEVEPSTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG" ---

license:mit
315,302
18

fasttext-language-identification

--- license: cc-by-nc-4.0 library_name: fasttext tags: - text-classification - language-identification ---

license:cc-by-nc-4.0
314,038
246

PE-Core-G14-448

--- license: apache-2.0 library_name: perception-encoder pipeline_tag: zero-shot-image-classification ---

license:apache-2.0
298,767
18

sam-vit-base

--- license: apache-2.0 tags: - vision ---

license:apache-2.0
290,683
152

mask2former-swin-large-cityscapes-semantic

--- license: other tags: - vision - image-segmentation datasets: - coco widget: - src: http://images.cocodataset.org/val2017/000000039769.jpg example_title: Cats - src: http://images.cocodataset.org/val2017/000000039770.jpg example_title: Castle ---

277,240
32

nllb-200-distilled-600M

--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - k

license:cc-by-nc-4.0
270,358
786

cwm

262,602
231

hubert-large-ls960-ft

--- language: en datasets: - libri-light - librispeech_asr tags: - speech - audio - automatic-speech-recognition - hf-asr-leaderboard license: apache-2.0 model-index: - name: hubert-large-ls960-ft results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: clean split: test args: language: en metrics: - name: Test WER type: wer value: 1.9 ---

license:apache-2.0
260,208
75

dinov3-vits16-pretrain-lvd1689m

--- extra_gated_fields: First Name: text Last Name: text Date of birth: date_picker Country: country Affiliation: text Job title: type: select options: - Student - Research Graduate - AI researcher - AI developer/engineer - Reporter - Other geo: ip_location By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox extra_gated_description: >- The infor

NaNK
257,400
43

dinov2-with-registers-base

--- library_name: transformers pipeline_tag: image-feature-extraction license: apache-2.0 tags: - dino - vision inference: false ---

license:apache-2.0
246,513
7

mms-lid-256

--- tags: - mms language: - ab - af - ak - am - ar - as - av - ay - az - ba - bm - be - bn - bi - bo - sh - br - bg - ca - cs - ce - cv - ku - cy - da - de - dv - dz - el - en - eo - et - eu - ee - fo - fa - fj - fi - fr - fy - ff - ga - gl - gn - gu - zh - ht - ha - he - hi - sh - hu - hy - ig - ia - ms - is - it - jv - ja - kn - ka - kk - kr - km - ki - rw - ky - ko - kv - lo - la - lv - ln - lt - lb - lg - mh - ml - mr - ms - mk - mg - mt - mn - mi - my - zh - nl - 'no' - 'no' - ne - ny - oc

license:cc-by-nc-4.0
230,885
14

sam-vit-huge

--- license: apache-2.0 tags: - vision ---

license:apache-2.0
218,072
179

mbart-large-50-many-to-many-mmt

--- language: - multilingual - ar - cs - de - en - es - et - fi - fr - gu - hi - it - ja - kk - ko - lt - lv - my - ne - nl - ro - ru - si - tr - vi - zh - af - az - bn - fa - he - hr - id - ka - km - mk - ml - mn - mr - pl - ps - pt - sv - sw - ta - te - th - tl - uk - ur - xh - gl - sl tags: - mbart-50 pipeline_tag: translation ---

216,046
396

sam2-hiera-large

--- license: apache-2.0 pipeline_tag: mask-generation library_name: transformers ---

license:apache-2.0
196,345
110

dinov2-giant

license:apache-2.0
172,906
52

dino-vitb16

license:apache-2.0
166,559
111

wav2vec2-xls-r-300m

--- language: - multilingual - ab - af - sq - am - ar - hy - as - az - ba - eu - be - bn - bs - br - bg - my - yue - ca - ceb - km - zh - cv - hr - cs - da - dv - nl - en - eo - et - fo - fi - fr - gl - lg - ka - de - el - gn - gu - ht - cnh - ha - haw - he - hi - hu - is - id - ia - ga - it - ja - jv - kb - kn - kk - rw - ky - ko - ku - lo - la - lv - ln - lt - lm - mk - mg - ms - ml - mt - gv - mi - mr - mn - ne - no - nn - oc - or - ps - fa - pl - pt - pa - ro - rm - rm - ru - sah - sa - sco

license:apache-2.0
142,955
107

wav2vec2-large-robust-ft-libri-960h

This model is a fine-tuned version of the wav2vec2-large-robust model. It has been pretrained on: - Libri-Light: open-source audio books from the LibriVox project; clean, read-out audio data - CommonVoice: crowd-source collected audio data; read-out text snippets - Switchboard: telephone speech corpus; noisy telephone data - Fisher: conversational telephone speech; noisy telephone data When using the model make sure that your speech input is also sampled at 16Khz. Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli Abstract Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at this https URL. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows:

license:apache-2.0
136,364
15

sam-vit-large

license:apache-2.0
131,946
29

dinov3-vith16plus-pretrain-lvd1689m

NaNK
129,599
29

convnextv2-tiny-22k-224

license:apache-2.0
127,872
2

sam2.1-hiera-large

license:apache-2.0
122,531
111

opt-350m

OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.

112,793
148

opt-1.3b

OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.

NaNK
108,406
180

vjepa2-vitl-fpc64-256

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation.

license:mit
106,308
160

mask2former-swin-large-mapillary-vistas-semantic

Mask2Former model trained on Mapillary Vistas semantic segmentation (large-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.

97,408
6

nllb-200-distilled-1.3B

--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - k

NaNK
license:cc-by-nc-4.0
95,811
134

PE-Core-L14-336

license:apache-2.0
86,600
47

detr-resnet-50-panoptic

license:apache-2.0
84,918
138

musicgen-small

--- inference: true tags: - musicgen license: cc-by-nc-4.0 pipeline_tag: text-to-audio widget: - text: "a funky house with 80s hip hop vibes" example_title: "Prompt 1" - text: "a chill song with influences from lofi, chillstep and downtempo" example_title: "Prompt 2" - text: "a catchy beat for a podcast intro" example_title: "Prompt 3" ---

license:cc-by-nc-4.0
83,955
462

wav2vec2-large-es-voxpopuli

Facebook's Wav2Vec2 large model pretrained on the es unlabeled subset of VoxPopuli corpus. Paper: VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation Authors: Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, Emmanuel Dupoux from Facebook AI See the official website for more information, here Please refer to this blog on how to fine-tune this model on a specific language. Note that you should replace `"facebook/wav2vec2-large-xlsr-53"` with this checkpoint for fine-tuning.

license:cc-by-nc-4.0
78,717
1

encodec_32khz

77,551
19

wav2vec2-conformer-rope-large-960h-ft

license:apache-2.0
76,135
10

dpr-ctx_encoder-single-nq-base

license:cc-by-nc-4.0
68,152
25

mask2former-swin-tiny-coco-instance

Mask2Former model trained on COCO instance segmentation (tiny-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for instance segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.

67,819
11

rag-sequence-nq

license:apache-2.0
66,713
43

chameleon-7b

NaNK
64,315
193

metaclip-b32-400m

license:cc-by-nc-4.0
64,002
46

dinov3-vit7b16-pretrain-sat493m

NaNK
62,211
23

dpr-question_encoder-single-nq-base

license:cc-by-nc-4.0
61,513
34

wav2vec2-large-960h

license:apache-2.0
59,594
32

contriever-msmarco

58,128
31

nllb-200-3.3B

--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbp - kea - khk - km - k

NaNK
license:cc-by-nc-4.0
57,020
384

wav2vec2-large-960h-lv60-self

The large model pretrained and fine-tuned on 960 hours of Libri-Light and Librispeech on 16kHz sampled speech audio. Model was trained with Self-Training objective. When using the model make sure that your speech input is also sampled at 16Khz. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows: This code snippet shows how to evaluate facebook/wav2vec2-large-960h-lv60-self on LibriSpeech's "clean" and "other" test data.

license:apache-2.0
51,525
155

detr-resnet-101

license:apache-2.0
47,281
124

bart-large

license:apache-2.0
47,181
198

vit-mae-base

license:apache-2.0
44,303
38

mms-1b-all

NaNK
license:cc-by-nc-4.0
42,688
162

seamless-m4t-v2-large

SeamlessM4T is our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages. SeamlessM4T models support the tasks of: - Speech-to-speech translation (S2ST) - Speech-to-text translation (S2TT) - Text-to-speech translation (T2ST) - Text-to-text translation (T2TT) - Automatic speech recognition (ASR). SeamlessM4T models support: - 🎤 101 languages for speech input. - 💬 96 Languages for text input/output. - 🔊 35 languages for speech output. 🌟 We are releasing SeamlessM4T v2, an updated version with our novel UnitY2 architecture. This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks. The v2 version of SeamlessM4T is a multitask adaptation of our novel UnitY2 architecture. Unity2 with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed. SeamlessM4T v2 is also supported by 🤗 Transformers, more on it in the dedicated section below. SeamlessM4T models | Model Name | #params | checkpoint | metrics | | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | | SeamlessM4T-Large v2 | 2.3B | checkpoint | metrics | | SeamlessM4T-Large (v1) | 2.3B | checkpoint | metrics | | SeamlessM4T-Medium (v1) | 1.2B | checkpoint | metrics | We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above. The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found here Evaluating SeamlessM4T models To reproduce our results or to evaluate using the same metrics over your own test sets, please check out the Evaluation README here. Finetuning SeamlessM4T models Please check out the Finetuning README here. SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started: 1. First install the 🤗 Transformers library from main and sentencepiece: 2. Run the following Python code to generate speech samples. Here the target language is Russian: 3. Listen to the audio samples either in an ipynb notebook: Or save them as a `.wav` file using a third-party library, e.g. `scipy`: For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the SeamlessM4T v2 docs or to this hands-on Google Colab. Listed below, are the languages supported by SeamlessM4T-large (v1/v2). The `source` column specifies whether a language is supported as source speech (`Sp`) and/or source text (`Tx`). The `target` column specifies whether a language is supported as target speech (`Sp`) and/or target text (`Tx`). | code | language | script | Source | Target | | ---- | ---------------------- | ---------- | ------ | ------ | | afr | Afrikaans | Latn | Sp, Tx | Tx | | amh | Amharic | Ethi | Sp, Tx | Tx | | arb | Modern Standard Arabic | Arab | Sp, Tx | Sp, Tx | | ary | Moroccan Arabic | Arab | Sp, Tx | Tx | | arz | Egyptian Arabic | Arab | Sp, Tx | Tx | | asm | Assamese | Beng | Sp, Tx | Tx | | ast | Asturian | Latn | Sp | \-- | | azj | North Azerbaijani | Latn | Sp, Tx | Tx | | bel | Belarusian | Cyrl | Sp, Tx | Tx | | ben | Bengali | Beng | Sp, Tx | Sp, Tx | | bos | Bosnian | Latn | Sp, Tx | Tx | | bul | Bulgarian | Cyrl | Sp, Tx | Tx | | cat | Catalan | Latn | Sp, Tx | Sp, Tx | | ceb | Cebuano | Latn | Sp, Tx | Tx | | ces | Czech | Latn | Sp, Tx | Sp, Tx | | ckb | Central Kurdish | Arab | Sp, Tx | Tx | | cmn | Mandarin Chinese | Hans | Sp, Tx | Sp, Tx | | cmnHant | Mandarin Chinese | Hant | Sp, Tx | Sp, Tx | | cym | Welsh | Latn | Sp, Tx | Sp, Tx | | dan | Danish | Latn | Sp, Tx | Sp, Tx | | deu | German | Latn | Sp, Tx | Sp, Tx | | ell | Greek | Grek | Sp, Tx | Tx | | eng | English | Latn | Sp, Tx | Sp, Tx | | est | Estonian | Latn | Sp, Tx | Sp, Tx | | eus | Basque | Latn | Sp, Tx | Tx | | fin | Finnish | Latn | Sp, Tx | Sp, Tx | | fra | French | Latn | Sp, Tx | Sp, Tx | | fuv | Nigerian Fulfulde | Latn | Sp, Tx | Tx | | gaz | West Central Oromo | Latn | Sp, Tx | Tx | | gle | Irish | Latn | Sp, Tx | Tx | | glg | Galician | Latn | Sp, Tx | Tx | | guj | Gujarati | Gujr | Sp, Tx | Tx | | heb | Hebrew | Hebr | Sp, Tx | Tx | | hin | Hindi | Deva | Sp, Tx | Sp, Tx | | hrv | Croatian | Latn | Sp, Tx | Tx | | hun | Hungarian | Latn | Sp, Tx | Tx | | hye | Armenian | Armn | Sp, Tx | Tx | | ibo | Igbo | Latn | Sp, Tx | Tx | | ind | Indonesian | Latn | Sp, Tx | Sp, Tx | | isl | Icelandic | Latn | Sp, Tx | Tx | | ita | Italian | Latn | Sp, Tx | Sp, Tx | | jav | Javanese | Latn | Sp, Tx | Tx | | jpn | Japanese | Jpan | Sp, Tx | Sp, Tx | | kam | Kamba | Latn | Sp | \-- | | kan | Kannada | Knda | Sp, Tx | Tx | | kat | Georgian | Geor | Sp, Tx | Tx | | kaz | Kazakh | Cyrl | Sp, Tx | Tx | | kea | Kabuverdianu | Latn | Sp | \-- | | khk | Halh Mongolian | Cyrl | Sp, Tx | Tx | | khm | Khmer | Khmr | Sp, Tx | Tx | | kir | Kyrgyz | Cyrl | Sp, Tx | Tx | | kor | Korean | Kore | Sp, Tx | Sp, Tx | | lao | Lao | Laoo | Sp, Tx | Tx | | lit | Lithuanian | Latn | Sp, Tx | Tx | | ltz | Luxembourgish | Latn | Sp | \-- | | lug | Ganda | Latn | Sp, Tx | Tx | | luo | Luo | Latn | Sp, Tx | Tx | | lvs | Standard Latvian | Latn | Sp, Tx | Tx | | mai | Maithili | Deva | Sp, Tx | Tx | | mal | Malayalam | Mlym | Sp, Tx | Tx | | mar | Marathi | Deva | Sp, Tx | Tx | | mkd | Macedonian | Cyrl | Sp, Tx | Tx | | mlt | Maltese | Latn | Sp, Tx | Sp, Tx | | mni | Meitei | Beng | Sp, Tx | Tx | | mya | Burmese | Mymr | Sp, Tx | Tx | | nld | Dutch | Latn | Sp, Tx | Sp, Tx | | nno | Norwegian Nynorsk | Latn | Sp, Tx | Tx | | nob | Norwegian Bokmål | Latn | Sp, Tx | Tx | | npi | Nepali | Deva | Sp, Tx | Tx | | nya | Nyanja | Latn | Sp, Tx | Tx | | oci | Occitan | Latn | Sp | \-- | | ory | Odia | Orya | Sp, Tx | Tx | | pan | Punjabi | Guru | Sp, Tx | Tx | | pbt | Southern Pashto | Arab | Sp, Tx | Tx | | pes | Western Persian | Arab | Sp, Tx | Sp, Tx | | pol | Polish | Latn | Sp, Tx | Sp, Tx | | por | Portuguese | Latn | Sp, Tx | Sp, Tx | | ron | Romanian | Latn | Sp, Tx | Sp, Tx | | rus | Russian | Cyrl | Sp, Tx | Sp, Tx | | slk | Slovak | Latn | Sp, Tx | Sp, Tx | | slv | Slovenian | Latn | Sp, Tx | Tx | | sna | Shona | Latn | Sp, Tx | Tx | | snd | Sindhi | Arab | Sp, Tx | Tx | | som | Somali | Latn | Sp, Tx | Tx | | spa | Spanish | Latn | Sp, Tx | Sp, Tx | | srp | Serbian | Cyrl | Sp, Tx | Tx | | swe | Swedish | Latn | Sp, Tx | Sp, Tx | | swh | Swahili | Latn | Sp, Tx | Sp, Tx | | tam | Tamil | Taml | Sp, Tx | Tx | | tel | Telugu | Telu | Sp, Tx | Sp, Tx | | tgk | Tajik | Cyrl | Sp, Tx | Tx | | tgl | Tagalog | Latn | Sp, Tx | Sp, Tx | | tha | Thai | Thai | Sp, Tx | Sp, Tx | | tur | Turkish | Latn | Sp, Tx | Sp, Tx | | ukr | Ukrainian | Cyrl | Sp, Tx | Sp, Tx | | urd | Urdu | Arab | Sp, Tx | Sp, Tx | | uzn | Northern Uzbek | Latn | Sp, Tx | Sp, Tx | | vie | Vietnamese | Latn | Sp, Tx | Sp, Tx | | xho | Xhosa | Latn | Sp | \-- | | yor | Yoruba | Latn | Sp, Tx | Tx | | yue | Cantonese | Hant | Sp, Tx | Tx | | zlm | Colloquial Malay | Latn | Sp | \-- | | zsm | Standard Malay | Latn | Tx | Tx | | zul | Zulu | Latn | Sp, Tx | Tx | Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in asset card)

license:cc-by-nc-4.0
39,727
917

map-anything-apache

--- tags: - model_hub_mixin - pytorch_model_hub_mixin - computer-vision - 3d-reconstruction - multi-view-stereo - depth-estimation - camera-pose - covisibility - mapanything license: apache-2.0 language: - en pipeline_tag: image-to-3d ---

license:apache-2.0
37,781
31

audiobox-aesthetics

license:cc-by-4.0
37,368
38

mask2former-swin-large-ade-semantic

36,788
17

wav2vec2-lv-60-espeak-cv-ft

Wav2Vec2-Large-LV60 finetuned on multi-lingual Common Voice This checkpoint leverages the pretrained checkpoint wav2vec2-large-lv60 and is fine-tuned on CommonVoice to recognize phonetic labels in multiple languages. When using the model make sure that your speech input is sampled at 16kHz. Note that the model outputs a string of phonetic labels. A dictionary mapping phonetic labels to words has to be used to map the phonetic output labels to output words. Paper: Simple and Effective Zero-shot Cross-lingual Phoneme Recognition Abstract Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows:

license:apache-2.0
34,147
53

timesformer-base-finetuned-k400

TimeSformer (base-sized model, fine-tuned on Kinetics-400) TimeSformer model pre-trained on Kinetics-400. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository. Disclaimer: The team releasing TimeSformer did not write a model card for this model so this model card has been written by fcakyon. You can use the raw model for video classification into one of the 400 possible Kinetics-400 labels. For more code examples, we refer to the documentation.

license:cc-by-nc-4.0
33,323
42

opt-6.7b

NaNK
31,997
118

deit-tiny-patch16-224

license:apache-2.0
27,883
11

metaclip-h14-fullcc2.5b

MetaCLIP model, huge-sized version, patch resolution 14 MetaCLIP model applied to 2.5 billion data points of CommonCrawl (CC). It was introduced in the paper Demystifying CLIP Data by Xu et al. and first released in this repository. Disclaimer: The team releasing MetaCLIP did not write a model card for this model so this model card has been written by the Hugging Face team. The Demystifying CLIP Data paper aims to reveal CLIP’s method around training data curation. OpenAI never open-sourced code regarding their data preparation pipeline. CLIP high-level overview. Taken from the CLIP paper . You can use the raw model for linking images with text in a shared embedding space. This enables things like zero-shot image classification, text-based image retrieval, image-based text retrieval, etc. We refer to the docs. Just replace the names of the models on the hub.

NaNK
license:cc-by-nc-4.0
26,648
44

mask2former-swin-base-coco-panoptic

Mask2Former model trained on COCO panoptic segmentation (base-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.

26,505
16

deit-base-patch16-224

license:apache-2.0
26,377
14

convnextv2-base-22k-224

ConvNeXt V2 model pretrained using the FCMAE framework and fine-tuned on the ImageNet-22K dataset at resolution 224x224. It was introduced in the paper ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders by Woo et al. and first released in this repository. Disclaimer: The team releasing ConvNeXT V2 did not write a model card for this model so this model card has been written by the Hugging Face team. ConvNeXt V2 is a pure convolutional model (ConvNet) that introduces a fully convolutional masked autoencoder framework (FCMAE) and a new Global Response Normalization (GRN) layer to ConvNeXt. ConvNeXt V2 significantly improves the performance of pure ConvNets on various recognition benchmarks. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.

license:apache-2.0
25,462
5

sam2.1-hiera-tiny

Repository for SAM 2: Segment Anything in Images and Videos, a foundation model towards solving promptable visual segmentation in images and videos from FAIR. See the SAM 2 paper for more information. The official code is publicly release in this repo. SAM2 can be used for automatic mask generation to segment all objects in an image using the `mask-generation` pipeline: You can segment objects by providing a single point click on the object you want to segment: You can provide multiple points to refine the segmentation: SAM2 also supports bounding box inputs for segmentation: Process multiple images simultaneously for improved efficiency: Segment multiple objects within each image using batch inference: Batched Images with Batched Objects and Multiple Points Handle complex batch scenarios with multiple points per object: SAM2 can use masks from previous predictions as input to refine segmentation: SAM2's key strength is its ability to track objects across video frames. Here's how to use it for video segmentation: Track multiple objects simultaneously across video frames: You can add additional clicks on any frame to refine the tracking: For real-time applications, SAM2 supports processing video frames as they arrive: Track multiple objects simultaneously in video by adding them all at once: To cite the paper, model, or software, please use the below:

license:apache-2.0
25,431
15

esm2_t48_15B_UR50D

NaNK
license:mit
25,040
26

mms-lid-126

Massively Multilingual Speech (MMS) - Finetuned LID This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's Massive Multilingual Speech project. This checkpoint is based on the Wav2Vec2 architecture and classifies raw audio input to a probability distribution over 126 output classes (each class representing a language). The checkpoint consists of 1 billion parameters and has been fine-tuned from facebook/mms-1b on 126 languages. - Example - Supported Languages - Model details - Additional links This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 126 languages. First, we install transformers and some other libraries ` Note: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version is not yet available on PyPI make sure to install `transformers` from source: Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz. Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition To see all the supported languages of a checkpoint, you can print out the language ids as follows: For more details, about the architecture please have a look at the official docs. This model supports 126 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview. - ara - cmn - eng - spa - fra - mlg - swe - por - vie - ful - sun - asm - ben - zlm - kor - ind - hin - tuk - urd - aze - slv - mon - hau - tel - swh - bod - rus - tur - heb - mar - som - tgl - tat - tha - cat - ron - mal - bel - pol - yor - nld - bul - hat - afr - isl - amh - tam - hun - hrv - lit - cym - fas - mkd - ell - bos - deu - sqi - jav - nob - uzb - snd - lat - nya - grn - mya - orm - lin - hye - yue - pan - jpn - kaz - npi - kat - guj - kan - tgk - ukr - ces - lav - bak - khm - fao - glg - ltz - lao - mlt - sin - sna - ita - srp - mri - nno - pus - eus - ory - lug - bre - luo - slk - fin - dan - yid - est - ceb - war - san - kir - oci - wol - haw - kam - umb - xho - epo - zul - ibo - abk - ckb - nso - gle - kea - ast - sco - glv - ina - Developed by: Vineel Pratap et al. - Model type: Multi-Lingual Automatic Speech Recognition model - Language(s): 126 languages, see supported languages - License: CC-BY-NC 4.0 license - Num parameters: 1 billion - Audio sampling rate: 16,000 kHz - Cite as: @article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } - Blog post - Transformers documentation. - Paper - GitHub Repository - Other MMS checkpoints - MMS base checkpoints: - facebook/mms-1b - facebook/mms-300m - Official Space

license:cc-by-nc-4.0
25,028
32

convnext-large-224-22k-1k

license:apache-2.0
24,667
3

mask2former-swin-large-coco-panoptic

Mask2Former model trained on COCO panoptic segmentation (large-sized version, Swin backbone). It was introduced in the paper Masked-attention Mask Transformer for Universal Image Segmentation and first released in this repository. Disclaimer: The team releasing Mask2Former did not write a model card for this model so this model card has been written by the Hugging Face team. Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA, MaskFormer both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks. You can use this particular checkpoint for panoptic segmentation. See the model hub to look for other fine-tuned versions on a task that interests you. For more code examples, we refer to the documentation.

24,398
32

dinov2-with-registers-large

Vision Transformer (large-sized model) trained using DINOv2, with registers Vision Transformer (ViT) model introduced in the paper Vision Transformers Need Registers by Darcet et al. and first released in this repository. Disclaimer: The team releasing DINOv2 with registers did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet. Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE. The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called "register" tokens), which you only use during pre-training (and throw away afterwards). This results in: - no artifacts - interpretable attention maps - and improved performances. Visualization of attention maps of various models trained with vs. without registers. Taken from the original paper . Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for feature extraction. See the model hub to look for fine-tuned versions on a task that interests you.

license:apache-2.0
24,255
9

mbart-large-50

mBART-50 is a multilingual Sequence-to-Sequence model pre-trained using the "Multilingual Denoising Pretraining" objective. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper. mBART-50 is a multilingual Sequence-to-Sequence model. It was introduced to show that multilingual translation models can be created through multilingual fine-tuning. Instead of fine-tuning on one direction, a pre-trained model is fine-tuned on many directions simultaneously. mBART-50 is created using the original mBART model and extended to add extra 25 languages to support multilingual machine translation models of 50 languages. The pre-training objective is explained below. Multilingual Denoising Pretraining: The model incorporates N languages by concatenating data: `D = {D1, ..., DN }` where each Di is a collection of monolingual documents in language `i`. The source documents are noised using two schemes, first randomly shuffling the original sentences' order, and second a novel in-filling scheme, where spans of text are replaced with a single mask token. The model is then tasked to reconstruct the original text. 35% of each instance's words are masked by random sampling a span length according to a Poisson distribution `(λ = 3.5)`. The decoder input is the original text with one position offset. A language id symbol `LID` is used as the initial token to predict the sentence. `mbart-large-50` is pre-trained model and primarily aimed at being fine-tuned on translation tasks. It can also be fine-tuned on other multilingual sequence-to-sequence tasks. See the model hub to look for fine-tuned versions. As the model is multilingual, it expects the sequences in a different format. A special language id token is used as a prefix in both the source and target text. The text format is `[langcode] X [eos]` with `X` being the source or target text respectively and `langcode` is `sourcelangcode` for source text and `tgtlangcode` for target text. `bos` is never used. Once the examples are prepared in this format, it can be trained as any other sequence-to-sequence model. Languages covered Arabic (arAR), Czech (csCZ), German (deDE), English (enXX), Spanish (esXX), Estonian (etEE), Finnish (fiFI), French (frXX), Gujarati (guIN), Hindi (hiIN), Italian (itIT), Japanese (jaXX), Kazakh (kkKZ), Korean (koKR), Lithuanian (ltLT), Latvian (lvLV), Burmese (myMM), Nepali (neNP), Dutch (nlXX), Romanian (roRO), Russian (ruRU), Sinhala (siLK), Turkish (trTR), Vietnamese (viVN), Chinese (zhCN), Afrikaans (afZA), Azerbaijani (azAZ), Bengali (bnIN), Persian (faIR), Hebrew (heIL), Croatian (hrHR), Indonesian (idID), Georgian (kaGE), Khmer (kmKH), Macedonian (mkMK), Malayalam (mlIN), Mongolian (mnMN), Marathi (mrIN), Polish (plPL), Pashto (psAF), Portuguese (ptXX), Swedish (svSE), Swahili (swKE), Tamil (taIN), Telugu (teIN), Thai (thTH), Tagalog (tlXX), Ukrainian (ukUA), Urdu (urPK), Xhosa (xhZA), Galician (glES), Slovene (slSI)

license:mit
24,177
162

mms-tts-eng

Massively Multilingual Speech (MMS): English Text-to-Speech This repository contains the English (eng) language text-to-speech (TTS) model checkpoint. This model is part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across a diverse range of languages. You can find more details about the supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and see all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text. The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform. For the MMS project, a separate VITS checkpoint is trained on each langauge. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint, first install the latest version of the library: Then, run inference with the following code-snippet: The resulting waveform can be saved as a `.wav` file: This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:

license:cc-by-nc-4.0
24,017
160

dino-vitb8

license:apache-2.0
23,915
18

dinov3-convnext-small-pretrain-lvd1689m

NaNK
23,872
20

dinov3-vits16plus-pretrain-lvd1689m

NaNK
23,398
8

dinov3-convnext-tiny-pretrain-lvd1689m

NaNK
21,898
24

blenderbot-400M-distill

license:apache-2.0
21,447
458

wmt19-ru-en

license:apache-2.0
21,258
20

dinov3-vitl16-pretrain-sat493m

NaNK
20,829
22

convnextv2-tiny-22k-384

ConvNeXt V2 model pretrained using the FCMAE framework and fine-tuned on the ImageNet-22K dataset at resolution 384x384. It was introduced in the paper ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders by Woo et al. and first released in this repository. Disclaimer: The team releasing ConvNeXT V2 did not write a model card for this model so this model card has been written by the Hugging Face team. ConvNeXt V2 is a pure convolutional model (ConvNet) that introduces a fully convolutional masked autoencoder framework (FCMAE) and a new Global Response Normalization (GRN) layer to ConvNeXt. ConvNeXt V2 significantly improves the performance of pure ConvNets on various recognition benchmarks. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.

license:apache-2.0
20,773
4

map-anything

MapAnything is a simple, end-to-end trained transformer model that directly regresses the factored metric 3D geometry of a scene given various types of modalities as inputs. A single feed-forward model supports over 12 different 3D reconstruction tasks including multi-image sfm, multi-view stereo, monocular metric depth estimation, registration, depth completion and more. If you find our repository useful, please consider giving it a star ⭐ and citing our paper in your work:

license:cc-by-nc-4.0
20,440
74

esm1v_t33_650M_UR90S_1

20,199
5

dinov2-small-imagenet1k-1-layer

Vision Transformer (small-sized model) trained using DINOv2 Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. and first released in this repository. Disclaimer: The team releasing DINOv2 did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion. Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Note that this model does not include any fine-tuned heads. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the model for classifying an image among one of the 1000 ImageNet labels. See the model hub to look for other fine-tuned versions on a task that interests you.

license:apache-2.0
20,196
3

mbart-large-50-one-to-many-mmt

mBART-50 one to many multilingual machine translation This model is a fine-tuned checkpoint of mBART-large-50. `mbart-large-50-one-to-many-mmt` is fine-tuned for multilingual machine translation. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper. The model can translate English to other 49 languages mentioned below. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the `forcedbostokenid` parameter to the `generate` method. See the model hub to look for more fine-tuned versions. Languages covered Arabic (arAR), Czech (csCZ), German (deDE), English (enXX), Spanish (esXX), Estonian (etEE), Finnish (fiFI), French (frXX), Gujarati (guIN), Hindi (hiIN), Italian (itIT), Japanese (jaXX), Kazakh (kkKZ), Korean (koKR), Lithuanian (ltLT), Latvian (lvLV), Burmese (myMM), Nepali (neNP), Dutch (nlXX), Romanian (roRO), Russian (ruRU), Sinhala (siLK), Turkish (trTR), Vietnamese (viVN), Chinese (zhCN), Afrikaans (afZA), Azerbaijani (azAZ), Bengali (bnIN), Persian (faIR), Hebrew (heIL), Croatian (hrHR), Indonesian (idID), Georgian (kaGE), Khmer (kmKH), Macedonian (mkMK), Malayalam (mlIN), Mongolian (mnMN), Marathi (mrIN), Polish (plPL), Pashto (psAF), Portuguese (ptXX), Swedish (svSE), Swahili (swKE), Tamil (taIN), Telugu (teIN), Thai (thTH), Tagalog (tlXX), Ukrainian (ukUA), Urdu (urPK), Xhosa (xhZA), Galician (glES), Slovene (slSI)

20,178
39

vit-mae-large

Vision Transformer (large-sized model) pre-trained with MAE Vision Transformer (ViT) model pre-trained using the MAE method. It was introduced in the paper Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick and first released in this repository. Disclaimer: The team releasing MAE did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like). Images are presented to the model as a sequence of fixed-size patches. During pre-training, one randomly masks out a high portion (75%) of the image patches. First, the encoder is used to encode the visual patches. Next, a learnable (shared) mask token is added at the positions of the masked patches. The decoder takes the encoded visual patches and mask tokens as input and reconstructs raw pixel values for the masked positions. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you.

license:apache-2.0
19,576
10

hubert-large-ll60k

The large model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Note: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model. Authors: Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed Abstract Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/hubert . See this blog for more information on how to fine-tune the model. Note that the class `Wav2Vec2ForCTC` has to be replaced by `HubertForCTC`.

license:apache-2.0
19,328
30

nllb-200-1.3B

NaNK
license:cc-by-nc-4.0
18,606
68

timesformer-hr-finetuned-k400

TimeSformer (high-resolution variant, fine-tuned on Kinetics-400) TimeSformer model pre-trained on Kinetics-400. It was introduced in the paper TimeSformer: Is Space-Time Attention All You Need for Video Understanding? by Tong et al. and first released in this repository. Disclaimer: The team releasing TimeSformer did not write a model card for this model so this model card has been written by fcakyon. You can use the raw model for video classification into one of the 400 possible Kinetics-400 labels. For more code examples, we refer to the documentation.

license:cc-by-nc-4.0
18,494
3

opt-2.7b

OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. You can use this model directly with a pipeline for text generation. By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.

NaNK
18,327
86

dinov3-vit7b16-pretrain-lvd1689m

NaNK
17,582
185

PE-Core-B16-224

license:apache-2.0
17,428
12

vjepa2-vitg-fpc64-384-ssv2

NaNK
license:mit
16,968
2

sam-audio-judge

16,924
22

xlm-roberta-xl

XLM-RoBERTa-XL model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper Larger-Scale Transformers for Multilingual Masked Language Modeling by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau and first released in this repository. Disclaimer: The team releasing XLM-RoBERTa-XL did not write a model card for this model so this model card has been written by the Hugging Face team. XLM-RoBERTa-XL is a extra large multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence. This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa-XL model as inputs. You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2. You can use this model directly with a pipeline for masked language modeling: Here is how to use this model to get the features of a given text in PyTorch:

license:mit
16,898
33

metaclip-2-worldwide-huge-quickgelu

license:cc-by-nc-4.0
16,825
8

mask2former-swin-large-coco-instance

16,233
6

dinov2-with-registers-small

license:apache-2.0
16,188
9

metaclip-b16-fullcc2.5b

NaNK
license:cc-by-nc-4.0
14,929
10

mms-tts-vie

license:cc-by-nc-4.0
14,676
30

mask2former-swin-small-ade-semantic

14,159
8

convnext-base-224

license:apache-2.0
13,997
9

wmt19-en-de

license:apache-2.0
13,633
21

wav2vec2-xlsr-53-phon-cv-ft

13,467
3

dinov3-convnext-base-pretrain-lvd1689m

NaNK
13,410
8

convnextv2-tiny-1k-224

license:apache-2.0
13,270
6

wav2vec2-large-xlsr-53-italian

license:apache-2.0
13,135
6

vjepa2-vitg-fpc64-256

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation. ``` @techreport{assran2025vjepa2, title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and Rabbat, Michael and Ballas, Nicolas}, institution={FAIR at Meta}, year={2025} }

license:apache-2.0
12,606
20

sam-audio-large

12,478
307

convnextv2-atto-1k-224

license:apache-2.0
11,802
3

sam2.1-hiera-base-plus

license:apache-2.0
11,373
16

blenderbot_small-90M

license:apache-2.0
10,999
53

encodec_48khz

license:mit
10,767
33

nougat-base

Nougat model trained on PDF-to-markdown. It was introduced in the paper Nougat: Neural Optical Understanding for Academic Documents by Blecher et al. and first released in this repository. Disclaim...

license:cc-by-nc-4.0
10,552
179

mask2former-swin-base-ade-semantic

10,398
0

cotracker3

license:cc-by-nc-4.0
10,160
26

dpr-ctx_encoder-multiset-base

license:cc-by-nc-4.0
10,004
5

mms-tts-ind

license:cc-by-nc-4.0
9,956
18

mms-lid-4017

license:cc-by-nc-4.0
9,387
11

convnextv2-base-22k-384

license:apache-2.0
9,133
0

deit-base-distilled-patch16-224

license:apache-2.0
9,111
31

mbart-large-cc25

8,983
69

musicgen-melody-large

license:cc-by-nc-4.0
8,974
32

sam-audio-base

8,972
32

wav2vec2-large-lv60

license:apache-2.0
8,848
12

s2t-small-librispeech-asr

license:mit
8,842
30

xlm-v-base

license:mit
8,684
39

hf-seamless-m4t-medium

license:cc-by-nc-4.0
8,669
31

convnext-tiny-224

license:apache-2.0
8,453
21

bart-large-xsum

license:mit
8,366
36

PE-Lang-G14-448

license:apache-2.0
8,185
13

opt-30b

OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's repository on May 3rd 2022 by Meta AI. Disclaimer: The team releasing OPT wrote an official model card, which is available in Appendix D of the paper. Content from this model card has been written by the Hugging Face team. To quote the first two paragraphs of the official paper > Large language models trained on massive text collections have shown surprising emergent > capabilities to generate text and perform zero- and few-shot learning. While in some cases the public > can interact with these models through paid APIs, full model access is currently limited to only a > few highly resourced labs. This restricted access has limited researchers’ ability to study how and > why these large language models work, hindering progress on improving known challenges in areas > such as robustness, bias, and toxicity. > We present Open Pretrained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M > to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match > the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data > collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and > to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the > collective research community as a whole, which is only possible when models are available for study. OPT was predominantly pretrained with English text, but a small amount of non-English data is still present within the training corpus via CommonCrawl. The model was pretrained using a causal language modeling (CLM) objective. OPT belongs to the same family of decoder-only models like GPT-3. As such, it was pretrained using the self-supervised causal language modedling objective. For evaluation, OPT follows GPT-3 by using their prompts and overall experimental setup. For more details, please read the official paper. Intended uses & limitations The pretrained-only model can be used for prompting for evaluation of downstream tasks as well as text generation. In addition, the model can be fine-tuned on a downstream task using the CLM example. For all other OPT checkpoints, please have a look at the model hub. For large OPT models, such as this one, it is not recommend to make use of the `text-generation` pipeline because one should load the model in half-precision to accelerate generation and optimize memory consumption on GPU. It is recommended to directly call the `generate` method as follows: By default, generation is deterministic. In order to use the top-k sampling, please set `dosample` to `True`. As mentioned in Meta AI's model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased : > Like other large language models for which the diversity (or lack thereof) of training > data induces downstream impact on the quality of our model, OPT-175B has limitations in terms > of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and > hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern > large language models. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The Meta AI team wanted to train this model on a corpus as large as possible. It is composed of the union of the following 5 filtered datasets of textual documents: - BookCorpus, which consists of more than 10K unpublished books, - CC-Stories, which contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas, - The Pile, from which Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews were included. - Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in Roller et al. (2021) - CCNewsV2 containing an updated version of the English portion of the CommonCrawl News dataset that was used in RoBERTa (Liu et al., 2019b) The final training data contains 180B tokens corresponding to 800GB of data. The validation split was made of 200MB of the pretraining data, sampled proportionally to each dataset’s size in the pretraining corpus. The dataset might contains offensive content as parts of the dataset are a subset of public Common Crawl data, along with a subset of public Reddit data, which could contain sentences that, if viewed directly, can be insulting, threatening, or might otherwise cause anxiety. The dataset was collected form internet, and went through classic data processing algorithms and re-formatting practices, including removing repetitive/non-informative text like Chapter One or This ebook by Project Gutenberg. The texts are tokenized using the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50272. The inputs are sequences of 2048 consecutive tokens. The 175B model was trained on 992 80GB A100 GPUs. The training duration was roughly ~33 days of continuous training.

NaNK
7,956
136

mcontriever-msmarco

7,945
9

nougat-small

license:cc-by-4.0
7,886
30

opt-13b

NaNK
7,586
66

dinov2-giant-imagenet1k-1-layer

license:apache-2.0
7,351
3

wmt19-de-en

license:apache-2.0
7,213
20

dino-vits8

license:apache-2.0
7,089
16

data2vec-audio-base-960h

license:apache-2.0
7,089
12

xlm-roberta-xxl

license:mit
6,989
16

timesformer-base-finetuned-k600

license:cc-by-nc-4.0
6,860
12

vjepa2-vitl-fpc16-256-ssv2

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. 💡 This is V-JEPA 2 ViT-L 256 model with video classification head pretrained on Something-Something-V2 dataset. To run V-JEPA 2 model, ensure you have installed the latest transformers:

NaNK
license:mit
6,848
6

musicgen-large

MusicGen is a text-to-music model capable of genreating high-quality music samples conditioned on text descriptions or audio prompts. It is a single stage auto-regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike existing methods, like MusicLM, MusicGen doesn't require a self-supervised semantic representation, and it generates all 4 codebooks in one pass. By introducing a small delay between the codebooks, we show we can predict them in parallel, thus having only 50 auto-regressive steps per second of audio. MusicGen was published in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez. Four checkpoints are released: - small - medium - large (this checkpoint) - melody You can run MusicGen locally with the 🤗 Transformers library from version 4.31.0 onwards. 1. First install the 🤗 Transformers library and scipy: 2. Run inference via the `Text-to-Audio` (TTA) pipeline. You can infer the MusicGen model via the TTA pipeline in just a few lines of code! 3. Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 32 kHz audio waveform for more fine-grained control. 4. Listen to the audio samples either in an ipynb notebook: Or save them as a `.wav` file using a third-party library, e.g. `scipy`: For more details on using the MusicGen model for inference using the 🤗 Transformers library, refer to the MusicGen docs. You can also run MusicGen locally through the original Audiocraft library: Organization developing the model: The FAIR team of Meta AI. Model date: MusicGen was trained between April 2023 and May 2023. Model type: MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation. Paper or resources for more information: More information can be found in the paper Simple and Controllable Music Generation. Where to send questions or comments about the model: Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue. Intended use Primary intended use: The primary use of MusicGen is research on AI-based music generation, including: - Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science - Generation of music guided by text or melody to understand current abilities of generative AI models by machine learning amateurs Primary intended users: The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models. Out-of-scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes. Models performance measures: We used the following objective measure to evaluate the model on a standard music benchmark: - Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish) - Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST) - CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model Additionally, we run qualitative studies with human participants, evaluating the performance of the model with the following axes: - Overall quality of the music samples; - Text relevance to the provided text input; - Adherence to the melody for melody-guided music generation. More details on performance measures and human studies can be found in the paper. The model was evaluated on the MusicCaps benchmark and on an in-domain held-out evaluation set, with no artist overlap with the training set. The model was trained on licensed data using the following sources: the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection. See the paper for more details about the training set and corresponding preprocessing. Below are the objective metrics obtained on MusicCaps with the released model. Note that for the publicly released models, we had all the datasets go through a state-of-the-art music source separation method, namely using the open source Hybrid Transformer for Music Source Separation (HT-Demucs), in order to keep only the instrumental part. This explains the difference in objective metrics with the models used in the paper. | Model | Frechet Audio Distance | KLD | Text Consistency | Chroma Cosine Similarity | |---|---|---|---|---| | facebook/musicgen-small | 4.88 | 1.42 | 0.27 | - | | facebook/musicgen-medium | 5.14 | 1.38 | 0.28 | - | | facebook/musicgen-large | 5.48 | 1.37 | 0.28 | - | | facebook/musicgen-melody | 4.93 | 1.41 | 0.27 | 0.44 | More information can be found in the paper Simple and Controllable Music Generation, in the Results section. Data: The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model. Mitigations: Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source Hybrid Transformer for Music Source Separation (HT-Demucs). - The model is not able to generate realistic vocals. - The model has been trained with English descriptions and will not perform as well in other languages. - The model does not perform equally well for all music styles and cultures. - The model sometimes generates end of songs, collapsing to silence. - It is sometimes difficult to assess what types of text descriptions provide the best generations. Prompt engineering may be required to obtain satisfying results. Biases: The source of data is potentially lacking diversity and all music cultures are not equally represented in the dataset. The model may not perform equally well on the wide variety of music genres that exists. The generated samples from the model will reflect the biases from the training data. Further work on this model should include methods for balanced and just representations of cultures, for example, by scaling the training data to be both diverse and inclusive. Risks and harms: Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. We believe that providing the code to reproduce the research and train new models will allow to broaden the application to new and more representative data. Use cases: Users must be aware of the biases, limitations and risks of the model. MusicGen is a model developed for artificial intelligence research on controllable music generation. As such, it should not be used for downstream applications without further investigation and mitigation of risks.

license:cc-by-nc-4.0
6,811
489

sam-audio-small

6,782
61

wmt19-en-ru

license:apache-2.0
6,779
24

mms-tts-ewe

license:cc-by-nc-4.0
6,722
3

xmod-base

license:mit
6,593
17

mask2former-swin-small-coco-instance

6,492
9

mask2former-swin-large-mapillary-vistas-panoptic

6,203
2

mask2former-swin-tiny-ade-semantic

6,129
2

Perception-LM-1B

NaNK
6,118
39

audiogen-medium

license:cc-by-nc-4.0
5,870
136

mms-1b-l1107

NaNK
license:cc-by-nc-4.0
5,838
11

blenderbot-90M

license:apache-2.0
5,736
3

mbart-large-en-ro

license:mit
5,706
2

mbart-large-50-many-to-one-mmt

5,539
67

convnext-base-224-22k

license:apache-2.0
5,403
9

dpr-question_encoder-multiset-base

license:cc-by-nc-4.0
5,350
4

vjepa2-vith-fpc64-256

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation. ``` @techreport{assran2025vjepa2, title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, author={Assran, Mahmoud and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and Zholus, Artem and Arnaud, Sergio and Gejji, Abha and Martin, Ada and Robert Hogan, Francois and Dugas, Daniel and Bojanowski, Piotr and Khalidov, Vasil and Labatut, Patrick and Massa, Francisco and Szafraniec, Marc and Krishnakumar, Kapil and Li, Yong and Ma, Xiaodong and Chandar, Sarath and Meier, Franziska and LeCun, Yann and Rabbat, Michael and Ballas, Nicolas}, institution={FAIR at Meta}, year={2025} }

license:mit
5,338
12

mask2former-swin-tiny-cityscapes-semantic

5,232
3

xglm-564M

license:mit
5,170
53

mms-tts-tha

license:cc-by-nc-4.0
5,136
10

wav2vec2-large

The base model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Note that this model should be fine-tuned on a downstream task, like Automatic Speech Recognition. Check out this blog for more information. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli Abstract We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. See this notebook for more information on how to fine-tune the model.

license:apache-2.0
5,131
9

mms-tts-spa

license:cc-by-nc-4.0
4,958
21

mms-tts-tam

license:cc-by-nc-4.0
4,950
13

deit-small-patch16-224

license:apache-2.0
4,942
9

sam3

4,858
297

rag-token-nq

license:apache-2.0
4,722
175

MobileLLM-Pro

llama4_text
4,712
143

dragon-plus-context-encoder

4,695
39

dragon-plus-query-encoder

4,614
20

blenderbot-3B

NaNK
license:apache-2.0
4,612
155

dinov3-convnext-large-pretrain-lvd1689m

NaNK
4,585
14

mms-tts-tgl

license:cc-by-nc-4.0
4,535
4

esm1b_t33_650M_UR50S

NaNK
license:mit
4,464
21

webssl-mae300m-full2b-224

NaNK
license:cc-by-nc-4.0
4,351
0

s2t-medium-mustc-multilingual-st

license:mit
4,288
7

mask2former-swin-tiny-coco-panoptic

4,062
9

MobileLLM-R1-950M

llama4_text
4,049
351

DiT-XL-2-256

license:cc-by-nc-4.0
3,984
25

PE-Spatial-B16-512

license:apache-2.0
3,933
0

vit-mae-huge

license:apache-2.0
3,886
6

mms-lid-1024

license:cc-by-nc-4.0
3,819
10

mms-tts-hin

license:cc-by-nc-4.0
3,681
20

incoder-1B

NaNK
license:cc-by-nc-4.0
3,590
40

PE-Core-T16-384

license:apache-2.0
3,453
0

mgenre-wiki

3,421
29

metaclip-l14-fullcc2.5b

NaNK
license:cc-by-nc-4.0
3,393
6

wav2vec2-xls-r-1b

NaNK
license:apache-2.0
3,377
30

galactica-1.3b

NaNK
license:cc-by-nc-4.0
3,290
74

ijepa_vith14_1k

license:cc-by-nc-4.0
3,213
15

MEXMA

Current pre-trained cross-lingual sentence encoders approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and all tokens directly updating the encoder. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bi-text mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them. Usage You use this model as you would any other XLM-RoBERTa model, taking into account that the "pooler" has not been trained, so you should use the CLS the encoder outputs directly as your sentence representation: You can also use this model with SentenceTransformers: License This model is released under the MIT license. Training code For the training code of this model, please check the official MEXMA repo. Paper MEXMA: Token-level objectives improve sentence representations Citation If you use this model in your work, please cite:

license:mit
3,151
28

hubert-xlarge-ls960-ft

license:apache-2.0
3,116
14

wav2vec2-large-960h-lv60

license:apache-2.0
3,111
6

mms-tts-fra

license:cc-by-nc-4.0
3,082
14

ijepa_vith14_22k

license:cc-by-nc-4.0
3,049
0

sam2.1-hiera-small

license:apache-2.0
3,033
13

convnext-small-224

license:apache-2.0
2,893
5

dinov2-base-imagenet1k-1-layer

license:apache-2.0
2,811
6

sam2-hiera-base-plus

license:apache-2.0
2,765
12

mms-1b

NaNK
license:cc-by-nc-4.0
2,746
52

musicgen-melody

license:cc-by-nc-4.0
2,727
247

mask2former-swin-large-cityscapes-panoptic

2,684
3

webssl-dino300m-full2b-224

NaNK
license:cc-by-nc-4.0
2,593
10

detr-resnet-50-dc5

license:apache-2.0
2,571
6

mms-tts-rus

Massively Multilingual Speech (MMS): Russian Text-to-Speech This repository contains the Russian (rus) language text-to-speech (TTS) model checkpoint. This model is part of Facebook's Massively Multilingual Speech project, aiming to provide speech technology across a diverse range of languages. You can find more details about the supported languages and their ISO 639-3 codes in the MMS Language Coverage Overview, and see all MMS-TTS checkpoints on the Hugging Face Hub: facebook/mms-tts. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to synthesise speech with different rhythms from the same input text. The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform. For the MMS project, a separate VITS checkpoint is trained on each langauge. MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint, first install the latest version of the library: Then, run inference with the following code-snippet: The resulting waveform can be saved as a `.wav` file: This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:

license:cc-by-nc-4.0
2,549
24

opt-66b

NaNK
2,451
174

mask2former-swin-base-coco-instance

2,375
1

webssl-dino7b-full8b-224

NaNK
license:cc-by-nc-4.0
2,331
3

PE-Spatial-L14-448

license:apache-2.0
2,323
0

mms-tts-amh

license:cc-by-nc-4.0
2,321
3

hiera-tiny-224-hf

license:cc-by-nc-4.0
2,280
0

convnextv2-large-22k-384

license:apache-2.0
2,256
2

sonata

2,244
26

data2vec-audio-base

license:apache-2.0
2,177
4

levit-128S

license:apache-2.0
2,172
4

pe-av-large-16-frame

license:apache-2.0
2,105
5

xglm-1.7B

NaNK
license:mit
2,070
20

maskformer-swin-base-ade

2,021
13

mms-tts-ara

license:cc-by-nc-4.0
2,008
19

Perception-LM-3B

NaNK
1,977
19

tart-full-flan-t5-xl

1,958
26

mms-tts-tir

license:cc-by-nc-4.0
1,939
1

mms-1b-fl102

NaNK
license:cc-by-nc-4.0
1,925
27

mms-tts-lao

license:cc-by-nc-4.0
1,880
1

wav2vec2-conformer-rel-pos-large-960h-ft

license:apache-2.0
1,785
5

mms-tts-mya

license:cc-by-nc-4.0
1,752
6

wav2vec2-large-robust

license:apache-2.0
1,741
37

Perception-LM-8B

NaNK
1,731
54

sam2-hiera-tiny

license:apache-2.0
1,721
25

musicgen-stereo-small

license:cc-by-nc-4.0
1,717
37

dinov2-with-registers-giant

license:apache-2.0
1,710
6

mms-tts-tur

license:cc-by-nc-4.0
1,704
24

maskformer-swin-large-ade

1,690
58

s2t-small-mustc-en-fr-st

license:mit
1,684
2

fasttext-et-vectors

license:cc-by-sa-3.0
1,682
1

regnet-y-040

license:apache-2.0
1,681
2

galactica-125m

license:cc-by-nc-4.0
1,664
39

wav2vec2-large-xlsr-53-spanish

license:apache-2.0
1,643
21

mms-tts-urd-script_arabic

license:cc-by-nc-4.0
1,587
8

detr-resnet-101-dc5

license:apache-2.0
1,579
19

convnextv2-large-22k-224

license:apache-2.0
1,559
2

dinov2-large-imagenet1k-1-layer

license:apache-2.0
1,547
1

mask2former-swin-large-ade-panoptic

1,544
4

mms-tts-kor

license:cc-by-nc-4.0
1,542
14

convnextv2-huge-22k-512

license:apache-2.0
1,536
7

maskformer-swin-base-coco

1,529
26

wav2vec2-xls-r-2b

NaNK
license:apache-2.0
1,497
42

rag-token-base

license:apache-2.0
1,489
18

mms-tts-uig-script_arabic

license:cc-by-nc-4.0
1,483
12

dragon-roberta-query-encoder

1,470
4

dragon-roberta-context-encoder

1,456
2

layerskip-llama3-8B

NaNK
llama
1,442
20

wav2vec2-large-100k-voxpopuli

license:cc-by-nc-4.0
1,422
4

data2vec-audio-large-960h

license:apache-2.0
1,329
7

mms-tts-tel

license:cc-by-nc-4.0
1,308
9

vjepa2-vitg-fpc64-384

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. To run V-JEPA 2 model, ensure you have installed the latest transformers: V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs. To load a video, sample the number of frames according to the model. For this model, we use 64. To load an image, simply copy the image to the desired number of frames. For more code examples, please refer to the V-JEPA 2 documentation.

license:apache-2.0
1,248
32

pe-a-frame-large

license:apache-2.0
1,246
11

mms-tts-uzb-script_cyrillic

license:cc-by-nc-4.0
1,232
8

mask2former-swin-small-coco-panoptic

1,231
1

maskformer-swin-small-coco

1,209
4

layerskip-llama2-7B

NaNK
llama
1,147
15

mms-tts-pol

license:cc-by-nc-4.0
1,142
6

sam-audio-large-tv

1,112
19

DiT-XL-2-512

license:cc-by-nc-4.0
1,106
15

deit-tiny-distilled-patch16-224

license:apache-2.0
1,102
7

convnextv2-huge-22k-384

license:apache-2.0
1,101
3

convnext-large-224

license:apache-2.0
1,083
28

vit-msn-small

license:apache-2.0
1,083
5

layerskip-codellama-34B

NaNK
llama
1,069
4

maskformer-swin-small-ade

1,067
2

rag-sequence-base

license:apache-2.0
1,058
10

mms-tts-deu

license:cc-by-nc-4.0
1,040
14

mms-tts-orm

license:cc-by-nc-4.0
994
4

PE-Spatial-T16-512

license:apache-2.0
994
1

wav2vec2-large-xlsr-53-french

license:apache-2.0
978
13

data2vec-text-base

license:mit
960
12

mms-tts-fas

license:cc-by-nc-4.0
954
7

pe-av-base-16-frame

license:apache-2.0
946
2

sam2-hiera-small

license:apache-2.0
945
14

timesformer-base-finetuned-ssv2

license:cc-by-nc-4.0
945
3

mms-tts-ben

license:cc-by-nc-4.0
940
3

wav2vec2-large-robust-ft-swbd-300h

license:apache-2.0
919
20

hf-seamless-m4t-large

license:cc-by-nc-4.0
914
60

data2vec-vision-base-ft1k

license:apache-2.0
892
2

mms-tts-por

license:cc-by-nc-4.0
861
21

blenderbot-1B-distill

NaNK
license:apache-2.0
852
37

deformable-detr-detic

license:apache-2.0
851
8

webssl-dino1b-full2b-224

NaNK
license:cc-by-nc-4.0
848
3

xglm-7.5B

NaNK
license:mit
847
59

s2t-large-librispeech-asr

license:mit
843
10

wav2vec2-conformer-rel-pos-large

license:apache-2.0
835
9

mms-lid-512

license:cc-by-nc-4.0
833
2

MobileLLM-Pro-base-int4-cpu

llama4_text
831
0

wav2vec2-base-it-voxpopuli

license:cc-by-nc-4.0
826
0

dpr-reader-single-nq-base

license:cc-by-nc-4.0
819
2

s2t-small-mustc-en-it-st

license:mit
807
1

ijepa_vitg16_22k

license:cc-by-nc-4.0
802
4

MobileLLM-R1-140M

NaNK
llama4_text
796
33

opt-iml-max-1.3b

NaNK
795
43

xglm-4.5B

NaNK
license:mit
789
20

MobileLLM-125M

786
126

audioseal

license:mit
782
29

PE-Lang-L14-448

license:apache-2.0
772
7

mcontriever

768
4

metamotivo-M-1

license:cc-by-nc-4.0
758
8

musicgen-stereo-medium

license:cc-by-nc-4.0
756
33

galactica-120b

NaNK
license:cc-by-nc-4.0
755
157

galactica-30b

NaNK
license:cc-by-nc-4.0
737
40

galactica-6.7b

NaNK
license:cc-by-nc-4.0
732
104

layerskip-llama3.2-1B

NaNK
llama
732
24

dinov2-with-registers-base-imagenet1k-1-layer

license:apache-2.0
729
2

metaclip-b32-fullcc2.5b

NaNK
license:cc-by-nc-4.0
721
8

mms-tts-yor

license:cc-by-nc-4.0
717
21

wav2vec2-large-xlsr-53-german

license:apache-2.0
714
4

opt-iml-max-30b

NaNK
708
36

mms-tts-ory

license:cc-by-nc-4.0
702
4

regnet-y-320

license:apache-2.0
693
0

regnet-y-160

license:apache-2.0
680
0

mms-tts-pan

license:cc-by-nc-4.0
677
3

Nllb Moe 54b

- Information about training algorithms, parameters, fairness constraints or other applied approaches, and features. The exact training algorithm, data and the strategies to handle data imbalances for high and low resource languages that were used to train NLLB-200 is described in the paper. - Paper or other resource for more information NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022 - License: CC-BY-NC - Where to send questions or comments about the model: https://github.com/facebookresearch/fairseq/issues The NLLB model was presented in No Language Left Behind: Scaling Human-Centered Machine Translation by Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. - The Expert Output Masking is used for training, which consists in droping the full contribution for some tokens. This corresponds to the following scheme: Generating with NLLB-MoE The avalable checkpoints requires around 350GB of storage. Make sure to use `accelerate` if you do not have enough RAM on your machine. While generating the target text set the `forcedbostokenid` to the target language id. The following example shows how to translate English to French using the facebook/nllb-moe-54b model. Note that we're using the BCP-47 code for French `fraLatn`. See here for the list of all BCP-47 in the Flores 200 dataset.

NaNK
license:cc-by-nc-4.0
663
123

mask2former-swin-small-cityscapes-semantic

663
2

Mms Tts Khm

license:cc-by-nc-4.0
661
9

layerskip-llama2-70B

NaNK
llama
661
5

mms-tts-swh

license:cc-by-nc-4.0
657
11

convnext-base-384-22k-1k

license:apache-2.0
651
5

magnet-small-10secs

license:cc-by-nc-4.0
645
25

fasttext-en-vectors

license:cc-by-sa-3.0
644
17

metamotivo-S-1

license:cc-by-nc-4.0
636
9

maskformer-resnet101-coco-stuff

607
1

MobileLLM-Pro-base

llama4_text
598
7

Meta-SecAlign-70B

NaNK
llama
593
8

mms-tts-heb

license:cc-by-nc-4.0
588
9

metaclip-2-worldwide-giant

license:cc-by-nc-4.0
586
3

mms-tts-ell

license:cc-by-nc-4.0
577
1

blt-1b

NaNK
569
19

maskformer-swin-large-coco

568
27

regnet-y-080

license:apache-2.0
565
0

mms-tts-nld

license:cc-by-nc-4.0
553
3

regnet-y-064

license:apache-2.0
548
0

regnet-y-120

license:apache-2.0
540
0

deit-small-distilled-patch16-224

license:apache-2.0
536
6

cotracker

license:cc-by-nc-4.0
533
36

pe-av-small

license:apache-2.0
532
14

mms-tts-hau

license:cc-by-nc-4.0
521
4

layerskip-llama2-13B

NaNK
llama
518
5

dpr-reader-multiset-base

license:cc-by-nc-4.0
518
0

mms-tts-mar

license:cc-by-nc-4.0
517
3

Wav2vec2 Base 100h

The base model pretrained and fine-tuned on 100 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. To transcribe audio files the model can be used as a standalone acoustic model as follows: This code snippet shows how to evaluate facebook/wav2vec2-base-100h on LibriSpeech's "clean" and "other" test data.

license:apache-2.0
514
7

dpt-dinov2-small-kitti

license:apache-2.0
505
7

drama-base

llama
503
21

mms-tts-ron

license:cc-by-nc-4.0
503
5

tribev2

license:cc-by-nc-4.0
493
66

sapiens-seg-1b-torchscript

NaNK
license:cc-by-nc-4.0
493
3

mms-tts-fon

license:cc-by-nc-4.0
481
7

MobileLLM-R1-360M

NaNK
llama4_text
479
19

mms-tts-swe

license:cc-by-nc-4.0
477
0

metaclip-2-worldwide-huge-378

MetaCLIP 2 (worldwide) was presented in MetaCLIP 2: A Worldwide Scaling Recipe. This checkpoint corresponds to "ViT-H-14-378-worldwide" of the original implementation. First install the Transformers library (from source for now): In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:

license:cc-by-nc-4.0
469
5

wav2vec2-xls-r-1b-en-to-15

NaNK
license:apache-2.0
469
3

dinov2-with-registers-giant-imagenet1k-1-layer

license:apache-2.0
469
2

wav2vec2-xls-r-1b-21-to-en

NaNK
license:apache-2.0
464
3

deit-base-distilled-patch16-384

license:apache-2.0
460
7

pe-av-large

license:apache-2.0
459
49

KernelLLM

NaNK
llama
457
182

PE-Spatial-G14-448

license:apache-2.0
454
19

mms-tts-som

license:cc-by-nc-4.0
447
6

layerskip-codellama-7B

NaNK
llama
443
6

mms-tts-hun

license:cc-by-nc-4.0
438
6

mms-tts-bul

license:cc-by-nc-4.0
431
1

convnextv2-nano-22k-224

license:apache-2.0
412
0

mms-tts-kab

license:cc-by-nc-4.0
401
1

mms-tts-guj

license:cc-by-nc-4.0
398
5

mms-tts-kaz

license:cc-by-nc-4.0
386
3

convnext-xlarge-224-22k

license:apache-2.0
383
1

audio-magnet-medium

license:cc-by-nc-4.0
378
34

metaclip-b16-400m

license:cc-by-nc-4.0
378
3

metaclip-2-worldwide-giant-378

MetaCLIP 2 (worldwide) was presented in MetaCLIP 2: A Worldwide Scaling Recipe. This checkpoint corresponds to "ViT-bigG-14-378-worldwide" of the original implementation. First install the Transformers library (from source for now): In case you want to perform pre- and postprocessing yourself, you can use the `AutoModel` API:

license:cc-by-nc-4.0
367
5

convnextv2-base-1k-224

license:apache-2.0
363
4

mms-tts-nan

license:cc-by-nc-4.0
362
6

dpt-dinov2-large-nyu

license:apache-2.0
362
1

mms-tts-mon

license:cc-by-nc-4.0
360
9

vjepa2-vitl-fpc32-256-diving48

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of VJEPA, resulting in state-of-the-art video understanding capabilities, leveraging data and model sizes at scale. The code is released in this repository. 💡 This is V-JEPA 2 ViT-L 256 model with video classification head pretrained on Diving 48 dataset. To run V-JEPA 2 model, ensure you have installed the latest transformers:

NaNK
license:mit
357
2

magnet-medium-30secs

license:cc-by-nc-4.0
354
36

maskformer-swin-tiny-coco

349
6

convnextv2-nano-1k-224

license:apache-2.0
348
0

convnextv2-large-1k-224

license:apache-2.0
345
0

blt-entropy

342
8

VGGT-1B-Commercial

NaNK
341
47

dinov2-with-registers-small-imagenet1k-1-layer

license:apache-2.0
340
2

ijepa_vith16_1k

license:cc-by-nc-4.0
340
0

maskformer-swin-tiny-ade

319
5

timesformer-hr-finetuned-k600

license:cc-by-nc-4.0
318
6

mms-tts-kan

license:cc-by-nc-4.0
315
3

MobileLLM-1B

NaNK
310
121

Meta-SecAlign-8B

NaNK
llama
307
11

hiera-tiny-224-in1k-hf

license:cc-by-nc-4.0
304
2

hiera-tiny-224-mae-hf

license:cc-by-nc-4.0
301
1

convnext-xlarge-224-22k-1k

license:apache-2.0
296
2

sam-audio-small-tv

292
9

deit-base-patch16-384

license:apache-2.0
287
3

s2t-medium-librispeech-asr

license:mit
281
8

MobileLLM-350M

280
35

xmod-large-prenorm

license:mit
273
0

llm-compiler-7b

NaNK
llama
272
137

MobileLLM-R1-950M-base

llama4_text
272
16

dinov2-with-registers-large-imagenet1k-1-layer

license:apache-2.0
270
0

wav2vec2-large-xlsr-53-dutch

license:apache-2.0
269
3

mms-tts-hat

license:cc-by-nc-4.0
267
2

MobileLLM-ParetoQ-1.5B-1.58-bit

NaNK
llama
267
0

wav2vec2-large-xlsr-53-portuguese

license:apache-2.0
259
7

mms-tts-mal

license:cc-by-nc-4.0
257
3

mms-tts-ukr

license:cc-by-nc-4.0
255
5

PE-Core-S16-384

license:apache-2.0
255
0

blt-7b

NaNK
251
61

mask2former-swin-large-cityscapes-instance

240
2

magnet-small-30secs

license:cc-by-nc-4.0
235
9

convnext-base-224-22k-1k

license:apache-2.0
235
5

mask2former-swin-base-IN21k-cityscapes-panoptic

234
0

mms-tts-ceb

license:cc-by-nc-4.0
233
3

mms-tts-kmr-script_latin

license:cc-by-nc-4.0
227
2

mms-tts-gbm

license:cc-by-nc-4.0
222
0

mms-tts-kmr-script_arabic

license:cc-by-nc-4.0
221
1

mask2former-swin-base-IN21k-cityscapes-semantic

215
0

metamotivo-S-4

license:cc-by-nc-4.0
207
2

sapiens-pose-1b-torchscript

NaNK
license:cc-by-nc-4.0
206
11

esm1v_t33_650M_UR90S_3

206
0

chameleon-30b

NaNK
205
88

mms-tts-quz

license:cc-by-nc-4.0
204
1

metamotivo-S-2

license:cc-by-nc-4.0
202
2

musicgen-stereo-large

license:cc-by-nc-4.0
200
83

metamotivo-S-3

license:cc-by-nc-4.0
200
2

metamotivo-S-5

license:cc-by-nc-4.0
199
1

mms-tts-eus

license:cc-by-nc-4.0
197
2

xglm-2.9B

NaNK
license:mit
195
10

convnextv2-pico-1k-224

license:apache-2.0
193
1

esm1v_t33_650M_UR90S_4

193
0

mms-tts-fin

license:cc-by-nc-4.0
186
1

mask2former-swin-base-IN21k-cityscapes-instance

181
0

MobileLLM-R1-360M-base

llama4_text
175
11

magnet-medium-10secs

license:cc-by-nc-4.0
175
9

mms-tts-crs

license:cc-by-nc-4.0
174
0

locate-3d-plus

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D Official model weights for the `Locate-3D` models and the `3D-JEPA` encoders Locate 3D is a model for localizing objects in 3D scenes from referring expressions like “the small coffee table between the sofa and the lamp.” Locate 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, Locate 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds, is key to `Locate 3D`. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. - Locate-3D: Locate-3D model trained on public referential grounding datasets - Locate-3D+: Locate-3D model trained on public referential grounding datasets and the newly released Locate 3D Dataset - 3D-JEPA: Pre-trained SSL encoder for 3D understanding For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our GitHub repository. The majority of `locate-3` is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Pointcept is licensed under the MIT license.

license:cc-by-nc-4.0
171
7

mms-tts-zlm

license:cc-by-nc-4.0
165
4

data2vec-vision-large

license:apache-2.0
165
2

timesformer-hr-finetuned-ssv2

license:cc-by-nc-4.0
165
2

s2t-small-mustc-en-de-st

license:mit
165
0

pe-av-small-16-frame

license:apache-2.0
164
3

dpt-dinov2-base-kitti

license:apache-2.0
164
2

esm1v_t33_650M_UR90S_5

163
0

incoder-6B

NaNK
license:cc-by-nc-4.0
162
80

locate-3d

license:cc-by-nc-4.0
162
9

mask2former-swin-base-IN21k-ade-semantic

161
3

mms-tts-pap

license:cc-by-nc-4.0
161
0

sapiens

license:cc-by-nc-4.0
158
242

hubert-xlarge-ll60k

license:apache-2.0
157
5

mask2former-swin-tiny-cityscapes-panoptic

155
0

wav2vec2-base-es-voxpopuli-v2

license:cc-by-nc-4.0
151
1

mms-tts-sqi

license:cc-by-nc-4.0
141
3

mms-tts-aka

license:cc-by-nc-4.0
139
3

wav2vec2-base-it-voxpopuli-v2

license:cc-by-nc-4.0
139
0

wav2vec2-xlsr-53-phon-cv-babel-ft

138
4

dpt-dinov2-base-nyu

license:apache-2.0
138
0

MobileLLM-600M

137
29

vit-msn-base

license:apache-2.0
136
0

detr-resnet-101-panoptic

license:apache-2.0
135
17

mms-tts-mlg

license:cc-by-nc-4.0
134
6

sapiens-pose-bbox-detector

license:apache-2.0
133
4

MobileLLM-ParetoQ-125M-BF16

llama
130
0

pixio-vitl16

128
7

hiera-base-224-hf

license:cc-by-nc-4.0
127
0

mms-tts-saq

license:cc-by-nc-4.0
125
0

sapiens-depth-1b-torchscript

NaNK
license:cc-by-nc-4.0
123
1

Dpt Dinov2 Large Kitti

DPT (Dense Prediction Transformer) model with DINOv2 backbone as proposed in DINOv2: Learning Robust Visual Features without Supervision by Oquab et al. The model is intended to showcase that using the DPT framework with DINOv2 as backbone yields a powerful depth estimator.

license:apache-2.0
121
4

mms-tts-urd-script_devanagari

license:cc-by-nc-4.0
121
0