laion

160 models • 6 total models in database
Sort by:

clap-htsat-fused

Model card for CLAP: Contrastive Language-Audio Pretraining

14,098,060
44

CLIP-ViT-H-14-laion2B-s32B-b79K

--- license: mit widget: - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog library_name: open_clip pipeline_tag: zero-shot-image-classification ---

NaNK
license:mit
2,200,370
417

CLIP-ViT-B-32-laion2B-s34B-b79K

--- license: mit widget: - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog pipeline_tag: zero-shot-image-classification ---

NaNK
license:mit
952,107
134

larger_clap_general

--- license: apache-2.0 ---

license:apache-2.0
712,980
46

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip tags: - clip ---

NaNK
license:mit
467,500
7

CLIP-ViT-B-16-laion2B-s34B-b88K

--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip ---

NaNK
license:mit
208,293
38

clap-htsat-unfused

Model card for CLAP: Contrastive Language-Audio Pretraining 0. TL;DR 1. Model Details 2. Usage 3. Uses 4. Citation > Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public. You can use this model for zero shot audio classification or extracting audio and/or textual features. You can also get the audio and text embeddings using `ClapModel` If you are using this model for your work, please consider citing the original paper:

license:apache-2.0
156,656
61

CLIP-ViT-L-14-laion2B-s32B-b82K

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training ('babysitting') done by Ross Wightman on the JUWELS Booster supercomputer. See acknowledgements below. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. The model was trained on 384 A100 GPUs using 200M sample 'virtual' epochs where dataset shards were sampled with replacement. The model was trained with 160 virtual epochs for a total of 32B samples seen. The first 68 epochs were trained with float16 AMP, global batch size 79K (208 per GPU). Initially running to epoch 75, where the loss spiked and training failed with NaN. Romain Beaumont was training H/14 and g/14 models at the same time on Stability cluster and hit similar instabilities. Collectively we tried restarts with, different dataset shuffle seed different LR gradient clipping modifications to the architecture Norm modifications (stable norm for final, post embed norm for text transformer) as per https://github.com/mlfoundations/openclip/pull/153 thanks to Phil Wang Extra attention block norms ala Normformer (https://arxiv.org/abs/2110.09456) Scaled cosine attention ala Swin-V2 (https://arxiv.org/abs/2111.09883) None of the above ended up working. Most blew up within the same epoch as original, with the exception of architecture mods. Normformer mods signifcantly altered the network such that resuming did not quickly converge to previous performance, this was abandoned but might be worth trying from start. Scaled cosine attn initially looked promising and lasted until epoch 90 before loss suddenly increased and appeared to remain 'stuck'. In the end, restarting at epoch 69 with `float32` precision solved all instabilities and training continued from there with global batch size 86k (224 per GPU). On A100 GPUs, `float32` had a minimal impact on the throughput once `tf32` matmuls were enabled in PyTorch. Approximately 10% slower than `float16 AMP`. Romain similary changed the precision but ended up using `bfloat16 AMP` to resolve issues. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 75.3 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging the Gauss Centre for Supercomputing e.V. (http://gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC). TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets

NaNK
license:mit
144,724
60

CLIP-ViT-L-14-DataComp.XL-s13B-b90K

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-L/14 model trained with the DataComp-1B (https://github.com/mlfoundations/datacomp) using OpenCLIP (https://github.com/mlfoundations/openclip). As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the DataComp paper (https://arxiv.org/abs/2304.14108) include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. This model was trained with the 1.4 Billion samples of the DataComp-1B dataset (https://arxiv.org/abs/2304.14108). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. Evaluation done on 38 datasets, using the DataComp repo and the LAION CLIP Benchmark. The testing is performed on a suite of 38 datasets. See our paper for more details (https://arxiv.org/abs/2304.14108). The model achieves a 79.2% zero-shot top-1 accuracy on ImageNet-1k. See our paper for more details and results (https://arxiv.org/abs/2304.14108). Acknowledging stability.ai for the compute used to train this model.

NaNK
license:mit
123,640
122

CLIP-convnext_large_d.laion2B-s26B-b102K-augreg

NaNK
license:mit
120,618
5

CLIP-ViT-bigG-14-laion2B-39B-b160k

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Mitchell Wortsman on the stability.ai cluster. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). Fine-tuning was also partially done on LAION-A, a 900M subset of LAION-2B filtered with aesthetic V2 4.5+ and phash deduplicated. IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. The training procedure will soon be discussed by a blog post on laion.ai. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 80.1 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, and will soon be visible at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging stability.ai for the compute used to train this model. TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets

NaNK
license:mit
111,235
295

CLIP-ViT-B-32-DataComp.XL-s13B-b90K

NaNK
license:mit
84,558
4

CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

NaNK
license:mit
70,967
21

CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k

Model Card for CLIP ViT-B/32 xlm roberta base - LAION-5B 1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-B/32 xlm roberta base model trained with the LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Romain Beaumont on the stability.ai cluster. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. This model was trained with the full LAION-5B (https://laion.ai/blog/laion-5b/). Training with batch size 90k for 13B sample of laion5B, see https://wandb.ai/rom1504/open-clip/reports/xlm-roberta-base-B-32--VmlldzoyOTQ5OTE2 Model is B/32 on visual side, xlm roberta base initialized with pretrained weights on text side. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves imagenet 1k 62.33% (vs 62.9% for baseline) mscoco 63.4% (vs 60.8% for baseline) flickr30k 86.2% (vs 85.4% for baseline) A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance. Acknowledging stability.ai for the compute used to train this model. In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite:

NaNK
license:mit
50,404
14

CLIP-ViT-g-14-laion2B-s12B-b42K

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-g/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Romain Beaumont on the stability.ai cluster. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 76.6 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging stability.ai for the compute used to train this model. In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite: TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets

NaNK
license:mit
47,676
44

CLIP-ViT-B-16-DataComp.XL-s13B-b90K

NaNK
license:mit
43,210
8

CLIP-ViT-B-32-256x256-DataComp-s34B-b86K

NaNK
license:mit
38,898
8

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup

NaNK
license:mit
30,135
25

larger_clap_music_and_speech

license:apache-2.0
17,423
31

CLIP-ViT-g-14-laion2B-s34B-b88K

NaNK
license:mit
11,165
27

CLIP-ViT-L-14-CommonPool.XL-s13B-b90K

NaNK
license:mit
10,708
3

CLIP-convnext_base-laion400M-s13B-b51K

NaNK
license:mit
7,555
0

CLIP-convnext_base_w-laion2B-s13B-b82K

NaNK
license:mit
5,489
4

CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K-augreg

NaNK
license:mit
4,226
4

mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k

NaNK
license:mit
3,791
21

CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

NaNK
license:mit
3,433
23

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

NaNK
license:mit
2,539
8

larger_clap_music

license:apache-2.0
2,217
35

CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K

NaNK
license:mit
2,103
3

CLIP-ViT-B-16-DataComp.L-s1B-b8K

NaNK
license:mit
1,526
1

CoCa-ViT-B-32-laion2B-s13B-b90k

NaNK
license:mit
1,418
6

CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K

NaNK
license:mit
1,321
1

CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k

NaNK
license:mit
1,181
2

CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft

NaNK
license:mit
1,151
3

CLIP-ViT-B-16-CommonPool.L-s1B-b8K

NaNK
license:mit
986
0

BUD-E-Whisper

BUD-E Whisper is a suite of Whisper models fine-tuned for direct emotional speech captioning. The core models are built upon OpenAI's Whisper architecture, with the current primary variant being a fine-tune of OpenAI Whisper Small. These models are designed to generate text captions that not only transcribe speech but also inherently reflect its emotional content. The embeddings generated by BUD-E Whisper can also serve as input for Empathic Insight - Voice, a downstream ensemble of Multi-Layer Perceptrons (MLPs) designed to predict dimensional emotion scores. This model is released under the CC-by-4.0 license. Please give attribution to Maurice Kraus & Christoph Schuhmann, who made this model. Colab [](https://colab.research.google.com/drive/1VoAtmNhY1hI5Yzv1dppHTcYky82OCDK?usp=sharing) BUD-E Whisper was trained on a combination of: The Laion's Got Talent (Enhanced Flash Annotations and Long Captions) dataset. An internal dataset comprising approximately 5,000 hours of public Vlogs and similar audio content. A key aspect of BUD-E Whisper's development was a multi-step caption refinement process to create rich training targets: 1. Initial Score Generation: An iterative process using Gemini Flash 2.0 generated initial 40-dimensional emotion scores (0-4 scale) and 15 additional dimensions like age, arousal, valence, dominance, harshness, vocalbursts,... for all audio snippets. 2. Templated Captions: These scores were converted into templated string captions. 3. Paraphrasing for Richness: Gemini Flash 2.0 was then used to paraphrase these templated captions, creating diverse and semantically rich training targets. 4. Fine-tuning: Various Whisper model sizes (including the aforementioned fine-tune of OpenAI Whisper Small) were fine-tuned on these refined, emotionally-aware captions. This multi-step caption refinement was crucial for performance. Direct score regression or simple templated captions were found to lead to suboptimal performance for emotional speech captioning with Whisper models. Generating emotionally nuanced captions for audio content. Providing rich embeddings for downstream emotion recognition tasks (e.g., with Empathic Insight - Voice).

license:cc-by-4.0
696
36

CoCa-ViT-L-14-laion2B-s13B-b90k

NaNK
license:mit
551
18

CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K

NaNK
license:mit
521
1

swesmith-nl2bash-stack-bugsseq

NaNK
llama-factory
485
0

nl2bash-swesmith-stack-bugsseq

NaNK
llama-factory
433
0

open-thoughts-4-code-qwen3-32b-annotated-32k

NaNK
llama-factory
410
0

open-thoughts-4-code-qwen3-32b-annotated

NaNK
llama-factory
406
0

mscoco_finetuned_CoCa-ViT-B-32-laion2B-s13B-b90k

NaNK
license:mit
391
0

r2egym-stack-bugsseq

NaNK
llama-factory
272
0

r2egym-nl2bash-stackseq

NaNK
llama-factory
258
0

exp-uns-r2egym-33_6x_glm_4_7_traces_jupiter

NaNK
llama-factory
257
0

CLIP-convnext_base_w-laion_aesthetic-s13B-b82K

NaNK
license:mit
255
5

bugs-r2egym-stackseq

NaNK
llama-factory
239
0

r2egym-nl2bash-bugsseq

NaNK
llama-factory
225
0

r2egym-nl2bash-stack-bugsseq

NaNK
llama-factory
220
0

CLIP-ViT-B-32-CommonPool.M.clip-s128M-b4K

license:mit
201
0

Qwen3-8B_exp_tas_temp_0.5_traces_save-strategy_steps

NaNK
llama-factory
180
0

CLIP-ViT-B-32-CommonPool.S.text-s13M-b4K

license:mit
176
0

CLIP-ViT-B-32-CommonPool.S.laion-s13M-b4K

license:mit
171
0

r2egym-stackseq

NaNK
llama-factory
169
0

r2egym-bugsseq

NaNK
llama-factory
165
0

r2egym-nl2bashseq

NaNK
llama-factory
159
0

open-thoughts-4-code-qwen3-32b-annotated-gbs256-4node

NaNK
154
0

glm46-swesmith-maxeps-131k

NaNK
llama-factory
152
0

CLIP-ViT-B-32-CommonPool.M-s128M-b4K

license:mit
145
0

exp_tas_optimal_combined_traces-Qwen3.5-9B

NaNK
llama-factory
141
0

CLIP-ViT-B-16-CommonPool.L.laion-s1B-b8K

NaNK
license:mit
140
0

minimax-m2-stack-overflow-32ep-131k-summtrc

NaNK
llama-factory
137
0

glm-4_6-stack-overflow-32ep-131k-summtrc

NaNK
llama-factory
137
0

gpt-oss-120B-stack-overflow-32ep-131k-summtrc-fixthink1

NaNK
llama-factory
136
0

gpt-oss-120B-stack-overflow-32ep-131k-summtrc

NaNK
llama-factory
136
0

stackexchange-tezos-sandboxes_glm_4_6_traces_together

NaNK
llama-factory
133
0

CLIP-ViT-B-32-DataComp.M-s128M-b4K

license:mit
131
0

qwen3-coder-480B-stack-overflow-32ep-131k-summtrc

NaNK
llama-factory
122
0

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind

NaNK
license:mit
117
2

CLIP-ViT-B-32-CommonPool.M.laion-s128M-b4K

license:mit
115
0

CLIP-ViT-B-32-CommonPool.M.text-s128M-b4K

license:mit
109
0

CLIP-ViT-B-32-CommonPool.M.image-s128M-b4K

license:mit
105
0

CLIP-ViT-B-16-CommonPool.L.basic-s1B-b8K

NaNK
license:mit
105
0

stackexchange-tezos-sandboxes_glm_4_6_traces_together_again

NaNK
llama-factory
104
0

kimi-k2t-freelancer-32ep-32k

NaNK
llama-factory
100
0

CLIP-ViT-B-32-CommonPool.M.basic-s128M-b4K

license:mit
97
0

glm-4_6-freelancer-32ep-131k-torch

NaNK
llama-factory
95
0

GLM-4_6-stackexchange-overflow-sandboxes-32eps-65k-reasoning

llama-factory
93
0

CLIP-ViT-B-32-CommonPool.S.clip-s13M-b4K

license:mit
89
0

sft_GLM-4-7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k_Qwen3-32B

NaNK
llama-factory
86
0

glm46-neulab-synatra-32ep-131k

NaNK
llama-factory
86
0

MiniMax-M2-freelancer-32ep-32k

llama-factory
82
0

CLIP-ViT-B-16-CommonPool.L.clip-s1B-b8K

NaNK
license:mit
82
0

claude-4-5-sonnet-thinking-stackexchange-overflow-32ep-32k-traces

NaNK
llama-factory
80
0

CLIP-ViT-B-32-DataComp.S-s13M-b4K

license:mit
75
0

CLIP-ViT-B-32-CommonPool.S.basic-s13M-b4K

license:mit
74
0

CLIP-ViT-B-32-CommonPool.S.image-s13M-b4K

license:mit
74
0

CLIP-ViT-B-32-CommonPool.S-s13M-b4K

license:mit
71
0

glm46-stackexchange-tezos-maxeps-131k

NaNK
llama-factory
67
0

r2egym-gpt5-codex-160ep-1M

NaNK
llama-factory
66
0

stackexchange-tezos-sandboxes_glm_4_7_traces_locetash

NaNK
llama-factory
62
0

stackexchange-tezos-sandboxes_glm_4_6_traces_locetash_again

NaNK
llama-factory
61
0

glm-4_6-stackexchange-tezos-32ep-131k

NaNK
llama-factory
61
0

timbre-whisper

license:cc-by-4.0
60
0

GLM-4_6-stackexchange-superuser-32ep-32k

NaNK
llama-factory
59
0

sft__Kimi-2-5-inferredbugs-sandboxes-maxeps-32k__Qwen3-8B

NaNK
llama-factory
55
0

exp_tas_parser_xml_traces

NaNK
llama-factory
49
0

exp_tas_min_p_0_1_traces

NaNK
llama-factory
48
0

exp-syh-r2egym-askllm-constrained_glm_4_7_traces_jupiter

NaNK
llama-factory
45
0

glm46-swegym-tasks-maxeps-131k

NaNK
llama-factory
44
0

exp_tas_baseline_traces

NaNK
llama-factory
44
0

exp_tas_full_thinking_traces

NaNK
llama-factory
43
0

exp_tas_linear_history_off_traces

NaNK
llama-factory
43
0

GLM-4_6-inferredbugs-32ep-65k-reasoning

llama-factory
43
0

exp_tas_low_diversity_traces

NaNK
llama-factory
42
0

exp_tas_frequency_penalty_0_5_traces

NaNK
llama-factory
40
0

exp_tas_frequency_penalty_0_25_traces

NaNK
llama-factory
40
0

glm46-swesmith-maxeps-131k-lc

NaNK
llama-factory
37
0

exp_tas_interleaved_thinking_on_traces

NaNK
llama-factory
37
0

glm-4_6-nemo-prism

NaNK
llama-factory
37
0

CLIP-ViT-B-16-CommonPool.L.image-s1B-b8K

NaNK
license:mit
35
0

alfworld-swesmith-r2egym-swegym-131k-lc

NaNK
llama-factory
34
0

CLIP-ViT-B-16-CommonPool.L.text-s1B-b8K

NaNK
license:mit
32
0

stackexchange-tezos-sandboxes_glm_4_6_traces_locetash

NaNK
llama-factory
28
0

rl__24GPU_shaped__swe_rebench_patched_oracle__r2egym-nl2bash-stack

NaNK
24
0

r2egym-nl2bash-stack-bugsseq-fixthink

NaNK
llama-factory
23
0

Qwen3-Coder-480B-codeforces-fixeps_Qwen3-8B

NaNK
llama-factory
22
0

r2egym-nl2bash-stack-bugsseq-fixthink-again

NaNK
llama-factory
20
0

glm46-glaive-code-assistant-sandboxes-maxeps-131k

NaNK
llama-factory
20
0

GLM-4_6-freelancer-32eps-131k

llama-factory
20
0

glm-4_6-all-puzzles-32ep-131k

NaNK
llama-factory
19
0

openMaMMUT-ViT-L-14-512x512-pt_datacomp1b-ft_DFN512x512-s293M-b32k

NaNK
license:apache-2.0
17
2

exp_tas_optimal_combined_traces

NaNK
llama-factory
16
0

glm46-defects4j-32ep-131k

NaNK
llama-factory
16
0

openMaMMUT-ViT-B-32-512x512-pt_DFN2B-ft_DFN512x512-s293M-b73k

NaNK
license:apache-2.0
15
1

voice-tagging-whisper

license:apache-2.0
15
0

sft_r2egym-nl2bash-stackoverflow-inferredbugs-32B_Qwen3-32B

NaNK
llama-factory
13
0

exp-gfi-staqc-askllm-filtered-10K_glm_4_7_traces_jupiter_cleaned

NaNK
llama-factory
11
0

openMaMMUT-ViT-B-16-512x512-pt_DFN2B-ft_DFN512x512-s293M-b73k

NaNK
license:apache-2.0
10
1

rl__r2egym_deepswe_fp8_terminus-2_32b

NaNK
10
0

GLM-4.6-stackexchange-overflow-sandboxes-32eps-65k-reasoning_adam-beta1_0-97_Qwen3-32B

NaNK
llama-factory
9
0

openMaMMUT-ViT-L-14-DataComp-1.4B-s12.8B-b180K

NaNK
license:apache-2.0
7
5

anh-bloomz-7b1-mt-cross-lingual

NaNK
7
5

glm-4_6-r2egym-32ep-32k

llama-factory
7
0

exp-uns-r2egym-2_1x_glm_4_7_traces_jupiter

NaNK
llama-factory
6
0

exp-syh-tezos-stackoverflow-mixed_glm_4_7_traces_jupiter

NaNK
llama-factory
5
0

GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k

NaNK
llama-factory
4
0

exp-syh-tezos-askllm-constrained_glm_4_7_traces_jupiter

NaNK
llama-factory
3
0

openMaMMUT-ViT-B-16-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k

NaNK
license:apache-2.0
2
1

openMaMMUT-ViT-L-14-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k

NaNK
license:apache-2.0
2
1

exp_tas_top_p_0_95_traces

NaNK
llama-factory
2
0

exp_tas_max_tokens_2048_traces

NaNK
llama-factory
2
0

distil-whisper-large-v3_openvino_int8

license:apache-2.0
2
0

Mantis-8B-siglip-llama3_openvino_int8

NaNK
license:apache-2.0
2
0

anh-xglm-7.5b-cross-lingual

NaNK
license:apache-2.0
1
5

LLaVA-Video-7B-Qwen2_openvino_int8

NaNK
license:apache-2.0
1
1

exp_tas_top_p_0_8_traces

NaNK
llama-factory
1
0

DALLE2-PyTorch

license:mit
0
68

Empathic-Insight-Voice-Small

Empathic-Insight-Voice-Small [](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGFtJ3YdFF8BkUA2) Empathic-Insight-Voice-Small is a suite of 40+ emotion and attribute regression models trained on the large-scale, multilingual synthetic voice-acting dataset LAION'S GOT TALENT (~ 5.000 hours) & an "in the wild" dataset of voice snippets (also ~ 5.000 hours). Each model is designed to predict the intensity of a specific fine-grained emotion or attribute from speech audio. These models leverage embeddings from a fine-tuned Whisper model (laion/BUD-E-Whisper) followed by dedicated MLP regression heads for each dimension. This work is based on the research paper: "EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection" The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set. The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours). These models are intended for research purposes in affective computing, speech emotion recognition (SER), human-AI interaction, and voice AI development. They can be used to: Analyze and predict fine-grained emotional states and vocal attributes from speech. Serve as a baseline for developing more advanced SER systems. Facilitate research into nuanced emotional understanding in voice AI. Explore multilingual and cross-cultural aspects of speech emotion (given the foundation dataset). Out-of-Scope Use: These models are trained on synthetic speech and their generalization to spontaneous real-world speech needs further evaluation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes or infringe on privacy without due diligence and ethical review. The primary way to use these models is through the provided Google Colab Notebook. The notebook handles dependencies, model loading, audio processing, and provides examples for: Batch processing a folder of audio files. Generating a comprehensive HTML report with per-file emotion scores, waveforms, and audio players. Generating individual JSON files with all predicted scores for each audio file. Below is a conceptual example of how to perform inference for a single audio file, extracting all emotion and attribute scores. For the full, runnable version, please refer to the Colab notebook. Conceptual Python Example for Single Audio File Inference: The core 40 emotion categories are (from EMONET-VOICE, Appendix A.1): Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States of Consciousness, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph. Additional vocal attributes (e.g., Valence, Arousal, Gender, Age, Pitch characteristics) are also predicted by corresponding MLP models in the suite. The full list of predictable dimensions can be inferred from the FILENAMEPARTTOTARGETKEYMAP in the Colab notebook (Cell 2). The EMONET-VOICE suite was developed with ethical considerations as a priority: Privacy Preservation: The use of synthetic voice generation fundamentally circumvents privacy concerns associated with collecting real human emotional expressions, especially for sensitive states. Responsible Use: These models are released for research. Users are urged to consider the ethical implications of their applications and avoid misuse, such as for emotional manipulation, surveillance, or in ways that could lead to unfair, biased, or harmful outcomes. The broader societal implications and mitigation of potential misuse of SER technology remain important ongoing considerations.

license:cc-by-4.0
0
18

Empathic-Insight-Face-Large

Empathic-Insight-Face-Large [](https://colab.research.google.com/drive/11oUMo2HX0OuD9dx5ZM4ltNvoYxbI65hu?usp=sharing) Empathic-Insight-Face-Large is a set of 40 emotion regression models trained on the EMoNet-FACE benchmark suite. Each model is designed to predict the intensity of a specific fine-grained emotion from facial expressions. These models are built on top of SigLIP2 image embeddings followed by MLP regression heads. This work is based on the research paper: "EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition" Authors: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Maurice Kraus, Felix Friedrich, Huu Nguyen, Krishna Kalyan, Kourosh Nadi, Kristian Kersting, Sören Auer. (Please refer to the full paper for a complete list of authors and affiliations if applicable). Paper link: (Insert ArXiv/Conference link here when available) The models and datasets are released under the CC-BY-4.0 license. The Empathic-Insight-Face-Large suite consists of 40 individual MLP models. Each model takes a 1152-dimensional SigLIP2 image embedding as input and outputs a continuous score (typically 0-7, can be mean-subtracted) for one of the 40 emotion categories defined in the EMoNet-FACE taxonomy. The models were pre-trained on the EMoNet-FACE BIG dataset (over 203k synthetic images with generated labels) and fine-tuned on the EMoNet-FACE BINARY dataset (nearly 20k synthetic images with over 65k human expert binary annotations). Key Features: Fine-grained Emotions: Covers a novel 40-category emotion taxonomy. High Performance: Achieves human-expert-level performance on the EMoNet-FACE HQ benchmark. Synthetic Data: Trained on AI-generated, demographically balanced, full-face expressions. Open: Publicly released models, datasets, and taxonomy. These models are intended for research purposes in affective computing, human-AI interaction, and emotion recognition. They can be used to: Analyze and predict fine-grained emotional expressions in facial images. Serve as a baseline for developing more advanced emotion recognition systems. Facilitate research into nuanced emotional understanding in AI. Out-of-Scope Use: These models are trained on synthetic faces and may not generalize well to real-world, in-the-wild images without further adaptation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes. These are individual `.pth` files, each corresponding to one emotion classifier. To use them, you will typically: 1. Obtain SigLIP2 Embeddings: Use a pre-trained SigLIP2 model (e.g., `google/siglip2-so400m-patch16-384`). Extract the 1152-dimensional image embedding for your target facial image. 2. Load an MLP Model: Each `.pth` file (e.g., `modelelationbest.pth`) is a PyTorch state dictionary for an MLP. The MLP architecture used for "Empathic-Insight-Face-Large" (big models) is: Input: 1152 features Hidden Layer 1: 1024 neurons, ReLU, Dropout (0.2) Hidden Layer 2: 512 neurons, ReLU, Dropout (0.2) Hidden Layer 3: 256 neurons, ReLU, Dropout (0.2) Output Layer: 1 neuron (continuous score) 3. Perform Inference: Pass the SigLIP2 embedding through the loaded MLP model(s). 4. (Optional) Mean Subtraction: The raw output scores can be adjusted by subtracting the model's mean score on neutral faces. The `neutralstatscache-human-binary-big-mlpsv8twostagehigherlrstage25200+` file in this repository contains these mean values for each emotion model. bibtex @inproceedings{schuhmann2025emonetface, title={{EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition}}, author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Kraus, Maurice and Friedrich, Felix and Nguyen, Huu and Kalyan, Krishna and Nadi, Kourosh and Kersting, Kristian and Auer, Sören}, booktitle={NeurIPS}, year={2025}, }

license:cc-by-4.0
0
10

erlich

license:mit
0
9

scaling-laws-openclip

0
7

Empathic-Insight-Voice-Large

license:cc-by-4.0
0
7

Empathic-Insight-Face-Small

license:cc-by-4.0
0
5

scaling-laws-for-comparison

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [[arXiv]](https://arxiv.org/abs/2506.04598) We provide pre-trained models used in the paper Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [[arXiv]](https://arxiv.org/abs/2506.04598). Please refer to the official Github repository for more information about how to reproduce the results download and use the models.

0
2

whisper-captioning-ensemble

license:cc-by-4.0
0
1

openMaMMUT-ViT-B-32-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k

NaNK
license:apache-2.0
0
1

ongo

0
1

SNAC-24khz-decoder-onnx

This repo provides ONNX decoders for the SNAC 24 kHz codec so you can decode SNAC tokens on-device, including in the browser with `onnxruntime-web`. Why? If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio in the user’s browser/CPU (or WebGPU when available). > In a Colab CPU test, we saw ~2.1× real-time decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser. - `snac24int2wavstatic.onnx` — int → wav decoder Inputs (int64): - `codes0`: `[1, 12]` - `codes1`: `[1, 24]` - `codes2`: `[1, 48]` Output: - `audio`: `float32 [1, 1, 24576]` (24 kHz) Shapes correspond to a 48-frame window. Each frame is 512 samples, so one window = 24576 samples ≈ 1.024 s at 24 kHz. Token alignment: `L04 = L12 = L21 = sharedframes`. - `snac24latent2wavstatic.onnx` — latent → wav decoder Input: `z` `float32 [1, 768, 48]` → Output: `audio [1, 1, 24576]` Use this if you reconstruct the latent yourself (RVQ embeddings + 1×1 conv projections). - `snac24quantizers.json` — RVQ metadata/weights (stride + embeddings + 1×1 projections) to reconstruct `z` if needed. Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run single-threaded. (async () => { // Prefer WebGPU if available; else WASM const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm']; // Enable SIMD; threads only if crossOriginIsolated ort.env.wasm.simd = true; ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency||4) : 1; const session = await ort.InferenceSession.create('snac24int2wavstatic.onnx', { executionProviders: providers, graphOptimizationLevel: 'all', }); // Example: one 48-frame window (12/24/48 tokens). Replace with real codes. const T0=12, T1=24, T2=48; const feed = { codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]), codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]), codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]), }; const t0 = performance.now(); const out = await session.run(feed); const t1 = performance.now(); const audio = out.audio.data; // Float32Array [1,1,24576] // Play it (24 kHz) const ctx = new (window.AudioContext||window.webkitAudioContext)({sampleRate:24000}); const buf = ctx.createBuffer(1, audio.length, 24000); buf.copyToChannel(audio, 0); const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start(); console.log({ usedEP: providers[0], inferms: (t1-t0).toFixed(2), samples: audio.length }); })(); SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms, start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams. Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded. WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not.

0
1