laion
clap-htsat-fused
Model card for CLAP: Contrastive Language-Audio Pretraining
CLIP-ViT-H-14-laion2B-s32B-b79K
--- license: mit widget: - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog library_name: open_clip pipeline_tag: zero-shot-image-classification ---
CLIP-ViT-B-32-laion2B-s34B-b79K
--- license: mit widget: - src: >- https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog pipeline_tag: zero-shot-image-classification ---
larger_clap_general
--- license: apache-2.0 ---
CLIP-convnext_base_w-laion2B-s13B-b82K-augreg
--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip tags: - clip ---
CLIP-ViT-B-16-laion2B-s34B-b88K
--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip ---
clap-htsat-unfused
Model card for CLAP: Contrastive Language-Audio Pretraining 0. TL;DR 1. Model Details 2. Usage 3. Uses 4. Citation > Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public. You can use this model for zero shot audio classification or extracting audio and/or textual features. You can also get the audio and text embeddings using `ClapModel` If you are using this model for your work, please consider citing the original paper:
CLIP-ViT-L-14-laion2B-s32B-b82K
1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training ('babysitting') done by Ross Wightman on the JUWELS Booster supercomputer. See acknowledgements below. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. The model was trained on 384 A100 GPUs using 200M sample 'virtual' epochs where dataset shards were sampled with replacement. The model was trained with 160 virtual epochs for a total of 32B samples seen. The first 68 epochs were trained with float16 AMP, global batch size 79K (208 per GPU). Initially running to epoch 75, where the loss spiked and training failed with NaN. Romain Beaumont was training H/14 and g/14 models at the same time on Stability cluster and hit similar instabilities. Collectively we tried restarts with, different dataset shuffle seed different LR gradient clipping modifications to the architecture Norm modifications (stable norm for final, post embed norm for text transformer) as per https://github.com/mlfoundations/openclip/pull/153 thanks to Phil Wang Extra attention block norms ala Normformer (https://arxiv.org/abs/2110.09456) Scaled cosine attention ala Swin-V2 (https://arxiv.org/abs/2111.09883) None of the above ended up working. Most blew up within the same epoch as original, with the exception of architecture mods. Normformer mods signifcantly altered the network such that resuming did not quickly converge to previous performance, this was abandoned but might be worth trying from start. Scaled cosine attn initially looked promising and lasted until epoch 90 before loss suddenly increased and appeared to remain 'stuck'. In the end, restarting at epoch 69 with `float32` precision solved all instabilities and training continued from there with global batch size 86k (224 per GPU). On A100 GPUs, `float32` had a minimal impact on the throughput once `tf32` matmuls were enabled in PyTorch. Approximately 10% slower than `float16 AMP`. Romain similary changed the precision but ended up using `bfloat16 AMP` to resolve issues. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 75.3 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging the Gauss Centre for Supercomputing e.V. (http://gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC). TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets
CLIP-ViT-L-14-DataComp.XL-s13B-b90K
1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-L/14 model trained with the DataComp-1B (https://github.com/mlfoundations/datacomp) using OpenCLIP (https://github.com/mlfoundations/openclip). As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the DataComp paper (https://arxiv.org/abs/2304.14108) include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. This model was trained with the 1.4 Billion samples of the DataComp-1B dataset (https://arxiv.org/abs/2304.14108). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. Evaluation done on 38 datasets, using the DataComp repo and the LAION CLIP Benchmark. The testing is performed on a suite of 38 datasets. See our paper for more details (https://arxiv.org/abs/2304.14108). The model achieves a 79.2% zero-shot top-1 accuracy on ImageNet-1k. See our paper for more details and results (https://arxiv.org/abs/2304.14108). Acknowledging stability.ai for the compute used to train this model.
CLIP-convnext_large_d.laion2B-s26B-b102K-augreg
CLIP-ViT-bigG-14-laion2B-39B-b160k
1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Mitchell Wortsman on the stability.ai cluster. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). Fine-tuning was also partially done on LAION-A, a 900M subset of LAION-2B filtered with aesthetic V2 4.5+ and phash deduplicated. IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. The training procedure will soon be discussed by a blog post on laion.ai. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 80.1 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, and will soon be visible at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging stability.ai for the compute used to train this model. TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets
CLIP-ViT-B-32-DataComp.XL-s13B-b90K
CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup
CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k
Model Card for CLIP ViT-B/32 xlm roberta base - LAION-5B 1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-B/32 xlm roberta base model trained with the LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Romain Beaumont on the stability.ai cluster. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. This model was trained with the full LAION-5B (https://laion.ai/blog/laion-5b/). Training with batch size 90k for 13B sample of laion5B, see https://wandb.ai/rom1504/open-clip/reports/xlm-roberta-base-B-32--VmlldzoyOTQ5OTE2 Model is B/32 on visual side, xlm roberta base initialized with pretrained weights on text side. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves imagenet 1k 62.33% (vs 62.9% for baseline) mscoco 63.4% (vs 60.8% for baseline) flickr30k 86.2% (vs 85.4% for baseline) A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance. Acknowledging stability.ai for the compute used to train this model. In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite:
CLIP-ViT-g-14-laion2B-s12B-b42K
1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-g/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Romain Beaumont on the stability.ai cluster. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 76.6 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging stability.ai for the compute used to train this model. In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite: TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets
CLIP-ViT-B-16-DataComp.XL-s13B-b90K
CLIP-ViT-B-32-256x256-DataComp-s34B-b86K
CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup
larger_clap_music_and_speech
CLIP-ViT-g-14-laion2B-s34B-b88K
CLIP-ViT-L-14-CommonPool.XL-s13B-b90K
CLIP-convnext_base-laion400M-s13B-b51K
CLIP-convnext_base_w-laion2B-s13B-b82K
CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K-augreg
mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k
CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg
larger_clap_music
CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K
CLIP-ViT-B-16-DataComp.L-s1B-b8K
CoCa-ViT-B-32-laion2B-s13B-b90k
CLIP-ViT-L-14-CommonPool.XL.clip-s13B-b90K
CLIP-ViT-B-32-roberta-base-laion2B-s12B-b32k
CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft
CLIP-ViT-B-16-CommonPool.L-s1B-b8K
BUD-E-Whisper
BUD-E Whisper is a suite of Whisper models fine-tuned for direct emotional speech captioning. The core models are built upon OpenAI's Whisper architecture, with the current primary variant being a fine-tune of OpenAI Whisper Small. These models are designed to generate text captions that not only transcribe speech but also inherently reflect its emotional content. The embeddings generated by BUD-E Whisper can also serve as input for Empathic Insight - Voice, a downstream ensemble of Multi-Layer Perceptrons (MLPs) designed to predict dimensional emotion scores. This model is released under the CC-by-4.0 license. Please give attribution to Maurice Kraus & Christoph Schuhmann, who made this model. Colab [](https://colab.research.google.com/drive/1VoAtmNhY1hI5Yzv1dppHTcYky82OCDK?usp=sharing) BUD-E Whisper was trained on a combination of: The Laion's Got Talent (Enhanced Flash Annotations and Long Captions) dataset. An internal dataset comprising approximately 5,000 hours of public Vlogs and similar audio content. A key aspect of BUD-E Whisper's development was a multi-step caption refinement process to create rich training targets: 1. Initial Score Generation: An iterative process using Gemini Flash 2.0 generated initial 40-dimensional emotion scores (0-4 scale) and 15 additional dimensions like age, arousal, valence, dominance, harshness, vocalbursts,... for all audio snippets. 2. Templated Captions: These scores were converted into templated string captions. 3. Paraphrasing for Richness: Gemini Flash 2.0 was then used to paraphrase these templated captions, creating diverse and semantically rich training targets. 4. Fine-tuning: Various Whisper model sizes (including the aforementioned fine-tune of OpenAI Whisper Small) were fine-tuned on these refined, emotionally-aware captions. This multi-step caption refinement was crucial for performance. Direct score regression or simple templated captions were found to lead to suboptimal performance for emotional speech captioning with Whisper models. Generating emotionally nuanced captions for audio content. Providing rich embeddings for downstream emotion recognition tasks (e.g., with Empathic Insight - Voice).
CoCa-ViT-L-14-laion2B-s13B-b90k
CLIP-ViT-L-14-CommonPool.XL.laion-s13B-b90K
swesmith-nl2bash-stack-bugsseq
nl2bash-swesmith-stack-bugsseq
open-thoughts-4-code-qwen3-32b-annotated-32k
open-thoughts-4-code-qwen3-32b-annotated
mscoco_finetuned_CoCa-ViT-B-32-laion2B-s13B-b90k
r2egym-stack-bugsseq
r2egym-nl2bash-stackseq
exp-uns-r2egym-33_6x_glm_4_7_traces_jupiter
CLIP-convnext_base_w-laion_aesthetic-s13B-b82K
bugs-r2egym-stackseq
r2egym-nl2bash-bugsseq
r2egym-nl2bash-stack-bugsseq
CLIP-ViT-B-32-CommonPool.M.clip-s128M-b4K
Qwen3-8B_exp_tas_temp_0.5_traces_save-strategy_steps
CLIP-ViT-B-32-CommonPool.S.text-s13M-b4K
CLIP-ViT-B-32-CommonPool.S.laion-s13M-b4K
r2egym-stackseq
r2egym-bugsseq
r2egym-nl2bashseq
open-thoughts-4-code-qwen3-32b-annotated-gbs256-4node
glm46-swesmith-maxeps-131k
CLIP-ViT-B-32-CommonPool.M-s128M-b4K
exp_tas_optimal_combined_traces-Qwen3.5-9B
CLIP-ViT-B-16-CommonPool.L.laion-s1B-b8K
minimax-m2-stack-overflow-32ep-131k-summtrc
glm-4_6-stack-overflow-32ep-131k-summtrc
gpt-oss-120B-stack-overflow-32ep-131k-summtrc-fixthink1
gpt-oss-120B-stack-overflow-32ep-131k-summtrc
stackexchange-tezos-sandboxes_glm_4_6_traces_together
CLIP-ViT-B-32-DataComp.M-s128M-b4K
qwen3-coder-480B-stack-overflow-32ep-131k-summtrc
CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind
CLIP-ViT-B-32-CommonPool.M.laion-s128M-b4K
CLIP-ViT-B-32-CommonPool.M.text-s128M-b4K
CLIP-ViT-B-32-CommonPool.M.image-s128M-b4K
CLIP-ViT-B-16-CommonPool.L.basic-s1B-b8K
stackexchange-tezos-sandboxes_glm_4_6_traces_together_again
kimi-k2t-freelancer-32ep-32k
CLIP-ViT-B-32-CommonPool.M.basic-s128M-b4K
glm-4_6-freelancer-32ep-131k-torch
GLM-4_6-stackexchange-overflow-sandboxes-32eps-65k-reasoning
CLIP-ViT-B-32-CommonPool.S.clip-s13M-b4K
sft_GLM-4-7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k_Qwen3-32B
glm46-neulab-synatra-32ep-131k
MiniMax-M2-freelancer-32ep-32k
CLIP-ViT-B-16-CommonPool.L.clip-s1B-b8K
claude-4-5-sonnet-thinking-stackexchange-overflow-32ep-32k-traces
CLIP-ViT-B-32-DataComp.S-s13M-b4K
CLIP-ViT-B-32-CommonPool.S.basic-s13M-b4K
CLIP-ViT-B-32-CommonPool.S.image-s13M-b4K
CLIP-ViT-B-32-CommonPool.S-s13M-b4K
glm46-stackexchange-tezos-maxeps-131k
r2egym-gpt5-codex-160ep-1M
stackexchange-tezos-sandboxes_glm_4_7_traces_locetash
stackexchange-tezos-sandboxes_glm_4_6_traces_locetash_again
glm-4_6-stackexchange-tezos-32ep-131k
timbre-whisper
GLM-4_6-stackexchange-superuser-32ep-32k
sft__Kimi-2-5-inferredbugs-sandboxes-maxeps-32k__Qwen3-8B
exp_tas_parser_xml_traces
exp_tas_min_p_0_1_traces
exp-syh-r2egym-askllm-constrained_glm_4_7_traces_jupiter
glm46-swegym-tasks-maxeps-131k
exp_tas_baseline_traces
exp_tas_full_thinking_traces
exp_tas_linear_history_off_traces
GLM-4_6-inferredbugs-32ep-65k-reasoning
exp_tas_low_diversity_traces
exp_tas_frequency_penalty_0_5_traces
exp_tas_frequency_penalty_0_25_traces
glm46-swesmith-maxeps-131k-lc
exp_tas_interleaved_thinking_on_traces
glm-4_6-nemo-prism
CLIP-ViT-B-16-CommonPool.L.image-s1B-b8K
alfworld-swesmith-r2egym-swegym-131k-lc
CLIP-ViT-B-16-CommonPool.L.text-s1B-b8K
stackexchange-tezos-sandboxes_glm_4_6_traces_locetash
rl__24GPU_shaped__swe_rebench_patched_oracle__r2egym-nl2bash-stack
r2egym-nl2bash-stack-bugsseq-fixthink
Qwen3-Coder-480B-codeforces-fixeps_Qwen3-8B
r2egym-nl2bash-stack-bugsseq-fixthink-again
glm46-glaive-code-assistant-sandboxes-maxeps-131k
GLM-4_6-freelancer-32eps-131k
glm-4_6-all-puzzles-32ep-131k
openMaMMUT-ViT-L-14-512x512-pt_datacomp1b-ft_DFN512x512-s293M-b32k
exp_tas_optimal_combined_traces
glm46-defects4j-32ep-131k
openMaMMUT-ViT-B-32-512x512-pt_DFN2B-ft_DFN512x512-s293M-b73k
voice-tagging-whisper
sft_r2egym-nl2bash-stackoverflow-inferredbugs-32B_Qwen3-32B
exp-gfi-staqc-askllm-filtered-10K_glm_4_7_traces_jupiter_cleaned
openMaMMUT-ViT-B-16-512x512-pt_DFN2B-ft_DFN512x512-s293M-b73k
rl__r2egym_deepswe_fp8_terminus-2_32b
GLM-4.6-stackexchange-overflow-sandboxes-32eps-65k-reasoning_adam-beta1_0-97_Qwen3-32B
openMaMMUT-ViT-L-14-DataComp-1.4B-s12.8B-b180K
anh-bloomz-7b1-mt-cross-lingual
glm-4_6-r2egym-32ep-32k
exp-uns-r2egym-2_1x_glm_4_7_traces_jupiter
exp-syh-tezos-stackoverflow-mixed_glm_4_7_traces_jupiter
GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k
exp-syh-tezos-askllm-constrained_glm_4_7_traces_jupiter
openMaMMUT-ViT-B-16-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k
openMaMMUT-ViT-L-14-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k
exp_tas_top_p_0_95_traces
exp_tas_max_tokens_2048_traces
distil-whisper-large-v3_openvino_int8
Mantis-8B-siglip-llama3_openvino_int8
anh-xglm-7.5b-cross-lingual
LLaVA-Video-7B-Qwen2_openvino_int8
exp_tas_top_p_0_8_traces
DALLE2-PyTorch
Empathic-Insight-Voice-Small
Empathic-Insight-Voice-Small [](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGFtJ3YdFF8BkUA2) Empathic-Insight-Voice-Small is a suite of 40+ emotion and attribute regression models trained on the large-scale, multilingual synthetic voice-acting dataset LAION'S GOT TALENT (~ 5.000 hours) & an "in the wild" dataset of voice snippets (also ~ 5.000 hours). Each model is designed to predict the intensity of a specific fine-grained emotion or attribute from speech audio. These models leverage embeddings from a fine-tuned Whisper model (laion/BUD-E-Whisper) followed by dedicated MLP regression heads for each dimension. This work is based on the research paper: "EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection" The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set. The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours). These models are intended for research purposes in affective computing, speech emotion recognition (SER), human-AI interaction, and voice AI development. They can be used to: Analyze and predict fine-grained emotional states and vocal attributes from speech. Serve as a baseline for developing more advanced SER systems. Facilitate research into nuanced emotional understanding in voice AI. Explore multilingual and cross-cultural aspects of speech emotion (given the foundation dataset). Out-of-Scope Use: These models are trained on synthetic speech and their generalization to spontaneous real-world speech needs further evaluation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes or infringe on privacy without due diligence and ethical review. The primary way to use these models is through the provided Google Colab Notebook. The notebook handles dependencies, model loading, audio processing, and provides examples for: Batch processing a folder of audio files. Generating a comprehensive HTML report with per-file emotion scores, waveforms, and audio players. Generating individual JSON files with all predicted scores for each audio file. Below is a conceptual example of how to perform inference for a single audio file, extracting all emotion and attribute scores. For the full, runnable version, please refer to the Colab notebook. Conceptual Python Example for Single Audio File Inference: The core 40 emotion categories are (from EMONET-VOICE, Appendix A.1): Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States of Consciousness, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph. Additional vocal attributes (e.g., Valence, Arousal, Gender, Age, Pitch characteristics) are also predicted by corresponding MLP models in the suite. The full list of predictable dimensions can be inferred from the FILENAMEPARTTOTARGETKEYMAP in the Colab notebook (Cell 2). The EMONET-VOICE suite was developed with ethical considerations as a priority: Privacy Preservation: The use of synthetic voice generation fundamentally circumvents privacy concerns associated with collecting real human emotional expressions, especially for sensitive states. Responsible Use: These models are released for research. Users are urged to consider the ethical implications of their applications and avoid misuse, such as for emotional manipulation, surveillance, or in ways that could lead to unfair, biased, or harmful outcomes. The broader societal implications and mitigation of potential misuse of SER technology remain important ongoing considerations.
Empathic-Insight-Face-Large
Empathic-Insight-Face-Large [](https://colab.research.google.com/drive/11oUMo2HX0OuD9dx5ZM4ltNvoYxbI65hu?usp=sharing) Empathic-Insight-Face-Large is a set of 40 emotion regression models trained on the EMoNet-FACE benchmark suite. Each model is designed to predict the intensity of a specific fine-grained emotion from facial expressions. These models are built on top of SigLIP2 image embeddings followed by MLP regression heads. This work is based on the research paper: "EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition" Authors: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Maurice Kraus, Felix Friedrich, Huu Nguyen, Krishna Kalyan, Kourosh Nadi, Kristian Kersting, Sören Auer. (Please refer to the full paper for a complete list of authors and affiliations if applicable). Paper link: (Insert ArXiv/Conference link here when available) The models and datasets are released under the CC-BY-4.0 license. The Empathic-Insight-Face-Large suite consists of 40 individual MLP models. Each model takes a 1152-dimensional SigLIP2 image embedding as input and outputs a continuous score (typically 0-7, can be mean-subtracted) for one of the 40 emotion categories defined in the EMoNet-FACE taxonomy. The models were pre-trained on the EMoNet-FACE BIG dataset (over 203k synthetic images with generated labels) and fine-tuned on the EMoNet-FACE BINARY dataset (nearly 20k synthetic images with over 65k human expert binary annotations). Key Features: Fine-grained Emotions: Covers a novel 40-category emotion taxonomy. High Performance: Achieves human-expert-level performance on the EMoNet-FACE HQ benchmark. Synthetic Data: Trained on AI-generated, demographically balanced, full-face expressions. Open: Publicly released models, datasets, and taxonomy. These models are intended for research purposes in affective computing, human-AI interaction, and emotion recognition. They can be used to: Analyze and predict fine-grained emotional expressions in facial images. Serve as a baseline for developing more advanced emotion recognition systems. Facilitate research into nuanced emotional understanding in AI. Out-of-Scope Use: These models are trained on synthetic faces and may not generalize well to real-world, in-the-wild images without further adaptation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes. These are individual `.pth` files, each corresponding to one emotion classifier. To use them, you will typically: 1. Obtain SigLIP2 Embeddings: Use a pre-trained SigLIP2 model (e.g., `google/siglip2-so400m-patch16-384`). Extract the 1152-dimensional image embedding for your target facial image. 2. Load an MLP Model: Each `.pth` file (e.g., `modelelationbest.pth`) is a PyTorch state dictionary for an MLP. The MLP architecture used for "Empathic-Insight-Face-Large" (big models) is: Input: 1152 features Hidden Layer 1: 1024 neurons, ReLU, Dropout (0.2) Hidden Layer 2: 512 neurons, ReLU, Dropout (0.2) Hidden Layer 3: 256 neurons, ReLU, Dropout (0.2) Output Layer: 1 neuron (continuous score) 3. Perform Inference: Pass the SigLIP2 embedding through the loaded MLP model(s). 4. (Optional) Mean Subtraction: The raw output scores can be adjusted by subtracting the model's mean score on neutral faces. The `neutralstatscache-human-binary-big-mlpsv8twostagehigherlrstage25200+` file in this repository contains these mean values for each emotion model. bibtex @inproceedings{schuhmann2025emonetface, title={{EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition}}, author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Kraus, Maurice and Friedrich, Felix and Nguyen, Huu and Kalyan, Krishna and Nadi, Kourosh and Kersting, Kristian and Auer, Sören}, booktitle={NeurIPS}, year={2025}, }
erlich
scaling-laws-openclip
Empathic-Insight-Voice-Large
Empathic-Insight-Face-Small
scaling-laws-for-comparison
Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [[arXiv]](https://arxiv.org/abs/2506.04598) We provide pre-trained models used in the paper Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [[arXiv]](https://arxiv.org/abs/2506.04598). Please refer to the official Github repository for more information about how to reproduce the results download and use the models.
whisper-captioning-ensemble
openMaMMUT-ViT-B-32-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k
ongo
SNAC-24khz-decoder-onnx
This repo provides ONNX decoders for the SNAC 24 kHz codec so you can decode SNAC tokens on-device, including in the browser with `onnxruntime-web`. Why? If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio in the user’s browser/CPU (or WebGPU when available). > In a Colab CPU test, we saw ~2.1× real-time decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser. - `snac24int2wavstatic.onnx` — int → wav decoder Inputs (int64): - `codes0`: `[1, 12]` - `codes1`: `[1, 24]` - `codes2`: `[1, 48]` Output: - `audio`: `float32 [1, 1, 24576]` (24 kHz) Shapes correspond to a 48-frame window. Each frame is 512 samples, so one window = 24576 samples ≈ 1.024 s at 24 kHz. Token alignment: `L04 = L12 = L21 = sharedframes`. - `snac24latent2wavstatic.onnx` — latent → wav decoder Input: `z` `float32 [1, 768, 48]` → Output: `audio [1, 1, 24576]` Use this if you reconstruct the latent yourself (RVQ embeddings + 1×1 conv projections). - `snac24quantizers.json` — RVQ metadata/weights (stride + embeddings + 1×1 projections) to reconstruct `z` if needed. Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run single-threaded. (async () => { // Prefer WebGPU if available; else WASM const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm']; // Enable SIMD; threads only if crossOriginIsolated ort.env.wasm.simd = true; ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency||4) : 1; const session = await ort.InferenceSession.create('snac24int2wavstatic.onnx', { executionProviders: providers, graphOptimizationLevel: 'all', }); // Example: one 48-frame window (12/24/48 tokens). Replace with real codes. const T0=12, T1=24, T2=48; const feed = { codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]), codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]), codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]), }; const t0 = performance.now(); const out = await session.run(feed); const t1 = performance.now(); const audio = out.audio.data; // Float32Array [1,1,24576] // Play it (24 kHz) const ctx = new (window.AudioContext||window.webkitAudioContext)({sampleRate:24000}); const buf = ctx.createBuffer(1, audio.length, 24000); buf.copyToChannel(audio, 0); const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start(); console.log({ usedEP: providers[0], inferms: (t1-t0).toFixed(2), samples: audio.length }); })(); SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms, start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams. Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded. WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not.