laion

712,980

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip tags: - clip ---

467,500

CLIP-ViT-B-16-laion2B-s34B-b88K

--- license: mit pipeline_tag: zero-shot-image-classification library_name: open_clip ---

208,293

clap-htsat-unfused

Model card for CLAP: Contrastive Language-Audio Pretraining 0. TL;DR 1. Model Details 2. Usage 3. Uses 4. Citation > Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zero-shot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-630K and the proposed model are both available to the public. You can use this model for zero shot audio classification or extracting audio and/or textual features. You can also get the audio and text embeddings using `ClapModel` If you are using this model for your work, please consider citing the original paper:

156,656

CLIP-ViT-L-14-laion2B-s32B-b82K

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training ('babysitting') done by Ross Wightman on the JUWELS Booster supercomputer. See acknowledgements below. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. The model was trained on 384 A100 GPUs using 200M sample 'virtual' epochs where dataset shards were sampled with replacement. The model was trained with 160 virtual epochs for a total of 32B samples seen. The first 68 epochs were trained with float16 AMP, global batch size 79K (208 per GPU). Initially running to epoch 75, where the loss spiked and training failed with NaN. Romain Beaumont was training H/14 and g/14 models at the same time on Stability cluster and hit similar instabilities. Collectively we tried restarts with, different dataset shuffle seed different LR gradient clipping modifications to the architecture Norm modifications (stable norm for final, post embed norm for text transformer) as per https://github.com/mlfoundations/openclip/pull/153 thanks to Phil Wang Extra attention block norms ala Normformer (https://arxiv.org/abs/2110.09456) Scaled cosine attention ala Swin-V2 (https://arxiv.org/abs/2111.09883) None of the above ended up working. Most blew up within the same epoch as original, with the exception of architecture mods. Normformer mods signifcantly altered the network such that resuming did not quickly converge to previous performance, this was abandoned but might be worth trying from start. Scaled cosine attn initially looked promising and lasted until epoch 90 before loss suddenly increased and appeared to remain 'stuck'. In the end, restarting at epoch 69 with `float32` precision solved all instabilities and training continued from there with global batch size 86k (224 per GPU). On A100 GPUs, `float32` had a minimal impact on the throughput once `tf32` matmuls were enabled in PyTorch. Approximately 10% slower than `float16 AMP`. Romain similary changed the precision but ended up using `bfloat16 AMP` to resolve issues. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 75.3 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging the Gauss Centre for Supercomputing e.V. (http://gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC). TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets

144,724

CLIP-ViT-L-14-DataComp.XL-s13B-b90K

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-L/14 model trained with the DataComp-1B (https://github.com/mlfoundations/datacomp) using OpenCLIP (https://github.com/mlfoundations/openclip). As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the DataComp paper (https://arxiv.org/abs/2304.14108) include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. This model was trained with the 1.4 Billion samples of the DataComp-1B dataset (https://arxiv.org/abs/2304.14108). IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. Evaluation done on 38 datasets, using the DataComp repo and the LAION CLIP Benchmark. The testing is performed on a suite of 38 datasets. See our paper for more details (https://arxiv.org/abs/2304.14108). The model achieves a 79.2% zero-shot top-1 accuracy on ImageNet-1k. See our paper for more details and results (https://arxiv.org/abs/2304.14108). Acknowledging stability.ai for the compute used to train this model.

CLIP-convnext_large_d.laion2B-s26B-b102K-augreg

120,618

CLIP-ViT-bigG-14-laion2B-39B-b160k

1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-bigG/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Mitchell Wortsman on the stability.ai cluster. As per the original OpenAI CLIP model card, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model. The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. Any deployed use case of the model - whether commercial or not - is currently out of scope. Non-deployed use cases such as image search in a constrained environment, are also not recommended unless there is thorough in-domain testing of the model with a specific, fixed class taxonomy. This is because our safety assessment demonstrated a high need for task specific testing especially given the variability of CLIP’s performance with different class taxonomies. This makes untested and unconstrained deployment of the model in any use case currently potentially harmful. Certain use cases which would fall under the domain of surveillance and facial recognition are always out-of-scope regardless of performance of the model. This is because the use of artificial intelligence for tasks such as these can be premature currently given the lack of testing norms and checks to ensure its fair use. Since the model has not been purposefully trained in or evaluated on any languages other than English, its use should be limited to English language use cases. Further the above notice, the LAION-5B dataset used in training of these models has additional considerations, see below. This model was trained with the 2 Billion sample English subset of LAION-5B (https://laion.ai/blog/laion-5b/). Fine-tuning was also partially done on LAION-A, a 900M subset of LAION-2B filtered with aesthetic V2 4.5+ and phash deduplicated. IMPORTANT NOTE: The motivation behind dataset creation is to democratize research and experimentation around large-scale multi-modal model training and handling of uncurated, large-scale datasets crawled from publically available internet. Our recommendation is therefore to use the dataset for research purposes. Be aware that this large-scale dataset is uncurated. Keep in mind that the uncurated nature of the dataset means that collected links may lead to strongly discomforting and disturbing content for a human viewer. Therefore, please use the demo links with caution and at your own risk. It is possible to extract a “safe” subset by filtering out samples based on the safety tags (using a customized trained NSFW classifier that we built). While this strongly reduces the chance for encountering potentially harmful content when viewing, we cannot entirely exclude the possibility for harmful content being still present in safe mode, so that the warning holds also there. We think that providing the dataset openly to broad research and other interested communities will allow for transparent investigation of benefits that come along with training large-scale models as well as pitfalls and dangers that may stay unreported or unnoticed when working with closed large datasets that remain restricted to a small community. Providing our dataset openly, we however do not recommend using it for creating ready-to-go industrial products, as the basic research about general properties and safety of such large-scale models, which we would like to encourage with this release, is still in progress. The training procedure will soon be discussed by a blog post on laion.ai. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves a 80.1 zero-shot top-1 accuracy on ImageNet-1k. An initial round of benchmarks have been performed on a wider range of datasets, and will soon be visible at https://github.com/LAION-AI/CLIPbenchmark/blob/main/benchmark/results.ipynb Acknowledging stability.ai for the compute used to train this model. TODO - Hugging Face transformers, OpenCLIP, and timm getting started snippets

CLIP-ViT-B-32-DataComp.XL-s13B-b90K

84,558

CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

70,967

CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k

Model Card for CLIP ViT-B/32 xlm roberta base - LAION-5B 1. Model Details 2. Uses 3. Training Details 4. Evaluation 5. Acknowledgements 6. Citation 7. How To Get Started With the Model A CLIP ViT-B/32 xlm roberta base model trained with the LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/openclip). Model training done by Romain Beaumont on the stability.ai cluster. Zero-shot image classification, image and text retrieval, among others. Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others. This model was trained with the full LAION-5B (https://laion.ai/blog/laion-5b/). Training with batch size 90k for 13B sample of laion5B, see https://wandb.ai/rom1504/open-clip/reports/xlm-roberta-base-B-32--VmlldzoyOTQ5OTE2 Model is B/32 on visual side, xlm roberta base initialized with pretrained weights on text side. Evaluation done with code in the LAION CLIP Benchmark suite. The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval. The model achieves imagenet 1k 62.33% (vs 62.9% for baseline) mscoco 63.4% (vs 60.8% for baseline) flickr30k 86.2% (vs 85.4% for baseline) A preliminary multilingual evaluation was run: 43% on imagenet1k italian (vs 21% for english B/32), 37% for imagenet1k japanese (vs 1% for english B/32 and 50% for B/16 clip japanese). It shows the multilingual property is indeed there as expected. Larger models will get even better performance. Acknowledging stability.ai for the compute used to train this model. In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite:

50,404

CLIP-ViT-g-14-laion2B-s12B-b42K

201

Qwen3-8B_exp_tas_temp_0.5_traces_save-strategy_steps

180

CLIP-ViT-B-32-CommonPool.S.text-s13M-b4K

CLIP-ViT-B-16-CommonPool.L.basic-s1B-b8K

105

stackexchange-tezos-sandboxes_glm_4_6_traces_together_again

104

kimi-k2t-freelancer-32ep-32k

100

claude-4-5-sonnet-thinking-stackexchange-overflow-32ep-32k-traces

r2egym-gpt5-codex-160ep-1M

stackexchange-tezos-sandboxes_glm_4_7_traces_locetash

stackexchange-tezos-sandboxes_glm_4_6_traces_locetash_again

glm-4_6-stackexchange-tezos-32ep-131k

distil-whisper-large-v3_openvino_int8

Mantis-8B-siglip-llama3_openvino_int8

anh-xglm-7.5b-cross-lingual

LLaVA-Video-7B-Qwen2_openvino_int8

exp_tas_top_p_0_8_traces

DALLE2-PyTorch

Empathic-Insight-Voice-Small

Empathic-Insight-Voice-Small [](https://colab.research.google.com/drive/1WR-B6j--Y5RdhIyRGFtJ3YdFF8BkUA2) Empathic-Insight-Voice-Small is a suite of 40+ emotion and attribute regression models trained on the large-scale, multilingual synthetic voice-acting dataset LAION'S GOT TALENT (~ 5.000 hours) & an "in the wild" dataset of voice snippets (also ~ 5.000 hours). Each model is designed to predict the intensity of a specific fine-grained emotion or attribute from speech audio. These models leverage embeddings from a fine-tuned Whisper model (laion/BUD-E-Whisper) followed by dedicated MLP regression heads for each dimension. This work is based on the research paper: "EMONET-VOICE: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection" The Empathic-Insight-Voice-Small suite consists of over 54 individual MLP models (40 for primary emotions, plus others for attributes like valence, arousal, gender, etc.). Each model takes a Whisper audio embedding as input and outputs a continuous score for one of the emotion/attribute categories defined in the EMONET-VOICE taxonomy and extended attribute set. The models were trained on a large dataset of synthetic & "in the wild" speech (both each ~ 5.000 hours). These models are intended for research purposes in affective computing, speech emotion recognition (SER), human-AI interaction, and voice AI development. They can be used to: Analyze and predict fine-grained emotional states and vocal attributes from speech. Serve as a baseline for developing more advanced SER systems. Facilitate research into nuanced emotional understanding in voice AI. Explore multilingual and cross-cultural aspects of speech emotion (given the foundation dataset). Out-of-Scope Use: These models are trained on synthetic speech and their generalization to spontaneous real-world speech needs further evaluation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes or infringe on privacy without due diligence and ethical review. The primary way to use these models is through the provided Google Colab Notebook. The notebook handles dependencies, model loading, audio processing, and provides examples for: Batch processing a folder of audio files. Generating a comprehensive HTML report with per-file emotion scores, waveforms, and audio players. Generating individual JSON files with all predicted scores for each audio file. Below is a conceptual example of how to perform inference for a single audio file, extracting all emotion and attribute scores. For the full, runnable version, please refer to the Colab notebook. Conceptual Python Example for Single Audio File Inference: The core 40 emotion categories are (from EMONET-VOICE, Appendix A.1): Affection, Amusement, Anger, Astonishment/Surprise, Awe, Bitterness, Concentration, Confusion, Contemplation, Contempt, Contentment, Disappointment, Disgust, Distress, Doubt, Elation, Embarrassment, Emotional Numbness, Fatigue/Exhaustion, Fear, Helplessness, Hope/Enthusiasm/Optimism, Impatience and Irritability, Infatuation, Interest, Intoxication/Altered States of Consciousness, Jealousy & Envy, Longing, Malevolence/Malice, Pain, Pleasure/Ecstasy, Pride, Relief, Sadness, Sexual Lust, Shame, Sourness, Teasing, Thankfulness/Gratitude, Triumph. Additional vocal attributes (e.g., Valence, Arousal, Gender, Age, Pitch characteristics) are also predicted by corresponding MLP models in the suite. The full list of predictable dimensions can be inferred from the FILENAMEPARTTOTARGETKEYMAP in the Colab notebook (Cell 2). The EMONET-VOICE suite was developed with ethical considerations as a priority: Privacy Preservation: The use of synthetic voice generation fundamentally circumvents privacy concerns associated with collecting real human emotional expressions, especially for sensitive states. Responsible Use: These models are released for research. Users are urged to consider the ethical implications of their applications and avoid misuse, such as for emotional manipulation, surveillance, or in ways that could lead to unfair, biased, or harmful outcomes. The broader societal implications and mitigation of potential misuse of SER technology remain important ongoing considerations.

license:cc-by-4.0

Empathic-Insight-Face-Large

Empathic-Insight-Face-Large [](https://colab.research.google.com/drive/11oUMo2HX0OuD9dx5ZM4ltNvoYxbI65hu?usp=sharing) Empathic-Insight-Face-Large is a set of 40 emotion regression models trained on the EMoNet-FACE benchmark suite. Each model is designed to predict the intensity of a specific fine-grained emotion from facial expressions. These models are built on top of SigLIP2 image embeddings followed by MLP regression heads. This work is based on the research paper: "EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition" Authors: Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Maurice Kraus, Felix Friedrich, Huu Nguyen, Krishna Kalyan, Kourosh Nadi, Kristian Kersting, Sören Auer. (Please refer to the full paper for a complete list of authors and affiliations if applicable). Paper link: (Insert ArXiv/Conference link here when available) The models and datasets are released under the CC-BY-4.0 license. The Empathic-Insight-Face-Large suite consists of 40 individual MLP models. Each model takes a 1152-dimensional SigLIP2 image embedding as input and outputs a continuous score (typically 0-7, can be mean-subtracted) for one of the 40 emotion categories defined in the EMoNet-FACE taxonomy. The models were pre-trained on the EMoNet-FACE BIG dataset (over 203k synthetic images with generated labels) and fine-tuned on the EMoNet-FACE BINARY dataset (nearly 20k synthetic images with over 65k human expert binary annotations). Key Features: Fine-grained Emotions: Covers a novel 40-category emotion taxonomy. High Performance: Achieves human-expert-level performance on the EMoNet-FACE HQ benchmark. Synthetic Data: Trained on AI-generated, demographically balanced, full-face expressions. Open: Publicly released models, datasets, and taxonomy. These models are intended for research purposes in affective computing, human-AI interaction, and emotion recognition. They can be used to: Analyze and predict fine-grained emotional expressions in facial images. Serve as a baseline for developing more advanced emotion recognition systems. Facilitate research into nuanced emotional understanding in AI. Out-of-Scope Use: These models are trained on synthetic faces and may not generalize well to real-world, in-the-wild images without further adaptation. They should not be used for making critical decisions about individuals, for surveillance, or in any manner that could lead to discriminatory outcomes. These are individual `.pth` files, each corresponding to one emotion classifier. To use them, you will typically: 1. Obtain SigLIP2 Embeddings: Use a pre-trained SigLIP2 model (e.g., `google/siglip2-so400m-patch16-384`). Extract the 1152-dimensional image embedding for your target facial image. 2. Load an MLP Model: Each `.pth` file (e.g., `modelelationbest.pth`) is a PyTorch state dictionary for an MLP. The MLP architecture used for "Empathic-Insight-Face-Large" (big models) is: Input: 1152 features Hidden Layer 1: 1024 neurons, ReLU, Dropout (0.2) Hidden Layer 2: 512 neurons, ReLU, Dropout (0.2) Hidden Layer 3: 256 neurons, ReLU, Dropout (0.2) Output Layer: 1 neuron (continuous score) 3. Perform Inference: Pass the SigLIP2 embedding through the loaded MLP model(s). 4. (Optional) Mean Subtraction: The raw output scores can be adjusted by subtracting the model's mean score on neutral faces. The `neutralstatscache-human-binary-big-mlpsv8twostagehigherlrstage25200+` file in this repository contains these mean values for each emotion model. bibtex @inproceedings{schuhmann2025emonetface, title={{EMONET-FACE: An Expert-Annotated Benchmark for Synthetic Emotion Recognition}}, author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Kraus, Maurice and Friedrich, Felix and Nguyen, Huu and Kalyan, Krishna and Nadi, Kourosh and Kersting, Kristian and Auer, Sören}, booktitle={NeurIPS}, year={2025}, }

whisper-captioning-ensemble

license:cc-by-4.0

openMaMMUT-ViT-B-32-512x512-pt_datacomp1b-ft_datacomp512x512-s76M-b73k