microsoft
✓ VerifiedEnterpriseMicrosoft AI, partners with OpenAI on GPT integration
TRELLIS-image-large
--- library_name: trellis pipeline_tag: image-to-3d license: mit language: - en ---
deberta-v3-base
--- language: en tags: - deberta - deberta-v3 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---
deberta-xlarge-mnli
--- language: en tags: - deberta-v1 - deberta-mnli tasks: mnli thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit widget: - text: "[CLS] I love you. [SEP] I like you. [SEP]" ---
table-transformer-structure-recognition
--- license: mit widget: - src: https://documentation.tricentis.com/tosca/1420/en/content/tbox/images/table.png example_title: Table ---
deberta-large-mnli
--- language: en tags: - deberta-v1 - deberta-mnli tasks: mnli thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit widget: - text: "[CLS] I love you. [SEP] I like you. [SEP]" ---
table-transformer-detection
--- license: mit widget: - src: https://www.invoicesimple.com/wp-content/uploads/2018/06/Sample-Invoice-printable.png example_title: Invoice ---
Phi-3-mini-4k-instruct
--- license: mit license_link: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/resolve/main/LICENSE language: - en - fr pipeline_tag: text-generation tags: - nlp - code inference: parameters: temperature: 0 widget: - messages: - role: user content: Can you provide ways to eat combinations of bananas and dragonfruits? --- 🎉 **Phi-3.5**: [[mini-instruct]](https://huggingface.co/microsoft/Phi-3.5-mini-instruct); [[MoE-instruct]](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) ; [[vi
BiomedNLP-BiomedBERT-base-uncased-abstract
--- language: en tags: - exbert license: mit widget: - text: "[MASK] is a tyrosine kinase inhibitor." ---
tapex-base-finetuned-wikisql
--- language: en tags: - tapex - table-question-answering datasets: - wikisql license: mit ---
BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
--- language: en tags: - clip - biology - medical license: mit library_name: open_clip widget: - src: https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224/resolve/main/example_data/biomed_image_classification_example_data/squamous_cell_carcinoma_histopathology.jpeg candidate_labels: adenocarcinoma histopathology, squamous cell carcinoma histopathology example_title: squamous cell carcinoma histopathology - src: >- https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_
beit-base-patch16-224-pt22k-ft22k
--- license: apache-2.0 tags: - image-classification - vision datasets: - imagenet - imagenet-21k ---
Florence-2-large
--- license: mit license_link: https://huggingface.co/microsoft/Florence-2-large/resolve/main/LICENSE pipeline_tag: image-text-to-text tags: - vision ---
phi-2
--- license: mit license_link: https://huggingface.co/microsoft/phi-2/resolve/main/LICENSE language: - en pipeline_tag: text-generation tags: - nlp - code ---
layoutlmv2-base-uncased
--- language: en license: cc-by-nc-sa-4.0
layoutlmv3-base
--- language: en license: cc-by-nc-sa-4.0
deberta-v3-large
--- language: en tags: - deberta - deberta-v3 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---
codebert-base
Pretrained weights for [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155).
deberta-base
--- language: en tags: - deberta-v1 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---
DialoGPT-medium
--- thumbnail: https://huggingface.co/front/thumbnails/dialogpt.png tags: - conversational license: mit ---
BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
--- language: en tags: - exbert license: mit widget: - text: "[MASK] is a tumor suppressor gene." ---
mdeberta-v3-base
--- language: - multilingual - en - ar - bg - de - el - es - fr - hi - ru - sw - th - tr - ur - vi - zh tags: - deberta - deberta-v3 - mdeberta - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---
phi-4
| | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | `phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. `phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures | | Architecture | 14B parameters, dense decoder-only Transformer model | | Inputs | Text, best suited for prompts in the chat format | | Context length | 16K tokens | | GPUs | 1920 H100-80G | | Training time | 21 days | | Training data | 9.8T tokens | | Outputs | Generated text in response to input | | Dates | October 2024 – November 2024 | | Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data | | Release date | December 12, 2024 | | License | MIT | | | | |-------------------------------|-------------------------------------------------------------------------| | Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. | | Out-of-Scope Use Cases | Our models is not specifically designed or evaluated for all downstream purposes, thus: 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. | Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from: 1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code. 2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.). 4. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge. We evaluated `phi-4` using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically: MMLU: Popular aggregated dataset for multitask language understanding. `phi-4` has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories. Prior to release, `phi-4` followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by `phi-4` in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks. Please refer to the technical report for more details on safety alignment. To understand the capabilities, we compare `phi-4` with a set of models over OpenAI’s SimpleEval benchmark. At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance: | Category | Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 (14B instruct) | GPT-4o-mini | Llama-3.3 (70B instruct) | Qwen 2.5 (72B instruct) | GPT-4o | |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------| | Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 | | Science | GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 | | Math | MGSM MATH | 80.6 80.4 | 53.5 44.6 | 79.6 75.6 | 86.5 73.0 | 89.1 66.3 | 87.3 80.0 | 90.4 74.6 | | Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 80.4 | 90.6 | | Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 | | Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 | \ These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B. Given the nature of the training data, `phi-4` is best suited for prompts using the chat format as follows: Like other language models, `phi-4` can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. `phi-4` is not intended to support multilingual use. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Limited Scope for Code: Majority of `phi-4` training data is based in Python and uses common packages such as `typing`, `math`, `random`, `collections`, `datetime`, `itertools`. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. Data Summary: https://huggingface.co/microsoft/phi-4/blob/main/datasummarycard.md
Phi-4-multimodal-instruct
--- license: mit license_link: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/LICENSE language: - multilingual - ar - zh - cs - da - nl - en - fi - fr - de - he - hu - it - ja - ko - no - pl - pt - ru - es - sv - th - tr - uk tags: - nlp - code - audio - automatic-speech-recognition - speech-summarization - speech-translation - visual-question-answering - phi-4-multimodal - phi - phi-4-mini widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.
wavlm-large
--- language: - en tags: - speech inference: false ---
swinv2-tiny-patch4-window16-256
--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg example_title: Tiger - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg example_title: Teapot - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg example_title: Palace ---
Phi-3.5-vision-instruct
--- license: mit license_link: https://huggingface.co/microsoft/Phi-3.5-vision-instruct/resolve/main/LICENSE language: - multilingual pipeline_tag: image-text-to-text tags: - nlp - code - vision inference: parameters: temperature: 0.7 widget: - messages: - role: user content: Can you describe what you see in the image? library_name: transformers ---
trocr-large-printed
--- tags: - trocr - image-to-text widget: - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X00016469612_1.jpg example_title: Printed 1 - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X51005255805_7.jpg example_title: Printed 2 - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X51005745214_6.jpg example_title: Printed 3 ---
DialoGPT-small
--- thumbnail: https://huggingface.co/front/thumbnails/dialogpt.png tags: - conversational license: mit ---
Phi-3.5-mini-instruct
--- license: mit license_link: https://huggingface.co/microsoft/Phi-3.5-mini-instruct/resolve/main/LICENSE language: - multilingual pipeline_tag: text-generation tags: - nlp - code widget: - messages: - role: user content: Can you provide ways to eat combinations of bananas and dragonfruits? library_name: transformers --- 🎉**Phi-4**: [[multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | [onnx](https://huggingface.co/microsoft/Phi-4-multimodal-instruct-onnx)]; [[mi
wavlm-base-plus
--- language: - en datasets: tags: - speech inference: false ---
Phi-3-mini-128k-instruct
--- license: mit license_link: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/main/LICENSE
graphcodebert-base
GraphCodeBERT is a graph-based pre-trained model based on the Transformer architecture for programming language, which also considers data-flow information along with code sequences. GraphCodeBERT consists of 12 layers, 768 dimensional hidden states, and 12 attention heads. The maximum sequence length for the model is 512. The model is trained on the CodeSearchNet dataset, which includes 2.3M functions with document pairs for six programming languages.
Phi-4-mini-instruct
--- language: - multilingual - ar - zh - cs - da - nl - en - fi - fr - de - he - hu - it - ja - ko - 'no' - pl - pt - ru - es - sv - th - tr - uk library_name: transformers license: mit license_link: https://huggingface.co/microsoft/Phi-4-mini-instruct/resolve/main/LICENSE pipeline_tag: text-generation tags: - nlp - code widget: - messages: - role: user content: Can you provide ways to eat combinations of bananas and dragonfruits? --- 🎉**Phi-4**: [[mini-reasoning](https://huggingface.co/microso
xclip-base-patch32
--- language: en license: mit tags: - vision - video-classification model-index: - name: nielsr/xclip-base-patch32 results: - task: type: video-classification dataset: name: Kinetics 400 type: kinetics-400 metrics: - type: top-1 accuracy value: 80.4 - type: top-5 accuracy value: 95.0 ---
trocr-base-printed
--- tags: - trocr - image-to-text widget: - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X00016469612_1.jpg example_title: Printed 1 - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X51005255805_7.jpg example_title: Printed 2 - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X51005745214_6.jpg example_title: Printed 3 ---
deberta-v3-small
--- language: en tags: - deberta - deberta-v3 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---
resnet-50
--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k ---
wavlm-base-plus-sv
--- language: - en tags: - speech ---
Florence-2-base
--- license: mit license_link: https://huggingface.co/microsoft/Florence-2-base/resolve/main/LICENSE pipeline_tag: image-text-to-text tags: - vision ---
kosmos-2-patch14-224
--- pipeline_tag: image-to-text tags: - image-captioning languages: - en license: mit ---
table-transformer-structure-recognition-v1.1-all
--- license: mit ---
trocr-base-handwritten
--- tags: - trocr - image-to-text widget: - src: https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg example_title: Note 1 - src: >- https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSoolxi9yWGAT5SLZShv8vVd0bz47UWRzQC19fDTeE8GmGv_Rn-PCF1pP1rrUx8kOjA4gg&usqp=CAU example_title: Note 2 - src: >- https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRNYtTuSBpZPV_nkBYPMFwVVD9asZOPgHww4epu9EqWgDmXW--sE2o8og40ZfDGo87j5w&usqp=CAU example_title: Note 3 license: mit ---
VibeVoice-1.5B
--- language: - en - zh license: mit pipeline_tag: text-to-speech tags: - Podcast library_name: transformers ---
resnet-18
llmlingua-2-xlm-roberta-large-meetingbank
LLMLingua-2-Bert-base-Multilingual-Cased-MeetingBank This model was introduced in the paper LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression (Pan et al, 2024). It is a XLM-RoBERTa (large-sized model) finetuned to perform token classification for task agnostic prompt compression. The probability $p{preserve}$ of each token $xi$ is used as the metric for compression. This model is trained on the extractive text compression dataset constructed with the methodology proposed in the LLMLingua-2, using training examples from MeetingBank (Hu et al, 2023) as the seed data. You can evaluate the model on downstream tasks such as question answering (QA) and summarization over compressed meeting transcripts using this dataset. For more details, please check the home page of LLMLingua-2 and LLMLingua Series.
layoutlm-base-uncased
LayoutLM Multimodal (text + layout/format + image) pre-training for document AI LayoutLM is a simple but effective pre-training method of text and layout for document image understanding and information extraction tasks, such as form understanding and receipt understanding. LayoutLM archives the SOTA results on multiple datasets. For more details, please refer to our paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, KDD 2020 We pre-train LayoutLM on IIT-CDIP Test Collection 1.0\ dataset with two settings. LayoutLM-Base, Uncased (11M documents, 2 epochs): 12-layer, 768-hidden, 12-heads, 113M parameters (This Model) LayoutLM-Large, Uncased (11M documents, 2 epochs): 24-layer, 1024-hidden, 16-heads, 343M parameters If you find LayoutLM useful in your research, please cite the following paper:
BiomedVLP-CXR-BERT-general
infoxlm-large
**InfoXLM** (NAACL 2021, [paper](https://arxiv.org/pdf/2007.07834.pdf), [repo](https://github.com/microsoft/unilm/tree/master/infoxlm), [model](https://huggingface.co/microsoft/infoxlm-base)) InfoX...
beit-base-patch16-384
Phi-3.5-MoE-instruct
Phi-3.5-MoE is a lightweight, state-of-the-art open model built upon datasets used for Phi-3 - synthetic data and filtered publicly available documents - with a focus on very high-quality, reasoning dense data. The model supports multilingual and comes with 128K context length (in tokens). The model underwent a rigorous enhancement process, incorporating supervised fine-tuning, proximal policy optimization, and direct preference optimization to ensure precise instruction adherence and robust safety measures. 🏡 Phi-3 Portal 📰 Phi-3 Microsoft Blog 📖 Phi-3 Technical Report 👩🍳 Phi-3 Cookbook 🖥️ Try It Phi-3.5: [[mini-instruct]](https://huggingface.co/microsoft/Phi-3.5-mini-instruct); [[MoE-instruct]](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) ; [[vision-instruct]](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) The model is intended for commercial and research use in multiple languages. The model provides uses for general purpose AI systems and applications which require: 1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially code, math and logic) Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features. Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. Requirements Phi-3.5-MoE-instruct is integrated in the official version of `transformers` starting from 4.46.0. The current `transformers` version can be verified with: `pip list | grep transformers`. Phi-3.5-MoE-instruct is also available in Azure AI Studio Phi-3.5-MoE-Instruct supports a vocabulary size of up to `32064` tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size. Input Formats Given the nature of the training data, the Phi-3.5-MoE-instruct model is best suited for prompts using the chat format as follows: Loading the model locally After obtaining the Phi-3.5-MoE-instruct model checkpoints, users can use this sample code for inference. To understand the capabilities, we compare Phi-3.5-MoE with a set of models over a variety of benchmarks using our internal benchmark platform. At the high-level overview of the model quality on representative benchmarks: | Category | Benchmark | Phi-3.5-MoE-instruct | Mistral-Nemo-12B-instruct-2407 | Llama-3.1-8B-instruct | Gemma-2-9b-It | Gemini-1.5-Flash | GPT-4o-mini-2024-07-18 (Chat) | |--|--|--|--|--|--|--|--| | Popular aggregated benchmark | Arena Hard | 37.9 | 39.4 | 25.7 | 42.0 | 55.2 | 75.0 | | | BigBench Hard CoT (0-shot) | 79.1 | 60.2 | 63.4 | 63.5 | 66.7 | 80.4 | | | MMLU (5-shot) | 78.9 | 67.2 | 68.1 | 71.3 | 78.7 | 77.2 | | | MMLU-Pro (0-shot, CoT) | 54.3 | 40.7 | 44.0 | 50.1 | 57.2 | 62.8 | | Reasoning | ARC Challenge (10-shot) | 91.0 | 84.8 | 83.1 | 89.8 | 92.8 | 93.5 | | | BoolQ (2-shot) | 84.6 | 82.5 | 82.8 | 85.7 | 85.8 | 88.7 | | | GPQA (0-shot, CoT) | 36.8 | 28.6 | 26.3 | 29.2 | 37.5 | 41.1 | | | HellaSwag (5-shot) | 83.8 | 76.7 | 73.5 | 80.9 | 67.5 | 87.1 | | | OpenBookQA (10-shot) | 89.6 | 84.4 | 84.8 | 89.6 | 89.0 | 90.0 | | | PIQA (5-shot) | 88.6 | 83.5 | 81.2 | 83.7 | 87.5 | 88.7 | | | Social IQA (5-shot) | 78.0 | 75.3 | 71.8 | 74.7 | 77.8 | 82.9 | | | TruthfulQA (MC2) (10-shot) | 77.5 | 68.1 | 69.2 | 76.6 | 76.6 | 78.2 | | | WinoGrande (5-shot) | 81.3 | 70.4 | 64.7 | 74.0 | 74.7 | 76.9 | | Multilingual | Multilingual MMLU (5-shot) | 69.9 | 58.9 | 56.2 | 63.8 | 77.2 | 72.9 | | | MGSM (0-shot CoT) | 58.7 | 63.3 | 56.7 | 75.1 | 75.8 | 81.7 | | Math | GSM8K (8-shot, CoT) | 88.7 | 84.2 | 82.4 | 84.9 | 82.4 | 91.3 | | | MATH (0-shot, CoT) | 59.5 | 31.2 | 47.6 | 50.9 | 38.0 | 70.2 | | Long context | Qasper | 40.0 | 30.7 | 37.2 | 13.9 | 43.5 | 39.8 | | | SQuALITY | 24.1 | 25.8 | 26.2 | 0.0 | 23.5 | 23.8 | | Code Generation | HumanEval (0-shot) | 70.7 | 63.4 | 66.5 | 61.0 | 74.4 | 86.6 | | | MBPP (3-shot) | 80.8 | 68.1 | 69.4 | 69.3 | 77.5 | 84.1 | | Average | | 69.2 | 61.3 | 61.0 | 63.3 | 68.5 | 74.9 | We take a closer look at different categories across 80 public benchmark datasets at the table below: | Category | Phi-3.5-MoE-instruct | Mistral-Nemo-12B-instruct-2407 | Llama-3.1-8B-instruct | Gemma-2-9b-It | Gemini-1.5-Flash | GPT-4o-mini-2024-07-18 (Chat) | |--|--|--|--|--|--|--| | Popular aggregated benchmark | 62.6 | 51.9 | 50.3 | 56.7 | 64.5 | 73.9 | | Reasoning | 78.7 | 72.2 | 70.5 | 75.4 | 77.7 | 80.0 | | Language understanding | 71.8 | 67.0 | 62.9 | 72.8 | 66.6 | 76.8 | | Robustness | 75.6 | 65.2 | 59.8 | 64.7 | 68.9 | 77.5 | | Long context | 25.5 | 24.5 | 25.5 | 0.0 | 27.0 | 25.4 | | Math | 74.1 | 57.7 | 65.0 | 67.9 | 60.2 | 80.8 | | Code generation | 68.3 | 56.9 | 65.8 | 58.3 | 66.8 | 69.9 | | Multilingual | 65.8 | 55.3 | 47.5 | 59.6 | 64.3 | 76.6 | Overall, Phi-3.5-MoE with only 6.6B active parameters achieves a similar level of language understanding and math as much larger models. Moreover, the model outperforms bigger models in reasoning capability and only behind GPT-4o-mini. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, we believe such weakness can be resolved by augmenting Phi-3.5 with a search engine, particularly when using the model under RAG settings. The table below highlights multilingual capability of Phi-3.5-MoE on multilingual MMLU, MEGA, and multilingual MMLU-pro datasets. Overall, we observed that even with just 6.6B active parameters, the model is very competitive on multilingual tasks in comparison to other models with a much bigger active parameters. | Category | Phi-3.5-MoE-instruct | Mistral-Nemo-12B-instruct-2407 | Llama-3.1-8B-instruct | Gemma-2-9b-It | Gemini-1.5-Flash | GPT-4o-mini-2024-07-18 (Chat) | |--|--|--|--|--|--|--| | Multilingual MMLU | 69.9 | 58.9 | 56.2 | 63.8 | 77.2 | 72.9 | | Multilingual MMLU-Pro | 45.3 | 34.0 | 21.4 | 43.0 | 57.9 | 53.2 | | MGSM | 58.7 | 63.3 | 56.7 | 75.1 | 75.8 | 81.7 | | MEGA MLQA | 65.3 | 61.2 | 45.2 | 54.4 | 61.6 | 70.0 | | MEGA TyDi QA | 67.1 | 63.7 | 54.5 | 65.6 | 63.6 | 81.8 | | MEGA UDPOS | 60.4 | 58.2 | 54.1 | 56.6 | 62.4 | 66.0 | | MEGA XCOPA | 76.6 | 10.8 | 21.1 | 31.2 | 95.0 | 90.3 | | MEGA XStoryCloze | 82.8 | 92.3 | 71.0 | 87.0 | 20.7 | 96.6 | | Average | 65.8 | 55.3 | 47.5 | 59.6 | 64.3 | 76.6 | Phi-3.5-MoE supports 128K context length, therefore the model is capable of several long context tasks including long document/meeting summarization, long document QA, multilingual context retrieval. We see that Phi-3.5 is clearly better than Gemma-2 family which only supports 8K context length. Phi-3.5-MoE-instruct is very competitive with other much larger open-weight models such as Llama-3.1-8B-instruct, and Mistral-Nemo-12B-instruct-2407. | Benchmark | Phi-3.5-MoE-instruct | Mistral-Nemo-12B-instruct-2407 | Llama-3.1-8B-instruct | Gemini-1.5-Flash | GPT-4o-mini-2024-07-18 (Chat) | |--|--|--|--|--|--| | GovReport | 26.4 | 25.6 | 25.1 | 27.8 | 24.8 | | QMSum | 19.9 | 22.1 | 21.6 | 24.0 | 21.7 | | Qasper | 40.0 | 30.7 | 37.2 | 43.5 | 39.8 | | SQuALITY | 24.1 | 25.8 | 26.2 | 23.5 | 23.8 | | SummScreenFD | 16.9 | 18.2 | 17.6 | 16.3 | 17.0 | | Average | 25.5 | 24.5 | 25.5 | 27.0 | 25.4 | RULER: a retrieval-based benchmark for long context understanding | Model | 4K | 8K | 16K | 32K | 64K | 128K | Average | |--|--|--|--|--|--|--|--| | Phi-3.5-MoE-instruct | 94.8 | 93 | 93.2 | 91.6 | 85.7 | 64.2 | 87.1 | | Llama-3.1-8B-instruct | 95.5 | 93.8 | 91.6 | 87.4 | 84.7 | 77.0 | 88.3 | | Mistral-Nemo-12B-instruct-2407 | 87.8 | 87.2 | 87.7 | 69.0 | 46.8 | 19.0 | 66.2 | RepoQA: a benchmark for long context code understanding | Model | Python | C++ | Rust | Java | TypeScript | Average | |--|--|--|--|--|--|--| | Phi-3.5-MoE-instruct | 89 | 74 | 81 | 88 | 95 | 85 | | Llama-3.1-8B-instruct | 80 | 65 | 73 | 76 | 63 | 71 | | Mistral-7B-instruct-v0.3 | 61 | 57 | 51 | 61 | 80 | 62 | Architecture: Phi-3.5-MoE has 16x3.8B parameters with 6.6B active parameters when using 2 experts. The model is a mixture-of-expert decoder-only Transformer model using the tokenizer with vocabulary size of 32,064. Inputs: Text. It is best suited for prompts using chat format. Context length: 128K tokens GPUs: 512 H100-80G Training time: 23 days Training data: 4.9T tokens Outputs: Generated text in response to the input Dates: Trained between April and August 2024 Status: This is a static model trained on an offline dataset with cutoff date October 2023 for publicly available data. Future versions of the tuned models may be released as we improve models. Supported languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian Release date: August 2024 Training Datasets Our training data includes a wide variety of sources, totaling 4.9 trillion tokens (including 10% multilingual), and is a combination of 1) publicly available documents filtered rigorously for quality, selected high-quality educational data, and code; 2) newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.); 3) high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge. As an example, the result of a game in premier league in a particular day might be good training data for frontier models, but we need to remove such information to leave more model capacity for reasoning for the small size models. More details about data can be found in the Phi-3 Technical Report. Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 3 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Long Conversation: Phi-3 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi-3 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. We leveraged various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets to evaluate Phi-3.5 models' propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi-3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Note, however, while comprehensive red team evaluations were conducted across all models in the prior release of Phi models, red teaming was largely focused on Phi-3.5 MOE across multiple languages and risk categories for this release as it is the largest and more capable model of the three models. Details on prior red team evaluations across Phi models can be found in the Phi-3 Safety Post-Training paper. For this release, insights from red teaming indicate that the models may refuse to generate undesirable outputs in English, even when the request for undesirable output is in another language. Models may also be more susceptible to longer multi-turn jailbreak techniques across both English and non-English languages. These findings highlight the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken. Hardware Note that by default, the Phi-3.5-MoE-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA A6000 NVIDIA H100 License The model is licensed under the MIT license. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. The prompt is the same as the CLIcK paper prompt. The experimental results below were given with maxtokens=512 (zero-shot), maxtokens=1024 (5-shot), temperature=0.01. No system prompt used. - GPT-4o: 2024-05-13 version - GPT-4o-mini: 2024-07-18 version - GPT-4-turbo: 2024-04-09 version - GPT-3.5-turbo: 2023-06-13 version Overall, the Phi-3.5 MoE model with just 6.6B active params outperforms GPT-3.5-Turbo. | Benchmarks | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:-------------------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | CLIcK | 56.44 | 29.12 | 47.82 | 80.46 | 68.5 | 72.82 | 50.98 | | HAERAE 1.0 | 61.83 | 36.41 | 53.9 | 85.7 | 76.4 | 77.76 | 52.67 | | KMMLU (0-shot, CoT) | 47.43 | 30.82 | 38.54 | 64.26 | 52.63 | 58.75 | 40.3 | | KMMLU (5-shot) | 47.92 | 29.98 | 20.21 | 64.28 | 51.62 | 59.29 | 42.28 | | KMMLU-HARD (0-shot, CoT) | 25.34 | 25.68 | 24.03 | 39.62 | 24.56 | 30.56 | 20.97 | | KMMLU-HARD (5-shot) | 25.66 | 25.73 | 15.81 | 40.94 | 24.63 | 31.12 | 21.19 | | Average | 45.82 | 29.99 | 29.29 | 62.54 | 50.08 | 56.74 | 39.61 | CLIcK (Cultural and Linguistic Intelligence in Korean) Accuracy by supercategory | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | Culture | 58.44 | 29.74 | 51.15 | 81.89 | 70.95 | 73.61 | 53.38 | | Language | 52.31 | 27.85 | 40.92 | 77.54 | 63.54 | 71.23 | 46 | | Overall | 56.44 | 29.12 | 47.82 | 80.46 | 68.5 | 72.82 | 50.98 | Accuracy by category | supercategory | category | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------|:------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | Culture | Economy | 77.97 | 28.81 | 66.1 | 94.92 | 83.05 | 89.83 | 64.41 | | Culture | Geography | 60.31 | 29.01 | 54.2 | 80.15 | 77.86 | 82.44 | 53.44 | | Culture | History | 33.93 | 30 | 29.64 | 66.92 | 48.4 | 46.4 | 31.79 | | Culture | Law | 52.51 | 22.83 | 44.29 | 70.78 | 57.53 | 61.19 | 41.55 | | Culture | Politics | 70.24 | 33.33 | 59.52 | 88.1 | 83.33 | 89.29 | 65.48 | | Culture | Pop Culture | 80.49 | 34.15 | 60.98 | 97.56 | 85.37 | 92.68 | 75.61 | | Culture | Society | 74.43 | 31.72 | 65.05 | 92.88 | 85.44 | 86.73 | 71.2 | | Culture | Tradition | 58.11 | 31.98 | 54.95 | 87.39 | 74.77 | 79.28 | 55.86 | | Language | Functional | 48 | 24 | 32.8 | 84.8 | 64.8 | 80 | 40 | | Language | Grammar | 29.58 | 23.33 | 22.92 | 57.08 | 42.5 | 47.5 | 30 | | Language | Textual | 73.33 | 33.33 | 59.65 | 91.58 | 80.7 | 87.37 | 62.11 | | category | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | General Knowledge | 39.77 | 28.41 | 34.66 | 77.27 | 53.41 | 66.48 | 40.91 | | History | 60.64 | 22.34 | 44.15 | 92.02 | 84.57 | 78.72 | 30.32 | | Loan Words | 70.41 | 35.5 | 63.31 | 79.88 | 76.33 | 78.11 | 59.17 | | Rare Words | 63.95 | 42.96 | 63.21 | 87.9 | 81.98 | 79.01 | 61.23 | | Reading Comprehension | 64.43 | 41.16 | 51.9 | 85.46 | 77.18 | 80.09 | 56.15 | | Standard Nomenclature | 66.01 | 32.68 | 58.82 | 88.89 | 75.82 | 79.08 | 53.59 | | Overall | 61.83 | 36.41 | 53.9 | 85.7 | 76.4 | 77.76 | 52.67 | | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | Applied Science | 45.15 | 31.68 | 37.03 | 61.52 | 49.29 | 55.98 | 38.47 | | HUMSS | 49.75 | 26.47 | 37.29 | 69.45 | 56.59 | 63 | 40.9 | | Other | 47.24 | 31.01 | 39.15 | 63.79 | 52.35 | 57.53 | 40.19 | | STEM | 49.08 | 31.9 | 40.42 | 65.16 | 54.74 | 60.84 | 42.24 | | Overall | 47.43 | 30.82 | 38.54 | 64.26 | 52.63 | 58.75 | 40.3 | | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | Applied Science | 45.9 | 29.98 | 19.24 | 61.47 | 48.66 | 56.85 | 40.22 | | HUMSS | 49.18 | 27.27 | 22.5 | 68.79 | 55.95 | 63.68 | 43.35 | | Other | 48.43 | 30.76 | 20.95 | 64.21 | 51.1 | 57.85 | 41.92 | | STEM | 49.21 | 30.73 | 19.55 | 65.28 | 53.29 | 61.08 | 44.43 | | Overall | 47.92 | 29.98 | 20.21 | 64.28 | 51.62 | 59.29 | 42.28 | | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024)| Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | Applied Science | 25.83 | 26.17 | 26.25 | 37.12 | 22.25 | 29.17 | 21.07 | | HUMSS | 21.52 | 24.38 | 20.21 | 41.97 | 23.31 | 31.51 | 19.44 | | Other | 24.82 | 24.82 | 23.88 | 40.39 | 26.48 | 29.59 | 22.22 | | STEM | 28.18 | 26.91 | 24.64 | 39.82 | 26.36 | 32.18 | 20.91 | | Overall | 25.34 | 25.68 | 24.03 | 39.62 | 24.56 | 30.56 | 20.97 | | supercategory | Phi-3.5-MoE-Instruct | Phi-3.0-Mini-128k-Instruct (June2024) | Llama-3.1-8B-Instruct | GPT-4o | GPT-4o-mini | GPT-4-turbo | GPT-3.5-turbo | |:----------------|-----------------------:|--------------------------------:|------------------------:|---------:|--------------:|--------------:|----------------:| | Applied Science | 21 | 29 | 12 | 31 | 21 | 25 | 20 | | HUMSS | 22.88 | 19.92 | 14 | 43.98 | 23.47 | 33.53 | 19.53 | | Other | 25.13 | 27.27 | 12.83 | 39.84 | 28.34 | 29.68 | 23.22 | | STEM | 21.75 | 25.25 | 12.75 | 40.25 | 23.25 | 27.25 | 19.75 | | Overall | 25.66 | 25.73 | 15.81 | 40.94 | 24.63 | 31.12 | 21.19 |
phi-1_5
The language model Phi-1.5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic...
swin-large-patch4-window7-224
Swin Transformer model trained on ImageNet-1k at resolution 224x224. It was introduced in the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. and first released in this repository. Disclaimer: The team releasing Swin Transformer did not write a model card for this model so this model card has been written by the Hugging Face team. The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
speecht5_hifigan
unixcoder-base
deberta-base-mnli
git-base-coco
GIT (GenerativeImage2Text), base-sized, fine-tuned on COCO GIT (short for GenerativeImage2Text) model, base-sized version, fine-tuned on COCO. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. Disclaimer: The team releasing GIT did not write a model card for this model so this model card has been written by the Hugging Face team. GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. The model has full access to (i.e. a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e. a causal attention mask is used for the text tokens) when predicting the next text token. - image and video captioning - visual question answering (VQA) on images and videos - even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text). You can use the raw model for image captioning. See the model hub to look for fine-tuned versions on a task that interests you. > We collect 0.8B image-text pairs for pre-training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a). => however this is for the model referred to as "GIT" in the paper, which is not open-sourced. This checkpoint is "GIT-base", which is a smaller variant of GIT trained on 10 million image-text pairs. We refer to the original repo regarding details for preprocessing during training. During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed-size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. For evaluation results, we refer readers to the paper.
speecht5_tts
Multilingual-MiniLM-L12-H384
infoxlm-base
swin-tiny-patch4-window7-224
Swin Transformer model trained on ImageNet-1k at resolution 224x224. It was introduced in the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. and first released in this repository. Disclaimer: The team releasing Swin Transformer did not write a model card for this model so this model card has been written by the Hugging Face team. The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
MiniLM-L12-H384-uncased
llmlingua-2-bert-base-multilingual-cased-meetingbank
Florence-2-large-ft
wavlm-base-sv
trocr-large-handwritten
deberta-v2-xlarge-mnli
deberta-v2-xlarge
wavlm-base-plus-sd
Phi-tiny-MoE-instruct
trocr-small-printed
swin-base-patch4-window7-224
biogpt
rad-dino
Phi-mini-MoE-instruct
Phi-4-reasoning
| | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | Phi-4-reasoning is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. The supervised fine-tuning dataset includes a blend of synthetic prompts and high-quality filtered data from public domain websites, focused on math, science, and coding skills as well as alignment data for safety and Responsible AI. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. | | Architecture | Base model same as previously released Phi-4, 14B parameters, dense decoder-only Transformer model | | Inputs | Text, best suited for prompts in the chat format | | Context length | 32k tokens | | GPUs | 32 H100-80G | | Training time | 2.5 days | | Training data | 16B tokens, ~8.3B unique tokens | | Outputs | Generated text in response to the input. Model responses have two sections, namely, a reasoning chain-of-thought block followed by a summarization block | | Dates | January 2025 – April 2025 | | Status | Static model trained on an offline dataset with cutoff dates of March 2025 and earlier for publicly available data | | Release date | April 30, 2025 | | License | MIT | | | | |-------------------------------|-------------------------------------------------------------------------| | Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. | | Out-of-Scope Use Cases | This model is designed and tested for math reasoning only. Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. Review the Responsible AI Considerations section below for further guidance when choosing a use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. | > [!IMPORTANT] > To fully take advantage of the model's capabilities, inference must use `temperature=0.8`, `topk=50`, `topp=0.95`, and `dosample=True`. For more complex queries, set `maxnewtokens=32768` to allow for longer chain-of-thought (CoT). Given the nature of the training data, always use ChatML template with the following system prompt for inference: Phi-4-reasoning is also supported out-of-the-box by Ollama, llama.cpp, and any Phi-4 compatible framework. Our training data is a mixture of Q&A, chat format data in math, science, and coding. The chat prompts are sourced from filtered high-quality web data and optionally rewritten and processed through a synthetic data generation pipeline. We further include data to improve truthfulness and safety. We evaluated Phi-4-reasoning using the open-source Eureka evaluation suite and our own internal benchmarks to understand the model's capabilities. More specifically, we evaluate our model on: AIME 2025, 2024, 2023, and 2022: Math olympiad questions. GPQA-Diamond: Complex, graduate-level science questions. OmniMath: Collection of over 4000 olympiad-level math problems with human annotation. LiveCodeBench: Code generation benchmark gathered from competitive coding contests. 3SAT (3-literal Satisfiability Problem) and TSP (Traveling Salesman Problem): Algorithmic problem solving. FlenQA: Impact of prompt length on model performance. MMLU-Pro: Popular aggregated dataset for multitask language understanding. Phi-4-reasoning has adopted a robust safety post-training approach via supervised fine-tuning (SFT). This approach leverages a variety of both open-source and in-house generated synthetic prompts, with LLM-generated responses that adhere to rigorous Microsoft safety guidelines, e.g., User Understanding and Clarity, Security and Ethical Guidelines, Limitations, Disclaimers and Knowledge Scope, Handling Complex and Sensitive Topics, Safety and Respectful Engagement, Confidentiality of Guidelines and Confidentiality of Chain-of-Thoughts. Prior to release, Phi-4-reasoning followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by Phi-4-reasoning in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model's safety training including grounded-ness, jailbreaks, harmful content like hate and unfairness, violence, sexual content, or self-harm, and copyright violations for protected material. We further evaluate models on Toxigen, a benchmark designed to measure bias and toxicity targeted towards minority groups. Please refer to the technical report for more details on safety alignment. At the high-level overview of the model quality on representative benchmarks. For the tables below, higher numbers indicate better performance: | | AIME 24 | AIME 25 | OmniMath | GPQA-D | LiveCodeBench (8/1/24–2/1/25) | |-----------------------------|-------------|-------------|-------------|------------|-------------------------------| | Phi-4-reasoning | 75.3 | 62.9 | 76.6 | 65.8 | 53.8 | | Phi-4-reasoning-plus | 81.3 | 78.0 | 81.9 | 68.9 | 53.1 | | OpenThinker2-32B | 58.0 | 58.0 | — | 64.1 | — | | QwQ 32B | 79.5 | 65.8 | — | 59.5 | 63.4 | | EXAONE-Deep-32B | 72.1 | 65.8 | — | 66.1 | 59.5 | | DeepSeek-R1-Distill-70B | 69.3 | 51.5 | 63.4 | 66.2 | 57.5 | | DeepSeek-R1 | 78.7 | 70.4 | 85.0 | 73.0 | 62.8 | | o1-mini | 63.6 | 54.8 | — | 60.0 | 53.8 | | o1 | 74.6 | 75.3 | 67.5 | 76.7 | 71.0 | | o3-mini | 88.0 | 78.0 | 74.6 | 77.7 | 69.5 | | Claude-3.7-Sonnet | 55.3 | 58.7 | 54.6 | 76.8 | — | | Gemini-2.5-Pro | 92.0 | 86.7 | 61.1 | 84.0 | 69.2 | | | Phi-4 | Phi-4-reasoning | Phi-4-reasoning-plus | o3-mini | GPT-4o | |----------------------------------------|-------|------------------|-------------------|---------|--------| | FlenQA [3K-token subset] | 82.0 | 97.7 | 97.9 | 96.8 | 90.8 | | IFEval Strict | 62.3 | 83.4 | 84.9 | 91.5 | 81.8 | | ArenaHard | 68.1 | 73.3 | 79.0 | 81.9 | 75.6 | | HumanEvalPlus | 83.5 | 92.9 | 92.3 | 94.0| 88.0 | | MMLUPro | 71.5 | 74.3 | 76.0 | 79.4 | 73.0 | | Kitab No Context - Precision With Context - Precision No Context - Recall With Context - Recall | 19.3 88.5 8.2 68.1 | 23.2 91.5 4.9 74.8 | 27.6 93.6 6.3 75.4 | 37.9 94.0 4.2 76.1 | 53.7 84.7 20.3 69.2 | | Toxigen Discriminative Toxic category Neutral category | 72.6 90.0 | 86.7 84.7 | 77.3 90.5 | 85.4 88.7 | 87.6 85.1 | | PhiBench 2.21 | 58.2 | 70.6 | 74.2 | 78.0| 72.4 | Overall, Phi-4-reasoning, with only 14B parameters, performs well across a wide range of reasoning tasks, outperforming significantly larger open-weight models such as DeepSeek-R1 distilled 70B model and approaching the performance levels of full DeepSeek R1 model. We also test the models on multiple new reasoning benchmarks for algorithmic problem solving and planning, including 3SAT, TSP, and BA-Calendar. These new tasks are nominally out-of-domain for the models as the training process did not intentionally target these skills, but the models still show strong generalization to these tasks. Furthermore, when evaluating performance against standard general abilities benchmarks such as instruction following or non-reasoning tasks, we find that our new models improve significantly from Phi-4, despite the post-training being focused on reasoning skills in specific domains. Like other language models, Phi-4-reasoning can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. Phi-4-reasoning is not intended to support multilingual use. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Election Information Reliability: The model has an elevated defect rate when responding to election-critical queries, which may result in incorrect or unauthoritative election critical information being presented. We are working to improve the model's performance in this area. Users should verify information related to elections with the election authority in their region. Limited Scope for Code: Majority of Phi-4-reasoning training data is based in Python and uses common packages such as `typing`, `math`, `random`, `collections`, `datetime`, `itertools`. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. Data Summary: https://huggingface.co/microsoft/Phi-4-reasoning/blob/main/datasummarycard.md
OmniParser-v2.0
📢 [GitHub Repo] [OmniParser V2 Blog Post] Huggingface demo Model Summary OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function. This model hub includes a finetuned version of YOLOv8 and a finetuned Florence-2 base model on the above dataset respectively. For more details of the models used and finetuning, please refer to the paper. What's new in V2? - Larger and cleaner set of icon caption + grounding dataset - 60% improvement in latency compared to V1. Avg latency: 0.6s/frame on A100, 0.8s on single 4090. - Strong performance: 39.6 average accuracy on ScreenSpot Pro - Your agent only need one tool: OmniTool. Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Check out our github repo for details. Responsible AI Considerations Intended Use - OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality. - OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser. - OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications. limitations - OmniParser is designed to faithfully convert screenshot image into structured elements of interactable regions and semantics of the screen, while it does not detect harmful content in its input (like users have freedom to decide the input of any LLMs), users are expected to provide input to the OmniParser that is not harmful. - While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard. License Please note that icondetect model is under AGPL license, and iconcaption is under MIT license. Please refer to the LICENSE file in the folder of each model.
Phi-3-mini-4k-instruct-gguf
Phi-3-vision-128k-instruct
🎉 Phi-3.5: [[mini-instruct]](https://huggingface.co/microsoft/Phi-3.5-mini-instruct); [[MoE-instruct]](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) ; [[vision-instruct]](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. + Phi-3 Microsoft Blog + Phi-3 Technical Report + Phi-3 on Azure AI Studio + Phi-3 Cookbook | | Short Context | Long Context | | ------- | ------------- | ------------ | | Mini | 4K [[HF]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) ; [[GGUF]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx)| | Small | 8K [[HF]](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda)| | Medium | 4K [[HF]](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cuda) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda)| | Vision | | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cuda)| The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require 1) memory/compute constrained environments; 2) latency bound scenarios; 3) general image understanding; 4) OCR; 5) chart and table understanding. Our model is designed to accelerate research on efficient language and multimodal models, for use as a building block for generative AI powered features. Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. Phi-3-Vision-128K-Instruct has been integrated in the development version (4.40.2) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following: When loading the model, ensure that `trustremotecode=True` is passed as an argument of the `frompretrained()` function. Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source. The current `transformers` version can be verified with: `pip list | grep transformers`. Phi-3-Vision-128K-Instruct is also available in Azure AI Studio. Given the nature of the training data, the Phi-3-Vision-128K-Instruct model is best suited for a single image input wih prompts using the chat format as follows. You can provide the prompt as a single image with a generic template as follow: where the model generates the text after ` ` . In case of multi-turn conversation, the prompt can be formatted as follows: This code snippets show how to get quickly started with running the model on a GPU: How to finetune? We recommend user to take a look at the Phi-3 CookBook finetuning recipe for Vision Like other models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. + Identification of individuals: models with vision capabilities may have the potential to uniquely identify individuals in images. Safety post-training steers the model to refuse such requests, but developers should consider and implement, as appropriate, additional mitigations or user consent flows as required in their respective jurisdiction, (e.g., building measures to blur faces in image inputs before processing. Architecture: Phi-3-Vision-128K-Instruct has 4.2B parameters and contains image encoder, connector, projector, and Phi-3 Mini language model. Inputs: Text and Image. It’s best suited for prompts using the chat format. Context length: 128K tokens GPUs: 512 H100-80G Training time: 1.5 days Training data: 500B vision and text tokens Outputs: Generated text in response to the input Dates: Our models were trained between February and April 2024 Status: This is a static model trained on an offline text dataset with cutoff date Mar 15, 2024. Future versions of the tuned models may be released as we improve models. Release Type: Open weight release Release dates: The model weight is released on May 21, 2024. Our training data includes a wide variety of sources, and is a combination of 1) publicly available documents filtered rigorously for quality, selected high-quality educational data and code; 2) selected high-quality image-text interleave; 3) newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.), newly created image data, e.g., chart/table/diagram/slides; 4) high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data. More details can be found in the Phi-3 Technical Report. To understand the capabilities, we compare Phi-3-Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform. |Benchmark|Phi-3 Vision-128K-In|LlaVA-1.6 Vicuna-7B|QWEN-VL Chat|Llama3-Llava-Next-8B|Claude-3 Haiku|Gemini 1.0 Pro V|GPT-4V-Turbo| |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |MMMU|40.4|34.2|39.0|36.4|40.7|42.0|55.5| |MMBench|80.5|76.3|75.8|79.4|62.4|80.0|86.1| |ScienceQA|90.8|70.6|67.2|73.7|72.0|79.7|75.7| |MathVista|44.5|31.5|29.4|34.8|33.2|35.0|47.5| |InterGPS|38.1|20.5|22.3|24.6|32.1|28.6|41.0| |AI2D|76.7|63.1|59.8|66.9|60.3|62.8|74.7| |ChartQA|81.4|55.0|50.9|65.8|59.3|58.0|62.3| |TextVQA|70.9|64.6|59.4|55.7|62.7|64.7|68.1| |POPE|85.8|87.2|82.6|87.0|74.4|84.2|83.7| Hardware Note that by default, the Phi-3-Vision-128K model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA A6000 NVIDIA H100 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
Phi-3-medium-128k-instruct
wavlm-base
The base model pretrained on 16kHz sampled speech audio. When using the model, make sure that your speech input is also sampled at 16kHz. Note: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model. Paper: WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing Authors: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei Abstract Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The original model can be found under https://github.com/microsoft/unilm/tree/master/wavlm. This is an English pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be used in inference. The model was pre-trained in English and should therefore perform well only in English. The model has been shown to work well on the SUPERB benchmark. Note: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence of phonemes before fine-tuning. To fine-tune the model for speech recognition, see the official speech recognition example. To fine-tune the model for speech classification, see the official audio classification example. The model was contributed by cywang and patrickvonplaten.
xclip-base-patch16
X-CLIP model (base-sized, patch resolution of 16) trained fully-supervised on Kinetics-400. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository. This model was trained using 8 frames per video, at a resolution of 224x224. Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team. X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval. You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you. The exact details of preprocessing during training can be found here. The exact details of preprocessing during validation can be found here. During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. This model achieves a top-1 accuracy of 83.8% and a top-5 accuracy of 95.7%.
mpnet-base
deberta-v3-xsmall
GRIN-MoE
dit-base
BioGPT-Large
Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms. If you find BioGPT useful in your research, please cite the following paper:
trocr-small-handwritten
trocr-large-stage1
Florence-2-base-ft
Phi-3-medium-4k-instruct
resnet-152
BiomedVLP-CXR-BERT-specialized
layoutlmv3-large
layoutlm-base-cased
swinv2-tiny-patch4-window8-256
Swin Transformer v2 model pre-trained on ImageNet-1k at resolution 256x256. It was introduced in the paper Swin Transformer V2: Scaling Up Capacity and Resolution by Liu et al. and first released in this repository. Disclaimer: The team releasing Swin Transformer v2 did not write a model card for this model so this model card has been written by the Hugging Face team. The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally. Swin Transformer v2 adds 3 main improvements: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) a self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
udop-large
TRELLIS-text-xlarge
trocr-base-stage1
git-base
Phi-4-mini-reasoning
Phi-4-mini-reasoning is a lightweight open model built upon synthetic data with a focus on high-quality, reasoning dense data further finetuned for more advanced math reasoning capabilities. The model belongs to the Phi-4 model family and supports 128K token context length. 📰 Phi-4-mini-reasoning Blog, and Developer Article 📖 Phi-4-mini-reasoning Technical Report | HF paper 👩🍳 Phi Cookbook 🏡 Phi Portal 🖥️ Try It Azure 🎉Phi-4 models: [Phi-4-reasoning] | [multimodal-instruct | onnx]; [mini-instruct | onnx] Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks under memory/compute constrained environments and latency bound scenarios. Some of the use cases include formal proof generation, symbolic computation, advanced word problems, and a wide range of mathematical reasoning scenarios. These models excel at maintaining context across steps, applying structured logic, and delivering accurate, reliable solutions in domains that require deep analytical thinking. This model is designed and tested for math reasoning only. It is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. This release of Phi-4-mini-reasoning addresses user feedback and market demand for a compact reasoning model. It is a compact transformer-based language model optimized for mathematical reasoning, built to deliver high-quality, step-by-step problem solving in environments where computing or latency is constrained. The model is fine-tuned with synthetic math data from a more capable model (much larger, smarter, more accurate, and better at following instructions), which has resulted in enhanced reasoning performance. Phi-4-mini-reasoning balances reasoning ability with efficiency, making it potentially suitable for educational applications, embedded tutoring, and lightweight deployment on edge or mobile systems. If a critical issue is identified with Phi-4-mini-reasoning, it should be promptly reported through the MSRC Researcher Portal or [email protected] To understand the capabilities, the 3.8B parameters Phi-4-mini-reasoning model was compared with a set of models over a variety of reasoning benchmarks. A high-level overview of the model quality is as follows: | Model | AIME | MATH-500 | GPQA Diamond | |------------------------------------|-------|----------|--------------| | o1-mini | 63.6 | 90.0 | 60.0 | | DeepSeek-R1-Distill-Qwen-7B | 53.3 | 91.4 | 49.5 | | DeepSeek-R1-Distill-Llama-8B | 43.3 | 86.9 | 47.3 | | Bespoke-Stratos-7B | 20.0 | 82.0 | 37.8 | | OpenThinker-7B | 31.3 | 83.0 | 42.4 | | Llama-3.2-3B-Instruct | 6.7 | 44.4 | 25.3 | | Phi-4-Mini (base model, 3.8B) | 10.0 | 71.8 | 36.9 | |Phi-4-mini-reasoning (3.8B) | 57.5 | 94.6 | 52.0 | Overall, the model with only 3.8B-param achieves a similar level of multilingual language understanding and reasoning ability as much larger models. However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4 with a search engine, particularly when using the model under RAG settings. Phi-4-mini-reasoning supports a vocabulary size of up to `200064` tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size. Given the nature of the training data, the Phi-4-mini-instruct model is best suited for prompts using specific formats. Below are the two primary formats: This format is used for general conversation and instructions: Phi-4-mini-reasoning has been integrated in the `4.51.3` version of `transformers`. The current `transformers` version can be verified with: `pip list | grep transformers`. Python 3.8 and 3.10 will work best. List of required packages: Phi-4-mini-reasoning is also available in Azure AI Studio After obtaining the Phi-4-mini-instruct model checkpoints, users can use this sample code for inference. + Architecture: Phi-4-mini-reasoning shares the same architecture as Phi-4-Mini, which has 3.8B parameters and is a dense decoder-only Transformer model. When compared with Phi-3.5-Mini, the major changes with Phi-4-Mini are 200K vocabulary, grouped-query attention, and shared input and output embedding. + Inputs: Text. It is best suited for prompts using the chat format. + Context length: 128K tokens + GPUs: 128 H100-80G + Training time: 2 days + Training data: 150B tokens + Outputs: Generated text + Dates: Trained in February 2024 + Status: This is a static model trained on offline datasets with the cutoff date of February 2025 for publicly available data. + Supported languages: English + Release date: April 2025 The training data for Phi-4-mini-reasoning consists exclusively of synthetic mathematical content generated by a stronger and more advanced reasoning model, Deepseek-R1. The objective is to distill knowledge from this model. This synthetic dataset comprises over one million diverse math problems spanning multiple levels of difficulty (from middle school to Ph.D. level). For each problem in the synthetic dataset, eight distinct solutions (rollouts) were sampled, and only those verified as correct were retained, resulting in approximately 30 billion tokens of math content. The dataset integrates three primary components: 1) a curated selection of high-quality, publicly available math questions and a part of the SFT(Supervised Fine-Tuning) data that was used to train the base Phi-4-Mini model; 2) an extensive collection of synthetic math data generated by the Deepseek-R1 model, designed specifically for high-quality supervised fine-tuning and model distillation; and 3) a balanced set of correct and incorrect answers used to construct preference data aimed at enhancing Phi-4-mini-reasoning's reasoning capabilities by learning more effective reasoning trajectories Hardware Note that by default, the Phi-4-mini-reasoning model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA H100 If you want to run the model on: NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT, DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. Phi-4-Mini-Reasoning was developed in accordance with Microsoft's responsible AI principles. Potential safety risks in the model’s responses were assessed using the Azure AI Foundry’s Risk and Safety Evaluation framework, focusing on harmful content, direct jailbreak, and model groundedness. The Phi-4-Mini-Reasoning Model Card contains additional information about our approach to safety and responsible AI considerations that developers should be aware of when using this model. Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. + Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Election Information Reliability : The model has an elevated defect rate when responding to election-critical queries, which may result in incorrect or unauthoritative election critical information being presented. We are working to improve the model's performance in this area. Users should verify information related to elections with the election authority in their region. + Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses. + Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift. Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. License The model is licensed under the MIT license. Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. For all benchmarks, we consider using the same generation configuration such as max sequence length (32768), the same temperature for the fair comparison. Benchmark datasets We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically: - Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving. - AIME 2024: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning. - GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.
beit-base-patch16-224
llava-med-v1.5-mistral-7b
dit-base-finetuned-rvlcdip
phi-4-gguf
| | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | `phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. `phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures | | Architecture | 14B parameters, dense decoder-only Transformer model | | Inputs | Text, best suited for prompts in the chat format | | Context length | 16K tokens | | GPUs | 1920 H100-80G | | Training time | 21 days | | Training data | 9.8T tokens | | Outputs | Generated text in response to input | | Dates | October 2024 – November 2024 | | Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data | | Release date | December 12, 2024 | | License | MIT | | | | |-------------------------------|-------------------------------------------------------------------------| | Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. | | Out-of-Scope Use Cases | Our models is not specifically designed or evaluated for all downstream purposes, thus: 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. | Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from: 1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code. 2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.). 4. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge. We evaluated `phi-4` using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically: MMLU: Popular aggregated dataset for multitask language understanding. `phi-4` has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories. Prior to release, `phi-4` followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by `phi-4` in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks. Please refer to the technical report for more details on safety alignment. To understand the capabilities, we compare `phi-4` with a set of models over OpenAI’s SimpleEval benchmark. At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance: | Category | Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 (14B instruct) | GPT-4o-mini | Llama-3.3 (70B instruct) | Qwen 2.5 (72B instruct) | GPT-4o | |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------| | Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 | | Science | GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 | | Math | MGSM MATH | 80.6 80.4 | 53.5 44.6 | 79.6 75.6 | 86.5 73.0 | 89.1 66.3 | 87.3 80.0 | 90.4 74.6 | | Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 80.4 | 90.6 | | Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 | | Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 | \ These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B. Given the nature of the training data, `phi-4` is best suited for prompts using the chat format as follows: Install `llama.cpp` according to their documentation and use the following code snippet to interact with `phi-4` (4-bit quantized): Like other language models, `phi-4` can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. `phi-4` is not intended to support multilingual use. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Limited Scope for Code: Majority of `phi-4` training data is based in Python and uses common packages such as `typing`, `math`, `random`, `collections`, `datetime`, `itertools`. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. Data Summary: https://huggingface.co/microsoft/phi-4-gguf/blob/main/datasummarycard.md
bitnet-b1.58-2B-4T
This repository contains the weights for BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale, developed by Microsoft Research. Trained...
Phi-4-reasoning-plus
layoutlmv2-large-uncased
LayoutLMv2 Multimodal (text + layout/format + image) pre-training for document AI Introduction LayoutLMv2 is an improved version of LayoutLM with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. It outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including , including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.834 → 0.852), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672). LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou, ACL 2021
DialoGPT-large
codebert-base-mlm
unixcoder-base-nine
xclip-large-patch14
resnet-101
ResNet model pre-trained on ImageNet-1k at resolution 224x224. It was introduced in the paper Deep Residual Learning for Image Recognition by He et al. Disclaimer: The team releasing ResNet did not write a model card for this model so this model card has been written by the Hugging Face team. ResNet (Residual Network) is a convolutional neural network that democratized the concepts of residual learning and skip connections. This enables to train much deeper models. This is ResNet v1.5, which differs from the original model: in the bottleneck blocks which require downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution. This difference makes ResNet50 v1.5 slightly more accurate (\~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec) according to Nvidia. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.
cvt-13
codereviewer
resnet-34
phi-1
swin-base-patch4-window12-384-in22k
conditional-detr-resnet-50
kosmos-2.5
MediPhi-Instruct
Phi-3-small-128k-instruct
renderformer-v1.1-swin-large
layoutxlm-base
tapex-large
deberta-v2-xxlarge
swin-base-patch4-window7-224-in22k
Magma-8B
trocr-small-stage1
rad-dino-maira-2
Fara-7B
deberta-large
Orca-2-13b
Orca 2 is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning. 1. This is a research model, intended to show that we can use capable models and complex workflows (advanced prompts, multiple calls) to create synthetic data that can teach Small Language Models (SLMs) new capabilities. We chose reasoning because it is a widely useful capability that SLMs lack. 2. The model is not optimized for chat and has not been trained with RLHF or DPO. It is best used after being finetuned for chat or for a specific task. 3. Beyond reasoning, the model inherits capabilities and limitations of its base (LLAMA-2 base). We have already seen that the benefits of the Orca training can be applied to other base model too. We make Orca 2's weights publicly available to support further research on the development, evaluation, and alignment of SLMs. + Orca 2 is built for research purposes only. + The main purpose is to allow the research community to assess its abilities and to provide a foundation for building better frontier models. + Orca 2 has been evaluated on a large number of tasks ranging from reasoning to grounding and safety. Please refer to Section 6 and Appendix in the Orca 2 paper for details on evaluations. Orca 2 is a finetuned version of LLAMA-2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper. Please refer to LLaMA-2 technical report for details on the model architecture. Orca 2 is licensed under the Microsoft Research License. Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including: Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair. Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses. Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information. Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction. Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic. Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content. Data Distribution: Orca 2’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning. System messages: Orca 2 demonstrates variance in performance depending on the system instructions. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions. Zero-Shot Settings: Orca 2 was trained on data that mostly simulate zero-shot settings. While the model demonstrate very strong performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other, specially larger, models. Synthetic data: As Orca 2 is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation. We posit that Orca 2 benefits from the safety measures incorporated during training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification of such risks. This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application. The usage of Azure AI Content Safety on top of model prediction is strongly encouraged and can help prevent content harms. Azure AI Content Safety is a content moderation platform that uses AI to keep your content safe. By integrating Orca 2 with Azure AI Content Safety, we can moderate the model output by scanning it for sexual content, violence, hate, and self-harm with multiple severity levels and multi-lingual detection.
UserLM-8b
Unlike typical LLMs that are trained to play the role of the "assistant" in conversation, we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a...
git-large-coco
BiomedParse
layoutlm-large-uncased
maira-2
bitnet-b1.58-2B-4T-gguf
This repository contains the weights for BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale, developed by Microsoft Research. Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency). ➡️ Technical Report: BitNet b1.58 2B4T Technical Report ➡️ Official Inference Code: microsoft/BitNet (bitnet.cpp) Several versions of the model weights are available on Hugging Face: `microsoft/bitnet-b1.58-2B-4T`: Contains the packed 1.58-bit weights optimized for efficient inference. Use this for deployment. `microsoft/bitnet-b1.58-2B-4T-bf16`: Contains the master weights in BF16 format. Use this only for training or fine-tuning purposes. `microsoft/bitnet-b1.58-2B-4T-gguf` (This repository): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference. Architecture: Transformer-based, modified with `BitLinear` layers (BitNet framework). Uses Rotary Position Embeddings (RoPE). Uses squared ReLU (ReLU²) activation in FFN layers. Employs `subln` normalization. No bias terms in linear or normalization layers. Quantization: Native 1.58-bit weights and 8-bit activations (W1.58A8). Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass. Activations are quantized to 8-bit integers using absmax quantization (per-token). Crucially, the model was trained from scratch with this quantization scheme, not post-training quantized. Parameters: ~2 Billion Training Tokens: 4 Trillion Context Length: Maximum sequence length of 4096 tokens. Recommendation: For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage. Training Stages: 1. Pre-training: Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule. 2. Supervised Fine-tuning (SFT): Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning. 3. Direct Preference Optimization (DPO): Aligned with human preferences using preference pairs. Tokenizer: LLaMA 3 Tokenizer (vocab size: 128,256). > Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library, even with the required fork. > > The current execution paths within transformers do not contain the specialized, highly optimized computational kernels required to leverage the advantages of the BitNet architecture. Running the model via transformers will likely result in inference speeds and energy usage comparable to, or potentially worse than, standard full-precision models within this framework on both CPU and GPU. > > While you might observe reduced memory usage due to the quantized weights, the primary computational efficiency benefits are not accessible through this standard transformers usage path. > > For achieving the efficiency benefits demonstrated in the technical paper, you MUST use the dedicated C++ implementation: bitnet.cpp. Please refer to the bitnet.cpp GitHub repository for detailed compilation steps, usage examples, and command-line options. BitNet b1.58 2B4T was evaluated against leading open-weight full-precision LLMs of similar size. Below are the key results (all models are instruction-tuned versions): | Benchmark | LLaMA 3.2 1B | Gemma-3 1B | Qwen2.5 1.5B | SmolLM2 1.7B | MiniCPM 2B | BitNet b1.58 2B | |--------------------------------|--------------|------------|--------------|--------------|------------|---------------------| | Memory (Non-emb) | 2GB | 1.4GB | 2.6GB | 3.2GB | 4.8GB | 0.4GB | | Latency (CPU Decoding) | 48ms | 41ms | 65ms | 67ms | 124ms | 29ms | | Energy (Estimated) | 0.258J | 0.186J | 0.347J | 0.425J | 0.649J | 0.028J | | Training Tokens (Pre-train)| 9T | 2T | 18T | 11T | 1.1T | 4T | | ARC-Challenge | 37.80 | 38.40 | 46.67 | 43.52 | 44.80 | 49.91 | | ARC-Easy | 63.17 | 63.13 | 76.01 | 62.92 | 72.14 | 74.79 | | OpenbookQA | 34.80 | 38.80 | 40.80 | 46.00 | 40.20 | 41.60 | | BoolQ | 64.65 | 74.22 | 78.04 | 75.78 | 80.67 | 80.18 | | HellaSwag | 60.80 | 57.69 | 68.28 | 71.71 | 70.81 | 68.44 | | PIQA | 74.21 | 71.93 | 76.12 | 76.12 | 76.66 | 77.09 | | WinoGrande | 59.51 | 58.48 | 62.83 | 68.98 | 61.80 | 71.90 | | CommonsenseQA | 58.48 | 42.10 | 76.41 | 63.55 | 71.74 | 71.58 | | TruthfulQA | 43.80 | 38.66 | 46.67 | 39.90 | 41.41 | 45.31 | | TriviaQA | 37.60 | 23.49 | 38.37 | 45.97 | 34.13 | 33.57 | | MMLU | 45.58 | 39.91 | 60.25 | 49.24 | 51.82 | 53.17 | | HumanEval+ | 31.10 | 37.20 | 50.60 | 28.00 | 43.90 | 38.40 | | GSM8K | 38.21 | 31.16 | 56.79 | 45.11 | 4.40 | 58.38 | | MATH-500 | 23.00 | 42.00 | 53.00 | 17.60 | 14.80 | 43.40 | | IFEval | 62.71 | 66.67 | 50.12 | 57.91 | 36.81 | 53.48 | | MT-bench | 5.43 | 6.40 | 6.12 | 5.50 | 6.57 | 5.85 | | Average | 44.90 | 43.74 | 55.23 | 48.70 | 42.05 | 54.19 | License The model weights and code are released under the MIT License. Disclaimer This model is intended for research and development purposes. While efforts have been made to align it using SFT and DPO, it may still produce outputs that are unexpected, biased, or inaccurate. Please use responsibly.
Phi-3-small-8k-instruct
LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned
swinv2-large-patch4-window12-192-22k
xclip-base-patch16-zero-shot
X-CLIP model (base-sized, patch resolution of 16) trained on Kinetics-400. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository. This model was trained using 32 frames per video, at a resolution of 224x224. Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team. X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval. You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you. The exact details of preprocessing during training can be found here. The exact details of preprocessing during validation can be found here. During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. This model achieves a zero-shot top-1 accuracy of 44.6% on HMDB-51, 72.0% on UCF-101 and 65.2% on Kinetics-600.
swin-large-patch4-window12-384
xclip-base-patch16-16-frames
CodeGPT-small-py
swin-base-patch4-window12-384
markuplm-base
beit-large-finetuned-ade-640-640
Phi-4-mini-flash-reasoning
focalnet-tiny
bitnet-b1.58-2B-4T-bf16
MediPhi-Clinical
swin-small-patch4-window7-224
swinv2-base-patch4-window12-192-22k
swinv2-base-patch4-window8-256
cvt-w24-384-22k
BiomedNLP-BiomedELECTRA-base-uncased-abstract
xclip-base-patch32-16-frames
beit-large-patch16-224-pt22k-ft22k
speecht5_vc
beit-base-patch16-224-pt22k
BiomedNLP-BiomedBERT-large-uncased-abstract
renderformer-v1-base
layoutlmv3-base-chinese
BiomedVLP-BioViL-T
xtremedistil-l6-h256-uncased
speecht5_asr
TRELLIS-text-large
swinv2-large-patch4-window12to24-192to384-22kto1k-ft
GUI-Actor-7B-Qwen2.5-VL
Orca-2-7b
Orca 2 is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning. 1. This is a research model, intended to show that we can use capable models and complex workflows (advanced prompts, multiple calls) to create synthetic data that can teach Small Language Models (SLMs) new capabilities. We chose reasoning because it is a widely useful capability that SLMs lack. 2. The model is not optimized for chat and has not been trained with RLHF or DPO. It is best used after being finetuned for chat or for a specific task. 3. Beyond reasoning, the model inherits capabilities and limitations of its base (LLAMA-2 base). We have already seen that the benefits of the Orca training can be applied to other base model too. We make Orca 2's weights publicly available to support further research on the development, evaluation, and alignment of SLMs. + Orca 2 is built for research purposes only. + The main purpose is to allow the research community to assess its abilities and to provide a foundation for building better frontier models. + Orca 2 has been evaluated on a large number of tasks ranging from reasoning to grounding and safety. Please refer to Section 6 and Appendix in the Orca 2 paper for details on evaluations. Orca 2 is a finetuned version of LLAMA-2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper. Please refer to LLaMA-2 technical report for details on the model architecture. Orca 2 is licensed under the Microsoft Research License. Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including: Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair. Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses. Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information. Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction. Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic. Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content. Data Distribution: Orca 2’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning. System messages: Orca 2 demonstrates variance in performance depending on the system instructions. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions. Zero-Shot Settings: Orca 2 was trained on data that mostly simulate zero-shot settings. While the model demonstrate very strong performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other, specially larger, models. Synthetic data: As Orca 2 is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation. We posit that Orca 2 benefits from the safety measures incorporated during training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification of such risks. This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application. The usage of Azure AI Content Safety on top of model prediction is strongly encouraged and can help preventing some of content harms. Azure AI Content Safety is a content moderation platform that uses AI to moderate content. By having Azure AI Content Safety on the output of Orca 2, the model output can be moderated by scanning it for different harm categories including sexual content, violence, hate, and self-harm with multiple severity levels and multi-lingual detection.
beit-large-patch16-224
prophetnet-large-uncased
LLM2CLIP-Openai-L-14-336
TRELLIS-text-base
beit-base-finetuned-ade-640-640
BEiT model pre-trained in a self-supervised fashion on ImageNet-21k (14 million images, 21,841 classes) at resolution 224x224, and fine-tuned on ADE20k (an important benchmark for semantic segmentation of images) at resolution 640x640. It was introduced in the paper BEIT: BERT Pre-Training of Image Transformers by Hangbo Bao, Li Dong and Furu Wei and first released in this repository. Disclaimer: The team releasing BEiT did not write a model card for this model so this model card has been written by the Hugging Face team. The BEiT model is a Vision Transformer (ViT), which is a transformer encoder model (BERT-like). In contrast to the original ViT model, BEiT is pretrained on a large collection of images in a self-supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. The pre-training objective for the model is to predict visual tokens from the encoder of OpenAI's DALL-E's VQ-VAE, based on masked patches. Next, the model was fine-tuned in a supervised fashion on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. Contrary to the original ViT models, BEiT models do use relative position embeddings (similar to T5) instead of absolute position embeddings, and perform classification of images by mean-pooling the final hidden states of the patches, instead of placing a linear layer on top of the final hidden state of the [CLS] token. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: for semantic segmentation, one can just add one of the decode heads available in the mmseg library for example, and fine-tune the model in a supervised fashion on annotated images. This is what the authors did: they fine-tuned BEiT with an UperHead segmentation decode head, allowing it to obtain SOTA results on important benchmarks such as ADE20k and CityScapes. You can use the raw model for semantic segmentation of images. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model for semantic segmentation: Currently, both the feature extractor and model support PyTorch. This BEiT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ADE20k, a dataset consisting of thousands of annotated images and 150 classes. The exact details of preprocessing of images during training/validation can be found here. Images are cropped and padded to the same resolution (640x640) and normalized across the RGB channels with the ImageNet mean and standard deviation. For all pre-training related hyperparameters, we refer to page 15 of the original paper. For evaluation results on several image classification benchmarks, we refer to tables 1 and 2 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.
xtremedistil-l6-h384-uncased
unispeech-sat-base-plus-sv
unispeech-sat-large
tapex-large-finetuned-wtq
Dayhoff-170m-UR90
table-transformer-structure-recognition-v1.1-pub
swinv2-base-patch4-window16-256
MAI-DS-R1-FP8
swin-large-patch4-window12-384-in22k
BioGPT-Large-PubMedQA
swinv2-small-patch4-window8-256
GUI-Actor-3B-Qwen2.5-VL
git-large-textcaps
unispeech-sat-base-sv
llava-rad
resnet-26
phi-2-pytdml
swinv2-large-patch4-window12to16-192to256-22kto1k-ft
xtremedistil-l12-h384-uncased
BiomedNLP-KRISSBERT-PubMed-UMLS-EL
unispeech-sat-base-plus
deberta-v2-xxlarge-mnli
unispeech-large-1500h-cv
swinv2-small-patch4-window16-256
git-large-vqav2
git-base-textvqa
LLM2CLIP-Openai-L-14-224
Phi-3-mini-4k-instruct-onnx
cvt-21
git-base-vqav2
NextCoder-32B
MediPhi
markuplm-large
unispeech-sat-base
beit-large-patch16-512
swinv2-base-patch4-window12to16-192to256-22kto1k-ft
Promptist
Promptist: reinforcement learning for automatic prompt optimization News - [Demo Release] Dec, 2022: Demo at HuggingFace Space - [Model Release] Dec, 2022: link - [Paper Release] Dec, 2022: Optimizing Prompts for Text-to-Image Generation > - Language models serve as a prompt interface that optimizes user input into model-preferred prompts. > - Learn a language model for automatic prompt optimization via reinforcement learning. You can try the online demo at https://huggingface.co/spaces/microsoft/Promptist. `[Note]` the online demo at HuggingFace Space is using CPU, so slow generation speed would be expected. Please load the model locally with GPUs for faster generation.
SportsBERT
prophetnet-large-uncased-cnndm
unispeech-sat-large-sv
prophetnet-large-uncased-squad-qg
OmniParser
Model Summary OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function. This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 model on the above dataset respectively. For more details of the models used and finetuning, please refer to the paper. Responsible AI Considerations Intended Use - OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality. - OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser. - OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications. limitations - OmniParser is designed to faithfully convert screenshot image into structured elements of interactable regions and semantics of the screen, while it does not detect harmful content in its input (like users have freedom to decide the input of any LLMs), users are expected to provide input to the OmniParser that is not harmful. - While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard. - For OmniPaser-BLIP2, it may incorrectly infer the gender or other sensitive attribute (e.g., race, religion etc.) of individuals in icon images. Inference of sensitive attributes may rely upon stereotypes and generalizations rather than information about specific individuals and are more likely to be incorrect for marginalized people. Incorrect inferences may result in significant physical or psychological injury or restrict, infringe upon or undermine the ability to realize an individual’s human rights. We do not recommend use of OmniParser in any workplace-like use case scenario. License Please note that icondetect model is under AGPL license, and iconcaptionblip2 & iconcaptionflorence is under MIT license. Please refer to the LICENSE file in the folder of each model.
table-transformer-structure-recognition-v1.1-fin
GODEL-v1_1-large-seq2seq
unispeech-sat-base-plus-sd
trocr-large-str
NextCoder-14B
xclip-base-patch16-ucf-8-shot
git-large
MediPhi-MedCode
swinv2-base-patch4-window12to24-192to384-22kto1k-ft
deberta-xlarge
LLM2CLIP-Openai-B-16
swin-large-patch4-window7-224-in22k
MediPhi-PubMed
Dayhoff-3b-UR90
kosmos-2.5-chat
GUI-Actor-2B-Qwen2-VL
trocr-base-str
Phi-Ground
DialogRPT Updown
Please try this ➤➤➤ Colab Notebook Demo (click me!) | Context | Response | `updown` score | | :------ | :------- | :------------: | | I love NLP! | Here’s a free textbook (URL) in case anyone needs it. | 0.613 | | I love NLP! | Me too! | 0.111 | The `updown` score predicts how likely the response is getting upvoted. > How likely a dialog response is upvoted 👍 and/or gets replied 💬? This is what DialogRPT is learned to predict. It is a set of dialog response ranking models proposed by Microsoft Research NLP Group trained on 100 + millions of human feedback data. It can be used to improve existing dialog generation model (e.g., DialoGPT) by re-ranking the generated response candidates. Quick Links: EMNLP'20 Paper Dataset, training, and evaluation Colab Notebook Demo We considered the following tasks and provided corresponding pretrained models. This page is for the `updown` task, and other model cards can be found in table below. |Task | Description | Pretrained model | | :------------- | :----------- | :-----------: | | Human feedback | given a context and its two human responses, predict...| | `updown` | ... which gets more upvotes? | this model | | `width`| ... which gets more direct replies? | model card | | `depth`| ... which gets longer follow-up thread? | model card | | Human-like (human vs fake) | given a context and one human response, distinguish it with... | | `humanvsrand`| ... a random human response | model card | | `humanvsmachine`| ... a machine generated response | model card |
MAI-DS-R1
GUI-Actor-7B-Qwen2-VL
Dayhoff-170m-GR
Dayhoff-3b-GR-HM-c
markuplm-base-finetuned-websrc
CodeGPT-small-java-adaptedGPT2
DialogRPT-human-vs-rand
tapex-base
xclip-large-patch14-kinetics-600
CodeGPT-small-java
NatureLM-8x7B-Inst
focalnet-tiny-lrf
git-base-textcaps
Phi-4-mini-instruct-onnx
dit-large
git-base-vatex
Phi 3 Mini 128k Instruct Onnx
This repository hosts the optimized versions of Phi-3-mini-128k-instruct to accelerate inference with ONNX Runtime. Phi-3 Mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 - synthetic data and filtered websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family, and the mini version comes in two variants: 4K and 128K which is the context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. Optimized Phi-3 Mini models are published here in ONNX format to run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets. DirectML support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Phi-3 Mini across a range of devices for CPU, GPU, and mobile. To easily get started with Phi-3, you can use our newly introduced ONNX Runtime Generate() API. See here for instructions on how to run it. Here are some of the optimized configurations we have added: 1. ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ. 2. ONNX model for fp16 CUDA: ONNX model you can use to run for your NVIDIA GPUs. 3. ONNX model for int4 CUDA: ONNX model for NVIDIA GPUs using int4 quantization via RTN. 4. ONNX model for int4 CPU and Mobile: ONNX model for your CPU and Mobile, using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved perf. For mobile devices, we recommend using the model with acc-level-4. More updates on AMD, and additional optimizations on CPU and Mobile will be added with the official ORT 1.18 release in early May. Stay tuned! The models are tested on: - GPU SKU: RTX 4090 (DirectML) - GPU SKU: 1 A100 80GB GPU, SKU: StandardND96amsrA100v4 (CUDA) - CPU SKU: Standard F64s v2 (64 vcpus, 128 GiB memory) - Mobile SKU: Samsung Galaxy S21 Minimum Configuration Required: - Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM - CUDA: NVIDIA GPU with Compute Capability >= 7.0 - Developed by: Microsoft - Model type: ONNX - Language(s) (NLP): Python, C, C++ - License: MIT - Model Description: This is a conversion of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference. Additional Details - ONNX Runtime Optimizations Blog Link - Phi-3 Model Blog Link - Phi-3 Model Card - Phi-3 Technical Report How to Get Started with the Model To make running of the Phi-3 models across a range of devices and platforms across various execution provider backends possible, we introduce a new API to wrap several aspects of generative AI inferencing. This API make it easy to drag and drop LLMs straight into your app. For running the early version of these models with ONNX Runtime, follow the steps here. Phi-3 Mini-128K-Instruct performs better in ONNX Runtime than PyTorch for all batch size, prompt length combinations. For FP16 CUDA, ORT performs up to 5X faster than PyTorch, while with INT4 CUDA it's up to 9X faster than PyTorch. The table below shows the average throughput of the first 256 tokens generated (tps) for FP16 and INT4 precisions on CUDA as measured on 1 A100 80GB GPU, SKU: StandardND96amsrA100v4. | Batch Size, Prompt Length | ORT FP16 CUDA | PyTorch Eager FP16 CUDA | FP16 CUDA Speed Up (ORT/PyTorch) | |---------------------------|---------------|-------------------------|----------------------------------| | 1, 16 | 134.46 | 25.35 | 5.30 | | 1, 64 | 132.21 | 25.69 | 5.15 | | 1, 256 | 124.51 | 25.77 | 4.83 | | 1, 1024 | 110.03 | 25.73 | 4.28 | | 1, 2048 | 96.93 | 25.72 | 3.77 | | 1, 4096 | 62.12 | 25.66 | 2.42 | | 4, 16 | 521.10 | 101.31 | 5.14 | | 4, 64 | 507.03 | 101.66 | 4.99 | | 4, 256 | 459.47 | 101.15 | 4.54 | | 4, 1024 | 343.60 | 101.09 | 3.40 | | 4, 2048 | 264.81 | 100.78 | 2.63 | | 4, 4096 | 158.00 | 77.98 | 2.03 | | 16, 16 | 1689.08 | 394.19 | 4.28 | | 16, 64 | 1567.13 | 394.29 | 3.97 | | 16, 256 | 1232.10 | 405.30 | 3.04 | | 16, 1024 | 680.61 | 294.79 | 2.31 | | 16, 2048 | 350.77 | 203.02 | 1.73 | | 16, 4096 | 192.36 | OOM | | | Batch Size, Prompt Length | PyTorch Eager INT4 CUDA | INT4 CUDA Speed Up (ORT/PyTorch) | |---------------------------|-------------------------|----------------------------------| | 1, 16 | 25.35 | 8.89 | | 1, 64 | 25.69 | 8.58 | | 1, 256 | 25.77 | 7.69 | | 1, 1024 | 25.73 | 6.34 | | 1, 2048 | 25.72 | 5.24 | | 1, 4096 | 25.66 | 2.97 | | 4, 16 | 101.31 | 2.82 | | 4, 64 | 101.66 | 2.77 | | 4, 256 | 101.15 | 2.64 | | 4, 1024 | 101.09 | 2.20 | | 4, 2048 | 100.78 | 1.84 | | 4, 4096 | 77.98 | 1.62 | | 16, 16 | 394.19 | 2.52 | | 16, 64 | 394.29 | 2.41 | | 16, 256 | 405.30 | 2.00 | | 16, 1024 | 294.79 | 1.79 | | 16, 2048 | 203.02 | 1.81 | | 16, 4096 | OOM | | Note: PyTorch compile and Llama.cpp currently do not support the Phi-3 Mini-128K-Instruct model. | Pip package name | Version | |----------------------------|----------| | torch | 2.2.0 | | triton | 2.2.0 | | onnxruntime-gpu | 1.18.0 | | onnxruntime-genai | 0.2.0 | | onnxruntime-genai-cuda | 0.2.0 | | onnxruntime-genai-directml | 0.2.0 | | transformers | 4.39.0 | | bitsandbytes | 0.42.0 | AWQ works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see here. Contributors Kunal Vaishnavi, Sunghoon Choi, Yufeng Li, Akshay Sonawane, Sheetal Arun Kadam, Rui Ren, Edward Chen, Scott McKay, Ryan Hill, Emma Ning, Natalie Kershaw, Parinita Rahi, Patrice Vignola, Chai Chaoweeraprasit, Logan Iyer, Vicente Rivera, Jacques Van Rhyn
beit-large-patch16-224-pt22k
beit-large-patch16-384
Phi-3-mini-4k-instruct-onnx-web
xlm-align-base
LLM2CLIP-EVA02-L-14-336
git-large-r-coco
wavlm-base-sd
llava-med-7b-delta
GODEL-v1_1-base-seq2seq
DialogRPT-human-vs-machine
NextCoder-7B
Phi-3.5-mini-instruct-onnx
rho-math-1b-interpreter-v0.1
Tapex Base Finetuned Wtq
TAPEX was proposed in TAPEX: Table Pre-training via Learning a Neural SQL Executor by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. The original repo can be found here. TAPEX (Table Pre-training via Execution) is a conceptually simple and empirically powerful pre-training approach to empower existing models with table reasoning skills. TAPEX realizes table pre-training by learning a neural SQL executor over a synthetic corpus, which is obtained by automatically synthesizing executable SQL queries. TAPEX is based on the BART architecture, the transformer encoder-encoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. This model is the `tapex-base` model fine-tuned on the WikiTableQuestions dataset. You can use the model for table question answering on complex questions. Some solveable questions are shown below (corresponding tables now shown): | Question | Answer | |:---: |:---:| | according to the table, what is the last title that spicy horse produced? | Akaneiro: Demon Hunters | | what is the difference in runners-up from coleraine academical institution and royal school dungannon? | 20 | | what were the first and last movies greenstreet acted in? | The Maltese Falcon, Malaya | | in which olympic games did arasay thondike not finish in the top 20? | 2012 | | which broadcaster hosted 3 titles but they had only 1 episode? | Channel 4 |
Dayhoff-170m-UR50
Llama2-7b-WhoIsHarryPotter
dit-large-finetuned-rvlcdip
wham
git-large-textvqa
xclip-large-patch14-16-frames
cvt-13-384-22k
git-large-r-textcaps
tapex-large-finetuned-tabfact
Phi-3-medium-128k-instruct-onnx-cpu
git-large-vatex
markuplm-large-finetuned-websrc
Dayhoff-170m-UR50-BRn
GUI-Actor-Verifier-2B
Dayhoff-170m-UR50-BRq
rho-math-1b-v0.1
amos
NatureLM-8x7B
udop-large-512
git-base-msrvtt-qa
xprophetnet-large-wiki100-cased
Wavecoder Ultra 6.7b
🌊 WaveCoder: Widespread And Versatile Enhanced Code LLM [🐦 Twitter] • [💬 Reddit] • [🍀 Unofficial Blog] Quick Start • --> Citation --> Repo for " WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation " - [2024/04/10] 🔥🔥🔥 WaveCoder repo, models released at 🤗 HuggingFace! - [2023/12/26] WaveCoder paper released. WaveCoder 🌊 is a series of large language models (LLMs) for the coding domain, designed to solve relevant problems in the field of code through instruction-following learning. Its training dataset was generated from a subset of code-search-net data using a generator-discriminator framework based on LLMs that we proposed, covering four general code-related tasks: code generation, code summary, code translation, and code repair. | Model | HumanEval | MBPP(500) | HumanEval Fix(Avg.) | HumanEval Explain(Avg.) | | -------------------------------------------------------------------------------- | --------- | --------- | ---------------------- | -------------------------- | | GPT-4 | 85.4 | - | 47.8 | 52.1 | | 🌊 WaveCoder-DS-6.7B | 65.8 | 63.0 | 49.5 | 40.8 | | 🌊 WaveCoder-Pro-6.7B | 74.4 | 63.4 | 52.1 | 43.0 | | 🌊 WaveCoder-Ultra-6.7B | 79.9 | 64.6 | 52.3 | 45.7 | Please refer to WaveCoder's GitHub repo for inference, evaluation, and training code. This code repository is licensed under the MIT License. The use of DeepSeek Coder models is subject to the its License. If you find this repository helpful, please consider citing our paper: WaveCoder models are trained on the synthetic data generated by OpenAI models. Please pay attention to OpenAI's terms of use when using the models and the datasets.
unispeech-sat-base-100h-libri-ft
Phi-3.5-vision-instruct-onnx
Dayhoff-170m-UR50-BRu
tapex-large-finetuned-wikisql
xclip-base-patch16-kinetics-600
git-large-msrvtt-qa
MediPhi-Guidelines
git-large-r
Dayhoff-3b-GR-HM
rho-math-7b-interpreter-v0.1
LLaMA-2-7b-GTL-Delta
LLaMA-2-13b-GTL-Delta
MediPhi-MedWiki
unixcoder-base-unimodal
rho-math-7b-v0.1
wavecoder-pro-6.7b
wavecoder-ds-6.7b
udop-large-512-300k
dolly-v2-7b-olive-optimized
Phi-3-medium-4k-instruct-onnx-cpu
Phi-3-medium-4k-instruct-onnx-directml
swin-base-simmim-window6-192
Phi-3-medium-4k-instruct-onnx-cuda
Phi-4-multimodal-instruct-onnx
This is an ONNX version of the Phi-4 multimodal model that is quantized to int4 precision to accelerate inference with ONNX Runtime. Model Run For CPU: stay tuned or follow this tutorial to generate your own ONNX models for CPU! You will be prompted to provide any images, audios, and a prompt. The performance of the text component is similar to the Phi-4 mini ONNX models - Developed by: Microsoft - Model type: ONNX - License: MIT - Model Description: This is a conversion of Phi4 multimodal model for ONNX Runtime inference. Disclaimer: Model is only an optimization of the base model, any risk associated with the model is the responsibility of the user of the model. Please verify and test for you scenarios. There may be a slight difference in output from the base model with the optimizations applied. Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.
longcoder-base
chatbench-distilgpt2
Phi-3-medium-128k-instruct-onnx-cuda
vq-diffusion-ithq
LLM2CLIP-EVA02-B-16
Phi-3-medium-128k-instruct-onnx-directml
LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
focalnet-small
swin-large-simmim-window12-192
Phi-3-vision-128k-instruct-onnx
Phi-3-vision-128k-instruct-onnx-cpu
cvt-21-384
deberta-xlarge-v2
xprophetnet-large-wiki100-cased-xglue-ntg
cvt-21-384-22k
unispeech-1350-en-17h-ky-ft-1h
focalnet-base
Phi-3-vision-128k-instruct-onnx-cuda
deberta-xxlarge-v2
phi-4-onnx
DialogRPT-depth
xclip-base-patch16-kinetics-600-16-frames
Phi-3-vision-128k-instruct-onnx-directml
unilm-base-cased
CodeGPT-small-py-adaptedGPT2
ssr-base
tapex-base-finetuned-tabfact
BiomedNLP-BiomedELECTRA-large-uncased-abstract
Phi-4-reasoning-plus-onnx
deberta-xxlarge-v2-mnli
xclip-base-patch16-hmdb-16-shot
elem2design
cvt-13-384
xclip-base-patch16-hmdb-4-shot
xclip-base-patch16-ucf-2-shot
xdoc-base-squad2.0
focalnet-base-lrf
tapex-large-sql-execution
mistral-7b-instruct-v0.2-ONNX
chatbench-mistral-7b
xclip-base-patch16-hmdb-8-shot
deberta-xlarge-v2-mnli
xclip-base-patch16-hmdb-2-shot
unilm-large-cased
xclip-base-patch16-ucf-4-shot
bloom-deepspeed-inference-fp16
Phi-3-small-8k-instruct-onnx-cuda
xclip-base-patch16-ucf-16-shot
unispeech-large-multi-lingual-1500h-cv
unispeech-1350-en-90-it-ft-1h
Phi-4-reasoning-onnx
Reducio VAE
This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \\(\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\\), enabling 4096x downsampling. It is part of the Reducio-DiT, which is a video generation method. Codebase available here. The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space. |Method|Downsample Factor|\|z\||PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)| |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |SD2.1-VAE|1\8\8|4|29.23|0.82|0.09|25.96|21.00| |SDXL-VAE|1\8\8|16|30.54|0.85|0.08|19.87|23.68| |OmniTokenizer|4\8\8|8|27.11|0.89|0.07|23.88|30.52| |OpenSora-1.2|4\8\8|16|30.72|0.85|0.11|60.88|67.52| |Cosmos Tokenizer|8\8\8|16|30.84|0.74|0.12|29.44|22.06| |Cosmos Tokenizer|8\16\16|16|28.14|0.65|0.18|77.87|119.37| |Reducio-VAE|4\32\32|16|35.88|0.94|0.05|17.88|65.17|
unispeech-1350-en-168-es-ft-1h
focalnet-small-lrf
Mri Autoencoder V0.1
codeexecutor
cocolm-large
unihanlm-base
Phi-4-mini-reasoning-onnx
xdoc-base
xdoc-base-funsd
chatbench-llama3-8b
unispeech-sat-large-sd
DialogRPT-width
xprophetnet-large-wiki100-cased-xglue-qg
xdoc-base-squad1.1
bloom-deepspeed-inference-int8
cocolm-base
reacc-py-retriever
unispeech-1350-en-353-fr-ft-1h
xdoc-base-websrc
unispeech-sat-base-sd
aurora
mattergen
VidTok
msclap
CADFusion
This model takes a textual description as input, which describes the appearance, components and functionality of the desired CAD model, and generates a CAD model as output, which is represented as a structured text with CAD modeling operations and dimensions. - Developed by: Ruiyu Wang, Yu Yuan, Shizhao Sun, Jiang Bian - Model type: Large Language Models - Language(s): English, CAD-structured texts - License: MIT - Finetuned from model: LLaMA-3-8B - Repository: https://github.com/microsoft/CADFusion - Paper: https://arxiv.org/abs/2501.19054 Taking natural language instruction reflecting the design intent from the user, the model generate a CAD model represented as structured texts that reflects the intent. CADFusion is an open-source model shared with the research community to facilitate the reproduction of our results and foster research in text-to-CAD generation. It is intended to be used by experts in the CAD domain who are independently capable of evaluating the quality of outputs before acting on them. CADFusion is being released for research purposes. We do not recommend using CADFusion in commercial or real-world deployments without extra testing and development. Follow local laws and regulations when using the model. Any use that violates applicable laws and regulations is considered out-of-scope in the designation of this model. CADFusion is built upon Meta-Llama-3-8B. Like all large language models, it may inherit biases, errors, or omissions from its base model. We recommend developers carefully select the appropriate LLM backbone for their specific use case. You can learn more about the capabilities and limitations of the Llama model here: https://huggingface.co/meta-llama/Meta-Llama-3-8B. While CAD-Editor has been fine-tuned on CAD-specific data to minimize irrelevant details, it may still generate harmful or undesirable CAD models under certain prompts. For example, when given an instruction like 'create a CAD model for a ghost gun', it could produce potentially dangerous content. Therefore, it is essential for users to implement their own content-filtering strategies to prevent the generation of harmful or undesirable CAD models. Please note that CADFusion is currently for research and experimental purposes only. Generated CAD models may not always be technically accurate, and users are responsible for assessing the quality and suitability of the content it produces. Extensive testing and validation are required before any commercial or real-world deployment. Please only use the textual description for the appearance, components and functionality of the desired CAD model as input. Please replace with the path of the downloaded CADFusion model on your machine. For more information, please visit our GitHub repo: https://github.com/microsoft/CADFusion. The training data contains two parts, the CAD model and its corresponding textual description. For the CAD model, it is originally from a open-source dataset DeepCAD. We use its pre-processed version Skexgen and transform it into a format that are suitable for our model. For the textual description, we invite human annotators to describe the CAD model in DeepCAD by natural language. The model is trained by alternating two stages, the sequential learning stage and the visual feedback stage. For the sequence learning stage, we use paired textual descriptions and CAD models to fine-tune the LLM backbone. For the visual feedback stage, we use paired textual descriptions, rejected CAD models and preferred CAD models to fine-tune the LLM backbone, where the rejected CAD models and preferred CAD models are produced by labeling the rendered CAD images with a large vision-language model (https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov). See the methodology section in our paper (https://arxiv.org/pdf/2501.19054) for more information. Sequential Learning (SL) Stage: - LoRA rank and alpha: 32, 32 - Optimizer: AdamW - Learning rate: 1e-4 - Epochs: 40 Visual Feedback (VF) Stage: - LoRA rank and alpha: 32, 32 - Optimizer: AdamW - Learning rate: 1e-4 - Epochs: 3 VF epochs and 1 SL epoch The testing data contains two parts, the CAD model and its corresponding textual description. For the CAD model, it is originally from a open-source dataset DeepCAD. We use its pre-processed version Skexgen and transform it into a format that are suitable for our model. For the textual description, we invite human annotators to describe the CAD model in DeepCAD by natural language. - Generation diversity and quality on the generated CAD models in comparison to the test set, including Coverage (COV), Minimum Matching Distance (MMD) and Jensen-Shannon Divergence (JSD). - Invalidity Ratio (IR). - Visual quality, including human ranking and VLM scoring. CADFusion achieves better performance quantitatively and qualitatively compared with baselines such as GPT-4o and Text2CAD (https://arxiv.org/abs/2409.17106). For example, it improves the visual quality in a great margin: the VLM scoring for GPT-4o and Text2CAD is 5.13 and 2.01 respectively and CADFusion improves it to 8.96. See Table 1 in our paper (https://arxiv.org/pdf/2501.19054) for the complete evaluation. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact Shizhao Sun at [email protected]. If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.