microsoft

✓ VerifiedEnterprise

Microsoft AI, partners with OpenAI on GPT integration

421 models • 65 total models in database

Sort by:

TRELLIS-image-large

--- library_name: trellis pipeline_tag: image-to-3d license: mit language: - en ---

license:mit

2,548,659

589

deberta-v3-base

--- language: en tags: - deberta - deberta-v3 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---

license:mit

2,418,393

374

deberta-xlarge-mnli

--- language: en tags: - deberta-v1 - deberta-mnli tasks: mnli thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit widget: - text: "[CLS] I love you. [SEP] I like you. [SEP]" ---

license:mit

1,825,672

table-transformer-structure-recognition

--- license: mit widget: - src: https://documentation.tricentis.com/tosca/1420/en/content/tbox/images/table.png example_title: Table ---

license:mit

1,815,142

--- language: en tags: - exbert license: mit widget: - text: "[MASK] is a tyrosine kinase inhibitor." ---

license:mit

1,250,841

tapex-base-finetuned-wikisql

--- language: en tags: - tapex - table-question-answering datasets: - wikisql license: mit ---

license:mit

1,030,887

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

--- language: en tags: - clip - biology - medical license: mit library_name: open_clip widget: - src: https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224/resolve/main/example_data/biomed_image_classification_example_data/squamous_cell_carcinoma_histopathology.jpeg candidate_labels: adenocarcinoma histopathology, squamous cell carcinoma histopathology example_title: squamous cell carcinoma histopathology - src: >- https://huggingface.co/microsoft/BiomedCLIP-PubMedBERT_

license:mit

1,020,147

368

beit-base-patch16-224-pt22k-ft22k

--- license: apache-2.0 tags: - image-classification - vision datasets: - imagenet - imagenet-21k ---

license:apache-2.0

946,504

Florence-2-large

--- license: mit license_link: https://huggingface.co/microsoft/Florence-2-large/resolve/main/LICENSE pipeline_tag: image-text-to-text tags: - vision ---

license:mit

931,229

1,700

phi-2

--- license: mit license_link: https://huggingface.co/microsoft/phi-2/resolve/main/LICENSE language: - en pipeline_tag: text-generation tags: - nlp - code ---

license:mit

775,700

3,408

layoutlmv2-base-uncased

--- language: en license: cc-by-nc-sa-4.0

license:cc-by-nc-sa-4.0

775,606

layoutlmv3-base

--- language: en license: cc-by-nc-sa-4.0

license:cc-by-nc-sa-4.0

752,336

457

deberta-v3-large

--- language: en tags: - deberta - deberta-v3 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---

license:mit

705,571

254

codebert-base

Pretrained weights for [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/abs/2002.08155).

—

653,255

275

--- language: - en tags: - speech inference: false ---

—

395,797

swinv2-tiny-patch4-window16-256

--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg example_title: Tiger - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg example_title: Teapot - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg example_title: Palace ---

license:apache-2.0

389,909

Phi-3.5-vision-instruct

--- license: mit license_link: https://huggingface.co/microsoft/Phi-3.5-vision-instruct/resolve/main/LICENSE language: - multilingual pipeline_tag: image-text-to-text tags: - nlp - code - vision inference: parameters: temperature: 0.7 widget: - messages: - role: user content: Can you describe what you see in the image? library_name: transformers ---

license:mit

384,187

711

trocr-large-printed

--- tags: - trocr - image-to-text widget: - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X00016469612_1.jpg example_title: Printed 1 - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X51005255805_7.jpg example_title: Printed 2 - src: https://layoutlm.blob.core.windows.net/trocr/dataset/SROIE2019Task2Crop/train/X51005745214_6.jpg example_title: Printed 3 ---

—

362,503

172

DialoGPT-small

--- thumbnail: https://huggingface.co/front/thumbnails/dialogpt.png tags: - conversational license: mit ---

--- language: en license: mit tags: - vision - video-classification model-index: - name: nielsr/xclip-base-patch32 results: - task: type: video-classification dataset: name: Kinetics 400 type: kinetics-400 metrics: - type: top-1 accuracy value: 80.4 - type: top-5 accuracy value: 95.0 ---

license:mit

249,824

104

trocr-base-printed

—

247,873

194

deberta-v3-small

--- language: en tags: - deberta - deberta-v3 - fill-mask thumbnail: https://huggingface.co/front/thumbnails/microsoft.png license: mit ---

license:mit

230,249

resnet-50

--- license: apache-2.0 tags: - vision - image-classification datasets: - imagenet-1k ---

license:apache-2.0

228,445

455

wavlm-base-plus-sv

--- language: - en tags: - speech ---

—

227,227

Florence-2-base

--- license: mit license_link: https://huggingface.co/microsoft/Florence-2-base/resolve/main/LICENSE pipeline_tag: image-text-to-text tags: - vision ---

license:mit

213,991

310

kosmos-2-patch14-224

phi-1_5

The language model Phi-1.5 is a Transformer with 1.3 billion parameters. It was trained using the same data sources as phi-1, augmented with a new data source that consists of various NLP synthetic...

license:mit

102,891

1,351

swin-large-patch4-window7-224

Swin Transformer model trained on ImageNet-1k at resolution 224x224. It was introduced in the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. and first released in this repository. Disclaimer: The team releasing Swin Transformer did not write a model card for this model so this model card has been written by the Hugging Face team. The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation.

license:apache-2.0

94,212

git-base-coco

GIT (GenerativeImage2Text), base-sized, fine-tuned on COCO GIT (short for GenerativeImage2Text) model, base-sized version, fine-tuned on COCO. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository. Disclaimer: The team releasing GIT did not write a model card for this model so this model card has been written by the Hugging Face team. GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens. The model has full access to (i.e. a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e. a causal attention mask is used for the text tokens) when predicting the next text token. - image and video captioning - visual question answering (VQA) on images and videos - even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text). You can use the raw model for image captioning. See the model hub to look for fine-tuned versions on a task that interests you. > We collect 0.8B image-text pairs for pre-training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a). => however this is for the model referred to as "GIT" in the paper, which is not open-sourced. This checkpoint is "GIT-base", which is a smaller variant of GIT trained on 10 million image-text pairs. We refer to the original repo regarding details for preprocessing during training. During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed-size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. For evaluation results, we refer readers to the paper.

license:mit

79,681

speecht5_tts

license:mit

76,931

813

Multilingual-MiniLM-L12-H384

license:mit

72,037

infoxlm-base

—

68,841

swin-tiny-patch4-window7-224

license:apache-2.0

61,204

MiniLM-L12-H384-uncased

license:mit

58,207

102

llmlingua-2-bert-base-multilingual-cased-meetingbank

license:apache-2.0

57,228

Florence-2-large-ft

license:mit

55,666

373

wavlm-base-sv

—

54,304

trocr-large-handwritten

—

rad-dino

Phi-3-vision-128k-instruct

🎉 Phi-3.5: [[mini-instruct]](https://huggingface.co/microsoft/Phi-3.5-mini-instruct); [[MoE-instruct]](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) ; [[vision-instruct]](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures. + Phi-3 Microsoft Blog + Phi-3 Technical Report + Phi-3 on Azure AI Studio + Phi-3 Cookbook | | Short Context | Long Context | | ------- | ------------- | ------------ | | Mini | 4K [[HF]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) ; [[GGUF]](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx)| | Small | 8K [[HF]](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-small-8k-instruct-onnx-cuda) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-small-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-small-128k-instruct-onnx-cuda)| | Medium | 4K [[HF]](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-cuda) | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda)| | Vision | | 128K [[HF]](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) ; [[ONNX]](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cuda)| The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require 1) memory/compute constrained environments; 2) latency bound scenarios; 3) general image understanding; 4) OCR; 5) chart and table understanding. Our model is designed to accelerate research on efficient language and multimodal models, for use as a building block for generative AI powered features. Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. Phi-3-Vision-128K-Instruct has been integrated in the development version (4.40.2) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following: When loading the model, ensure that `trustremotecode=True` is passed as an argument of the `frompretrained()` function. Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source. The current `transformers` version can be verified with: `pip list | grep transformers`. Phi-3-Vision-128K-Instruct is also available in Azure AI Studio. Given the nature of the training data, the Phi-3-Vision-128K-Instruct model is best suited for a single image input wih prompts using the chat format as follows. You can provide the prompt as a single image with a generic template as follow: where the model generates the text after ` ` . In case of multi-turn conversation, the prompt can be formatted as follows: This code snippets show how to get quickly started with running the model on a GPU: How to finetune? We recommend user to take a look at the Phi-3 CookBook finetuning recipe for Vision Like other models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: + Quality of Service: The Phi models are trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. + Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. + Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. + Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. + Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include: + Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. + High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. + Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). + Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. + Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. + Identification of individuals: models with vision capabilities may have the potential to uniquely identify individuals in images. Safety post-training steers the model to refuse such requests, but developers should consider and implement, as appropriate, additional mitigations or user consent flows as required in their respective jurisdiction, (e.g., building measures to blur faces in image inputs before processing. Architecture: Phi-3-Vision-128K-Instruct has 4.2B parameters and contains image encoder, connector, projector, and Phi-3 Mini language model. Inputs: Text and Image. It’s best suited for prompts using the chat format. Context length: 128K tokens GPUs: 512 H100-80G Training time: 1.5 days Training data: 500B vision and text tokens Outputs: Generated text in response to the input Dates: Our models were trained between February and April 2024 Status: This is a static model trained on an offline text dataset with cutoff date Mar 15, 2024. Future versions of the tuned models may be released as we improve models. Release Type: Open weight release Release dates: The model weight is released on May 21, 2024. Our training data includes a wide variety of sources, and is a combination of 1) publicly available documents filtered rigorously for quality, selected high-quality educational data and code; 2) selected high-quality image-text interleave; 3) newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.), newly created image data, e.g., chart/table/diagram/slides; 4) high quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data. More details can be found in the Phi-3 Technical Report. To understand the capabilities, we compare Phi-3-Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform. |Benchmark|Phi-3 Vision-128K-In|LlaVA-1.6 Vicuna-7B|QWEN-VL Chat|Llama3-Llava-Next-8B|Claude-3 Haiku|Gemini 1.0 Pro V|GPT-4V-Turbo| |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |MMMU|40.4|34.2|39.0|36.4|40.7|42.0|55.5| |MMBench|80.5|76.3|75.8|79.4|62.4|80.0|86.1| |ScienceQA|90.8|70.6|67.2|73.7|72.0|79.7|75.7| |MathVista|44.5|31.5|29.4|34.8|33.2|35.0|47.5| |InterGPS|38.1|20.5|22.3|24.6|32.1|28.6|41.0| |AI2D|76.7|63.1|59.8|66.9|60.3|62.8|74.7| |ChartQA|81.4|55.0|50.9|65.8|59.3|58.0|62.3| |TextVQA|70.9|64.6|59.4|55.7|62.7|64.7|68.1| |POPE|85.8|87.2|82.6|87.0|74.4|84.2|83.7| Hardware Note that by default, the Phi-3-Vision-128K model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: NVIDIA A100 NVIDIA A6000 NVIDIA H100 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

license:mit

31,158

968

Phi-3-medium-128k-instruct

license:mit

29,320

384

wavlm-base

The base model pretrained on 16kHz sampled speech audio. When using the model, make sure that your speech input is also sampled at 16kHz. Note: This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. Check out this blog for more in-detail explanation of how to fine-tune the model. Paper: WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing Authors: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei Abstract Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The original model can be found under https://github.com/microsoft/unilm/tree/master/wavlm. This is an English pre-trained speech model that has to be fine-tuned on a downstream task like speech recognition or audio classification before it can be used in inference. The model was pre-trained in English and should therefore perform well only in English. The model has been shown to work well on the SUPERB benchmark. Note: The model was pre-trained on phonemes rather than characters. This means that one should make sure that the input text is converted to a sequence of phonemes before fine-tuning. To fine-tune the model for speech recognition, see the official speech recognition example. To fine-tune the model for speech classification, see the official audio classification example. The model was contributed by cywang and patrickvonplaten.

—

28,991

xclip-base-patch16

X-CLIP model (base-sized, patch resolution of 16) trained fully-supervised on Kinetics-400. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository. This model was trained using 8 frames per video, at a resolution of 224x224. Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team. X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval. You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you. The exact details of preprocessing during training can be found here. The exact details of preprocessing during validation can be found here. During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. This model achieves a top-1 accuracy of 83.8% and a top-5 accuracy of 95.7%.

license:mit

27,800

mpnet-base

—

27,714

deberta-v3-xsmall

license:mit

27,334

GRIN-MoE

license:mit

27,091

195

dit-base

—

25,485

BioGPT-Large

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms. If you find BioGPT useful in your research, please cite the following paper:

license:mit

25,001

209

trocr-small-handwritten

—

21,500

trocr-large-stage1

—

21,243

Florence-2-base-ft

license:mit

21,054

132

Phi-3-medium-4k-instruct

license:mit

20,200

222

resnet-152

license:apache-2.0

20,010

BiomedVLP-CXR-BERT-specialized

license:mit

17,638

layoutlmv3-large

211

beit-base-patch16-224

license:apache-2.0

11,642

llava-med-v1.5-mistral-7b

dit-base-finetuned-rvlcdip

—

11,127

phi-4-gguf

| | | |-------------------------|-------------------------------------------------------------------------------| | Developers | Microsoft Research | | Description | `phi-4` is a state-of-the-art open model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. `phi-4` underwent a rigorous enhancement and alignment process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures | | Architecture | 14B parameters, dense decoder-only Transformer model | | Inputs | Text, best suited for prompts in the chat format | | Context length | 16K tokens | | GPUs | 1920 H100-80G | | Training time | 21 days | | Training data | 9.8T tokens | | Outputs | Generated text in response to input | | Dates | October 2024 – November 2024 | | Status | Static model trained on an offline dataset with cutoff dates of June 2024 and earlier for publicly available data | | Release date | December 12, 2024 | | License | MIT | | | | |-------------------------------|-------------------------------------------------------------------------| | Primary Use Cases | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. | | Out-of-Scope Use Cases | Our models is not specifically designed or evaluated for all downstream purposes, thus: 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. | Our training data is an extension of the data used for Phi-3 and includes a wide variety of sources from: 1. Publicly available documents filtered rigorously for quality, selected high-quality educational data, and code. 2. Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (science, daily activities, theory of mind, etc.). 4. High quality chat format supervised data covering various topics to reflect human preferences on different aspects such as instruct-following, truthfulness, honesty and helpfulness. Multilingual data constitutes about 8% of our overall data. We are focusing on the quality of data that could potentially improve the reasoning ability for the model, and we filter the publicly available documents to contain the correct level of knowledge. We evaluated `phi-4` using OpenAI’s SimpleEval and our own internal benchmarks to understand the model’s capabilities, more specifically: MMLU: Popular aggregated dataset for multitask language understanding. `phi-4` has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated synthetic datasets. The overall technique employed to do the safety alignment is a combination of SFT (Supervised Fine-Tuning) and iterative DPO (Direct Preference Optimization), including publicly available datasets focusing on helpfulness and harmlessness as well as various questions and answers targeted to multiple safety categories. Prior to release, `phi-4` followed a multi-faceted evaluation approach. Quantitative evaluation was conducted with multiple open-source safety benchmarks and in-house tools utilizing adversarial conversation simulation. For qualitative safety evaluation, we collaborated with the independent AI Red Team (AIRT) at Microsoft to assess safety risks posed by `phi-4` in both average and adversarial user scenarios. In the average user scenario, AIRT emulated typical single-turn and multi-turn interactions to identify potentially risky behaviors. The adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, encoding-based attacks, multi-turn attacks, and adversarial suffix attacks. Please refer to the technical report for more details on safety alignment. To understand the capabilities, we compare `phi-4` with a set of models over OpenAI’s SimpleEval benchmark. At the high-level overview of the model quality on representative benchmarks. For the table below, higher numbers indicate better performance: | Category | Benchmark | phi-4 (14B) | phi-3 (14B) | Qwen 2.5 (14B instruct) | GPT-4o-mini | Llama-3.3 (70B instruct) | Qwen 2.5 (72B instruct) | GPT-4o | |------------------------------|---------------|-----------|-----------------|----------------------|----------------------|--------------------|-------------------|-----------------| | Popular Aggregated Benchmark | MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 | | Science | GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 | | Math | MGSM MATH | 80.6 80.4 | 53.5 44.6 | 79.6 75.6 | 86.5 73.0 | 89.1 66.3 | 87.3 80.0 | 90.4 74.6 | | Code Generation | HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 80.4 | 90.6 | | Factual Knowledge | SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 | | Reasoning | DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 | \ These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3-70B. Given the nature of the training data, `phi-4` is best suited for prompts using the chat format as follows: Install `llama.cpp` according to their documentation and use the following code snippet to interact with `phi-4` (4-bit quantized): Like other language models, `phi-4` can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: Quality of Service: The model is trained primarily on English text. Languages other than English will experience worse performance. English language varieties with less representation in the training data might experience worse performance than standard American English. `phi-4` is not intended to support multilingual use. Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. Limited Scope for Code: Majority of `phi-4` training data is based in Python and uses common packages such as `typing`, `math`, `random`, `collections`, `datetime`, `itertools`. If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Using safety services like Azure AI Content Safety that have advanced guardrails is highly recommended. Important areas for consideration include: Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. Data Summary: https://huggingface.co/microsoft/phi-4-gguf/blob/main/datasummarycard.md

license:mit

10,785

164

bitnet-b1.58-2B-4T

Phi-4-reasoning-plus

license:mit

10,011

328

layoutlmv2-large-uncased

LayoutLMv2 Multimodal (text + layout/format + image) pre-training for document AI Introduction LayoutLMv2 is an improved version of LayoutLM with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. It outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including , including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.834 → 0.852), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672). LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou, ACL 2021

license:cc-by-nc-sa-4.0

8,713

DialoGPT-large

license:mit

8,643

286

codebert-base-mlm

kosmos-2.5

license:mit

5,530

269

MediPhi-Instruct

license:mit

5,313

Phi-3-small-128k-instruct

trocr-small-stage1

—

4,289

rad-dino-maira-2

—

3,884

Fara-7B

deberta-large

license:mit

3,871

Orca-2-13b

Orca 2 is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning. 1. This is a research model, intended to show that we can use capable models and complex workflows (advanced prompts, multiple calls) to create synthetic data that can teach Small Language Models (SLMs) new capabilities. We chose reasoning because it is a widely useful capability that SLMs lack. 2. The model is not optimized for chat and has not been trained with RLHF or DPO. It is best used after being finetuned for chat or for a specific task. 3. Beyond reasoning, the model inherits capabilities and limitations of its base (LLAMA-2 base). We have already seen that the benefits of the Orca training can be applied to other base model too. We make Orca 2's weights publicly available to support further research on the development, evaluation, and alignment of SLMs. + Orca 2 is built for research purposes only. + The main purpose is to allow the research community to assess its abilities and to provide a foundation for building better frontier models. + Orca 2 has been evaluated on a large number of tasks ranging from reasoning to grounding and safety. Please refer to Section 6 and Appendix in the Orca 2 paper for details on evaluations. Orca 2 is a finetuned version of LLAMA-2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper. Please refer to LLaMA-2 technical report for details on the model architecture. Orca 2 is licensed under the Microsoft Research License. Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including: Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair. Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses. Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information. Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction. Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic. Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content. Data Distribution: Orca 2’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning. System messages: Orca 2 demonstrates variance in performance depending on the system instructions. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions. Zero-Shot Settings: Orca 2 was trained on data that mostly simulate zero-shot settings. While the model demonstrate very strong performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other, specially larger, models. Synthetic data: As Orca 2 is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation. We posit that Orca 2 benefits from the safety measures incorporated during training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification of such risks. This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application. The usage of Azure AI Content Safety on top of model prediction is strongly encouraged and can help prevent content harms. Azure AI Content Safety is a content moderation platform that uses AI to keep your content safe. By integrating Orca 2 with Azure AI Content Safety, we can moderate the model output by scanning it for sexual content, violence, hate, and self-harm with multiple severity levels and multi-lingual detection.

UserLM-8b

Unlike typical LLMs that are trained to play the role of the "assistant" in conversation, we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a...

git-large-coco

license:mit

3,548

104

BiomedParse

license:cc-by-nc-sa-4.0

3,504

layoutlm-large-uncased

—

3,380

maira-2

—

3,364

bitnet-b1.58-2B-4T-gguf

This repository contains the weights for BitNet b1.58 2B4T, the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale, developed by Microsoft Research. Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency). ➡️ Technical Report: BitNet b1.58 2B4T Technical Report ➡️ Official Inference Code: microsoft/BitNet (bitnet.cpp) Several versions of the model weights are available on Hugging Face: `microsoft/bitnet-b1.58-2B-4T`: Contains the packed 1.58-bit weights optimized for efficient inference. Use this for deployment. `microsoft/bitnet-b1.58-2B-4T-bf16`: Contains the master weights in BF16 format. Use this only for training or fine-tuning purposes. `microsoft/bitnet-b1.58-2B-4T-gguf` (This repository): Contains the model weights in GGUF format, compatible with the `bitnet.cpp` library for CPU inference. Architecture: Transformer-based, modified with `BitLinear` layers (BitNet framework). Uses Rotary Position Embeddings (RoPE). Uses squared ReLU (ReLU²) activation in FFN layers. Employs `subln` normalization. No bias terms in linear or normalization layers. Quantization: Native 1.58-bit weights and 8-bit activations (W1.58A8). Weights are quantized to ternary values {-1, 0, +1} using absmean quantization during the forward pass. Activations are quantized to 8-bit integers using absmax quantization (per-token). Crucially, the model was trained from scratch with this quantization scheme, not post-training quantized. Parameters: ~2 Billion Training Tokens: 4 Trillion Context Length: Maximum sequence length of 4096 tokens. Recommendation: For optimal performance on tasks requiring very long contexts (beyond the pre-training length or for specialized long-reasoning tasks), we recommend performing intermediate long-sequence adaptation/training before the final fine-tuning stage. Training Stages: 1. Pre-training: Large-scale training on public text/code and synthetic math data using a two-stage learning rate and weight decay schedule. 2. Supervised Fine-tuning (SFT): Fine-tuned on instruction-following and conversational datasets using sum loss aggregation and specific hyperparameter tuning. 3. Direct Preference Optimization (DPO): Aligned with human preferences using preference pairs. Tokenizer: LLaMA 3 Tokenizer (vocab size: 128,256). > Please do NOT expect performance efficiency gains (in terms of speed, latency, or energy consumption) when using this model with the standard transformers library, even with the required fork. > > The current execution paths within transformers do not contain the specialized, highly optimized computational kernels required to leverage the advantages of the BitNet architecture. Running the model via transformers will likely result in inference speeds and energy usage comparable to, or potentially worse than, standard full-precision models within this framework on both CPU and GPU. > > While you might observe reduced memory usage due to the quantized weights, the primary computational efficiency benefits are not accessible through this standard transformers usage path. > > For achieving the efficiency benefits demonstrated in the technical paper, you MUST use the dedicated C++ implementation: bitnet.cpp. Please refer to the bitnet.cpp GitHub repository for detailed compilation steps, usage examples, and command-line options. BitNet b1.58 2B4T was evaluated against leading open-weight full-precision LLMs of similar size. Below are the key results (all models are instruction-tuned versions): | Benchmark | LLaMA 3.2 1B | Gemma-3 1B | Qwen2.5 1.5B | SmolLM2 1.7B | MiniCPM 2B | BitNet b1.58 2B | |--------------------------------|--------------|------------|--------------|--------------|------------|---------------------| | Memory (Non-emb) | 2GB | 1.4GB | 2.6GB | 3.2GB | 4.8GB | 0.4GB | | Latency (CPU Decoding) | 48ms | 41ms | 65ms | 67ms | 124ms | 29ms | | Energy (Estimated) | 0.258J | 0.186J | 0.347J | 0.425J | 0.649J | 0.028J | | Training Tokens (Pre-train)| 9T | 2T | 18T | 11T | 1.1T | 4T | | ARC-Challenge | 37.80 | 38.40 | 46.67 | 43.52 | 44.80 | 49.91 | | ARC-Easy | 63.17 | 63.13 | 76.01 | 62.92 | 72.14 | 74.79 | | OpenbookQA | 34.80 | 38.80 | 40.80 | 46.00 | 40.20 | 41.60 | | BoolQ | 64.65 | 74.22 | 78.04 | 75.78 | 80.67 | 80.18 | | HellaSwag | 60.80 | 57.69 | 68.28 | 71.71 | 70.81 | 68.44 | | PIQA | 74.21 | 71.93 | 76.12 | 76.12 | 76.66 | 77.09 | | WinoGrande | 59.51 | 58.48 | 62.83 | 68.98 | 61.80 | 71.90 | | CommonsenseQA | 58.48 | 42.10 | 76.41 | 63.55 | 71.74 | 71.58 | | TruthfulQA | 43.80 | 38.66 | 46.67 | 39.90 | 41.41 | 45.31 | | TriviaQA | 37.60 | 23.49 | 38.37 | 45.97 | 34.13 | 33.57 | | MMLU | 45.58 | 39.91 | 60.25 | 49.24 | 51.82 | 53.17 | | HumanEval+ | 31.10 | 37.20 | 50.60 | 28.00 | 43.90 | 38.40 | | GSM8K | 38.21 | 31.16 | 56.79 | 45.11 | 4.40 | 58.38 | | MATH-500 | 23.00 | 42.00 | 53.00 | 17.60 | 14.80 | 43.40 | | IFEval | 62.71 | 66.67 | 50.12 | 57.91 | 36.81 | 53.48 | | MT-bench | 5.43 | 6.40 | 6.12 | 5.50 | 6.57 | 5.85 | | Average | 44.90 | 43.74 | 55.23 | 48.70 | 42.05 | 54.19 | License The model weights and code are released under the MIT License. Disclaimer This model is intended for research and development purposes. While efforts have been made to align it using SFT and DPO, it may still produce outputs that are unexpected, biased, or inaccurate. Please use responsibly.

Phi-3-small-8k-instruct

license:mit

3,334

170

LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned

NaNK

llama

3,323

swinv2-large-patch4-window12-192-22k

license:apache-2.0

3,189

xclip-base-patch16-zero-shot

X-CLIP model (base-sized, patch resolution of 16) trained on Kinetics-400. It was introduced in the paper Expanding Language-Image Pretrained Models for General Video Recognition by Ni et al. and first released in this repository. This model was trained using 32 frames per video, at a resolution of 224x224. Disclaimer: The team releasing X-CLIP did not write a model card for this model so this model card has been written by the Hugging Face team. X-CLIP is a minimal extension of CLIP for general video-language understanding. The model is trained in a contrastive way on (video, text) pairs. This allows the model to be used for tasks like zero-shot, few-shot or fully supervised video classification and video-text retrieval. You can use the raw model for determining how well text goes with a given video. See the model hub to look for fine-tuned versions on a task that interests you. The exact details of preprocessing during training can be found here. The exact details of preprocessing during validation can be found here. During validation, one resizes the shorter edge of each frame, after which center cropping is performed to a fixed-size resolution (like 224x224). Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation. This model achieves a zero-shot top-1 accuracy of 44.6% on HMDB-51, 72.0% on UCF-101 and 65.2% on Kinetics-600.

license:mit

3,134

swin-large-patch4-window12-384

license:apache-2.0

2,948

xclip-base-patch16-16-frames

license:mit

2,907

CodeGPT-small-py

—

2,661

swin-base-patch4-window12-384

license:apache-2.0

2,486

1,683

Orca-2-7b

Orca 2 is built for research purposes only and provides a single turn response in tasks such as reasoning over user given data, reading comprehension, math problem solving and text summarization. The model is designed to excel particularly in reasoning. 1. This is a research model, intended to show that we can use capable models and complex workflows (advanced prompts, multiple calls) to create synthetic data that can teach Small Language Models (SLMs) new capabilities. We chose reasoning because it is a widely useful capability that SLMs lack. 2. The model is not optimized for chat and has not been trained with RLHF or DPO. It is best used after being finetuned for chat or for a specific task. 3. Beyond reasoning, the model inherits capabilities and limitations of its base (LLAMA-2 base). We have already seen that the benefits of the Orca training can be applied to other base model too. We make Orca 2's weights publicly available to support further research on the development, evaluation, and alignment of SLMs. + Orca 2 is built for research purposes only. + The main purpose is to allow the research community to assess its abilities and to provide a foundation for building better frontier models. + Orca 2 has been evaluated on a large number of tasks ranging from reasoning to grounding and safety. Please refer to Section 6 and Appendix in the Orca 2 paper for details on evaluations. Orca 2 is a finetuned version of LLAMA-2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the Orca 2 paper. Please refer to LLaMA-2 technical report for details on the model architecture. Orca 2 is licensed under the Microsoft Research License. Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Orca 2, built upon the LLaMA 2 model family, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including: Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair. Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses. Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We recommend reviewing transparency notes from Azure for more information. Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction. Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic. Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content. Data Distribution: Orca 2’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as math, coding, and reasoning. System messages: Orca 2 demonstrates variance in performance depending on the system instructions. Additionally, the stochasticity introduced by the model size may lead to generation of non-deterministic responses to different system instructions. Zero-Shot Settings: Orca 2 was trained on data that mostly simulate zero-shot settings. While the model demonstrate very strong performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other, specially larger, models. Synthetic data: As Orca 2 is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation. We posit that Orca 2 benefits from the safety measures incorporated during training and safety guardrails (e.g., content filter) within the Azure OpenAI API. However, detailed studies are required for better quantification of such risks. This model is solely designed for research settings, and its testing has only been carried out in such environments. It should not be used in downstream applications, as additional analysis is needed to assess potential harm or bias in the proposed application. The usage of Azure AI Content Safety on top of model prediction is strongly encouraged and can help preventing some of content harms. Azure AI Content Safety is a content moderation platform that uses AI to moderate content. By having Azure AI Content Safety on the output of Orca 2, the model output can be moderated by scanning it for different harm categories including sexual content, violence, hate, and self-harm with multiple severity levels and multi-lingual detection.

NaNK

llama

1,017

swin-large-patch4-window12-384-in22k

license:apache-2.0

984

BioGPT-Large-PubMedQA

license:mit

949

117

swinv2-small-patch4-window8-256

license:apache-2.0

942

GUI-Actor-3B-Qwen2.5-VL

SportsBERT

—

387

prophetnet-large-uncased-cnndm

—

380

unispeech-sat-large-sv

—

367

prophetnet-large-uncased-squad-qg

—

344

OmniParser

Model Summary OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function. This model hub includes a finetuned version of YOLOv8 and a finetuned BLIP-2 model on the above dataset respectively. For more details of the models used and finetuning, please refer to the paper. Responsible AI Considerations Intended Use - OmniParser is designed to be able to convert unstructured screenshot image into structured list of elements including interactable regions location and captions of icons on its potential functionality. - OmniParser is intended to be used in settings where users are already trained on responsible analytic approaches and critical reasoning is expected. OmniParser is capable of providing extracted information from the screenshot, however human judgement is needed for the output of OmniParser. - OmniParser is intended to be used on various screenshots, which includes both PC and Phone, and also on various applications. limitations - OmniParser is designed to faithfully convert screenshot image into structured elements of interactable regions and semantics of the screen, while it does not detect harmful content in its input (like users have freedom to decide the input of any LLMs), users are expected to provide input to the OmniParser that is not harmful. - While OmniParser only converts screenshot image into texts, it can be used to construct an GUI agent based on LLMs that is actionable. When developing and operating the agent using OmniParser, the developers need to be responsible and follow common safety standard. - For OmniPaser-BLIP2, it may incorrectly infer the gender or other sensitive attribute (e.g., race, religion etc.) of individuals in icon images. Inference of sensitive attributes may rely upon stereotypes and generalizations rather than information about specific individuals and are more likely to be incorrect for marginalized people. Incorrect inferences may result in significant physical or psychological injury or restrict, infringe upon or undermine the ability to realize an individual’s human rights. We do not recommend use of OmniParser in any workplace-like use case scenario. License Please note that icondetect model is under AGPL license, and iconcaptionblip2 & iconcaptionflorence is under MIT license. Please refer to the LICENSE file in the folder of each model.

license:mit

339

1,694

table-transformer-structure-recognition-v1.1-fin

license:mit

318

MAI-DS-R1

GUI-Actor-7B-Qwen2-VL

NaNK

license:mit

225

Dayhoff-170m-GR

—

225

Dayhoff-3b-GR-HM-c

NaNK

GODEL-v1_1-base-seq2seq

license:mit

150

DialogRPT-human-vs-machine

—

133

NextCoder-7B

NaNK

dataset:bigcode/commitpackft

131

Phi-3.5-mini-instruct-onnx

license:mit

129

rho-math-1b-interpreter-v0.1

NaNK

—

101

Dayhoff-170m-UR50-BRn

—

GUI-Actor-Verifier-2B

NaNK

license:mit

Dayhoff-170m-UR50-BRq

—

rho-math-1b-v0.1

NaNK

llama

amos

license:mit

NatureLM-8x7B

rho-math-7b-v0.1

NaNK

license:mit

wavecoder-pro-6.7b

NaNK

llama

wavecoder-ds-6.7b

NaNK

llama

udop-large-512-300k

license:mit

dolly-v2-7b-olive-optimized

NaNK

longcoder-base

—

chatbench-distilgpt2

NaNK

mistral-7b-instruct-v0.2-ONNX

NaNK

license:apache-2.0

chatbench-mistral-7b

NaNK

This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \$\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\$, enabling 4096x downsampling. It is part of the Reducio-DiT, which is a video generation method. Codebase available here. The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space. |Method|Downsample Factor|\|z\||PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)| |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |SD2.1-VAE|1\8\8|4|29.23|0.82|0.09|25.96|21.00| |SDXL-VAE|1\8\8|16|30.54|0.85|0.08|19.87|23.68| |OmniTokenizer|4\8\8|8|27.11|0.89|0.07|23.88|30.52| |OpenSora-1.2|4\8\8|16|30.72|0.85|0.11|60.88|67.52| |Cosmos Tokenizer|8\8\8|16|30.84|0.74|0.12|29.44|22.06| |Cosmos Tokenizer|8\16\16|16|28.14|0.65|0.18|77.87|119.37| |Reducio-VAE|4\32\32|16|35.88|0.94|0.05|17.88|65.17|

radedit

—

—

dml-ai-hub-models

—

LLM2CLIP-Llama3.1-8B-siglip2-so400m-patch14-224

NaNK

license:apache-2.0

FlexCAD

license:mit

villa-x

license:mit

Graphormer

license:mit

VibeVoice-7B-hf

NaNK

NaNK

license:mit

whisper-base-webnn

license:apache-2.0

LLM2CLIP-Llama3.2-1B-EVA02-L-14-224

NaNK

SatCLIP-ResNet18-L40

license:mit

microsoft

TRELLIS-image-large

deberta-v3-base

deberta-xlarge-mnli

table-transformer-structure-recognition

deberta-large-mnli

table-transformer-detection

Phi-3-mini-4k-instruct

BiomedNLP-BiomedBERT-base-uncased-abstract

tapex-base-finetuned-wikisql

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

beit-base-patch16-224-pt22k-ft22k

Florence-2-large

phi-2

layoutlmv2-base-uncased

layoutlmv3-base

deberta-v3-large

codebert-base

deberta-base

DialoGPT-medium

BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

mdeberta-v3-base

phi-4

Phi-4-multimodal-instruct

wavlm-large

swinv2-tiny-patch4-window16-256

Phi-3.5-vision-instruct

trocr-large-printed

DialoGPT-small

Phi-3.5-mini-instruct

wavlm-base-plus

Phi-3-mini-128k-instruct

graphcodebert-base

Phi-4-mini-instruct

xclip-base-patch32

trocr-base-printed

deberta-v3-small

resnet-50

wavlm-base-plus-sv

Florence-2-base

kosmos-2-patch14-224

table-transformer-structure-recognition-v1.1-all

trocr-base-handwritten

VibeVoice-1.5B

resnet-18

llmlingua-2-xlm-roberta-large-meetingbank

layoutlm-base-uncased

BiomedVLP-CXR-BERT-general

infoxlm-large

beit-base-patch16-384

Phi-3.5-MoE-instruct

phi-1_5

swin-large-patch4-window7-224

speecht5_hifigan

unixcoder-base

deberta-base-mnli

git-base-coco

speecht5_tts

Multilingual-MiniLM-L12-H384

infoxlm-base

swin-tiny-patch4-window7-224

MiniLM-L12-H384-uncased

llmlingua-2-bert-base-multilingual-cased-meetingbank

Florence-2-large-ft

wavlm-base-sv

trocr-large-handwritten

deberta-v2-xlarge-mnli

deberta-v2-xlarge

wavlm-base-plus-sd

Phi-tiny-MoE-instruct

trocr-small-printed

swin-base-patch4-window7-224

biogpt

rad-dino

Phi-mini-MoE-instruct

Phi-4-reasoning

OmniParser-v2.0

Phi-3-mini-4k-instruct-gguf

Phi-3-vision-128k-instruct

Phi-3-medium-128k-instruct