🔍
safety

Deepfake Detection

Verify authenticity of images and videos. Detect AI-generated faces and manipulated media. Critical for journalism, legal tech, and content verification.

20
Models Found

Common Applications

Journalism fact-checking
Legal evidence verification
Social media platform safety
Identity verification
Misinformation prevention

Top Models

20 models • Sorted by downloads

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset. For a detailed description and experimental results, please refer to our paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).

74.5M downloads
67 likes
PYTORCH
#2

Model Card: Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification The Fine-Tuned Vision Transformer (ViT) is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks. This specific model, named "google/vit-base-patch16-224-in21k," is pre-trained on a substantial collection of images in a supervised manner, leveraging the ImageNet-21k dataset. The images in the pre-training dataset are resized to a resolution of 224x224 pixels, making it suitable for a wide range of image recognition tasks. During the training phase, meticulous attention was given to hyperparameter settings to ensure optimal model performance. The model was fine-tuned with a judiciously chosen batch size of 16. This choice not only balanced computational efficiency but also allowed for the model to effectively process and learn from a diverse array of images. To facilitate this fine-tuning process, a learning rate of 5e-5 was employed. The learning rate serves as a critical tuning parameter that dictates the magnitude of adjustments made to the model's parameters during training. In this case, a learning rate of 5e-5 was selected to strike a harmonious balance between rapid convergence and steady optimization, resulting in a model that not only learns swiftly but also steadily refines its capabilities throughout the training process. This training phase was executed using a proprietary dataset containing an extensive collection of 80,000 images, each characterized by a substantial degree of variability. The dataset was thoughtfully curated to include two distinct classes, namely "normal" and "nsfw." This diversity allowed the model to grasp nuanced visual patterns, equipping it with the competence to accurately differentiate between safe and explicit content. The overarching objective of this meticulous training process was to impart the model with a deep understanding of visual cues, ensuring its robustness and competence in tackling the specific task of NSFW image classification. The result is a model that stands ready to contribute significantly to content safety and moderation, all while maintaining the highest standards of accuracy and reliability. Intended Uses & Limitations Intended Uses - NSFW Image Classification: The primary intended use of this model is for the classification of NSFW (Not Safe for Work) images. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications. How to use Here is how to use this model to classifiy an image based on 1 of 2 classes (normal,nsfw): - 'evalloss': 0.07463177293539047, - 'evalaccuracy': 0.980375, - 'evalruntime': 304.9846, - 'evalsamplespersecond': 52.462, - 'evalstepspersecond': 3.279 Note: It's essential to use this model responsibly and ethically, adhering to content guidelines and applicable regulations when implementing it in real-world applications, particularly those involving potentially sensitive content. For more details on model fine-tuning and usage, please refer to the model's documentation and the model hub. - Hugging Face Model Hub - Vision Transformer (ViT) Paper - ImageNet-21k Dataset Disclaimer: The model's performance may be influenced by the quality and representativeness of the data it was fine-tuned on. Users are encouraged to assess the model's suitability for their specific applications and datasets.

70.7M downloads
890 likes
PYTORCH

Detects age group with about 59% accuracy based on an image. See https://www.kaggle.com/code/dima806/age-group-image-classification-vit for details.

26.2M downloads
52 likes
OTHER

A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer based recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.5 - GMACs: 0.1 - Activations (M): 1.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.

23.4M downloads
41 likes
PYTORCH
#5

adetailer

by Bingsu

- coco2017 (only person) - AniSeg - skytnt/anime-segmentation | id | label | | --- | --------------------- | | 0 | shortsleevedshirt | | 1 | longsleevedshirt | | 2 | shortsleevedoutwear | | 3 | longsleevedoutwear | | 4 | vest | | 5 | sling | | 6 | shorts | | 7 | trousers | | 8 | skirt | | 9 | shortsleeveddress | | 10 | longsleeveddress | | 11 | vestdress | | 12 | slingdress | | Model | Target | mAP 50 | mAP 50-95 | | --------------------------- | --------------------- | ----------------------------- | ----------------------------- | | faceyolov8n.pt | 2D / realistic face | 0.660 | 0.366 | | faceyolov8nv2.pt | 2D / realistic face | 0.669 | 0.372 | | faceyolov8s.pt | 2D / realistic face | 0.713 | 0.404 | | faceyolov8m.pt | 2D / realistic face | 0.737 | 0.424 | | faceyolov9c.pt | 2D / realistic face | 0.748 | 0.433 | | handyolov8n.pt | 2D / realistic hand | 0.767 | 0.505 | | handyolov8s.pt | 2D / realistic hand | 0.794 | 0.527 | | handyolov9c.pt | 2D / realistic hand | 0.810 | 0.550 | | personyolov8n-seg.pt | 2D / realistic person | 0.782 (bbox) 0.761 (mask) | 0.555 (bbox) 0.460 (mask) | | personyolov8s-seg.pt | 2D / realistic person | 0.824 (bbox) 0.809 (mask) | 0.605 (bbox) 0.508 (mask) | | personyolov8m-seg.pt | 2D / realistic person | 0.849 (bbox) 0.831 (mask) | 0.636 (bbox) 0.533 (mask) | | deepfashion2yolov8s-seg.pt | realistic clothes | 0.849 (bbox) 0.840 (mask) | 0.763 (bbox) 0.675 (mask) | Since `getattr` is classified as a dangerous pickle function, any segmentation model that uses it is classified as unsafe. All models were created and saved using the official ultralytics library, so it's okay to use files downloaded from a trusted source. See also: https://huggingface.co/docs/hub/security-pickle

14.1M downloads
625 likes
PYTORCH
#6

gpt2

by openai-community

Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this page. Disclaimer: The team releasing GPT-2 also wrote a model card for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias. GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences. More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt. This is the smallest version of GPT-2, with 124M parameters. You can use the raw model for text generation or fine-tune it to a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card: > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases > that require the generated text to be true. > > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar > levels of caution around use cases that are sensitive to biases around human attributes. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here. The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens. The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact details of training. The model achieves the following results without any fine-tuning (zero-shot): | Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiText2 | PTB | enwiki8 | text8 | WikiText103 | 1BW | |:--------:|:-------:|:-------:|:------:|:------:|:---------:|:------:|:-------:|:------:|:-----------:|:-----:| | (metric) | (PPL) | (ACC) | (ACC) | (ACC) | (PPL) | (PPL) | (BPB) | (BPC) | (PPL) | (PPL) | | | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1,17 | 37.50 | 75.20 |

11.8M downloads
3.0K likes
PYTORCH
#7

Qwen3-0.6B

by Qwen

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. > [!TIP] > If you encounter significant endless repetitions, please refer to the Best Practices section for optimal sampling parameters, and set the ``presencepenalty`` to 1.5. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-0.6B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```

7.3M downloads
777 likes
OTHER

--- license: cc-by-nc-4.0 base_model: distilbert-base-uncased tags: - generated_from_trainer model-index: - name: distilbert_finetuned_ai4privacy_v2 results: [] datasets: - ai4privacy/pii-masking-200k - Isotonic/pii-masking-200k pipeline_tag: token-classification language: - en metrics: - seqeval ---

5.9M downloads
21 likes
ONNX

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice. - Developed by: Kyutai - Model type: Multimodal speech-text foundation model - Language(s) (NLP): English - License: CC-BY The model can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. However, the model has limited abilities for complex tasks and cannot access tools, but rather focues on natural, low-latency interactions. Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems.. Regarding the main Moshi architecture, other downstream usecases would require some finetuning / domain adaptation. The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. The model has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations. - Textual data: The underlying Helium model is trained on a mix of data, more precisely: - 12.5% is high-quality data sources from the following curated sources: Wikipedia Wikibooks, Wikisource, Wikinews, StackExchange and the collection of scientific articles pes2o. For Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022. - 87.5% is filtered web data from CommonCrawl, using the following crawls: 2018-30, 2019-04, 2019-30, 2020-05, 2020-34, 2021-04, 2021-31, 2022-05, 2022-33, 2023-40. - Unsupervised audio dataset: used for pre-training, this is a collection of 7 million hours of readily available audio content, which consists mostly of English speech. This training set is transcribed with Whisper (large v3 model) - The Fisher dataset:: used to enable multi-stream. It consists of 2000 hours of phone conversations at 8kHz from Fisher, which we upsample to 24kHz using AudioSR. - Supervised multi-stream dataset: A dataset of 170 hours of natural and scripted conversation between multiple pairs of participants, collected by Kyutai. This dataset is used to train the TTS system used to create synthetic data. - Synthetic data: 20,000 hours of synthetic data generated by our TTS system, and simulating a dialogue between Moshi and a user. The different stages of the training procedure are detailled in the paper along with the hyper-parameters. The training was performed on 127 DGX nodes provided by Scaleway, accounting for 1016 H100 Nvidia GPUs. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour

5.7M downloads
187 likes
OTHER

Fine-tuned XLSR-53 large model for speech recognition in Portuguese Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1. When using this model, make sure that your speech input is sampled at 16kHz. This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :) The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint The model can be used directly (without a language model) as follows... | Reference | Prediction | | ------------- | ------------- | | NEM O RADAR NEM OS OUTROS INSTRUMENTOS DETECTARAM O BOMBARDEIRO STEALTH. | NEMHUM VADAN OS OLTWES INSTRUMENTOS DE TTÉÃN UM BOMBERDEIRO OSTER | | PEDIR DINHEIRO EMPRESTADO ÀS PESSOAS DA ALDEIA | E DIR ENGINHEIRO EMPRESTAR AS PESSOAS DA ALDEIA | | OITO | OITO | | TRANCÁ-LOS | TRANCAUVOS | | REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA | REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA | | O YOUTUBE AINDA É A MELHOR PLATAFORMA DE VÍDEOS. | YOUTUBE AINDA É A MELHOR PLATAFOMA DE VÍDEOS | | MENINA E MENINO BEIJANDO NAS SOMBRAS | MENINA E MENINO BEIJANDO NAS SOMBRAS | | EU SOU O SENHOR | EU SOU O SENHOR | | DUAS MULHERES QUE SENTAM-SE PARA BAIXO LENDO JORNAIS. | DUAS MIERES QUE SENTAM-SE PARA BAICLANE JODNÓI | | EU ORIGINALMENTE ESPERAVA | EU ORIGINALMENTE ESPERAVA | 1. To evaluate on `mozilla-foundation/commonvoice60` with split `test` 2. To evaluate on `speech-recognition-community-v2/devdata` Citation If you want to cite this model you can use this:

4.7M downloads
35 likes
PYTORCH

Curated and trained by Eric Hartford, Lucas Atkins, and Fernando Fernandes, and Cognitive Computations This is our most spectacular outcome ever. FFT, all parameters, 16bit. 77.4 MMLU on 34b. And it talks like a dream. Although the max positional embeddings is 4k, we used rope theta of 1000000.0 and we trained with sequence length 8k. We plan to train on the upcoming 32k version as well. Website: https://dphn.ai Twitter: https://x.com/dphnAI Web Chat: https://chat.dphn.ai Telegram bot: https://t.me/DolphinAIbot Our appreciation for the sponsors of Dolphin 2.9.1: - Crusoe Cloud - provided excellent on-demand 8xH100 node - OnDemand - provided inference sponsorship This model is based on Yi-1.5-34b, and is governed by apache 2.0 license. The base model has 4k context, but we used rope theta of 1000000.0 and the full-weight fine-tuning was with 8k sequence length. Dolphin-2.9.1 has a variety of instruction, conversational, and coding skills. It also has initial agentic abilities and supports function calling. Dolphin is uncensored. We have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant with any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly. Dolphin is licensed according to apache 2.0 license. We grant permission for any use, including commercial. Dolphin was trained on data generated from GPT4, among other models. This model is a fine-tuned version of 01-ai/Yi-1.5-34B on the None dataset. It achieves the following results on the evaluation set: - Loss: 0.4425 The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 1 - seed: 42 - distributedtype: multi-GPU - numdevices: 8 - gradientaccumulationsteps: 8 - totaltrainbatchsize: 64 - totalevalbatchsize: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: cosine - lrschedulerwarmupsteps: 10 - numepochs: 3 | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 0.6265 | 0.0 | 1 | 0.6035 | | 0.4674 | 0.25 | 327 | 0.4344 | | 0.4337 | 0.5 | 654 | 0.4250 | | 0.4346 | 0.75 | 981 | 0.4179 | | 0.3985 | 1.0 | 1308 | 0.4118 | | 0.3128 | 1.23 | 1635 | 0.4201 | | 0.3261 | 1.48 | 1962 | 0.4157 | | 0.3259 | 1.73 | 2289 | 0.4122 | | 0.3126 | 1.98 | 2616 | 0.4079 | | 0.2265 | 2.21 | 2943 | 0.4441 | | 0.2297 | 2.46 | 3270 | 0.4427 | | 0.2424 | 2.71 | 3597 | 0.4425 | - Transformers 4.40.0.dev0 - Pytorch 2.2.2+cu121 - Datasets 2.15.0 - Tokenizers 0.15.0

4.6M downloads
48 likes
OTHER
#12

Kokoro-82M

by hexgrad

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects. > [!NOTE] > As of April 2025, the market rate of Kokoro served over API is under $1 per million characters of text input, or under $0.06 per hour of audio output. (On average, 1000 characters of input is about 1 minute of output.) Sources: ArtificialAnalysis/Replicate at 65 cents per M chars and DeepInfra at 80 cents per M chars. > > This is an Apache-licensed model, and Kokoro has been deployed in numerous projects and commercial APIs. We welcome the deployment of the model in real use cases. > [!CAUTION] > Fake websites like kokorottsaicom (snapshot: https://archive.ph/nRRnk) and kokorottsnet (snapshot: https://archive.ph/60opa) are likely scams masquerading under the banner of a popular model. > > Any website containing "kokoro" in its root domain (e.g. kokorottsaicom, kokorottsnet) is NOT owned by and NOT affiliated with this model page or its author, and attempts to imply otherwise are red flags. - Releases - Usage - EVAL.md ↗️ - SAMPLES.md ↗️ - VOICES.md ↗️ - Model Facts - Training Details - Creative Commons Attribution - Acknowledgements | Model | Published | Training Data | Langs & Voices | SHA256 | | ----- | --------- | ------------- | -------------- | ------ | | v1.0 | 2025 Jan 27 | Few hundred hrs | 8 & 54 | `496dba11` | | v0.19 | 2024 Dec 25 | =0.9.2 soundfile !apt-get -qq -y install espeak-ng > /dev/null 2>&1 from kokoro import KPipeline from IPython.display import display, Audio import soundfile as sf import torch pipeline = KPipeline(langcode='a') text = ''' Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects. ''' generator = pipeline(text, voice='afheart') for i, (gs, ps, audio) in enumerate(generator): print(i, gs, ps) display(Audio(data=audio, rate=24000, autoplay=i==0)) sf.write(f'{i}.wav', audio, 24000) ``` Under the hood, `kokoro` uses `misaki`, a G2P library at https://github.com/hexgrad/misaki Architecture: - StyleTTS 2: https://arxiv.org/abs/2306.07691 - ISTFTNet: https://arxiv.org/abs/2203.02395 - Decoder only: no diffusion, no encoder release Architected by: Li et al @ https://github.com/yl4579/StyleTTS2 Model SHA256 Hash: `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4` Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include: - Public domain audio - Audio licensed under Apache, MIT, etc - Synthetic audio [1] generated by closed [2] TTS models from large providers [1] https://copyright.gov/ai/aipolicyguidance.pdf [2] No synthetic audio from open TTS models or "custom voice clones" Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM The following CC BY audio was part of the dataset used to train Kokoro v1.0. | Audio Data | Duration Used | License | Added to Training Set After | | ---------- | ------------- | ------- | --------------------------- | | Koniwa `tnc` |

4.4M downloads
5.3K likes
OTHER
#13

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences: 1. The spectrogram input uses 128 Mel frequency bins instead of 80 2. A new language token for Cantonese The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2 . The model was trained for 2.0 epochs over this mixture dataset. The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper large-v2 . For more details on the different checkpoints available, refer to the section Model details. Disclaimer: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and pasted from the original model card. Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time: The model can be used with the `pipeline` class to transcribe audios of arbitrary length: To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batchsize` parameter: Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics: Whisper predicts the language of the source audio automatically. If the source audio language is known a-priori, it can be passed as an argument to the pipeline: By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to `"translate"`: Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `returntimestamps` argument: The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription where the source audio is in French, and we want to return sentence-level timestamps, the following can be used: For more control over the generation parameters, use the model + processor API directly: You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM requirements. Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are required: 1. Sequential: uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other 2. Chunked: splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries The sequential long-form algorithm should be used in either of the following scenarios: 1. Transcription accuracy is the most important factor, and speed is less of a consideration 2. You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate Conversely, the chunked algorithm should be used when: 1. Transcription speed is the most important factor 2. You are transcribing a single long audio file By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunklengths` parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument `batchsize`: The Whisper forward pass is compatible with `torch.compile` for 4.5x speed-ups. Note: `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️ We recommend using Flash-Attention 2 if your GPU supports it and you are not using torch.compile. To do so, first install Flash Attention: Then pass `attnimplementation="flashattention2"` to `frompretrained`: If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). This attention implementation is activated by default for PyTorch versions 2.1.1 or greater. To check whether you have a compatible PyTorch version, run the following Python code snippet: If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it returns `False`, you need to upgrade your PyTorch version according to the official instructions Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying `attnimplementation="sdpa"` as follows: For more information about how to use the SDPA refer to the Transformers SDPA documentation. Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. There are two flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio. Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub: | Size | Parameters | English-only | Multilingual | |----------|------------|------------------------------------------------------|-----------------------------------------------------| | tiny | 39 M | ✓ | ✓ | | base | 74 M | ✓ | ✓ | | small | 244 M | ✓ | ✓ | | medium | 769 M | ✓ | ✓ | | large | 1550 M | x | ✓ | | large-v2 | 1550 M | x | ✓ | | large-v3 | 1550 M | x | ✓ | The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However, its predictive capabilities can be improved further for certain languages and tasks through fine-tuning. The blog post Fine-Tune Whisper with 🤗 Transformers provides a step-by-step guide to fine-tuning the Whisper model with as little as 5 hours of labelled data. The primary intended users of these models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an ASR solution for developers, especially for English speech recognition. We recognize that once models are released, it is impossible to restrict access to only “intended” uses or to draw reasonable guidelines around what is or is not research. The models are primarily trained and evaluated on ASR and speech translation to English tasks. They show strong ASR results in ~10 languages. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. We strongly recommend that users perform robust evaluations of the models in a particular context and domain before deploying them. In particular, we caution against using Whisper models to transcribe recordings of individuals taken without their consent or purporting to use these models for any kind of subjective classification. We recommend against use in high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes. The models are intended to transcribe and translate speech, use of the model for classification is not only not evaluated but also not appropriate, particularly to infer human attributes. The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. As discussed in the accompanying paper, we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language. Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level. However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself. Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in the paper accompanying this release. In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in the paper. It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages. We anticipate that Whisper models’ transcription capabilities may be used for improving accessibility tools. While Whisper models cannot be used for real-time transcription out of the box – their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications. There are also potential dual use concerns that come with releasing Whisper. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.

4.1M downloads
5.1K likes
PYTORCH

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation. You can find more details about it in this GitHub discussion. Disclaimer: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and pasted from the original model card. Whisper large-v3-turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time: The model can be used with the `pipeline` class to transcribe audios of arbitrary length: To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batchsize` parameter: Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics: Whisper predicts the language of the source audio automatically. If the source audio language is known a-priori, it can be passed as an argument to the pipeline: By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to `"translate"`: Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `returntimestamps` argument: The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription where the source audio is in French, and we want to return sentence-level timestamps, the following can be used: For more control over the generation parameters, use the model + processor API directly: You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM requirements. Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are required: 1. Sequential: uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other 2. Chunked: splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries The sequential long-form algorithm should be used in either of the following scenarios: 1. Transcription accuracy is the most important factor, and speed is less of a consideration 2. You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate Conversely, the chunked algorithm should be used when: 1. Transcription speed is the most important factor 2. You are transcribing a single long audio file By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunklengths` parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument `batchsize`: The Whisper forward pass is compatible with `torch.compile` for 4.5x speed-ups. Note: `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️ We recommend using Flash-Attention 2 if your GPU supports it and you are not using torch.compile. To do so, first install Flash Attention: Then pass `attnimplementation="flashattention2"` to `frompretrained`: If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). This attention implementation is activated by default for PyTorch versions 2.1.1 or greater. To check whether you have a compatible PyTorch version, run the following Python code snippet: If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it returns `False`, you need to upgrade your PyTorch version according to the official instructions Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying `attnimplementation="sdpa"` as follows: For more information about how to use the SDPA refer to the Transformers SDPA documentation. Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. There are two flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio. Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub: | Size | Parameters | English-only | Multilingual | |----------|------------|------------------------------------------------------|-----------------------------------------------------| | tiny | 39 M | ✓ | ✓ | | base | 74 M | ✓ | ✓ | | small | 244 M | ✓ | ✓ | | medium | 769 M | ✓ | ✓ | | large | 1550 M | x | ✓ | | large-v2 | 1550 M | x | ✓ | | large-v3 | 1550 M | x | ✓ | | large-v3-turbo | 809 M | x | ✓ | The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However, its predictive capabilities can be improved further for certain languages and tasks through fine-tuning. The blog post Fine-Tune Whisper with 🤗 Transformers provides a step-by-step guide to fine-tuning the Whisper model with as little as 5 hours of labelled data. The primary intended users of these models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an ASR solution for developers, especially for English speech recognition. We recognize that once models are released, it is impossible to restrict access to only “intended” uses or to draw reasonable guidelines around what is or is not research. The models are primarily trained and evaluated on ASR and speech translation to English tasks. They show strong ASR results in ~10 languages. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. We strongly recommend that users perform robust evaluations of the models in a particular context and domain before deploying them. In particular, we caution against using Whisper models to transcribe recordings of individuals taken without their consent or purporting to use these models for any kind of subjective classification. We recommend against use in high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes. The models are intended to transcribe and translate speech, use of the model for classification is not only not evaluated but also not appropriate, particularly to infer human attributes. Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level. However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself. Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in the paper accompanying this release. In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in the paper. It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages. We anticipate that Whisper models’ transcription capabilities may be used for improving accessibility tools. While Whisper models cannot be used for real-time transcription out of the box – their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications. There are also potential dual use concerns that come with releasing Whisper. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.

4.0M downloads
2.7K likes
OTHER

This model features: ReLU activations single layer 7x7 convolution with pooling 1x1 convolution shortcut downsample Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: ResNet Strikes Back `A1` recipe LAMB optimizer with BCE loss Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 25.6 - GMACs: 4.1 - Activations (M): 11.1 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385 - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |imgsize|top1 |top5 |paramcount|gmacs|macts|img/sec| |------------------------------------------|--------|-----|-----|-----------|-----|-----|-------| |seresnextaa101d32x8d.swin12kftin1k288|320 |86.72|98.17|93.6 |35.2 |69.7 |451 | |seresnextaa101d32x8d.swin12kftin1k288|288 |86.51|98.08|93.6 |28.5 |56.4 |560 | |seresnextaa101d32x8d.swin12kftin1k|288 |86.49|98.03|93.6 |28.5 |56.4 |557 | |seresnextaa101d32x8d.swin12kftin1k|224 |85.96|97.82|93.6 |17.2 |34.2 |923 | |resnext10132x32d.fbwslig1bftin1k|224 |85.11|97.44|468.5 |87.3 |91.1 |254 | |resnetrs420.tfin1k|416 |85.0 |97.12|191.9 |108.4|213.8|134 | |ecaresnet269d.ra2in1k|352 |84.96|97.22|102.1 |50.2 |101.2|291 | |ecaresnet269d.ra2in1k|320 |84.73|97.18|102.1 |41.5 |83.7 |353 | |resnetrs350.tfin1k|384 |84.71|96.99|164.0 |77.6 |154.7|183 | |seresnextaa101d32x8d.ahin1k|288 |84.57|97.08|93.6 |28.5 |56.4 |557 | |resnetrs200.tfin1k|320 |84.45|97.08|93.2 |31.5 |67.8 |446 | |resnetrs270.tfin1k|352 |84.43|96.97|129.9 |51.1 |105.5|280 | |seresnext101d32x8d.ahin1k|288 |84.36|96.92|93.6 |27.6 |53.0 |595 | |seresnet152d.ra2in1k|320 |84.35|97.04|66.8 |24.1 |47.7 |610 | |resnetrs350.tfin1k|288 |84.3 |96.94|164.0 |43.7 |87.1 |333 | |resnext10132x8d.fbswslig1bftin1k|224 |84.28|97.17|88.8 |16.5 |31.2 |1100 | |resnetrs420.tfin1k|320 |84.24|96.86|191.9 |64.2 |126.6|228 | |seresnext10132x8d.ahin1k|288 |84.19|96.87|93.6 |27.2 |51.6 |613 | |resnext10132x16d.fbwslig1bftin1k|224 |84.18|97.19|194.0 |36.3 |51.2 |581 | |resnetaa101d.swin12kftin1k|288 |84.11|97.11|44.6 |15.1 |29.0 |1144 | |resnet200d.ra2in1k|320 |83.97|96.82|64.7 |31.2 |67.3 |518 | |resnetrs200.tfin1k|256 |83.87|96.75|93.2 |20.2 |43.4 |692 | |seresnextaa101d32x8d.ahin1k|224 |83.86|96.65|93.6 |17.2 |34.2 |923 | |resnetrs152.tfin1k|320 |83.72|96.61|86.6 |24.3 |48.1 |617 | |seresnet152d.ra2in1k|256 |83.69|96.78|66.8 |15.4 |30.6 |943 | |seresnext101d32x8d.ahin1k|224 |83.68|96.61|93.6 |16.7 |32.0 |986 | |resnet152d.ra2in1k|320 |83.67|96.74|60.2 |24.1 |47.7 |706 | |resnetrs270.tfin1k|256 |83.59|96.61|129.9 |27.1 |55.8 |526 | |seresnext10132x8d.ahin1k|224 |83.58|96.4 |93.6 |16.5 |31.2 |1013 | |resnetaa101d.swin12kftin1k|224 |83.54|96.83|44.6 |9.1 |17.6 |1864 | |resnet152.a1hin1k|288 |83.46|96.54|60.2 |19.1 |37.3 |904 | |resnext10132x16d.fbswslig1bftin1k|224 |83.35|96.85|194.0 |36.3 |51.2 |582 | |resnet200d.ra2in1k|256 |83.23|96.53|64.7 |20.0 |43.1 |809 | |resnext10132x4d.fbswslig1bftin1k|224 |83.22|96.75|44.2 |8.0 |21.2 |1814 | |resnext10164x4d.c1in1k|288 |83.16|96.38|83.5 |25.7 |51.6 |590 | |resnet152d.ra2in1k|256 |83.14|96.38|60.2 |15.4 |30.5 |1096 | |resnet101d.ra2in1k|320 |83.02|96.45|44.6 |16.5 |34.8 |992 | |ecaresnet101d.miilin1k|288 |82.98|96.54|44.6 |13.4 |28.2 |1077 | |resnext10164x4d.tvin1k|224 |82.98|96.25|83.5 |15.5 |31.2 |989 | |resnetrs152.tfin1k|256 |82.86|96.28|86.6 |15.6 |30.8 |951 | |resnext10132x8d.tv2in1k|224 |82.83|96.22|88.8 |16.5 |31.2 |1099 | |resnet152.a1hin1k|224 |82.8 |96.13|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|288 |82.8 |96.32|44.6 |13.0 |26.8 |1291 | |resnet152.a1in1k|288 |82.74|95.71|60.2 |19.1 |37.3 |905 | |resnext10132x8d.fbwslig1bftin1k|224 |82.69|96.63|88.8 |16.5 |31.2 |1100 | |resnet152.a2in1k|288 |82.62|95.75|60.2 |19.1 |37.3 |904 | |resnetaa50d.swin12kftin1k|288 |82.61|96.49|25.6 |8.9 |20.6 |1729 | |resnet61q.ra2in1k|288 |82.53|96.13|36.8 |9.9 |21.5 |1773 | |wideresnet1012.tv2in1k|224 |82.5 |96.02|126.9 |22.8 |21.2 |1078 | |resnext10164x4d.c1in1k|224 |82.46|95.92|83.5 |15.5 |31.2 |987 | |resnet51q.ra2in1k|288 |82.36|96.18|35.7 |8.1 |20.9 |1964 | |ecaresnet50t.ra2in1k|320 |82.35|96.14|25.6 |8.8 |24.1 |1386 | |resnet101.a1in1k|288 |82.31|95.63|44.6 |13.0 |26.8 |1291 | |resnetrs101.tfin1k|288 |82.29|96.01|63.6 |13.6 |28.5 |1078 | |resnet152.tv2in1k|224 |82.29|96.0 |60.2 |11.6 |22.6 |1484 | |wideresnet502.racmin1k|288 |82.27|96.06|68.9 |18.9 |23.8 |1176 | |resnet101d.ra2in1k|256 |82.26|96.07|44.6 |10.6 |22.2 |1542 | |resnet101.a2in1k|288 |82.24|95.73|44.6 |13.0 |26.8 |1290 | |seresnext5032x4d.racmin1k|288 |82.2 |96.14|27.6 |7.0 |23.8 |1547 | |ecaresnet101d.miilin1k|224 |82.18|96.05|44.6 |8.1 |17.1 |1771 | |resnext5032x4d.fbswslig1bftin1k|224 |82.17|96.22|25.0 |4.3 |14.4 |2943 | |ecaresnet50t.a1in1k|288 |82.12|95.65|25.6 |7.1 |19.6 |1704 | |resnext5032x4d.a1hin1k|288 |82.03|95.94|25.0 |7.0 |23.8 |1745 | |ecaresnet101dpruned.miilin1k|288 |82.0 |96.15|24.9 |5.8 |12.7 |1787 | |resnet61q.ra2in1k|256 |81.99|95.85|36.8 |7.8 |17.0 |2230 | |resnext10132x8d.tv2in1k|176 |81.98|95.72|88.8 |10.3 |19.4 |1768 | |resnet152.a1in1k|224 |81.97|95.24|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|224 |81.93|95.75|44.6 |7.8 |16.2 |2122 | |resnet101.tv2in1k|224 |81.9 |95.77|44.6 |7.8 |16.2 |2118 | |resnext10132x16d.fbsslyfcc100mftin1k|224 |81.84|96.1 |194.0 |36.3 |51.2 |583 | |resnet51q.ra2in1k|256 |81.78|95.94|35.7 |6.4 |16.6 |2471 | |resnet152.a2in1k|224 |81.77|95.22|60.2 |11.6 |22.6 |1485 | |resnetaa50d.swin12kftin1k|224 |81.74|96.06|25.6 |5.4 |12.4 |2813 | |ecaresnet50t.a2in1k|288 |81.65|95.54|25.6 |7.1 |19.6 |1703 | |ecaresnet50d.miilin1k|288 |81.64|95.88|25.6 |7.2 |19.7 |1694 | |resnext10132x8d.fbsslyfcc100mftin1k|224 |81.62|96.04|88.8 |16.5 |31.2 |1101 | |wideresnet502.tv2in1k|224 |81.61|95.76|68.9 |11.4 |14.4 |1930 | |resnetaa50.a1hin1k|288 |81.61|95.83|25.6 |8.5 |19.2 |1868 | |resnet101.a1in1k|224 |81.5 |95.16|44.6 |7.8 |16.2 |2125 | |resnext5032x4d.a1in1k|288 |81.48|95.16|25.0 |7.0 |23.8 |1745 | |gcresnet50t.ra2in1k|288 |81.47|95.71|25.9 |6.9 |18.6 |2071 | |wideresnet502.racmin1k|224 |81.45|95.53|68.9 |11.4 |14.4 |1929 | |resnet50d.a1in1k|288 |81.44|95.22|25.6 |7.2 |19.7 |1908 | |ecaresnet50t.ra2in1k|256 |81.44|95.67|25.6 |5.6 |15.4 |2168 | |ecaresnetlight.miilin1k|288 |81.4 |95.82|30.2 |6.8 |13.9 |2132 | |resnet50d.ra2in1k|288 |81.37|95.74|25.6 |7.2 |19.7 |1910 | |resnet101.a2in1k|224 |81.32|95.19|44.6 |7.8 |16.2 |2125 | |seresnet50.ra2in1k|288 |81.3 |95.65|28.1 |6.8 |18.4 |1803 | |resnext5032x4d.a2in1k|288 |81.3 |95.11|25.0 |7.0 |23.8 |1746 | |seresnext5032x4d.racmin1k|224 |81.27|95.62|27.6 |4.3 |14.4 |2591 | |ecaresnet50t.a1in1k|224 |81.26|95.16|25.6 |4.3 |11.8 |2823 | |gcresnext50ts.chin1k|288 |81.23|95.54|15.7 |4.8 |19.6 |2117 | |senet154.gluonin1k|224 |81.23|95.35|115.1 |20.8 |38.7 |545 | |resnet50.a1in1k|288 |81.22|95.11|25.6 |6.8 |18.4 |2089 | |resnet50gn.a1hin1k|288 |81.22|95.63|25.6 |6.8 |18.4 |676 | |resnet50d.a2in1k|288 |81.18|95.09|25.6 |7.2 |19.7 |1908 | |resnet50.fbswslig1bftin1k|224 |81.18|95.98|25.6 |4.1 |11.1 |3455 | |resnext5032x4d.tv2in1k|224 |81.17|95.34|25.0 |4.3 |14.4 |2933 | |resnext5032x4d.a1hin1k|224 |81.1 |95.33|25.0 |4.3 |14.4 |2934 | |seresnet50.a2in1k|288 |81.1 |95.23|28.1 |6.8 |18.4 |1801 | |seresnet50.a1in1k|288 |81.1 |95.12|28.1 |6.8 |18.4 |1799 | |resnet152s.gluonin1k|224 |81.02|95.41|60.3 |12.9 |25.0 |1347 | |resnet50.din1k|288 |80.97|95.44|25.6 |6.8 |18.4 |2085 | |gcresnet50t.ra2in1k|256 |80.94|95.45|25.9 |5.4 |14.7 |2571 | |resnext10132x4d.fbsslyfcc100mftin1k|224 |80.93|95.73|44.2 |8.0 |21.2 |1814 | |resnet50.c1in1k|288 |80.91|95.55|25.6 |6.8 |18.4 |2084 | |seresnext10132x4d.gluonin1k|224 |80.9 |95.31|49.0 |8.0 |21.3 |1585 | |seresnext10164x4d.gluonin1k|224 |80.9 |95.3 |88.2 |15.5 |31.2 |918 | |resnet50.c2in1k|288 |80.86|95.52|25.6 |6.8 |18.4 |2085 | |resnet50.tv2in1k|224 |80.85|95.43|25.6 |4.1 |11.1 |3450 | |ecaresnet50t.a2in1k|224 |80.84|95.02|25.6 |4.3 |11.8 |2821 | |ecaresnet101dpruned.miilin1k|224 |80.79|95.62|24.9 |3.5 |7.7 |2961 | |seresnet33ts.ra2in1k|288 |80.79|95.36|19.8 |6.0 |14.8 |2506 | |ecaresnet50dpruned.miilin1k|288 |80.79|95.58|19.9 |4.2 |10.6 |2349 | |resnet50.a2in1k|288 |80.78|94.99|25.6 |6.8 |18.4 |2088 | |resnet50.b1kin1k|288 |80.71|95.43|25.6 |6.8 |18.4 |2087 | |resnext5032x4d.rain1k|288 |80.7 |95.39|25.0 |7.0 |23.8 |1749 | |resnetrs101.tfin1k|192 |80.69|95.24|63.6 |6.0 |12.7 |2270 | |resnet50d.a1in1k|224 |80.68|94.71|25.6 |4.4 |11.9 |3162 | |ecaresnet33ts.ra2in1k|288 |80.68|95.36|19.7 |6.0 |14.8 |2637 | |resnet50.a1hin1k|224 |80.67|95.3 |25.6 |4.1 |11.1 |3452 | |resnext50d32x4d.btin1k|288 |80.67|95.42|25.0 |7.4 |25.1 |1626 | |resnetaa50.a1hin1k|224 |80.63|95.21|25.6 |5.2 |11.6 |3034 | |ecaresnet50d.miilin1k|224 |80.61|95.32|25.6 |4.4 |11.9 |2813 | |resnext10164x4d.gluonin1k|224 |80.61|94.99|83.5 |15.5 |31.2 |989 | |gcresnet33ts.ra2in1k|288 |80.6 |95.31|19.9 |6.0 |14.8 |2578 | |gcresnext50ts.chin1k|256 |80.57|95.17|15.7 |3.8 |15.5 |2710 | |resnet152.a3in1k|224 |80.56|95.0 |60.2 |11.6 |22.6 |1483 | |resnet50d.ra2in1k|224 |80.53|95.16|25.6 |4.4 |11.9 |3164 | |resnext5032x4d.a1in1k|224 |80.53|94.46|25.0 |4.3 |14.4 |2930 | |wideresnet1012.tv2in1k|176 |80.48|94.98|126.9 |14.3 |13.2 |1719 | |resnet152d.gluonin1k|224 |80.47|95.2 |60.2 |11.8 |23.4 |1428 | |resnet50.b2kin1k|288 |80.45|95.32|25.6 |6.8 |18.4 |2086 | |ecaresnetlight.miilin1k|224 |80.45|95.24|30.2 |4.1 |8.4 |3530 | |resnext5032x4d.a2in1k|224 |80.45|94.63|25.0 |4.3 |14.4 |2936 | |wideresnet502.tv2in1k|176 |80.43|95.09|68.9 |7.3 |9.0 |3015 | |resnet101d.gluonin1k|224 |80.42|95.01|44.6 |8.1 |17.0 |2007 | |resnet50.a1in1k|224 |80.38|94.6 |25.6 |4.1 |11.1 |3461 | |seresnet33ts.ra2in1k|256 |80.36|95.1 |19.8 |4.8 |11.7 |3267 | |resnext10132x4d.gluonin1k|224 |80.34|94.93|44.2 |8.0 |21.2 |1814 | |resnext5032x4d.fbsslyfcc100mftin1k|224 |80.32|95.4 |25.0 |4.3 |14.4 |2941 | |resnet101s.gluonin1k|224 |80.28|95.16|44.7 |9.2 |18.6 |1851 | |seresnet50.ra2in1k|224 |80.26|95.08|28.1 |4.1 |11.1 |2972 | |resnetblur50.btin1k|288 |80.24|95.24|25.6 |8.5 |19.9 |1523 | |resnet50d.a2in1k|224 |80.22|94.63|25.6 |4.4 |11.9 |3162 | |resnet152.tv2in1k|176 |80.2 |94.64|60.2 |7.2 |14.0 |2346 | |seresnet50.a2in1k|224 |80.08|94.74|28.1 |4.1 |11.1 |2969 | |ecaresnet33ts.ra2in1k|256 |80.08|94.97|19.7 |4.8 |11.7 |3284 | |gcresnet33ts.ra2in1k|256 |80.06|94.99|19.9 |4.8 |11.7 |3216 | |resnet50gn.a1hin1k|224 |80.06|94.95|25.6 |4.1 |11.1 |1109 | |seresnet50.a1in1k|224 |80.02|94.71|28.1 |4.1 |11.1 |2962 | |resnet50.ramin1k|288 |79.97|95.05|25.6 |6.8 |18.4 |2086 | |resnet152c.gluonin1k|224 |79.92|94.84|60.2 |11.8 |23.4 |1455 | |seresnext5032x4d.gluonin1k|224 |79.91|94.82|27.6 |4.3 |14.4 |2591 | |resnet50.din1k|224 |79.91|94.67|25.6 |4.1 |11.1 |3456 | |resnet101.tv2in1k|176 |79.9 |94.6 |44.6 |4.9 |10.1 |3341 | |resnetrs50.tfin1k|224 |79.89|94.97|35.7 |4.5 |12.1 |2774 | |resnet50.c2in1k|224 |79.88|94.87|25.6 |4.1 |11.1 |3455 | |ecaresnet26t.ra2in1k|320 |79.86|95.07|16.0 |5.2 |16.4 |2168 | |resnet50.a2in1k|224 |79.85|94.56|25.6 |4.1 |11.1 |3460 | |resnet50.rain1k|288 |79.83|94.97|25.6 |6.8 |18.4 |2087 | |resnet101.a3in1k|224 |79.82|94.62|44.6 |7.8 |16.2 |2114 | |resnext5032x4d.rain1k|224 |79.76|94.6 |25.0 |4.3 |14.4 |2943 | |resnet50.c1in1k|224 |79.74|94.95|25.6 |4.1 |11.1 |3455 | |ecaresnet50dpruned.miilin1k|224 |79.74|94.87|19.9 |2.5 |6.4 |3929 | |resnet33ts.ra2in1k|288 |79.71|94.83|19.7 |6.0 |14.8 |2710 | |resnet152.gluonin1k|224 |79.68|94.74|60.2 |11.6 |22.6 |1486 | |resnext50d32x4d.btin1k|224 |79.67|94.87|25.0 |4.5 |15.2 |2729 | |resnet50.btin1k|288 |79.63|94.91|25.6 |6.8 |18.4 |2086 | |ecaresnet50t.a3in1k|224 |79.56|94.72|25.6 |4.3 |11.8 |2805 | |resnet101c.gluonin1k|224 |79.53|94.58|44.6 |8.1 |17.0 |2062 | |resnet50.b1kin1k|224 |79.52|94.61|25.6 |4.1 |11.1 |3459 | |resnet50.tv2in1k|176 |79.42|94.64|25.6 |2.6 |6.9 |5397 | |resnet32ts.ra2in1k|288 |79.4 |94.66|18.0 |5.9 |14.6 |2752 | |resnet50.b2kin1k|224 |79.38|94.57|25.6 |4.1 |11.1 |3459 | |resnext5032x4d.tv2in1k|176 |79.37|94.3 |25.0 |2.7 |9.0 |4577 | |resnext5032x4d.gluonin1k|224 |79.36|94.43|25.0 |4.3 |14.4 |2942 | |resnext10132x8d.tvin1k|224 |79.31|94.52|88.8 |16.5 |31.2 |1100 | |resnet101.gluonin1k|224 |79.31|94.53|44.6 |7.8 |16.2 |2125 | |resnetblur50.btin1k|224 |79.31|94.63|25.6 |5.2 |12.0 |2524 | |resnet50.a1hin1k|176 |79.27|94.49|25.6 |2.6 |6.9 |5404 | |resnext5032x4d.a3in1k|224 |79.25|94.31|25.0 |4.3 |14.4 |2931 | |resnet50.fbsslyfcc100mftin1k|224 |79.22|94.84|25.6 |4.1 |11.1 |3451 | |resnet33ts.ra2in1k|256 |79.21|94.56|19.7 |4.8 |11.7 |3392 | |resnet50d.gluonin1k|224 |79.07|94.48|25.6 |4.4 |11.9 |3162 | |resnet50.ramin1k|224 |79.03|94.38|25.6 |4.1 |11.1 |3453 | |resnet50.amin1k|224 |79.01|94.39|25.6 |4.1 |11.1 |3461 | |resnet32ts.ra2in1k|256 |79.01|94.37|18.0 |4.6 |11.6 |3440 | |ecaresnet26t.ra2in1k|256 |78.9 |94.54|16.0 |3.4 |10.5 |3421 | |resnet152.a3in1k|160 |78.89|94.11|60.2 |5.9 |11.5 |2745 | |wideresnet1012.tvin1k|224 |78.84|94.28|126.9 |22.8 |21.2 |1079 | |seresnext26d32x4d.btin1k|288 |78.83|94.24|16.8 |4.5 |16.8 |2251 | |resnet50.rain1k|224 |78.81|94.32|25.6 |4.1 |11.1 |3454 | |seresnext26t32x4d.btin1k|288 |78.74|94.33|16.8 |4.5 |16.7 |2264 | |resnet50s.gluonin1k|224 |78.72|94.23|25.7 |5.5 |13.5 |2796 | |resnet50d.a3in1k|224 |78.71|94.24|25.6 |4.4 |11.9 |3154 | |wideresnet502.tvin1k|224 |78.47|94.09|68.9 |11.4 |14.4 |1934 | |resnet50.btin1k|224 |78.46|94.27|25.6 |4.1 |11.1 |3454 | |resnet34d.ra2in1k|288 |78.43|94.35|21.8 |6.5 |7.5 |3291 | |gcresnext26ts.chin1k|288 |78.42|94.04|10.5 |3.1 |13.3 |3226 | |resnet26t.ra2in1k|320 |78.33|94.13|16.0 |5.2 |16.4 |2391 | |resnet152.tvin1k|224 |78.32|94.04|60.2 |11.6 |22.6 |1487 | |seresnext26ts.chin1k|288 |78.28|94.1 |10.4 |3.1 |13.3 |3062 | |batresnext26ts.chin1k|256 |78.25|94.1 |10.7 |2.5 |12.5 |3393 | |resnet50.a3in1k|224 |78.06|93.78|25.6 |4.1 |11.1 |3450 | |resnet50c.gluonin1k|224 |78.0 |93.99|25.6 |4.4 |11.9 |3286 | |ecaresnext26ts.chin1k|288 |78.0 |93.91|10.3 |3.1 |13.3 |3297 | |seresnext26t32x4d.btin1k|224 |77.98|93.75|16.8 |2.7 |10.1 |3841 | |resnet34.a1in1k|288 |77.92|93.77|21.8 |6.1 |6.2 |3609 | |resnet101.a3in1k|160 |77.88|93.71|44.6 |4.0 |8.3 |3926 | |resnet26t.ra2in1k|256 |77.87|93.84|16.0 |3.4 |10.5 |3772 | |seresnext26ts.chin1k|256 |77.86|93.79|10.4 |2.4 |10.5 |4263 | |resnetrs50.tfin1k|160 |77.82|93.81|35.7 |2.3 |6.2 |5238 | |gcresnext26ts.chin1k|256 |77.81|93.82|10.5 |2.4 |10.5 |4183 | |ecaresnet50t.a3in1k|160 |77.79|93.6 |25.6 |2.2 |6.0 |5329 | |resnext5032x4d.a3in1k|160 |77.73|93.32|25.0 |2.2 |7.4 |5576 | |resnext5032x4d.tvin1k|224 |77.61|93.7 |25.0 |4.3 |14.4 |2944 | |seresnext26d32x4d.btin1k|224 |77.59|93.61|16.8 |2.7 |10.2 |3807 | |resnet50.gluonin1k|224 |77.58|93.72|25.6 |4.1 |11.1 |3455 | |ecaresnext26ts.chin1k|256 |77.44|93.56|10.3 |2.4 |10.5 |4284 | |resnet26d.btin1k|288 |77.41|93.63|16.0 |4.3 |13.5 |2907 | |resnet101.tvin1k|224 |77.38|93.54|44.6 |7.8 |16.2 |2125 | |resnet50d.a3in1k|160 |77.22|93.27|25.6 |2.2 |6.1 |5982 | |resnext26ts.ra2in1k|288 |77.17|93.47|10.3 |3.1 |13.3 |3392 | |resnet34.a2in1k|288 |77.15|93.27|21.8 |6.1 |6.2 |3615 | |resnet34d.ra2in1k|224 |77.1 |93.37|21.8 |3.9 |4.5 |5436 | |seresnet50.a3in1k|224 |77.02|93.07|28.1 |4.1 |11.1 |2952 | |resnext26ts.ra2in1k|256 |76.78|93.13|10.3 |2.4 |10.5 |4410 | |resnet26d.btin1k|224 |76.7 |93.17|16.0 |2.6 |8.2 |4859 | |resnet34.btin1k|288 |76.5 |93.35|21.8 |6.1 |6.2 |3617 | |resnet34.a1in1k|224 |76.42|92.87|21.8 |3.7 |3.7 |5984 | |resnet26.btin1k|288 |76.35|93.18|16.0 |3.9 |12.2 |3331 | |resnet50.tvin1k|224 |76.13|92.86|25.6 |4.1 |11.1 |3457 | |resnet50.a3in1k|160 |75.96|92.5 |25.6 |2.1 |5.7 |6490 | |resnet34.a2in1k|224 |75.52|92.44|21.8 |3.7 |3.7 |5991 | |resnet26.btin1k|224 |75.3 |92.58|16.0 |2.4 |7.4 |5583 | |resnet34.btin1k|224 |75.16|92.18|21.8 |3.7 |3.7 |5994 | |seresnet50.a3in1k|160 |75.1 |92.08|28.1 |2.1 |5.7 |5513 | |resnet34.gluonin1k|224 |74.57|91.98|21.8 |3.7 |3.7 |5984 | |resnet18d.ra2in1k|288 |73.81|91.83|11.7 |3.4 |5.4 |5196 | |resnet34.tvin1k|224 |73.32|91.42|21.8 |3.7 |3.7 |5979 | |resnet18.fbswslig1bftin1k|224 |73.28|91.73|11.7 |1.8 |2.5 |10213 | |resnet18.a1in1k|288 |73.16|91.03|11.7 |3.0 |4.1 |6050 | |resnet34.a3in1k|224 |72.98|91.11|21.8 |3.7 |3.7 |5967 | |resnet18.fbsslyfcc100mftin1k|224 |72.6 |91.42|11.7 |1.8 |2.5 |10213 | |resnet18.a2in1k|288 |72.37|90.59|11.7 |3.0 |4.1 |6051 | |resnet14t.c3in1k|224 |72.26|90.31|10.1 |1.7 |5.8 |7026 | |resnet18d.ra2in1k|224 |72.26|90.68|11.7 |2.1 |3.3 |8707 | |resnet18.a1in1k|224 |71.49|90.07|11.7 |1.8 |2.5 |10187 | |resnet14t.c3in1k|176 |71.31|89.69|10.1 |1.1 |3.6 |10970 | |resnet18.gluonin1k|224 |70.84|89.76|11.7 |1.8 |2.5 |10210 | |resnet18.a2in1k|224 |70.64|89.47|11.7 |1.8 |2.5 |10194 | |resnet34.a3in1k|160 |70.56|89.52|21.8 |1.9 |1.9 |10737 | |resnet18.tvin1k|224 |69.76|89.07|11.7 |1.8 |2.5 |10205 | |resnet10t.c3in1k|224 |68.34|88.03|5.4 |1.1 |2.4 |13079 | |resnet18.a3in1k|224 |68.25|88.17|11.7 |1.8 |2.5 |10167 | |resnet10t.c3in1k|176 |66.71|86.96|5.4 |0.7 |1.5 |20327 | |resnet18.a3in1k|160 |65.66|86.26|11.7 |0.9 |1.3 |18229 |

3.6M downloads
39 likes
PYTORCH

The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01. We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint. This Model This is the chat model finetuned on top of TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T. We follow HF's Zephyr's training recipe. The model was " initially fine-tuned on a variant of the `UltraChat` dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's `DPOTrainer` on the openbmb/UltraFeedback dataset, which contain 64k prompts and model completions that are ranked by GPT-4." How to use You will need the transformers>=4.34 Do check the TinyLlama github page for more information.

3.3M downloads
1.4K likes
OTHER
#17

vitpose-plus-base

by usyd-community

--- library_name: transformers license: apache-2.0 language: - en pipeline_tag: keypoint-detection ---

3.1M downloads
23 likes
OTHER

--- license: mit base_model: flax-community/indonesian-roberta-base tags: - generated_from_trainer datasets: - indonlu language: - ind metrics: - precision - recall - f1 - accuracy model-index: - name: indonesian-roberta-base-posp-tagger results: - task: name: Token Classification type: token-classification dataset: name: indonlu type: indonlu config: posp split: test args: posp metrics: - name: Precision type: precision value: 0.9625100240577386 - name: Recall type: recall value: 0.962510024057

3.1M downloads
9 likes
PYTORCH

The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. Model Architecture: Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. | | Training Data | Params | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count | Knowledge cutoff | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Llama 3.2 (text only) | A new mix of publicly available online data. | 1B (1.23B) | Multilingual Text | Multilingual Text and code | 128k | Yes | Yes | Up to 9T tokens | December 2023 | | | | 3B (3.21B) | Multilingual Text | Multilingual Text and code | | | | | | | Llama 3.2 Quantized (text only) | A new mix of publicly available online data. | 1B (1.23B) | Multilingual Text | Multilingual Text and code | 8k | Yes | Yes | Up to 9T tokens | December 2023 | | | | 3B (3.21B) | Multilingual Text | Multilingual Text and code | | | | | | Supported Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly. Llama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. Status: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here. Intended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources. Out of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card. This repository contains two versions of Llama-3.2-3B-Instruct, for use with `transformers` and with the original `llama` codebase. Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function. Make sure to update your transformers installation via `pip install --upgrade transformers`. Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at `huggingface-llama-recipes` To download Original checkpoints, see the example command below leveraging `huggingface-cli`: Training Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure. Training Energy Use: Training utilized a cumulative of 916k GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. Training Greenhouse Gas Emissions: Estimated total location-based greenhouse gas emissions were 240 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq. | | Training Time (GPU hours) | Logit Generation Time (GPU Hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) | | :---- | :---: | ----- | :---: | :---: | :---: | | Llama 3.2 1B | 370k | \- | 700 | 107 | 0 | | Llama 3.2 3B | 460k | \- | 700 | 133 | 0 | | Llama 3.2 1B SpinQuant | 1.7 | 0 | 700 | Negligible\\ | 0 | | Llama 3.2 3B SpinQuant | 2.4 | 0 | 700 | Negligible\\ | 0 | | Llama 3.2 1B QLora | 1.3k | 0 | 700 | 0.381 | 0 | | Llama 3.2 3B QLora | 1.6k | 0 | 700 | 0.461 | 0 | | Total | 833k | 86k | | 240 | 0 | \\ The location-based CO2e emissions of Llama 3.2 1B SpinQuant and Llama 3.2 3B SpinQuant are less than 0.001 metric tonnes each. This is due to the minimal training GPU hours that are required. The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others. Overview: Llama 3.2 was pretrained on up to 9 trillion tokens of data from publicly available sources. For the 1B and 3B Llama 3.2 models, we incorporated logits from the Llama 3.1 8B and 70B models into the pretraining stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance. In post-training we used a similar recipe as Llama 3.1 and produced final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involved Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). Data Freshness: The pretraining data has a cutoff of December 2023\. We designed the current quantization scheme with the PyTorch’s ExecuTorch inference framework and Arm CPU backend in mind, taking into account metrics including model quality, prefill/decoding speed, and memory footprint. Our quantization scheme involves three parts: - All linear layers in all transformer blocks are quantized to a 4-bit groupwise scheme (with a group size of 32) for weights and 8-bit per-token dynamic quantization for activations. - The classification layer is quantized to 8-bit per-channel for weight and 8-bit per token dynamic quantization for activation. - Similar to classification layer, an 8-bit per channel quantization is used for embedding layer. The quantization-aware training (QAT) with low-rank adaptation (LoRA) models went through only post-training stages, using the same data as the full precision models. To initialize QAT, we utilize BF16 Llama 3.2 model checkpoints obtained after supervised fine-tuning (SFT) and perform an additional full round of SFT training with QAT. We then freeze the backbone of the QAT model and perform another round of SFT with LoRA adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16. Because our approach is similar to QLoRA of Dettmers et al., (2023) (i.e., quantization followed by LoRA adapters), we refer this method as QLoRA. Finally, we fine-tune the resulting model (both backbone and LoRA adaptors) using direct preference optimization (DPO). SpinQuant was applied, together with generative post-training quantization (GPTQ). For the SpinQuant rotation matrix fine-tuning, we optimized for 100 iterations, using 800 samples with sequence-length 2048 from the WikiText 2 dataset. For GPTQ, we used 128 samples from the same dataset with the same sequence-length. In this section, we report the results for Llama 3.2 models on standard automatic benchmarks. For all these evaluations, we used our internal evaluations library. | Category | Benchmark | \# Shots | Metric | Llama 3.2 1B | Llama 3.2 3B | Llama 3.1 8B | | ----- | ----- | :---: | :---: | :---: | :---: | :---: | | General | MMLU | 5 | macro\avg/acc\char | 32.2 | 58 | 66.7 | | | AGIEval English | 3-5 | average/acc\char | 23.3 | 39.2 | 47.8 | | | ARC-Challenge | 25 | acc\char | 32.8 | 69.1 | 79.7 | | Reading comprehension | SQuAD | 1 | em | 49.2 | 67.7 | 77 | | | QuAC (F1) | 1 | f1 | 37.9 | 42.9 | 44.9 | | | DROP (F1) | 3 | f1 | 28.0 | 45.2 | 59.5 | | Long Context | Needle in Haystack | 0 | em | 96.8 | 1 | 1 | | Capability | | Benchmark | \# Shots | Metric | Llama 3.2 1B bf16 | Llama 3.2 1B Vanilla PTQ\\ | Llama 3.2 1B Spin Quant | Llama 3.2 1B QLoRA | Llama 3.2 3B bf16 | Llama 3.2 3B Vanilla PTQ\\ | Llama 3.2 3B Spin Quant | Llama 3.2 3B QLoRA | Llama 3.1 8B | | :---: | ----- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | General | | MMLU | 5 | macro\avg/acc | 49.3 | 43.3 | 47.3 | 49.0 | 63.4 | 60.5 | 62 | 62.4 | 69.4 | | Re-writing | | Open-rewrite eval | 0 | micro\avg/rougeL | 41.6 | 39.2 | 40.9 | 41.2 | 40.1 | 40.3 | 40.8 | 40.7 | 40.9 | | Summarization | | TLDR9+ (test) | 1 | rougeL | 16.8 | 14.9 | 16.7 | 16.8 | 19.0 | 19.1 | 19.2 | 19.1 | 17.2 | | Instruction following | | IFEval | 0 | Avg(Prompt/Instruction acc Loose/Strict) | 59.5 | 51.5 | 58.4 | 55.6 | 77.4 | 73.9 | 73.5 | 75.9 | 80.4 | | Math | | GSM8K (CoT) | 8 | em\maj1@1 | 44.4 | 33.1 | 40.6 | 46.5 | 77.7 | 72.9 | 75.7 | 77.9 | 84.5 | | | | MATH (CoT) | 0 | final\em | 30.6 | 20.5 | 25.3 | 31.0 | 48.0 | 44.2 | 45.3 | 49.2 | 51.9 | | Reasoning | | ARC-C | 0 | acc | 59.4 | 54.3 | 57 | 60.7 | 78.6 | 75.6 | 77.6 | 77.6 | 83.4 | | | | GPQA | 0 | acc | 27.2 | 25.9 | 26.3 | 25.9 | 32.8 | 32.8 | 31.7 | 33.9 | 32.8 | | | | Hellaswag | 0 | acc | 41.2 | 38.1 | 41.3 | 41.5 | 69.8 | 66.3 | 68 | 66.3 | 78.7 | | Tool Use | | BFCL V2 | 0 | acc | 25.7 | 14.3 | 15.9 | 23.7 | 67.0 | 53.4 | 60.1 | 63.5 | 67.1 | | | | Nexus | 0 | macro\avg/acc | 13.5 | 5.2 | 9.6 | 12.5 | 34.3 | 32.4 | 31.5 | 30.1 | 38.5 | | Long Context | | InfiniteBench/En.QA | 0 | longbook\qa/f1 | 20.3 | N/A | N/A | N/A | 19.8 | N/A | N/A | N/A | 27.3 | | | | InfiniteBench/En.MC | 0 | longbook\choice/acc | 38.0 | N/A | N/A | N/A | 63.3 | N/A | N/A | N/A | 72.2 | | | | NIH/Multi-needle | 0 | recall | 75.0 | N/A | N/A | N/A | 84.7 | N/A | N/A | N/A | 98.8 | | Multilingual | | MGSM (CoT) | 0 | em | 24.5 | 13.7 | 18.2 | 24.4 | 58.2 | 48.9 | 54.3 | 56.8 | 68.9 | \\for comparison purposes only. Model not released. | Category | Benchmark | Language | Llama 3.2 1B | Llama 3.2 1B Vanilla PTQ\\ | Llama 3.2 1B Spin Quant | Llama 3.2 1B QLoRA | Llama 3.2 3B | Llama 3.2 3B Vanilla PTQ\\ | Llama 3.2 3B Spin Quant | Llama 3.2 3B QLoRA | Llama 3.1 8B | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | General | MMLU (5-shot, macroavg/acc) | Portuguese | 39.8 | 34.9 | 38.9 | 40.2 | 54.5 | 50.9 | 53.3 | 53.4 | 62.1 | | | | Spanish | 41.5 | 36.0 | 39.8 | 41.8 | 55.1 | 51.9 | 53.6 | 53.6 | 62.5 | | | | Italian | 39.8 | 34.9 | 38.1 | 40.6 | 53.8 | 49.9 | 52.1 | 51.7 | 61.6 | | | | German | 39.2 | 34.9 | 37.5 | 39.6 | 53.3 | 50.0 | 52.2 | 51.3 | 60.6 | | | | French | 40.5 | 34.8 | 39.2 | 40.8 | 54.6 | 51.2 | 53.3 | 53.3 | 62.3 | | | | Hindi | 33.5 | 30.0 | 32.1 | 34.0 | 43.3 | 40.4 | 42.0 | 42.1 | 50.9 | | | | Thai | 34.7 | 31.2 | 32.4 | 34.9 | 44.5 | 41.3 | 44.0 | 42.2 | 50.3 | \\for comparison purposes only. Model not released. In the below table, we compare the performance metrics of different quantization methods (SpinQuant and QAT \+ LoRA) with the BF16 baseline. The evaluation was done using the ExecuTorch framework as the inference engine, with the ARM CPU as a backend using Android OnePlus 12 device. | Category | Decode (tokens/sec) | Time-to-first-token (sec) | Prefill (tokens/sec) | Model size (PTE file size in MB) | Memory size (RSS in MB) | | :---- | ----- | ----- | ----- | ----- | ----- | | 1B BF16 (baseline) | 19.2 | 1.0 | 60.3 | 2358 | 3,185 | | 1B SpinQuant | 50.2 (2.6x) | 0.3 (-76.9%) | 260.5 (4.3x) | 1083 (-54.1%) | 1,921 (-39.7%) | | 1B QLoRA | 45.8 (2.4x) | 0.3 (-76.0%) | 252.0 (4.2x) | 1127 (-52.2%) | 2,255 (-29.2%) | | 3B BF16 (baseline) | 7.6 | 3.0 | 21.2 | 6129 | 7,419 | | 3B SpinQuant | 19.7 (2.6x) | 0.7 (-76.4%) | 89.7 (4.2x) | 2435 (-60.3%) | 3,726 (-49.8%) | | 3B QLoRA | 18.5 (2.4x) | 0.7 (-76.1%) | 88.8 (4.2x) | 2529 (-58.7%) | 4,060 (-45.3%) | (\) The performance measurement is done using an adb binary-based approach. (\\) It is measured on an Android OnePlus 12 device. (\\\) Time-to-first-token (TTFT) is measured with prompt length=64 - Decode (tokens/second) is for how quickly it keeps generating. Higher is better. - Time-to-first-token (TTFT for shorthand) is for how fast it generates the first token for a given prompt. Lower is better. - Prefill is the inverse of TTFT (aka 1/TTFT) in tokens/second. Higher is better - Model size \- how big is the model, measured by, PTE file, a binary file format for ExecuTorch - RSS size \- Memory usage in resident set size (RSS) As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks: 1. Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama 2. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm 3. Provide protections for the community to help prevent the misuse of our models Approach: Llama is a foundational technology designed to be used in a variety of use cases. Examples on how Meta’s Llama models have been responsibly deployed can be found in our Community Stories webpage. Our approach is to build the most helpful models, enabling the world to benefit from the technology power, by aligning our model safety for generic use cases and addressing a standard set of harms. Developers are then in the driver’s seat to tailor safety for their use cases, defining their own policies and deploying the models with the necessary safeguards in their Llama systems. Llama 3.2 was developed following the best practices outlined in our Responsible Use Guide. Objective: Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. We implemented the same set of safety mitigations as in Llama 3, and you can learn more about these in the Llama 3 paper. Fine-Tuning Data: We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We’ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. Refusals and Tone: Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. Safety as a System: Large language models, including Llama 3.2, are not designed to be deployed in isolation but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with safeguards that developers should deploy with Llama models or other LLMs, including Llama Guard, Prompt Guard and Code Shield. All our reference implementations demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. Technological Advancement: Llama releases usually introduce new capabilities that require specific considerations in addition to the best practices that generally apply across all Generative AI use cases. For prior release capabilities also supported by Llama 3.2, see Llama 3.1 Model Card, as the same considerations apply here as well. Constrained Environments: Llama 3.2 1B and 3B models are expected to be deployed in highly constrained environments, such as mobile devices. LLM Systems using smaller models will have a different alignment profile and safety/helpfulness tradeoff than more complex, larger systems. Developers should ensure the safety of their system meets the requirements of their use case. We recommend using lighter system safeguards for such use cases, like Llama Guard 3-1B or its mobile-optimized version. Scaled Evaluations: We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Purple Llama safeguards to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case. Red Teaming: We conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets. In addition to our safety work above, we took extra care on measuring and/or mitigating the following critical risk areas: 1\. CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive Weapons): Llama 3.2 1B and 3B models are smaller and less capable derivatives of Llama 3.1. For Llama 3.1 70B and 405B, to assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons and have determined that such testing also applies to the smaller 1B and 3B models. 2\. Child Safety: Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. 3\. Cyber Attacks: For Llama 3.1 405B, our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention. Because Llama 3.2’s 1B and 3B models are smaller and less capable models than Llama 3.1 405B, we broadly believe that the testing conducted for the 405B model also applies to Llama 3.2 models. Industry Partnerships: Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our Github repository. Grants: We also set up the Llama Impact Grants program to identify and support the most compelling applications of Meta’s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found here. Reporting: Finally, we put in place a set of resources including an output reporting mechanism and bug bounty program to continuously improve the Llama technology with the help of the community. Values: The core values of Llama 3.2 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.2 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. Testing: Llama 3.2 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.2’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.2 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our Responsible Use Guide, Trust and Safety solutions, and other resources to learn more about responsible development.

3.0M downloads
7 likes
PYTORCH

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him. Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team. The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224. Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image. You can use the raw model for image classification. See the model hub to look for fine-tuned versions on a task that interests you. Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: For more code examples, we refer to the documentation. The ViT model was pretrained on ImageNet-21k, a dataset consisting of 14 million images and 21k classes, and fine-tuned on ImageNet, a dataset consisting of 1 million images and 1k classes. The exact details of preprocessing of images during training/validation can be found here. Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on TPUv3 hardware (8 cores). All model variants are trained with a batch size of 4096 and learning rate warmup of 10k steps. For ImageNet, the authors found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224. For evaluation results on several image classification benchmarks, we refer to tables 2 and 5 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution (384x384). Of course, increasing the model size will result in better performance.

3.0M downloads
890 likes
PYTORCH