Deepfake Detection
Verify authenticity of images and videos. Detect AI-generated faces and manipulated media. Critical for journalism, legal tech, and content verification.
Common Applications
Top Models
electra-base-discriminator
by google
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset. For a detailed description and experimental results, please refer to our paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).
nsfw_image_detection
by Falconsai
Model Card: Fine-Tuned Vision Transformer (ViT) for NSFW Image Classification The Fine-Tuned Vision Transformer (ViT) is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks. This specific model, named "google/vit-base-patch16-224-in21k," is pre-trained on a substantial collection of images in a supervised manner, leveraging the ImageNet-21k dataset. The images in the pre-training dataset are resized to a resolution of 224x224 pixels, making it suitable for a wide range of image recognition tasks. During the training phase, meticulous attention was given to hyperparameter settings to ensure optimal model performance. The model was fine-tuned with a judiciously chosen batch size of 16. This choice not only balanced computational efficiency but also allowed for the model to effectively process and learn from a diverse array of images. To facilitate this fine-tuning process, a learning rate of 5e-5 was employed. The learning rate serves as a critical tuning parameter that dictates the magnitude of adjustments made to the model's parameters during training. In this case, a learning rate of 5e-5 was selected to strike a harmonious balance between rapid convergence and steady optimization, resulting in a model that not only learns swiftly but also steadily refines its capabilities throughout the training process. This training phase was executed using a proprietary dataset containing an extensive collection of 80,000 images, each characterized by a substantial degree of variability. The dataset was thoughtfully curated to include two distinct classes, namely "normal" and "nsfw." This diversity allowed the model to grasp nuanced visual patterns, equipping it with the competence to accurately differentiate between safe and explicit content. The overarching objective of this meticulous training process was to impart the model with a deep understanding of visual cues, ensuring its robustness and competence in tackling the specific task of NSFW image classification. The result is a model that stands ready to contribute significantly to content safety and moderation, all while maintaining the highest standards of accuracy and reliability. Intended Uses & Limitations Intended Uses - NSFW Image Classification: The primary intended use of this model is for the classification of NSFW (Not Safe for Work) images. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications. How to use Here is how to use this model to classifiy an image based on 1 of 2 classes (normal,nsfw): - 'evalloss': 0.07463177293539047, - 'evalaccuracy': 0.980375, - 'evalruntime': 304.9846, - 'evalsamplespersecond': 52.462, - 'evalstepspersecond': 3.279 Note: It's essential to use this model responsibly and ethically, adhering to content guidelines and applicable regulations when implementing it in real-world applications, particularly those involving potentially sensitive content. For more details on model fine-tuning and usage, please refer to the model's documentation and the model hub. - Hugging Face Model Hub - Vision Transformer (ViT) Paper - ImageNet-21k Dataset Disclaimer: The model's performance may be influenced by the quality and representativeness of the data it was fine-tuned on. Users are encouraged to assess the model's suitability for their specific applications and datasets.
fairface_age_image_detection
by dima806
Detects age group with about 59% accuracy based on an image. See https://www.kaggle.com/code/dima806/age-group-image-classification-vit for details.
A MobileNet-v3 image classification model. Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: A LAMB optimizer based recipe that is similar to ResNet Strikes Back `A2` but 50% longer with EMA weight averaging, no CutMix Step (exponential decay w/ staircase) LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 2.5 - GMACs: 0.1 - Activations (M): 1.4 - Image size: 224 x 224 - Papers: - Searching for MobileNetV3: https://arxiv.org/abs/1905.02244 - Dataset: ImageNet-1k - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results.
adetailer
by Bingsu
- coco2017 (only person) - AniSeg - skytnt/anime-segmentation | id | label | | --- | --------------------- | | 0 | shortsleevedshirt | | 1 | longsleevedshirt | | 2 | shortsleevedoutwear | | 3 | longsleevedoutwear | | 4 | vest | | 5 | sling | | 6 | shorts | | 7 | trousers | | 8 | skirt | | 9 | shortsleeveddress | | 10 | longsleeveddress | | 11 | vestdress | | 12 | slingdress | | Model | Target | mAP 50 | mAP 50-95 | | --------------------------- | --------------------- | ----------------------------- | ----------------------------- | | faceyolov8n.pt | 2D / realistic face | 0.660 | 0.366 | | faceyolov8nv2.pt | 2D / realistic face | 0.669 | 0.372 | | faceyolov8s.pt | 2D / realistic face | 0.713 | 0.404 | | faceyolov8m.pt | 2D / realistic face | 0.737 | 0.424 | | faceyolov9c.pt | 2D / realistic face | 0.748 | 0.433 | | handyolov8n.pt | 2D / realistic hand | 0.767 | 0.505 | | handyolov8s.pt | 2D / realistic hand | 0.794 | 0.527 | | handyolov9c.pt | 2D / realistic hand | 0.810 | 0.550 | | personyolov8n-seg.pt | 2D / realistic person | 0.782 (bbox) 0.761 (mask) | 0.555 (bbox) 0.460 (mask) | | personyolov8s-seg.pt | 2D / realistic person | 0.824 (bbox) 0.809 (mask) | 0.605 (bbox) 0.508 (mask) | | personyolov8m-seg.pt | 2D / realistic person | 0.849 (bbox) 0.831 (mask) | 0.636 (bbox) 0.533 (mask) | | deepfashion2yolov8s-seg.pt | realistic clothes | 0.849 (bbox) 0.840 (mask) | 0.763 (bbox) 0.675 (mask) | Since `getattr` is classified as a dangerous pickle function, any segmentation model that uses it is classified as unsafe. All models were created and saved using the official ultralytics library, so it's okay to use files downloaded from a trusted source. See also: https://huggingface.co/docs/hub/security-pickle
gpt2
by openai-community
Test the whole generation capabilities here: https://transformer.huggingface.co/doc/gpt2-large Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this page. Disclaimer: The team releasing GPT-2 also wrote a model card for their model. Content from this model card has been written by the Hugging Face team to complete the information they provided and give specific examples of bias. GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences. More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt. This is the smallest version of GPT-2, with 124M parameters. You can use the raw model for text generation or fine-tune it to a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility: Here is how to use this model to get the features of a given text in PyTorch: The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card: > Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases > that require the generated text to be true. > > Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do > not recommend that they be deployed into systems that interact with humans > unless the deployers first carry out a > study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, > and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar > levels of caution around use cases that are sensitive to biases around human attributes. Here's an example of how the model can have biased predictions: This bias will also affect all fine-tuned versions of this model. The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here. The texts are tokenized using a byte-level version of Byte Pair Encoding (BPE) (for unicode characters) and a vocabulary size of 50,257. The inputs are sequences of 1024 consecutive tokens. The larger model was trained on 256 cloud TPU v3 cores. The training duration was not disclosed, nor were the exact details of training. The model achieves the following results without any fine-tuning (zero-shot): | Dataset | LAMBADA | LAMBADA | CBT-CN | CBT-NE | WikiText2 | PTB | enwiki8 | text8 | WikiText103 | 1BW | |:--------:|:-------:|:-------:|:------:|:------:|:---------:|:------:|:-------:|:------:|:-----------:|:-----:| | (metric) | (PPL) | (ACC) | (ACC) | (ACC) | (PPL) | (PPL) | (BPB) | (BPC) | (PPL) | (PPL) | | | 35.13 | 45.99 | 87.65 | 83.4 | 29.41 | 65.85 | 1.16 | 1,17 | 37.50 | 75.20 |
nomic-embed-text-v1.5
by nomic-ai
nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning
whisperkit-coreml
by argmaxinc
Repo WhisperKit is part of Argmax OSS, an On-device Speech AI SDK for Apple Silicon: https://github.com/argmaxinc/argmax-oss-swift
Qwen3-0.6B
by Qwen
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features: - Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios. - Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning. - Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. - Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. - Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation. Qwen3-0.6B has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 0.6B - Number of Paramaters (Non-Embedding): 0.44B - Number of Layers: 28 - Number of Attention Heads (GQA): 16 for Q and 8 for KV - Context Length: 32,768 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. > [!TIP] > If you encounter significant endless repetitions, please refer to the Best Practices section for optimal sampling parameters, and set the ``presencepenalty`` to 1.5. The code of Qwen3 has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers ) index = len(outputids) - outputids[::-1].index(151668) except ValueError: index = 0 thinkingcontent = tokenizer.decode(outputids[:index], skipspecialtokens=True).strip("\n") content = tokenizer.decode(outputids[index:], skipspecialtokens=True).strip("\n") print("thinking content:", thinkingcontent) print("content:", content) shell python -m sglang.launchserver --model-path Qwen/Qwen3-0.6B --reasoning-parser qwen3 shell vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser deepseekr1 python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=True # True is the default value for enablethinking ) python text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True, enablethinking=False # Setting enablethinking=False disables thinking mode ) python from transformers import AutoModelForCausalLM, AutoTokenizer class QwenChatbot: def init(self, modelname="Qwen/Qwen3-0.6B"): self.tokenizer = AutoTokenizer.frompretrained(modelname) self.model = AutoModelForCausalLM.frompretrained(modelname) self.history = [] def generateresponse(self, userinput): messages = self.history + [{"role": "user", "content": userinput}] text = self.tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) inputs = self.tokenizer(text, returntensors="pt") responseids = self.model.generate(inputs, maxnewtokens=32768)[0][len(inputs.inputids[0]):].tolist() response = self.tokenizer.decode(responseids, skipspecialtokens=True) # Update history self.history.append({"role": "user", "content": userinput}) self.history.append({"role": "assistant", "content": response}) Example Usage if name == "main": chatbot = QwenChatbot() # First input (without /think or /nothink tags, thinking mode is enabled by default) userinput1 = "How many r's in strawberries?" print(f"User: {userinput1}") response1 = chatbot.generateresponse(userinput1) print(f"Bot: {response1}") print("----------------------") # Second input with /nothink userinput2 = "Then, how many r's in blueberries? /nothink" print(f"User: {userinput2}") response2 = chatbot.generateresponse(userinput2) print(f"Bot: {response2}") print("----------------------") # Third input with /think userinput3 = "Really? /think" print(f"User: {userinput3}") response3 = chatbot.generateresponse(userinput3) print(f"Bot: {response3}") python from qwenagent.agents import Assistant # Use the endpoint provided by Alibaba Model Studio: # 'modeltype': 'qwendashscope', # 'apikey': os.getenv('DASHSCOPEAPIKEY'), # Use a custom endpoint compatible with OpenAI API: 'modelserver': 'http://localhost:8000/v1', # apibase 'apikey': 'EMPTY', # Other parameters: # 'generatecfg': { # # Add: When the response content is ` this is the thought this is the answer; # # Do not add: When the response has been separated by reasoningcontent and content. # 'thoughtincontent': True, # }, } Define Tools tools = [ {'mcpServers': { # You can specify the MCP configuration file 'time': { 'command': 'uvx', 'args': ['mcp-server-time', '--local-timezone=Asia/Shanghai'] }, "fetch": { "command": "uvx", "args": ["mcp-server-fetch"] } } }, 'codeinterpreter', # Built-in tools ] Define Agent bot = Assistant(llm=llmcfg, functionlist=tools) Streaming generation messages = [{'role': 'user', 'content': 'https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen'}] for responses in bot.run(messages=messages): pass print(responses) @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```
GLM-OCR
by zai-org
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforce...
distilbert_finetuned_ai4privacy_v2
by Isotonic
--- license: cc-by-nc-4.0 base_model: distilbert-base-uncased tags: - generated_from_trainer model-index: - name: distilbert_finetuned_ai4privacy_v2 results: [] datasets: - ai4privacy/pii-masking-200k - Isotonic/pii-masking-200k pipeline_tag: token-classification language: - en metrics: - seqeval ---
moshiko-pytorch-bf16
by kyutai
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice. - Developed by: Kyutai - Model type: Multimodal speech-text foundation model - Language(s) (NLP): English - License: CC-BY The model can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. However, the model has limited abilities for complex tasks and cannot access tools, but rather focues on natural, low-latency interactions. Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems.. Regarding the main Moshi architecture, other downstream usecases would require some finetuning / domain adaptation. The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. The model has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations. - Textual data: The underlying Helium model is trained on a mix of data, more precisely: - 12.5% is high-quality data sources from the following curated sources: Wikipedia Wikibooks, Wikisource, Wikinews, StackExchange and the collection of scientific articles pes2o. For Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022. - 87.5% is filtered web data from CommonCrawl, using the following crawls: 2018-30, 2019-04, 2019-30, 2020-05, 2020-34, 2021-04, 2021-31, 2022-05, 2022-33, 2023-40. - Unsupervised audio dataset: used for pre-training, this is a collection of 7 million hours of readily available audio content, which consists mostly of English speech. This training set is transcribed with Whisper (large v3 model) - The Fisher dataset:: used to enable multi-stream. It consists of 2000 hours of phone conversations at 8kHz from Fisher, which we upsample to 24kHz using AudioSR. - Supervised multi-stream dataset: A dataset of 170 hours of natural and scripted conversation between multiple pairs of participants, collected by Kyutai. This dataset is used to train the TTS system used to create synthetic data. - Synthetic data: 20,000 hours of synthetic data generated by our TTS system, and simulating a dialogue between Moshi and a user. The different stages of the training procedure are detailled in the paper along with the hyper-parameters. The training was performed on 127 DGX nodes provided by Scaleway, accounting for 1016 H100 Nvidia GPUs. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
wav2vec2-large-xlsr-53-portuguese
by jonatasgrosman
Fine-tuned XLSR-53 large model for speech recognition in Portuguese Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1. When using this model, make sure that your speech input is sampled at 16kHz. This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :) The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint The model can be used directly (without a language model) as follows... | Reference | Prediction | | ------------- | ------------- | | NEM O RADAR NEM OS OUTROS INSTRUMENTOS DETECTARAM O BOMBARDEIRO STEALTH. | NEMHUM VADAN OS OLTWES INSTRUMENTOS DE TTÉÃN UM BOMBERDEIRO OSTER | | PEDIR DINHEIRO EMPRESTADO ÀS PESSOAS DA ALDEIA | E DIR ENGINHEIRO EMPRESTAR AS PESSOAS DA ALDEIA | | OITO | OITO | | TRANCÁ-LOS | TRANCAUVOS | | REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA | REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA | | O YOUTUBE AINDA É A MELHOR PLATAFORMA DE VÍDEOS. | YOUTUBE AINDA É A MELHOR PLATAFOMA DE VÍDEOS | | MENINA E MENINO BEIJANDO NAS SOMBRAS | MENINA E MENINO BEIJANDO NAS SOMBRAS | | EU SOU O SENHOR | EU SOU O SENHOR | | DUAS MULHERES QUE SENTAM-SE PARA BAIXO LENDO JORNAIS. | DUAS MIERES QUE SENTAM-SE PARA BAICLANE JODNÓI | | EU ORIGINALMENTE ESPERAVA | EU ORIGINALMENTE ESPERAVA | 1. To evaluate on `mozilla-foundation/commonvoice60` with split `test` 2. To evaluate on `speech-recognition-community-v2/devdata` Citation If you want to cite this model you can use this:
dolphin-2.9.1-yi-1.5-34b
by dphn
Curated and trained by Eric Hartford, Lucas Atkins, and Fernando Fernandes, and Cognitive Computations This is our most spectacular outcome ever. FFT, all parameters, 16bit. 77.4 MMLU on 34b. And it talks like a dream. Although the max positional embeddings is 4k, we used rope theta of 1000000.0 and we trained with sequence length 8k. We plan to train on the upcoming 32k version as well. Website: https://dphn.ai Twitter: https://x.com/dphnAI Web Chat: https://chat.dphn.ai Telegram bot: https://t.me/DolphinAIbot Our appreciation for the sponsors of Dolphin 2.9.1: - Crusoe Cloud - provided excellent on-demand 8xH100 node - OnDemand - provided inference sponsorship This model is based on Yi-1.5-34b, and is governed by apache 2.0 license. The base model has 4k context, but we used rope theta of 1000000.0 and the full-weight fine-tuning was with 8k sequence length. Dolphin-2.9.1 has a variety of instruction, conversational, and coding skills. It also has initial agentic abilities and supports function calling. Dolphin is uncensored. We have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant with any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly. Dolphin is licensed according to apache 2.0 license. We grant permission for any use, including commercial. Dolphin was trained on data generated from GPT4, among other models. This model is a fine-tuned version of 01-ai/Yi-1.5-34B on the None dataset. It achieves the following results on the evaluation set: - Loss: 0.4425 The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 1 - seed: 42 - distributedtype: multi-GPU - numdevices: 8 - gradientaccumulationsteps: 8 - totaltrainbatchsize: 64 - totalevalbatchsize: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: cosine - lrschedulerwarmupsteps: 10 - numepochs: 3 | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 0.6265 | 0.0 | 1 | 0.6035 | | 0.4674 | 0.25 | 327 | 0.4344 | | 0.4337 | 0.5 | 654 | 0.4250 | | 0.4346 | 0.75 | 981 | 0.4179 | | 0.3985 | 1.0 | 1308 | 0.4118 | | 0.3128 | 1.23 | 1635 | 0.4201 | | 0.3261 | 1.48 | 1962 | 0.4157 | | 0.3259 | 1.73 | 2289 | 0.4122 | | 0.3126 | 1.98 | 2616 | 0.4079 | | 0.2265 | 2.21 | 2943 | 0.4441 | | 0.2297 | 2.46 | 3270 | 0.4427 | | 0.2424 | 2.71 | 3597 | 0.4425 | - Transformers 4.40.0.dev0 - Pytorch 2.2.2+cu121 - Datasets 2.15.0 - Tokenizers 0.15.0
Kokoro-82M
by hexgrad
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects. > [!NOTE] > As of April 2025, the market rate of Kokoro served over API is under $1 per million characters of text input, or under $0.06 per hour of audio output. (On average, 1000 characters of input is about 1 minute of output.) Sources: ArtificialAnalysis/Replicate at 65 cents per M chars and DeepInfra at 80 cents per M chars. > > This is an Apache-licensed model, and Kokoro has been deployed in numerous projects and commercial APIs. We welcome the deployment of the model in real use cases. > [!CAUTION] > Fake websites like kokorottsaicom (snapshot: https://archive.ph/nRRnk) and kokorottsnet (snapshot: https://archive.ph/60opa) are likely scams masquerading under the banner of a popular model. > > Any website containing "kokoro" in its root domain (e.g. kokorottsaicom, kokorottsnet) is NOT owned by and NOT affiliated with this model page or its author, and attempts to imply otherwise are red flags. - Releases - Usage - EVAL.md ↗️ - SAMPLES.md ↗️ - VOICES.md ↗️ - Model Facts - Training Details - Creative Commons Attribution - Acknowledgements | Model | Published | Training Data | Langs & Voices | SHA256 | | ----- | --------- | ------------- | -------------- | ------ | | v1.0 | 2025 Jan 27 | Few hundred hrs | 8 & 54 | `496dba11` | | v0.19 | 2024 Dec 25 | =0.9.2 soundfile !apt-get -qq -y install espeak-ng > /dev/null 2>&1 from kokoro import KPipeline from IPython.display import display, Audio import soundfile as sf import torch pipeline = KPipeline(langcode='a') text = ''' Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects. ''' generator = pipeline(text, voice='afheart') for i, (gs, ps, audio) in enumerate(generator): print(i, gs, ps) display(Audio(data=audio, rate=24000, autoplay=i==0)) sf.write(f'{i}.wav', audio, 24000) ``` Under the hood, `kokoro` uses `misaki`, a G2P library at https://github.com/hexgrad/misaki Architecture: - StyleTTS 2: https://arxiv.org/abs/2306.07691 - ISTFTNet: https://arxiv.org/abs/2203.02395 - Decoder only: no diffusion, no encoder release Architected by: Li et al @ https://github.com/yl4579/StyleTTS2 Model SHA256 Hash: `496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4` Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include: - Public domain audio - Audio licensed under Apache, MIT, etc - Synthetic audio [1] generated by closed [2] TTS models from large providers [1] https://copyright.gov/ai/aipolicyguidance.pdf [2] No synthetic audio from open TTS models or "custom voice clones" Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM The following CC BY audio was part of the dataset used to train Kokoro v1.0. | Audio Data | Duration Used | License | Added to Training Set After | | ---------- | ------------- | ------- | --------------------------- | | Koniwa `tnc` |
chronos-2
by amazon
Update Dec 30, 2025: ☁️ Deploy Chronos-2 on Amazon SageMaker. New guide covers real-time GPU and CPU inference, serverless endpoints (run on demand, no idle costs), and batch transform for large-sc...
whisper-large-v3
by openai
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences: 1. The spectrogram input uses 128 Mel frequency bins instead of 80 2. A new language token for Cantonese The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2 . The model was trained for 2.0 epochs over this mixture dataset. The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper large-v2 . For more details on the different checkpoints available, refer to the section Model details. Disclaimer: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and pasted from the original model card. Whisper large-v3 is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time: The model can be used with the `pipeline` class to transcribe audios of arbitrary length: To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batchsize` parameter: Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics: Whisper predicts the language of the source audio automatically. If the source audio language is known a-priori, it can be passed as an argument to the pipeline: By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to `"translate"`: Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `returntimestamps` argument: The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription where the source audio is in French, and we want to return sentence-level timestamps, the following can be used: For more control over the generation parameters, use the model + processor API directly: You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM requirements. Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are required: 1. Sequential: uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other 2. Chunked: splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries The sequential long-form algorithm should be used in either of the following scenarios: 1. Transcription accuracy is the most important factor, and speed is less of a consideration 2. You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate Conversely, the chunked algorithm should be used when: 1. Transcription speed is the most important factor 2. You are transcribing a single long audio file By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunklengths` parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument `batchsize`: The Whisper forward pass is compatible with `torch.compile` for 4.5x speed-ups. Note: `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️ We recommend using Flash-Attention 2 if your GPU supports it and you are not using torch.compile. To do so, first install Flash Attention: Then pass `attnimplementation="flashattention2"` to `frompretrained`: If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). This attention implementation is activated by default for PyTorch versions 2.1.1 or greater. To check whether you have a compatible PyTorch version, run the following Python code snippet: If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it returns `False`, you need to upgrade your PyTorch version according to the official instructions Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying `attnimplementation="sdpa"` as follows: For more information about how to use the SDPA refer to the Transformers SDPA documentation. Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. There are two flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio. Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub: | Size | Parameters | English-only | Multilingual | |----------|------------|------------------------------------------------------|-----------------------------------------------------| | tiny | 39 M | ✓ | ✓ | | base | 74 M | ✓ | ✓ | | small | 244 M | ✓ | ✓ | | medium | 769 M | ✓ | ✓ | | large | 1550 M | x | ✓ | | large-v2 | 1550 M | x | ✓ | | large-v3 | 1550 M | x | ✓ | The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However, its predictive capabilities can be improved further for certain languages and tasks through fine-tuning. The blog post Fine-Tune Whisper with 🤗 Transformers provides a step-by-step guide to fine-tuning the Whisper model with as little as 5 hours of labelled data. The primary intended users of these models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an ASR solution for developers, especially for English speech recognition. We recognize that once models are released, it is impossible to restrict access to only “intended” uses or to draw reasonable guidelines around what is or is not research. The models are primarily trained and evaluated on ASR and speech translation to English tasks. They show strong ASR results in ~10 languages. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. We strongly recommend that users perform robust evaluations of the models in a particular context and domain before deploying them. In particular, we caution against using Whisper models to transcribe recordings of individuals taken without their consent or purporting to use these models for any kind of subjective classification. We recommend against use in high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes. The models are intended to transcribe and translate speech, use of the model for classification is not only not evaluated but also not appropriate, particularly to infer human attributes. The large-v3 checkpoint is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. As discussed in the accompanying paper, we see that performance on transcription in a given language is directly correlated with the amount of training data we employ in that language. Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level. However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself. Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in the paper accompanying this release. In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in the paper. It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages. We anticipate that Whisper models’ transcription capabilities may be used for improving accessibility tools. While Whisper models cannot be used for real-time transcription out of the box – their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications. There are also potential dual use concerns that come with releasing Whisper. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.
whisper-large-v3-turbo
by openai
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation. You can find more details about it in this GitHub discussion. Disclaimer: Content for this model card has partly been written by the 🤗 Hugging Face team, and partly copied and pasted from the original model card. Whisper large-v3-turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time: The model can be used with the `pipeline` class to transcribe audios of arbitrary length: To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline: Multiple audio files can be transcribed in parallel by specifying them as a list and setting the `batchsize` parameter: Transformers is compatible with all Whisper decoding strategies, such as temperature fallback and condition on previous tokens. The following example demonstrates how to enable these heuristics: Whisper predicts the language of the source audio automatically. If the source audio language is known a-priori, it can be passed as an argument to the pipeline: By default, Whisper performs the task of speech transcription, where the source audio language is the same as the target text language. To perform speech translation, where the target text is in English, set the task to `"translate"`: Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `returntimestamps` argument: The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription where the source audio is in French, and we want to return sentence-level timestamps, the following can be used: For more control over the generation parameters, use the model + processor API directly: You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM requirements. Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are required: 1. Sequential: uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other 2. Chunked: splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries The sequential long-form algorithm should be used in either of the following scenarios: 1. Transcription accuracy is the most important factor, and speed is less of a consideration 2. You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate Conversely, the chunked algorithm should be used when: 1. Transcription speed is the most important factor 2. You are transcribing a single long audio file By default, Transformers uses the sequential algorithm. To enable the chunked algorithm, pass the `chunklengths` parameter to the `pipeline`. For large-v3, a chunk length of 30-seconds is optimal. To activate batching over long audio files, pass the argument `batchsize`: The Whisper forward pass is compatible with `torch.compile` for 4.5x speed-ups. Note: `torch.compile` is currently not compatible with the Chunked long-form algorithm or Flash Attention 2 ⚠️ We recommend using Flash-Attention 2 if your GPU supports it and you are not using torch.compile. To do so, first install Flash Attention: Then pass `attnimplementation="flashattention2"` to `frompretrained`: If your GPU does not support Flash Attention, we recommend making use of PyTorch scaled dot-product attention (SDPA). This attention implementation is activated by default for PyTorch versions 2.1.1 or greater. To check whether you have a compatible PyTorch version, run the following Python code snippet: If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it returns `False`, you need to upgrade your PyTorch version according to the official instructions Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying `attnimplementation="sdpa"` as follows: For more information about how to use the SDPA refer to the Transformers SDPA documentation. Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. There are two flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio. Whisper checkpoints come in five configurations of varying model sizes. The smallest four are available as English-only and multilingual. The largest checkpoints are multilingual only. All ten of the pre-trained checkpoints are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub: | Size | Parameters | English-only | Multilingual | |----------|------------|------------------------------------------------------|-----------------------------------------------------| | tiny | 39 M | ✓ | ✓ | | base | 74 M | ✓ | ✓ | | small | 244 M | ✓ | ✓ | | medium | 769 M | ✓ | ✓ | | large | 1550 M | x | ✓ | | large-v2 | 1550 M | x | ✓ | | large-v3 | 1550 M | x | ✓ | | large-v3-turbo | 809 M | x | ✓ | The pre-trained Whisper model demonstrates a strong ability to generalise to different datasets and domains. However, its predictive capabilities can be improved further for certain languages and tasks through fine-tuning. The blog post Fine-Tune Whisper with 🤗 Transformers provides a step-by-step guide to fine-tuning the Whisper model with as little as 5 hours of labelled data. The primary intended users of these models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an ASR solution for developers, especially for English speech recognition. We recognize that once models are released, it is impossible to restrict access to only “intended” uses or to draw reasonable guidelines around what is or is not research. The models are primarily trained and evaluated on ASR and speech translation to English tasks. They show strong ASR results in ~10 languages. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. We strongly recommend that users perform robust evaluations of the models in a particular context and domain before deploying them. In particular, we caution against using Whisper models to transcribe recordings of individuals taken without their consent or purporting to use these models for any kind of subjective classification. We recommend against use in high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes. The models are intended to transcribe and translate speech, use of the model for classification is not only not evaluated but also not appropriate, particularly to infer human attributes. Our studies show that, over many existing ASR systems, the models exhibit improved robustness to accents, background noise, technical language, as well as zero shot translation from multiple languages into English; and that accuracy on speech recognition and translation is near the state-of-the-art level. However, because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself. Our models perform unevenly across languages, and we observe lower accuracy on low-resource and/or low-discoverability languages or languages where we have less training data. The models also exhibit disparate performance on different accents and dialects of particular languages, which may include higher word error rate across speakers of different genders, races, ages, or other demographic criteria. Our full evaluation results are presented in the paper accompanying this release. In addition, the sequence-to-sequence architecture of the model makes it prone to generating repetitive texts, which can be mitigated to some degree by beam search and temperature scheduling but not perfectly. Further analysis on these limitations are provided in the paper. It is likely that this behavior and hallucinations may be worse on lower-resource and/or lower-discoverability languages. We anticipate that Whisper models’ transcription capabilities may be used for improving accessibility tools. While Whisper models cannot be used for real-time transcription out of the box – their speed and size suggest that others may be able to build applications on top of them that allow for near-real-time speech recognition and translation. The real value of beneficial applications built on top of Whisper models suggests that the disparate performance of these models may have real economic implications. There are also potential dual use concerns that come with releasing Whisper. While we hope the technology will be used primarily for beneficial purposes, making ASR technology more accessible could enable more actors to build capable surveillance technologies or scale up existing surveillance efforts, as the speed and accuracy allow for affordable automatic transcription and translation of large volumes of audio communication. Moreover, these models may have some capabilities to recognize specific individuals out of the box, which in turn presents safety concerns related both to dual use and disparate performance. In practice, we expect that the cost of transcription is not the limiting factor of scaling up surveillance projects.
Qwen3.5-35B-A3B
by Qwen
> [!Note] > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. > > These artifacts are compatible with Hugging Face T...
resnet50.a1_in1k
by timm
This model features: ReLU activations single layer 7x7 convolution with pooling 1x1 convolution shortcut downsample Trained on ImageNet-1k in `timm` using recipe template described below. Recipe details: ResNet Strikes Back `A1` recipe LAMB optimizer with BCE loss Cosine LR schedule with warmup Model Details - Model Type: Image classification / feature backbone - Model Stats: - Params (M): 25.6 - GMACs: 4.1 - Activations (M): 11.1 - Image size: train = 224 x 224, test = 288 x 288 - Papers: - ResNet strikes back: An improved training procedure in timm: https://arxiv.org/abs/2110.00476 - Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385 - Original: https://github.com/huggingface/pytorch-image-models Model Comparison Explore the dataset and runtime metrics of this model in timm model results. |model |imgsize|top1 |top5 |paramcount|gmacs|macts|img/sec| |------------------------------------------|--------|-----|-----|-----------|-----|-----|-------| |seresnextaa101d32x8d.swin12kftin1k288|320 |86.72|98.17|93.6 |35.2 |69.7 |451 | |seresnextaa101d32x8d.swin12kftin1k288|288 |86.51|98.08|93.6 |28.5 |56.4 |560 | |seresnextaa101d32x8d.swin12kftin1k|288 |86.49|98.03|93.6 |28.5 |56.4 |557 | |seresnextaa101d32x8d.swin12kftin1k|224 |85.96|97.82|93.6 |17.2 |34.2 |923 | |resnext10132x32d.fbwslig1bftin1k|224 |85.11|97.44|468.5 |87.3 |91.1 |254 | |resnetrs420.tfin1k|416 |85.0 |97.12|191.9 |108.4|213.8|134 | |ecaresnet269d.ra2in1k|352 |84.96|97.22|102.1 |50.2 |101.2|291 | |ecaresnet269d.ra2in1k|320 |84.73|97.18|102.1 |41.5 |83.7 |353 | |resnetrs350.tfin1k|384 |84.71|96.99|164.0 |77.6 |154.7|183 | |seresnextaa101d32x8d.ahin1k|288 |84.57|97.08|93.6 |28.5 |56.4 |557 | |resnetrs200.tfin1k|320 |84.45|97.08|93.2 |31.5 |67.8 |446 | |resnetrs270.tfin1k|352 |84.43|96.97|129.9 |51.1 |105.5|280 | |seresnext101d32x8d.ahin1k|288 |84.36|96.92|93.6 |27.6 |53.0 |595 | |seresnet152d.ra2in1k|320 |84.35|97.04|66.8 |24.1 |47.7 |610 | |resnetrs350.tfin1k|288 |84.3 |96.94|164.0 |43.7 |87.1 |333 | |resnext10132x8d.fbswslig1bftin1k|224 |84.28|97.17|88.8 |16.5 |31.2 |1100 | |resnetrs420.tfin1k|320 |84.24|96.86|191.9 |64.2 |126.6|228 | |seresnext10132x8d.ahin1k|288 |84.19|96.87|93.6 |27.2 |51.6 |613 | |resnext10132x16d.fbwslig1bftin1k|224 |84.18|97.19|194.0 |36.3 |51.2 |581 | |resnetaa101d.swin12kftin1k|288 |84.11|97.11|44.6 |15.1 |29.0 |1144 | |resnet200d.ra2in1k|320 |83.97|96.82|64.7 |31.2 |67.3 |518 | |resnetrs200.tfin1k|256 |83.87|96.75|93.2 |20.2 |43.4 |692 | |seresnextaa101d32x8d.ahin1k|224 |83.86|96.65|93.6 |17.2 |34.2 |923 | |resnetrs152.tfin1k|320 |83.72|96.61|86.6 |24.3 |48.1 |617 | |seresnet152d.ra2in1k|256 |83.69|96.78|66.8 |15.4 |30.6 |943 | |seresnext101d32x8d.ahin1k|224 |83.68|96.61|93.6 |16.7 |32.0 |986 | |resnet152d.ra2in1k|320 |83.67|96.74|60.2 |24.1 |47.7 |706 | |resnetrs270.tfin1k|256 |83.59|96.61|129.9 |27.1 |55.8 |526 | |seresnext10132x8d.ahin1k|224 |83.58|96.4 |93.6 |16.5 |31.2 |1013 | |resnetaa101d.swin12kftin1k|224 |83.54|96.83|44.6 |9.1 |17.6 |1864 | |resnet152.a1hin1k|288 |83.46|96.54|60.2 |19.1 |37.3 |904 | |resnext10132x16d.fbswslig1bftin1k|224 |83.35|96.85|194.0 |36.3 |51.2 |582 | |resnet200d.ra2in1k|256 |83.23|96.53|64.7 |20.0 |43.1 |809 | |resnext10132x4d.fbswslig1bftin1k|224 |83.22|96.75|44.2 |8.0 |21.2 |1814 | |resnext10164x4d.c1in1k|288 |83.16|96.38|83.5 |25.7 |51.6 |590 | |resnet152d.ra2in1k|256 |83.14|96.38|60.2 |15.4 |30.5 |1096 | |resnet101d.ra2in1k|320 |83.02|96.45|44.6 |16.5 |34.8 |992 | |ecaresnet101d.miilin1k|288 |82.98|96.54|44.6 |13.4 |28.2 |1077 | |resnext10164x4d.tvin1k|224 |82.98|96.25|83.5 |15.5 |31.2 |989 | |resnetrs152.tfin1k|256 |82.86|96.28|86.6 |15.6 |30.8 |951 | |resnext10132x8d.tv2in1k|224 |82.83|96.22|88.8 |16.5 |31.2 |1099 | |resnet152.a1hin1k|224 |82.8 |96.13|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|288 |82.8 |96.32|44.6 |13.0 |26.8 |1291 | |resnet152.a1in1k|288 |82.74|95.71|60.2 |19.1 |37.3 |905 | |resnext10132x8d.fbwslig1bftin1k|224 |82.69|96.63|88.8 |16.5 |31.2 |1100 | |resnet152.a2in1k|288 |82.62|95.75|60.2 |19.1 |37.3 |904 | |resnetaa50d.swin12kftin1k|288 |82.61|96.49|25.6 |8.9 |20.6 |1729 | |resnet61q.ra2in1k|288 |82.53|96.13|36.8 |9.9 |21.5 |1773 | |wideresnet1012.tv2in1k|224 |82.5 |96.02|126.9 |22.8 |21.2 |1078 | |resnext10164x4d.c1in1k|224 |82.46|95.92|83.5 |15.5 |31.2 |987 | |resnet51q.ra2in1k|288 |82.36|96.18|35.7 |8.1 |20.9 |1964 | |ecaresnet50t.ra2in1k|320 |82.35|96.14|25.6 |8.8 |24.1 |1386 | |resnet101.a1in1k|288 |82.31|95.63|44.6 |13.0 |26.8 |1291 | |resnetrs101.tfin1k|288 |82.29|96.01|63.6 |13.6 |28.5 |1078 | |resnet152.tv2in1k|224 |82.29|96.0 |60.2 |11.6 |22.6 |1484 | |wideresnet502.racmin1k|288 |82.27|96.06|68.9 |18.9 |23.8 |1176 | |resnet101d.ra2in1k|256 |82.26|96.07|44.6 |10.6 |22.2 |1542 | |resnet101.a2in1k|288 |82.24|95.73|44.6 |13.0 |26.8 |1290 | |seresnext5032x4d.racmin1k|288 |82.2 |96.14|27.6 |7.0 |23.8 |1547 | |ecaresnet101d.miilin1k|224 |82.18|96.05|44.6 |8.1 |17.1 |1771 | |resnext5032x4d.fbswslig1bftin1k|224 |82.17|96.22|25.0 |4.3 |14.4 |2943 | |ecaresnet50t.a1in1k|288 |82.12|95.65|25.6 |7.1 |19.6 |1704 | |resnext5032x4d.a1hin1k|288 |82.03|95.94|25.0 |7.0 |23.8 |1745 | |ecaresnet101dpruned.miilin1k|288 |82.0 |96.15|24.9 |5.8 |12.7 |1787 | |resnet61q.ra2in1k|256 |81.99|95.85|36.8 |7.8 |17.0 |2230 | |resnext10132x8d.tv2in1k|176 |81.98|95.72|88.8 |10.3 |19.4 |1768 | |resnet152.a1in1k|224 |81.97|95.24|60.2 |11.6 |22.6 |1486 | |resnet101.a1hin1k|224 |81.93|95.75|44.6 |7.8 |16.2 |2122 | |resnet101.tv2in1k|224 |81.9 |95.77|44.6 |7.8 |16.2 |2118 | |resnext10132x16d.fbsslyfcc100mftin1k|224 |81.84|96.1 |194.0 |36.3 |51.2 |583 | |resnet51q.ra2in1k|256 |81.78|95.94|35.7 |6.4 |16.6 |2471 | |resnet152.a2in1k|224 |81.77|95.22|60.2 |11.6 |22.6 |1485 | |resnetaa50d.swin12kftin1k|224 |81.74|96.06|25.6 |5.4 |12.4 |2813 | |ecaresnet50t.a2in1k|288 |81.65|95.54|25.6 |7.1 |19.6 |1703 | |ecaresnet50d.miilin1k|288 |81.64|95.88|25.6 |7.2 |19.7 |1694 | |resnext10132x8d.fbsslyfcc100mftin1k|224 |81.62|96.04|88.8 |16.5 |31.2 |1101 | |wideresnet502.tv2in1k|224 |81.61|95.76|68.9 |11.4 |14.4 |1930 | |resnetaa50.a1hin1k|288 |81.61|95.83|25.6 |8.5 |19.2 |1868 | |resnet101.a1in1k|224 |81.5 |95.16|44.6 |7.8 |16.2 |2125 | |resnext5032x4d.a1in1k|288 |81.48|95.16|25.0 |7.0 |23.8 |1745 | |gcresnet50t.ra2in1k|288 |81.47|95.71|25.9 |6.9 |18.6 |2071 | |wideresnet502.racmin1k|224 |81.45|95.53|68.9 |11.4 |14.4 |1929 | |resnet50d.a1in1k|288 |81.44|95.22|25.6 |7.2 |19.7 |1908 | |ecaresnet50t.ra2in1k|256 |81.44|95.67|25.6 |5.6 |15.4 |2168 | |ecaresnetlight.miilin1k|288 |81.4 |95.82|30.2 |6.8 |13.9 |2132 | |resnet50d.ra2in1k|288 |81.37|95.74|25.6 |7.2 |19.7 |1910 | |resnet101.a2in1k|224 |81.32|95.19|44.6 |7.8 |16.2 |2125 | |seresnet50.ra2in1k|288 |81.3 |95.65|28.1 |6.8 |18.4 |1803 | |resnext5032x4d.a2in1k|288 |81.3 |95.11|25.0 |7.0 |23.8 |1746 | |seresnext5032x4d.racmin1k|224 |81.27|95.62|27.6 |4.3 |14.4 |2591 | |ecaresnet50t.a1in1k|224 |81.26|95.16|25.6 |4.3 |11.8 |2823 | |gcresnext50ts.chin1k|288 |81.23|95.54|15.7 |4.8 |19.6 |2117 | |senet154.gluonin1k|224 |81.23|95.35|115.1 |20.8 |38.7 |545 | |resnet50.a1in1k|288 |81.22|95.11|25.6 |6.8 |18.4 |2089 | |resnet50gn.a1hin1k|288 |81.22|95.63|25.6 |6.8 |18.4 |676 | |resnet50d.a2in1k|288 |81.18|95.09|25.6 |7.2 |19.7 |1908 | |resnet50.fbswslig1bftin1k|224 |81.18|95.98|25.6 |4.1 |11.1 |3455 | |resnext5032x4d.tv2in1k|224 |81.17|95.34|25.0 |4.3 |14.4 |2933 | |resnext5032x4d.a1hin1k|224 |81.1 |95.33|25.0 |4.3 |14.4 |2934 | |seresnet50.a2in1k|288 |81.1 |95.23|28.1 |6.8 |18.4 |1801 | |seresnet50.a1in1k|288 |81.1 |95.12|28.1 |6.8 |18.4 |1799 | |resnet152s.gluonin1k|224 |81.02|95.41|60.3 |12.9 |25.0 |1347 | |resnet50.din1k|288 |80.97|95.44|25.6 |6.8 |18.4 |2085 | |gcresnet50t.ra2in1k|256 |80.94|95.45|25.9 |5.4 |14.7 |2571 | |resnext10132x4d.fbsslyfcc100mftin1k|224 |80.93|95.73|44.2 |8.0 |21.2 |1814 | |resnet50.c1in1k|288 |80.91|95.55|25.6 |6.8 |18.4 |2084 | |seresnext10132x4d.gluonin1k|224 |80.9 |95.31|49.0 |8.0 |21.3 |1585 | |seresnext10164x4d.gluonin1k|224 |80.9 |95.3 |88.2 |15.5 |31.2 |918 | |resnet50.c2in1k|288 |80.86|95.52|25.6 |6.8 |18.4 |2085 | |resnet50.tv2in1k|224 |80.85|95.43|25.6 |4.1 |11.1 |3450 | |ecaresnet50t.a2in1k|224 |80.84|95.02|25.6 |4.3 |11.8 |2821 | |ecaresnet101dpruned.miilin1k|224 |80.79|95.62|24.9 |3.5 |7.7 |2961 | |seresnet33ts.ra2in1k|288 |80.79|95.36|19.8 |6.0 |14.8 |2506 | |ecaresnet50dpruned.miilin1k|288 |80.79|95.58|19.9 |4.2 |10.6 |2349 | |resnet50.a2in1k|288 |80.78|94.99|25.6 |6.8 |18.4 |2088 | |resnet50.b1kin1k|288 |80.71|95.43|25.6 |6.8 |18.4 |2087 | |resnext5032x4d.rain1k|288 |80.7 |95.39|25.0 |7.0 |23.8 |1749 | |resnetrs101.tfin1k|192 |80.69|95.24|63.6 |6.0 |12.7 |2270 | |resnet50d.a1in1k|224 |80.68|94.71|25.6 |4.4 |11.9 |3162 | |ecaresnet33ts.ra2in1k|288 |80.68|95.36|19.7 |6.0 |14.8 |2637 | |resnet50.a1hin1k|224 |80.67|95.3 |25.6 |4.1 |11.1 |3452 | |resnext50d32x4d.btin1k|288 |80.67|95.42|25.0 |7.4 |25.1 |1626 | |resnetaa50.a1hin1k|224 |80.63|95.21|25.6 |5.2 |11.6 |3034 | |ecaresnet50d.miilin1k|224 |80.61|95.32|25.6 |4.4 |11.9 |2813 | |resnext10164x4d.gluonin1k|224 |80.61|94.99|83.5 |15.5 |31.2 |989 | |gcresnet33ts.ra2in1k|288 |80.6 |95.31|19.9 |6.0 |14.8 |2578 | |gcresnext50ts.chin1k|256 |80.57|95.17|15.7 |3.8 |15.5 |2710 | |resnet152.a3in1k|224 |80.56|95.0 |60.2 |11.6 |22.6 |1483 | |resnet50d.ra2in1k|224 |80.53|95.16|25.6 |4.4 |11.9 |3164 | |resnext5032x4d.a1in1k|224 |80.53|94.46|25.0 |4.3 |14.4 |2930 | |wideresnet1012.tv2in1k|176 |80.48|94.98|126.9 |14.3 |13.2 |1719 | |resnet152d.gluonin1k|224 |80.47|95.2 |60.2 |11.8 |23.4 |1428 | |resnet50.b2kin1k|288 |80.45|95.32|25.6 |6.8 |18.4 |2086 | |ecaresnetlight.miilin1k|224 |80.45|95.24|30.2 |4.1 |8.4 |3530 | |resnext5032x4d.a2in1k|224 |80.45|94.63|25.0 |4.3 |14.4 |2936 | |wideresnet502.tv2in1k|176 |80.43|95.09|68.9 |7.3 |9.0 |3015 | |resnet101d.gluonin1k|224 |80.42|95.01|44.6 |8.1 |17.0 |2007 | |resnet50.a1in1k|224 |80.38|94.6 |25.6 |4.1 |11.1 |3461 | |seresnet33ts.ra2in1k|256 |80.36|95.1 |19.8 |4.8 |11.7 |3267 | |resnext10132x4d.gluonin1k|224 |80.34|94.93|44.2 |8.0 |21.2 |1814 | |resnext5032x4d.fbsslyfcc100mftin1k|224 |80.32|95.4 |25.0 |4.3 |14.4 |2941 | |resnet101s.gluonin1k|224 |80.28|95.16|44.7 |9.2 |18.6 |1851 | |seresnet50.ra2in1k|224 |80.26|95.08|28.1 |4.1 |11.1 |2972 | |resnetblur50.btin1k|288 |80.24|95.24|25.6 |8.5 |19.9 |1523 | |resnet50d.a2in1k|224 |80.22|94.63|25.6 |4.4 |11.9 |3162 | |resnet152.tv2in1k|176 |80.2 |94.64|60.2 |7.2 |14.0 |2346 | |seresnet50.a2in1k|224 |80.08|94.74|28.1 |4.1 |11.1 |2969 | |ecaresnet33ts.ra2in1k|256 |80.08|94.97|19.7 |4.8 |11.7 |3284 | |gcresnet33ts.ra2in1k|256 |80.06|94.99|19.9 |4.8 |11.7 |3216 | |resnet50gn.a1hin1k|224 |80.06|94.95|25.6 |4.1 |11.1 |1109 | |seresnet50.a1in1k|224 |80.02|94.71|28.1 |4.1 |11.1 |2962 | |resnet50.ramin1k|288 |79.97|95.05|25.6 |6.8 |18.4 |2086 | |resnet152c.gluonin1k|224 |79.92|94.84|60.2 |11.8 |23.4 |1455 | |seresnext5032x4d.gluonin1k|224 |79.91|94.82|27.6 |4.3 |14.4 |2591 | |resnet50.din1k|224 |79.91|94.67|25.6 |4.1 |11.1 |3456 | |resnet101.tv2in1k|176 |79.9 |94.6 |44.6 |4.9 |10.1 |3341 | |resnetrs50.tfin1k|224 |79.89|94.97|35.7 |4.5 |12.1 |2774 | |resnet50.c2in1k|224 |79.88|94.87|25.6 |4.1 |11.1 |3455 | |ecaresnet26t.ra2in1k|320 |79.86|95.07|16.0 |5.2 |16.4 |2168 | |resnet50.a2in1k|224 |79.85|94.56|25.6 |4.1 |11.1 |3460 | |resnet50.rain1k|288 |79.83|94.97|25.6 |6.8 |18.4 |2087 | |resnet101.a3in1k|224 |79.82|94.62|44.6 |7.8 |16.2 |2114 | |resnext5032x4d.rain1k|224 |79.76|94.6 |25.0 |4.3 |14.4 |2943 | |resnet50.c1in1k|224 |79.74|94.95|25.6 |4.1 |11.1 |3455 | |ecaresnet50dpruned.miilin1k|224 |79.74|94.87|19.9 |2.5 |6.4 |3929 | |resnet33ts.ra2in1k|288 |79.71|94.83|19.7 |6.0 |14.8 |2710 | |resnet152.gluonin1k|224 |79.68|94.74|60.2 |11.6 |22.6 |1486 | |resnext50d32x4d.btin1k|224 |79.67|94.87|25.0 |4.5 |15.2 |2729 | |resnet50.btin1k|288 |79.63|94.91|25.6 |6.8 |18.4 |2086 | |ecaresnet50t.a3in1k|224 |79.56|94.72|25.6 |4.3 |11.8 |2805 | |resnet101c.gluonin1k|224 |79.53|94.58|44.6 |8.1 |17.0 |2062 | |resnet50.b1kin1k|224 |79.52|94.61|25.6 |4.1 |11.1 |3459 | |resnet50.tv2in1k|176 |79.42|94.64|25.6 |2.6 |6.9 |5397 | |resnet32ts.ra2in1k|288 |79.4 |94.66|18.0 |5.9 |14.6 |2752 | |resnet50.b2kin1k|224 |79.38|94.57|25.6 |4.1 |11.1 |3459 | |resnext5032x4d.tv2in1k|176 |79.37|94.3 |25.0 |2.7 |9.0 |4577 | |resnext5032x4d.gluonin1k|224 |79.36|94.43|25.0 |4.3 |14.4 |2942 | |resnext10132x8d.tvin1k|224 |79.31|94.52|88.8 |16.5 |31.2 |1100 | |resnet101.gluonin1k|224 |79.31|94.53|44.6 |7.8 |16.2 |2125 | |resnetblur50.btin1k|224 |79.31|94.63|25.6 |5.2 |12.0 |2524 | |resnet50.a1hin1k|176 |79.27|94.49|25.6 |2.6 |6.9 |5404 | |resnext5032x4d.a3in1k|224 |79.25|94.31|25.0 |4.3 |14.4 |2931 | |resnet50.fbsslyfcc100mftin1k|224 |79.22|94.84|25.6 |4.1 |11.1 |3451 | |resnet33ts.ra2in1k|256 |79.21|94.56|19.7 |4.8 |11.7 |3392 | |resnet50d.gluonin1k|224 |79.07|94.48|25.6 |4.4 |11.9 |3162 | |resnet50.ramin1k|224 |79.03|94.38|25.6 |4.1 |11.1 |3453 | |resnet50.amin1k|224 |79.01|94.39|25.6 |4.1 |11.1 |3461 | |resnet32ts.ra2in1k|256 |79.01|94.37|18.0 |4.6 |11.6 |3440 | |ecaresnet26t.ra2in1k|256 |78.9 |94.54|16.0 |3.4 |10.5 |3421 | |resnet152.a3in1k|160 |78.89|94.11|60.2 |5.9 |11.5 |2745 | |wideresnet1012.tvin1k|224 |78.84|94.28|126.9 |22.8 |21.2 |1079 | |seresnext26d32x4d.btin1k|288 |78.83|94.24|16.8 |4.5 |16.8 |2251 | |resnet50.rain1k|224 |78.81|94.32|25.6 |4.1 |11.1 |3454 | |seresnext26t32x4d.btin1k|288 |78.74|94.33|16.8 |4.5 |16.7 |2264 | |resnet50s.gluonin1k|224 |78.72|94.23|25.7 |5.5 |13.5 |2796 | |resnet50d.a3in1k|224 |78.71|94.24|25.6 |4.4 |11.9 |3154 | |wideresnet502.tvin1k|224 |78.47|94.09|68.9 |11.4 |14.4 |1934 | |resnet50.btin1k|224 |78.46|94.27|25.6 |4.1 |11.1 |3454 | |resnet34d.ra2in1k|288 |78.43|94.35|21.8 |6.5 |7.5 |3291 | |gcresnext26ts.chin1k|288 |78.42|94.04|10.5 |3.1 |13.3 |3226 | |resnet26t.ra2in1k|320 |78.33|94.13|16.0 |5.2 |16.4 |2391 | |resnet152.tvin1k|224 |78.32|94.04|60.2 |11.6 |22.6 |1487 | |seresnext26ts.chin1k|288 |78.28|94.1 |10.4 |3.1 |13.3 |3062 | |batresnext26ts.chin1k|256 |78.25|94.1 |10.7 |2.5 |12.5 |3393 | |resnet50.a3in1k|224 |78.06|93.78|25.6 |4.1 |11.1 |3450 | |resnet50c.gluonin1k|224 |78.0 |93.99|25.6 |4.4 |11.9 |3286 | |ecaresnext26ts.chin1k|288 |78.0 |93.91|10.3 |3.1 |13.3 |3297 | |seresnext26t32x4d.btin1k|224 |77.98|93.75|16.8 |2.7 |10.1 |3841 | |resnet34.a1in1k|288 |77.92|93.77|21.8 |6.1 |6.2 |3609 | |resnet101.a3in1k|160 |77.88|93.71|44.6 |4.0 |8.3 |3926 | |resnet26t.ra2in1k|256 |77.87|93.84|16.0 |3.4 |10.5 |3772 | |seresnext26ts.chin1k|256 |77.86|93.79|10.4 |2.4 |10.5 |4263 | |resnetrs50.tfin1k|160 |77.82|93.81|35.7 |2.3 |6.2 |5238 | |gcresnext26ts.chin1k|256 |77.81|93.82|10.5 |2.4 |10.5 |4183 | |ecaresnet50t.a3in1k|160 |77.79|93.6 |25.6 |2.2 |6.0 |5329 | |resnext5032x4d.a3in1k|160 |77.73|93.32|25.0 |2.2 |7.4 |5576 | |resnext5032x4d.tvin1k|224 |77.61|93.7 |25.0 |4.3 |14.4 |2944 | |seresnext26d32x4d.btin1k|224 |77.59|93.61|16.8 |2.7 |10.2 |3807 | |resnet50.gluonin1k|224 |77.58|93.72|25.6 |4.1 |11.1 |3455 | |ecaresnext26ts.chin1k|256 |77.44|93.56|10.3 |2.4 |10.5 |4284 | |resnet26d.btin1k|288 |77.41|93.63|16.0 |4.3 |13.5 |2907 | |resnet101.tvin1k|224 |77.38|93.54|44.6 |7.8 |16.2 |2125 | |resnet50d.a3in1k|160 |77.22|93.27|25.6 |2.2 |6.1 |5982 | |resnext26ts.ra2in1k|288 |77.17|93.47|10.3 |3.1 |13.3 |3392 | |resnet34.a2in1k|288 |77.15|93.27|21.8 |6.1 |6.2 |3615 | |resnet34d.ra2in1k|224 |77.1 |93.37|21.8 |3.9 |4.5 |5436 | |seresnet50.a3in1k|224 |77.02|93.07|28.1 |4.1 |11.1 |2952 | |resnext26ts.ra2in1k|256 |76.78|93.13|10.3 |2.4 |10.5 |4410 | |resnet26d.btin1k|224 |76.7 |93.17|16.0 |2.6 |8.2 |4859 | |resnet34.btin1k|288 |76.5 |93.35|21.8 |6.1 |6.2 |3617 | |resnet34.a1in1k|224 |76.42|92.87|21.8 |3.7 |3.7 |5984 | |resnet26.btin1k|288 |76.35|93.18|16.0 |3.9 |12.2 |3331 | |resnet50.tvin1k|224 |76.13|92.86|25.6 |4.1 |11.1 |3457 | |resnet50.a3in1k|160 |75.96|92.5 |25.6 |2.1 |5.7 |6490 | |resnet34.a2in1k|224 |75.52|92.44|21.8 |3.7 |3.7 |5991 | |resnet26.btin1k|224 |75.3 |92.58|16.0 |2.4 |7.4 |5583 | |resnet34.btin1k|224 |75.16|92.18|21.8 |3.7 |3.7 |5994 | |seresnet50.a3in1k|160 |75.1 |92.08|28.1 |2.1 |5.7 |5513 | |resnet34.gluonin1k|224 |74.57|91.98|21.8 |3.7 |3.7 |5984 | |resnet18d.ra2in1k|288 |73.81|91.83|11.7 |3.4 |5.4 |5196 | |resnet34.tvin1k|224 |73.32|91.42|21.8 |3.7 |3.7 |5979 | |resnet18.fbswslig1bftin1k|224 |73.28|91.73|11.7 |1.8 |2.5 |10213 | |resnet18.a1in1k|288 |73.16|91.03|11.7 |3.0 |4.1 |6050 | |resnet34.a3in1k|224 |72.98|91.11|21.8 |3.7 |3.7 |5967 | |resnet18.fbsslyfcc100mftin1k|224 |72.6 |91.42|11.7 |1.8 |2.5 |10213 | |resnet18.a2in1k|288 |72.37|90.59|11.7 |3.0 |4.1 |6051 | |resnet14t.c3in1k|224 |72.26|90.31|10.1 |1.7 |5.8 |7026 | |resnet18d.ra2in1k|224 |72.26|90.68|11.7 |2.1 |3.3 |8707 | |resnet18.a1in1k|224 |71.49|90.07|11.7 |1.8 |2.5 |10187 | |resnet14t.c3in1k|176 |71.31|89.69|10.1 |1.1 |3.6 |10970 | |resnet18.gluonin1k|224 |70.84|89.76|11.7 |1.8 |2.5 |10210 | |resnet18.a2in1k|224 |70.64|89.47|11.7 |1.8 |2.5 |10194 | |resnet34.a3in1k|160 |70.56|89.52|21.8 |1.9 |1.9 |10737 | |resnet18.tvin1k|224 |69.76|89.07|11.7 |1.8 |2.5 |10205 | |resnet10t.c3in1k|224 |68.34|88.03|5.4 |1.1 |2.4 |13079 | |resnet18.a3in1k|224 |68.25|88.17|11.7 |1.8 |2.5 |10167 | |resnet10t.c3in1k|176 |66.71|86.96|5.4 |0.7 |1.5 |20327 | |resnet18.a3in1k|160 |65.66|86.26|11.7 |0.9 |1.3 |18229 |