Efficient-Large-Model
gemma-2-2b-it
NVILA-Lite-8B
Model type: NVILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility. Paper or resources for more information: https://github.com/NVLabs/VILA License - The code is released under the Apache 2.0 license as found in the LICENSE file. - The pretrained weights are released under the CC-BY-NC-SA-4.0 license. - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: - Terms of Use of the data generated by OpenAI - Dataset Licenses for each one used during training. Where to send questions or comments about the model: https://github.com/NVLabs/VILA/issues Intended use Primary intended uses: The primary use of VILA is research on large multimodal models and chatbots. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Input: Input Type: Image, Video, Text Input Format: Red, Green, Blue; MP4 ;String Input Parameters: 2D, 3D Supported Hardware Microarchitecture Compatibility: Ampere Jetson Hopper Lovelace Training dataset See Dataset Preparation for more details. Data Collection Method by dataset [Hybrid: Automated, Human] Labeling Method by dataset [Hybrid: Automated, Human] Inference: Engine: [Tensor(RT), Triton, Or List Other Here] PyTorch TensorRT-LLM TinyChat Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Fast_dLLM_v2_1.5B
Fast-dLLM v2 (1.5B) — Efficient Block-Diffusion LLM Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. We present Fast-dLLM v2 — a carefully designed block diffusion language model (dLLM) that efficiently adapts a pretrained AR model (Qwen2.5-1.5B-Instruct) into a diffusion-style decoder for parallel text generation. Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks. ✨ Key Innovations - Block Diffusion Mechanism + Complementary Attention Mask Enables blockwise bidirectional context modeling without sacrificing AR objectives. - Hierarchical Caching - Block-level cache: Stores historical context representations across blocks. - Sub-block cache: Parallel decoding within partially generated blocks. - Token Shift Mechanism Retains autoregressive characteristics while supporting bidirectional context within blocks. - Parallel Decoding Pipeline Achieves up to 2.5× speedup over standard AR decoding without compromising quality. > 🚀 Fast-dLLM v2 uses only ~1B tokens for fine-tuning — a 500× reduction vs. full-attention diffusion LLMs (Dream: 580B tokens) — while matching or surpassing AR baselines in accuracy. 🛠 Model Overview - Type: Block Diffusion Language Model (dLLM) - Base Model: `Qwen/Qwen2.5-1.5B-Instruct` - Architecture: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings - Params: 1.54B (non-embedding: 1.31B) - Layers: 28 - Attention Heads: 12 (Q), 2 (KV, GQA) - Key Feature: Parallel block-wise decoding + hierarchical caching 📦 Installation You will need `transformers`, `torch`, and our custom generation function: ▶ Real-time Throughput Fast-dLLM v2 offers up to 2.54× higher throughput than Qwen2.5-7B-Instruct, without loss in quality. 🏆 Benchmark Results We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks: HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA). - 1B group: Fast-dLLM v2 (1.5B) achieves best average score: 45.0. - 7B group: Fast-dLLM v2 (7B) achieves best average score: 60.3, surpassing LLaDA and Dream models. If you use Fast-dLLM v2 in your research or products, please cite: 📄 License Released under Apache 2.0, following the base Qwen2.5 license. 🔗 Resources - 📄 Paper - 💻 Code - 🤗 HuggingFace Model
NVILA-8B
paligemma-siglip-so400m-patch14-448
VILA1.5-3b
LongVILA-R1-7B
NVILA-Lite-2B
NVILA-Lite-15B
Fast_dLLM_v2_7B
NVILA-Lite-2B-hf-preview
NVILA-15B
Model type: NVILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility. Paper or resources for more information: https://github.com/NVLabs/VILA License - The code is released under the Apache 2.0 license as found in the LICENSE file. - The pretrained weights are released under the CC-BY-NC-SA-4.0 license. - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: - Terms of Use of the data generated by OpenAI - Dataset Licenses for each one used during training. Where to send questions or comments about the model: https://github.com/NVLabs/VILA/issues Intended use Primary intended uses: The primary use of VILA is research on large multimodal models and chatbots. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Input: Input Type: Image, Video, Text Input Format: Red, Green, Blue; MP4 ;String Input Parameters: 2D, 3D Supported Hardware Microarchitecture Compatibility: Ampere Jetson Hopper Lovelace Training dataset See Dataset Preparation for more details. Data Collection Method by dataset [Hybrid: Automated, Human] Labeling Method by dataset [Hybrid: Automated, Human] Inference: Engine: [Tensor(RT), Triton, Or List Other Here] PyTorch TensorRT-LLM TinyChat Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Sana_600M_512px
VILA-7b
Qwen2-VL-7B-Instruct
Llama-3-VILA1.5-8B
NVILA-8B-Video
NVILA-Lite-2B-hf
Sana_1600M_1024px
NVILA-Lite-2B-Verifier
SANA1.5_1.6B_1024px
NVILA-Lite-2B-hf-0626
VILA1.5-13b
NVILA-8B-hf
qwen2-7b-longvila-256f
Sana_600M_1024px_ControlNet_HED
Sana_600M_1024px
VILA1.5-3b-s2
qwen2-7b-longvila-1M
Sana_1600M_1024px_BF16
Sana_1600M_1024px_MultiLing
VILA-2.7b
Sana_1600M_4Kpx_BF16
Sana_Sprint_1.6B_1024px
Llama-3-VILA1.5-8B-Fix-AWQ
Sana_1600M_512px
NVILA-Lite-15B-Video
SANA1.5_4.8B_1024px
VILA1.5-3b-AWQ
SANA Video 2B 480p
SANA-Video is a small, ultra-efficient diffusion model designed for rapid generation of high-quality, minute-long videos at resolutions up to 720×1280. (1) Linear DiT: Leverages linear attention as the core operation, offering significantly more efficiency than vanilla attention when processing the massive number of tokens required for video generation. (2) Constant-Memory KV Cache for Block Linear Attention: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis. SANA-Video achieves exceptional efficiency and cost savings: its training cost is only 1% of MovieGen's (12 days on 64 H100 GPUs). Compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1 and SkyReel-V2), SANA-Video maintains competitive performance while being 16× faster in measured latency. SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation. Source code is available at https://github.com/NVlabs/Sana. Refer to: https://github.com/NVlabs/Sana/blob/main/asset/docs/sanavideo.md#1-inference-with-txt-file - Developed by: NVIDIA, Sana - Model type: Efficient Video Generation with Block Linear Diffusion Transformer - Model size: 2B parameters - Model precision: torch.bfloat16 (BF16) - Model resolution: This model is developed to generate 480p resolution 81(5s) frames videos with multi-scale heigh and width. - Model Description: This is a model that can be used to generate and modify videos based on text prompts. It is a Linear Diffusion Transformer that uses 8x wan-vae one 32x spatial-compressed latent feature encoder (DC-AE-V). - Resources for more information: Check out our GitHub Repository and the SANA-Video report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference - Repository: https://github.com/NVlabs/Sana - Guidance: https://github.com/NVlabs/Sana/asset/docs/sanavideo.md GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of video generation models are impressive, they can also reinforce or exacerbate social biases.
Sana_1600M_1024px_BF16_ControlNet_HED
Sana_1600M_512px_MultiLing
Llama-3-VILA1.5-8b-AWQ
VILA1.5-40b
Sana_Sprint_1.6B_1024px_teacher
qwen2_5vl-7b-wolfv2-tuned
This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the mllmdemo, the identity and the alpacaendemo datasets. The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 8 - seed: 42 - distributedtype: multi-GPU - numdevices: 8 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 16 - totalevalbatchsize: 64 - optimizer: Use adamwtorchfused with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 - Transformers 4.56.1 - Pytorch 2.8.0+cu128 - Datasets 4.0.0 - Tokenizers 0.22.1
NVILA-Lite-8B-stage2
NVILA-15B-hf
VILA1.5-13b-AWQ
VILA1.5-7b
Sana_1600M_2Kpx_BF16
VILA-13b
vila-ewm-qwen2-1.5b
Sana_Sprint_0.6B_1024px
VILA-7b-4bit-awq
VILA1.5-40b-AWQ
NVILA-Lite-8B-hf-preview
Sana_Sprint_0.6B_1024px_teacher
VILA1.5-3b-s2-AWQ
Qwen2-OV-72B
Llama-3-VILA1.5-8B-Fix
NVILA-Lite-8B-hf-0626
NVILA-Lite-15B-Video-hf-0626
NVILA-Lite-8B-hf
nvila_15b_video-wolfv2_tuned
Meta-Llama-Guard-2-8B
nvila_15b_video-plm_tuned
VILA-13b-4bit-awq
Qwen2-OV-SI-72B
qwen2-vl-7b-instruct-pretrain
VILA15-3b-hf-preview
nvila_8b_video-wolfv2_tuned
nvila_8b_video-plm_tuned
Meta-Llama-3.1-70B
Meta-Llama-3.1-70B-Instruct
NVILA-Lite-15B-stage2
NVILA-Lite-15B-hf-0626
VILA15_3b
Llama-3-VILA15-8B-hf-preview
VILA15-13b-hf-preview
NVILA-Lite-15B-hf
NVILA-Lite-15B-WolfV2
qwen2-1.5b-longvila-256f
nvila-internal-33b-video-v1
LongLive 1.3B
🎬 LongLive: Real-time Interactive Long Video Generation [](https://arxiv.org/abs/2509.22622) [](https://github.com/NVlabs/LongLive) [](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B) [](https://www.youtube.com/watch?v=CO1QC7BNvig) [](https://nvlabs.github.io/LongLive) 💡 TLDR: Turn interactive prompts into long videos—instantly, as you type! LongLive: Real-time Interactive Long Video Generation [Paper] Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU. With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss. News - [x] [2025.9.25] We release Paper, this GitHub repo LongLive with all training and inference code, the model weight LongLive-1.3B, and demo page Website. Highlights 1. Long Video Gen: LongLive supports up to 240s video generation, with visual consistency. 2. Real-time Inference: LongLive supports 20.7 FPS generation speed on a single H100 GPU, and 24.8 FPS with FP8 quantization with marginal quality loss. 3. Efficient Fine-tuning: LongLive extends a short-clip model to minute-long generation in 32 H100 GPU-days. LongLive accepts sequential user prompts and generates corresponding videos in real time, enabling user-guided long video generation. The framework of LongLive. (Left) Frame Sink + Short window attention. (Right) KV-recache. The streaming long tuning pipeline. Our approach trains on long sequences by reusing the historical KV cache each iteration to generate the next 5s clip, then supervising it with the teacher. The effectiveness of KV re-cache. Consistent transitions with new-prompt compliance. Interactive 60s videos with 6 prompts. See our demo Website for video examples. We tested this repo on the following setup: Nvidia GPU with at least 40 GB memory (A100, and H100 are tested). Linux operating system. 64 GB RAM. Other hardware setup could also work but hasn't been tested. Create a conda environment and install dependencies: How to contribute - Make sure to have git installed. - Create your own fork of the project. - Clone the repository on your local machine, using git clone and pasting the url of this project. - Read both the `Requirements` and `Installation and Quick Guide` sections below. - Commit and push your changes. - Make a pull request when finished modifying the project. Citation Please consider to cite our paper and this framework, if they are helpful in your research. License - LongLive-1.3B model weight is under CC-BY-NC 4.0 license. Acknowledgement - Self-Forcing: the codebase and algorithm we built upon. Thanks for their wonderful work. - Wan: the base model we built upon. Thanks for their wonderful work.
Sana_Sprint_1.6B_1024px_diffusers
Sana_1600M_1024px_diffusers
SANA1.5_4.8B_1024px_diffusers
Sana_Sprint_0.6B_1024px_diffusers
Sana_600M_512px_diffusers
Sana_1600M_4Kpx_BF16_diffusers
SANA1.5_1.6B_1024px_diffusers
Sana_1600M_1024px_BF16_diffusers
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Source code is available at https://github.com/NVlabs/Sana. Note - Weakness in Complex Scene Creation: Due to limitation of data, our model has limited capabilities in generating complex scenes, text, and human hands. - Enhancing Capabilities: The model’s performance can be improved by increasing the complexity and length of prompts. Below are some examples of prompts and samples. - Developed by: NVIDIA, Sana - Model type: Linear-Diffusion-Transformer-based text-to-image generative model - Model size: 1648M parameters - Model resolution: This model is developed to generate 1024px based images with multi-scale heigh and width. - License: NSCL v2-custom. Governing Terms: NVIDIA License. Additional Information: Gemma Terms of Use | Google AI for Developers for Gemma-2-2B-IT, Gemma Prohibited Use Policy | Google AI for Developers. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders (Gemma2-2B-IT) and one 32x spatial-compressed latent feature encoder (DC-AE). - Special: This model is fine-tuned from the base model Efficient-Large-Model/Sana1600M1024pxBF16 and it supports Emoji, Chinese and English and all mixed prompts. - Resources for more information: Check out our GitHub Repository and the Sana report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference and for which most advanced diffusion sampler like Flow-DPM-Solver is integrated. MIT Han-Lab provides free Sana inference. - Repository: https://github.com/NVlabs/Sana > \[!IMPORTANT\] > Make sure to specify `pipe.transformer` to default `torchdtype` and `variant` according to Model Card. > > Set `pipe.textencoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, docs are here. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
Sana_1600M_2Kpx_BF16_diffusers
Sana_600M_1024px_diffusers
Sana_1600M_1024px_MultiLing_diffusers
NVILA-AWQ
SANA_Sprint_1.6B_1024px_teacher_diffusers
Sana 1600M 512px Diffusers
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Source code is available at https://github.com/NVlabs/Sana. Note - Weakness in Complex Scene Creation: Due to limitation of data, our model has limited capabilities in generating complex scenes, text, and human hands. - Enhancing Capabilities: The model’s performance can be improved by increasing the complexity and length of prompts. Below are some examples of prompts and samples. - Developed by: NVIDIA, Sana - Model type: Linear-Diffusion-Transformer-based text-to-image generative model - Model size: 1648M parameters - Model resolution: This model is developed to generate 512px based images with multi-scale heigh and width. - License: NSCL v2-custom. Governing Terms: NVIDIA License. Additional Information: Gemma Terms of Use | Google AI for Developers for Gemma-2-2B-IT, Gemma Prohibited Use Policy | Google AI for Developers. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders (Gemma2-2B-IT) and one 32x spatial-compressed latent feature encoder (DC-AE). - Resources for more information: Check out our GitHub Repository and the Sana report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference and for which most advanced diffusion sampler like Flow-DPM-Solver is integrated. MIT Han-Lab provides free Sana inference. - Repository: https://github.com/NVlabs/Sana > \[!IMPORTANT\] > Make sure to specify `pipe.transformer` to default `torchdtype` and `variant` according to Model Card. > > Set `pipe.textencoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, docs are here. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
SANA-Video_2B_480p_LongLive_diffusers
Sana_1600M_512px_MultiLing_diffusers
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Source code is available at https://github.com/NVlabs/Sana. Note - Weakness in Complex Scene Creation: Due to limitation of data, our model has limited capabilities in generating complex scenes, text, and human hands. - Enhancing Capabilities: The model’s performance can be improved by increasing the complexity and length of prompts. Below are some examples of prompts and samples. - Developed by: NVIDIA, Sana - Model type: Linear-Diffusion-Transformer-based text-to-image generative model - Model size: 1648M parameters - Model resolution: This model is developed to generate 512px based images with multi-scale heigh and width. - License: NSCL v2-custom. Governing Terms: NVIDIA License. Additional Information: Gemma Terms of Use | Google AI for Developers for Gemma-2-2B-IT, Gemma Prohibited Use Policy | Google AI for Developers. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders (Gemma2-2B-IT) and one 32x spatial-compressed latent feature encoder (DC-AE). - Resources for more information: Check out our GitHub Repository and the Sana report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference and for which most advanced diffusion sampler like Flow-DPM-Solver is integrated. MIT Han-Lab provides free Sana inference. - Repository: https://github.com/NVlabs/Sana > \[!IMPORTANT\] > Make sure to specify `pipe.transformer` to default `torchdtype` and `variant` according to Model Card. > > Set `pipe.textencoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, docs are here. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.