Efficient-Large-Model

101 models • 4 total models in database
Sort by:

gemma-2-2b-it

NaNK
73,851
3

NVILA-Lite-8B

Model type: NVILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility. Paper or resources for more information: https://github.com/NVLabs/VILA License - The code is released under the Apache 2.0 license as found in the LICENSE file. - The pretrained weights are released under the CC-BY-NC-SA-4.0 license. - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: - Terms of Use of the data generated by OpenAI - Dataset Licenses for each one used during training. Where to send questions or comments about the model: https://github.com/NVLabs/VILA/issues Intended use Primary intended uses: The primary use of VILA is research on large multimodal models and chatbots. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Input: Input Type: Image, Video, Text Input Format: Red, Green, Blue; MP4 ;String Input Parameters: 2D, 3D Supported Hardware Microarchitecture Compatibility: Ampere Jetson Hopper Lovelace Training dataset See Dataset Preparation for more details. Data Collection Method by dataset [Hybrid: Automated, Human] Labeling Method by dataset [Hybrid: Automated, Human] Inference: Engine: [Tensor(RT), Triton, Or List Other Here] PyTorch TensorRT-LLM TinyChat Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

NaNK
llava_llama
7,458
4

Fast_dLLM_v2_1.5B

Fast-dLLM v2 (1.5B) — Efficient Block-Diffusion LLM Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. We present Fast-dLLM v2 — a carefully designed block diffusion language model (dLLM) that efficiently adapts a pretrained AR model (Qwen2.5-1.5B-Instruct) into a diffusion-style decoder for parallel text generation. Our approach introduces a novel decoding recipe incorporating a complementary attention mask and block diffusion mechanism, which together enable blockwise bidirectional context modeling while preserving the original AR training objectives and performance. To further enhance inference speed, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations and a sub-block level cache that supports efficient parallel decoding within partially generated blocks. ✨ Key Innovations - Block Diffusion Mechanism + Complementary Attention Mask Enables blockwise bidirectional context modeling without sacrificing AR objectives. - Hierarchical Caching - Block-level cache: Stores historical context representations across blocks. - Sub-block cache: Parallel decoding within partially generated blocks. - Token Shift Mechanism Retains autoregressive characteristics while supporting bidirectional context within blocks. - Parallel Decoding Pipeline Achieves up to 2.5× speedup over standard AR decoding without compromising quality. > 🚀 Fast-dLLM v2 uses only ~1B tokens for fine-tuning — a 500× reduction vs. full-attention diffusion LLMs (Dream: 580B tokens) — while matching or surpassing AR baselines in accuracy. 🛠 Model Overview - Type: Block Diffusion Language Model (dLLM) - Base Model: `Qwen/Qwen2.5-1.5B-Instruct` - Architecture: Transformer w/ RoPE, SwiGLU, RMSNorm, Attention QKV bias, tied embeddings - Params: 1.54B (non-embedding: 1.31B) - Layers: 28 - Attention Heads: 12 (Q), 2 (KV, GQA) - Key Feature: Parallel block-wise decoding + hierarchical caching 📦 Installation You will need `transformers`, `torch`, and our custom generation function: ▶ Real-time Throughput Fast-dLLM v2 offers up to 2.54× higher throughput than Qwen2.5-7B-Instruct, without loss in quality. 🏆 Benchmark Results We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks: HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA). - 1B group: Fast-dLLM v2 (1.5B) achieves best average score: 45.0. - 7B group: Fast-dLLM v2 (7B) achieves best average score: 60.3, surpassing LLaDA and Dream models. If you use Fast-dLLM v2 in your research or products, please cite: 📄 License Released under Apache 2.0, following the base Qwen2.5 license. 🔗 Resources - 📄 Paper - 💻 Code - 🤗 HuggingFace Model

NaNK
license:apache-2.0
5,678
7

NVILA-8B

NaNK
llava_llama
4,323
6

paligemma-siglip-so400m-patch14-448

license:apache-2.0
3,778
1

VILA1.5-3b

NaNK
llava_llama
3,183
30

LongVILA-R1-7B

NaNK
3,028
10

NVILA-Lite-2B

NaNK
llava_llama
3,019
5

NVILA-Lite-15B

NaNK
llava_llama
2,658
4

Fast_dLLM_v2_7B

NaNK
license:apache-2.0
2,632
17

NVILA-Lite-2B-hf-preview

NaNK
license:cc-by-nc-4.0
2,460
1

NVILA-15B

Model type: NVILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility. Paper or resources for more information: https://github.com/NVLabs/VILA License - The code is released under the Apache 2.0 license as found in the LICENSE file. - The pretrained weights are released under the CC-BY-NC-SA-4.0 license. - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: - Terms of Use of the data generated by OpenAI - Dataset Licenses for each one used during training. Where to send questions or comments about the model: https://github.com/NVLabs/VILA/issues Intended use Primary intended uses: The primary use of VILA is research on large multimodal models and chatbots. Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. Input: Input Type: Image, Video, Text Input Format: Red, Green, Blue; MP4 ;String Input Parameters: 2D, 3D Supported Hardware Microarchitecture Compatibility: Ampere Jetson Hopper Lovelace Training dataset See Dataset Preparation for more details. Data Collection Method by dataset [Hybrid: Automated, Human] Labeling Method by dataset [Hybrid: Automated, Human] Inference: Engine: [Tensor(RT), Triton, Or List Other Here] PyTorch TensorRT-LLM TinyChat Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

NaNK
llava_llama
2,414
23

Sana_600M_512px

2,302
12

VILA-7b

NaNK
llava_llama
2,080
27

Qwen2-VL-7B-Instruct

NaNK
license:apache-2.0
1,554
0

Llama-3-VILA1.5-8B

NaNK
llava_llama
1,238
37

NVILA-8B-Video

NaNK
llava_llama
740
7

NVILA-Lite-2B-hf

NaNK
671
0

Sana_1600M_1024px

NaNK
647
214

NVILA-Lite-2B-Verifier

NaNK
license:cc-by-nc-4.0
535
7

SANA1.5_1.6B_1024px

NaNK
535
2

NVILA-Lite-2B-hf-0626

NaNK
471
0

VILA1.5-13b

NaNK
llava_llama
461
5

NVILA-8B-hf

NaNK
457
0

qwen2-7b-longvila-256f

NaNK
llava_llama
450
0

Sana_600M_1024px_ControlNet_HED

NaNK
151
0

Sana_600M_1024px

NaNK
144
19

VILA1.5-3b-s2

NaNK
llava_llama
128
1

qwen2-7b-longvila-1M

NaNK
llava_llama
122
2

Sana_1600M_1024px_BF16

NaNK
118
13

Sana_1600M_1024px_MultiLing

NaNK
108
25

VILA-2.7b

NaNK
llava_llama
95
15

Sana_1600M_4Kpx_BF16

NaNK
82
32

Sana_Sprint_1.6B_1024px

NaNK
66
15

Llama-3-VILA1.5-8B-Fix-AWQ

NaNK
llava_llama
65
0

Sana_1600M_512px

NaNK
58
39

NVILA-Lite-15B-Video

NaNK
llava_llama
48
0

SANA1.5_4.8B_1024px

NaNK
33
23

VILA1.5-3b-AWQ

NaNK
llava_llama
31
5

SANA Video 2B 480p

SANA-Video is a small, ultra-efficient diffusion model designed for rapid generation of high-quality, minute-long videos at resolutions up to 720×1280. (1) Linear DiT: Leverages linear attention as the core operation, offering significantly more efficiency than vanilla attention when processing the massive number of tokens required for video generation. (2) Constant-Memory KV Cache for Block Linear Attention: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis. SANA-Video achieves exceptional efficiency and cost savings: its training cost is only 1% of MovieGen's (12 days on 64 H100 GPUs). Compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1 and SkyReel-V2), SANA-Video maintains competitive performance while being 16× faster in measured latency. SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation. Source code is available at https://github.com/NVlabs/Sana. Refer to: https://github.com/NVlabs/Sana/blob/main/asset/docs/sanavideo.md#1-inference-with-txt-file - Developed by: NVIDIA, Sana - Model type: Efficient Video Generation with Block Linear Diffusion Transformer - Model size: 2B parameters - Model precision: torch.bfloat16 (BF16) - Model resolution: This model is developed to generate 480p resolution 81(5s) frames videos with multi-scale heigh and width. - Model Description: This is a model that can be used to generate and modify videos based on text prompts. It is a Linear Diffusion Transformer that uses 8x wan-vae one 32x spatial-compressed latent feature encoder (DC-AE-V). - Resources for more information: Check out our GitHub Repository and the SANA-Video report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference - Repository: https://github.com/NVlabs/Sana - Guidance: https://github.com/NVlabs/Sana/asset/docs/sanavideo.md GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of video generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
25
6

Sana_1600M_1024px_BF16_ControlNet_HED

NaNK
24
0

Sana_1600M_512px_MultiLing

NaNK
21
16

Llama-3-VILA1.5-8b-AWQ

NaNK
llava_llama
21
7

VILA1.5-40b

NaNK
llava_llama
19
17

Sana_Sprint_1.6B_1024px_teacher

NaNK
19
1

qwen2_5vl-7b-wolfv2-tuned

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the mllmdemo, the identity and the alpacaendemo datasets. The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 8 - seed: 42 - distributedtype: multi-GPU - numdevices: 8 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 16 - totalevalbatchsize: 64 - optimizer: Use adamwtorchfused with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 - Transformers 4.56.1 - Pytorch 2.8.0+cu128 - Datasets 4.0.0 - Tokenizers 0.22.1

NaNK
llama-factory
17
0

NVILA-Lite-8B-stage2

NaNK
llava_llama
16
1

NVILA-15B-hf

NaNK
16
0

VILA1.5-13b-AWQ

NaNK
llava_llama
15
3

VILA1.5-7b

NaNK
llava_llama
14
0

Sana_1600M_2Kpx_BF16

NaNK
13
11

VILA-13b

NaNK
llava_llama
12
20

vila-ewm-qwen2-1.5b

NaNK
llava_llama
12
1

Sana_Sprint_0.6B_1024px

NaNK
10
5

VILA-7b-4bit-awq

NaNK
llava_llama
10
2

VILA1.5-40b-AWQ

NaNK
llava_llama
9
3

NVILA-Lite-8B-hf-preview

NaNK
license:cc-by-nc-4.0
8
0

Sana_Sprint_0.6B_1024px_teacher

NaNK
8
0

VILA1.5-3b-s2-AWQ

NaNK
llava_llama
7
1

Qwen2-OV-72B

NaNK
6
0

Llama-3-VILA1.5-8B-Fix

NaNK
llava_llama
6
0

NVILA-Lite-8B-hf-0626

NaNK
6
0

NVILA-Lite-15B-Video-hf-0626

NaNK
6
0

NVILA-Lite-8B-hf

NaNK
6
0

nvila_15b_video-wolfv2_tuned

NaNK
llava_llama
6
0

Meta-Llama-Guard-2-8B

NaNK
llama
5
0

nvila_15b_video-plm_tuned

NaNK
llava_llama
5
0

VILA-13b-4bit-awq

NaNK
llava_llama
4
2

Qwen2-OV-SI-72B

NaNK
4
0

qwen2-vl-7b-instruct-pretrain

NaNK
llava_llama
4
0

VILA15-3b-hf-preview

NaNK
license:cc-by-nc-4.0
4
0

nvila_8b_video-wolfv2_tuned

NaNK
llava_llama
4
0

nvila_8b_video-plm_tuned

NaNK
llava_llama
4
0

Meta-Llama-3.1-70B

NaNK
llama
3
0

Meta-Llama-3.1-70B-Instruct

NaNK
llama
3
0

NVILA-Lite-15B-stage2

NaNK
llava_llama
3
0

NVILA-Lite-15B-hf-0626

NaNK
3
0

VILA15_3b

NaNK
llava_llama
2
0

Llama-3-VILA15-8B-hf-preview

NaNK
license:cc-by-nc-4.0
2
0

VILA15-13b-hf-preview

NaNK
license:cc-by-nc-4.0
2
0

NVILA-Lite-15B-hf

NaNK
2
0

NVILA-Lite-15B-WolfV2

NaNK
llava_llama
2
0

qwen2-1.5b-longvila-256f

NaNK
llava_llama
1
0

nvila-internal-33b-video-v1

NaNK
llava_llama
1
0

LongLive 1.3B

🎬 LongLive: Real-time Interactive Long Video Generation [](https://arxiv.org/abs/2509.22622) [](https://github.com/NVlabs/LongLive) [](https://huggingface.co/Efficient-Large-Model/LongLive-1.3B) [](https://www.youtube.com/watch?v=CO1QC7BNvig) [](https://nvlabs.github.io/LongLive) 💡 TLDR: Turn interactive prompts into long videos—instantly, as you type! LongLive: Real-time Interactive Long Video Generation [Paper] Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases the complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with the new prompt for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long–test-long); and short window attention paired with a frame-level attention sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short- and long-video settings. LongLive supports up to 240-second videos on a single H100 GPU. With FP8 quantization, LongLive boosts inference to 24.8 FPS with marginal quality loss. News - [x] [2025.9.25] We release Paper, this GitHub repo LongLive with all training and inference code, the model weight LongLive-1.3B, and demo page Website. Highlights 1. Long Video Gen: LongLive supports up to 240s video generation, with visual consistency. 2. Real-time Inference: LongLive supports 20.7 FPS generation speed on a single H100 GPU, and 24.8 FPS with FP8 quantization with marginal quality loss. 3. Efficient Fine-tuning: LongLive extends a short-clip model to minute-long generation in 32 H100 GPU-days. LongLive accepts sequential user prompts and generates corresponding videos in real time, enabling user-guided long video generation. The framework of LongLive. (Left) Frame Sink + Short window attention. (Right) KV-recache. The streaming long tuning pipeline. Our approach trains on long sequences by reusing the historical KV cache each iteration to generate the next 5s clip, then supervising it with the teacher. The effectiveness of KV re-cache. Consistent transitions with new-prompt compliance. Interactive 60s videos with 6 prompts. See our demo Website for video examples. We tested this repo on the following setup: Nvidia GPU with at least 40 GB memory (A100, and H100 are tested). Linux operating system. 64 GB RAM. Other hardware setup could also work but hasn't been tested. Create a conda environment and install dependencies: How to contribute - Make sure to have git installed. - Create your own fork of the project. - Clone the repository on your local machine, using git clone and pasting the url of this project. - Read both the `Requirements` and `Installation and Quick Guide` sections below. - Commit and push your changes. - Make a pull request when finished modifying the project. Citation Please consider to cite our paper and this framework, if they are helpful in your research. License - LongLive-1.3B model weight is under CC-BY-NC 4.0 license. Acknowledgement - Self-Forcing: the codebase and algorithm we built upon. Thanks for their wonderful work. - Wan: the base model we built upon. Thanks for their wonderful work.

NaNK
license:cc-by-nc-sa-4.0
0
38

Sana_Sprint_1.6B_1024px_diffusers

NaNK
0
19

Sana_1600M_1024px_diffusers

NaNK
0
18

SANA1.5_4.8B_1024px_diffusers

NaNK
0
14

Sana_Sprint_0.6B_1024px_diffusers

NaNK
0
12

Sana_600M_512px_diffusers

NaNK
0
8

Sana_1600M_4Kpx_BF16_diffusers

NaNK
0
8

SANA1.5_1.6B_1024px_diffusers

NaNK
0
8

Sana_1600M_1024px_BF16_diffusers

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Source code is available at https://github.com/NVlabs/Sana. Note - Weakness in Complex Scene Creation: Due to limitation of data, our model has limited capabilities in generating complex scenes, text, and human hands. - Enhancing Capabilities: The model’s performance can be improved by increasing the complexity and length of prompts. Below are some examples of prompts and samples. - Developed by: NVIDIA, Sana - Model type: Linear-Diffusion-Transformer-based text-to-image generative model - Model size: 1648M parameters - Model resolution: This model is developed to generate 1024px based images with multi-scale heigh and width. - License: NSCL v2-custom. Governing Terms: NVIDIA License. Additional Information: Gemma Terms of Use | Google AI for Developers for Gemma-2-2B-IT, Gemma Prohibited Use Policy | Google AI for Developers. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders (Gemma2-2B-IT) and one 32x spatial-compressed latent feature encoder (DC-AE). - Special: This model is fine-tuned from the base model Efficient-Large-Model/Sana1600M1024pxBF16 and it supports Emoji, Chinese and English and all mixed prompts. - Resources for more information: Check out our GitHub Repository and the Sana report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference and for which most advanced diffusion sampler like Flow-DPM-Solver is integrated. MIT Han-Lab provides free Sana inference. - Repository: https://github.com/NVlabs/Sana > \[!IMPORTANT\] > Make sure to specify `pipe.transformer` to default `torchdtype` and `variant` according to Model Card. > > Set `pipe.textencoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, docs are here. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
0
7

Sana_1600M_2Kpx_BF16_diffusers

NaNK
0
6

Sana_600M_1024px_diffusers

NaNK
0
5

Sana_1600M_1024px_MultiLing_diffusers

NaNK
0
3

NVILA-AWQ

license:apache-2.0
0
2

SANA_Sprint_1.6B_1024px_teacher_diffusers

NaNK
0
2

Sana 1600M 512px Diffusers

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Source code is available at https://github.com/NVlabs/Sana. Note - Weakness in Complex Scene Creation: Due to limitation of data, our model has limited capabilities in generating complex scenes, text, and human hands. - Enhancing Capabilities: The model’s performance can be improved by increasing the complexity and length of prompts. Below are some examples of prompts and samples. - Developed by: NVIDIA, Sana - Model type: Linear-Diffusion-Transformer-based text-to-image generative model - Model size: 1648M parameters - Model resolution: This model is developed to generate 512px based images with multi-scale heigh and width. - License: NSCL v2-custom. Governing Terms: NVIDIA License. Additional Information: Gemma Terms of Use | Google AI for Developers for Gemma-2-2B-IT, Gemma Prohibited Use Policy | Google AI for Developers. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders (Gemma2-2B-IT) and one 32x spatial-compressed latent feature encoder (DC-AE). - Resources for more information: Check out our GitHub Repository and the Sana report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference and for which most advanced diffusion sampler like Flow-DPM-Solver is integrated. MIT Han-Lab provides free Sana inference. - Repository: https://github.com/NVlabs/Sana > \[!IMPORTANT\] > Make sure to specify `pipe.transformer` to default `torchdtype` and `variant` according to Model Card. > > Set `pipe.textencoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, docs are here. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
0
2

SANA-Video_2B_480p_LongLive_diffusers

NaNK
0
1

Sana_1600M_512px_MultiLing_diffusers

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Source code is available at https://github.com/NVlabs/Sana. Note - Weakness in Complex Scene Creation: Due to limitation of data, our model has limited capabilities in generating complex scenes, text, and human hands. - Enhancing Capabilities: The model’s performance can be improved by increasing the complexity and length of prompts. Below are some examples of prompts and samples. - Developed by: NVIDIA, Sana - Model type: Linear-Diffusion-Transformer-based text-to-image generative model - Model size: 1648M parameters - Model resolution: This model is developed to generate 512px based images with multi-scale heigh and width. - License: NSCL v2-custom. Governing Terms: NVIDIA License. Additional Information: Gemma Terms of Use | Google AI for Developers for Gemma-2-2B-IT, Gemma Prohibited Use Policy | Google AI for Developers. - Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Linear Diffusion Transformer that uses one fixed, pretrained text encoders (Gemma2-2B-IT) and one 32x spatial-compressed latent feature encoder (DC-AE). - Resources for more information: Check out our GitHub Repository and the Sana report on arXiv. For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference and for which most advanced diffusion sampler like Flow-DPM-Solver is integrated. MIT Han-Lab provides free Sana inference. - Repository: https://github.com/NVlabs/Sana > \[!IMPORTANT\] > Make sure to specify `pipe.transformer` to default `torchdtype` and `variant` according to Model Card. > > Set `pipe.textencoder` to BF16 and `pipe.vae` to FP32 or BF16. For more info, docs are here. The model is intended for research purposes only. Possible research areas and tasks include - Generation of artworks and use in design and other artistic processes. - Applications in educational or creative tools. - Research on generative models. - Safe deployment of models which have the potential to generate harmful content. - Probing and understanding the limitations and biases of generative models. The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model. - The model does not achieve perfect photorealism - The model cannot render complex legible text - fingers, .etc in general may not be generated properly. - The autoencoding part of the model is lossy. Bias While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.

NaNK
0
1