DFloat11

41 models • 1 total models in database

Sort by:

FLUX.1-Krea-dev-DF11

DFloat11 Compressed Model: `black-forest-labs/FLUX.1-Krea-dev` This is a DFloat11 losslessly compressed version of the original `black-forest-labs/FLUX.1-Krea-dev` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. 🔥🔥🔥 Thanks to DFloat11 compression, FLUX.1-Krea-dev can now run on a single 24GB GPU, or on a 12GB GPU with CPU offloading, while maintaining full model quality. 🔥🔥🔥 | Model | Model Size | Peak GPU Memory (1024×1024 image generation) | Generation Time (A100 GPU) | |------------------------------------------------|------------|----------------------------------------------|----------------------------| | FLUX.1-Krea-dev (BFloat16) | 23.80 GB | 24.28 GB | 56 seconds | | FLUX.1-Krea-dev (DFloat11) | 16.33 GB | 17.54 GB | 58 seconds | | FLUX.1-Krea-dev (DFloat11 + GPU Offloading) | 16.33 GB | 9.76 GB | 78 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `krea.py`: 4. To run without CPU offloading (18GB VRAM required): We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

llama

776

OmniGen2-transformer-DF11

NaNK

—

745

Wan2.2-I2V-A14B-2-DF11

NaNK

llama

701

Wan2.2-T2V-A14B-2-DF11

NaNK

—

618

Wan2.2-I2V-A14B-DF11

NaNK

llama

577

Wan2.2-T2V-A14B-DF11

NaNK

—

458

HiDream-I1-Full-DF11

NaNK

llama

360

Chroma-DF11

NaNK

llama

315

Qwen-Image-DF11

This is a DFloat11 losslessly compressed version of the original `Qwen/Qwen-Image` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. 🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image can now run on a single 32GB GPU, or on a single 16GB GPU with CPU offloading, while maintaining full model quality. 🔥🔥🔥 | Model | Model Size | Peak GPU Memory (1328x1328 image generation) | Generation Time (A100 GPU) | |-------------------------------------------|------------|----------------------------------------------|----------------------------| | Qwen-Image (BFloat16) | ~41 GB | OOM | - | | Qwen-Image (DFloat11) | 28.42 GB | 29.74 GB | 100 seconds | | Qwen-Image (DFloat11 + GPU Offloading) | 28.42 GB | 16.68 GB | 260 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `qwenimage.py`: 4. To run without CPU offloading (32GB VRAM required): If you are getting out-of-CPU-memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

304

stable-diffusion-3.5-large-DF11

DFloat11 Compressed Model: `stabilityai/stable-diffusion-3.5-large` This is a losslessly compressed version of `stabilityai/stable-diffusion-3.5-large` using our custom DFloat11 format. ✅ Bit-for-bit identical outputs to the original BFloat16 model 📉 \~30% reduction in model size (from 16GB → 11.3GB) 🧠 Lower memory requirements: now runs on 16GB GPUs ⚡ Minimal performance overhead: barely any slower than the full model DFloat11 compresses the model weights while preserving full numerical precision. This allows you to run `stabilityai/stable-diffusion-3.5-large` on more accessible hardware, with no compromise in output quality. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. Advantages: Fully GPU-based: no CPU decompression or host-device data transfer. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

llama

279

Qwen Image Edit 2509 DF11

DFloat11 Compressed Model: `Qwen/Qwen-Image-Edit-2509` This is a DFloat11 losslessly compressed version of the original `Qwen/Qwen-Image-Edit-2509` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. 🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image-Edit-2509 can now run on a single 32GB GPU, or on a single 24GB GPU with CPU offloading, while maintaining full model quality. 🔥🔥🔥 | Model | Model Size | Peak GPU Memory (1024x1024 image generation) | Image Editing Time (A100 GPU) | |-----------------------------------------------------|------------|----------------------------------------------|-------------------------------| | Qwen-Image-Edit-2509 (BFloat16) | ~41 GB | OOM | - | | Qwen-Image-Edit-2509 (DFloat11) | 28.43 GB | 30.20 GB | 102 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `qwenimageedit.py`: 4. To run without CPU offloading (32GB VRAM required): If you are getting out-of-CPU-memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

265

FLUX.1-Depth-dev-DF11

NaNK

llama

232

BAGEL-7B-MoT-DF11

NaNK

—

141

Qwen-Image-Edit-DF11

This is a DFloat11 losslessly compressed version of the original `Qwen/Qwen-Image-Edit` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. 🔥🔥🔥 Thanks to DFloat11 compression, Qwen-Image-Edit can now run on a single 32GB GPU, or on a single 24GB GPU with CPU offloading, while maintaining full model quality. 🔥🔥🔥 | Model | Model Size | Peak GPU Memory | Generation Time (A100 GPU) | |------------------------------------------------|------------|----------------------------------------------|----------------------------| | Qwen-Image-Edit (BFloat16) | ~41 GB | OOM | - | | Qwen-Image-Edit (DFloat11) | 28.43 GB | 30.11 GB | 280 seconds | | Qwen-Image-Edit (DFloat11 + CPU Offloading) | 28.43 GB | 22.71 GB | 570 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `qwenimageedit.py`: 4. To run without CPU offloading (32GB VRAM required): To run with CPU offloading (24GB VRAM required, 50GB CPU RAM required): If you are getting out of (CPU or GPU) memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

Llama-3.1-8B-Instruct-DF11

DFloat11 Compressed Model: `meta-llama/Llama-3.1-8B-Instruct` This is a losslessly compressed version of `meta-llama/Llama-3.1-8B-Instruct` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

llama

Qwen3-8B-DF11

NaNK

—

Mistral-Nemo-Instruct-2407-DF11

DFloat11 Compressed Model: `mistralai/Mistral-Nemo-Instruct-2407` This is a losslessly compressed version of `mistralai/Mistral-Nemo-Instruct-2407` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

Qwen3-4B-DF11

NaNK

—

OmniGen2-mllm-DF11

NaNK

—

gemma-3-4b-it-DF11

NaNK

—

Qwen3-14B-DF11

NaNK

—

QwQ-32B-DF11

This is a losslessly compressed version of `Qwen/QwQ-32B` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

gemma-3-12b-it-DF11

NaNK

—

DeepSeek-R1-Distill-Llama-8B-DF11

DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` This is a losslessly compressed version of `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

llama

DeepSeek-R1-Distill-Qwen-32B-DF11

NaNK

—

DeepSeek-R1-Distill-Qwen-14B-DF11

NaNK

—

gemma-3-27b-it-DF11

NaNK

—

Qwen2.5-14B-Instruct-DF11

DFloat11 Compressed Model: `Qwen/Qwen2.5-14B-Instruct` This is a losslessly compressed version of `Qwen/Qwen2.5-14B-Instruct` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

Llama-3.3-70B-Instruct-DF11

DFloat11 Compressed Model: `meta-llama/Llama-3.3-70B-Instruct` This is a losslessly compressed version of `meta-llama/Llama-3.3-70B-Instruct` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

llama

DeepSeek-R1-0528-Qwen3-8B-DF11

NaNK

—

Llama-3.1-405B-Instruct-DF11

NaNK

—

Phi-4-reasoning-plus-DF11

NaNK

—

Mistral-Small-24B-Instruct-2501-DF11

DFloat11 Compressed Model: `mistralai/Mistral-Small-24B-Instruct-2501` This is a losslessly compressed version of `mistralai/Mistral-Small-24B-Instruct-2501` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2× slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK

—

Qwen3-32B-DF11

NaNK

—

FLUX.1-Krea-dev-DF11-ComfyUI

—

DFloat11

FLUX.1-Kontext-dev-DF11

FLUX.1-dev-DF11

FLUX.1-schnell-DF11

Wan2.1-T2V-14B-Diffusers-DF11

FLUX.1-Fill-dev-DF11

FLUX.1-Canny-dev-DF11

FLUX.1-Krea-dev-DF11

OmniGen2-transformer-DF11

Wan2.2-I2V-A14B-2-DF11

Wan2.2-T2V-A14B-2-DF11

Wan2.2-I2V-A14B-DF11

Wan2.2-T2V-A14B-DF11

HiDream-I1-Full-DF11

Chroma-DF11

Qwen-Image-DF11

stable-diffusion-3.5-large-DF11

Qwen Image Edit 2509 DF11

FLUX.1-Depth-dev-DF11

BAGEL-7B-MoT-DF11

Qwen-Image-Edit-DF11

Llama-3.1-8B-Instruct-DF11

Qwen3-8B-DF11

Mistral-Nemo-Instruct-2407-DF11

Qwen3-4B-DF11

OmniGen2-mllm-DF11

gemma-3-4b-it-DF11

Qwen3-14B-DF11

QwQ-32B-DF11

gemma-3-12b-it-DF11

DeepSeek-R1-Distill-Llama-8B-DF11

DeepSeek-R1-Distill-Qwen-32B-DF11

DeepSeek-R1-Distill-Qwen-14B-DF11

gemma-3-27b-it-DF11

Qwen2.5-14B-Instruct-DF11

Llama-3.3-70B-Instruct-DF11

DeepSeek-R1-0528-Qwen3-8B-DF11

Llama-3.1-405B-Instruct-DF11

Phi-4-reasoning-plus-DF11

Mistral-Small-24B-Instruct-2501-DF11

Qwen3-32B-DF11

FLUX.1-Krea-dev-DF11-ComfyUI