DFloat11

41 models β€’ 1 total models in database
Sort by:

FLUX.1-Kontext-dev-DF11

llama
3,715
13

FLUX.1-dev-DF11

llama
2,066
21

FLUX.1-schnell-DF11

llama
1,774
6

Wan2.1-T2V-14B-Diffusers-DF11

NaNK
llama
1,207
3

FLUX.1-Fill-dev-DF11

NaNK
llama
886
4

FLUX.1-Canny-dev-DF11

NaNK
llama
869
1

FLUX.1-Krea-dev-DF11

DFloat11 Compressed Model: `black-forest-labs/FLUX.1-Krea-dev` This is a DFloat11 losslessly compressed version of the original `black-forest-labs/FLUX.1-Krea-dev` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, FLUX.1-Krea-dev can now run on a single 24GB GPU, or on a 12GB GPU with CPU offloading, while maintaining full model quality. πŸ”₯πŸ”₯πŸ”₯ | Model | Model Size | Peak GPU Memory (1024Γ—1024 image generation) | Generation Time (A100 GPU) | |------------------------------------------------|------------|----------------------------------------------|----------------------------| | FLUX.1-Krea-dev (BFloat16) | 23.80 GB | 24.28 GB | 56 seconds | | FLUX.1-Krea-dev (DFloat11) | 16.33 GB | 17.54 GB | 58 seconds | | FLUX.1-Krea-dev (DFloat11 + GPU Offloading) | 16.33 GB | 9.76 GB | 78 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `krea.py`: 4. To run without CPU offloading (18GB VRAM required): We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
llama
776
5

OmniGen2-transformer-DF11

NaNK
β€”
745
3

Wan2.2-I2V-A14B-2-DF11

NaNK
llama
701
5

Wan2.2-T2V-A14B-2-DF11

NaNK
β€”
618
4

Wan2.2-I2V-A14B-DF11

NaNK
llama
577
6

Wan2.2-T2V-A14B-DF11

NaNK
β€”
458
3

HiDream-I1-Full-DF11

NaNK
llama
360
0

Chroma-DF11

NaNK
llama
315
5

Qwen-Image-DF11

This is a DFloat11 losslessly compressed version of the original `Qwen/Qwen-Image` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, Qwen-Image can now run on a single 32GB GPU, or on a single 16GB GPU with CPU offloading, while maintaining full model quality. πŸ”₯πŸ”₯πŸ”₯ | Model | Model Size | Peak GPU Memory (1328x1328 image generation) | Generation Time (A100 GPU) | |-------------------------------------------|------------|----------------------------------------------|----------------------------| | Qwen-Image (BFloat16) | ~41 GB | OOM | - | | Qwen-Image (DFloat11) | 28.42 GB | 29.74 GB | 100 seconds | | Qwen-Image (DFloat11 + GPU Offloading) | 28.42 GB | 16.68 GB | 260 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `qwenimage.py`: 4. To run without CPU offloading (32GB VRAM required): If you are getting out-of-CPU-memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
304
63

stable-diffusion-3.5-large-DF11

DFloat11 Compressed Model: `stabilityai/stable-diffusion-3.5-large` This is a losslessly compressed version of `stabilityai/stable-diffusion-3.5-large` using our custom DFloat11 format. βœ… Bit-for-bit identical outputs to the original BFloat16 model πŸ“‰ \~30% reduction in model size (from 16GB β†’ 11.3GB) 🧠 Lower memory requirements: now runs on 16GB GPUs ⚑ Minimal performance overhead: barely any slower than the full model DFloat11 compresses the model weights while preserving full numerical precision. This allows you to run `stabilityai/stable-diffusion-3.5-large` on more accessible hardware, with no compromise in output quality. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. Advantages: Fully GPU-based: no CPU decompression or host-device data transfer. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
llama
279
4

Qwen Image Edit 2509 DF11

DFloat11 Compressed Model: `Qwen/Qwen-Image-Edit-2509` This is a DFloat11 losslessly compressed version of the original `Qwen/Qwen-Image-Edit-2509` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, Qwen-Image-Edit-2509 can now run on a single 32GB GPU, or on a single 24GB GPU with CPU offloading, while maintaining full model quality. πŸ”₯πŸ”₯πŸ”₯ | Model | Model Size | Peak GPU Memory (1024x1024 image generation) | Image Editing Time (A100 GPU) | |-----------------------------------------------------|------------|----------------------------------------------|-------------------------------| | Qwen-Image-Edit-2509 (BFloat16) | ~41 GB | OOM | - | | Qwen-Image-Edit-2509 (DFloat11) | 28.43 GB | 30.20 GB | 102 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `qwenimageedit.py`: 4. To run without CPU offloading (32GB VRAM required): If you are getting out-of-CPU-memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
265
10

FLUX.1-Depth-dev-DF11

NaNK
llama
232
0

BAGEL-7B-MoT-DF11

NaNK
β€”
141
24

Qwen-Image-Edit-DF11

This is a DFloat11 losslessly compressed version of the original `Qwen/Qwen-Image-Edit` model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference. πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, Qwen-Image-Edit can now run on a single 32GB GPU, or on a single 24GB GPU with CPU offloading, while maintaining full model quality. πŸ”₯πŸ”₯πŸ”₯ | Model | Model Size | Peak GPU Memory | Generation Time (A100 GPU) | |------------------------------------------------|------------|----------------------------------------------|----------------------------| | Qwen-Image-Edit (BFloat16) | ~41 GB | OOM | - | | Qwen-Image-Edit (DFloat11) | 28.43 GB | 30.11 GB | 280 seconds | | Qwen-Image-Edit (DFloat11 + CPU Offloading) | 28.43 GB | 22.71 GB | 570 seconds | 1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 3. Save the following code to a Python file `qwenimageedit.py`: 4. To run without CPU offloading (32GB VRAM required): To run with CPU offloading (24GB VRAM required, 50GB CPU RAM required): If you are getting out of (CPU or GPU) memory errors, try limiting the number of offloaded blocks or disabling memory-pinning: We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU. The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model. Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
97
12

Llama-3.1-8B-Instruct-DF11

DFloat11 Compressed Model: `meta-llama/Llama-3.1-8B-Instruct` This is a losslessly compressed version of `meta-llama/Llama-3.1-8B-Instruct` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
llama
19
2

Qwen3-8B-DF11

NaNK
β€”
10
2

Mistral-Nemo-Instruct-2407-DF11

DFloat11 Compressed Model: `mistralai/Mistral-Nemo-Instruct-2407` This is a losslessly compressed version of `mistralai/Mistral-Nemo-Instruct-2407` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
8
0

Qwen3-4B-DF11

NaNK
β€”
6
3

OmniGen2-mllm-DF11

NaNK
β€”
6
1

gemma-3-4b-it-DF11

NaNK
β€”
6
0

Qwen3-14B-DF11

NaNK
β€”
6
0

QwQ-32B-DF11

This is a losslessly compressed version of `Qwen/QwQ-32B` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
5
0

gemma-3-12b-it-DF11

NaNK
β€”
4
0

DeepSeek-R1-Distill-Llama-8B-DF11

DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` This is a losslessly compressed version of `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
llama
4
0

DeepSeek-R1-Distill-Qwen-32B-DF11

NaNK
β€”
3
0

DeepSeek-R1-Distill-Qwen-14B-DF11

NaNK
β€”
3
0

gemma-3-27b-it-DF11

NaNK
β€”
2
1

Qwen2.5-14B-Instruct-DF11

DFloat11 Compressed Model: `Qwen/Qwen2.5-14B-Instruct` This is a losslessly compressed version of `Qwen/Qwen2.5-14B-Instruct` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
1
1

Llama-3.3-70B-Instruct-DF11

DFloat11 Compressed Model: `meta-llama/Llama-3.3-70B-Instruct` This is a losslessly compressed version of `meta-llama/Llama-3.3-70B-Instruct` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
llama
1
1

DeepSeek-R1-0528-Qwen3-8B-DF11

NaNK
β€”
1
1

Llama-3.1-405B-Instruct-DF11

NaNK
β€”
1
0

Phi-4-reasoning-plus-DF11

NaNK
β€”
1
0

Mistral-Small-24B-Instruct-2501-DF11

DFloat11 Compressed Model: `mistralai/Mistral-Small-24B-Instruct-2501` This is a losslessly compressed version of `mistralai/Mistral-Small-24B-Instruct-2501` using our custom DFloat11 format. The outputs of this compressed model are bit-for-bit identical to the original BFloat16 model, while reducing GPU memory consumption by approximately 30%. DFloat11 compresses model weights using Huffman coding of BFloat16 exponent bits, combined with hardware-aware algorithmic designs that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are decompressed just before matrix multiplications, then immediately discarded after use to minimize memory footprint. No CPU decompression or host-device data transfer -- all operations are handled entirely on the GPU. Decompression overhead is constant per forward pass and independent of batch size, making DFloat11 increasingly efficient at larger batch sizes. DFloat11 is much faster than CPU-offloading approaches, enabling practical deployment in memory-constrained environments. At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. The compression is fully lossless, guaranteeing that the model’s outputs are bit-for-bit identical to those of the original model. 1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed): 2. To use the DFloat11 model, run the following example code in Python: Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float GitHub: https://github.com/LeanModels/DFloat11 HuggingFace: https://huggingface.co/DFloat11

NaNK
β€”
1
0

Qwen3-32B-DF11

NaNK
β€”
0
1

FLUX.1-Krea-dev-DF11-ComfyUI

β€”
0
1