DeepSeek-OCR-bnb-4bit-NF4
🌟 Github | 📥 Model Download | 📄 Paper Link | 📄 Arxiv Paper Link |
This is a 4-bit NF4 quantized version of `deepseek-ai/DeepSeek-OCR`, created using `bitsandbytes`. It offers significantly reduced VRAM (up to 8 Gb!) usage while maintaining high accuracy, making it ideal for consumer GPUs.
For optimal compatibility, we strongly recommend creating a virtual environment using `uv` with Python 3.12.9. This matches the test environment of the original `deepseek-ai/DeepSeek-OCR` model.
Prerequisites: You must have the NVIDIA CUDA Toolkit (e.g., CUDA 11.8, which matches the PyTorch build) installed on your system to compile `flash-attn`.
Below are the recommended library versions, based on the original model's requirements, plus the libraries needed for 4-bit loading (`bitsandbytes`, `accelerate`) and PyTorch compatibility (`torchvision`).
Usage Usage (Inference Code) Once your environment is set up, you can use one of the two code blocks provided below.
The main difference is the attnimplementation parameter, which depends on your GPU architecture:
1. `attnimplementation='flashattention2'`: Recommended for NVIDIA Ampere (RTX 30xx, A100) or newer GPUs.
2. `attnimplementation='eager'`: Required for NVIDIA Turing (RTX 20xx) GPUs and older, or for general compatibility if Flash Attention 2 fails.
This 4-bit model was successfully tested on an RTX 2080 Ti (Turing architecture) with CUDA 13.0 drivers, which requires the eager implementation. Both code blocks are provided for your convenience.
I would like to thank DeepSeek and their entire team for making this incredible range of models available. I also want to thank the entire Hugging Face community for their amazing platform and everyone who contributes to making this such an incredible community.
Citation ```bibtex @article{wei2025deepseek, title={DeepSeek-OCR: Contexts Optical Compression}, author={Wei, Haoran and Sun, Yaofeng and Li, Yukun}, journal={arXiv preprint arXiv:2510.18234}, year={2025} }