Qwen3-VL-8B-Thinking-FP8
4.9K
28
262K
Long context
8.0B
license:apache-2.0
by
Qwen
Image Model
OTHER
8B params
New
5K downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
18GB+ RAM
Mobile
Laptop
Server
Quick Summary
> This repository contains an FP8 quantized version of the Qwen3-VL-8B-Thinking model.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
8GB+ RAM
Code Examples
SGLang Inferencepythontransformers
import time
from PIL import Image
from sglang import Engine
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoConfig
if __name__ == "__main__":
# TODO: change to your own checkpoint path
checkpoint_path = "Qwen/Qwen3-VL-8B-Thinking-FP8"
processor = AutoProcessor.from_pretrained(checkpoint_path)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/receipt.png",
},
{"type": "text", "text": "Read all the text in the image."},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
image_inputs, _ = process_vision_info(messages, image_patch_size=processor.image_processor.patch_size)
llm = Engine(
model_path=checkpoint_path,
enable_multimodal=True,
mem_fraction_static=0.8,
tp_size=torch.cuda.device_count(),
attention_backend="fa3"
)
start = time.time()
sampling_params = {"max_new_tokens": 1024}
response = llm.generate(prompt=text, image_data=image_inputs, sampling_params=sampling_params)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response['text']}")Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.