Qwen3-VL-30B-A3B-Thinking-FP8
100.8K
45
262K
Long context
30.0B
license:apache-2.0
by
Qwen
Image Model
OTHER
30B params
Good
101K downloads
Production-ready
Edge AI:
Mobile
Laptop
Server
68GB+ RAM
Mobile
Laptop
Server
Quick Summary
<a href="https://chat.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
28GB+ RAM
Code Examples
SGLang Inferencepythontransformers
import time
from PIL import Image
from sglang import Engine
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoConfig
if __name__ == "__main__":
# TODO: change to your own checkpoint path
checkpoint_path = "Qwen/Qwen3-VL-30B-A3B-Thinking-FP8"
processor = AutoProcessor.from_pretrained(checkpoint_path)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
},
{"type": "text", "text": "Read all the text in the image."},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
image_inputs, _ = process_vision_info(messages, image_patch_size=processor.image_processor.patch_size)
llm = Engine(
model_path=checkpoint_path,
enable_multimodal=True,
mem_fraction_static=0.8,
tp_size=torch.cuda.device_count(),
attention_backend="fa3"
)
start = time.time()
sampling_params = {"max_new_tokens": 1024}
response = llm.generate(prompt=text, image_data=image_inputs, sampling_params=sampling_params)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response['text']}")Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.