Qwen3-Next-80B-A3B-Instruct-4bit-SINQ

Name: Qwen3-Next-80B-A3B-Instruct-4bit-SINQ
Author: huawei-csl

license:apache-2.0

huawei-csl

Language Model

OTHER

80B params

New

6 downloads

Early-stage

Try on Hugging Face Add to Compare

Edge AI:

Mobile

Laptop

Server

179GB+ RAM

Mobile

Laptop

Server

Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile

4-6GB RAM

Laptop

16GB RAM

Server

GPU

Minimum Recommended

75GB+ RAM

Code Examples

Usage examplepythontransformers

import torch
from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel

model_name = "huawei-csl/Qwen3-Next-80B-A3B-Instruct-4bit-SINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
    model_name,
    device="cuda:0",
    compute_dtype=torch.bfloat16
)

# prepare the model input
prompt = "Explain neural network quantization in one sentence."
messages = [
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(sinq_model.device)
    
# conduct text completion
generated_ids = sinq_model.generate(
    **model_inputs,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    min_p=0.0,
    max_new_tokens=16384,
)

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Load base modelpythontransformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig

# Load base model
base_model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
    nbits=4,            # quantization bit-width
    group_size=64,     # group size
    tiling_mode="1D",   # tiling strategy
    method="sinq"       # quantization method ("asinq" for the calibrated version)
)

sinq_model = AutoSINQHFModel.quantize_model(
    model,
    tokenizer=tokenizer,
    quant_config=quant_cfg,
    compute_dtype=torch.bfloat16,
    device="cuda:0"
)

bibtex

@misc{muller2025sinq,
      title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights}, 
      author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
      year={2025},
      eprint={2509.22944},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2509.22944}
}

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.