Apertus-8B-2509-4bit-ASINQ
11
2
8.0B
2 languages
license:apache-2.0
by
huawei-csl
Language Model
OTHER
8B params
New
11 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
18GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
8GB+ RAM
Code Examples
Usage examplepythontransformers
import torch
from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
model_name = "huawei-csl/Apertus-8B-2509-4bit-ASINQ"
device = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
model_name,
device=device,
compute_dtype=torch.bfloat16
)
# OPTIONAL: use it if you want to further increase the inference speed
# sinq_model.forward(torch.tensor([[0]]).to(device))
# sinq_model.forward = torch.compile(sinq_model.forward, dynamic=True, fullgraph=False, backend='inductor', mode='reduce-overhead')
template = """{% for m in messages -%}
{{ m['role'] }}: {{ m['content'] }}
{% endfor -%}
{% if add_generation_prompt %}assistant: {% endif %}"""
tokenizer.chat_template = template # set once per tokenizer
# prepare the model input
prompt = "Give me a brief explanation of gravity in simple terms."
messages_think = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages_think,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
# Generate the output
generated_ids = sinq_model.generate(**model_inputs, max_new_tokens=100)
# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))Load base modelpythontransformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig
# Load base model
base_model_name = "swiss-ai/Apertus-8B-2509"
model = AutoModelForCausalLM.from_pretrained(base_model_name, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
nbits=4, # quantization bit-width
group_size=64, # group size
tiling_mode="1D", # tiling strategy
method="asinq" # quantization method ("asinq" for the calibrated version)
)
qmodel = AutoSINQHFModel.quantize_model(
model,
tokenizer=tokenizer,
quant_config=quant_cfg,
compute_dtype=torch.bfloat16,
device="cuda:0"
)bibtex
@misc{muller2025sinq,
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
year={2025},
eprint={2509.22944},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2509.22944}
}Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.