Kimi-Linear-48B-A3B-Instruct-4bit-SINQ
26
3
license:apache-2.0
by
huawei-csl
Language Model
OTHER
48B params
New
26 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
108GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
45GB+ RAM
Code Examples
Usage examplepythontransformers
import torch
from transformers import AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
model_name = "huawei-csl/Kimi-Linear-48B-A3B-Instruct-4bit-SINQ"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
device = "cuda:0"
sinq_model = AutoSINQHFModel.from_quantized_safetensors(
model_name,
device=device,
attn_implementation = "kernels-community/flash-attn2",
compute_dtype=torch.bfloat16,
trust_remote_code=True
)
# Test the quantized model
messages = [
{"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
{"role": "user", "content": "Is 7 a prime?"}
]
chat_prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True, # ask it to end with assistant turn
# no return_tensors here; this returns a string in your version
)
inputs = tokenizer(
chat_prompt,
return_tensors="pt"
)
inputs = {k: v.to(sinq_model.device) for k, v in inputs.items()}
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
with torch.no_grad():
generated_ids = sinq_model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs.get("attention_mask", None),
max_new_tokens=200
)
new_tokens = generated_ids[0, inputs["input_ids"].shape[-1]:]
response = tokenizer.decode(new_tokens, skip_special_tokens=True)
print(response)Load base modelpythontransformers
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from sinq.patch_model import AutoSINQHFModel
from sinq.sinqlinear import BaseQuantizeConfig
# Load base model
model_name = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
trust_remote_code=True,
attn_implementation = "kernels-community/flash-attn2",
)
# Apply 4-bit SINQ quantization
quant_cfg = BaseQuantizeConfig(
nbits=4, # quantization bit-width
group_size=64, # group size
tiling_mode="1D", # tiling strategy
method="sinq" # quantization method ("asinq" for the calibrated version)
)
sinq_model = AutoSINQHFModel.quantize_model(
model,
tokenizer=tokenizer,
quant_config=quant_cfg,
compute_dtype=torch.bfloat16,
device="cuda:0"
)bibtex
@misc{muller2025sinq,
title={SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights},
author={Lorenz K. Muller and Philippe Bich and Jiawei Zhuang and Ahmet Celik and Luca Benfenati and Lukas Cavigelli},
year={2025},
eprint={2509.22944},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2509.22944}
}Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.