Devstral-Small-2-24B-Instruct-SINQ-4bit

66
1
license:apache-2.0
by
maxence-bouvier
Language Model
OTHER
24B params
New
66 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
54GB+ RAM
Mobile
Laptop
Server
Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
23GB+ RAM

Code Examples

Load processor (handles tokenization and chat templates)python
def _build_unicode_to_bytes_map() -> dict[str, int]:
    """Build inverse of GPT-2's bytes_to_unicode mapping."""
    bs = (
        list(range(ord("!"), ord("~") + 1))
        + list(range(ord("¡"), ord("¬") + 1))
        + list(range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(256):
        if b not in bs:
            bs.append(b)
            cs.append(256 + n)
            n += 1
    return {chr(c): b for b, c in zip(bs, cs)}

_UNICODE_TO_BYTE = _build_unicode_to_bytes_map()

def fix_byte_encoding(text: str) -> str:
    """Fix byte-level BPE encoding for proper emoji/unicode display.

    Example: "ð٤Ĺ" -> "🤗"
    """
    try:
        byte_values = bytes([_UNICODE_TO_BYTE.get(c, ord(c)) for c in text])
        return byte_values.decode("utf-8")
    except (UnicodeDecodeError, KeyError, ValueError):
        return text

messages = [
    {"role": "user", "content": "Write a Python function to check if a number is prime."}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        pad_token_id=processor.tokenizer.eos_token_id,
    )

response = processor.decode(outputs[0], skip_special_tokens=True)
response = fix_byte_encoding(response)  # Fix emoji/unicode display
print(response)
Quick Testpython
# Minimal test to verify the model works
messages = [{"role": "user", "content": "Say hello"}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to("cuda")

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=50, do_sample=False,
                         pad_token_id=processor.tokenizer.eos_token_id)
response = fix_byte_encoding(processor.decode(out[0], skip_special_tokens=True))
print(response)

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.