SmolLM3-3B-ONNX
380
19
3.0B
8 languages
FP16
license:apache-2.0
by
HuggingFaceTB
Language Model
OTHER
3B params
New
380 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
7GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
3GB+ RAM
Code Examples
How to usejavascriptonnx
import { pipeline, TextStreamer } from "@huggingface/transformers";
// Create a text generation pipeline
const generator = await pipeline(
"text-generation",
"HuggingFaceTB/SmolLM3-3B-ONNX",
{ dtype: "q4f16", device: "webgpu" },
);
// Define the model inputs
const thinking = true; // Whether the model should think before answering
const messages = [
{
role: "system",
content: "You are SmolLM, a language model created by Hugging Face."
+ (thinking ? "/think" : "/no_think")
},
{ role: "user", content: "Solve the equation x^2 - 3x + 2 = 0" },
];
// Generate a response
const output = await generator(messages, {
max_new_tokens: 1024,
streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: true }),
});
console.log(output[0].generated_text.at(-1).content);ONNXRuntimepythontransformers
from transformers import AutoConfig, AutoTokenizer
import onnxruntime
import numpy as np
from huggingface_hub import hf_hub_download
# 1. Load config, processor, and model
model_id = "HuggingFaceTB/SmolLM3-3B-ONNX"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_path = hf_hub_download(repo_id=model_id, filename="onnx/model_q4.onnx") # Download the graph
hf_hub_download(repo_id=model_id, filename="onnx/model_q4.onnx_data") # Download the model weights
decoder_session = onnxruntime.InferenceSession(model_path)
## Set config values
num_key_value_heads = config.num_key_value_heads
head_dim = config.hidden_size // config.num_attention_heads
num_hidden_layers = config.num_hidden_layers
eos_token_id = config.eos_token_id
# 2. Prepare inputs
messages = [
{ "role": "system", "content": "/no_think" },
{ "role": "user", "content": "What is the capital of France?" },
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="np")
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
batch_size = input_ids.shape[0]
past_key_values = {
f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
for layer in range(num_hidden_layers)
for kv in ('key', 'value')
}
position_ids = np.tile(np.arange(0, input_ids.shape[-1]), (batch_size, 1))
# 3. Generation loop
max_new_tokens = 1024
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
logits, *present_key_values = decoder_session.run(None, dict(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
**past_key_values,
))
## Update values for next generation loop
input_ids = logits[:, -1].argmax(-1, keepdims=True)
attention_mask = np.concatenate([attention_mask, np.ones_like(input_ids, dtype=np.int64)], axis=-1)
position_ids = position_ids[:, -1:] + 1
for j, key in enumerate(past_key_values):
past_key_values[key] = present_key_values[j]
generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
if (input_ids == eos_token_id).all():
break
## (Optional) Streaming
print(tokenizer.decode(input_ids[0]), end='', flush=True)
print()
# 4. Output result
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0])Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.