Llama-3.2-1B-Instruct-quantized.w8a8
10.1K
7
1.0B
8 languages
llama
by
RedHatAI
Language Model
OTHER
1B params
Fair
10K downloads
Community-tested
Edge AI:
Mobile
Laptop
Server
3GB+ RAM
Mobile
Laptop
Server
Quick Summary
Model Overview - Model Architecture: Llama-3 - Input: Text - Output: Text - Model Optimizations: - Activation quantization: INT8 - Weight quantization: INT8 - I...
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
1GB+ RAM
Training Data Analysis
🟡 Average (4.8/10)
Researched training datasets used by Llama-3.2-1B-Instruct-quantized.w8a8 with quality assessment
Specialized For
general
science
multilingual
reasoning
Training Datasets (4)
common crawl
🔴 2.5/10
general
science
Key Strengths
- •Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
- •Diversity: The dataset captures billions of web pages across multiple domains and content types, ena...
- •Comprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...
Considerations
- •Biased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
- •Large-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...
c4
🔵 6/10
general
multilingual
Key Strengths
- •Scale and Accessibility: 750GB of publicly available, filtered text
- •Systematic Filtering: Documented heuristics enable reproducibility
- •Language Diversity: Despite English-only, captures diverse writing styles
Considerations
- •English-Only: Limits multilingual applications
- •Filtering Limitations: Offensive content and low-quality text remain despite filtering
wikipedia
🟡 5/10
science
multilingual
Key Strengths
- •High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
- •Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
- •Structured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...
Considerations
- •Language Inequality: Low-resource language editions have significantly lower quality, fewer articles...
- •Biased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...
arxiv
🟡 5.5/10
science
reasoning
Key Strengths
- •Scientific Authority: Peer-reviewed content from established repository
- •Domain-Specific: Specialized vocabulary and concepts
- •Mathematical Content: Includes complex equations and notation
Considerations
- •Specialized: Primarily technical and mathematical content
- •English-Heavy: Predominantly English-language papers
Explore our comprehensive training dataset analysis
View All DatasetsCode Examples
Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
number_gpus = 1
max_model_len = 8192
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Creationpythontransformers
from transformers import AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.modifiers.quantization import GPTQModifier, SmoothQuantModifier
model_id = "meta-llama/Llama-3.2-1B-Instruct"
num_samples = 512
max_seq_len = 8192
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_fn(example):
return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
ds = ds.shuffle().select(range(num_samples))
ds = ds.map(preprocess_fn)
recipe = [
SmoothQuantModifier(
smoothing_strength=0.7,
mappings=[
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
[["re:.*down_proj"], "re:.*up_proj"],
],
),
GPTQModifier(
sequential=True,
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
dampening_frac=0.01,
)
]
model = SparseAutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=max_seq_len,
num_calibration_samples=num_samples,
)
model.save_pretrained("Llama-3.2-1B-Instruct-quantized.w8a8")Reproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoReproductiontextvllm
lm_eval \
--model vllm \
--model_args pretrained="neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",dtype=auto,add_bos_token=True,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
--tasks mmlu_llama_3.1_instruct \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size autoDeploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.