Llama-4-Scout-17B-16E-Instruct-NVFP4
389
17.0B
8 languages
llama4
by
RedHatAI
Language Model
OTHER
17B params
New
389 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
38GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
16GB+ RAM
Training Data Analysis
🟡 Average (4.8/10)
Researched training datasets used by Llama-4-Scout-17B-16E-Instruct-NVFP4 with quality assessment
Specialized For
general
science
multilingual
reasoning
Training Datasets (4)
common crawl
🔴 2.5/10
general
science
Key Strengths
- •Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
- •Diversity: The dataset captures billions of web pages across multiple domains and content types, ena...
- •Comprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...
Considerations
- •Biased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
- •Large-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...
c4
🔵 6/10
general
multilingual
Key Strengths
- •Scale and Accessibility: 750GB of publicly available, filtered text
- •Systematic Filtering: Documented heuristics enable reproducibility
- •Language Diversity: Despite English-only, captures diverse writing styles
Considerations
- •English-Only: Limits multilingual applications
- •Filtering Limitations: Offensive content and low-quality text remain despite filtering
wikipedia
🟡 5/10
science
multilingual
Key Strengths
- •High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
- •Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
- •Structured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...
Considerations
- •Language Inequality: Low-resource language editions have significantly lower quality, fewer articles...
- •Biased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...
arxiv
🟡 5.5/10
science
reasoning
Key Strengths
- •Scientific Authority: Peer-reviewed content from established repository
- •Domain-Specific: Specialized vocabulary and concepts
- •Mathematical Content: Includes complex equations and notation
Considerations
- •Specialized: Primarily technical and mathematical content
- •English-Heavy: Predominantly English-language papers
Explore our comprehensive training dataset analysis
View All DatasetsCode Examples
Oneshot argumentspythontransformers
from transformers import Llama4ForConditionalGeneration, Llama4Processor
from transformers.quantizers.quantizers_utils import get_module_from_name
import torch
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.utils.dev import skip_weights_initialize
from transformers.models.llama4.modeling_llama4 import Llama4TextMLP
from llmcompressor.modifiers.quantization import QuantizationModifier
import gc
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
def convert_model_for_quantization(model):
to_delete = []
for name, module in model.named_modules():
module_class_name = module.__class__.__name__
if module_class_name == "Llama4TextMoe":
parent_module, module_name = get_module_from_name(model, name)
parent_module._modules[module_name] = SequentialLlama4TextMoe(
model.config.get_text_config(),
module,
)
to_delete.append(module)
print(f"Patched {name} with SequentialLlama4TextMoe", flush=True)
for module in to_delete:
del module
gc.collect()
torch.cuda.empty_cache()
class SequentialLlama4TextMoe(torch.nn.Module):
def __init__(self, config, original_moe):
super().__init__()
self.top_k = config.num_experts_per_tok
self.hidden_dim = config.hidden_size
self.num_experts = config.num_local_experts
self.experts = SequentialLlama4TextExperts(config, original_moe.experts)
self.router = original_moe.router
self.shared_expert = original_moe.shared_expert
def forward(self, hidden_states):
hidden_states = hidden_states.reshape(-1, self.hidden_dim)
router_logits = self.router(hidden_states)
router_top_value, router_indices = torch.topk(router_logits, self.top_k, dim=1)
router_scores = (
torch.full_like(router_logits, float("-inf")).scatter_(1, router_indices, router_top_value).transpose(0, 1)
)
router_scores = torch.sigmoid(router_scores.float()).to(hidden_states.dtype)
out = self.shared_expert(hidden_states)
for i in range(self.num_experts):
out += self.experts[i](hidden_states) * router_scores[i].reshape(-1, 1)
return out, router_scores
class SequentialLlama4TextExperts(torch.nn.ModuleList):
def __init__(self, config, original_experts):
self.num_experts = original_experts.gate_up_proj.shape[0]
with skip_weights_initialize():
super().__init__([Llama4TextMLP(config) for _ in range(self.num_experts)])
intermediate_size = original_experts.down_proj.shape[1]
for i in range(self.num_experts):
gate_up = original_experts.gate_up_proj[i]
down = original_experts.down_proj[i]
gate_proj = gate_up[:, :intermediate_size]
up_proj = gate_up[:, intermediate_size:]
self[i].gate_proj.weight.data = gate_proj.t().clone().contiguous()
self[i].up_proj.weight.data = up_proj.t().clone().contiguous()
self[i].down_proj.weight.data = down.t().clone().contiguous()
original_experts.gate_up_proj = None
original_experts.down_proj = None
gc.collect()
torch.cuda.empty_cache()
model_id = "meta-llama/Llama-4-Scout-17B-16E"
model = Llama4ForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16 # load on cpu
)
processor = Llama4Processor.from_pretrained(model_id)
convert_model_for_quantization(model)
# Oneshot arguments
DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 8192
ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
def preprocess_function(example):
messgages = []
for message in example["messages"]:
messgages.append(
{
"role": message["role"],
"content": [{"type": "text", "text": message["content"]}]
}
)
return processor.apply_chat_template(
messgages,
return_tensors="pt",
padding=False,
truncation=True,
max_length=MAX_SEQUENCE_LENGTH,
tokenize=True,
add_special_tokens=False,
return_dict=True,
add_generation_prompt=False,
).to("cuda:0")
ds = ds.map(
preprocess_function,
batched=False,
remove_columns=ds.column_names
)
# Define a oneshot data collator for multimodal inputs.
def data_collator(batch):
assert len(batch) == 1
return {
key: torch.tensor(value) if key != "pixel_values" else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
for key, value in batch[0].items()
}
# Recipe
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4",
ignore=[
're:.*lm_head',
're:.*self_attn',
're:.*router',
're:.*vision_model',
're:.*multi_modal_projector',
're:.*multi_modal_projector',
"Llama4TextAttention",
],
sequential_targets=["Llama4TextMLP"],
)
SAVE_DIR = f"{model_id.split('/')[1]}-{recipe.scheme}"
# Perform oneshot
oneshot(
model=model,
tokenizer=model_id,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
data_collator=data_collator,
output_dir=SAVE_DIR
)
# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.