Llama-4-Maverick-17B-128E-Instruct-NVFP4
2.8K
2
17.0B
8 languages
llama4
by
RedHatAI
Language Model
OTHER
17B params
New
3K downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
38GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
16GB+ RAM
Training Data Analysis
🟡 Average (4.8/10)
Researched training datasets used by Llama-4-Maverick-17B-128E-Instruct-NVFP4 with quality assessment
Specialized For
general
science
multilingual
reasoning
Training Datasets (4)
common crawl
🔴 2.5/10
general
science
Key Strengths
- •Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
- •Diversity: The dataset captures billions of web pages across multiple domains and content types, ena...
- •Comprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...
Considerations
- •Biased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
- •Large-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...
c4
🔵 6/10
general
multilingual
Key Strengths
- •Scale and Accessibility: 750GB of publicly available, filtered text
- •Systematic Filtering: Documented heuristics enable reproducibility
- •Language Diversity: Despite English-only, captures diverse writing styles
Considerations
- •English-Only: Limits multilingual applications
- •Filtering Limitations: Offensive content and low-quality text remain despite filtering
wikipedia
🟡 5/10
science
multilingual
Key Strengths
- •High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
- •Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
- •Structured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...
Considerations
- •Language Inequality: Low-resource language editions have significantly lower quality, fewer articles...
- •Biased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...
arxiv
🟡 5.5/10
science
reasoning
Key Strengths
- •Scientific Authority: Peer-reviewed content from established repository
- •Domain-Specific: Specialized vocabulary and concepts
- •Mathematical Content: Includes complex equations and notation
Considerations
- •Specialized: Primarily technical and mathematical content
- •English-Heavy: Predominantly English-language papers
Explore our comprehensive training dataset analysis
View All DatasetsCode Examples
Select model and load it.pythontransformers
import torch
from datasets import load_dataset
from transformers import Llama4ForConditionalGeneration, Llama4Processor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Select model and load it.
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
processor = Llama4Processor.from_pretrained(model_id)
# MoE calibration is now handled automatically by the pipeline.
# The `SequentialLlama4TextMoe` modules (from `llmcompressor.modeling.llama4`)
# will be applied during calibration to enable
# proper expert calibration and vLLM compatibility.
# These replace the original `Llama4TextMoe` class from
# `transformers.models.llama4.modeling_llama4`.
DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 8192
ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")
def preprocess_function(example):
messgages = []
for message in example["messages"]:
messgages.append(
{
"role": message["role"],
"content": [{"type": "text", "text": message["content"]}],
}
)
return processor.apply_chat_template(
messgages,
return_tensors="pt",
padding=False,
truncation=True,
max_length=MAX_SEQUENCE_LENGTH,
tokenize=True,
add_special_tokens=False,
return_dict=True,
add_generation_prompt=False,
)
ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
def data_collator(batch):
assert len(batch) == 1
return {
key: (
torch.tensor(value)
if key != "pixel_values"
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
)
for key, value in batch[0].items()
}
# Configure the quantization algorithm to run.
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=[
"re:.*lm_head",
"re:.*self_attn",
"re:.*router",
"re:.*vision_model.*",
"re:.*multi_modal_projector.*",
"Llama4TextAttention",
],
)
# Apply algorithms.
# due to the large size of Llama4, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
sequential_targets=["Llama4TextMLP"],
data_collator=data_collator,
)
# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.