Llama-4-Maverick-17B-128E-Instruct-NVFP4

Name: Llama-4-Maverick-17B-128E-Instruct-NVFP4
Author: RedHatAI

2.8K

17.0B

8 languages

llama4

RedHatAI

Language Model

OTHER

17B params

New

3K downloads

Early-stage

Try on Hugging Face Add to Compare

Edge AI:

Mobile

Laptop

Server

38GB+ RAM

Mobile

Laptop

Server

Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile

4-6GB RAM

Laptop

16GB RAM

Server

GPU

Minimum Recommended

16GB+ RAM

Training Data Analysis

🟡 Average (4.8/10)

Researched training datasets used by Llama-4-Maverick-17B-128E-Instruct-NVFP4 with quality assessment

Specialized For

general

science

multilingual

reasoning

Training Datasets (4)

common crawl

🔴 2.5/10

general

science

Key Strengths

•Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
•Diversity: The dataset captures billions of web pages across multiple domains and content types, ena...
•Comprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...

Considerations

•Biased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
•Large-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...

🔵 6/10

general

multilingual

Key Strengths

•Scale and Accessibility: 750GB of publicly available, filtered text
•Systematic Filtering: Documented heuristics enable reproducibility
•Language Diversity: Despite English-only, captures diverse writing styles

Considerations

•English-Only: Limits multilingual applications
•Filtering Limitations: Offensive content and low-quality text remain despite filtering

wikipedia

🟡 5/10

science

multilingual

Key Strengths

•High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
•Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
•Structured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...

Considerations

•Language Inequality: Low-resource language editions have significantly lower quality, fewer articles...
•Biased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...

arxiv

🟡 5.5/10

science

reasoning

Key Strengths

•Scientific Authority: Peer-reviewed content from established repository
•Domain-Specific: Specialized vocabulary and concepts
•Mathematical Content: Includes complex equations and notation

Considerations

•Specialized: Primarily technical and mathematical content
•English-Heavy: Predominantly English-language papers

Explore our comprehensive training dataset analysis

View All Datasets

Code Examples

Select model and load it.pythontransformers

import torch
from datasets import load_dataset
from transformers import Llama4ForConditionalGeneration, Llama4Processor

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Select model and load it.
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
model = Llama4ForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
processor = Llama4Processor.from_pretrained(model_id)
# MoE calibration is now handled automatically by the pipeline.
# The `SequentialLlama4TextMoe` modules (from `llmcompressor.modeling.llama4`)
# will be applied during calibration to enable
# proper expert calibration and vLLM compatibility.
# These replace the original `Llama4TextMoe` class from
# `transformers.models.llama4.modeling_llama4`.

DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 20
MAX_SEQUENCE_LENGTH = 8192

ds = load_dataset(DATASET_ID, name="LLM", split=f"train[:{NUM_CALIBRATION_SAMPLES}]")


def preprocess_function(example):
    messgages = []
    for message in example["messages"]:
        messgages.append(
            {
                "role": message["role"],
                "content": [{"type": "text", "text": message["content"]}],
            }
        )

    return processor.apply_chat_template(
        messgages,
        return_tensors="pt",
        padding=False,
        truncation=True,
        max_length=MAX_SEQUENCE_LENGTH,
        tokenize=True,
        add_special_tokens=False,
        return_dict=True,
        add_generation_prompt=False,
    )


ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)


def data_collator(batch):
    assert len(batch) == 1
    return {
        key: (
            torch.tensor(value)
            if key != "pixel_values"
            else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
        )
        for key, value in batch[0].items()
    }


# Configure the quantization algorithm to run.
recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=[
        "re:.*lm_head",
        "re:.*self_attn",
        "re:.*router",
        "re:.*vision_model.*",
        "re:.*multi_modal_projector.*",
        "Llama4TextAttention",
    ],
)

# Apply algorithms.
# due to the large size of Llama4, we specify sequential targets such that
# only one MLP is loaded into GPU memory at a time
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    sequential_targets=["Llama4TextMLP"],
    data_collator=data_collator,
)


# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-NVFP4"
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.