Ministral-3-14B-Instruct-2512-FP8-dynamic
121
license:apache-2.0
by
inference-optimization
Language Model
OTHER
14B params
New
121 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
32GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
14GB+ RAM
Code Examples
Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.15, top_p=1.0, top_k=20, min_p=0, max_tokens=65536)
messages = [
{"role": "user", "content": prompt}
]
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)Creationpythontransformers
from datasets import load_dataset
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
MODEL_ID = "mistralai/Ministral-3-14B-Instruct-2512-BF16"
model = Mistral3ForConditionalGeneration.from_pretrained(MODEL_ID, device_map="auto")
tokenizer = MistralCommonBackend.from_pretrained(MODEL_ID)
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"]
config_groups:
group_0:
targets: [Linear]
weights:
num_bits: 8
type: float
strategy: channel
symmetric: true
dynamic: false
observer: mse
input_activations:
num_bits: 8
type: float
strategy: token
symmetric: true
dynamic: true
observer: minmax
"""
# Apply quantization.
oneshot(model=model, recipe=recipe)
# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")
# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-DYNAMIC-OBSERVER"
model.save_pretrained(SAVE_DIR, save_compressed = True)
tokenizer.save_pretrained(SAVE_DIR)Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.