Ministral-3-14B-Instruct-2512-FP8-dynamic

121
license:apache-2.0
by
inference-optimization
Language Model
OTHER
14B params
New
121 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
32GB+ RAM
Mobile
Laptop
Server
Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
14GB+ RAM

Code Examples

Deploymentpythontransformers
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Ministral-3-14B-Instruct-2512-FP8-dynamic"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.15, top_p=1.0, top_k=20, min_p=0, max_tokens=65536)

messages = [
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
Creationpythontransformers
from datasets import load_dataset
  from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
  from llmcompressor import oneshot
  from llmcompressor.modifiers.quantization import QuantizationModifier
  from llmcompressor.utils import dispatch_for_generation
  
  MODEL_ID = "mistralai/Ministral-3-14B-Instruct-2512-BF16"
  
  model = Mistral3ForConditionalGeneration.from_pretrained(MODEL_ID, device_map="auto")
  tokenizer = MistralCommonBackend.from_pretrained(MODEL_ID)
  
  recipe = """
      quant_stage:
        quant_modifiers:
          QuantizationModifier:
            ignore: ["re:.*lm_head", "re:.*vision_tower.*", "re:.*multi_modal_projector.*"]
            config_groups:
              group_0:
                targets: [Linear]
                weights:
                  num_bits: 8
                  type: float
                  strategy: channel
                  symmetric: true
                  dynamic: false
                  observer: mse
                input_activations:
                  num_bits: 8
                  type: float
                  strategy: token
                  symmetric: true
                  dynamic: true
                  observer: minmax
  """
  
  # Apply quantization.
  oneshot(model=model, recipe=recipe)
  
  # Confirm generations of the quantized model look sane.
  print("========== SAMPLE GENERATION ==============")
  dispatch_for_generation(model)
  input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
      model.device
  )
  output = model.generate(input_ids, max_new_tokens=20)
  print(tokenizer.decode(output[0]))
  print("==========================================")
  
  
  # Save to disk in compressed-tensors format.
  SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-DYNAMIC-OBSERVER"
  model.save_pretrained(SAVE_DIR, save_compressed = True)
  tokenizer.save_pretrained(SAVE_DIR)

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.