phi-4-quantized.w4a16

Name: phi-4-quantized.w4a16
Author: RedHatAI

846

3 languages

license:mit

RedHatAI

Language Model

OTHER

New

846 downloads

Early-stage

Try on Hugging Face Add to Compare

Edge AI:

Mobile

Laptop

Server

Unknown

Mobile

Laptop

Server

Quick Summary

AI model with specialized capabilities.

Training Data Analysis

🟡 Average (5.2/10)

Researched training datasets used by phi-4-quantized.w4a16 with quality assessment

Specialized For

code

general

science

multilingual

Training Datasets (3)

the pile

🟢 8/10

code

general

science

multilingual

Key Strengths

•Deliberate Diversity: Explicitly curated to include diverse content types (academia, code, Q&A, book...
•Documented Quality: Each component dataset is thoroughly documented with rationale for inclusion, en...
•Epoch Weighting: Component datasets receive different training epochs based on perceived quality, al...

common crawl

🔴 2.5/10

general

science

Key Strengths

•Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
•Diversity: The dataset captures billions of web pages across multiple domains and content types, ena...
•Comprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...

Considerations

•Biased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
•Large-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...

wikipedia

🟡 5/10

science

multilingual

Key Strengths

•High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
•Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
•Structured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...

Considerations

•Language Inequality: Low-resource language editions have significantly lower quality, fewer articles...
•Biased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...

Explore our comprehensive training dataset analysis

View All Datasets

Code Examples

Deploymentpythontransformers

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic-ent/phi-4-quantized.w4a16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language model."},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

bashvllm

podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/phi-4-quantized.w4a16

Download model from Red Hat Registry via dockerbash

# Download model from Red Hat Registry via docker
# Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
ilab model download --repository docker://registry.redhat.io/rhelai1/phi-4-quantized-w4a16:1.5

Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.bash

# Serve model via ilab
ilab model serve --model-path ~/.cache/instructlab/models/phi-4-quantized-w4a16
  
# Chat with model
ilab model chat --model ~/.cache/instructlab/models/phi-4-quantized-w4a16

Attach model to vllm server. This is an NVIDIA templatepythonvllm

# Attach model to vllm server. This is an NVIDIA template
# Save as: inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: phi-4-quantized.w4a16 # OPTIONAL CHANGE
    serving.kserve.io/deploymentMode: RawDeployment
  name: phi-4-quantized.w4a16         # specify model name. This value will be used to invoke the model in the payload
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2'			# this is model specific
          memory: 8Gi		# this is model specific
          nvidia.com/gpu: '1'	# this is accelerator specific
        requests:			# same comment for this block
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-phi-4-quantized-w4a16:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

make sure first to be in the project where you want to deploy the modelbashvllm

# make sure first to be in the project where you want to deploy the model
# oc project <project-name>
# apply both resources to run model
# Apply the ServingRuntime
oc apply -f vllm-servingruntime.yaml
# Apply the InferenceService
oc apply -f qwen-inferenceservice.yaml

apply both resources to run modelpython

# Replace <inference-service-name> and <cluster-ingress-domain> below:
# - Run `oc get inferenceservice` to find your URL if unsure.
# Call the server using curl:
curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
        -H "Content-Type: application/json" \
        -d '{
    "model": "phi-4-quantized.w4a16",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

Creationpythontransformers

from transformers import AutoModelForCausalLM, AutoTokenizer
  from llmcompressor.modifiers.quantization import GPTQModifier
  from llmcompressor.transformers import oneshot
  from datasets import load_dataset
  
  # Load model
  model_stub = "microsoft/phi-4"
  model_name = model_stub.split("/")[-1]
  
  num_samples = 1024
  max_seq_len = 8192
  
  tokenizer = AutoTokenizer.from_pretrained(model_stub)
  
  model = AutoModelForCausalLM.from_pretrained(
      model_stub,
      device_map="auto",
      torch_dtype="auto",
  )
  
  def preprocess_fn(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
  
  ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
  ds = ds.map(preprocess_fn)
  
  # Configure the quantization algorithm and scheme
  recipe = GPTQModifier(
      targets="Linear",
      scheme="W4A16",
      ignore=["lm_head"],
      sequential_targets=["Phi3DecoderLayer"],
      dampening_frac=0.01,
  )
  
  # Apply quantization
  oneshot(
      model=model,
      dataset=ds, 
      recipe=recipe,
      max_seq_length=max_seq_len,
      num_calibration_samples=num_samples,
  )
  
  # Save to disk in compressed-tensors format
  save_path = model_name + "-quantized.w4a16"
  model.save_pretrained(save_path)
  tokenizer.save_pretrained(save_path)
  print(f"Model and tokenizer saved to: {save_path}")

textvllm

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic-ent/phi-4-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.6,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
  --tasks openllm \
  --batch_size auto

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.