NVILA-8B-HD-Video
73
26
license:cc-by-nc-4.0
by
nvidia
Code Model
OTHER
8B params
New
73 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
18GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
8GB+ RAM
Code Examples
Quick Start:pythontransformers
import torch
from transformers import AutoModel, AutoProcessor
model_path = "nvidia/NVILA-8B-HD-Video"
video_path = "https://huggingface.co/datasets/bfshi/HLVid/resolve/main/example/clip_av_video_5_001.mp4"
prompt = "Question: What does the white text on the green road sign say?\n \
A. Hampden St\n \
B. Hampden Ave\n \
C. HampdenBlvd\n \
D. Hampden Rd\n \
Please answer directly with the letter of the correct answer."
# ----- Video processing args -----
num_video_frames = 128 # Total sampled frames for tiles
num_video_frames_thumbnail = 64 # Total sampled frames for thumbnails
max_tiles_video = 48 # Max spatial tiles per video (one tile is 392x392)
# ----- AutoGaze args (tiles) -----
gazing_ratio_tile = [0.2] + [0.06] * 15 # Per-frame max gazing ratios (single float or list). Videos with higher resolution/FPS usually need lower gazing ratio.
task_loss_requirement_tile = 0.6 # AutoGaze stops gazing at each frame when the estimated reconstruction loss of that frame is lower than this threshold.
# ----- AutoGaze args (thumbnails) -----
gazing_ratio_thumbnail = 1 # Set gazing ratio to 1 and task loss requirement to None to skip gazing on thumbnails
task_loss_requirement_thumbnail = None
# ----- Batching -----
max_batch_size_autogaze = 16 # Set AutoGaze and SigLIP to use smaller mini-batch size if GPU memory is limited
max_batch_size_siglip = 32
# Load processor and model
processor = AutoProcessor.from_pretrained(
model_path,
num_video_frames=num_video_frames,
num_video_frames_thumbnail=num_video_frames_thumbnail,
max_tiles_video=max_tiles_video,
gazing_ratio_tile=gazing_ratio_tile,
gazing_ratio_thumbnail=gazing_ratio_thumbnail,
task_loss_requirement_tile=task_loss_requirement_tile,
task_loss_requirement_thumbnail=task_loss_requirement_thumbnail,
max_batch_size_autogaze=max_batch_size_autogaze,
trust_remote_code=True,
)
model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto",
max_batch_size_siglip=max_batch_size_siglip,
)
model.eval()
# Run inference
video_token = processor.tokenizer.video_token
inputs = processor(text=f"{video_token}\n\n{prompt}", videos=video_path, return_tensors="pt")
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
outputs = model.generate(**inputs)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip()
print(response)Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.