Qwen3-Nemotron-8B-BRRM

Name: Qwen3-Nemotron-8B-BRRM
Author: nvidia

717

8.0B

1 language

—

nvidia

Language Model

OTHER

8B params

New

717 downloads

Early-stage

Try on Hugging Face Add to Compare

Edge AI:

Mobile

Laptop

Server

18GB+ RAM

Mobile

Laptop

Server

Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile

4-6GB RAM

Laptop

16GB RAM

Server

GPU

Minimum Recommended

8GB+ RAM

Code Examples

Quick Startpythontransformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "nvidia/Qwen3-Nemotron-8B-BRRM"  # or nvidia/Qwen3-Nemotron-14B-BRRM
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example usage
context = "What is 2+2?"
response1 = "2+2=4"
response2 = "2+2=5"

# Format Turn 1: Adaptive Branching
turn1_prompt = f"""You are a response quality evaluator. Given the context and two responses, select the most important cognitive abilities and analyze critical issues.

**Context:** 
{context}

**Responses:**
[The Begin of Response 1]
{response1}
[The End of Response 1]

[The Begin of Response 2]
{response2}
[The End of Response 2]

**Output Format:**
[Quality Assessment Focus]
Choose 1-3 abilities: Information Accuracy, Computational Precision, Logical Reasoning, Implementation Capability, Safety Awareness, Response Completeness, Instruction Adherence, Communication Clarity.
[End of Quality Assessment Focus]

[Quality Analysis for Response 1]
- Critical Issues: [List specific issues or "None identified"]
[End of Quality Analysis for Response 1]

[Quality Analysis for Response 2]
- Critical Issues: [List specific issues or "None identified"]
[End of Quality Analysis for Response 2]"""

# Generate Turn 1
messages = [{"role": "user", "content": turn1_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)
outputs = model.generate(
    input_ids, 
    max_new_tokens=8192,      
    temperature=1.0,
    top_p=0.95,               
    top_k=20,                 
    do_sample=True,           
    pad_token_id=tokenizer.eos_token_id
)
turn1_response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)


# Format Turn 2: Branch-Conditioned Rethinking
turn2_prompt = f"""You are making final comparative judgments using established evaluation priorities.

**Evaluation Hierarchies:**
- **Accuracy-Critical**: Correctness > Process > Presentation 
- **Creative/Open-Ended**: User Intent > Content Quality > Creativity 
- **Instruction-Following**: Adherence > Content > Clarity

[The Begin of Analysis on Response 1]
[Apply appropriate evaluation hierarchy]
[The End of Analysis on Response 1]

[The Begin of Analysis on Response 2]
[Apply appropriate evaluation hierarchy]
[The End of Analysis on Response 2]

[The Begin of Ranking Score]
\\boxed{{1 or 2}}
[The End of Ranking Score]"""

# Generate Turn 2
messages.append({"role": "assistant", "content": turn1_response})
messages.append({"role": "user", "content": turn2_prompt})
input_ids = tokenizer.apply_chat_template(
    messages, 
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)
outputs = model.generate(
    input_ids, 
    max_new_tokens=8192,     
    temperature=1.0,
    top_p=0.95,              
    top_k=20,                
    do_sample=True,          
    pad_token_id=tokenizer.eos_token_id
)
final_response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=False)

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.