gemma-4-31B-it-NVFP4A16-GPTQ

Name: gemma-4-31B-it-NVFP4A16-GPTQ
Author: YCWTG

118

license:apache-2.0

YCWTG

Image Model

OTHER

31B params

New

118 downloads

Early-stage

Try on Hugging Face Add to Compare

Edge AI:

Mobile

Laptop

Server

70GB+ RAM

Mobile

Laptop

Server

Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile

4-6GB RAM

Laptop

16GB RAM

Server

GPU

Minimum Recommended

29GB+ RAM

Training Data Analysis

🟡 Average (4.3/10)

Researched training datasets used by gemma-4-31B-it-NVFP4A16-GPTQ with quality assessment

Specialized For

general

science

multilingual

reasoning

Training Datasets (3)

common crawl

🔴 2.5/10

general

science

Key Strengths

•Scale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
•Diversity: The dataset captures billions of web pages across multiple domains and content types, ena...
•Comprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...

Considerations

•Biased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
•Large-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...

wikipedia

🟡 5/10

science

multilingual

Key Strengths

•High-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
•Multilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
•Structured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...

Considerations

•Language Inequality: Low-resource language editions have significantly lower quality, fewer articles...
•Biased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...

arxiv

🟡 5.5/10

science

reasoning

Key Strengths

•Scientific Authority: Peer-reviewed content from established repository
•Domain-Specific: Specialized vocabulary and concepts
•Mathematical Content: Includes complex equations and notation

Considerations

•Specialized: Primarily technical and mathematical content
•English-Heavy: Predominantly English-language papers

Explore our comprehensive training dataset analysis

View All Datasets

Code Examples

Quickstartpythonvllm

import argparse
import atexit
import json
import os
import shutil
import subprocess
import sys
import time
import urllib.error
import urllib.request


# ---------------------------
# User-facing configuration
# ---------------------------
DEFAULTS = {
    "model": "YCWTG/gemma-4-31B-it-NVFP4A16-GPTQ",
    "served_model_name": "YCWTG/gemma-4-31B-it-NVFP4A16-GPTQ",
    "host": "localhost",
    "port": 8000,
    "max_model_len": 66464,
    "enable_auto_tool_choice": True,
    "async_scheduling": True,
    "tool_call_parser": "gemma4",
    "max_num_seqs": 1,
    "reasoning_parser": "gemma4",
    "default_chat_template_kwargs": '{"enable_thinking": true}',
    "allowed_local_media_path": "/home/ycwtg/图片/截图",
}

RUNTIME = {
    "gpu_memory_utilization": 0.97,
    "startup_timeout_sec": 180,
    "healthcheck_timeout_sec": 3,
    "healthcheck_interval_sec": 1,
    "chat_timeout_sec": 600,
}

SERVE_VALUE_OPTIONS = (
    ("--served-model-name", "served_model_name"),
    ("--host", "host"),
    ("--port", "port"),
    ("--max-model-len", "max_model_len"),
    ("--tool-call-parser", "tool_call_parser"),
    ("--max_num_seqs", "max_num_seqs"),
    ("--reasoning-parser", "reasoning_parser"),
    ("--default-chat-template-kwargs", "default_chat_template_kwargs"),
)

CLIENT_VALUE_OPTIONS = (
    ("--model", "model"),
    *SERVE_VALUE_OPTIONS,
)

SERVE_BOOL_OPTIONS = (
    ("--enable-auto-tool-choice", "enable_auto_tool_choice"),
    ("--async-scheduling", "async_scheduling"),
)

CLIENT_BOOL_OPTIONS = (
    ("--enable-auto-tool-choice", "--no-enable-auto-tool-choice", "enable_auto_tool_choice"),
    ("--async-scheduling", "--no-async-scheduling", "async_scheduling"),
)


def append_value_options(cmd, args, options):
    for flag, attr in options:
        cmd.extend([flag, str(getattr(args, attr))])


def append_true_bool_options(cmd, args, options):
    for flag, attr in options:
        if getattr(args, attr):
            cmd.append(flag)


def append_boolean_optional_options(cmd, args, options):
    for positive_flag, negative_flag, attr in options:
        cmd.append(positive_flag if getattr(args, attr) else negative_flag)


def append_optional_value_option(cmd, args, flag, attr):
    value = getattr(args, attr)
    if value is None:
        return
    if isinstance(value, str) and not value.strip():
        return
    cmd.extend([flag, str(value)])


def multiline_input():
    print('User (type "END" on a single line to send, "exit" to quit):')
    lines = []
    while True:
        line = input()
        text = line.strip()
        if text.lower() in {"exit", "quit"}:
            return None
        if text == "END":
            break
        lines.append(line)
    return "\n".join(lines)


def resolve_client_host(host):
    return "127.0.0.1" if host in {"0.0.0.0", "::"} else host


def launch_vllm(args):
    cmd = ["vllm", "serve", args.model]
    append_value_options(cmd, args, SERVE_VALUE_OPTIONS)
    append_optional_value_option(cmd, args, "--allowed-local-media-path", "allowed_local_media_path")
    cmd.extend(
        [
            "--gpu-memory-utilization",
            str(RUNTIME["gpu_memory_utilization"]),
        ]
    )
    append_true_bool_options(cmd, args, SERVE_BOOL_OPTIONS)

    print("Launching vLLM:")
    print(" ".join(cmd))
    try:
        return subprocess.Popen(cmd)
    except FileNotFoundError as e:
        raise RuntimeError("vllm command not found. Activate an environment that has vllm installed.") from e


def stop_vllm(proc):
    if proc and proc.poll() is None:
        proc.terminate()
        try:
            proc.wait(timeout=10)
        except subprocess.TimeoutExpired:
            proc.kill()


def wait_vllm_ready(base_url, timeout_sec=RUNTIME["startup_timeout_sec"]):
    deadline = time.time() + timeout_sec
    url = f"{base_url}/v1/models"
    req = urllib.request.Request(url=url)
    while time.time() < deadline:
        try:
            with urllib.request.urlopen(req, timeout=RUNTIME["healthcheck_timeout_sec"]) as resp:
                if resp.status == 200:
                    return True
        except urllib.error.URLError:
            pass
        time.sleep(RUNTIME["healthcheck_interval_sec"])
    return False


def chat_once(base_url, model_name, messages):
    payload = {"model": model_name, "messages": messages, "skip_special_tokens": False}
    req = urllib.request.Request(
        url=f"{base_url}/v1/chat/completions",
        data=json.dumps(payload, ensure_ascii=False).encode("utf-8"),
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=RUNTIME["chat_timeout_sec"]) as resp:
        data = json.loads(resp.read().decode("utf-8"))
    return data["choices"][0]["message"]


def chat_loop(base_url, model_name):
    print("\n===== Chat Started =====\n")
    messages = []

    while True:
        user_text = multiline_input()
        if user_text is None:
            break

        messages.append({"role": "user", "content": user_text})
        try:
            assistant_msg = chat_once(base_url, model_name, messages)
        except Exception as e:
            print(f"\nRequest failed: {e}\n")
            messages.pop()
            continue

        content = assistant_msg.get("content")
        tool_calls = assistant_msg.get("tool_calls")

        if content:
            print(f"\nAssistant:\n{content}\n")
        elif tool_calls:
            print("\nAssistant(tool_calls):")
            print(json.dumps(tool_calls, ensure_ascii=False, indent=2))
            print()
        else:
            print("\nAssistant:\n(empty response)\n")

        normalized_msg = {"role": "assistant", "content": content or ""}
        if tool_calls:
            normalized_msg["tool_calls"] = tool_calls
        messages.append(normalized_msg)


def build_client_command(args):
    cmd = [sys.executable, os.path.abspath(__file__), "--_client"]
    append_value_options(cmd, args, CLIENT_VALUE_OPTIONS)
    append_boolean_optional_options(cmd, args, CLIENT_BOOL_OPTIONS)
    return cmd


def spawn_chat_terminal(args):
    client_cmd = build_client_command(args)

    terminal_cmd = None
    if os.name == "nt":
        # Open a new cmd window on Windows and keep it alive for interactive chat.
        terminal_cmd = [
            "cmd",
            "/c",
            "start",
            "",
            "cmd",
            "/k",
            subprocess.list2cmdline(client_cmd),
        ]
    elif shutil.which("gnome-terminal"):
        terminal_cmd = ["gnome-terminal", "--", *client_cmd]
    elif shutil.which("x-terminal-emulator"):
        terminal_cmd = ["x-terminal-emulator", "-e", *client_cmd]

    if not terminal_cmd:
        return False

    try:
        subprocess.Popen(terminal_cmd)
        return True
    except Exception as e:
        print(f"Failed to open a new terminal automatically: {e}")
        return False


def parse_args():
    parser = argparse.ArgumentParser(description="Minimal local vLLM chat script")
    parser.add_argument("--_client", action="store_true", help=argparse.SUPPRESS)
    parser.add_argument("--model", default=DEFAULTS["model"])
    parser.add_argument(
        "--served-model-name",
        default=DEFAULTS["served_model_name"],
    )
    parser.add_argument("--host", default=DEFAULTS["host"])
    parser.add_argument("--port", type=int, default=DEFAULTS["port"])
    parser.add_argument("--max-model-len", type=int, default=DEFAULTS["max_model_len"])
    parser.add_argument(
        "--max-num-seqs",
        "--max_num_seqs",
        dest="max_num_seqs",
        type=int,
        default=DEFAULTS["max_num_seqs"],
    )
    parser.add_argument(
        "--enable-auto-tool-choice",
        action=argparse.BooleanOptionalAction,
        default=DEFAULTS["enable_auto_tool_choice"],
    )
    parser.add_argument(
        "--async-scheduling",
        action=argparse.BooleanOptionalAction,
        default=DEFAULTS["async_scheduling"],
    )
    parser.add_argument(
        "--allowed-local-media-path",
        default=DEFAULTS["allowed_local_media_path"],
        help="Optional local media path. Leave empty to disable.",
    )
    parser.add_argument("--tool-call-parser", default=DEFAULTS["tool_call_parser"])
    parser.add_argument("--reasoning-parser", default=DEFAULTS["reasoning_parser"])
    parser.add_argument(
        "--default-chat-template-kwargs",
        default=DEFAULTS["default_chat_template_kwargs"],
    )
    return parser.parse_args()


def main():
    args = parse_args()
    base_url = f"http://{resolve_client_host(args.host)}:{args.port}"
    if args._client:
        chat_loop(base_url, args.served_model_name)
        return

    proc = launch_vllm(args)
    atexit.register(stop_vllm, proc)

    print(f"Waiting for service to become ready: {base_url}")
    if not wait_vllm_ready(base_url):
        print("vLLM startup timed out. Check server logs above.")
        stop_vllm(proc)
        sys.exit(1)

    if spawn_chat_terminal(args):
        print("Model is ready. Opened a new terminal for chat; this terminal keeps server logs.")
        print("Press Ctrl+C here to stop vLLM.")
        try:
            proc.wait()
        except KeyboardInterrupt:
            print("\nInterrupted. Stopping vLLM...")
    else:
        print("No supported terminal found. Falling back to chat in this terminal.")
        chat_loop(base_url, args.served_model_name)


if __name__ == "__main__":
    main()

Instruct Modebashvllm

vllm serve YCWTG/gemma-4-31B-it-NVFP4A16-GPTQ --served-model-name YCWTG/gemma-4-31B-it-NVFP4A16-GPTQ --host localhost --port 8000 --async-scheduling --max-model-len 66464 --enable-auto-tool-choice --tool-call-parser gemma4 --gpu-memory-utilization 0.97 --max_num_seqs 1 --allowed-local-media-path /home/ycwtg/image

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.