zacks917
AutoDeco-Llama-Nemotron-8B
AutoDeco Official Implementation of "The End of Manual Decoding: Towards Truly End-to-End Language Models" AutoDeco is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding. - Token-Level Decoding Parameter Prediction: Dynamically predict decoding parameters (temperature and top-p) for each generated token - Lightweight Design: Only adds two small MLP prediction heads (~5MB), without modifying the base model - Universal Architecture: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.) - End-to-End Training: Complete training with implicit gradient backpropagation through cross-entropy loss only - Flexible Training: Supports independent training of temperature head, top-p head, or joint training - Efficient Deployment: Only saves AutoDeco prediction head weights during training, merges with base model during decoding. The AutoDeco framework consists of two core components: During training, the base LLM parameters are frozen, and only the two prediction heads are trained. AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface. | Base Model | #Base Params | #AutoDeco Params | Download | | :------------: | :------------: | :------------: | :------------: | | Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | 7B | 1.84M | 🤗 HuggingFace | | Qwen3-30B-A3B-Instruct-2507 | 30B | 1.05M | 🤗 HuggingFace | | OpenAI-GPT-OSS-20B | 20B | 1.48M | 🤗 HuggingFace | | OpenAI-GPT-OSS-120B | 120B | 1.48M | 🤗 HuggingFace | | Qwen3-235B-A22B-Thinking | 235B | 2.1M | 🤗 HuggingFace | | DeepSeek-V3.1-Terminus | 671B | - | Comming Soon | - Python >= 3.10 - PyTorch >= 2.0 - CUDA >= 12.0 (recommended for training) Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format: Evaluation results are saved in the `generationlog/` directory, including: - Pass@K metrics - Average accuracy - Detailed generation results for each sample This generates a lightweight checkpoint (~5MB) containing: - `config.json`: AutoDeco configuration (including basemodelnameorpath) - `autodecoheads.safetensors`: Heads weights 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment) If you need to create a complete model file with heads for inference engines like vLLM:
AutoDeco-R1-Distill-Qwen-7B
AutoDeco Official Implementation of "The End of Manual Decoding: Towards Truly End-to-End Language Models" AutoDeco is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding. - Token-Level Decoding Parameter Prediction: Dynamically predict decoding parameters (temperature and top-p) for each generated token - Lightweight Design: Only adds two small MLP prediction heads (~5MB), without modifying the base model - Universal Architecture: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.) - End-to-End Training: Complete training with implicit gradient backpropagation through cross-entropy loss only - Flexible Training: Supports independent training of temperature head, top-p head, or joint training - Efficient Deployment: Only saves AutoDeco prediction head weights during training, merges with base model during decoding. The AutoDeco framework consists of two core components: During training, the base LLM parameters are frozen, and only the two prediction heads are trained. AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface. | Base Model | #Base Params | #AutoDeco Params | Download | | :------------: | :------------: | :------------: | :------------: | | Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | 7B | 1.84M | 🤗 HuggingFace | | Qwen3-30B-A3B-Instruct-2507 | 30B | 1.05M | 🤗 HuggingFace | | OpenAI-GPT-OSS-20B | 20B | 1.48M | 🤗 HuggingFace | | OpenAI-GPT-OSS-120B | 120B | 1.48M | 🤗 HuggingFace | | Qwen3-235B-A22B-Thinking | 235B | 2.1M | 🤗 HuggingFace | | DeepSeek-V3.1-Terminus | 671B | - | Comming Soon | - Python >= 3.10 - PyTorch >= 2.0 - CUDA >= 12.0 (recommended for training) Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format: Evaluation results are saved in the `generationlog/` directory, including: - Pass@K metrics - Average accuracy - Detailed generation results for each sample This generates a lightweight checkpoint (~5MB) containing: - `config.json`: AutoDeco configuration (including basemodelnameorpath) - `autodecoheads.safetensors`: Heads weights 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment) If you need to create a complete model file with heads for inference engines like vLLM:
AutoDeco-GPT-Oss-20B
AutoDeco Qwen3 30B A3B Instruct 2507
AutoDeco Official Implementation of "The End of Manual Decoding: Towards Truly End-to-End Language Models" AutoDeco is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding. - Token-Level Decoding Parameter Prediction: Dynamically predict decoding parameters (temperature and top-p) for each generated token - Lightweight Design: Only adds two small MLP prediction heads (~5MB), without modifying the base model - Universal Architecture: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.) - End-to-End Training: Complete training with implicit gradient backpropagation through cross-entropy loss only - Flexible Training: Supports independent training of temperature head, top-p head, or joint training - Efficient Deployment: Only saves AutoDeco prediction head weights during training, merges with base model during decoding. The AutoDeco framework consists of two core components: During training, the base LLM parameters are frozen, and only the two prediction heads are trained. AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface. | Base Model | #Base Params | #AutoDeco Params | Download | | :------------: | :------------: | :------------: | :------------: | | Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | 🤗 HuggingFace | | DeepSeek-R1-Distill-Qwen-7B | 7B | 1.84M | 🤗 HuggingFace | | Qwen3-30B-A3B-Instruct-2507 | 30B | 1.05M | 🤗 HuggingFace | | OpenAI-GPT-OSS-20B | 20B | 1.48M | 🤗 HuggingFace | | OpenAI-GPT-OSS-120B | 120B | 1.48M | 🤗 HuggingFace | | Qwen3-235B-A22B-Thinking | 235B | 2.1M | 🤗 HuggingFace | | DeepSeek-V3.1-Terminus | 671B | - | Comming Soon | - Python >= 3.10 - PyTorch >= 2.0 - CUDA >= 12.0 (recommended for training) Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format: Evaluation results are saved in the `generationlog/` directory, including: - Pass@K metrics - Average accuracy - Detailed generation results for each sample This generates a lightweight checkpoint (~5MB) containing: - `config.json`: AutoDeco configuration (including basemodelnameorpath) - `autodecoheads.safetensors`: Heads weights 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment) If you need to create a complete model file with heads for inference engines like vLLM: