yuhuili

16 models • 1 total models in database

Sort by:

EAGLE-LLaMA3.1-Instruct-8B

NaNK

llama

164,613

EAGLE3-LLaMA3.1-Instruct-8B

NaNK

llama

68,467

EAGLE-LLaMA3-Instruct-8B

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency. - EAGLE is: - certified by the third-party evaluation as the fastest speculative method so far. - achieving 2x speedup on gpt-fast . - 3x faster than vanilla decoding (13B). - 2x faster than Lookahead (13B). - 1.6x faster than Medusa (13B). - provably maintaining the consistency with vanilla decoding in the distribution of generated texts. - trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it. - combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization. EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance. - EAGLE-2 is: - 4x faster than vanilla decoding (13B). - 1.4x faster than EAGLE-1 (13B). EAGLE-3 removes the feature prediction constraint in EAGLE and simulates this process during training using training-time testing. Considering that top-layer features are limited to next-token prediction, EAGLE-3 replaces them with a fusion of low-, mid-, and high-level semantic features. EAGLE-3 further improves generation speed while ensuring lossless performance. - EAGLE-3 is: - 5.6 faster than vanilla decoding (13B). - 1.8x faster than EAGLE-1 (13B). Inference is conducted on 2x RTX 3090 GPUs at fp16 precision using the Vicuna 13B model. [//]: # () [//]: # () [//]: # (Using EAGLE-2, the inference speed on 2 RTX 3060 GPUs can be faster than vanilla autoregressive decoding on an A100 GPU.) Support EAGLE has been merged in the following mainstream LLM serving frameworks (listed in alphabetical order). - AMD ROCm - AngelSlim - AWS NeuronX Distributed Core - CPM.cu - Intel® Extension for Transformers - Intel® LLM Library for PyTorch - MLC-LLM - NVIDIA NeMo Framework - NVIDIA TensorRT-LLM - NVIDIA TensorRT Model Optimizer - PaddleNLP - SGLang - SpecForge - vLLM Reference For technical details and full experimental results, please check the paper of EAGLE, the paper of EAGLE-2, and the paper of EAGLE-3.

NaNK

llama

32,790

yuhuili

EAGLE-LLaMA3.1-Instruct-8B

EAGLE3-LLaMA3.1-Instruct-8B

EAGLE-LLaMA3-Instruct-8B

EAGLE3-LLaMA3.3-Instruct-70B

EAGLE-llama2-chat-7B

EAGLE-Vicuna-7B-v1.3

EAGLE3-DeepSeek-R1-Distill-LLaMA-8B

EAGLE-LLaMA3-Instruct-70B

EAGLE3-Vicuna1.3-13B

EAGLE-Qwen2-7B-Instruct

EAGLE-llama2-chat-13B

EAGLE-Vicuna-13B-v1.3

EAGLE-mixtral-instruct-8x7B

EAGLE-llama2-chat-70B

EAGLE-Qwen2-72B-Instruct

EAGLE-Vicuna-33B-v1.3