maomaocun
DLLM Var
This model is a fine-tuned version of the LLaDA 8B Base model, obtained through a specialized Supervised Fine-Tuning (SFT) process. It innovatively discards the complex attention mask design typically associated with block diffusion, while preserving full attention mechanisms. This allows the model to achieve block diffusion-style inference efficiently—leveraging KV cache for streamlined generation, outputting an EOS token upon completion of the response to seamlessly exit the generation process. Key innovations: - Full Attention Preservation: Maintains standard full attention without the overhead of intricate masking. - Block Diffusion Inference: Enables iterative block-wise generation via KV cache management, ensuring coherent and controlled outputs. - EOS Handling: Trained to naturally emit EOS tokens at response boundaries. This approach balances computational efficiency with high-quality generation, making it suitable for tasks requiring structured, multi-step reasoning. To load and use this model with Hugging Face Transformers: For block diffusion-style inference, customize the generation loop to manage KV cache and block outputs as needed. The following table compares performance across key evaluation benchmarks. Results are reported as accuracy percentages where applicable. | Model | GSM8K | GPQA | BBH | MATH | HumanEval | MBPP | MMLU-Generate | |--------------------------------|-------|-------|-------|-------|-----------|-------|---------------| | LLaDA 8B Base in Pure Diffusion | 69.06 | 31.91 | 44.77 | 30.84 | 32.92 | 40.80 | 65.9 | | LLaDA 8B Instruct in Semi-ar Diffusion | 77.48 | 29.01 | 51.49 | 22.32 | 38.71 | 39.20 | 65.5 | | dLLM-Var Block Diffusion | 77.40 | 33.03 | 48.74 | 31.94 | 40.24 | 42.00 | 65.53 | These results demonstrate competitive performance, particularly in code generation (HumanEval, MBPP) and reasoning tasks (BBH, MATH), with gains over the base instruct variant in several areas.
dLLM-Var-no-template
This model is a fine-tuned version of the LLaDA 8B Base model, obtained through a specialized Supervised Fine-Tuning (SFT) process. It innovatively discards the complex attention mask design typically associated with block diffusion, while preserving full attention mechanisms. This allows the model to achieve block diffusion-style inference efficiently—leveraging KV cache for streamlined generation, outputting an EOS token upon completion of the response to seamlessly exit the generation process. Key innovations: - Full Attention Preservation: Maintains standard full attention without the overhead of intricate masking. - Block Diffusion Inference: Enables iterative block-wise generation via KV cache management, ensuring coherent and controlled outputs. - EOS Handling: Trained to naturally emit EOS tokens at response boundaries. This approach balances computational efficiency with high-quality generation, making it suitable for tasks requiring structured, multi-step reasoning. To load and use this model with Hugging Face Transformers: For block diffusion-style inference, customize the generation loop to manage KV cache and block outputs as needed. The following table compares performance across key evaluation benchmarks. Results are reported as accuracy percentages where applicable. | Model | GSM8K | GPQA | BBH | MATH | HumanEval | MBPP | MMLU-Pro | MMLU-Generate | |--------------------------------|-------|-------|-------|-------|-----------|-------|----------|---------------| | LLaDA 8B Base in Pure Diffusion | 69.06 | 31.91 | 44.77 | 30.84 | 32.92 | 40.8 | 24.26 | 65.9 | | LLaDA 8B Instruct in Pure Diffusion | 77.48 | 29.01 | 51.49 | 22.32 | 38.71 | 39.2 | 36.41 | 65.5 | | LLaDA-Prometheus in Block Diffusion | 77.4 | 33.03 | 48.74 | 31.94 | 40.24 | 42 | 33.45 | 65.53 | These results demonstrate competitive performance, particularly in code generation (HumanEval, MBPP) and reasoning tasks (BBH, MATH), with gains over the base instruct variant in several areas.