kuleshov-group
mdlm-owt
PlantCaduceus_l32
PlantCAD2-Small-l24-d0768
caduceus-ps_seqlen-131k_d_model-256_n_layer-16
PlantCaduceus_l20
PlantCAD2-Large-l48-d1536
proseco-llada-sft
bd3lm-owt-block_size4
PlantCaduceus_l24
PlantCaduceus_l28
caduceus-ph_seqlen-131k_d_model-256_n_layer-16
e2d2-gsm8k-finetune-Qwen3-2B
Quick start guide To use this models, follow the snippet below: Model details - Fine-tuned from `Qwen/Qwen3-1.7B-Base` on `openai/gsm8k` - Qwen3 tokenizer: `Qwen/Qwen3-1.7B-Base` - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/
e2d2-owt
Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `Skylion007/openwebtext` - `gpt2` tokenizer - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/
caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3
mdlm-no_flashattn-fp32-owt
e2d2-cnndm
Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `abisee/cnndailymail` - Qwen3 tokenizer: `Qwen/Qwen3-0.6B-Base` - Block diffusion parameterization, with block size 8 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/
bd3lm-owt-block_size8
bd3lm-owt-block_size1024-pretrain
e2d2-wmt
Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `wmt/wmt14` - Qwen3 tokenizer: `Qwen/Qwen3-0.6B-Base` - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/
PlantCAD2-Medium-l48-d1024
bd3lm-owt-block_size16
udlm-qm9
To use this pre-trained model with the HuggingFace APIs, use the following snippet: UDLM stands for Uniform Diffusion Language Models. This model was trained using the refined uniform noise discrete diffusion continuous-time ELBO introduced here. The model has a context size of 32 tokens. The model has 92M parameters. The model architecture is based off of the Diffusion Transformer architecture and consists of: - 12 multi-head attention blocks (with 12 attention heads), - hidden dimension of 768, - `adaLN` for conditioning on time-step (i.e., during diffusion training / generation). The model was trained using the `yairschiff/qm9-tokenizer` tokenizer, a custom tokenizer for parsing SMILES strings. We trained for 25k gradient update steps using a batch size of 2,048. We used linear warm-up with 1,000 steps until we reach a learning rate of 3e-4 and the applied cosine-decay until reaching a minimum learning rate of 3e-6. For more details, please refer to our work: Simple Guidance Mechanisms for Discrete Diffusion Models. Citation Please cite our work using the bibtex below:
udlm-lm1b
To use this pre-trained model with the HuggingFace APIs, use the following snippet: UDLM stands for Uniform Diffusion Language Models. This model was trained using the refined uniform noise discrete diffusion continuous-time ELBO introduced here. The model has a context size of 128 tokens. The model has 139M parameters. The model architecture is based off of the Diffusion Transformer architecture and consists of: - 12 multi-head attention blocks (with 12 attention heads), - hidden dimension of 768, - `adaLN` for conditioning on time-step (i.e., during diffusion training / generation). The model was trained using the `bert-base-uncased` tokenizer. We trained for 1M gradient update steps using a batch size of 512. We use linear warm-up with 2500 steps until we reach a constant learning rate of 3e-4. For more details, please refer to our work: Simple Guidance Mechanisms for Discrete Diffusion Models. Citation Please cite our work using the bibtex below:
mdlm-owt-noeos
caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3
caduceus-ps_seqlen-1k_d_model-118_n_layer-4_lr-8e-3
e2d2-gsm8k-finetune-Qwen3-2B-TEST
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
caduceus-ph_seqlen-1k_d_model-118_n_layer-4_lr-8e-3
proseco-owt
sedd-noeos-owt
Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov [](https://arxiv.org/abs/2503.09573) [](https://github.com/kuleshov-group/bd3lms) [](https://m-arriola.com/bd3lms/) [](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Model Description This is a retrained baseline model from SEDD. Differently from Austin et. al, we train our SEDD baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to analyze the lengths of generated samples at inference, without the artificial BOS/EOS injection confounding the length statistics. How to use See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.
ar-noeos-owt
Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov [](https://arxiv.org/abs/2503.09573) [](https://github.com/kuleshov-group/bd3lms) [](https://m-arriola.com/bd3lms/) [](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Model Description This is a retrained AR baseline model from MDLM. Differently from Sahoo et. al, we train our MDLM baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to generate sequences longer than 1024 tokens at inference. How to use See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.