kuleshov-group

31 models • 1 total models in database

Sort by:

e2d2-gsm8k-finetune-Qwen3-2B

Quick start guide To use this models, follow the snippet below: Model details - Fine-tuned from `Qwen/Qwen3-1.7B-Base` on `openai/gsm8k` - Qwen3 tokenizer: `Qwen/Qwen3-1.7B-Base` - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

NaNK

license:apache-2.0

463

e2d2-owt

Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `Skylion007/openwebtext` - `gpt2` tokenizer - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

license:apache-2.0

378

caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3

—

299

mdlm-no_flashattn-fp32-owt

license:apache-2.0

271

e2d2-cnndm

Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `abisee/cnndailymail` - Qwen3 tokenizer: `Qwen/Qwen3-0.6B-Base` - Block diffusion parameterization, with block size 8 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

license:apache-2.0

266

bd3lm-owt-block_size8

license:apache-2.0

213

bd3lm-owt-block_size1024-pretrain

license:apache-2.0

180

e2d2-wmt

Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `wmt/wmt14` - Qwen3 tokenizer: `Qwen/Qwen3-0.6B-Base` - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

license:apache-2.0

117

PlantCAD2-Medium-l48-d1024

license:apache-2.0

bd3lm-owt-block_size16

license:apache-2.0

udlm-qm9

To use this pre-trained model with the HuggingFace APIs, use the following snippet: UDLM stands for Uniform Diffusion Language Models. This model was trained using the refined uniform noise discrete diffusion continuous-time ELBO introduced here. The model has a context size of 32 tokens. The model has 92M parameters. The model architecture is based off of the Diffusion Transformer architecture and consists of: - 12 multi-head attention blocks (with 12 attention heads), - hidden dimension of 768, - `adaLN` for conditioning on time-step (i.e., during diffusion training / generation). The model was trained using the `yairschiff/qm9-tokenizer` tokenizer, a custom tokenizer for parsing SMILES strings. We trained for 25k gradient update steps using a batch size of 2,048. We used linear warm-up with 1,000 steps until we reach a learning rate of 3e-4 and the applied cosine-decay until reaching a minimum learning rate of 3e-6. For more details, please refer to our work: Simple Guidance Mechanisms for Discrete Diffusion Models. Citation Please cite our work using the bibtex below:

license:apache-2.0

udlm-lm1b

To use this pre-trained model with the HuggingFace APIs, use the following snippet: UDLM stands for Uniform Diffusion Language Models. This model was trained using the refined uniform noise discrete diffusion continuous-time ELBO introduced here. The model has a context size of 128 tokens. The model has 139M parameters. The model architecture is based off of the Diffusion Transformer architecture and consists of: - 12 multi-head attention blocks (with 12 attention heads), - hidden dimension of 768, - `adaLN` for conditioning on time-step (i.e., during diffusion training / generation). The model was trained using the `bert-base-uncased` tokenizer. We trained for 1M gradient update steps using a batch size of 512. We use linear warm-up with 2500 steps until we reach a constant learning rate of 3e-4. For more details, please refer to our work: Simple Guidance Mechanisms for Discrete Diffusion Models. Citation Please cite our work using the bibtex below:

NaNK

license:apache-2.0

mdlm-owt-noeos

license:apache-2.0

caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3

—

caduceus-ps_seqlen-1k_d_model-118_n_layer-4_lr-8e-3

—

e2d2-gsm8k-finetune-Qwen3-2B-TEST

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

NaNK

—

caduceus-ph_seqlen-1k_d_model-118_n_layer-4_lr-8e-3

—

proseco-owt

license:apache-2.0

sedd-noeos-owt

Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov [](https://arxiv.org/abs/2503.09573) [](https://github.com/kuleshov-group/bd3lms) [](https://m-arriola.com/bd3lms/) [](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Model Description This is a retrained baseline model from SEDD. Differently from Austin et. al, we train our SEDD baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to analyze the lengths of generated samples at inference, without the artificial BOS/EOS injection confounding the length statistics. How to use See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.

license:apache-2.0

ar-noeos-owt

Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov [](https://arxiv.org/abs/2503.09573) [](https://github.com/kuleshov-group/bd3lms) [](https://m-arriola.com/bd3lms/) [](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Model Description This is a retrained AR baseline model from MDLM. Differently from Sahoo et. al, we train our MDLM baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to generate sequences longer than 1024 tokens at inference. How to use See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.

license:apache-2.0

kuleshov-group

mdlm-owt

PlantCaduceus_l32

PlantCAD2-Small-l24-d0768

caduceus-ps_seqlen-131k_d_model-256_n_layer-16

PlantCaduceus_l20

PlantCAD2-Large-l48-d1536

proseco-llada-sft

bd3lm-owt-block_size4

PlantCaduceus_l24

PlantCaduceus_l28

caduceus-ph_seqlen-131k_d_model-256_n_layer-16

e2d2-gsm8k-finetune-Qwen3-2B

e2d2-owt

caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3

mdlm-no_flashattn-fp32-owt

e2d2-cnndm

bd3lm-owt-block_size8

bd3lm-owt-block_size1024-pretrain

e2d2-wmt

PlantCAD2-Medium-l48-d1024

bd3lm-owt-block_size16

udlm-qm9

udlm-lm1b

mdlm-owt-noeos

caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3

caduceus-ps_seqlen-1k_d_model-118_n_layer-4_lr-8e-3

e2d2-gsm8k-finetune-Qwen3-2B-TEST

caduceus-ph_seqlen-1k_d_model-118_n_layer-4_lr-8e-3

proseco-owt

sedd-noeos-owt

ar-noeos-owt