kuleshov-group

31 models • 1 total models in database
Sort by:

mdlm-owt

license:apache-2.0
30,656
22

PlantCaduceus_l32

license:apache-2.0
3,120
9

PlantCAD2-Small-l24-d0768

license:apache-2.0
1,830
0

caduceus-ps_seqlen-131k_d_model-256_n_layer-16

license:apache-2.0
1,541
14

PlantCaduceus_l20

license:apache-2.0
1,464
1

PlantCAD2-Large-l48-d1536

license:apache-2.0
1,439
0

proseco-llada-sft

license:apache-2.0
838
1

bd3lm-owt-block_size4

license:apache-2.0
835
3

PlantCaduceus_l24

license:apache-2.0
679
0

PlantCaduceus_l28

license:apache-2.0
676
1

caduceus-ph_seqlen-131k_d_model-256_n_layer-16

license:apache-2.0
674
6

e2d2-gsm8k-finetune-Qwen3-2B

Quick start guide To use this models, follow the snippet below: Model details - Fine-tuned from `Qwen/Qwen3-1.7B-Base` on `openai/gsm8k` - Qwen3 tokenizer: `Qwen/Qwen3-1.7B-Base` - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

NaNK
license:apache-2.0
463
0

e2d2-owt

Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `Skylion007/openwebtext` - `gpt2` tokenizer - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

license:apache-2.0
378
0

caduceus-ph_seqlen-1k_d_model-256_n_layer-4_lr-8e-3

299
1

mdlm-no_flashattn-fp32-owt

license:apache-2.0
271
5

e2d2-cnndm

Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `abisee/cnndailymail` - Qwen3 tokenizer: `Qwen/Qwen3-0.6B-Base` - Block diffusion parameterization, with block size 8 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

license:apache-2.0
266
0

bd3lm-owt-block_size8

license:apache-2.0
213
1

bd3lm-owt-block_size1024-pretrain

license:apache-2.0
180
1

e2d2-wmt

Quick start guide To use this models, follow the snippet below: Model details - Trained from scratch on `wmt/wmt14` - Qwen3 tokenizer: `Qwen/Qwen3-0.6B-Base` - Block diffusion parameterization, with block size 4 See the project site for more details and link to the paper and code: https://m-arriola.com/e2d2/

license:apache-2.0
117
0

PlantCAD2-Medium-l48-d1024

license:apache-2.0
97
0

bd3lm-owt-block_size16

license:apache-2.0
69
17

udlm-qm9

To use this pre-trained model with the HuggingFace APIs, use the following snippet: UDLM stands for Uniform Diffusion Language Models. This model was trained using the refined uniform noise discrete diffusion continuous-time ELBO introduced here. The model has a context size of 32 tokens. The model has 92M parameters. The model architecture is based off of the Diffusion Transformer architecture and consists of: - 12 multi-head attention blocks (with 12 attention heads), - hidden dimension of 768, - `adaLN` for conditioning on time-step (i.e., during diffusion training / generation). The model was trained using the `yairschiff/qm9-tokenizer` tokenizer, a custom tokenizer for parsing SMILES strings. We trained for 25k gradient update steps using a batch size of 2,048. We used linear warm-up with 1,000 steps until we reach a learning rate of 3e-4 and the applied cosine-decay until reaching a minimum learning rate of 3e-6. For more details, please refer to our work: Simple Guidance Mechanisms for Discrete Diffusion Models. Citation Please cite our work using the bibtex below:

license:apache-2.0
62
0

udlm-lm1b

To use this pre-trained model with the HuggingFace APIs, use the following snippet: UDLM stands for Uniform Diffusion Language Models. This model was trained using the refined uniform noise discrete diffusion continuous-time ELBO introduced here. The model has a context size of 128 tokens. The model has 139M parameters. The model architecture is based off of the Diffusion Transformer architecture and consists of: - 12 multi-head attention blocks (with 12 attention heads), - hidden dimension of 768, - `adaLN` for conditioning on time-step (i.e., during diffusion training / generation). The model was trained using the `bert-base-uncased` tokenizer. We trained for 1M gradient update steps using a batch size of 512. We use linear warm-up with 2500 steps until we reach a constant learning rate of 3e-4. For more details, please refer to our work: Simple Guidance Mechanisms for Discrete Diffusion Models. Citation Please cite our work using the bibtex below:

NaNK
license:apache-2.0
38
0

mdlm-owt-noeos

license:apache-2.0
28
0

caduceus-ps_seqlen-1k_d_model-256_n_layer-4_lr-8e-3

14
2

caduceus-ps_seqlen-1k_d_model-118_n_layer-4_lr-8e-3

14
1

e2d2-gsm8k-finetune-Qwen3-2B-TEST

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

NaNK
14
0

caduceus-ph_seqlen-1k_d_model-118_n_layer-4_lr-8e-3

12
1

proseco-owt

license:apache-2.0
11
1

sedd-noeos-owt

Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov [](https://arxiv.org/abs/2503.09573) [](https://github.com/kuleshov-group/bd3lms) [](https://m-arriola.com/bd3lms/) [](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Model Description This is a retrained baseline model from SEDD. Differently from Austin et. al, we train our SEDD baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to analyze the lengths of generated samples at inference, without the artificial BOS/EOS injection confounding the length statistics. How to use See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.

license:apache-2.0
7
0

ar-noeos-owt

Block Diffusion Interpolates Between Autoregressive and Diffusion Language Models (ICLR 2025 Oral) By Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, Volodymyr Kuleshov [](https://arxiv.org/abs/2503.09573) [](https://github.com/kuleshov-group/bd3lms) [](https://m-arriola.com/bd3lms/) [](https://huggingface.co/collections/kuleshov-group/bd3-lms-67be95f81b96b15fec50d53f) We introduce BD3-LMs, a family of Block Discrete Denoising Diffusion Language Models that achieve SOTA likelihoods among diffusion models and enable generation of arbitrary-length sequences. BD3-LMs combine the strengths of autoregressive and diffusion language models by decomposing a token sequence into blocks and performing discrete diffusion within each block. By tuning the block size, we interpolate between autoregressive and diffusion models which introduces a trade-off between quality and sample efficiency. We propose a recipe of building effective BD3-LMs that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. Model Description This is a retrained AR baseline model from MDLM. Differently from Sahoo et. al, we train our MDLM baseline on OpenWebText without injecting BOS/EOS at the beginning/end of the training context. This allows us to generate sequences longer than 1024 tokens at inference. How to use See our GitHub README, where we provide sample scripts for training, likelihood evaluation, and generation.

license:apache-2.0
7
0