zhangtaolab
plant-dnamamba-singlebase-promoter
dnabert2-promoter
plant-dnabert-BPE
plant-dnamodernbert-BPE
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes. All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary. - Repository: Plant DNA LLMs - Manuscript: PDLLMs: A group of tailored DNA large language models for analyzing plant genomes The model is trained based on the ModernBERT model with modified tokenizer specific for DNA sequence. Here is a simple code for inference (Note that Mamba model requires NVIDIA GPU for inference): Training data We use MaskedLM method to pre-train the model, the tokenized sequence have a maximum length of 1024. Detailed training procedure can be found in our manuscript. Training used FlashAttention2 to accelerate the process. Hardware Model was pre-trained on a NVIDIA RTX4090 GPU (24 GB).