InstaDeepAI

39 models • 1 total models in database
Sort by:

nucleotide-transformer-500m-human-ref

--- license: cc-by-nc-sa-4.0 widget: - text: ACCTGATTCTGAGTC tags: - DNA - biology - genomics datasets: - InstaDeepAI/human_reference_genome - InstaDeepAI/nucleotide_transformer_downstream_tasks ---

license:cc-by-nc-sa-4.0
968,694
14

nucleotide-transformer-v2-50m-multi-species

The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods Part of this collection is the nucleotide-transformer-v2-50m-multi-species, a 50M parameters transformer pre-trained on a collection of 850 genomes from a wide range of species, including model and non-model organisms. - Repository: Nucleotide Transformer - Paper: The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models: A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence. The nucleotide-transformer-v2-50m-multi-species model was pretrained on a total of 850 genomes downloaded from NCBI. Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset here. The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokenizer when possible, otherwise tokenizing each nucleotide separately as described in the Tokenization section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form: The tokenized sequence have a maximum length of 1,000. The masking procedure used is the standard one for Bert-style training: - 15% of the tokens are masked. - In 80% of the cases, the masked tokens are replaced by `[MASK]`. - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. - In the 10% remaining cases, the masked tokens are left as is. The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training. The model belongs to the second generation of nucleotide transformers, with the changes in architecture consisting the use of rotary positional embeddings instead of learned ones, as well as the introduction of Gated Linear Units.

license:cc-by-nc-sa-4.0
20,847
5

nucleotide-transformer-v2-500m-multi-species

license:cc-by-nc-sa-4.0
20,208
27

NTv3_650M_post

12,819
2

agro-nucleotide-transformer-1b

NaNK
license:cc-by-nc-sa-4.0
7,624
18

nucleotide-transformer-v2-250m-multi-species

license:cc-by-nc-sa-4.0
7,118
3

nucleotide-transformer-2.5b-multi-species

NaNK
license:cc-by-nc-sa-4.0
7,057
41

NTv3_8M_pre

6,820
2

NTv3_100M_post

3,125
2

NTv3_100M_pre

3,054
1

nucleotide-transformer-v2-100m-multi-species

license:cc-by-nc-sa-4.0
2,917
1

NTv3_650M_pre

2,720
4

nucleotide-transformer-500m-1000g

license:cc-by-nc-sa-4.0
1,858
7

ChatNT

1,124
13

nucleotide-transformer-v2-50m-3mer-multi-species

license:cc-by-nc-sa-4.0
678
3

BulkRNABert

BulkRNABert is a transformer-based, encoder-only language model pre-trained on bulk RNA-seq profiles from the TCGA dataset using self-supervised masked language modeling, following the original BERT framework. The model is trained to reconstruct randomly masked gene expression values from their genomic context, enabling it to learn biologically meaningful representations of transcriptomic profiles. Once pre-trained, BulkRNABert can be fine-tuned for various cancer-related downstream tasks—such as cancer type classification or survival analysis—by extracting embeddings from the model. - Repository - Paper: BulkRNABert: Cancer prognosis from bulk RNA-seq based language models Until its next release, the transformers library needs to be installed from source using the following command to use the models. PyTorch should also be installed. Other notes We also provide the params for the BulkRNABert jax model in `jaxparams`. A small snippet of code is provided below to run inference with the model using bulk RNA-seq samples from the TCGA dataset.

454
1

nucleotide-transformer-2.5b-1000g

NaNK
license:cc-by-nc-sa-4.0
199
8

NTv3_8M_pre_8kb

119
0

segment_nt

license:cc-by-nc-sa-4.0
98
9

NTv3_100M_post_131kb

88
0

instanovo-v1.0.0

license:cc-by-nc-sa-4.0
87
0

MOJO

83
0

segment_nt_multi_species

license:cc-by-nc-sa-4.0
76
1

instanovo-phospho-v1.0.0

license:cc-by-nc-sa-4.0
71
0

instanovo-v1.1.0

license:cc-by-nc-sa-4.0
66
0

instanovoplus-v1.1.0

license:cc-by-nc-sa-4.0
32
0

NTv3_650M_post_131kb

31
0

NTv3_650M_pre_8kb

25
0

isoformer

21
6

NTv3_100M_pre_8kb

19
0

AbBFN2

license:apache-2.0
13
1

IDP-ESM2-8M

11
0

sCellTransformer

8
0

segment_enformer

7
1

IDP-ESM2-150M

5
0

segment_borzoi

2
1

protein-sequence-bfn

license:cc-by-4.0
0
8

protein-structure-tokenizer

0
2

jumanji-benchmark-a2c-CVRP-v1

license:apache-2.0
0
1