vandijklab
C2S-Scale-Gemma-2-27B
C2S-Scale Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis HuggingFace C2S Collection: C2S-Scale Models GitHub Repository: vandijklab/cell2sentence (for code, tutorials...
C2S-Pythia-410m-cell-type-prediction
C2S-Scale-Gemma-2-2B
C2S-Scale Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis HuggingFace C2S Collection: C2S-Scale Models GitHub Repository: vandijklab/cell2sentence (for code, tutorials, and discussions) Google Research Blog Post: Teaching machines the language of biology Author: van Dijk Lab (Yale), Google Research, Google DeepMind This section describes the C2S-Scale model and how to use it. C2S-Scale-Gemma-2B is a state-of-the-art, open language model built upon the Gemma-2 2B architecture and fine-tuned for single-cell biology. Developed through the Cell2Sentence (C2S) framework, the model processes and understands single-cell RNA sequencing (scRNA-seq) data by treating it as a language. It converts high-dimensional scRNA-seq expression data into "cell sentences" - ordered sequences of gene names - enabling a wide range of biological analyses. This work is the result of a collaboration between Yale University, Google Research, and Google DeepMind to scale up C2S models. The C2S-Scale models were trained on Google's TPU v5s, which allowed for a significant increase in model size and capability. These models excel at tasks such as cell type prediction, tissue classification, and generating biologically meaningful cell representations. Versatility: Demonstrates strong performance across a diverse set of single-cell and multi-cell tasks. Scalability: Trained on a massive dataset of over 57 million cells, showcasing the power of scaling LLMs for biological data. Generative Power: Capable of generating realistic single-cell gene expression profiles. Foundation for Fine-tuning: Can serve as a powerful pretrained foundation for specialized, domain-specific single-cell analysis tasks. C2S-Scale can be a valuable tool for researchers in the following areas: In Silico Experiments: Generate cells under specific conditions or predict perturbational changes to form and test new biological hypotheses. Cell Atlas Annotation: Streamline the process of annotating large-scale single-cell datasets by predicting cell types and tissues. Biomarker Discovery: Analyze gene patterns within cell sentences to identify potential markers for specific cell states or diseases. Below are code snippets to help you get started running the model locally on a GPU. The model can be used for various tasks, further described in the C2S-Scale paper. To perform cell type prediction, the model expects a prompt containing the cell sentence followed by a query. The resulting prompt is in the format expected by the model for this task: See the following Colab notebooks in our GitHub repository for examples of how to use C2S-Scale models: To quickly get started with the model for tasks like cell type prediction and generation: C2S Tutorials C2S-Scale is based on the Gemma 2 family of lightweight, state-of-the-art open LLMs, which utilizes a decoder-only transformer architecture. Base Model: Gemma-2 2B. Fine-tuning Data: A comprehensive collection of over 800 datasets from CellxGene and the Human Cell Atlas, totaling over 57 million human and mouse cells. Training Approach: Instruction fine-tuning using the Cell2Sentence framework, which converts scRNA-seq expression data into sequences of gene tokens. Model type: Decoder-only Transformer (based on Gemma-2) Key publication: Scaling Large Language Models for Next-Generation Single-Cell Analysis The performance of C2S-Scale models was validated on a wide range of single-cell and multi-cell tasks, including advanced downstream tasks such as cluster captioning, question answering, and perturbation prediction. C2S-Scale models demonstrated significant improvements over other open and closed-source models, establishing new state-of-the-art benchmarks for LLMs in single-cell biology. Please see our preprint for a full breakdown of performance metrics. Input: Text. For best performance, prompts should be structured according to the specific task (e.g., cell type prediction, conditioned generation). Inputs are "cell sentences"—ordered, space-separated lists of gene names. Output: Text. The model generates text as a response, which can be a predicted label (like a cell type or tissue), a full cell sentence, or a natural language abstract. CellxGene and Human Cell Atlas: The model was trained on a curated collection of over 800 public scRNA-seq datasets, encompassing more than 57 million cells. This data covers a broad range of tissues, cell types, and experimental conditions from both human and mouse, ensuring the model learns a robust and generalizable representation of cellular states. Evaluation was performed using held-out datasets and standardized benchmarks designed to test the model's capabilities on the tasks listed above. All evaluation methodologies followed established best practices for splitting data to ensure robust and unbiased assessment. The model weights shared on Huggingface are CC-by-4.0. The model was trained using JAX, leveraging Google's TPU v5 hardware for efficient and large-scale training. Research in single-cell genomics and computational biology. As a foundational model for fine-tuning on specific biological domains or datasets. To aid in the annotation and interpretation of large-scale scRNA-seq experiments. C2S-Scale provides a powerful, versatile, and scalable tool for single-cell analysis. It offers: State-of-the-art performance on a wide range of scRNA-seq tasks. A unified framework for handling diverse single-cell analysis challenges. A foundation for building more specialized models from private or proprietary data. The ability to perform in silico generation of cellular data to explore biological hypotheses. The model is trained on public data and its knowledge is limited to the genes, cell types, and conditions present in that data. Performance on out-of-distribution data (e.g., completely novel cell types or technologies) is not guaranteed and requires validation. Performance of the models on input prompt formats that greatly deviate from training prompt formatting is not guaranteed. C2S-Scale Links - Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis - Google Research Blog Post: Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis - GitHub: https://github.com/vandijklab/cell2sentence (Note: Codebase has Apache 2.0 license, weights shared on HuggingFace are CC-by-4.0) Gemma-2 Links - HuggingFace: https://huggingface.co/google/gemma-2-2b - Gemma-2 Blog Post: Gemma explained: What's new in Gemma 2 - Technical report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
C2S-Scale-Pythia-1b-pt
Overview This is the C2S-Scale-1B pretrained model, based on the Pythia-1b architecture developed by EleutherAI, fine-tuned using the Cell2Sentence (C2S) framework on a wide array of single-cell RNA sequencing (scRNA-seq) datasets from CellxGene and the Human Cell Atlas. Cell2Sentence is a cutting-edge method that adapts large language models (LLMs) to single-cell biology by converting scRNA-seq data into "cell sentences" — ordered sequences of gene names based on expression levels. This model has been trained to perform a broad range of single- and multi-cell tasks, making it a versatile tool for various single-cell and multi-cell analyses. Training Data This model was trained on over 57 million human and mouse cells gathered from over 800 single-cell RNA sequencing datasets from CellxGene and the Human Cell Atlas. This dataset covers a broad range of cell types and conditions from multiple tissues in both human and mouse. This model was trained with a variable number of genes per cell sentence, with a maximum context length of 8192 tokens. The context length of the default Pythia model was extended using rotary positional embeddings prior to C2S training. - Cells: For multi cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each of the cells in the same sample. - Genes: For single cell samples, each cell sentence contained between 100 and 2048 genes. For multi cell samples, each cell sentence per cell contained between 100 and 400 genes. Tasks This model is designed for the following tasks: Single-Cell Tasks - Unconditional single-cell generation: Generate single cell sentences unconditionally. - Cell type prediction: Predict the cell type of a given single cell. - Cell type-conditioned generation: Generate a single cell sentence conditioned on a specific cell type. Multi-Cell Tasks - Unconditional multi-cell generation: Generate multiple cell sentences unconditionally. - Tissue prediction: Predict the tissue of origin for a group of cells. - Cell type prediction: Predict the cell type for each cell in a group of multiple cells. - Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue. - Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each individual cell. - Multi-cells to abstract: Generate a research paper abstract based on the provided multi-cell sentences. - Abstract to multi-cells: Generate multiple cell sentences based on a given research paper abstract. Gene Set Tasks - Gene set name to genes: Generate an alphabetical list of genes given a gene set name. - Genes to gene set name: Generate the name of a gene set given an alphabetical list of genes. Cell2Sentence Links - GitHub: https://github.com/vandijklab/cell2sentence (Note: Codebase has Apache 2.0 license, weights shared on HuggingFace are CC-by-4.0) - Paper: https://www.biorxiv.org/content/10.1101/2023.09.11.557287v3 Pythia Links - Paper: https://arxiv.org/pdf/2304.01373 - Hugging Face: https://huggingface.co/EleutherAI/pythia-410m