jhu-clsp

49 models • 5 total models in database
Sort by:

mmBERT-base

[](https://opensource.org/licenses/MIT) [](https://arxiv.org/abs/2509.06888) [](https://huggingface.co/jhu-clsp/mmBERT-base) [](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multiling...

mit
43,731
159

ettin-encoder-150m

license:mit
13,089
8

ettin-encoder-1b

Ettin: an Open Suite of Paired Encoders and Decoders [](https://opensource.org/licenses/MIT) [](https://arxiv.org/abs/2507.11412) [](https://huggingface.co/jhu-clsp) [](https://huggingface.co/datasets/jhu-clsp) [](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > 🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2. This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories. Table of Contents - Performance Highlights - Quick Start - Model Description - Training Data - Model Family - Encoder Models - Decoder Models - Cross-Objective Models - Accessing Training Checkpoints - Research Applications - Training Details - Model Architecture - Usage Examples - Fine-tuning Examples - Citation Encoder Tasks (vs. ModernBERT) - GLUE Average: 88.9 vs 88.4 (Base), 90.8 vs 90.4 (Large) - MTEB v2 English Retrieval: 45.7 vs 43.9 (Base), 48.4 vs 47.0 (Large) - Code Search and Long Context: Superior performance on CodeSearchNet and MLDR Decoder Tasks (vs. SmolLM2 & Llama 3.2) - Average Score: 46.2 vs 45.2 (SmolLM2-135M) - 1B Model: 59.0 vs 56.6 (Llama 3.2-1B) - Generative Tasks: Competitive across all model sizes Key Finding Architecture-specific advantages persist: A 400M encoder outperforms a 1B decoder on classification tasks, while a 400M decoder outperforms a 1B encoder on generation tasks. Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use: 1. Identical training data - Same high-quality mixture across all models 2. Open Training Data - Data is available now with batch-level training data for each of the 250+ checkpoints 3. Matched architectures - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM) 4. Consistent training recipe - Three-phase training with 2T tokens 5. Multiple scales - From 17M to 1B parameters This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture. The training data is publicly available and split across different phases: - Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens of diverse data mixture - Mid-training/Extension Data: jhu-clsp/ettin-extension-data - 250B tokens of higher-quality filtered data - Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens of premium data sources - Training Data Order: jhu-clsp/ettin-data-order - Batch-level training order (columns: inputids, step) | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-encoder-17m | 17M | Mobile/Edge devices | [](https://huggingface.co/jhu-clsp/ettin-encoder-17m) | | XS | ettin-encoder-32m | 32M | Fast inference | [](https://huggingface.co/jhu-clsp/ettin-encoder-32m) | | Small | ettin-encoder-68m | 68M | Balanced performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-68m) | | Base | ettin-encoder-150m | 150M | Standard use cases | [](https://huggingface.co/jhu-clsp/ettin-encoder-150m) | | Large | ettin-encoder-400m | 400M | High accuracy needs | [](https://huggingface.co/jhu-clsp/ettin-encoder-400m) | | XL | ettin-encoder-1b | 1B | Best performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-1b) | | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-decoder-17m | 17M | Lightweight generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-17m) | | XS | ettin-decoder-32m | 32M | Quick prototyping | [](https://huggingface.co/jhu-clsp/ettin-decoder-32m) | | Small | ettin-decoder-68m | 68M | Efficient generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) | | Base | ettin-decoder-150m | 150M | Standard generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | | Large | ettin-decoder-400m | 400M | Quality generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) | | XL | ettin-decoder-1b | 1B | Best generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture. Encoders Trained from Decoders (Decoder → MLM) Load as encoders using `AutoModel` or `AutoModelForMaskedLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-encoder-from-decoder-17m | 17M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-17m) | | XS | ettin-encoder-from-decoder-32m | 32M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-32m) | | Small | ettin-encoder-from-decoder-68m | 68M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-68m) | | Base | ettin-encoder-from-decoder-150m | 150M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-150m) | | Large | ettin-encoder-from-decoder-400m | 400M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-400m) | | XL | ettin-encoder-from-decoder-1b | 1B | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-1b) | Decoders Trained from Encoders (Encoder → CLM) Load as decoders using `AutoModelForCausalLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-decoder-from-encoder-17m | 17M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | | XS | ettin-decoder-from-encoder-32m | 32M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | | Small | ettin-decoder-from-encoder-68m | 68M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) | | Base | ettin-decoder-from-encoder-150m | 150M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | | Large | ettin-decoder-from-encoder-400m | 400M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) | | XL | ettin-decoder-from-encoder-1b | 1B | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | Beyond the final models listed above, we provide access to intermediate training checkpoints for research and analysis purposes. These checkpoints allow you to study model behavior and performance throughout the training process. You can get the checkpoints either in HF format or raw for continued pre-training (e.g. Composer format). Raw Checkpoints All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset. HuggingFace Format Checkpoints Each model repository contains multiple tagged versions representing different training stages: - `step{number}` - Pretraining phase checkpoints (e.g., `step599525`, `step596528`) - `ext{number}` - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`) - `decay{number}` - Decay phase checkpoints (e.g., `decay100`, `decay500`) This checkpoint availability enables detailed analysis of training dynamics, loss curves, and capability emergence across the complete 2T token training process. Ettin provides the first controlled comparison of encoder vs. decoder architectures: - Identical Training Data: Same 2T token mixture across all models - Matched Architectures: Only attention patterns and objectives differ - Open Everything: Training data, model weights, and batch-level training order - Multiple Scales: Fair comparison from 17M to 1B parameters - 250+ Checkpoints: Complete training trajectory analysis - Architecture Studies: Compare encoder vs decoder capabilities fairly - Training Dynamics: Analyze 250+ checkpoints with batch-level data ordering - Scaling Laws: Study how architectural advantages change with scale - Transfer Learning: Investigate cross-objective training effectiveness - Replication Studies: First open replication of ModernBERT training recipe All training artifacts are publicly available: - Training data with exact batch ordering - Model checkpoints every 8.5B tokens - Complete hyperparameter configurations - Training code and evaluation scripts Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens Architecture: Transformer with RoPE, GLU activations, and prenorm layers Training Phases: - Pre-training: 1.7T tokens with diverse data mixture - Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K - Decay phase: 100B tokens with premium data sources Key Features: - Context length: Up to 8K tokens - Vocabulary: 50,368 tokens (ModernBERT tokenizer) - Deep but efficient architectures following MobileLLM principles | Parameter | 17M | 32M | 68M | 150M | 400M | 1B | |:----------|:----|:----|:----|:-----|:-----|:---| | Layers | 7 | 10 | 19 | 22 | 28 | 28 | | Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 | | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 | | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 | Encoders Click to see how to finetune this into a dense embedding model using Sentence Transformers Click to see how to finetune this into a multi-vector embedding model with PyLate Click to see how to finetune this into a sparse retrieval model using Sentence Transformers Click to see how to finetune this into a reranker model using Sentence Transformers If you use Ettin models in your research, please cite our work:

NaNK
license:mit
10,500
21

mmBERT-small

[](https://opensource.org/licenses/MIT) [](https://arxiv.org/abs/2509.06888) [](https://huggingface.co/jhu-clsp/mmBERT-base) [](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) [](https://github.com/jhu-clsp/mmBERT) > TL;DR: A state-of-the-art multilingual encoder trained on 3T+ tokens across 1800+ languages, introducing novel techniques for learning low-resource languages during the decay phase. mmBERT is a modern multilingual encoder that significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks. Built on the ModernBERT architecture with novel multilingual training innovations, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training. It is also significantly faster than any previous multilingual encoder. Table of Contents - Highlights - Quick Start - Model Description - Novel Training Innovations - Model Family - Training Data - Usage Examples - Fine-tuning Examples - Model Architecture - Citation mmBERT represents the first significant advancement over XLM-R for massively multilingual encoder models. Key features include: 1. Massive Language Coverage - Trained on over 1800 languages with progressive inclusion strategy 2. Modern Architecture - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques 3. Novel Training Recipe - Introduces inverse mask scheduling and temperature sampling 4. Open Training Data - Complete 3T+ token dataset publicly available 5. Decay Phase Innovation - Demonstrates effective learning of low-resource languages in final training phase The model uses bidirectional attention with masked language modeling objectives, optimized specifically for multilingual understanding and cross-lingual transfer. Progressive Language Addition: Start with 60 high-resource languages, expand to 110 mid-resource languages, then include all 1833 languages in decay phase. Inverse Mask Schedule: Reduce mask ratio from 30% → 15% → 5% across training phases for progressively refined learning. Inverse Temperature Sampling: Adjust multilingual sampling from high-resource bias (τ=0.7) to uniform sampling (τ=0.3). Model Merging: Combine English-focused, high-resource, and all-language decay variants using TIES merging. | Model | Total Params | Non-embed Params | Languages | Download | |:------|:-------------|:------------------|:----------|:---------| | mmBERT-small | 140M | 42M | 1800+ | [](https://huggingface.co/jhu-clsp/mmBERT-small) | | mmBERT-base | 307M | 110M | 1800+ | [](https://huggingface.co/jhu-clsp/mmBERT-base) | mmBERT training data is publicly available across different phases: | Phase | Dataset | Tokens | Description | |:------|:--------|:-------|:------------| | Pre-training P1 | mmbert-pretrain-p1 | 2.3T | 60 languages, foundational training | | Pre-training P2 | mmbert-pretrain-p2 | - | Extension data for pre-training phase | | Pre-training P3 | mmbert-pretrain-p3 | - | Final pre-training data | | Mid-training | mmbert-midtraining | 600B | 110 languages, context extension to 8K | | Decay Phase | mmbert-decay | 100B | 1833 languages, premium quality | Data Sources: Filtered DCLM (English), FineWeb2 (multilingual), FineWeb2-HQ (20 high-resource languages), Wikipedia (MegaWika), code repositories (StarCoder, ProLong), academic papers (ArXiv, PeS2o), and community discussions (StackExchange). | Parameter | mmBERT-small | mmBERT-base | |:----------|:-------------|:------------| | Layers | 22 | 22 | | Hidden Size | 384 | 768 | | Intermediate Size | 1152 | 1152 | | Attention Heads | 6 | 12 | | Total Parameters | 140M | 307M | | Non-embedding Parameters | 42M | 110M | | Max Sequence Length | 8192 | 8192 | | Vocabulary Size | 256,000 | 256,000 | | Tokenizer | Gemma 2 | Gemma 2 | Click to expand dense retrieval fine-tuning example Click to expand multilingual classification fine-tuning example Click to expand multilingual reranking fine-tuning example mmBERT was trained on a carefully curated 3T+ token multilingual dataset: | Phase | Dataset | Description | |:------|:--------|:------------| | Pre-training P1 | 2.3T tokens | 60 languages, diverse data mixture | | Pre-training P2 | - | Extension data for pre-training | | Pre-training P3 | - | Final pre-training data | | Mid-training | 600B tokens | 110 languages, context extension | | Decay Phase | 100B tokens | 1833 languages, premium quality | Primary Sources: - Filtered DCLM: High-quality English content - FineWeb2: Broad multilingual web coverage (1800+ languages) - FineWeb2-HQ: Filtered subset of 20 high-resource languages - Code: StarCoder and ProLong repositories - Academic: ArXiv papers and PeS2o scientific content - Reference: Wikipedia (MegaWika) and textbooks - Community: StackExchange discussions If you use mmBERT in your research, please cite our work:

mit
5,313
53

ettin-encoder-68m

Ettin: an Open Suite of Paired Encoders and Decoders [](https://opensource.org/licenses/MIT) [](https://arxiv.org/abs/2507.11412) [](https://huggingface.co/jhu-clsp) [](https://huggingface.co/datasets/jhu-clsp) [](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > 🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2. This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories. Table of Contents - Performance Highlights - Quick Start - Model Description - Training Data - Model Family - Encoder Models - Decoder Models - Cross-Objective Models - Accessing Training Checkpoints - Research Applications - Training Details - Model Architecture - Usage Examples - Fine-tuning Examples - Citation Encoder Tasks (vs. ModernBERT) - GLUE Average: 88.9 vs 88.4 (Base), 90.8 vs 90.4 (Large) - MTEB v2 English Retrieval: 45.7 vs 43.9 (Base), 48.4 vs 47.0 (Large) - Code Search and Long Context: Superior performance on CodeSearchNet and MLDR Decoder Tasks (vs. SmolLM2 & Llama 3.2) - Average Score: 46.2 vs 45.2 (SmolLM2-135M) - 1B Model: 59.0 vs 56.6 (Llama 3.2-1B) - Generative Tasks: Competitive across all model sizes Key Finding Architecture-specific advantages persist: A 400M encoder outperforms a 1B decoder on classification tasks, while a 400M decoder outperforms a 1B encoder on generation tasks. Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use: 1. Identical training data - Same high-quality mixture across all models 2. Open Training Data - Data is available now with batch-level training data for each of the 250+ checkpoints 3. Matched architectures - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM) 4. Consistent training recipe - Three-phase training with 2T tokens 5. Multiple scales - From 17M to 1B parameters This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture. The training data is publicly available and split across different phases: - Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens of diverse data mixture - Mid-training/Extension Data: jhu-clsp/ettin-extension-data - 250B tokens of higher-quality filtered data - Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens of premium data sources - Training Data Order: jhu-clsp/ettin-data-order - Batch-level training order (columns: inputids, step) | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-encoder-17m | 17M | Mobile/Edge devices | [](https://huggingface.co/jhu-clsp/ettin-encoder-17m) | | XS | ettin-encoder-32m | 32M | Fast inference | [](https://huggingface.co/jhu-clsp/ettin-encoder-32m) | | Small | ettin-encoder-68m | 68M | Balanced performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-68m) | | Base | ettin-encoder-150m | 150M | Standard use cases | [](https://huggingface.co/jhu-clsp/ettin-encoder-150m) | | Large | ettin-encoder-400m | 400M | High accuracy needs | [](https://huggingface.co/jhu-clsp/ettin-encoder-400m) | | XL | ettin-encoder-1b | 1B | Best performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-1b) | | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-decoder-17m | 17M | Lightweight generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-17m) | | XS | ettin-decoder-32m | 32M | Quick prototyping | [](https://huggingface.co/jhu-clsp/ettin-decoder-32m) | | Small | ettin-decoder-68m | 68M | Efficient generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) | | Base | ettin-decoder-150m | 150M | Standard generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | | Large | ettin-decoder-400m | 400M | Quality generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) | | XL | ettin-decoder-1b | 1B | Best generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture. Encoders Trained from Decoders (Decoder → MLM) Load as encoders using `AutoModel` or `AutoModelForMaskedLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-encoder-from-decoder-17m | 17M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-17m) | | XS | ettin-encoder-from-decoder-32m | 32M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-32m) | | Small | ettin-encoder-from-decoder-68m | 68M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-68m) | | Base | ettin-encoder-from-decoder-150m | 150M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-150m) | | Large | ettin-encoder-from-decoder-400m | 400M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-400m) | | XL | ettin-encoder-from-decoder-1b | 1B | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-1b) | Decoders Trained from Encoders (Encoder → CLM) Load as decoders using `AutoModelForCausalLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-decoder-from-encoder-17m | 17M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | | XS | ettin-decoder-from-encoder-32m | 32M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | | Small | ettin-decoder-from-encoder-68m | 68M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | | Base | ettin-decoder-from-encoder-150m | 150M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | | Large | ettin-decoder-from-encoder-400m | 400M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | | XL | ettin-decoder-from-encoder-1b | 1B | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | Beyond the final models listed above, we provide access to intermediate training checkpoints for research and analysis purposes. These checkpoints allow you to study model behavior and performance throughout the training process. You can get the checkpoints either in HF format or raw for continued pre-training (e.g. Composer format). Raw Checkpoints All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset. HuggingFace Format Checkpoints Each model repository contains multiple tagged versions representing different training stages: - `step{number}` - Pretraining phase checkpoints (e.g., `step599525`, `step596528`) - `ext{number}` - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`) - `decay{number}` - Decay phase checkpoints (e.g., `decay100`, `decay500`) This checkpoint availability enables detailed analysis of training dynamics, loss curves, and capability emergence across the complete 2T token training process. Ettin provides the first controlled comparison of encoder vs. decoder architectures: - Identical Training Data: Same 2T token mixture across all models - Matched Architectures: Only attention patterns and objectives differ - Open Everything: Training data, model weights, and batch-level training order - Multiple Scales: Fair comparison from 17M to 1B parameters - 250+ Checkpoints: Complete training trajectory analysis - Architecture Studies: Compare encoder vs decoder capabilities fairly - Training Dynamics: Analyze 250+ checkpoints with batch-level data ordering - Scaling Laws: Study how architectural advantages change with scale - Transfer Learning: Investigate cross-objective training effectiveness - Replication Studies: First open replication of ModernBERT training recipe All training artifacts are publicly available: - Training data with exact batch ordering - Model checkpoints every 8.5B tokens - Complete hyperparameter configurations - Training code and evaluation scripts Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens Architecture: Transformer with RoPE, GLU activations, and prenorm layers Training Phases: - Pre-training: 1.7T tokens with diverse data mixture - Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K - Decay phase: 100B tokens with premium data sources Key Features: - Context length: Up to 8K tokens - Vocabulary: 50,368 tokens (ModernBERT tokenizer) - Deep but efficient architectures following MobileLLM principles | Parameter | 17M | 32M | 68M | 150M | 400M | 1B | |:----------|:----|:----|:----|:-----|:-----|:---| | Layers | 7 | 10 | 19 | 22 | 28 | 28 | | Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 | | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 | | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 | Encoders Click to see how to finetune this into a dense embedding model using Sentence Transformers Click to see how to finetune this into a multi-vector embedding model with PyLate Click to see how to finetune this into a sparse retrieval model using Sentence Transformers Click to see how to finetune this into a reranker model using Sentence Transformers If you use Ettin models in your research, please cite our work:

license:mit
1,542
3

ettin-encoder-400m

Ettin: an Open Suite of Paired Encoders and Decoders [](https://opensource.org/licenses/MIT) [](https://arxiv.org/abs/2507.11412) [](https://huggingface.co/jhu-clsp) [](https://huggingface.co/datasets/jhu-clsp) [](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > 🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2. This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories. Table of Contents - Performance Highlights - Quick Start - Model Description - Training Data - Model Family - Encoder Models - Decoder Models - Cross-Objective Models - Accessing Training Checkpoints - Research Applications - Training Details - Model Architecture - Usage Examples - Fine-tuning Examples - Citation Encoder Tasks (vs. ModernBERT) - GLUE Average: 88.9 vs 88.4 (Base), 90.8 vs 90.4 (Large) - MTEB v2 English Retrieval: 45.7 vs 43.9 (Base), 48.4 vs 47.0 (Large) - Code Search and Long Context: Superior performance on CodeSearchNet and MLDR Decoder Tasks (vs. SmolLM2 & Llama 3.2) - Average Score: 46.2 vs 45.2 (SmolLM2-135M) - 1B Model: 59.0 vs 56.6 (Llama 3.2-1B) - Generative Tasks: Competitive across all model sizes Key Finding Architecture-specific advantages persist: A 400M encoder outperforms a 1B decoder on classification tasks, while a 400M decoder outperforms a 1B encoder on generation tasks. Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use: 1. Identical training data - Same high-quality mixture across all models 2. Open Training Data - Data is available now with batch-level training data for each of the 250+ checkpoints 3. Matched architectures - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM) 4. Consistent training recipe - Three-phase training with 2T tokens 5. Multiple scales - From 17M to 1B parameters This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture. The training data is publicly available and split across different phases: - Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens of diverse data mixture - Mid-training/Extension Data: jhu-clsp/ettin-extension-data - 250B tokens of higher-quality filtered data - Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens of premium data sources - Training Data Order: jhu-clsp/ettin-data-order - Batch-level training order (columns: inputids, step) | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-encoder-17m | 17M | Mobile/Edge devices | [](https://huggingface.co/jhu-clsp/ettin-encoder-17m) | | XS | ettin-encoder-32m | 32M | Fast inference | [](https://huggingface.co/jhu-clsp/ettin-encoder-32m) | | Small | ettin-encoder-68m | 68M | Balanced performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-68m) | | Base | ettin-encoder-150m | 150M | Standard use cases | [](https://huggingface.co/jhu-clsp/ettin-encoder-150m) | | Large | ettin-encoder-400m | 400M | High accuracy needs | [](https://huggingface.co/jhu-clsp/ettin-encoder-400m) | | XL | ettin-encoder-1b | 1B | Best performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-1b) | | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-decoder-17m | 17M | Lightweight generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-17m) | | XS | ettin-decoder-32m | 32M | Quick prototyping | [](https://huggingface.co/jhu-clsp/ettin-decoder-32m) | | Small | ettin-decoder-68m | 68M | Efficient generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) | | Base | ettin-decoder-150m | 150M | Standard generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | | Large | ettin-decoder-400m | 400M | Quality generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) | | XL | ettin-decoder-1b | 1B | Best generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture. Encoders Trained from Decoders (Decoder → MLM) Load as encoders using `AutoModel` or `AutoModelForMaskedLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-encoder-from-decoder-17m | 17M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-17m) | | XS | ettin-encoder-from-decoder-32m | 32M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-32m) | | Small | ettin-encoder-from-decoder-68m | 68M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-68m) | | Base | ettin-encoder-from-decoder-150m | 150M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-150m) | | Large | ettin-encoder-from-decoder-400m | 400M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-400m) | | XL | ettin-encoder-from-decoder-1b | 1B | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-1b) | Decoders Trained from Encoders (Encoder → CLM) Load as decoders using `AutoModelForCausalLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-decoder-from-encoder-17m | 17M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | | XS | ettin-decoder-from-encoder-32m | 32M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | | Small | ettin-decoder-from-encoder-68m | 68M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | | Base | ettin-decoder-from-encoder-150m | 150M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | | Large | ettin-decoder-from-encoder-400m | 400M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | | XL | ettin-decoder-from-encoder-1b | 1B | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | Beyond the final models listed above, we provide access to intermediate training checkpoints for research and analysis purposes. These checkpoints allow you to study model behavior and performance throughout the training process. You can get the checkpoints either in HF format or raw for continued pre-training (e.g. Composer format). Raw Checkpoints All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset. HuggingFace Format Checkpoints Each model repository contains multiple tagged versions representing different training stages: - `step{number}` - Pretraining phase checkpoints (e.g., `step599525`, `step596528`) - `ext{number}` - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`) - `decay{number}` - Decay phase checkpoints (e.g., `decay100`, `decay500`) This checkpoint availability enables detailed analysis of training dynamics, loss curves, and capability emergence across the complete 2T token training process. Ettin provides the first controlled comparison of encoder vs. decoder architectures: - Identical Training Data: Same 2T token mixture across all models - Matched Architectures: Only attention patterns and objectives differ - Open Everything: Training data, model weights, and batch-level training order - Multiple Scales: Fair comparison from 17M to 1B parameters - 250+ Checkpoints: Complete training trajectory analysis - Architecture Studies: Compare encoder vs decoder capabilities fairly - Training Dynamics: Analyze 250+ checkpoints with batch-level data ordering - Scaling Laws: Study how architectural advantages change with scale - Transfer Learning: Investigate cross-objective training effectiveness - Replication Studies: First open replication of ModernBERT training recipe All training artifacts are publicly available: - Training data with exact batch ordering - Model checkpoints every 8.5B tokens - Complete hyperparameter configurations - Training code and evaluation scripts Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens Architecture: Transformer with RoPE, GLU activations, and prenorm layers Training Phases: - Pre-training: 1.7T tokens with diverse data mixture - Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K - Decay phase: 100B tokens with premium data sources Key Features: - Context length: Up to 8K tokens - Vocabulary: 50,368 tokens (ModernBERT tokenizer) - Deep but efficient architectures following MobileLLM principles | Parameter | 17M | 32M | 68M | 150M | 400M | 1B | |:----------|:----|:----|:----|:-----|:-----|:---| | Layers | 7 | 10 | 19 | 22 | 28 | 28 | | Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 | | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 | | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 | Encoders Click to see how to finetune this into a dense embedding model using Sentence Transformers Click to see how to finetune this into a multi-vector embedding model with PyLate Click to see how to finetune this into a sparse retrieval model using Sentence Transformers Click to see how to finetune this into a reranker model using Sentence Transformers If you use Ettin models in your research, please cite our work:

license:mit
1,029
8

Ettin Encoder 17m

Ettin: an Open Suite of Paired Encoders and Decoders [](https://opensource.org/licenses/MIT) [](https://arxiv.org/abs/2507.11412) [](https://huggingface.co/jhu-clsp) [](https://huggingface.co/datasets/jhu-clsp) [](https://github.com/jhu-clsp/ettin-encoder-vs-decoder) > 🎯 TL;DR: State-of-the-art paired encoder and decoder models (17M-1B params) trained identically for fair comparison with open data. Encoders beat ModernBERT. Decoders beat Llama 3.2/SmolLM2. This model is part of the Ettin suite - the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories. Table of Contents - Performance Highlights - Quick Start - Model Description - Training Data - Model Family - Encoder Models - Decoder Models - Cross-Objective Models - Accessing Training Checkpoints - Research Applications - Training Details - Model Architecture - Usage Examples - Fine-tuning Examples - Citation Encoder Tasks (vs. ModernBERT) - GLUE Average: 88.9 vs 88.4 (Base), 90.8 vs 90.4 (Large) - MTEB v2 English Retrieval: 45.7 vs 43.9 (Base), 48.4 vs 47.0 (Large) - Code Search and Long Context: Superior performance on CodeSearchNet and MLDR Decoder Tasks (vs. SmolLM2 & Llama 3.2) - Average Score: 46.2 vs 45.2 (SmolLM2-135M) - 1B Model: 59.0 vs 56.6 (Llama 3.2-1B) - Generative Tasks: Competitive across all model sizes Key Finding Architecture-specific advantages persist: A 400M encoder outperforms a 1B decoder on classification tasks, while a 400M decoder outperforms a 1B encoder on generation tasks. Ettin models are designed to provide a foundation for comparing encoder-only and decoder-only architectures. Unlike previous comparisons that were limited by different training data, architectures, and recipes, Ettin models use: 1. Identical training data - Same high-quality mixture across all models 2. Open Training Data - Data is available now with batch-level training data for each of the 250+ checkpoints 3. Matched architectures - Only differing in attention patterns (bidirectional vs causal) and training objectives (MLM vs CLM) 4. Consistent training recipe - Three-phase training with 2T tokens 5. Multiple scales - From 17M to 1B parameters This approach allows for true apples-to-apples comparisons between encoder and decoder models, revealing the inherent strengths of each architecture. The training data is publicly available and split across different phases: - Pre-training Data: jhu-clsp/ettin-pretraining-data - 1.7T tokens of diverse data mixture - Mid-training/Extension Data: jhu-clsp/ettin-extension-data - 250B tokens of higher-quality filtered data - Decay Phase Data: jhu-clsp/ettin-decay-data - 100B tokens of premium data sources - Training Data Order: jhu-clsp/ettin-data-order - Batch-level training order (columns: inputids, step) | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-encoder-17m | 17M | Mobile/Edge devices | [](https://huggingface.co/jhu-clsp/ettin-encoder-17m) | | XS | ettin-encoder-32m | 32M | Fast inference | [](https://huggingface.co/jhu-clsp/ettin-encoder-32m) | | Small | ettin-encoder-68m | 68M | Balanced performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-68m) | | Base | ettin-encoder-150m | 150M | Standard use cases | [](https://huggingface.co/jhu-clsp/ettin-encoder-150m) | | Large | ettin-encoder-400m | 400M | High accuracy needs | [](https://huggingface.co/jhu-clsp/ettin-encoder-400m) | | XL | ettin-encoder-1b | 1B | Best performance | [](https://huggingface.co/jhu-clsp/ettin-encoder-1b) | | Size | Model | Parameters | Best For | Download | |:-----|:------|:-----------|:---------|:---------| | XXS | ettin-decoder-17m | 17M | Lightweight generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-17m) | | XS | ettin-decoder-32m | 32M | Quick prototyping | [](https://huggingface.co/jhu-clsp/ettin-decoder-32m) | | Small | ettin-decoder-68m | 68M | Efficient generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-68m) | | Base | ettin-decoder-150m | 150M | Standard generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-150m) | | Large | ettin-decoder-400m | 400M | Quality generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-400m) | | XL | ettin-decoder-1b | 1B | Best generation | [](https://huggingface.co/jhu-clsp/ettin-decoder-1b) | These models demonstrate what happens when you continue training encoders as decoders (and vice versa). Important: Load these models using the architecture they were converted to, not their original architecture. Encoders Trained from Decoders (Decoder → MLM) Load as encoders using `AutoModel` or `AutoModelForMaskedLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-encoder-from-decoder-17m | 17M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-17m) | | XS | ettin-encoder-from-decoder-32m | 32M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-32m) | | Small | ettin-encoder-from-decoder-68m | 68M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-68m) | | Base | ettin-encoder-from-decoder-150m | 150M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-150m) | | Large | ettin-encoder-from-decoder-400m | 400M | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-400m) | | XL | ettin-encoder-from-decoder-1b | 1B | Decoder → MLM continued training | [](https://huggingface.co/jhu-clsp/ettin-encoder-from-decoder-1b) | Decoders Trained from Encoders (Encoder → CLM) Load as decoders using `AutoModelForCausalLM`: | Size | Model | Parameters | Description | Download | |:-----|:------|:-----------|:------------|:---------| | XXS | ettin-decoder-from-encoder-17m | 17M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-17m) | | XS | ettin-decoder-from-encoder-32m | 32M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-32m) | | Small | ettin-decoder-from-encoder-68m | 68M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-68m) | | Base | ettin-decoder-from-encoder-150m | 150M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-150m) | | Large | ettin-decoder-from-encoder-400m | 400M | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-400m) | | XL | ettin-decoder-from-encoder-1b | 1B | Encoder → CLM continued training | [](https://huggingface.co/jhu-clsp/ettin-decoder-from-encoder-1b) | Beyond the final models listed above, we provide access to intermediate training checkpoints for research and analysis purposes. These checkpoints allow you to study model behavior and performance throughout the training process. You can get the checkpoints either in HF format or raw for continued pre-training (e.g. Composer format). Raw Checkpoints All raw training checkpoints are available in the jhu-clsp/ettin-checkpoints dataset. HuggingFace Format Checkpoints Each model repository contains multiple tagged versions representing different training stages: - `step{number}` - Pretraining phase checkpoints (e.g., `step599525`, `step596528`) - `ext{number}` - Extension/mid-training phase checkpoints (e.g., `ext1000`, `ext2000`) - `decay{number}` - Decay phase checkpoints (e.g., `decay100`, `decay500`) This checkpoint availability enables detailed analysis of training dynamics, loss curves, and capability emergence across the complete 2T token training process. Ettin provides the first controlled comparison of encoder vs. decoder architectures: - Identical Training Data: Same 2T token mixture across all models - Matched Architectures: Only attention patterns and objectives differ - Open Everything: Training data, model weights, and batch-level training order - Multiple Scales: Fair comparison from 17M to 1B parameters - 250+ Checkpoints: Complete training trajectory analysis - Architecture Studies: Compare encoder vs decoder capabilities fairly - Training Dynamics: Analyze 250+ checkpoints with batch-level data ordering - Scaling Laws: Study how architectural advantages change with scale - Transfer Learning: Investigate cross-objective training effectiveness - Replication Studies: First open replication of ModernBERT training recipe All training artifacts are publicly available: - Training data with exact batch ordering - Model checkpoints every 8.5B tokens - Complete hyperparameter configurations - Training code and evaluation scripts Data: High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens Architecture: Transformer with RoPE, GLU activations, and prenorm layers Training Phases: - Pre-training: 1.7T tokens with diverse data mixture - Mid-training: 250B tokens with higher-quality filtered data and context extension to 8K - Decay phase: 100B tokens with premium data sources Key Features: - Context length: Up to 8K tokens - Vocabulary: 50,368 tokens (ModernBERT tokenizer) - Deep but efficient architectures following MobileLLM principles | Parameter | 17M | 32M | 68M | 150M | 400M | 1B | |:----------|:----|:----|:----|:-----|:-----|:---| | Layers | 7 | 10 | 19 | 22 | 28 | 28 | | Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 | | Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 | | Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 | Encoders Click to see how to finetune this into a dense embedding model using Sentence Transformers Click to see how to finetune this into a multi-vector embedding model with PyLate Click to see how to finetune this into a sparse retrieval model using Sentence Transformers Click to see how to finetune this into a reranker model using Sentence Transformers If you use Ettin models in your research, please cite our work:

license:mit
960
9

ettin-decoder-1b

NaNK
license:mit
628
4

ettin-decoder-150m

license:mit
616
4

ettin-decoder-17m

license:mit
202
1

ettin-encoder-32m

license:mit
180
6

kreyol-mt-pubtrain

license:mit
105
1

rank1-7b

NaNK
license:mit
100
3

rank1-3b

NaNK
license:mit
92
0

ettin-enc-from-dec-1b

NaNK
license:mit
84
0

rank1-32b

NaNK
license:mit
73
1

LegalBERT-DPR-CLERC-ft

55
0

rank1-14b

NaNK
license:mit
49
0

ettin-decoder-400m

license:mit
47
2

ettin-dec-from-enc-1b

NaNK
license:mit
46
0

ettin-enc-from-dec-400m

license:mit
45
0

ettin-dec-from-enc-17m

license:mit
45
0

ettin-dec-from-enc-68m

license:mit
45
0

ettin-dec-from-enc-32m

license:mit
42
0

ettin-dec-from-enc-400m

license:mit
40
0

ettin-enc-from-dec-17m

40
0

kreyol-mt

license:mit
39
3

ettin-decoder-68m

license:mit
36
0

ettin-enc-from-dec-68m

license:mit
36
0

ettin-decoder-32m

license:mit
36
0

ettin-enc-from-dec-150m

license:mit
33
0

ettin-dec-from-enc-150m

license:mit
29
0

rank1-32b-awq

NaNK
license:mit
27
0

BERT-DPR-CLERC-ft

26
0

ettin-enc-from-dec-32m

license:mit
26
0

LegalBert

license:mit
22
3

kreyol-mt-scratch

license:mit
22
1

bernice

license:mit
13
5

rank1-7b-awq

NaNK
license:mit
10
0

FollowIR-7B

NaNK
license:apache-2.0
4
15

bibert-ende

4
7

kreyol-mt-scratch-pubtrain

license:mit
2
0

rank1-1.5b

NaNK
license:mit
2
0

rank1-mistral-2501-24b

NaNK
license:mit
1
2

rank1-0.5b

NaNK
license:mit
1
1

rank1-llama3-8b-awq

NaNK
llama
1
0

rank1-mistral-2501-24b-awq

NaNK
license:mit
1
0

roberta-large-eng-ara-128k

license:mit
0
5

mmBERT-checkpoints

license:mit
0
2