nomic-ai

36 models • 7 total models in database

Sort by:

nomic-embed-text-v1.5

nomic-embed-text-v1.5: Resizable Production Embeddings with Matryoshka Representation Learning

nomic-embed-text-v1

nomic-embed-text-v1: A Reproducible Long Context (8192) Text Embedder

nomic-embed-text-v2-moe

--- base_model: - nomic-ai/nomic-embed-text-v2-moe-unsupervised library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - feature-extraction license: apache-2.0 language: - en - es - fr - de - it - pt - pl - nl - tr - ja - vi - ru - id - ar - cs - ro - sv - el - uk - zh - hu - da - 'no' - hi - fi - bg - ko - sk - th - he - ca - lt - fa - ms - sl - lv - mr - bn - sq - cy - be - ml - kn - mk - ur - fy - te - eu - sw - so - sd - uz -

license:apache-2.0

nomic-embed-vision-v1.5

--- library_name: transformers language: - en pipeline_tag: image-feature-extraction license: apache-2.0 inference: false ---

license:apache-2.0

CodeRankEmbed

`CodeRankEmbed` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms various open-source and proprietary code embedding models on various code retrieval tasks. Check out our blog post and paper for more details! Combine `CodeRankEmbed` with our re-ranker `CodeRankLLM` for even higher quality code retrieval. | Name | Parameters | CSN (MRR) | CoIR (NDCG@10) | | :-------------------------------:| :----- | :-------- | :------: | | CodeRankEmbed | 137M | 77.9 |60.1 | | Arctic-Embed-M-Long | 137M | 53.4 | 43.0 | | CodeSage-Small | 130M | 64.9 | 54.4 | | CodeSage-Base | 356M | 68.7 | 57.5 | | CodeSage-Large | 1.3B | 71.2 | 59.4 | | Jina-Code-v2 | 161M | 67.2 | 58.4 | | CodeT5+ | 110M | 74.2 | 45.9 | | OpenAI-Ada-002 | 110M | 71.3 | 45.6 | | Voyage-Code-002 | Unknown | 68.5 | 56.3 | We release the scripts to evaluate our model's performance here. Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code" Training We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called CoRNStack. Our encoder is initialized with Arctic-Embed-M-Long, a 137M parameter text encoder supporting an extended context length of 8,192 tokens. If you find the model, dataset, or training code useful, please cite our work:

modernbert-embed-base

license:apache-2.0

nomic-embed-code

Nomic Embed Code: A State-of-the-Art Code Retriever Blog | Technical Report | AWS SageMaker | Atlas Embedding and Unstructured Data Analytics Platform `nomic-embed-code` is a state-of-the-art code embedding model that excels at code retrieval tasks: - High Performance: Outperforms Voyage Code 3 and OpenAI Embed 3 Large on CodeSearchNet - Multilingual Code Support: Trained for multiple programming languages (Python, Java, Ruby, PHP, JavaScript, Go) - Advanced Architecture: 7B parameter code embedding model - Fully Open-Source: Model weights, training data, and evaluation code released | Model | Python | Java | Ruby | PHP | JavaScript | Go | |-------|--------|------|------|-----|------------|-----| | Nomic Embed Code | 81.7 | 80.5 | 81.8 | 72.3 | 77.1 | 93.8 | | Voyage Code 3 | 80.8 | 80.5 | 84.6 | 71.7 | 79.2 | 93.2 | | OpenAI Embed 3 Large | 70.8 | 72.9 | 75.3 | 59.6 | 68.1 | 87.6 | | Nomic CodeRankEmbed-137M | 78.4 | 76.9 | 79.3 | 68.8 | 71.4 | 92.7 | | CodeSage Large v2 (1B) | 74.2 | 72.3 | 76.7 | 65.2 | 72.5 | 84.6 | | CodeSage Large (1B) | 70.8 | 70.2 | 71.9 | 61.3 | 69.5 | 83.7 | | Qodo Embed 1 7B | 59.9 | 61.6 | 68.4 | 48.5 | 57.0 | 81.4 | - Total Parameters: 7B - Training Approach: Trained on the CoRNStack dataset with dual-consistency filtering and progressive hard negative mining - Supported Languages: Python, Java, Ruby, PHP, JavaScript, and Go Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out low-quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long-range dependencies. After the initial filtering, we used dual-consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top-2 most similar examples for a given docstring. During training, we employ a novel curriculum-based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax-based sampling strategy to progressively sample hard negatives with increasing difficulty over time. - Nomic Embed Ecosystem: https://www.nomic.ai/embed - Website: https://nomic.ai - Twitter: https://twitter.com/nomicai - Discord: https://discord.gg/myY5YDR8z8 If you find the model, dataset, or training code useful, please cite our work:

license:apache-2.0

nomic-embed-multimodal-7b

license:apache-2.0

colnomic-embed-multimodal-7b

dataset:llamaindex/vdr-multilingual-train

nomic-embed-text-v1.5-GGUF

license:apache-2.0

nomic-embed-vision-v1

license:apache-2.0

nomic-bert-2048

license:apache-2.0

nomic-embed-text-v2-moe-GGUF

license:apache-2.0

nomic-embed-multimodal-3b

colnomic-embed-multimodal-3b

dataset:llamaindex/vdr-multilingual-train

gpt4all-j

license:apache-2.0

nomic-embed-text-v1-GGUF

license:apache-2.0

nomic-embed-code-GGUF

Llama.cpp Quantizations of Nomic Embed Code: A State-of-the-Art Code Retriever Blog | Technical Report | AWS SageMaker | Atlas Embedding and Unstructured Data Analytics Platform This model can be used with the llama.cpp server and other software that supports llama.cpp embedding models. Queries embedded with `nomic-embed-code` must begin with the following prefix: For example, the code below shows how to use the prefix to embed user questions, e.g. in a RAG application. | Filename | Quant Type | File Size | Description | | -------- | ---------- | --------: | ----------- | | nomic-embed-code.f32.gguf | f32 | 26.35GiB | Full FP32 weights. | | nomic-embed-code.f16.gguf | f16 | 13.18GiB | Full FP16 weights. | | nomic-embed-code.bf16.gguf | bf16 | 13.18GiB | Full BF16 weights. | | nomic-embed-code.Q8\0.gguf | Q8\0 | 7.00GiB | Extremely high quality, generally unneeded but max available quant. | | nomic-embed-code.Q6\K.gguf | Q6\K | 5.41GiB | Very high quality, near perfect, recommended. | | nomic-embed-code.Q5\K\M.gguf | Q5\K\M | 4.72GiB | High quality, recommended. | | nomic-embed-code.Q5\K\S.gguf | Q5\K\S | 4.60GiB | High quality, recommended. | | nomic-embed-code.Q4\1.gguf | Q4\1 | 4.22GiB | Legacy format, similar performance to Q4\K\S but with improved tokens/watt on Apple silicon. | | nomic-embed-code.Q4\K\M.gguf | Q4\K\M | 4.08GiB | Good quality, default size for most use cases, recommended. | | nomic-embed-code.Q4\K\S.gguf | Q4\K\S | 3.87GiB | Slightly lower quality with more space savings, recommended. | | nomic-embed-code.Q4\0.gguf | Q4\0 | 3.84GiB | Legacy format, offers online repacking for ARM and AVX CPU inference. | | nomic-embed-code.Q3\K\L.gguf | Q3\K\L | 3.59GiB | Lower quality but usable, good for low RAM availability. | | nomic-embed-code.Q3\K\M.gguf | Q3\K\M | 3.33GiB | Low quality. | | nomic-embed-code.Q3\K\S.gguf | Q3\K\S | 3.03GiB | Low quality, not recommended. | | nomic-embed-code.Q2\K.gguf | Q2\K | 2.64GiB | Very low quality but surprisingly usable. | Model Overview `nomic-embed-code` is a state-of-the-art code embedding model that excels at code retrieval tasks: - High Performance: Outperforms Voyage Code 3 and OpenAI Embed 3 Large on CodeSearchNet - Multilingual Code Support: Trained for multiple programming languages (Python, Java, Ruby, PHP, JavaScript, Go) - Advanced Architecture: 7B parameter code embedding model - Fully Open-Source: Model weights, training data, and evaluation code released | Model | Python | Java | Ruby | PHP | JavaScript | Go | |-------|--------|------|------|-----|------------|-----| | Nomic Embed Code | 81.7 | 80.5 | 81.8 | 72.3 | 77.1 | 93.8 | | Voyage Code 3 | 80.8 | 80.5 | 84.6 | 71.7 | 79.2 | 93.2 | | OpenAI Embed 3 Large | 70.8 | 72.9 | 75.3 | 59.6 | 68.1 | 87.6 | | Nomic CodeRankEmbed-137M | 78.4 | 76.9 | 79.3 | 68.8 | 71.4 | 92.7 | | CodeSage Large v2 (1B) | 74.2 | 72.3 | 76.7 | 65.2 | 72.5 | 84.6 | | CodeSage Large (1B) | 70.8 | 70.2 | 71.9 | 61.3 | 69.5 | 83.7 | | Qodo Embed 1 7B | 59.9 | 61.6 | 68.4 | 48.5 | 57.0 | 81.4 | - Total Parameters: 7B - Training Approach: Trained on the CoRNStack dataset with dual-consistency filtering and progressive hard negative mining - Supported Languages: Python, Java, Ruby, PHP, JavaScript, and Go Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out low-quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long-range dependencies. After the initial filtering, we used dual-consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top-2 most similar examples for a given docstring. During training, we employ a novel curriculum-based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax-based sampling strategy to progressively sample hard negatives with increasing difficulty over time. - Nomic Embed Ecosystem: https://www.nomic.ai/embed - Website: https://nomic.ai - Twitter: https://twitter.com/nomic\ai - Discord: https://discord.gg/myY5YDR8z8 If you find the model, dataset, or training code useful, please cite our work:

license:apache-2.0

gpt4all-13b-snoozy

nomic-xlm-2048

nomic-embed-text-v1-unsupervised

license:apache-2.0

CodeRankLLM

gpt4all-falcon

license:apache-2.0

nomic-embed-text-v1-ablated

modernbert-embed-base-unsupervised

license:apache-2.0

vit_eva02_base_patch16_224.mim_in22k

license:apache-2.0

nomic-embed-text-v2-moe-unsupervised

eurobert-210m-2e4-128sl-full-ft

gpt4all-mpt

license:apache-2.0

gpt4all-lora

license:gpl-3.0

gpt4all-lora-epoch-3

license:gpl-3.0

gpt4all-falcon-ggml

gpt4all-j-lora

license:apache-2.0

ggml-replit-code-v1-3b

license:cc-by-sa-4.0

qwen2.5-coder-7B-instruct-bd

colqwen2.5-7B-base

ColQwen2.5: Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy ColQwen is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features. It is a Qwen2.5-VL-3B extension that generates ColBERT- style multi-vector representations of text and images. It was introduced in the paper ColPali: Efficient Document Retrieval with Vision Language Models and first released in this repository This version is the untrained base version to guarantee deterministic projection layer initialization. > [!WARNING] > This version should not be used: it is solely the base version useful for deterministic LoRA initialization. - Manuel Faysse: [email protected] - Hugues Sibille: [email protected] - Tony Wu: [email protected] If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

license:apache-2.0