jina-embeddings-v5-text-nano-retrieval

38.6K
12
llama.cpp
by
jinaai
Embedding Model
OTHER
Fair
39K downloads
Community-tested
Edge AI:
Mobile
Laptop
Server
Unknown
Mobile
Laptop
Server
Quick Summary

AI model with specialized capabilities.

Code Examples

Optional: set truncate_dim in encode() to control embedding sizepythonpytorch
from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-nano-retrieval",
    trust_remote_code=True,
    model_kwargs={"dtype": torch.bfloat16},  # Recommended for GPUs
    config_kwargs={"_attn_implementation": "flash_attention_2"},  # Recommended but optional
)
# Optional: set truncate_dim in encode() to control embedding size

query = "Which planet is known as the Red Planet?"
documents = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet.",
]

# Encode query and documents
query_embeddings = model.encode(sentences=query, prompt_name="query")
document_embeddings = model.encode(sentences=documents, prompt_name="document")
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)

similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.5013, 0.7914, 0.6133, 0.5736]])
1. Load tokenizer and ONNX modelpythontransformers
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch

model_id = "jinaai/jina-embeddings-v5-text-nano-retrieval"

# 1. Load tokenizer and ONNX model
# We specify the subfolder 'onnx' where the weights are located
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = ORTModelForFeatureExtraction.from_pretrained(
    model_id,
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",  # Or "CUDAExecutionProvider" for GPU
    trust_remote_code=True,
)

# 2. Prepare input
texts = ["Query: How do I use Jina ONNX models?", "Document: Information about semantic matching."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")


# 4. Inference
with torch.no_grad():
    outputs = model(**inputs)

# 5. Pooling (Crucial for Jina-v5)
# Jina-v5 uses LAST-TOKEN pooling.
# We take the hidden state of the last non-padding token.
last_hidden_state = outputs.last_hidden_state
# Find the indices of the last token (usually the end of the sequence)
sequence_lengths = inputs.attention_mask.sum(dim=1) - 1
embeddings = last_hidden_state[torch.arange(last_hidden_state.size(0)), sequence_lengths]

print('embeddings shape:', embeddings.shape)
print('embeddings:', embeddings)

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.