GME-VARCO-VISION-Embedding
552
11
7.0B
1 language
license:cc-by-nc-4.0
by
NCSOFT
Embedding Model
OTHER
New
552 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
16GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
7GB+ RAM
Code Examples
Code Examplespythontransformers
import torch
import requests
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model_name = "NCSOFT/GME-VARCO-VISION-Embedding"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer
device = model.device
qry_msg = [
{
"role": "user",
"content": [
{"type": "text", "text": "Find a photo of a cat."},
],
},
]
qry_txt = processor.apply_chat_template(
qry_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token
qry_input = processor(
text=[qry_txt],
padding=True,
return_tensors="pt",
).to(device)
img_msg = [
{
"role": "user",
"content": [{
"type": "image",
"image": "image"
}]
}
]
img_txt = processor.apply_chat_template(
img_msg, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token
candidate_imgs= [
# Photo of two cats
{
"role": "user",
"content": [{
"type": "image",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg"}]
},
# Photo of two dogs
{
"role": "user",
"content": [{
"type": "image",
"image": "https://farm1.staticflickr.com/116/290755713_a5de6c1079_z.jpg"}]
},
# photo of two people playing baseball
{
"role": "user",
"content": [{
"type": "image",
"image": "http://farm3.staticflickr.com/2418/2193688811_d9f5e23bbd_z.jpg"}]
},
# Photo of a large crowd in a busy city street
{
"role": "user",
"content": [{
"type": "image",
"image":"http://farm7.staticflickr.com/6049/6329686751_997c68fff9_z.jpg"}]
},
]
candidate_images, _ = process_vision_info(candidate_imgs)
image_inputs = processor(
text=[img_txt] * len(candidate_images),
images=candidate_images,
# videos=,
padding=True,
return_tensors="pt",
).to(device)
with torch.inference_mode():
qry_emb = model(
**qry_input, output_hidden_states=True, return_dict=True
).hidden_states[-1][:, -1, :]
img_emb = model(
**image_inputs, output_hidden_states=True, return_dict=True
).hidden_states[-1][:, -1, :]
qry_emb = F.normalize(qry_emb, dim=-1)
img_emb = F.normalize(img_emb, dim=-1)
score = qry_emb @ img_emb.t()
# tensor([[0.3066, 0.1108, 0.1226, 0.1245]], device='cuda:0', dtype=torch.bfloat16)
# corresponding to the score of photos (cat, dog, baseball, crowd)tensor([[0.3066, 0.1108, 0.1226, 0.1245]], device='cuda:0', dtype=torch.bfloat16)python
vid_message = [
{
"role": "user",
"content": [{
"type": "video",
"video": video_path,
"max_pixels": 262144,
"fps": 2.0,}]
}
]
video_text = processor.apply_chat_template(
vid_message, tokenize=False, add_generation_prompt=True
) + tokenizer.eos_token
image_input, video_input = process_vision_info(vid_message)
video_input = processor(
text=[video_text],
images=image_input,
videos=video_input,
padding=True,
return_tensors="pt",
).to(device)
with torch.inference_mode():
video_emb = model(
**video_input, output_hidden_states=True, return_dict=True
).hidden_states[-1][:, -1, :]
video_emb = F.normalize(video_emb, dim=-1)Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.