nyu-visionx

26 models • 1 total models in database
Sort by:

cambrian-phi3-3b

NaNK
license:apache-2.0
276
11

cambrian-8b

NaNK
cambrian_llama
252
63

siglip2_decoder

license:mit
177
0

Scale-RAE-Qwen1.5B_DiT2.4B-WebSSL

NaNK
license:mit
61
0

moco-v3-vit-b

59
1

cambrian-13b

NaNK
cambrian_llama
50
19

RAE-dinov2-wReg-small-ViTXL-n08

46
0

RAE-siglip2-base-p16-i256-ViTXL-n08

40
0

RAE-dinov2-wReg-base-ViTXL-n08

39
0

webssl300m_decoder

license:mit
34
0

RAE-dinov2-wReg-large-ViTXL-n08

29
0

RAE-dinov2-wReg-base-ViTXL-n08-i512

29
0

RAE-mae-base-p16-ViTXL-n08

28
0

cambrian-34b

NaNK
cambrian_llama
22
27

Cambrian-S-7B-LFP

NaNK
13
0

Cambrian-S-7B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 7B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK
license:apache-2.0
13
0

Cambrian-S-0.5B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-0.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-0.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 0.5B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK
license:apache-2.0
10
0

Cambrian-S-3B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-3B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-3B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 3B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK
license:apache-2.0
8
0

Cambrian-S-1.5B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-1.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-1.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 1.5B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK
license:apache-2.0
8
0

moco-v3-vit-l

6
1

RAE-collections

RAE: Diffusion Transformers with Representation Autoencoders This repository contains the official PyTorch checkpoints for Representation Autoencoders. Representation Autoencoders (RAE) are a class of autoencoders that utilize pretrained, frozen representation encoders such as DINOv2 and SigLIP2 as encoders with trained ViT decoders. RAE can be used in a two-stage training pipeline for high-fidelity image synthesis, where a Stage 2 diffusion model is trained on the latent space of a pretrained RAE to generate images.

license:mit
0
46

solaris

license:apache-2.0
0
5

FreeFlow

license:mit
0
1

cambrian-13b-pretrain

NaNK
0
1

cambrian-8b-pretrain

NaNK
0
1

cambrian-34b-pretrain

NaNK
0
1