nyu-visionx
cambrian-phi3-3b
cambrian-8b
siglip2_decoder
Scale-RAE-Qwen1.5B_DiT2.4B-WebSSL
moco-v3-vit-b
cambrian-13b
RAE-dinov2-wReg-small-ViTXL-n08
RAE-siglip2-base-p16-i256-ViTXL-n08
RAE-dinov2-wReg-base-ViTXL-n08
webssl300m_decoder
RAE-dinov2-wReg-large-ViTXL-n08
RAE-dinov2-wReg-base-ViTXL-n08-i512
RAE-mae-base-p16-ViTXL-n08
cambrian-34b
Cambrian-S-7B-LFP
Cambrian-S-7B
Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 7B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data
Cambrian-S-0.5B
Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-0.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-0.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 0.5B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data
Cambrian-S-3B
Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-3B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-3B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 3B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data
Cambrian-S-1.5B
Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-1.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-1.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 1.5B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data
moco-v3-vit-l
RAE-collections
RAE: Diffusion Transformers with Representation Autoencoders This repository contains the official PyTorch checkpoints for Representation Autoencoders. Representation Autoencoders (RAE) are a class of autoencoders that utilize pretrained, frozen representation encoders such as DINOv2 and SigLIP2 as encoders with trained ViT decoders. RAE can be used in a two-stage training pipeline for high-fidelity image synthesis, where a Stage 2 diffusion model is trained on the latent space of a pretrained RAE to generate images.