nyu-visionx

26 models • 1 total models in database

Sort by:

Cambrian-S-7B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 7B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK

license:apache-2.0

Cambrian-S-0.5B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-0.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-0.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 0.5B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK

license:apache-2.0

Cambrian-S-3B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-3B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-3B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 3B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK

license:apache-2.0

Cambrian-S-1.5B

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang†, Ellis Brown†, et al. Cambrian-S-1.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. - Architecture: Qwen2.5-1.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - Parameters: 1.5B - Vision Encoder: SigLIP-384 (SiGLIP) - Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

NaNK

license:apache-2.0

moco-v3-vit-l

—

RAE-collections

RAE: Diffusion Transformers with Representation Autoencoders This repository contains the official PyTorch checkpoints for Representation Autoencoders. Representation Autoencoders (RAE) are a class of autoencoders that utilize pretrained, frozen representation encoders such as DINOv2 and SigLIP2 as encoders with trained ViT decoders. RAE can be used in a two-stage training pipeline for high-fidelity image synthesis, where a Stage 2 diffusion model is trained on the latent space of a pretrained RAE to generate images.

license:mit