InternRobotics
InternVLA-M1
InternVLA-N1
InternVLA-N1: An Open Dual-System Navigation Foundation Model with Learned Latent Plans Project page: https://internrobotics.github.io/internvla-n1.github.io/ Technical report: https://internrobotics.github.io/internvla-n1.github.io/static/pdfs/InternVLAN1.pdf Data: https://huggingface.co/datasets/InternRobotics/InternData-N1 This repository hosts the official release of InternVLA-N1. The previously InternVLA-N1 model has been renamed to InternVLA-N1-Preview. If you are looking for the earlier preview version, please check InternVLA-N1-Preview. We recommend using this official release for research and deployment, as it contains the most stable and up-to-date improvements. Key Difference: Preview vs Official | Feature | InternVLA-N1-Preview | InternVLA-N1 (official) | | ------------- | ----------------------------------------- | ------------------------------------------------------------------------ | | System Design | Dual-System (synchronous) | Dual-System (asynchronous) | | Training | System 1 trained only at System 2 inferrence step | System 1 trained on denser step (~25 cm), using latest System 2 hidden state | | Inference | System 1, 2 infered at same frequency (~2 hz) | System 1, 2 infered asynchronously, allowing dynamic obstacle avoidance | | Performance | Solid baseline in simulation & benchmarks | Improved smoothness, efficiency, and real-world zero-shot generalization | | Status | Historical preview | Stable official release (recommended) The first navigation foundation model that achieves joint-tuning and asychronous inference of System-2 reasoning and System-1 action, resulting in smooth and efficient execution during the instruction-followed navigation procedure. The whole navigation foundation model with each system achieves state-of-the-art performance on both mainstream and our new established challenging benchmarks, including VLN-CE R2R & RxR, GRScenes-100, VLN-PE, etc. The training is based on simulation data InternData-N1 only, with diverse scenes, embodiments and other randomization, while achieving great zero-shot generalization capabilities in the real world. Please refer to InternNav for its inference, evaluation and gradio demo. If you find our work helpful, please consider starring this repo š and cite: License This work is under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Acknowledgements This repository is based on Qwen2.5-VL.
G2VLM-2B-MoT
InternVLA-A1-3B
InternVLA-M1-LIBERO-Object
InternVLA-M1-LIBERO-Goal
InternVLA-M1-LIBERO-Long
InternVLA-A1-3B-RoboTwin
VLAC
VLAC: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning [[paper]](https://github.com/InternRobotics/VLAC/blob/main/data/VLACEAI.pdf) [[code]](https://github.com/InternRobotics/VLAC) [[model]](https://huggingface.co/InternRobotics/VLAC) Try Interactive & Homepage > Online Demo is available now in Homepage, Try as you like!!! VLAC is a general-purpose pair-wise critic and manipulation model which designed for real world robot reinforcement learning and data refinement. It provides robust evaluation capabilities for task progress prediction and task completion verification base one images and task description. VLAC trained on 3000h+ human egocentric data, 1200h+ comprehensive public robotic manipulation data, and 15h+ self-collected manipulation data. VLAC-8B is coming soon! Now the 8B model can be used on Homepage. ⢠Pair-wise comparison mechanism for improved progressing dense critic accuracy, better recognition of state changes, and each step can be the start of the trajectory. ⢠Multi-modal capabilities - Supports process tracking, task completion judgment, task description estimation, visual question answering, and even embodied action output, equipped with VLA capabilities. ⢠Flexible zero-shot and one-shot - in-context capabilities, maintaining excellent performance across entities, scenarios, and tasks. ⢠Human-task synesthesia - Based on the ego4D human dataset, model understands common tasks and build synesthesia for real-world human tasks and embodied tasks. ⢠Trajectory quality screening - VLAC can evaluate the collected trajectories and filters out low score trajectories based on the VOC value and mask the action with negative pair-wise score, that is, data with low fluency and quality, improving the effect and efficiency of imitation learning. The VLAC model is trained on a combination of comprehensive public robotic manipulation datasets, human demonstration data, self-collected manipulation data, and various image understanding datasets. Video data is processed into pair-wise samples to learn the different task progress between any two frames, supplemented with task descriptions and task completion evaluation to enable task progress understanding and action generation, as illustrated in the bottom-left corner. As shown in the diagram on the right, the model demonstrates strong generalization capabilities to new robots, scenarios, and tasks not covered in the training dataset. It can predict task progress and distinguish failure action or trajectory, providing dense reward feedback for real-world reinforcement learning and offering guidance for data refinement. Additionally, the model can directly perform manipulation tasks, exhibiting zero-shot capabilities to handle different scenarios. --> Details about the model's performance and evaluation metrics can be found in the Homepage. | | Range | Recommended | Notes | | ------------ |--------------| ----------- | ----------------------------------------- | | python | >=3.9 | 3.10 | | | cuda | | cuda12 | No need to install if using CPU, NPU, MPS | | torch | >=2.0 | | | | transformers | >=4.51 | 4.51.3 | | | peft | >=0.15.2 | | | | ms-swift | | 3.3 | | ⢠pair-wise image inputs critic. Please check this example