xlangai

12 models • 1 total models in database

Sort by:

OpenCUA-72B-preview

OpenCUA models (OpenCUA-7B, OpenCUA-32B and OpenCUA-72B) are end-to-end computer-use foundation models than can produce executable actions in the computer environments. They are based on the weights of Qwen2.5-VL-7B-Instruction, Qwen2.5-VL-32B-Instruction and Qwen2.5-VL-72B-Instruction. They demonstrate superior performance across CUA benchmarks. In particular, OpenCUA-72B achieves an average success rate of 45.0% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). It also has a strong grounding performance, and achieves 60.8 on ScreenSpot-Pro and 37.3 (SOTA) on UI-Vision. - Superior Computer-Use Capablity: Able to execute multi-step computer-use actions with effective planning and reasoning - Multi-OS Support: Trained on demonstrations across Ubuntu, Windows, and macOS - Visual Grounding: Strong GUI element recognition and spatial reasoning capabilities - Multi-Image Context: Processes up to 3 screenshot history for better context understanding - Reflective Reasoning: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning Online Agent Evaluation OpenCUA models achieves strong performance on OSWorld-Verified. OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins. It also closes the gap to proprietary Claude models. | Model | 15 Steps | 50 Steps | 100 Steps | |-------------------------------|:--------:|:--------:|:---------:| | Proprietary | | | | | OpenAI CUA | 26.0 | 31.3 | 31.4 | | Seed 1.5-VL | 27.9 | — | 34.1 | | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 | | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 | | Open-Source | | | | | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 | | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 | | Kimi-VL-A3B | 9.7 | — | 10.3 | | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 | | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 | | OpenCUA-7B (Ours) | 24.3 | 27.9 | 26.6 | | OpenCUA-32B (Ours) | 29.7 | 34.1 | 34.8 | | OpenCUA-72B (Ours) | 39.0 | 44.9 | 45.0 | | Model | OSWorld-G | ScreenSpot-V2 | ScreenSpot-Pro | UI-Vision | |-------|-----------|---------------|----------------|----------------| | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 | 0.85 | | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 | - | | UI-TARS-72B | 57.1 | 90.3 | 38.1 | 25.5 | | OpenCUA-7B | 55.3 | 92.3 | 50.0 | 29.7 | | OpenCUA-32B | 59.6 | 93.4 | 55.3 | 33.3 | | OpenCUA-72B | 59.2 | 92.9 | 60.8 | 37.3 | | Model | Coordinate Actions | Content Actions | Function Actions | Average | |-------|-------------------|-----------------|------------------|---------| | Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 | | Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 | | Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 | | OpenAI CUA | 71.7 | 57.3 | 80.0 | 73.1 | | OpenCUA-7B | 79.0 | 62.0 | 44.3 | 75.2 | | OpenCUA-32B | 81.9 | 66.1 | 55.7 | 79.1 | ⚠️ Important for Qwen-based Models (OpenCUA-7B, OpenCUA-32B, OpenCUA-72B): To align with our training infrastructure, we have modified the model in two places: 1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE . 2. Using the same Tokenizer and ChatTemplate as Kimi-VL. Do not use the default transformers and vllm classes to load the model. Tokenizer and Chat Template should be aligned if training the models. First, install the required transformers dependencies: The following code demonstrates how to use OpenCUA models for GUI grounding tasks: You can also run the five grounding examples in OpenCUA/model/inference/huggingfaceinference.py: 🖥️ Computer Use Agent OpenCUAAgent is developed in the OSWorld environment based on OpenCUA models. It iteratively perceives the environment via screenshots, produces reflective long CoT as inner monologue, and predicts the next action to be executed. OpenCUAAgent uses 3 images in total and L2 CoT format in default. Command for running OpenCUA-7B and OpenCUA-32B in OSWorld: Currently we only supports huggingface inference. We are implementing the vLLM supports of OpenCUA models. Please stay tuned. AgentNet Dataset - Large-Scale Computer-Use Dataset AgentNet is the first large-scale desktop computer-use agent trajectory dataset, containing 22.6K human-annotated computer-use tasks across Windows, macOS, and Ubuntu systems. Collecting computer-use agent training data requires 3 steps: - Demonstrate human computer-use task via AgentNetTool; - Preprocess the demonstration using Action Reduction & State-Action Matching; - For each step, synthesize reflective long CoT Our AgentNetTool is a cross-platform GUI recorder that runs unobtrusively on annotators’ machines. It captures synchronized screen video, mouse/keyboard events, and accessibility trees, then provides an in-browser UI for reviewing, trimming, and submitting demonstrations. AgentNet Tool is available on Windows, macOS and Ubuntu. 2 DataProcessor – Action Reduction & State–Action Matching Raw demonstrations can contain thousands of low-level events that are too dense for model training. The DataProcessor module (`./data/data-process/`) performs two key steps: 1. Action Reduction — merges granular signals into concise, semantically meaningful PyAutoGUI actions (e.g., collapsing mouse moves → click, coalescing scrolls, grouping key-press sequences into text or hotkeys). 2. State–Action Matching — aligns every reduced action with the last visually distinct frame before the action begins, avoiding future-information leakage and yielding compact state–action pairs. These processed trajectories underlie all downstream training and evaluation. 3 CoTGenerator – Synthesizing Reflective Long Chain-of-Thought Inner Monologue To boost robustness and interpretability, we augment each trajectory with reflective long Chain-of-Thought (CoT) reasoning. The CoTGenerator pipeline (`./data/cot-generator/`) synthesizes step-level reflections that: reflect on the previous action, explain why an action is chosen given the current observation and history, note potential alternative actions, and forecast the expected next state. Empirically, models trained with these rich CoTs scale better with data and generalize across unseen applications. AgentNetBench (`./AgentNetBench/`) provides a realistic offline evaluator for OS agent trajectories. It compares model-predicted low-level actions (click, moveTo, write, press, scroll, terminate, etc.) against ground-truth human actions and reports detailed metrics. 👉 See AgentNetBench/README.md for usage instructions. TODO vLLM Support We are actively working with the vLLM team to add support for OpenCUA models. Workaround: For now, please use the standard transformers library as shown in the examples above. We will update this section once vLLM support becomes available. Training Code OpenCUA models are developed based on the training infrastructure of Kimi Team. We are developting the training pipeline based on the open-source infrastructure as well. We thank Su Yu, Caiming Xiong, Binyuan Hui, and the anonymous reviewers for their insightful discussions and valuable feedback. We are grateful to Moonshot AI for providing training infrastructure and annotated data. We also sincerely appreciate Calvin, Ziwei Chen, Jin Zhang, Ze Li, Zhengtao Wang, Yanxu Chen, and Qizheng Gu from the Kimi Team for their strong infrastructure support and helpful guidance. The development of our tool is based on the open-source projects- DuckTrack and OpenAdapt . We are very grateful to their commitment to the open source community. Finally, we extend our deepest thanks to all annotators for their tremendous effort and contributions to this project. This project is licensed under the MIT License - see the LICENSE file in the root folder for details. OpenCUA models are intended for research and educational purposes only. Prohibited Uses - The model may not be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction - Use for illegal, unethical, or harmful activities is strictly prohibited Disclaimer - The authors, contributors, and copyright holders are not responsible for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use - Use of the "OpenCUA" name, logo, or trademarks does not imply any endorsement or affiliation unless separate written permission is obtained - Users are solely responsible for ensuring their use complies with applicable laws and regulations OpenCUA/OpenCUA-A3B – Relative coordinates (not supported in this code) OpenCUA/OpenCUA-Qwen2-7B – Relative coordinates OpenCUA/OpenCUA-7B – Absolute coordinates OpenCUA/OpenCUA-32B – Absolute coordinates OpenCUA/OpenCUA-72B – Absolute coordinates OpenCUA models use different coordinate systems depending on the base model: - OpenCUA-Qwen2-7B: Outputs relative coordinates (0.0 to 1.0 range) - OpenCUA-7B, OpenCUA-32B, OpenCUA-72B (Qwen2.5-based): Output absolute coordinates after smart resize Understanding Smart Resize for Qwen2.5-based Models: The Qwen2.5-VL models use a “smart resize” preprocessing that maintains aspect ratio while fitting within pixel constraints. For coordinate conversion, you need the smart resize function from the If you use OpenCUA models in your research, please cite our work:

NaNK

license:mit