IDEA-Research

12 models • 2 total models in database

Sort by:

grounding-dino-base

--- license: apache-2.0 tags: - vision inference: false pipeline_tag: zero-shot-object-detection ---

license:apache-2.0

1,085,790

140

grounding-dino-tiny

--- license: apache-2.0 tags: - vision inference: false pipeline_tag: zero-shot-object-detection ---

license:apache-2.0

640,350

Rex-Omni

This model is Rex-Omni, a 3B-parameter Multimodal Large Language Model (MLLM) presented in the paper "Detect Anything via Next Point Prediction". It is compatible with the Hugging Face `transformers` library and is licensed under the IDEA License 1.0. > Rex-Omni is a 3B-parameter Multimodal Large Language Model (MLLM) that redefines object detection and a wide range of other visual perception tasks as a simple next-token prediction problem. We provide a series of tutorials to help you get started with Rex-Omni. - Detection Example - Pointing Example - OCR Example - Keypointing Example - Visual Prompting Example - Batch Inference Example Rex-Omni is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. This model is based on Qwen, which is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved. For questions and feedback, please contact us at: - Email: [email protected] - GitHub Issues: IDEA-Research/Rex-Omni 7. Citation Rex-Omni comes from a series of prior works. If you’re interested, you can take a look. - RexThinker - RexSeek - ChatRex - DINO-X - Grounidng DINO 1.5 - T-Rex2 - T-Rex

—

71,853

Rex-Thinker-GRPO-7B

🦖🧠 Rex-Thinker: Grounded Object Refering via Chain-of-Thought Reasoning 🦖🧠 > We propose Rex-Thinker, a Chain-of-Thought (CoT) reasoning model for object referring that addresses two key challenges: lack of interpretability and inability to reject unmatched expressions. Instead of directly predicting bounding boxes, Rex-Thinker reasons step-by-step over candidate objects to determine which, if any, match a given expression. Rex-Thinker is trained in two stages: supervised fine-tuning to learn structured CoT reasoning, followed by reinforcement learning with GRPO to enhance accuracy, faithfulness, and generalization. Our approach improves both prediction precision and interpretability, while enabling the model to abstain when no suitable object is found. Below is an example of the model's reasoning process: Rex-Thinker reformulates object referring as a Chain-of-Thought (CoT) reasoning task to improve both interpretability and reliability. The model follows a structured three-stage reasoning paradigm: 1. Planning: Decompose the referring expression into interpretable subgoals. 2. Action: Evaluate each candidate object (obtained via an open-vocabulary detector) against these subgoals using step-by-step reasoning. 3. Summarization: Aggregate the intermediate results to output the final prediction — or abstain when no object matches. Each reasoning step is grounded in a specific candidate object region through Box Hints, making the process transparent and verifiable. Rex-Thinker is implemented on top of QwenVL-2.5, and trained in two stages: - Supervised Fine-Tuning (SFT) Cold-start training using GPT-4o-generated CoT traces as supervision. - GRPO-based Reinforcement Learning Further optimizes reasoning accuracy, generalization, and rejection ability via Group Relative Policy Optimization. This CoT-based framework enables Rex-Thinker to make faithful, interpretable predictions while generalizing well to out-of-domain referring scenarios. 1.1 Download Pre-trained Model We provide the pre-trained model weights of Rex-Thinker-GRPO, which is trained on HumanRef-CoT through SFT and GRPO. You can download the model weights from Hugging Face. Or you can also using the following command to download the pre-trained models: 2. Inference 🚀 We provide a simple inference script to test the model. In this script, we use Grouning DINO to get the candidate boxes. You can run the following command to test the model: 3. Gradio Demo 🤗 We provide a Gradio demo for you to test the model. You can run the following command to start the Gradio demo: Then you can open your browser and visit `http://localhost:7860` to see the Gradio demo. You can input the image path, category name, and referring expression to test the model.

NaNK

—

176