hyf015

3 models • 1 total models in database
Sort by:

seine_weights

88
0

EgoThinker V1

⚡EgoThinker-v1 [\[📂 GitHub\]](https://github.com/InternRobotics/EgoThinker) [\[📜 Tech Report\]](https://arxiv.org/abs/2510.23569) ⭐️ We will release EgoThinker-v2 later, which supports real-world embodied intelligence and spatial understanding, stay tuned! Egocentric video reasoning focuses on the unseen, egocentric agent who shapes the scene, demanding inference of hidden intentions and fine-grained interactions—areas where current MLLMs struggle. We present EgoThinker, a framework that equips MLLMs with strong egocentric reasoning via spatio-temporal chain-of-thought supervision and a two-stage curriculum. We build EgoRe-5M, a large-scale QA dataset derived from 13M egocentric clips, featuring multi-minute segments with detailed rationales and dense hand–object grounding. Trained with SFT on EgoRe-5M and refined with RFT for better spatio-temporal localization, EgoThinker outperforms prior methods on multiple egocentric benchmarks and yields substantial gains in fine-grained localization tasks. Our EgoThinker-v1 is built from Qwen2-VL-7B-Instruct. Quickstart We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command: Here we show a code snippet to show you how to use the chat model with `transformers` and `qwenvlutils`: For input images, we support local files, base64, and URLs. For videos, we currently only support local files. The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage. Besides, We provide two methods for fine-grained control over the image size input to the model: 1. Define minpixels and maxpixels: Images will be resized to maintain their aspect ratio within the range of minpixels and maxpixels. 2. Specify exact dimensions: Directly set `resizedheight` and `resizedwidth`. These values will be rounded to the nearest multiple of 28.

64
2

Vinci-8B-base

NaNK
license:mit
1
2