Tongyi-MiA

2 models • 2 total models in database
Sort by:

UI-Ins-7B

Welcome to Tongyi-MiA! UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning šŸ“‘ Paper | Code | šŸ¤— UI-Ins-7B | šŸ¤— UI-Ins-32B | šŸ¤– UI-Ins-7B | šŸ¤– UI-Ins-32B GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. You can inference UI-Ins simply by the following script: import torch import re from PIL import Image from transformers import AutoProcessor, Qwen25VLForConditionalGeneration MODELPATH = "Qwen/Qwen2.5-VL-7B-Instruct" IMAGEPATH = "path/to/your/image.jpg" INSTRUCTION = "Click the 'Search' button" def parsecoordinates(rawstring: str) -> tuple[int, int]: matches = re.findall(r'\[(\d+),\s(\d+)\]', rawstring) if matches: return tuple(map(int, matches[0])) return -1, -1 print("Loading model...") model = Qwen25VLForConditionalGeneration.frompretrained( MODELPATH, torchdtype=torch.bfloat16, devicemap="auto" ).eval() processor = AutoProcessor.frompretrained(MODELPATH) image = Image.open(IMAGEPATH).convert("RGB") messages = [ { "role":"system", "content": [ { "type": "text", "text": "You are a helpful assistant." }, { "type": "text", "text": """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.\n\n## Output Format\nReturn a json object with a reasoning process in tags, a function name and arguments within XML tags:\n\n represents the following item of the action space:\n## Action Space{"action": "click", "coordinate": [x, y]}\nYour task is to accurately locate a UI element based on the instruction. You should first analyze instruction in tags and finally output the function in tags.\n""" } ] }, { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": INSTRUCTION} ] }] prompt = processor.applychattemplate(messages, tokenize=False, addgenerationprompt=True) inputs = processor(text=[prompt], images=[image], returntensors="pt").to(model.device) print("Running inference...") generatedids = model.generate(inputs, maxnewtokens=128) responseids = generatedids[0, len(inputs["inputids"][0]):] rawresponse = processor.decode(responseids, skipspecialtokens=True) print("\n" + "="20 + " RESULT " + "="20) print(f"Instruction: {INSTRUCTION}") print(f"Raw Response: {rawresponse}") if pointx != -1: , , resizedheight, resizedwidth = inputs['pixelvalues'].shape normx = pointx / resizedwidth normy = pointy / resizedheight print(f"āœ… Parsed Point (on resized image): ({pointx}, {pointy})") print(f"āœ… Normalized Point (0.0 to 1.0): ({normx:.4f}, {normy:.4f})") else: print("āŒ Could not parse coordinates from the response.") print("="48) Fell free to contact `[email protected]` if you have any questions. This repo follows CC-BY-NC-SA 4.0 license. Please use this repo for non-commercial use ONLY. Citation If you use this repository or find it helpful in your research, please cite it as follows:

NaNK
license:cc-by-nc-sa-4.0
283
2

UI-Ins-32B

Welcome to Tongyi-MiA! UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning šŸ“‘ Paper | Code | šŸ¤— UI-Ins-7B | šŸ¤— UI-Ins-32B | šŸ¤– UI-Ins-7B | šŸ¤– UI-Ins-32B GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. You can inference UI-Ins simply by the following script: import torch import re from PIL import Image from transformers import AutoProcessor, Qwen25VLForConditionalGeneration MODELPATH = "Qwen/Qwen2.5-VL-7B-Instruct" IMAGEPATH = "path/to/your/image.jpg" INSTRUCTION = "Click the 'Search' button" def parsecoordinates(rawstring: str) -> tuple[int, int]: matches = re.findall(r'\[(\d+),\s(\d+)\]', rawstring) if matches: return tuple(map(int, matches[0])) return -1, -1 print("Loading model...") model = Qwen25VLForConditionalGeneration.frompretrained( MODELPATH, torchdtype=torch.bfloat16, devicemap="auto" ).eval() processor = AutoProcessor.frompretrained(MODELPATH) image = Image.open(IMAGEPATH).convert("RGB") messages = [ { "role":"system", "content": [ { "type": "text", "text": "You are a helpful assistant." }, { "type": "text", "text": """You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.\n\n## Output Format\nReturn a json object with a reasoning process in tags, a function name and arguments within XML tags:\n\n represents the following item of the action space:\n## Action Space{"action": "click", "coordinate": [x, y]}\nYour task is to accurately locate a UI element based on the instruction. You should first analyze instruction in tags and finally output the function in tags.\n""" } ] }, { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": INSTRUCTION} ] }] prompt = processor.applychattemplate(messages, tokenize=False, addgenerationprompt=True) inputs = processor(text=[prompt], images=[image], returntensors="pt").to(model.device) print("Running inference...") generatedids = model.generate(inputs, maxnewtokens=128) responseids = generatedids[0, len(inputs["inputids"][0]):] rawresponse = processor.decode(responseids, skipspecialtokens=True) print("\n" + "="20 + " RESULT " + "="20) print(f"Instruction: {INSTRUCTION}") print(f"Raw Response: {rawresponse}") if pointx != -1: , , resizedheight, resizedwidth = inputs['pixelvalues'].shape normx = pointx / resizedwidth normy = pointy / resizedheight print(f"āœ… Parsed Point (on resized image): ({pointx}, {pointy})") print(f"āœ… Normalized Point (0.0 to 1.0): ({normx:.4f}, {normy:.4f})") else: print("āŒ Could not parse coordinates from the response.") print("="48) Fell free to contact `[email protected]` if you have any questions. This repo follows CC-BY-NC-SA 4.0 license. Please use this repo for non-commercial use ONLY. Citation If you use this repository or find it helpful in your research, please cite it as follows:

NaNK
license:cc-by-nc-sa-4.0
16
3