OfficerChul
Gemma 3n E2B It Android Control 84k
gemma-3n-E4B-it-Android-Control-84k
Gemma-3n-E4B-it Android Control LoRA Fine-tuned Model Model Overview This model is a fine-tuned version of Google's `gemma-3n-E4B-it` base model with LoRA adaptation for Android UI control tasks. Training Data - Dataset: OfficerChul/Android-Control-84k - Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.) Training Method LoRA fine-tuning performed using LLaMA-Factory framework 1. Training Configuration (`gemma3n-e4b-it.yaml`) - Base Model: `google/gemma-3n-E4B-it` - Training Method: LoRA (Low-Rank Adaptation) - LoRA Configuration: - Rank: 32 - Target modules: `qproj, kproj, vproj, oproj` - Training Parameters: - Batch size: 4 (gradient accumulation: 48) - Learning rate: 2e-5 - Epochs: 5 - LR scheduler: Cosine - Optimizer: AdamW (fused) - Precision: bf16 - Additional Settings: - Gradient checkpointing enabled - Vision tower, multi-modal projector, and language model all trainable - DeepSpeed ZeRO-2 utilized 2. Model Merging (`gemma3n-e4b-itlorasftmerge.yaml`) Merged trained LoRA adapter with base model: - Base Model: `google/gemma-3n-E4B-it` Training Results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.2226 | 2.4288 | 500 | 0.2229 | | 0.1658 | 4.8577 | 1000 | 0.2125 | Supported Action Types - `click`: Click on specific coordinates - `longpress`: Long press action - `scroll`: Scroll (up/down) - `inputtext`: Text input - `navigateback`: Navigate back - `navigatehome`: Navigate to home screen - `openapp`: Open application - `wait`: Wait action Usage The merged model can be directly loaded using the Hugging Face Transformers library. Performance comparison on Android UI control benchmark: | Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match | Avg. Episode Accuracy | Malformed JSON | Execution Time (s) | Inference Time (s) | |-------|---------------------|-------------------|------------------|----------------------|---------------------|----------------|-------------------|-------------------| | Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) | 0.9063 | 110 (5.5%) | 1101.43 | 485.12 | | Qwen/Qwen2.5-VL-7B-Instruct | 0.6125 | 59.89 (n=544) | 0.8197 (n=61) | 0.3243 (n=111) | 0.6163 | 499 (24.9%) | 720.88 | 580.92 | | Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) | 0.6615 | 440 (22.0%) | 676.76 | 536.27 | | OfficerChul/Qwen2.5-VL-7B-Instruct-Android-Control-5a | 0.9970 | 427.30 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9974 | 0 (0.0%) | 1086.97 | 581.82 | | OfficerChul/Qwen2.5-VL-3B-Instruct-Android-Control-5a | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) | 0.9976 | 1 (0.1%) | 672.88 | 530.95 | | OfficerChul/InfiGUI-G1-7B-Android-Control-5a | 0.9970 | 466.24 (n=1466) | 0.9434 (n=159) | 0.9775 (n=267) | 0.9968 | 1 (0.1%) | 897.58 | 552.23 | | OfficerChul/InfiGUI-G1-3B-Android-Control-5a | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) | 0.9983 | 0 (0.0%) | 722.63 | 529.57 | | InfiX-ai/InfiGUI-G1-7B | 0.6715 | 82.21 (n=821) | 0.8000 (n=70) | 0.2268 (n=194) | 0.6763 | 457 (22.9%) | 698.77 | 557.50 | | InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) | 0.8910 | 78 (3.9%) | 702.93 | 559.65 | | OfficerChul/gemma-3n-E2B-it-Android-Control-84k | 0.5819 | 985.82 (n=123) | 0.8596 (n=114) | 0.2159 (n=88) | 0.5781 | 0 (0.0%) | 322.95 | 159.23 | | OfficerChul/gemma-3n-E4B-it-Android-Control-84k | 0.5088 | 878.66 (n=124) | 0.8763 (n=97) | 0.3689 (n=103) | 0.5121 | 0 (0.0%) | 363.23 | 177.11 | License Follows the license terms of the Google Gemma model. Notes - This model was developed for research purposes in mobile UI automation and accessibility enhancement - Proper validation is required when using in production environments Framework Versions - PEFT 0.17.1 - Transformers 4.57.0 - PyTorch 2.8.0+cu128 - Datasets 4.0.0 - Tokenizers 0.22.1
Qwen2.5-VL-7B-Instruct-Android-Control-5a
This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the andctrlskt dataset. It achieves the following results on the evaluation set: - Loss: 0.1610 The following hyperparameters were used during training: - learningrate: 2e-05 - trainbatchsize: 4 - evalbatchsize: 1 - seed: 42 - distributedtype: multi-GPU - numdevices: 4 - gradientaccumulationsteps: 48 - totaltrainbatchsize: 768 - totalevalbatchsize: 4 - optimizer: Use adamwtorchfused with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 5.0 | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:------:|:----:|:---------------:| | 0.2389 | 1.1120 | 100 | 0.2237 | | 0.153 | 2.2239 | 200 | 0.1694 | | 0.0898 | 3.3359 | 300 | 0.1497 | | 0.0425 | 4.4479 | 400 | 0.1605 | - Transformers 4.56.1 - Pytorch 2.5.0a0+b465a5843b.nv24.09 - Datasets 3.0.1 - Tokenizers 0.22.1
InfiGUI-G1-7B-Android-Control-5a
InfiGUI-G1-3B-Android-Control-5a
Qwen2.5-VL-3B-Instruct-Android-Control-5a
Qwen3-VL-30B-A3B-Instruct-Android-Control-84k
Qwen3-VL-30B-A3B-Instruct Android Control LoRA Fine-tuned Model Model Overview This model is a fine-tuned version of Qwen's `Qwen3-VL-30B-A3B-Instruct` base model with LoRA adaptation for Android UI control tasks. This model demonstrates strong performance in GUI Grounding tasks, particularly excelling in coordinate prediction accuracy for click actions. Strong GUI Grounding Performance: - Click L2 Distance: 87.04 pixels - showing competitive performance compared to other models in the benchmark - Demonstrates strong coordinate prediction capabilities for GUI interaction tasks - Solid performance across other action types (input text match: 0.8455, scroll direction match: 0.8689) The model demonstrates strong spatial understanding for GUI elements, making it suitable for automated UI testing and accessibility applications. Training Data - Dataset: OfficerChul/Android-Control-84k - Data Format: Mobile UI screenshots paired with user instructions to perform appropriate actions (click, scroll, input, etc.) - Training Samples: 84,000+ UI interaction examples Training Method LoRA fine-tuning performed using LLaMA-Factory framework Training Configuration (`qwen3vl30b.yaml`) - Base Model: `Qwen/Qwen3-VL-30B-A3B-Instruct` - Training Method: LoRA (Low-Rank Adaptation) - LoRA Configuration: - Rank: 8 - Target modules: all - Image max pixels: 128,000 - Training Parameters: - Batch size: 4 (gradient accumulation: 48, effective batch size: 192) - Learning rate: 1e-4 - Epochs: 5 - LR scheduler: Cosine - Warmup ratio: 0.1 - Optimizer: AdamW (fused) - Precision: bf16 - Weight decay: 0.01 - Cutoff length: 2048 tokens - Additional Settings: - Gradient checkpointing enabled - Flash Attention 2 enabled - Vision tower, multi-modal projector, and language model all trainable - DeepSpeed ZeRO-3 utilized - Validation size: 5% - Evaluation steps: 100 Training Results - Total Steps: 2,055 - Final Training Loss: 0.2086 - Final Evaluation Loss: 0.1190 - Training Runtime: ~104 hours - Samples per Second: 1.049 Supported Action Types - `click`: Click on specific coordinates (x, y) - `longpress`: Long press action - `scroll`: Scroll (up/down/left/right) - `inputtext`: Text input - `navigateback`: Navigate back - `navigatehome`: Navigate to home screen - `openapp`: Open application - `wait`: Wait action Usage The merged model can be directly loaded using the Hugging Face Transformers library. | Model | Action Type Accuracy | Click L2 Distance | Input Text Match | Scroll Direction Match | |:--|:--:|:--:|:--:|:--:| | Qwen/Qwen2.5-VL-3B-Instruct | 0.6645 | 88.21 (n=165) | 0.7889 (n=90) | 0.3519 (n=108) | | OfficerChul/Qwen2.5-VL-3B-Instruct | 0.9965 | 446.54 (n=1467) | 0.9363 (n=157) | 0.9738 (n=267) | | InfiX-ai/InfiGUI-G1-3B | 0.8745 | 102.39 (n=1020) | 0.7700 (n=100) | 0.2299 (n=174) | | OfficerChul/InfiGUI-G1-3B | 0.9980 | 449.73 (n=1467) | 0.9625 (n=160) | 0.9625 (n=267) | | Qwen/Qwen3-VL-30B-A3B-Instruct | 0.9090 | 705.72 (n=812) | 0.8264 (n=121) | 0.3226 (n=248) | | OfficerChul/Qwen3-VL-30B-A3B-Instructlorasft | 0.5907 | 87.04 | 0.8455 | 0.8689 | | Qwen/Qwen2.5-VL-72B-Instruct | 0.6594 | 64.98 (n=125) | 0.8879 (n=107) | 0.2925 (n=106) | | OfficerChul/Qwen2.5-VL-72B-Instruct | 0.8838 | 529.23 | 0.9032 | 0.9512 | | google/gemma-3n-E4B-it | 0.5398 | 824.09 | 0.7521 | 0.5217 | | OfficerChul/gemma-3n-E4B-it | 0.5088 | 878.66 (n=124) | 0.8763 (n=97) | 0.3689 (n=103) | License This model follows the Apache 2.0 license of the Qwen3-VL base model. - Base model: Qwen3-VL-30B-A3B-Instruct by Qwen team - Training framework: LLaMA-Factory - Dataset: Android-Control-84k Notes - This model was developed for research purposes in mobile UI automation and accessibility enhancement - The strong GUI grounding performance makes it suitable for applications requiring precise coordinate prediction - Proper validation is required when using in production environments - For best results, ensure input images are clear and at appropriate resolution Generated with LLaMA-Factory | For questions or issues, please open an issue on the model repository.