racineai
QwenAmann 4B Dse
A multimodal vision-language model specialized for multilingual technical document retrieval. QwenAmann-4B-dse is a 4B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction. ENERGY Benchmark (racineai/Open-VLM-Retrieval-Leaderboard) - Competitive performance: Achieves performance comparable to Jina Embeddings v4 while being fully open-source under Apache 2.0 license (Jina Embeddings v4 is governed by the Qwen Research License as it derives from Qwen-2.5-VL-3B) - Strong multilingual performance: Stable scores across 5 tested languages - Multi-domain training: Trained on 1.44M examples across 15+ technical domains - Efficient Retrieval: Generates document and query embeddings for semantic similarity search - Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout - No Preprocessing Required: Directly works with document screenshots - Multilingual Technical Document Retrieval: Find relevant documents across multiple languages - International Technical Support Systems: Match user questions to relevant documentation regardless of language - Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports - Multi-Domain Search: Retrieve documents across military, energy, quantum computing, nuclear, geotechnical, and other technical domains QwenAmann-4B-dse was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents. The model was fine-tuned on the OGCMEGA2 dataset, comprising 1.44M examples across 35+ languages with primary focus on 5 major European languages (English, French, German, Spanish, Italian). The dataset spans 15+ technical domains including military, energy, quantum computing, nuclear, geotechnical engineering, and more. Paul Lemaistre - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique Dataset Curators: Léo Appourchaux, Paul Lemaistre, Yumeng Ye, Mattéo KHAN, André-Louis Rochet This model is released under the Apache 2.0 license.
CU 1
UI-DETR-1 (Computer Use Agent v1) is a fine-tuned implementation of RF-DETR-M specifically optimized for autonomous computer interaction. This model serves as the visual perception backbone for our computer use agent, enabling real-time UI element detection and multi-action task automation across diverse graphical interfaces. [](https://huggingface.co/spaces/racineai/UI-DETR-1) Key Features: - 70.8% accuracy on WebClick benchmark (vs 58.8% for OmniParser) - Multi-action task support beyond single-click benchmarks - Optimized training pipeline with merged COCO-format datasets - Class-agnostic detection supporting diverse UI elements Important: This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both UI-DETR-1 and OmniParser V2 using their respective optimized detection thresholds (0.35 for UI-DETR-1, 0.05 for OmniParser V2 from official sources) and refined prompts for improved task instruction clarity. Both sets of results are presented for transparency, with the optimized evaluation better representing real-world deployment scenarios where parameters and prompts are tuned for specific use cases. The model is trained on a merged COCO-format dataset combining multiple UI detection sources, ensuring broad coverage across platforms, applications, and visual styles. The class-agnostic approach enables detection of any clickable element without predefined categories. | Dataset | Train | Valid | Test | Total | |--------------------------------|-------|-------|------|-------| | months.v1i.coco | 173 | 25 | 12 | 210 | | all-item-merged.v1-100-60.coco | 334 | 35 | 14 | 383 | | Web.v3i.coco | 493 | 264 | 90 | 847 | | Website elements.v3i.coco | 133 | 10 | 3 | 146 | | Website Elements.v16i.coco | 679 | 55 | 21 | 755 | | Website.v1i.coco | 844 | 242 | 123 | 1,209 | | TOTAL | 2,656 | 631 | 263 | 3,550 | Training Configuration: - Training images: 2,656 annotated UI screenshots - Epochs: 30 - Batch size: 8 - Learning rate: 5e-4 Benchmark Methodology: The WebClick benchmark evaluates whether models can correctly identify clickable elements at specified target coordinates. Each sample returns a binary result (success/failure), with the final accuracy calculated as the average across all samples. Evaluation performed on 1,639 samples across three categories using Gemini 2.5 Pro as the decision-making LLM. Detection Configuration: - UI-DETR-1: - Confidence threshold: 0.35 - Model: RF-DETR-Medium - OmniParser: - Confidence threshold: 0.05 - IOU threshold: 0.1 - Model: YOLOv8-based icon detection Annotation System: Both models use numbered bounding box annotations where each detected UI element is assigned a unique ID displayed above its bounding box. The LLM (Gemini 2.5 Pro) analyzes the annotated screenshot and selects elements by their ID numbers to perform click actions. Each bounding box is drawn with a thin border for visibility, with the ID number displayed in a black label box with white text positioned above each element. LLM Decision Process: The benchmark evaluates the agent in a constrained scenario where it must select a single element to click. The LLM receives: 1. A task instruction (e.g., "Click on March 19th in the calendar") 2. An annotated screenshot showing all detected elements with their IDs The LLM is prompted to analyze the image and respond with a tool call in the format: Note that the full UI-DETR-1 agent supports multiple actions (`click`, `type`, `scroll`, `press`, `rightclick`, `doubleclick`, etc.), but for benchmark consistency, only the `click` action is evaluated. This tests the fundamental capability of correctly identifying and selecting UI elements. Figure 1: BBC News website showing numbered annotations for all interactive elements including navigation items, article links, and media controls. Figure 2: Airbnb search interface with numbered annotations on calendar dates, property listings, filters, and interactive controls. | Metric | UI-DETR-1 (RF-DETR-M) | OmniParser | Improvement | |--------|------------------|------------|-------------| | Overall Accuracy | 70.8% | 58.8% | +20% | | Agent Browse | 66% | 58% | +14% | | Calendars | 64% | 46% | +39% | | Human Browse | 83% | 73% | +14% | Table 1: Performance comparison between UI-DETR-1 and OmniParser across WebClick benchmark categories (optimized parameters) Initial evaluation used default YOLO detection parameters, yielding OmniParser accuracy of 40.7%. Following parameter optimization (confidence threshold 0.05, IOU threshold 0.1 from official deployment configurations) and refined prompts, OmniParser improved to 58.8%. UI-DETR-1 improved from 67.5% to 70.8% solely through enhanced system prompts, maintaining its threshold of 0.35 throughout both evaluations. Comparison showing impact of parameter optimization on OmniParser performance (40.7% → 58.8%) Category-level results demonstrating performance gains from optimized detection parameters and improved prompts Category Breakdown: - Agent Browse: Automated navigation tasks requiring identification of typical web elements like buttons, links, and form fields - Calendars: Date selection interfaces with dense grid layouts of small, similar-looking elements - Human Browse: Real-world web browsing scenarios with diverse UI patterns and complex page structures UI-DETR-1 shows particularly strong performance on Calendar tasks (+39% improvement), demonstrating superior ability to distinguish between densely packed, visually similar UI elements - a critical capability for autonomous agents. Detection Statistics: - Average elements detected per image: UI-DETR-1 detects 82.3 elements vs OmniParser's 50.6 elements - Processing time: UI-DETR-1 averages 0.82s per image vs OmniParser's 0.77s Examples showing UI-DETR-1 (blue boxes) vs OmniParser (orange boxes) detection capabilities: Figure 5: Calendar date selection interface with dual-month view (April-May 2026). UI-DETR-1 detects 103 interactive elements including individual calendar dates for both months, navigation arrows, date input fields, and action buttons (Reset, Cancel, Apply), while OmniParser only identifies 47 elements, missing numerous calendar dates and form controls. Figure 6: Spotify music streaming platform showing search results for artist "Gojira". UI-DETR-1 identifies 98 elements including navigation tabs (Tracks, Albums, Playlists, Artists, Episodes, Profiles), individual track rows with action buttons (play, like, more options), artist information, and media controls, compared to OmniParser's 60 detections that miss several interactive elements and granular controls. WebClick benchmark click decision examples with Gemini Pro 2.5 (green box: ground truth, blue: UI-DETR-1 selection, orange: OmniParser selection): Figure 7: Travel booking website with flight search and calendar date picker (April-May 2025). Query: Click task on calendar interface UI-DETR-1 correctly identifies and clicks the target date element (May 27th) within the dense calendar grid, while OmniParser fails to locate the correct date element. Figure 8: Booking.com accommodation search with stay duration selector. Query: Select stay duration option UI-DETR-1 demonstrates superior fine-grained detection by accurately identifying both the "A month" text label and its associated radio button as separate interactive elements, enabling precise selection. OmniParser fails to detect these subtle UI components, missing the granular structure of the duration selector interface. The WebClick benchmark evaluates single-click accuracy on web UI tasks. While our agent achieves 70.8% accuracy (compared to 58.8% for OmniParser), it's important to note that UI-DETR-1 is designed for multi-action sequences beyond the single-click paradigm: - Sequential Actions: Screenshot before/after each action for context awareness - Complex Workflows: Navigate through multi-step processes autonomously - Error Recovery: Adaptive behavior based on UI state changes Video 1: Example of UI-DETR-1 agent performing a multi-step task requiring several sequential actions to achieve the final result. UI-DETR-1 powers a sophisticated computer use agent with multiple detection modes: UI-DETR-1 serves as the visual perception foundation for an autonomous computer use agent capable of complex multi-step interactions across any graphical interface. While WebClick evaluates single-click accuracy, UI-DETR-1 excels at complex multi-action sequences: The agent maintains state across actions through before/after screenshots, enabling: - Error detection and recovery - Verification of action success - Handling of semi-dynamic content Seamless integration with vision-language models (Gemini, GPT-4V, Claude) for intelligent decision-making based on visual context and user intent. The `model.pth` file contains the model weights. To use them, you need to install the required dependencies first: Current Limitations: - 70.8% single-click accuracy leaves room for improvement - Performance degrades on very small UI elements (<20px) - Limited to static screenshots (no video/animation support yet) Léo Appourchaux - Lead Developer at TW3 Partners Noé Brandolini - R&D at TW3 Partners - Student at École Centrale d'Électronique David Soeiro-Vuong - R&D at Racine.ai - Student at École Centrale d'Électronique Matis Despujols - R&D at TW3 Partners Paul Lemaistre - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique About Ecole Centrale d'Electronique: ECE, a multi-program, multi-campus, and multi-sector engineering school specializing in digital engineering, trains engineers and technology experts for the 21st century, capable of meeting the challenges of the dual digital and sustainable development revolutions. French Engineering School ECE
UI-DETR-1
Flantier2-SmolVLM-2B-dse
A lightweight multimodal vision-language model specialized for multilingual technical document retrieval. Flantier2-SmolVLM-2B-dse is the second generation of the Document Screenshot Embedding model, a 2B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction. ENERGY Benchmark (racineai/Open-VLM-Retrieval-Leaderboard) - Significant multilingual improvement: +3.25% relative improvement achieving 0.8197 average (vs 0.7939 for Flantier1) - Excellent performance across all European languages: - Strong performance in English (0.8777) - Major improvement in Italian (+6.85%) and German (+5.73%) - Better balance across all tested languages - Efficient training: Trained on approximately 2% of the OGCMEGA dataset Flantier2 was fine-tuned on a carefully curated subset of the OGCMEGADocument retrieval dataset, representing approximately 2% of the full dataset. The training strategy leveraged hard negatives and the dataset's linguistic diversity to maximize data efficiency. This approach enabled the model to learn robust cross-lingual features, successfully rebalancing language representation. The strategic rebalancing resulted in a slight -1.03% decrease in English (0.8868 → 0.8777) but achieved substantial improvements in underperforming languages: Italian (+6.85%), German (+5.73%), French (+3.88%), and Spanish (+1.67%), resulting in a +3.25% overall relative improvement (0.8197 vs 0.7939 average). Flantier2 demonstrates significantly better multilingual balance compared to Flantier1, as measured by statistical dispersion metrics across the five supported languages: - Variance: 42.4% reduction (0.0036 → 0.0021) - shows more consistent performance across languages - Standard Deviation: 24.1% reduction (0.0602 → 0.0457) - indicates tighter clustering of language scores - Coefficient of Variation: 26.5% reduction (7.59% → 5.58%) - demonstrates improved relative consistency - Score Range: 27.9% reduction (0.1782 → 0.1285) - reflects a narrower gap between best and worst performing languages These metrics confirm that Flantier2 achieves approximately 24-42% better balance across all five European languages, successfully reducing the performance disparities observed in the first generation while maintaining high absolute scores. Key Distinction: Flantier2 is specifically optimized for balanced European multilingual performance (EN, FR, DE, ES, IT). While achieving an average score of 0.8197, the model prioritizes uniform language coverage and training efficiency (≈2% of dataset) over peak performance. This strategic trade-off resulted in a slight decrease in English (-1.03%) to achieve substantial improvements across other European languages (+1.67% to +6.85%), demonstrating better cross-lingual consistency. - Efficient Retrieval: Generates document and query embeddings for semantic similarity search - Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout - Lightweight Architecture: Only 2B parameters, runs on consumer GPUs - Enhanced Multilingual Support: Optimized performance on English, French, German, Spanish, and Italian - No Preprocessing Required: Directly works with document screenshots - Multilingual Technical Document Retrieval: Find relevant documents across multiple European languages - International Technical Support Systems: Match user questions to relevant documentation regardless of language - Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports - European Technical Documentation: Optimized support for documents in EN, FR, DE, ES, and IT Flantier2 was trained using the improved Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This second generation was developed with remarkable efficiency: - OGCMEGA Dataset: Trained on approximately 2% of the complete dataset, demonstrating excellent learning efficiency - Enriched Multilingual Data: Strategic selection of examples covering 5 European languages - Cross-lingual Optimization: Better balance of performance across languages - Refined Architecture: Improvements for multimodal and multilingual understanding This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents. Architecture SmolVLM-Instruct serves as base model with LoRA adaptation for efficient fine-tuning. The system uses BFloat16 precision, Flash Attention 2, and last-token pooling for representation extraction. Dataset OGCMEGA comprises 1.09M examples across five languages. Each example contains a query, one positive document image, and up to 16 hard negatives. Critically, 25% include hard negatives while 75% do not. Negative Sampling Three complementary sources provide negatives: dataset hard negatives (semantically similar but incorrect), in-batch negatives (other queries' positives), and distributed negatives (embeddings from all GPUs in multi-GPU training). Adaptive Batching Two batch types address dataset heterogeneity: standard-sized batches for examples with hard negatives, and enlarged batches for examples without hard negatives to maintain constant passage volume through increased in-batch negatives. Training Process Data follows SmolVLM chat format. The InfoNCE loss treats similarity matrices as classification tasks, maximizing query-positive similarity while minimizing query-negative similarity. DeepSpeed ZeRO Stage 3 enables distributed training with AdamW optimization, linear warmup (10% of steps), and gradient checkpointing. Outcome This architecture achieves efficient multilingual visual document retrieval through parametric adaptation, multi-source negative exploitation, adaptive batching, and distributed optimization on accessible hardware. - +3.25% relative improvement on the ENERGY benchmark (0.8197 vs 0.7939 average) - Major progress on Italian (+6.85%) and German (+5.73%) - Better overall multilingual balance (24-42% reduction in dispersion metrics) - Strategic rebalancing: slight English decrease (-1.03%) for substantial gains in other European languages - Ultra-efficient training on ≈2% of the OGCMEGA dataset Noé Brandolini - R&D at TW3 Partners - Student at École Centrale d'Électronique Paul Lemaistre - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique
NATOTAN
Flantier-SmolVLM-2B-dse
A lightweight multimodal vision-language model specialized for technical document retrieval. Flantier-SmolVLM-2B-dse (Document Screenshot Embedding) is a 2B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction. - Efficient Retrieval: Generates document and query embeddings for semantic similarity search - Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout - Lightweight Architecture: Only 2B parameters, runs on consumer GPUs - No Preprocessing Required: Directly works with document screenshots - Technical Document Retrieval: Find relevant documents based on technical queries - Technical Support Systems: Match user questions to relevant documentation - Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents. This model is released under the Apache 2.0 license.
Flantier-SmolVLM-500M-dse
Flantier-Nuclear-Reglementation-1
A specialized vision-language model optimized for nuclear regulatory document analysis and retrieval. Flantier-Nuclear-Reglementation-1 is a fine-tuned version of HuggingFaceTB/SmolVLM-Instruct, specifically optimized for nuclear regulatory document retrieval tasks. The model demonstrates exceptional performance in analyzing technical documents, diagrams, and regulatory content in both English and French, achieving state-of-the-art results in nuclear domain applications. - Nuclear Domain Expertise: Fine-tuned on 40,000 nuclear regulatory examples from IAEA, NEA/OECD, WENRA, EU directives, and French nuclear authorities - Multimodal Analysis: Simultaneously processes regulatory text, technical diagrams, safety flowcharts, and parameter tables - Bilingual Performance: Optimized for both English and French nuclear documentation - High Precision Retrieval: Achieves 74% accuracy (EN) and 61% accuracy (FR) on nuclear regulatory document retrieval tasks - European Sovereignty: Built on European open-source architecture for strategic autonomy in critical sectors Performance on nuclear regulatory document retrieval (NDCG@1): | Model | English | French | |-------|---------|--------| | HuggingFaceTB/SmolVLM-Instruct | 0.17 | 0.04 | | llamaindex/vdr-2b-multi-v1 | 0.66 | 0.48 | | racineai/Flantier-SmolVLM-2B-dse | 0.69 | 0.57 | | Flantier-Nuclear-Reglementation-1 | 0.74 | 0.61 | - Nuclear Regulatory Compliance: Retrieve relevant safety standards and regulatory requirements - Technical Documentation Analysis: Process nuclear technical diagrams, safety flowcharts, and parameter tables - Multilingual Regulatory Search: Handle international nuclear documentation in English and French - Safety Assessment Support: Assist in nuclear safety evaluations and compliance verification This model was fine-tuned using LoRA (Low-Rank Adaptation) on our specialized OGC Nuclear Dataset, which includes: - Regulatory documents from IAEA, NEA/OECD, WENRA - European Union nuclear safety directives - French nuclear regulatory framework (ASN orders, IRSN guides) - Technical documentation from nuclear operators The training focused on nuclear-specific terminologies including criticality, containment, radiation protection, and regulatory compliance requirements. Our training utilized the Organized Grouped Cleaned (OGC) Nuclear Dataset, available as an open-source resource for nuclear AI research and development. This work was developed in collaboration with the Intelligence Lab of École Centrale d'Électronique and built upon Hugging Face's foundational SmolVLM architecture. We thank the nuclear regulatory organizations whose public documentation enabled this research. This model is released under the Apache 2.0 license. Authors - Yumeng Ye: R&D at Racine.ai (Project Lead) - Léo Appourchaux: AI Developer at TW3 Partners - Noé Brandolini: R&D at TW3 Partners