Ugiat

2 models • 2 total models in database

Sort by:

IberLang

The IberLang classifier is a fine-tuned version of the Whisper Medium model, developed specifically for language recognition across the Iberian linguistic spectrum. Trained to accurately identify Spanish, Catalan, Galician, Euskera (Basque), and Occitan, this model enhances Whisper’s multilingual capabilities for regional language identification tasks. The pre-trained base used for fine-tuning was: openai/whisper-medium. We evaluated the fine-tuned IberLang classifier against Whisper Large V3 using a reserved subset of our custom VoxLingua107 IberLang dataset containing 1200 audios. The results show substantial performance gains, particularly in the recognition of minority Iberian languages. | Model | Catalan | Basque | Galician | Occitan | Spanish | |----------------|------------------|---------------|-----------|------------------|---------------| | IberLang | 0.902 | 0.96 | 0.915 | 0.655 | 1.0 | | Whisper-Large-V3 | 0.902 | 0.68 | 0.188 | 0.0 | 0.978 | The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization: - Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets. - Training Setup: - Batch size: 4 - Gradient accumulation steps: 8 - Epoch: 3 - Learning rate: 1e-5 - Scheduler: Linear - Evaluation frequency: Every 300 steps - Checkpointing: Every 300 steps This model, IberLang, is a fine-tuned version of Whisper Medium by OpenAI, licensed under the Apache License 2.0. Fine-tuning and additional modifications were performed by Ugiat Technologies to improve multilingual language identification for Catalan, Galician, Basque, Spanish, and Occitan. The resulting model and associated documentation are released under the Creative Commons Attribution 4.0 International License (CC BY 4.0). When using this model, please cite both the original Whisper project and this fine-tuned version as appropriate.

license:cc-by-4.0

NERCat

The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan. The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`. We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories: | Entity Type | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1 | |----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------| | Person | 1.00 | 1.00 | 1.00 | 0.92 | 0.80 | 0.86 | +0.08 | +0.20 | +0.14 | | Facility | 0.89 | 1.00 | 0.94 | 0.67 | 0.25 | 0.36 | +0.22 | +0.75 | +0.58 | | Organization | 1.00 | 1.00 | 1.00 | 0.72 | 0.62 | 0.67 | +0.28 | +0.38 | +0.33 | | Location | 1.00 | 0.97 | 0.99 | 0.83 | 0.54 | 0.66 | +0.17 | +0.43 | +0.33 | | Product | 0.96 | 1.00 | 0.98 | 0.63 | 0.21 | 0.31 | +0.34 | +0.79 | +0.67 | | Event | 0.88 | 0.88 | 0.88 | 0.60 | 0.38 | 0.46 | +0.28 | +0.50 | +0.41 | | Date | 0.88 | 1.00 | 0.93 | 1.00 | 0.07 | 0.13 | -0.13 | +0.93 | +0.80 | | Law | 0.67 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | +0.67 | +1.00 | +0.80 | The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization: - Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets. - Training Setup: - Batch size: 8 - Steps: 500 - Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances - Learning rates: - Entity layers: $5 \times 10^{-6}$ - Other model parameters: $1 \times 10^{-5}$ - Scheduler: Linear with a warmup ratio of 0.1 - Evaluation frequency: Every 100 steps - Checkpointing: Every 1000 steps The dataset included 13,732 named entity instances across eight categories:

license:apache-2.0