MWirelabs
khasibert
KhasiBERT is a foundational language model for the Khasi language, trained on 3.6 million sentences using the RoBERTa architecture. This model serves as the foundation for all downstream Khasi NLP tasks including text classification, sentiment analysis, question answering, and language generation. | Attribute | Value | | --- | --- | | Model Name | KhasiBERT | | Version | 1.0.0 | | Architecture | RoBERTa-base | | Parameters | 110,652,416 | | Model Size | 421 MB | | Language | Khasi (kha) | | Language Family | Austroasiatic | | Training Data | 3,621,116 sentences | | Vocabulary Size | 32,000 tokens | | Max Sequence Length | 512 tokens | | Training Time | ~4 hours | | GPU Used | NVIDIA RTX A6000 (48GB) | KhasiBERT follows the RoBERTa-base architecture with the following specifications: | Component | Configuration | | --- | --- | | Transformer Layers | 12 | | Hidden Size | 768 | | Attention Heads | 12 | | Intermediate Size | 3,072 | | Activation Function | GELU | | Dropout | 0.1 | | Layer Norm Epsilon | 1e-12 | | Max Position Embeddings | 514 | Training Objective KhasiBERT was trained using Masked Language Modeling (MLM), where 15% of input tokens are randomly masked and the model learns to predict these masked tokens based on bidirectional context. | Hyperparameter | Value | | --- | --- | | Training Objective | Masked Language Modeling | | Masking Probability | 15% | | Optimizer | AdamW | | Learning Rate | 5e-5 | | Learning Rate Schedule | Linear with warmup | | Warmup Steps | 5,000 | | Weight Decay | 0.01 | | Batch Size | 24 | | Gradient Accumulation | 1 | | Training Epochs | 1 | | Total Training Steps | 150,880 | | Mixed Precision | FP16 | | Hardware | NVIDIA RTX A6000 (48GB) | Tokenization A custom Byte-Level BPE tokenizer was trained specifically on the Khasi corpus with: - Vocabulary size: 32,000 tokens - Special tokens: ` `, ` `, ` `, ` `, ` ` - Trained on the complete Khasi dataset for optimal language coverage | Statistic | Value | | --- | --- | | Total Sentences | 3,621,116 | | Average Sentence Length | 83 characters | | Estimated Total Tokens | ~50-70 million | | Data Quality | High-quality, deduplicated | | Language Coverage | Comprehensive Khasi text | Data Preprocessing - Exact duplicate removal - Near-duplicate removal (80% similarity threshold) - Length filtering (10-500 characters) - Text normalization and cleaning - Quality validation KhasiBERT demonstrates strong contextual understanding in Khasi: | Test Case | Input | Top Prediction | Confidence | | --- | --- | --- | --- | | Question Context | Phi lah bam ? | bha | 6.8% | | Location Context | Ka shnong jongngi ka don ha pdeng ki | khlaw | 7.1% | | Place Reference | Ngan sa leit kai sha lashai. | Delhi | 10.6% | | Action Context | Ngi donkam ban leit sha iew ban thied . | jingthied | 25.3% | | Gratitude Expression | Khublei shibun na ka bynta ka jingiarap jong . | phi | 44.7% | Key Strengths - Authentic Khasi understanding with contextually appropriate predictions - Pronoun recognition - correctly predicts "phi" (you) in gratitude expressions (44.7%) - Semantic relationships - predicts "jingthied" related to "thied" (25.3%) - Place name recognition - identifies proper nouns like "Delhi" (10.6%) - Grammatical structure awareness across question, location, and action contexts KhasiBERT serves as the foundation for various Khasi NLP applications: Supported Tasks - Text Classification: Document categorization, topic modeling - Sentiment Analysis: Opinion mining in Khasi text - Named Entity Recognition: Person, place, organization extraction - Question Answering: Khasi reading comprehension systems - Text Generation: Coherent Khasi text creation - Language Understanding: Chatbots and virtual assistants - Machine Translation: English-Khasi translation systems Use Cases - Educational technology for Khasi language learning - Government services in Meghalaya - Cultural preservation and digital humanities - Social media monitoring and analysis - Content recommendation systems For Meghalaya State - Digital Government Services: Enables Khasi language interfaces for e-governance - Educational Technology: Powers AI-driven Khasi language learning platforms - Cultural Preservation: Digitally preserves and promotes Khasi linguistic heritage - Economic Development: Creates foundation for local language tech industry For Northeast India - Linguistic Diversity: Supports AI development for Northeast India's rich language landscape - Digital Inclusion: Ensures indigenous communities aren't left behind in AI revolution - Research Hub: Positions Meghalaya as center for indigenous language AI research | Task | RAM | GPU Memory | GPU | | --- | --- | --- | --- | | Inference (CPU) | 4GB | - | - | | Inference (GPU) | 8GB | 2GB | Any CUDA GPU | | Fine-tuning | 16GB | 8GB | RTX 3080+ | | Full Training | 32GB | 24GB+ | RTX 4090/A6000+ | Software Requirements - Python 3.8+ - PyTorch 1.9+ - Transformers 4.20+ - CUDA 11.0+ (for GPU) Significance KhasiBERT represents a significant advancement in low-resource NLP: - A foundational model for the Khasi language - Enables NLP research for 1.4+ million Khasi speakers - Preserves linguistic heritage through AI technology - Demonstrates efficient training methodology for resource-constrained scenarios Limitations - Trained on 1 epoch (future versions may benefit from additional training) - Performance may vary on highly domain-specific text - Requires task-specific fine-tuning for optimal performance - May not capture all dialectal variations of Khasi If you use KhasiBERT in your research or applications, please cite: - Organization: MWirelabs - Model Repository: https://huggingface.co/MWirelabs/khasibert - Issues: Please report issues through the Hugging Face model page - Training conducted on NVIDIA RTX A6000 GPU - Built using the Transformers library by Hugging Face - Inspired by the success of foundational models for major languages - Dedicated to the preservation and advancement of the Khasi language This model is released under the Creative Commons BY-NC 4.0 License. You are free to: - Use for non-commercial research and education - Modify and distribute for non-commercial purposes - Create derivative works for research Commercial Use: Contact MWirelabs for commercial licensing agreements. Attribution Required: Please provide appropriate credit to MWirelabs when using this model. KhasiBERT: Bridging traditional Khasi language with modern AI technology.
kokborok-mt
nagamesebert
meitei-roberta
neodac-mini
mizo-roberta
Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language [](https://huggingface.co/MWireLabs/mizo-roberta) [](https://huggingface.co/datasets/MWireLabs/mizo-language-corpus-4M) [](https://www.apache.org/licenses/LICENSE-2.0) Mizo-RoBERTa is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications. This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful KhasiBERT model. - Architecture: RoBERTa-base (110M parameters) - Training Scale: 5.94M sentences, 138.7M tokens - Open Data: 4M sentences publicly available at mizo-language-corpus-4M - Custom Tokenizer: Trained specifically for Mizo (30K BPE vocabulary) - Efficient: Single-epoch training on A40 GPU - Open Source: Model, tokenizer, and training code publicly available | Component | Specification | |-----------|--------------| | Base Architecture | RoBERTa-base | | Parameters | 109,113,648 (~110M) | | Layers | 12 transformer layers | | Attention Heads | 12 | | Hidden Size | 768 | | Intermediate Size | 3,072 | | Max Sequence Length | 512 tokens | | Vocabulary Size | 30,000 (custom BPE) | | Setting | Value | |---------|-------| | Training Data | 5.94M sentences (138.7M tokens) | | Public Dataset | 4M sentences available on HuggingFace | | Batch Size | 32 per device | | Learning Rate | 1e-4 | | Optimizer | AdamW | | Weight Decay | 0.01 | | Warmup Steps | 10,000 | | Training Epochs | 2 | | Hardware | 1x NVIDIA A40 (48GB) | | Training Time | ~4-6 hours | | Precision | Mixed (FP16) | Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes: - News articles from major Mizo publications - Literature and written content - Social media text - Government documents and official communications - Web content from Mizo language websites Public Dataset: 4 million sentences are openly available at MWireLabs/mizo-language-corpus-4M for research and development purposes. - Unicode normalization - Language identification and filtering - Deduplication (exact and near-duplicate removal) - Quality filtering based on length and character distributions - Custom sentence segmentation for Mizo punctuation - Training: 5,350,122 sentences (90%) - Validation: 297,229 sentences (5%) - Test: 297,230 sentences (5%) | Metric | Value | |--------|-------| | Test Perplexity | 15.85 | | Test Loss | 2.76 | The model demonstrates strong understanding of Mizo linguistic patterns and context: While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks. Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks: - Text Classification (sentiment analysis, topic classification, news categorization) - Named Entity Recognition (NER for Mizo entities) - Question Answering (extractive QA systems) - Semantic Similarity (sentence/document similarity) - Information Retrieval (semantic search in Mizo content) - Language Understanding (natural language inference, textual entailment) - Dialectal Coverage: The model may not comprehensively represent all Mizo dialects - Domain Balance: Formal written text may be overrepresented compared to conversational Mizo - Pretraining Objective: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives - Context Length: Limited to 512 tokens; longer documents require chunking - Low-resource Constraints: While large for Mizo, the training corpus is still smaller than high-resource language datasets - Representation: The model reflects the content and potential biases present in the training corpus - Intended Use: Designed for research and applications that benefit Mizo language speakers - Misuse Potential: Should not be used for generating misleading information or harmful content - Data Privacy: Training data was collected from publicly available sources; no private information was used - Cultural Sensitivity: Users should be aware of cultural context when deploying for Mizo-speaking communities If you use Mizo-RoBERTa in your research or applications, please cite: - Public Training Data: mizo-language-corpus-4M - Sister Model: KhasiBERT - RoBERTa model for Khasi language - Organization: MWireLabs on HuggingFace For questions, issues, or collaboration opportunities: - Organization: MWireLabs - Email: Contact through HuggingFace - Issues: Report on the model's HuggingFace page This model is released under the Apache 2.0 License. See LICENSE file for details. We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.
ne-bert
kren-vision
Kren-M
kren-v1
Kren v1 is a publicly documented encoder→decoder conversion producing a generative language model for an Indian language (Khasi). The conversion was performed by transferring weights and adapting the architecture of MWirelabs/khasibert (RoBERTa-style encoder) into a GPT-2 style causal decoder, followed by progressive causal LM fine-tuning. - Model Name: Kren v1 (formerly kren-v0.3) - Language: Khasi (kha) - Architecture: GPT-2 style causal language model - Parameters: 110M - Training Data: 1M lines (optimal training point identified through research) - Base Model: MWirelabs/khasibert ✅ Environmental and sustainability discussions ✅ Cultural and geographical questions about Meghalaya ✅ Abstract reasoning and concept exploration ✅ Multi-clause sophisticated responses ✅ Educational content generation - Training Method: Progressive fine-tuning with encoder-to-decoder conversion - Optimal Training Point: 1M lines (validated through research) - Training Loss: 2.960 - Perplexity: 19.3 - Architecture Conversion: RoBERTa encoder → GPT-2 decoder with systematic weight transfer This model represents the optimal point identified through comprehensive progressive training research: - v0.1 (300K lines): Training loss 3.149, basic generation - v0.2 (800K lines): Training loss 2.995, dialogue capabilities - v0.3/v1 (1M lines): Training loss 2.960, abstract reasoning breakthrough - v0.4 (2M lines): Training loss 2.903 but quality regression Key Finding: Training beyond 1M lines causes capability degradation despite lower loss values. Environmental Discussion Input: "Kumno ban pyniaid ia ka phang ha ka pyrthei?" (How to protect the environment?) Output: Generates substantive responses about environmental responsibility and conservation practices. Cultural Questions Input: "Kiei ki wah ki shnong ba don ha Meghalaya?" (What villages are in Meghalaya?) Output: Provides detailed responses about Meghalayan communities and geography. Kren v1 may produce hallucinations, biased or culturally sensitive content, and should not be used for medical, legal, or high-stakes decisions without human oversight. Users are responsible for verifying outputs in critical contexts. - Context Window: 514 tokens limits very long-form generation - Domain Coverage: Optimized for general Khasi; specialized domains may need fine-tuning - Cultural Nuances: May require additional culturally-specific training for certain applications - Scale: 110M parameters provide good balance but larger models might offer enhanced capabilities - Hallucinations: May generate plausible-sounding but factually incorrect information - Bias: May reflect biases present in training data - Cultural Sensitivity: Generated content should be reviewed by Khasi speakers for cultural appropriateness ✅ Appropriate Uses: - Educational content generation (with human review) - Creative writing assistance - Language learning tools - Cultural preservation projects - Research and experimentation ❌ Not Recommended: - Medical advice or diagnosis - Legal consultation - Financial advice - High-stakes decision making without human oversight - Official translations without verification - Context Length: 514 tokens - Vocabulary: 32,000 Khasi-specific tokens - Precision: BF16/FP16 compatible - Memory Requirements: ~450MB storage, 2GB+ RAM for inference - Hardware: Optimized for consumer GPUs (4GB+ VRAM recommended) - Educational Technology: Khasi language learning platforms - Content Generation: Cultural and educational material creation - Language Preservation: AI-assisted documentation of Khasi expressions - Research: Foundation for further Khasi NLP development - Training Efficiency: 6.0% loss improvement with optimal data usage - Quality Validation: Comprehensive evaluation across multiple domains - Capability Range: Environmental topics, cultural discussions, educational content - Reliability: Consistent generation quality across diverse prompts - Process: Encoder-to-decoder conversion methodology for Indian languages - Methodology: Validates progressive training approach for low-resource languages - Findings: Demonstrates optimal training data volumes for indigenous language models - Impact: Establishes foundation for Northeast Indian language AI development Developed by MWire Labs, Shillong, Meghalaya. For questions about Kren v1 or Khasi language AI research, please refer to the research paper or contact our research team. This model is released under CC BY 4.0 license, allowing for broad use with attribution. Note: This model represents the culmination of progressive training research and is recommended for production applications requiring Khasi text generation, with appropriate human oversight for safety-critical uses.
ne-ocr
khasi-english-semantic-search
NortheastNER
NortheastNER is a Named Entity Recognition (NER) model fine-tuned by MWirelabs to recognize entities specific to Northeast India. It is based on xlm-roberta-base and trained on a mix of gazetteers, curated news, and domain-specific data (tribes, villages, flora, fauna, festivals, tourist places). PLACES → States, districts, villages, regions (e.g., Shillong, Tura, Ri-Bhoi) TRIBES → Indigenous tribes & sub-tribes (e.g., Khasi, Nyishi, Wancho) FESTIVALS → Local festivals (e.g., Wangala, Losar, Nyokum Yullo) TOURIST → Landmarks & tourist spots (e.g., Tawang Monastery, Umiam Lake) FLORA → Plants & crops of the Himalayan / NE region FAUNA → Animals, birds, wildlife from NE region | Entity | Precision | Recall | F1 | | ----------- | ----------------------------- | --------- | --------- | | PLACES | 0.963 | 0.969 | 0.966 | | TRIBES | 0.927 | 0.927 | 0.927 | | FESTIVALS | (coming soon, fewer examples) | | | | TOURIST | 0.167 | 0.125 | 0.143 | | FLORA | 1.000 | 0.800 | 0.889 | | FAUNA | 0.000 | 0.000 | 0.000 | | Overall | 0.962 | 0.967 | 0.964 | ⚠️ Low scores for TOURIST / FAUNA due to very few training examples — performance will improve with more labeled data. Note: The current evaluation set does not include enough examples of NAMES, so that category is not reported in the table. Training data did include a small gazetteer of Khasi and regional names (~81 entries), but more labeled examples are needed for meaningful evaluation. Base model: `xlm-roberta-base` Max sequence length: 256 Batch size: 16 Learning rate: 3e-5 Epochs: 3 Weight decay: 0.01 Optimizer: AdamW Framework: HuggingFace Transformers Trainer API 📦 Dataset Size - Train set: ~20,000 sentences - Dev set: ~5,000 sentences - Sources: Gazetteers (districts, tribes, flora/fauna, festivals, tourist sites, names), news articles, tourism/cultural descriptions Transformers: 4.44.2 Datasets: 2.20.0 Evaluate: 0.4.2 PyTorch: 2.3.0+cu121 Python: 3.11 Hardware: Single NVIDIA A4500 GPU (20 GB VRAM), 62 GB RAM, 12 vCPU This model is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You are free to use, share, and adapt the model for non-commercial purposes with attribution. 🗂 Data Licenses - Gazetteers of villages and tribes: compiled by MWirelabs (open reference use). - Festivals, tourist sites, and names: curated by MWirelabs team. Please ensure attribution when reusing any derived dataset. If you use this model in your research, please cite: ⚠️ Limitations - Low support for TOURIST and FAUNA classes (few examples). - NAMES entity class trained but not evaluated due to lack of dev set coverage. - Possible confusion between TRIBES and PLACES where names overlap (e.g., Garo). - Model optimized for Northeast India texts; performance outside this domain may degrade. 🔮 Future Work - Add more gold-labeled examples for underrepresented classes (Names, Fauna, Tourist). - Explore active learning to identify low-confidence predictions for manual annotation. - Expand coverage of festivals and indigenous knowledge domains. This model is developed by MWirelabs, pioneering AI solutions for the rich cultural and linguistic diversity of Northeast India. Contact: MWirelabs