yashpwr

2 models • 1 total models in database

Sort by:

resume-ner-bert-v2

Resume NER BERT v2 - Advanced Resume Information Extraction A state-of-the-art Named Entity Recognition (NER) model specifically designed for extracting structured information from resumes and CVs. This model achieves 90.87% F1 score and is trained on a comprehensive dataset of 22,542 resume samples from multiple sources. This model has been extensively fine-tuned to extract key information from resumes including personal details, contact information, work experience, education, skills, and more. The model uses a BERT-based architecture with token classification capabilities, making it highly effective for resume parsing tasks. Key Features: - High Accuracy: 90.87% F1 score on comprehensive resume parsing - Comprehensive Coverage: 25 entity types covering all major resume sections - Large Training Dataset: 22,542 samples from multiple sources - Production Ready: Tested and optimized for real-world applications - Memory Efficient: CPU-optimized with reasonable model size (431MB) | Metric | Score | Status | |--------|-------|--------| | F1 Score | 90.87% | ✅ Excellent | | Precision | 91.44% | ✅ High | | Recall | 90.81% | ✅ High | | Training Loss | 0.2604 | ✅ Low | The model recognizes 25 entity types using BIO (Beginning-Inside-Outside) tagging: Core Personal Information: - Name: Person's full name (e.g., "John Smith", "Sarah Johnson") - Email Address: Email contact information (e.g., "[email protected]") - Phone: Phone number (e.g., "(555) 123-4567", "+1-555-123-4567") - Location: Geographic location (e.g., "San Francisco, CA", "New York") Professional Information: - Companies worked at: Previous employers and organizations (e.g., "Google", "Microsoft", "Amazon") - Designation: Job titles and roles (e.g., "Senior Software Engineer", "Data Scientist", "Marketing Manager") - Skills: Technical and soft skills (e.g., "Python", "JavaScript", "Machine Learning", "Leadership") - Years of Experience: Work experience duration (e.g., "5 years", "10+ years") Educational Information: - Degree: Educational qualifications (e.g., "Bachelor's in Computer Science", "Master's in Data Science") - College Name: Educational institutions (e.g., "Stanford University", "MIT", "Harvard") - Graduation Year: Year of degree completion (e.g., "2020", "2018") Additional: - UNKNOWN: Unclassified entities that don't fit other categories BIO Tags: - `B-` (Beginning): Start of an entity - `I-` (Inside): Continuation of an entity - `O` (Outside): Non-entity tokens Dataset Composition - Total Samples: 22,542 - Sources: - Resume-Corpus Dataset: 349 samples (structured resume data) - DataTurks Resume NER: 420 samples (manually annotated resumes) - Custom Training Data: 21,773 samples (rule-based extraction from conversation data) - Mehyaar Skills Dataset: Integrated skills-focused data Training Configuration - Base Model: `yashpwr/resume-ner-bert` - Learning Rate: 3e-5 - Batch Size: 4 (effective: 32 with gradient accumulation) - Max Sequence Length: 128 tokens - Epochs: 1.0 (early stopping applied) - Device: CPU (optimized for memory efficiency) - Gradient Accumulation Steps: 8 - Optimizer: AdamW - Loss Function: Cross-Entropy Loss Training Process Improvements The model was trained using a comprehensive pipeline that addressed several key challenges: 1. ✅ Tokenization Consistency: Used `bert-base-cased` throughout the pipeline to ensure consistent tokenization between training and inference 2. ✅ Entity Extraction Enhancement: Implemented proper character-to-token alignment using `returnoffsetsmapping=True` for accurate text reconstruction 3. ✅ Label Mapping: Unified diverse label schemas into the DataTurks format (25 labels) for consistency 4. ✅ Performance Optimization: CPU-optimized training with memory efficiency and gradient accumulation 5. ✅ Dataset Integration: Successfully integrated 21,773 additional samples from conversation data using rule-based extraction Primary Applications - Recruitment Platforms: Automated resume parsing and candidate screening - Resume Parsing Engines: Extract structured data from unstructured resumes - Talent Analytics Tools: Analyze candidate skills and experience patterns - Document Processing Pipelines: Integrate with HR systems and ATS - ATS (Applicant Tracking Systems): Automated candidate data extraction and categorization Industry Sectors - Human Resources & Recruitment: Streamline hiring processes - Technology & Software Development: Technical skill assessment - Finance & Banking: Compliance and background verification - Healthcare: Medical credential verification - Education: Academic credential processing - Government & Public Sector: Public service recruitment Specific Use Cases 1. Automated Resume Screening: Filter candidates based on skills and experience 2. Data Migration: Convert legacy resume databases to structured formats 3. Compliance Checking: Verify educational and professional credentials 4. Skill Gap Analysis: Identify missing skills in candidate pools 5. Market Research: Analyze job market trends and skill demands 1. Language: Currently optimized for English resumes only 2. Format: Works best with text-based resumes (PDF conversion may be required) 3. Domain: Primarily trained on technology and business resumes 4. Length: Optimal for resumes under 512 tokens (truncation applied for longer texts) 5. Accuracy: 90.87% F1 score - may miss some entities in complex or non-standard formats 6. Context: Limited to resume-specific entities (not general NER) System Requirements - Python: 3.8 or higher - PyTorch: 1.9 or higher - Transformers: 4.20 or higher - Memory: 2GB+ RAM recommended - Storage: 431MB model size - CPU: Multi-core recommended for inference This model is licensed under the Apache 2.0 License. This means you can: - Use the model for commercial purposes - Modify and distribute the model - Use it in proprietary software - Distribute modified versions We welcome contributions from the community! Here's how you can contribute: 1. Report Issues: Create an issue for bugs or feature requests 2. Submit Improvements: Fork the repository and submit pull requests 3. Share Datasets: Contribute additional training data 4. Documentation: Help improve documentation and examples 5. Testing: Test the model on different resume formats - GitHub Issues: Create an issue on the model repository - Hugging Face Discussions: Use the discussion tab on the model page - Email: Contact through the Hugging Face profile - Documentation: Check the model card and examples - Base Model: `yashpwr/resume-ner-bert` for the foundation architecture - Datasets: Resume-Corpus, DataTurks, and custom training data contributors - Hugging Face: For the transformers library and platform - Open Source Community: For contributions and feedback - Research Community: For advancing NER and information extraction techniques Version History - v1: Initial release with basic resume parsing - v2: Comprehensive model with 22,542 samples and 90.87% F1 score Future Improvements - Multi-language support - Enhanced entity types - Better handling of complex resume formats - Integration with document processing pipelines - Real-time inference optimization Last Updated: August 7, 2025 Version: v2 Status: Production Ready ✅ License: Apache 2.0 Repository: https://huggingface.co/yashpwr/resume-ner-bert-v2

license:apache-2.0

3,452

resume-ner-bert

license:apache-2.0