opendatalab
MinerU2.5-2509-1.2B
--- license: agpl-3.0 language: - zh - en pipeline_tag: image-text-to-text library_name: transformers ---
MinerU2.0-2505-0.9B
MinerU2.5-Pro-2604-1.2B
MinerU-HTML
meta-rater-professionalism-rating
meta-rater-readability-rating
meta-rater-cleanliness-rating
meta-rater-reasoning-rating
ChartVerse-2B
ChartVerse-Coder
ChartVerse-8B
ChartVerse-4B
TRivia-3B
meta-rater-1b-25raters
Meta-rater Language Model - All (25) Quality Scores (1.3B Parameters, 30B Tokens) This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models. This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Meta-rater framework with all 25 quality scores. This represents the flagship model of the Meta-rater research, combining natural language quality signals, data importance scores, and model-based ratings through learned optimal weightings. - Architecture: Transformer decoder-only - Parameters: 1.345B (1,345,423,360 parameters) - Training Tokens: 30B tokens - Context Window: 1,024 tokens - Vocabulary Size: 32,000 (LLaMA tokenizer) - Data Selection Method: Meta-rater with all 25 quality scores - Optimization: Learned optimal weightings through 256 proxy models - Hidden Dimension: 2,048 - Number of Layers: 24 - Attention Heads: 16 - Key-Value Heads: 16 - MLP Ratio: 8/3 - Position Encoding: RoPE (base=10,000) The training data was selected using the complete Meta-rater framework integrating: Natural Language Quality Signals (11) - RedPajama rule-based measures (word count, entropy, unique words, etc.) - Text naturalness and linguistic integrity indicators Data Importance Scores (3) - DSIR similarity to Books, Wikipedia, and AutoMathText - Domain-specific quality assessment Model-based Ratings (11) - PRRC (4): Professionalism, Readability, Reasoning, Cleanliness - QuRating (4): Required Expertise, Writing Style, Facts & Trivia, Educational Value - FineWeb-Edu (1): Educational value assessment - WanjuanCC (2): Advertisement detection, Fluency evaluation Optimal Weighting Top contributing quality scores with learned weights: 1. Educational Value (5.64%) 2. docfracnoalphwords (4.93%) 3. Fineweb-edu (4.93%) 4. linesuppercaseletterfraction (4.88%) 5. Facts and Trivia (4.77%) - Hardware: 32x NVIDIA A800 GPUs - Global Batch Size: 4,194,304 tokens - Learning Rate: 5e-5 - Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - Training Time: ~14 hours - Meta-rater Construction: 256 proxy models for optimal weight learning - General Knowledge: 58.90% (+6.11% vs Random) - ARC-Easy: 58.25% - ARC-Challenge: 29.86% - SciQ: 88.60% - Commonsense Reasoning: 45.41% (+1.47% vs Random) - HellaSwag: 39.81% - SIQA: 42.68% - WinoGrande: 53.75% - Reading Comprehension: 31.55% (+1.53% vs Random) - RACE: 31.10% - OpenbookQA: 32.00% - State-of-the-Art: Best performance among all baseline methods - Convergence Speed: 2x faster convergence compared to random selection - Token Efficiency: Matches Random-60B performance using only 30B tokens - Holistic Quality: Balanced improvements across all task categories - Multi-dimensional: Successfully integrates 25 complementary quality metrics This model is exceptionally well-suited for: - General-purpose language modeling with high quality standards - Research requiring state-of-the-art baseline performance - Educational applications across multiple domains - Content generation with balanced quality across dimensions - Multi-domain tasks requiring diverse capabilities - Production systems needing reliable, high-quality text generation - Holistic Quality: Balanced performance across all evaluation dimensions - Training Efficiency: Superior token efficiency compared to random selection - Robust Performance: Consistent improvements across diverse task types - Multi-dimensional: Benefits from comprehensive quality assessment - Research Validated: Empirically optimized through systematic methodology - Scalable: Framework scales to larger models (validated up to 7.2B) This model demonstrates several key findings: - Multi-dimensional beats single-dimension: 47.01% vs best single rater (46.16%) - Quality integration superiority: Outperforms simple combination methods - Efficiency gains: Achieves 2x convergence speed improvement - Scalability: Benefits persist at larger model scales - Comprehensive approach: 25 quality scores provide complementary information - vs Random Baseline: +3.23% overall improvement - vs Best Single Rater (QuRating Educational Value): +0.85% improvement - vs Simple Mean Combination: +2.36% improvement over uniform weighting - vs Previous SOTA: Establishes new state-of-the-art for data selection methods - Higher computational cost for quality score rating during data selection - Optimized for SlimPajama-style web-crawled data - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - Requires proxy model training for weight optimization The Meta-rater framework introduces: - Systematic quality integration: Moving beyond single-dimensional selection - Learned optimal weightings: Data-driven rather than heuristic combinations - Proxy model methodology: Efficient exploration of weight space - Multi-dimensional assessment: Comprehensive quality evaluation (PRRC) - Scalable paradigm: Framework applicable to diverse quality metrics If you use this model in your research, please cite: - PRRC Rating Models: Individual ModernBERT models for quality assessment - Annotated SlimPajama-627B: Fully labeled dataset with all 25 quality scores - Meta-rater Scripts: Implementation and training code - Proxy Models: Smaller models used for weight optimization Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. For questions or issues, please contact the authors or open an issue in the repository.
Meta Rater 1b Reasoning
PRRC-Reasoning Language Model (1.3B Parameters, 30B Tokens) This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models. This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Reasoning dimension of the PRRC framework. The training data was curated by selecting text with high reasoning complexity, focusing on content that requires multi-step logical analysis and critical thinking. - Architecture: Transformer decoder-only - Parameters: 1.345B (1,345,423,360 parameters) - Training Tokens: 30B tokens - Context Window: 1,024 tokens - Vocabulary Size: 32,000 (LLaMA tokenizer) - Data Selection Method: Top-k selection based on Reasoning scores - Rating Model: ModernBERT-base fine-tuned for Reasoning assessment - Hidden Dimension: 2,048 - Number of Layers: 24 - Attention Heads: 16 - Key-Value Heads: 16 - MLP Ratio: 8/3 - Position Encoding: RoPE (base=10,000) The training data was selected using the Reasoning rating model, which evaluates: - Logical Structure: Multi-step reasoning and argument chains - Analytical Depth: Complex analysis and critical evaluation - Causal Relationships: Identification and exploration of cause-effect patterns - Problem Solving: Strategic thinking and solution development - Evidence Integration: Synthesis of multiple information sources Selected texts typically include: - Analytical essays and research papers - Problem-solving discussions and case studies - Philosophical and scientific arguments - Strategic planning documents - Complex technical analyses - Hardware: 32x NVIDIA A800 GPUs - Global Batch Size: 4,194,304 tokens - Learning Rate: 5e-5 - Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - Training Time: ~14 hours - General Knowledge: 55.57% (+2.78% vs Random) - ARC-Easy: 55.35% - ARC-Challenge: 27.05% - SciQ: 84.30% - Commonsense Reasoning: 44.86% (+0.92% vs Random) - HellaSwag: 41.34% - SIQA: 40.36% - WinoGrande: 52.87% - Reading Comprehension: 30.48% (+0.46% vs Random) - RACE: 30.95% - OpenbookQA: 30.00% - Reasoning Enhancement: Improved logical thinking and analysis capabilities - Problem Solving: Enhanced ability to work through complex problems - Knowledge Application: Better at applying knowledge to new situations - Analytical Skills: Stronger performance in tasks requiring multi-step reasoning This model is particularly well-suited for: - Analytical writing and problem-solving tasks - Educational content focused on critical thinking - Research assistance and hypothesis development - Strategic planning and decision-making support - Complex reasoning tasks and logic puzzles - Academic writing requiring argumentation - Case study analysis and evaluation - Enhanced logical reasoning and analytical capabilities - Improved problem-solving approach and methodology - Better at handling complex, multi-step arguments - Strong performance on knowledge-intensive reasoning tasks - Effective at synthesizing information from multiple sources - Good at identifying causal relationships and patterns - May generate overly complex reasoning for simple questions - Could prioritize analytical depth over accessibility - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - May struggle with creative or intuitive tasks This model demonstrates enhanced abilities in: - Deductive Reasoning: Drawing logical conclusions from premises - Inductive Reasoning: Identifying patterns and generalizations - Causal Analysis: Understanding cause-and-effect relationships - Problem Decomposition: Breaking complex problems into manageable parts - Evidence Evaluation: Assessing the strength and relevance of information - Hypothesis Formation: Developing testable explanations - vs Random Baseline: +1.50% overall, with consistent improvements across categories - vs Other PRRC Dimensions: Competitive performance with focus on analytical tasks - vs Meta-rater All (25): Shows specialized improvement in reasoning-heavy applications If you use this model in your research, please cite: Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. For questions or issues, please contact the authors or open an issue in the repository.
meta-rater-3b-25raters
Meta-rater Language Model (3.3B Parameters, 100B Tokens) This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models. This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens selected from SlimPajama dataset using the Meta-rater framework with all 25 quality scores. This model demonstrates the scalability of Meta-rater's data selection benefits to larger model sizes and training datasets. - Architecture: Transformer decoder-only - Parameters: 3.3B (3,335,989,760 parameters) - Training Tokens: 100B tokens - Context Window: 1,024 tokens - Vocabulary Size: 32,000 (LLaMA tokenizer) - Data Selection Method: Meta-rater with all 25 quality scores - Optimization: Learned optimal weightings from 1.3B experiments - Hidden Dimension: 2,560 - Number of Layers: 40 - Attention Heads: 20 - Key-Value Heads: 20 - MLP Ratio: 8/3 - Position Encoding: RoPE (base=10,000) The training data was selected using the same Meta-rater framework as the 1.3B models, leveraging: Quality Score Integration (25 total) - Natural Language Quality Signals (11): RedPajama rule-based measures - Data Importance Scores (3): DSIR similarity to Books, Wikipedia, AutoMathText - Model-based Ratings (11): PRRC + QuRating + FineWeb-Edu + WanjuanCC Optimal Weighting Strategy The same learned weights from 1.3B proxy model experiments were applied, ensuring consistent data selection criteria across scales. - Hardware: 32x NVIDIA A800 GPUs - Global Batch Size: 4,194,304 tokens - Learning Rate: 5e-5 - Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - Training Time: ~129 hours - General Knowledge: 67.51% (+3.29% vs Random 3.3B) - ARC-Easy: 72.10% - ARC-Challenge: 37.54% - SciQ: 92.90% - Commonsense Reasoning: 54.35% (+0.80% vs Random 3.3B) - HellaSwag: 58.99% - SIQA: 43.91% - WinoGrande: 60.14% - Reading Comprehension: 36.06% (+0.78% vs Random 3.3B) - RACE: 35.12% - OpenbookQA: 37.00% Knowledge-Intensive Tasks - MMLU: 26.21% (+0.73% vs Random 3.3B) - NaturalQuestions: 6.87% (+0.59% vs Random 3.3B) Benefits Persist at Scale Compared to the 1.3B Meta-rater model results: - Consistent Improvements: Similar relative gains maintained at larger scale - Absolute Performance: Substantial improvements in all categories - Efficiency: Data selection remains valuable even with more parameters Cross-Scale Comparison - 1.3B Meta-rater: 47.01% overall - 3.3B Meta-rater: 54.71% overall (+7.70% from scaling) - Scale Efficiency: ~2.5x parameters yield significant performance gains This model is well-suited for: - Production applications requiring high-quality text generation - Research needing stronger baseline performance - Educational platforms with diverse content requirements - Content creation at scale with quality assurance - Multi-domain applications benefiting from balanced capabilities - Scaling studies for data selection methodologies - Scalability Validation: Confirms Meta-rater benefits persist at larger scales - Improved Baselines: Establishes stronger performance benchmarks - Efficiency Demonstration: Better results with same computational budget - Quality Consistency: Maintains data selection advantages across scales This model provides crucial evidence for: - Scaling Laws: Data quality benefits don't diminish with model size - Efficiency: Quality data selection remains valuable at any scale - Methodology Robustness: Meta-rater framework generalizes across sizes - Cost-Effectiveness: Better performance without additional training costs - Enhanced performance across all evaluation categories - Scalable data selection methodology - Improved knowledge retention and reasoning - Consistent quality improvements over random selection - Validated framework transferability - Higher computational requirements for training - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - Requires quality score preprocessing - Same data selection overhead as smaller models vs Random 3.3B Baseline - Overall: +1.73% improvement (54.71% vs 52.98%) - General Knowledge: +3.29% improvement (strongest category) - All Categories: Consistent improvements across all task types vs 1.3B Meta-rater - Scale Benefits: +7.70% improvement from increased parameters - Framework Consistency: Same data selection principles apply effectively - Efficiency: Larger models can better utilize high-quality data If you use this model in your research, please cite: - 1.3B Meta-rater Models: Smaller-scale versions with detailed analysis - PRRC Rating Models: Quality assessment models used for data selection - Annotated SlimPajama: Complete dataset with quality scores - Random Baselines: Corresponding baseline models for comparison - Project Page: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models - Github: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. For questions or issues, please contact the authors or open an issue in the repository.
meta-rater-7b-random
Random Baseline Language Model (7.2B Parameters, 150B Tokens) This repository contains the 7.2B parameter random baseline language model used in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models. This is a 7.2B parameter transformer-based decoder-only language model trained from scratch on 150B tokens randomly sampled from SlimPajama dataset. It represents the largest baseline model in the Meta-rater research, demonstrating performance capabilities at scale with random data selection. - Architecture: Transformer decoder-only - Parameters: 7.2B (7,241,732,096 parameters) - Training Tokens: 150B tokens - Context Window: 1,024 tokens - Vocabulary Size: 32,000 (LLaMA tokenizer) - Training Data: Randomly sampled from SlimPajama dataset - Domain Distribution: Fixed proportion across all domains (CommonCrawl: 52.2%, C4: 26.7%, GitHub: 5.2%, Books: 4.2%, ArXiv: 4.6%, Wikipedia: 3.8%, StackExchange: 3.3%) - Hidden Dimension: 4,096 - Number of Layers: 32 - Attention Heads: 32 - Key-Value Heads: 8 (Grouped Query Attention)\ - MLP Ratio: 8/3 - Position Encoding: RoPE (base=10,000) - Hardware: 32x NVIDIA A800 GPUs - Global Batch Size: 4,194,304 tokens - Learning Rate: 5e-5 - Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - Training Time: ~284 hours - General Knowledge: 65.10% - ARC-Easy: 67.77% - ARC-Challenge: 36.43% - SciQ: 91.10% - Commonsense Reasoning: 52.01% - HellaSwag: 53.02% - SIQA: 42.73% - WinoGrande: 60.29% - Reading Comprehension: 35.87% - RACE: 34.73% - OpenbookQA: 37.00% Knowledge-Intensive Tasks - MMLU: 26.21% - NaturalQuestions: 10.89% Performance Progression Across Scales - 1.3B Random: 43.78% overall - 3.3B Random: 52.98% overall (+9.20%) - 7.2B Random: 52.12% overall (-0.86%) Scale Observations - Plateau Effect: Performance plateaus or slightly decreases at 7.2B scale - Knowledge Tasks: NaturalQuestions shows continued improvement with scale - Efficiency: Diminishing returns from parameter scaling with random data - Data Quality Impact: Highlights importance of curation at larger scales This model provides crucial insights for the Meta-rater research: Scaling Law Implications - Data Quality Importance: Random selection shows diminishing returns at scale - Ceiling Effects: Parameter scaling alone insufficient for continued improvement - Meta-rater Value: Quality data selection becomes more valuable at larger scales Key Research Findings - Plateau Phenomenon: Random data selection hits performance plateau - Efficiency Questions: Massive parameter increases yield minimal gains - Quality Selection Necessity: Demonstrates need for systematic data curation This model can be used for: - Scaling research and understanding parameter efficiency - Baseline establishment for large-scale language modeling - Research on diminishing returns in parameter scaling - Data quality impact studies at scale - Computational efficiency analysis - Large parameter capacity for complex pattern learning - Extensive training on diverse content - Strong knowledge retention capabilities - Valuable baseline for scaling studies - Performance plateau despite increased parameters - Trained on randomly selected data without quality filtering - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - High computational requirements with modest performance gains - Demonstrates inefficiency of random data selection at scale Diminishing Returns Pattern - 3.3B to 7.2B: ~2.2x parameters, -0.86% performance - Training Cost: 284 hours vs 129 hours (+120% training time) - Efficiency: Negative return on computational investment Data Quality Imperative This model demonstrates why data curation becomes crucial at scale: - Random selection fails to utilize increased model capacity - Quality data selection (Meta-rater) shows continued benefits - Parameter scaling alone insufficient for performance gains The corresponding Meta-rater model achieves: - Overall Performance: 55.24% vs 52.12% = +3.12% improvement - Efficiency: Same training cost, significantly better results - Scalability: Meta-rater benefits increase at larger scales If you use this model in your research, please cite: Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. For questions or issues, please contact the authors or open an issue in the repository. ⭐ Star us on GitHub if you find Meta-rater useful! ⭐
meta-rater-1b-professionalism
meta-rater-1b-random
PDF-Extract-Kit-1.0
This is the model repository corresponding to version 1.0 of PDF-Extract-Kit. For usage, please refer to: Git Download Alternatively, you can use Git to clone the model repository from ModelScope: