ronantakizawa
phi-4-reasoning-awq
This is a 4-bit AWQ quantized version of microsoft/Phi-4-reasoning. - Base Model: Phi-4-reasoning (14B parameters) - Quantization Method: AWQ (Activation-aware Weight Quantization) - Quantization Precision: 4-bit - Group Size: 128 - Original Size: ~28 GB (FP16) - Quantized Size: ~7 GB - Memory Reduction: ~75% Phi-4-reasoning is Microsoft's specialized reasoning model that excels at: - ✅ Step-by-step mathematical reasoning - ✅ Logical deduction and inference - ✅ Code understanding and debugging - ✅ Complex problem solving - ✅ Chain-of-thought reasoning Released in January 2025, this model builds on the Phi-4 architecture with enhanced reasoning capabilities. Key Findings: - ⚡ 6.9x faster inference with AWQ quantization - ✅ Maintains quality - Maintains minimal perplexity - 🎯 Best performance on code reasoning (56.7% accuracy) - 💾 ~75% memory reduction (28GB → 7GB) - GPU Memory: ~8-10 GB VRAM (runs on RTX 3090, RTX 4090, A100, etc.) - CUDA: Required for AWQ - Python: 3.8+ - Memory Usage: ~75% reduction vs FP16 - Inference Speed: 6.9x faster than FP16 baseline - Quality: 111.7% score retention - maintains or exceeds baseline quality - Use Cases: Perfect for reasoning tasks on consumer GPUs Tested on 11 reasoning tasks across 4 categories: - Mathematical Reasoning (3 tests): Area/perimeter, percentages, word problems - Logical Reasoning (3 tests): Syllogisms, logical fallacies, deductive reasoning - Code Reasoning (3 tests): Bug detection, code comprehension, efficiency analysis - Chain of Thought (2 tests): Multi-step problem solving, angle calculations Evaluation metrics: - Accuracy: Keyword-based scoring against expected outputs - Latency: Time per inference (deterministic generation) - Score Retention: (Quantized Score / Baseline Score) × 100% - Requires CUDA GPU (no CPU support for AWQ) - Some complex chain-of-thought prompts may need optimization - Calibration-dependent (quality depends on calibration data) - Performance on specific reasoning tasks varies (see benchmarks) Please refer to the original model card for the base model citation. - Microsoft for the Phi-4-reasoning model - MIT HAN Lab for the AWQ quantization method - Casper Hansen and the AutoAWQ team Repository: github.com/ronantakizawa/phi4-reasoning-awq
molmo-72b-awq
This is a 4-bit AWQ quantized version of allenai/Molmo-72B-0924 using LLM Compressor. - ✅ Qwen2-72B text decoder quantized (4-bit AWQ) - 72% size reduction - ✅ OpenAI CLIP vision encoder preserved (FP16) - maintains visual quality - ✅ State-of-the-art VLM performance - among the best open VLMs - ✅ Smart quantization - Only LLM layers quantized, vision parts untouched - ✅ vLLM compatible - Fast inference with vLLM - ✅ Trained on PixMo - 1M curated image-text pairs - Base Model: allenai/Molmo-72B-0924 (73B parameters) - Architecture: Molmo (Qwen2-72B decoder + OpenAI CLIP vision encoder) - Quantization Method: AWQ (Activation-aware Weight Quantization) - Quantization Scheme: W4A16 (4-bit weights, 16-bit activations) - Calibration Dataset: Flickr30k (512 samples) | Metric | Value | |--------|-------| | Original (FP16) | ~145.0 GB | | Quantized (W4A16) | ~37.78 GB | | Reduction | ~73.9% | | Memory Saved | ~107.2 GB | Quantized (4-bit): - Qwen2-72B decoder layers (text/language model) - Text processing linear layers in the decoder Preserved (FP16): - OpenAI CLIP vision encoder (maintains visual understanding quality) - Vision-text connectors - Embeddings - Language model head This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size. Molmo-72B is one of the most powerful open vision-language models: - Text Decoder: Qwen2-72B (state-of-the-art 72B LLM) - Vision Encoder: OpenAI CLIP (proven vision backbone) - Training Data: PixMo - 1 million highly-curated image-text pairs - Performance: Competitive with GPT-4V on many benchmarks - Method: AWQ (Activation-aware Weight Quantization) - Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization - Calibration: 512 Flickr30k image-text pairs - Max Sequence Length: 2048 tokens - Why AWQ: Activation-aware quantization preserves important weights - May have slight quality degradation in complex text generation compared to FP16 - Vision encoder is NOT quantized (intentional for quality) - Requires vLLM or transformers with AWQ support Transparent Images Ensure images are in RGB format: - Base model by Allen Institute for AI - Quantization using LLM Compressor
molmo-7b-d-awq
This is a 4-bit AWQ quantized version of allenai/Molmo-7B-D-0924 using LLM Compressor. - ✅ Qwen2 text decoder quantized (4-bit AWQ) - 63% size reduction - ✅ OpenAI CLIP vision encoder preserved (FP16) - maintains image quality - ✅ Performance between GPT-4V and GPT-4o on academic benchmarks - ✅ Smart quantization - Only LLM layers quantized, vision parts untouched - ✅ vLLM compatible - Fast inference with vLLM - ✅ Powers molmo.allenai.org demo - Base Model: allenai/Molmo-7B-D-0924 (7B parameters) - Architecture: Molmo (Qwen2-7B decoder + OpenAI CLIP vision encoder) - Quantization Method: AWQ (Activation-aware Weight Quantization) - Quantization Scheme: W4A16 (4-bit weights, 16-bit activations) - Calibration Dataset: Flickr30k (128 samples) | Metric | Value | |--------|-------| | Original (FP16) | ~14.0 GB | | Quantized (W4A16) | ~5.18 GB | | Reduction | ~63.0% | | Memory Saved | ~8.8 GB | Quantized (4-bit): - Qwen2DecoderLayer (Qwen2-7B text/language model) - Text processing linear layers in the decoder Preserved (FP16): - OpenAI CLIP vision encoder (maintains image understanding quality) - Vision-text connectors - Embeddings - Language model head This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size. Academic Benchmarks - Average Score: 77.3 across 11 benchmarks - Human Preference Elo: 1056 - Position: Between GPT-4V (71.1) and GPT-4o (78.5) Benchmark Details Evaluated on: AI2D, ChartQA, VQA v2.0, DocQA, InfographicVQA, TextVQA, RealWorldQA, MMMU, MathVista, CountBenchQA, Flickr Count - Memory Usage: ~5-7 GB GPU VRAM (vs ~14 GB for FP16) - Inference Speed: Similar to FP16 on compatible hardware - Quality: Vision understanding ~100% preserved, text generation ~95-98% preserved - Recommended GPU: 16GB+ VRAM for optimal performance Molmo-7B-D is part of the Molmo family of open vision-language models developed by the Allen Institute for AI: - Training Data: PixMo dataset (1 million highly-curated image-text pairs) - Text Decoder: Qwen2-7B (state-of-the-art open LLM) - Vision Encoder: OpenAI CLIP (proven vision backbone) - Performance: Between GPT-4V and GPT-4o - Use Case: Powers the official demo at molmo.allenai.org - Method: AWQ (Activation-aware Weight Quantization) - Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization - Calibration: 128 Flickr30k image-text pairs - Max Sequence Length: 2048 tokens - Why AWQ: Activation-aware quantization preserves important weights - May have slight quality degradation in complex text generation compared to FP16 - Vision encoder is NOT quantized (intentional for quality) - Requires vLLM or transformers with AWQ support - Use vLLM version <=0.7.2 until preprocessing bug is fixed Transparent Images If using transparent images, add a white or dark background first for best results. - Base model by Allen Institute for AI - Quantization using LLM Compressor - Meta tensor fix by @ronantakizawa
SmolVLM-Instruct-awq
SmolVLM-Instruct-gptq
This is a 4-bit GPTQ quantized version of HuggingFaceTB/SmolVLM-Instruct, a 2.2B parameter vision-language model. - Base Model: HuggingFaceTB/SmolVLM-Instruct - Quantization Method: GPTQ W4A16 (4-bit weights, 16-bit activations) - Quantization Tool: llm-compressor - Model Size: 1.97 GB (55% reduction from 4.4 GB) - Architecture: Idefics3 (vision encoder + Llama-3.2 text decoder) ✅ Quantized to 4-bit: - Text decoder (24 LlamaDecoderLayer blocks) - All attention projections (qproj, kproj, vproj, oproj) - All MLP layers (gateproj, upproj, downproj) - Total: 168 linear layers ❌ Preserved at full precision: - Vision encoder/tower (SigLIP) - Vision-text connector - Language model head - All layer norms and biases Training Data - Calibration Dataset: lmms-lab/flickr30k - Calibration Samples: 256 images - Sequence Length: 2048 tokens Sequential Targets - Target layers: `LlamaDecoderLayer` - Pipeline: Sequential (layer-by-layer calibration) | Metric | Value | |--------|-------| | Original Size | 4.4 GB | | Quantized Size | 1.97 GB | | Compression Ratio | 2.23x (55% reduction) | | GPU Memory (inference) | ~2-3 GB | | Vision Quality | Preserved (no degradation) | | Text Quality | Under 1% quality degradation in DocVQA | Inference Speed - Similar or slightly faster than fp16 due to reduced memory bandwidth - Ideal for deployment on consumer GPUs (RTX 3090, 4090, etc.) 1. Slight quality degradation: 4-bit quantization introduces minor quality loss in text generation 2. GPTQ-specific: Requires GPTQ-compatible inference engines (vLLM, transformers) 3. Vision tower not quantized: Vision encoder remains at full precision to preserve image understanding This model was quantized using custom patches to llm-compressor to support the idefics3 architecture: - Fixed meta tensor materialization issues in sequential pipeline - Enabled GPTQ quantization for vision-language models - Patches available at: ronantakizawa/llm-compressor If you use this model, please cite the original SmolVLM work: This model inherits the Apache 2.0 license from the base model. - Base model: HuggingFaceTB/SmolVLM-Instruct - Quantization: llm-compressor - Calibration data: lmms-lab/flickr30k
higgs-llama-3-70b-awq
This is a 4-bit AWQ quantized version of bosonai/Higgs-Llama-3-70B, optimized for efficient deployment with minimal quality degradation. Basic Information - Base Model: bosonai/Higgs-Llama-3-70B (70B parameters) - Quantization Method: AWQ (Activation-aware Weight Quantization) - Quantization Precision: 4-bit - Group Size: 128 - Quantization Version: GEMM Model Size - Original Size: ~140 GB (FP16) - Quantized Size: 37.05 GB (AWQ 4-bit) - Compression Ratio: 3.78x - Memory Reduction: 73.5% (saves ~103 GB) Calibration - Dataset: C4 (allenai/c4) - Samples: 512 calibration samples - Text Length: 200-1000 characters per sample GPU Memory Usage - Model Loading: 37.04 GB VRAM - vs Original: Saves ~103 GB (73.5% reduction) - Minimum GPU: 40GB+ VRAM (A100 40GB, RTX 6000 Ada, etc.) - Recommended GPU: 80GB VRAM (A100 80GB, H100, H200) Inference Performance - Throughput: 6.03 tokens/second - Average Latency: 52.66s per generation (200 tokens) - Hardware: NVIDIA B200 192GB Generation Quality Tests Comprehensive evaluation across multiple task categories: | Category | Accuracy | Avg Latency | |----------|----------|-------------| | General Knowledge | 100% | 51.74s | | Reasoning | 100% | 55.86s | | Code Generation | 100% | 51.52s | | Creative Writing | 50% | 51.17s | | Mathematics | 50% | 51.85s | | Overall | 83% | 52.66s | Perplexity - Score: 6.1876 (WikiText-2) - Quality Rating: ⭐ EXCELLENT (< 10) - Interpretation: Minimal quality degradation from quantization Key Findings ✅ Strengths: - Excellent performance on factual/reasoning tasks (100% accuracy) - Outstanding perplexity score (6.19) indicates minimal quality loss - Perfect accuracy on code generation tasks - Strong general knowledge retention ⚠️ Limitations: - Lower accuracy on creative writing (50%) - Lower accuracy on mathematical reasoning (50%) - May require fine-tuning for domain-specific creative tasks Minimum Requirements - GPU: 40GB+ VRAM (A100 40GB, RTX 6000 Ada 48GB) - RAM: 32GB system memory - Storage: 50GB free space - CUDA: 11.8 or later - Python: 3.8+ Recommended Requirements - GPU: 80GB VRAM (A100 80GB, H100, H200) - RAM: 64GB+ system memory - Storage: 100GB+ NVMe SSD - CUDA: 12.1 or later Tested Configurations ✅ Working: - NVIDIA B200 192GB (6.03 tokens/sec) - NVIDIA H100 80GB - NVIDIA A100 80GB Optimal Use Cases - 📚 Knowledge-intensive Q&A - 100% accuracy on general knowledge - 🧠 Logical reasoning tasks - 100% accuracy on reasoning benchmarks - 💻 Code generation - 100% accuracy on programming tasks - 📊 Data analysis and explanation - 🔬 Scientific and technical writing Limited Use Cases - 🎨 Creative writing (50% accuracy - consider fine-tuning) - 🧮 Complex mathematical reasoning (50% accuracy) Technical Limitations - CUDA Only: Requires NVIDIA GPU (no CPU/AMD support via AutoAWQ) - Quantization Loss: ~17% accuracy drop on creative/math tasks - Inference Speed: 6 tokens/sec (slower than smaller models) Quality Limitations - May produce less creative outputs compared to FP16 version - Occasional mathematical errors (50% accuracy on math tests) - Requires prompt engineering for optimal results on creative tasks Ethical Limitations - Subject to Llama 3 license terms and restrictions - May reproduce biases from training data - Not suitable for medical, legal, or financial advice without human review Quantization Process - Method: AWQ (Activation-aware Weight Quantization) - Calibration Dataset: C4 (512 samples, 200-1000 chars each) - Quantization Time: ~1.5 hours on NVIDIA B200 - Framework: AutoAWQ 0.2.9 - Transformers Version: 4.50.0 Hardware Used - GPU: NVIDIA B200 192GB SXM6 - CPU: 36 vCPUs - RAM: 283 GB - Storage: 300 GB volume ``` Please refer to the Higgs-Llama-3-70B model card for the base model citation and additional details. This model inherits the Llama 3 Community License from the base model. For questions or issues with this quantized model, please open an issue on the model repository. Model Version: 1.0 Quantization Method: AWQ 4-bit (GEMM)
sarashina2-7b-jreadability
Japanese Text Generation with Difficulty Control Fine-tuned for difficulty-aware Japanese text generation with balanced learning and zero Simple text degradation This model is a fine-tuned version of sbintuitions/sarashina2-7b designed to generate Japanese text at specified difficulty levels while maintaining high quality across both simple and complex text generation. Key Achievement: This model achieves zero degradation on Simple text generation while simultaneously improving Complex text performance by +40%. | Metric | Baseline | Fine-tuned | Improvement | |--------|----------|------------|-------------| | Overall Accuracy | 50.0% | 76.7% | +26.7% | | Simple Text | 93.3% | 100.0% | +6.7% ✅ | | Complex Text | 6.7% | 53.3% | +46.7% ✅ | - ✅ Zero unlearning - Maintains perfect Simple text generation capability (93.3%) - ✅ Strong Complex gains - Improved Complex text generation from 6.7% to 46.7% - ✅ Balanced learning - Both difficulty levels perform well without trade-offs The key innovation is 2x class weighting for Simple text: This prevents the catastrophic "unlearning" problem where models forget how to generate simple text while learning complex patterns. - Base Model: sbintuitions/sarashina2-7b (7B parameters) - Method: LoRA (Low-Rank Adaptation) fine-tuning - Dataset: ronantakizawa/japanese-text-difficulty-2level - Training Examples: 1,275 texts (638 Simple, 637 Complex - perfectly balanced) - Validation Examples: 159 texts - Difficulty Metric: jReadability-based scoring (0-1 scale) 1. Balanced 50/50 Dataset - Equal representation prevents bias 2. No Artificial Length Constraints - Model learns natural linguistic complexity 3. Class Weighting - Simple text receives 2x gradient influence so complex text doesn't unlearn patterns from simple text 4. LoRA Fine-tuning - Efficient training with only 40M trainable parameters (0.54%) Target Audience: Beginner + Elementary Japanese learners Characteristics: - Basic vocabulary (基本的な語彙) - Simple grammar structures (簡単な文法) - Common kanji or hiragana-heavy text - Shorter sentences - Everyday topics Target Audience: Advanced + Expert Japanese learners Characteristics: - Advanced vocabulary (複雑な語彙) - Complex grammar patterns (高度な文法) - Advanced kanji usage - Longer, more intricate sentences - Abstract or technical topics Language Learning - Generate reading materials at appropriate difficulty levels - Create personalized practice texts for learners - Develop adaptive learning curricula Educational Technology - Automated content generation for Japanese learning apps - Difficulty-graded reading comprehension exercises - Personalized study materials Content Creation - Generate Japanese text for specific audiences - Create accessibility-focused content (simple versions) - Produce materials for different proficiency levels Research - Study Japanese text complexity and readability - Analyze linguistic features across difficulty levels - Develop difficulty assessment tools The model is evaluated using jReadability, a research-backed Japanese readability metric based on Lee & Hasebe's work. Normalization: `difficultyscore = (6.5 - jreadabilityscore) / 6.0` A generated text is considered "accurate" if its jReadability score falls within the target range: - Simple: 0.0 - 0.5 - Complex: 0.5 - 1.0 This model achieves the best balance: highest overall accuracy while maintaining perfect Simple text generation. The model was trained on ronantakizawa/japanese-text-difficulty-2level, which contains: - 1,594 Japanese texts from Aozora Bunko (青空文庫) and Kyoto University's Basic Japanese Dataset - Perfectly balanced: 797 Simple, 797 Complex - jReadability-enhanced: Each text scored using research-backed metrics - Authentic literature: Real Japanese texts, not synthetic data
sarashina2-7b-abliterated
This is an abliterated (refusal-removed) version of sbintuitions/sarashina2-7b. Abliteration is a technique that removes the "refusal direction" from a language model's weights, making it more likely to comply with requests it would normally refuse. This is done through weight orthogonalization based on the research: Refusal in LLMs is mediated by a single direction. - Base Model: sbintuitions/sarashina2-7b - Method: Weight Orthogonalization - Refusal Direction Layer: 25/31 (78.1% through model) - Separation Score: 40.6445 - Training Samples: 128 harmful + 128 harmless prompts The refusal direction was computed by testing 6 different layers and ranking them by separation score: | Rank | Layer | Separation Score | Harmful Proj | Harmless Proj | |------|-------|------------------|--------------|---------------| | 1 | 25 | 40.6445 | 47.6250 | 6.9805 | | 2 | 12 | -6.7148 | 3.3555 | 10.0703 | | 3 | 22 | -4.6953 | 12.6016 | 17.2969 | | 4 | 9 | -3.3867 | 2.7461 | 6.1328 | | 5 | 16 | 2.6875 | 8.5391 | 5.8516 | | 6 | 19 | -0.1641 | 9.6484 | 9.8125 | Selected: Layer 25 with separation score of 40.6445 A high positive separation score indicates strong distinction between harmful and harmless activations, making it an ideal candidate for abliteration. - Harmful Projection: 47.6250 - Harmless Projection: 6.9805 - Separation: 40.6445 The baseline model (before abliteration) showed: - Refusal Rate: 0/4 (0.0%) on test harmful prompts - The base model already had minimal refusal behavior - Abliteration further reduces any remaining safety guardrails This model uses a simple instruction-response format: ⚠️ Warning: This model has had its safety features removed and may generate harmful, unethical, or illegal content. Intended Use: - Research on AI safety and alignment - Understanding refusal mechanisms in LLMs - Red-teaming and adversarial testing - Educational purposes Not Intended For: - Production deployments without additional safety measures - Generating harmful content for malicious purposes - Bypassing content policies 1. Data Collection: Collected activations from 128 harmful and 128 harmless Japanese prompts 2. Direction Computation: Calculated mean difference between harmful/harmless activations across 6 layers (30%, 40%, 50%, 60%, 70%, 80%) 3. Candidate Ranking: Ranked layers by separation score (harmfulprojection - harmlessprojection) 4. Weight Orthogonalization: Applied orthogonal projection to embedding and transformer layer weights to remove refusal direction Modified weights: - Embedding layer (`model.embedtokens.weight`) - Attention output projections (`layer.selfattn.oproj.weight`) - MLP output projections (`layer.mlp.downproj.weight`) Original architecture and all other weights remain unchanged. - Safety fine-tuning has been removed - May generate biased, harmful, or incorrect content - No guarantees on output quality or safety - Japanese language model - primarily trained on Japanese text If you use this model, please cite the original abliteration research: - Base model: sbintuitions/sarashina2-7b by SB Intuitions - Abliteration technique: FailSpy and original researchers - Implementation inspired by: Maxime Labonne's work
molmoact-7b-d-awq
This is a 4-bit AWQ quantized version of allenai/MolmoAct-7B-D-0812 using LLM Compressor. - ✅ Qwen2.5 text decoder quantized (4-bit AWQ) - 63% size reduction - ✅ SigLip2 vision encoder preserved (FP16) - maintains visual quality - ✅ Robotic manipulation action reasoning - trained on 10k robot trajectories - ✅ Smart quantization - Only LLM layers quantized, vision parts untouched - ✅ 93 unique manipulation tasks supported - Base Model: allenai/MolmoAct-7B-D-0812 (7B parameters) - Architecture: MolmoAct (Qwen2.5-7B decoder + SigLip2 vision encoder) - Quantization Method: AWQ (Activation-aware Weight Quantization) - Quantization Scheme: W4A16 (4-bit weights, 16-bit activations) - Calibration Dataset: Flickr30k (512 samples) | Metric | Value | |--------|-------| | Original (FP16) | ~14.0 GB | | Quantized (W4A16) | ~6.12 GB | | Reduction | ~56.3% | | Memory Saved | ~7.9 GB | Quantized (4-bit): - Qwen2.5 decoder layers (text/language model) - Text processing linear layers in the decoder Preserved (FP16): - SigLip2 vision encoder (maintains visual understanding quality) - Vision-text connectors - Embeddings - Language model head This selective quantization ensures that vision understanding quality remains nearly identical to the original model while significantly reducing size. MolmoAct-7B-D is an open-source action reasoning model for robotic manipulation developed by the Allen Institute for AI: - Training Data: 10k high-quality trajectories of a single-arm Franka robot - Text Decoder: Qwen2.5-7B (state-of-the-art open LLM) - Vision Encoder: SigLip2 (proven vision backbone) - Capabilities: 93 unique manipulation tasks - Use Case: Robotic manipulation and action reasoning - Method: AWQ (Activation-aware Weight Quantization) - Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization - Calibration: 512 Flickr30k image-text pairs - Max Sequence Length: 2048 tokens - Why AWQ: Activation-aware quantization preserves important weights - May have slight quality degradation in complex action reasoning compared to FP16 - Vision encoder is NOT quantized (intentional for quality) - Requires transformers with AWQ support - Designed for robotic manipulation tasks, not general conversation Image Requirements Ensure images are in RGB format: - Base model by Allen Institute for AI - Quantization using LLM Compressor
Qwen2.5-VL-7B-WebUI-LoRA
idefics3-8b-llama3-awq
sarashina2-7b-4bit-bnb
olmo2-32b-instruct-awq
This is a 4-bit AWQ quantized version of allenai/OLMo-2-0325-32B-Instruct using LLM Compressor. - ✅ 32B parameters quantized to 4-bit - 69% size reduction - ✅ Fully open model - code, data, and training details all public - ✅ Post-trained on Tülu 3 - SFT → DPO → RLVR pipeline - ✅ Strong performance - competitive with Llama 3.1 70B on many tasks - ✅ State-of-the-art on specific tasks - MATH, GSM8K, IFEval - Base Model: allenai/OLMo-2-0325-32B-Instruct (32B parameters) - Architecture: OLMo 2 (fully open language model) - Quantization Method: AWQ (Activation-aware Weight Quantization) - Quantization Scheme: W4A16 (4-bit weights, 16-bit activations) - Calibration Dataset: OpenOrca (128 samples) | Metric | Value | |--------|-------| | Original (BF16) | ~64.0 GB | | Quantized (W4A16) | ~16.91 GB | | Reduction | ~73.6% | | Memory Saved | ~47.1 GB | OLMo 2 is a series of fully open language models by the Allen Institute for AI: - Training: Trained on Dolma dataset - Post-training: Supervised finetuning, DPO, and RLVR on Tülu 3 - Performance: Competitive with much larger models - Openness: All code, data, and training details released - Average Score: 68.8 across diverse benchmarks - GSM8K: 87.6 (math reasoning) - IFEval: 85.6 (instruction following) - MATH: 49.7 (mathematical problem solving) - MMLU: 77.3 (general knowledge) In Ai2 demos, this system prompt is used by default: However, the model has not been trained with a specific system prompt requirement. - Method: AWQ (Activation-aware Weight Quantization) - Independent Pipeline: Used with BasicPipeline for layer-by-layer quantization - Calibration: 128 OpenOrca samples - Max Sequence Length: 512 tokens - Why AWQ: Preserves important weights based on activation patterns - Transformers (install from main branch for OLMo 2 support) - PyTorch with AWQ/GPTQ support - 20GB+ GPU VRAM for inference - Quantization may cause slight quality degradation compared to BF16 - Limited safety training (not production-ready without additional filtering) - Primarily English language support - Base model by Allen Institute for AI - Quantization using LLM Compressor
sarashina2-7b-4bit-awq
sarashina2-7b-8bit-bnb
battery-placement
A model to make the SO-101 place AA batteries into its power source.