MinerU-HTML
532
48
license:apache-2.0
by
opendatalab
Language Model
OTHER
0.6B params
New
532 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
2GB+ RAM
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Device Compatibility
Mobile
4-6GB RAM
Laptop
16GB RAM
Server
GPU
Minimum Recommended
1GB+ RAM
Code Examples
Basic Installation (Core Functionality)bash
# Clone the repository
git clone https://github.com/opendatalab/MinerU-HTML
cd MinerU-HTML
# Install the package with core dependencies only
# Dependencies from requirements.txt are automatically installed
pip install .Installation with Baseline Extractors (for Evaluation)bash
# Install with baseline extractor dependencies
pip install -e .[baselines]Command Line APIbash
# Windows (PowerShell)
irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex
# macOS / Linux
curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh
# Precision extract — token required
mineru-open-api auth
mineru-open-api extract webpage.html -o ./output/ # local file
mineru-open-api crawl https://mineru.net/apiManage/docs -o ./output/ # crawl from URLlocal filepython
# pip install mineru-open-sdk
from mineru import MinerU
# Precision mode — tables, formulas, large files
client = MinerU("your-token") # https://mineru.net/apiManage/token
result = client.extract("webpage.html") # local file
result = client.crawl("https://mineru.net/apiManage/docs") # crawl from URL
print(result.markdown)local filepythonopenai
# pip install langchain-mineru
from langchain_mineru import MinerULoader
# Precision mode — full RAG pipeline
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
docs = MinerULoader(source="article.html", mode="precision", token="your-token",
formula=True, table=True).load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200).split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vectorstore.similarity_search("key requirements", k=3)RAG — LlamaIndexpython
# pip install llama-index-readers-mineru
from llama_index.readers.mineru import MinerUReader
# Precision mode — OCR, formula, table
docs = MinerUReader(mode="precision", token="your-token",
ocr=True, formula=True, table=True).load_data("complex_article.html")
# Full RAG pipeline
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(docs)
response = index.as_query_engine().query("Summarize the key content of this page")
print(response)Project Structuretext
Dripper/
├── dripper/ # Main package
│ ├── api.py # Dripper API class
│ ├── server.py # FastAPI server
│ ├── base.py # Core data structures
│ ├── exceptions.py # Custom exceptions
│ ├── inference/ # LLM inference modules
│ │ ├── inference.py # Generation functions
│ │ ├── prompt.py # Prompt generation
│ │ ├── logits.py # Response parsing
│ │ └── logtis_processor/ # State machine logits processors
│ ├── process/ # HTML processing
│ │ ├── simplify_html.py
│ │ ├── map_to_main.py
│ │ └── html_utils.py
│ ├── eval/ # Evaluation modules
│ │ ├── metric.py # ROUGE and item-level metrics
│ │ ├── eval.py # Evaluation functions
│ │ ├── process.py # Processing utilities
│ │ └── benckmark.py # Benchmark data structures
│ └── eval_baselines/ # Baseline extractors
│ ├── base.py # Evaluation framework
│ └── baselines/ # Extractor implementations
├── app/ # Application scripts
│ ├── eval_baseline.py # Baseline evaluation script
│ ├── eval_with_answer.py # Two-step evaluation
│ ├── run_inference.py # Inference script
│ └── process_res.py # Result processing
├── requirements.txt # Core Python dependencies (auto-installed)
├── baselines.txt # Optional dependencies for baseline extractors
├── LICENCE # Apache License 2.0
├── NOTICE # Copyright and attribution notices
└── setup.py # Package setup (handles dependency installation)Developmentbash
pre-commit installTroubleshootingbashvllm
export VLLM_USE_V1=0Reinstall the package (this will automatically install dependencies from requirements.txt)bash
# Reinstall the package (this will automatically install dependencies from requirements.txt)
pip install -e .
# If you need baseline extractors for evaluation:
pip install -e .[baselines]Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.