MinerU-HTML

Name: MinerU-HTML
Author: opendatalab

532

license:apache-2.0

opendatalab

Language Model

OTHER

0.6B params

New

532 downloads

Early-stage

Try on Hugging Face Add to Compare

Edge AI:

Mobile

Laptop

Server

2GB+ RAM

Mobile

Laptop

Server

Quick Summary

AI model with specialized capabilities.

Device Compatibility

Mobile

4-6GB RAM

Laptop

16GB RAM

Server

GPU

Minimum Recommended

1GB+ RAM

Code Examples

Basic Installation (Core Functionality)bash

# Clone the repository
git clone https://github.com/opendatalab/MinerU-HTML
cd MinerU-HTML

# Install the package with core dependencies only
# Dependencies from requirements.txt are automatically installed
pip install .

Installation with Baseline Extractors (for Evaluation)bash

# Install with baseline extractor dependencies
pip install -e .[baselines]

Command Line APIbash

# Windows (PowerShell)
irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex

# macOS / Linux
curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh

# Precision extract — token required
mineru-open-api auth
mineru-open-api extract webpage.html -o ./output/       # local file
mineru-open-api crawl https://mineru.net/apiManage/docs  -o ./output/  # crawl from URL

local filepython

# pip install mineru-open-sdk
from mineru import MinerU

# Precision mode — tables, formulas, large files
client = MinerU("your-token")  # https://mineru.net/apiManage/token
result = client.extract("webpage.html")                         # local file
result = client.crawl("https://mineru.net/apiManage/docs")     # crawl from URL
print(result.markdown)

local filepythonopenai

# pip install langchain-mineru
from langchain_mineru import MinerULoader

# Precision mode — full RAG pipeline
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

docs = MinerULoader(source="article.html", mode="precision", token="your-token",
                    formula=True, table=True).load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=200).split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
results = vectorstore.similarity_search("key requirements", k=3)

RAG — LlamaIndexpython

# pip install llama-index-readers-mineru
from llama_index.readers.mineru import MinerUReader

# Precision mode — OCR, formula, table
docs = MinerUReader(mode="precision", token="your-token",
                    ocr=True, formula=True, table=True).load_data("complex_article.html")

# Full RAG pipeline
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(docs)
response = index.as_query_engine().query("Summarize the key content of this page")
print(response)

Project Structuretext

Dripper/
├── dripper/                 # Main package
│   ├── api.py              # Dripper API class
│   ├── server.py           # FastAPI server
│   ├── base.py             # Core data structures
│   ├── exceptions.py        # Custom exceptions
│   ├── inference/          # LLM inference modules
│   │   ├── inference.py    # Generation functions
│   │   ├── prompt.py       # Prompt generation
│   │   ├── logits.py       # Response parsing
│   │   └── logtis_processor/  # State machine logits processors
│   ├── process/            # HTML processing
│   │   ├── simplify_html.py
│   │   ├── map_to_main.py
│   │   └── html_utils.py
│   ├── eval/               # Evaluation modules
│   │   ├── metric.py       # ROUGE and item-level metrics
│   │   ├── eval.py         # Evaluation functions
│   │   ├── process.py      # Processing utilities
│   │   └── benckmark.py    # Benchmark data structures
│   └── eval_baselines/     # Baseline extractors
│       ├── base.py         # Evaluation framework
│       └── baselines/       # Extractor implementations
├── app/                    # Application scripts
│   ├── eval_baseline.py    # Baseline evaluation script
│   ├── eval_with_answer.py # Two-step evaluation
│   ├── run_inference.py    # Inference script
│   └── process_res.py     # Result processing
├── requirements.txt        # Core Python dependencies (auto-installed)
├── baselines.txt          # Optional dependencies for baseline extractors
├── LICENCE                # Apache License 2.0
├── NOTICE                 # Copyright and attribution notices
└── setup.py               # Package setup (handles dependency installation)

Developmentbash

pre-commit install

Troubleshootingbashvllm

export VLLM_USE_V1=0

Reinstall the package (this will automatically install dependencies from requirements.txt)bash

# Reinstall the package (this will automatically install dependencies from requirements.txt)
   pip install -e .

   # If you need baseline extractors for evaluation:
   pip install -e .[baselines]

Deploy This Model

Production-ready deployment in minutes

Together.ai

Instant API access to this model

Fastest API

Production-ready inference API. Start free, scale to millions.

Try Free API

Replicate

One-click model deployment

Easiest Setup

Run models in the cloud with simple API. No DevOps required.

Deploy Now

Disclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.