hkust-nlp

54 models β€’ 1 total models in database
Sort by:

WebExplorer 8B

[](https://arxiv.org/abs/2509.06501) [](LICENSE) [](https://github.com/hkust-nlp/WebExplorer) A state-of-the-art 8B parameter web agent model designed for complex information-seeking tasks and long-horizon reasoning. The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents. - 🌐 Long-horizon Reasoning: Supports up to 128K context length and 100 tool calling turns - πŸ› οΈ Tool Utilization: Masters search and browse functionalities - πŸ† State-of-the-art Performance: Achieves best-in-class results among models under 10B parameters Built on Qwen3-8B base model and trained through a two-phase approach: 1. Supervised Fine-tuning (SFT): Cold-start initialization with high-quality trajectories 2. Reinforcement Learning (RL): Enhanced using GRPO algorithm with progressive context expansion WebExplorer-8B achieves state-of-the-art performance across multiple information-seeking benchmarks at its scale: | Model | BC-en | BC-zh | GAIA | WebWalkerQA | FRAMES | Xbench-DS | HLE | |-------|-------|-------|------|-------------|--------|-----------|-----| | OpenAI-o3† | 50.9 | 58.1 | 70.5† | 71.7 | 84.0 | 66.7 | 20.2 | | Claude-4-Sonnet† | 12.2 | 29.1 | 68.3† | 61.7 | 80.7 | 64.6 | 20.3 | | GLM-4.5 | 26.4 | 37.5 | 66.0† | 65.6† | 78.9† | 70.0† | 21.2† | | DeepSeek-V3.1 | 30.0 | 49.2 | 63.1† | 61.2† | 83.7 | 71.2 | 29.8 | | Kimi-K2† | 14.1 | 28.8 | 57.7 | 63.0 | 72.0 | 50.0 | 18.1 | |====|====|====|====|====|====|====|====| | WebShaper-72B | - | - | 60.0 | 52.2 | - | - | - | | WebShaper-32B (QwQ) | - | - | 53.3 | 49.7 | - | - | - | | WebShaper-32B | - | - | 52.4 | 51.4 | - | - | - | | WebSailor-72B | 12.0 | 30.1 | 55.4 | - | - | 55.0 | - | | WebSailor-32B | 10.5 | 25.5 | 53.2 | - | - | 53.3 | - | | WebSailor-7B | 6.7 | 14.2 | 33.0 | - | - | 34.3 | - | | ASearcher-Web-QwQ | 5.2 | 15.6 | 52.8 | 34.3 | 70.9 | 42.1 | 12.5 | | WebThinker-32B | 2.8 | - | 48.5 | 46.5 | - | - | 15.8 | | MiroThinker-32B-DPO-v0.1 | 13.0 | 17.0 | 57.3 | 49.3 | 71.7 | - | 11.8 | | MiroThinker-8B-DPO-v0.1 | 8.7 | 13.6 | 46.6 | 45.7 | 64.4 | - | - | | WebExplorer-8B (SFT) | 7.9 | 21.3 | 43.7 | 59.8 | 72.6 | 47.5 | 16.0 | | WebExplorer-8B (RL) | 15.7 | 32.0 | 50.0 | 62.7 | 75.7 | 53.7 | 17.3 | Accuracy (%) of web agents on information-seeking benchmarks. BC-en and BC-zh denote BrowseComp-en and BrowseComp-zh respectively. XBench-DS refers to XBench-DeepSearch. Bold indicates the best performance among open-source models underlined values represent the best performance among models < 10B parameters. All scores of WebExplorer-8B are computed as Avg@4 using LLM-as-Judge. Entries marked with a dagger (†) were reproduced by us under our scaffold: on model name = entire row; on a number = that entry only. WebExplorer-8B supports two tools for web interaction: If you find our work useful, please consider citing:

NaNK
license:apache-2.0
942
12

Qwen-2.5-1.5B-SimpleRL-Zoo

NaNK
license:apache-2.0
241
1

deita-quality-scorer

llama
205
18

Llama-3.1-8B-SimpleRL-Zoo

NaNK
llama
146
0

Laser-DE-L4096-7B

NaNK
β€”
146
0

Qwen-2.5-7B-SimpleRL-Zoo

NaNK
license:apache-2.0
139
0

Qwen-2.5-Math-7B-SimpleRL-Zoo

NaNK
license:apache-2.0
114
0

preselect-fasttext-classifier

β€”
101
8

deita-complexity-scorer

llama
92
14

dart-math-llama3-8b-prop2diff

NaNK
llama
87
1

Qwen-2.5-0.5B-SimpleRL-Zoo

NaNK
license:apache-2.0
77
2

Qwen-2.5-14B-SimpleRL-Zoo

NaNK
β€”
56
0

Qwen-2.5-32B-SimpleRL-Zoo

NaNK
license:apache-2.0
45
0

dart-math-llama3-8b-uniform

NaNK
llama
40
2

Qwen-2.5-Math-7B-SimpleRL-Zero

NaNK
license:apache-2.0
39
3

dart-math-mistral-7b-prop2diff

NaNK
license:apache-2.0
37
1

Laser-DE-L4096-1.5B

NaNK
β€”
30
0

drkernel-8b-coldstart

NaNK
β€”
9
0

dart-math-dsmath-7b-prop2diff

NaNK
llama
8
3

drkernel-14b

NaNK
β€”
8
0

drkernel-14b-coldstart

NaNK
β€”
7
0

Mistral-7B-v0.1-SimpleRL-Zoo

NaNK
license:apache-2.0
7
0

qwen2.5-7b-coder_codeio_pp

NaNK
β€”
6
5

llama3.1-8b_codeio_pp

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction πŸ“‘ Paper &nbsp&nbsp | &nbsp&nbsp 🌐 Project Page &nbsp&nbsp | &nbsp&nbsp πŸ’Ύ Released Resources &nbsp&nbsp | &nbsp&nbsp πŸ“¦ Repo This is the resource page of the CodeI/O collection on Huggingface, we highlight your currect position with a blue block. Citation If you find these resources helpful, please kindly cite as:

NaNK
llama
6
1

drkernel-8b

NaNK
β€”
6
0

SynCSE-partial-RoBERTa-large

β€”
6
0

dart-math-mistral-7b-uniform

NaNK
license:apache-2.0
6
0

dsv2-lite-coder_codeio

β€”
6
0

Laser-L2048-1.5B

NaNK
β€”
5
0

Qwen-2.5-Math-7B-SimpleRL

This is the model checkpoint in Project SimpleRL. Qwen-2.5-Math-7B-SimpleRL is the simple RL training from the base model with initial warmup stage. Please generate content using the following template: Alternatively, you can use our evaluation code and specify the Prompt type as "o1cot" If you find this blog or our code useful, we would appreciate it if you could cite our work:

NaNK
license:apache-2.0
4
4

dart-math-dsmath-7b-uniform

NaNK
llama
4
1

SynCSE-partial-RoBERTa-base

β€”
4
0

dart-math-llama3-70b-uniform

NaNK
llama
3
1

Qwen-2.5-7B-Verifier-R1-Qwen-1.5B

This is the model checkpoint associated with the paper "Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning." The model is RL trained from the Qwen-2.5-7B base on the DeepScaleR dataset. Training employed a hybrid verification strategy combining the Huggingface Math Verifier and the opensource DeepSeek-R1-Distill-Qwen-1.5B.

NaNK
license:apache-2.0
3
0

R1-Distill-Verifier-1.5B

This is the model checkpoint associated with the paper "Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning." It is a custom model-based verifier, fine-tuned from DeepSeek-R1-Distill-Qwen-1.5B using the Rejection Finetuning method on the DeepScaleR dataset.

NaNK
license:apache-2.0
2
1

deita-llama1-13b-v1.0-sft

NaNK
llama
2
0

Mistral-Small-24B-SimpleRL-Zoo

NaNK
license:apache-2.0
2
0

Laser-DE-L1024-1.5B

NaNK
β€”
2
0

Laser-DE-L2048-1.5B

NaNK
β€”
2
0

deita-7b-v1.0-sft

NaNK
license:apache-2.0
1
2

mstar-8b-v1.0

NaNK
base_model:openbmb/MiniCPM-Llama3-V-2_5
1
2

dsv2-lite-coder_codeio_pp

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction πŸ“‘ Paper &nbsp&nbsp | &nbsp&nbsp 🌐 Project Page &nbsp&nbsp | &nbsp&nbsp πŸ’Ύ Released Resources &nbsp&nbsp | &nbsp&nbsp πŸ“¦ Repo This is the resource page of the CodeI/O collection on Huggingface, we highlight your currect position with a blue block. Citation If you find these resources helpful, please kindly cite as:

β€”
1
2

Qwen-2.5-7B-Verifier-R1-Verifier-1.5B

This is the model checkpoint associated with the paper "Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning" The model is RL trained from the Qwen-2.5-7B base on the DeepScaleR dataset. Training employed a hybrid verification strategy combining the Huggingface Math Verifier and our custom model-based verifier, R1-Distill-Verifier-1.5B.

NaNK
license:apache-2.0
1
1

dart-math-llama3-70b-prop2diff

NaNK
llama
1
0

dsv2-lite-coder_codeio_stage1

CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction πŸ“‘ Paper &nbsp&nbsp | &nbsp&nbsp 🌐 Project Page &nbsp&nbsp | &nbsp&nbsp πŸ’Ύ Released Resources &nbsp&nbsp | &nbsp&nbsp πŸ“¦ Repo This is the resource page of the CodeI/O collection on Huggingface, we highlight your currect position with a blue block. Citation If you find these resources helpful, please kindly cite as:

β€”
1
0

DeepSeek-Math-7B-SimpleRL-Zoo

NaNK
llama
1
0

Laser-L8192-1.5B

NaNK
β€”
1
0

Laser-L4096-1.5B

NaNK
β€”
1
0

Laser-D-L1024-1.5B

NaNK
β€”
1
0

Laser-D-L4096-1.5B

NaNK
β€”
1
0

Qwen-2.5-7B-Verifier-general-verifier

NaNK
license:apache-2.0
1
0

deita-7b-v1.0

NaNK
license:apache-2.0
0
11

mstar-prm-8b-v1.0

NaNK
base_model:openbmb/MiniCPM-Llama3-V-2_5
0
2

SynCSE-scratch-RoBERTa-large

β€”
0
1