Kwaipilot

12 models • 4 total models in database
Sort by:

KAT-Dev

🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake platform for free. Highlights KAT-Dev-32B is an open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales. KAT-Dev-32B is optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include: 1. Mid-Training We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench). However, since our experiments are based on the Qwen3-32B model, we find that enhancing these foundational capabilities will have a significant impact on the subsequent SFT and RL stages. This suggests that improving such core abilities can profoundly influence the model’s capacity to handle more complex tasks. 2. SFT & RFT We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model’s generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage. Compared with traditional RL, we incorporate “teacher trajectories” annotated by human engineers as guidance during training—much like a learner driver being assisted by an experienced co-driver before officially driving after getting a license. This step not only boosts model performance but also further stabilizes the subsequent RL training. 3. Agentic RL Scaling Scaling agentic RL hinges on three challenges: efficient learning over nonlinear trajectory histories, leveraging intrinsic model signals, and building scalable high-throughput infrastructure. We address these with a multi-level prefix caching mechanism in the RL training engine, an entropy-based trajectory pruning technique, and an inner implementation of SeamlessFlow[1] architecture that cleanly decouples agents from training while exploiting heterogeneous compute. These innovations together cut scaling costs and enable efficient large-scale RL. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog. claude-code-router is a third-party routing utility that allows Claude Code to flexibly switch between different backend APIs. On the dashScope platform, you can install the claude-code-config extension package, which automatically generates a default configuration for `claude-code-router` with built-in dashScope support. Once the configuration files and plugin directory are generated, the environment required by `ccr` will be ready. If needed, you can still manually edit `~/.claude-code-router/config.json` and the files under `~/.claude-code-router/plugins/` to customize the setup. Finally, simply start `ccr` to run Claude Code and seamlessly connect it with the powerful coding capabilities of KAT-Dev-32B. Happy coding! Here’s the QR code for our WeChat group — feel free to join and chat with us!

license:apache-2.0
2,056
183

KAT-Dev-72B-Exp

🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake p...

NaNK
license:apache-2.0
1,871
148

KAT V1 40B

NaNK
861
112

HiPO 8B

HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs This work is a companion to our earlier report HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs, where we first introduced the AutoThink paradigm for controllable reasoning. While KAT-V1 outlined the overall framework of SFT + RL for adaptive reasoning, this paper provides the detailed algorithmic design of that training recipe. We introduce HiPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs), a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency. - Hybrid Data Pipeline – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices. - Hybrid Reward System – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains. Think-on Only (Overthinking). Training only on Think-on data makes the model reason on all problems, causing inefficiency. GRPO. Improves accuracy by +3.1%, but increases token length on simple tasks. Think-on/Think-off Mix. Yields higher accuracy (+4.0%) while reducing token length (–10.8%) and thinking rate (–22%). HiPO Advantage. Achieves the best results: +6.2% accuracy, –30% token length, –39% thinking rate, outperforming existing methods in both efficiency and accuracy. HiPO produces responses in a structured template that makes the reasoning path explicit and machine-parsable. Two modes are supported:

NaNK
license:apache-2.0
311
17

OASIS-code-1.3B

NaNK
llama
307
13

OASIS-code-embedding-1.5B

NaNK
license:mit
303
9

KAT-Dev-72B-Exp-FP8

> This repository contains an FP8 quantized version of the Kwaipilot/KAT-Dev-72B-Exp model. The FP8 version achieves 68.5% on SWE-Bench Verified. 🔥 We’re thrilled to announce the release of KAT-Dev-72B-Exp, our latest and most powerful model yet! 🔥 You can now try our strongest proprietary coder model KAT-Coder directly on the StreamLake platform for free. Highlights KAT-Dev-72B-Exp is an open-source 72B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-72B-Exp achieves 74.6% accuracy ⚡ — when evaluated strictly with the SWE-agent scaffold. KAT-Dev-72B-Exp is the experimental reinforcement-learning version of the KAT-Coder model. Through this open-source release, we aim to reveal the technical innovations behind KAT-Coder’s large-scale RL to developers and researchers. We rewrote the attention kernel and redesigned the training engine for shared prefix trajectories to achieve highly efficient RL training, especially for scaffolds leveraging context management. Furthermore, to prevent exploration collapse observed in RL training, we reshaped advantage distribution based on pass rates: amplifying the advantage scale of highly exploratory groups while reducing that of low-exploration ones.

NaNK
license:apache-2.0
183
6

KAT-Dev-FP8

license:apache-2.0
91
3

KwaiCoder-23B-A4B-v1

NaNK
license:mit
31
16

HiPO-1.7B

HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs This work is a companion to our earlier report HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs, where we first introduced the AutoThink paradigm for controllable reasoning. While KAT-V1 outlined the overall framework of SFT + RL for adaptive reasoning, this paper provides the detailed algorithmic design of that training recipe. We introduce HiPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs), a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efficiency. - Hybrid Data Pipeline – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices. - Hybrid Reward System – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains. Think-on Only (Overthinking). Training only on Think-on data makes the model reason on all problems, causing inefficiency. GRPO. Improves accuracy by +3.1%, but increases token length on simple tasks. Think-on/Think-off Mix. Yields higher accuracy (+4.0%) while reducing token length (–10.8%) and thinking rate (–22%). HiPO Advantage. Achieves the best results: +6.2% accuracy, –30% token length, –39% thinking rate, outperforming existing methods in both efficiency and accuracy. HiPO produces responses in a structured template that makes the reasoning path explicit and machine-parsable. Two modes are supported:

NaNK
license:apache-2.0
19
1

SRPO-Qwen-32B

NaNK
license:mit
4
15

KwaiCoder-DS-V2-Lite-Base

license:mit
4
6