Researchers from the MBZUAI Institute of Foundation Models and G42 have unveiled K2 Think, a groundbreaking 32-billion-parameter open-source AI reasoning system. This system is designed to excel in advanced AI reasoning tasks, outperforming models that are 20 times larger. K2 Think integrates long chain-of-thought supervised fine-tuning with reinforcement learning from verifiable rewards (RLVR), agentic planning, test-time scaling, and inference optimizations, including speculative decoding on wafer-scale hardware. The result is a system that achieves frontier-level performance in mathematics, code, and science, while maintaining a transparent, fully open release of weights, data, and code.
- System Overview
- Pillar 1: Long CoT SFT
- Pillar 2: RL with Verifiable Rewards
- Pillars 3 – 4: Agentic Planning and Test-time Scaling
- Pillars 5 – 6: Speculative Decoding and Wafer-Scale Inference
- Evaluation Protocol
- Results
- Key Numbers at a Glance
System Overview
K2 Think is developed by post-training an open-weight Qwen2.5-32B base model, enhanced with a lightweight test-time compute scaffold. The system is designed for parameter efficiency, allowing rapid iteration and deployment. The core framework consists of six pillars: (1) Long chain-of-thought (CoT) supervised fine-tuning, (2) Reinforcement Learning with Verifiable Rewards (RLVR), (3) agentic planning, (4) test-time scaling via best-of-N selection with verifiers, (5) speculative decoding, and (6) inference on a wafer-scale engine.
The primary goals are to improve pass@1 scores on competition-grade math benchmarks, maintain robust performance in code and science, and optimize response length and latency through plan-before-you-think prompting and hardware-aware inference.
Pillar 1: Long CoT SFT
Phase-1 Supervised Fine-Tuning (SFT) employs curated long chain-of-thought traces and instruction/response pairs across math, code, science, and general chat. This approach trains the base model to externalize intermediate reasoning and adopt a structured output format. Significant pass@1 improvements are observed early, with AIME’24 stabilizing around 79% and AIME’25 around 72% before reinforcement learning, indicating convergence.
Pillar 2: RL with Verifiable Rewards
K2 Think is further trained using RLVR on a dataset called Guru, which includes approximately 92,000 prompts across six domains: Math, Code, Science, Logic, Simulation, and Tabular data. This method ensures verifiable end-to-end correctness. Notably, starting RL from a strong SFT checkpoint yields modest gains, while applying RL directly to the base model shows substantial improvements, suggesting a trade-off between SFT strength and RL headroom.
An ablation study indicates that reducing the initial context window in multi-stage RL (e.g., from 16k to 32k) underperforms, failing to recover the SFT baseline, highlighting the importance of maintaining sequence length for learned reasoning patterns.
Pillars 3 – 4: Agentic Planning and Test-time Scaling
During inference, K2 Think first generates a compact plan before producing a full solution, followed by best-of-N sampling with verifiers to select the most likely correct answer. This approach results in consistent quality gains and shorter final responses, with reductions in token count of up to 11.7% on benchmarks like Omni-HARD, which is crucial for reducing latency and cost.
Analysis shows K2 Think’s response lengths are shorter than those of Qwen3-235B-A22B and comparable to GPT-OSS-120B on math tasks. After incorporating plan-before-you-think and verifiers, K2 Think’s average tokens decrease compared to its post-training checkpoint.
Pillars 5 – 6: Speculative Decoding and Wafer-Scale Inference
K2 Think utilizes Cerebras Wafer-Scale Engine for inference with speculative decoding, achieving per-request throughput of over 2,000 tokens per second, making the test-time scaffold practical for both production and research applications. This hardware-aware inference aligns with the system’s “small-but-fast” philosophy.
Evaluation Protocol
Benchmarking includes competition-level math (AIME’24, AIME’25, HMMT’25, Omni-MATH-HARD), code (LiveCodeBench v5; SciCode sub/main), and science knowledge/reasoning (GPQA-Diamond; HLE). The evaluation setup includes a maximum generation length of 64k tokens, temperature of 1.0, top-p of 0.95, and stop marker </answer>. Each score is averaged over 16 independent pass@1 evaluations to minimize variance.
Results
In math, K2 Think achieves a micro-average score of 67.99, leading the open-weight cohort and performing favorably against larger systems. It scores 90.83 on AIME’24, 81.24 on AIME’25, 73.75 on HMMT25, and 60.73 on Omni-HARD. These results highlight K2 Think’s parameter efficiency compared to models like DeepSeek V3.1 (671B) and GPT-OSS-120B (120B).
In code, K2 Think scores 63.97 on LiveCodeBench v5, surpassing similarly sized and larger open models. On SciCode, it scores 39.2/12.0 (sub/main), closely tracking the best open systems on sub-problem accuracy.
In science, the model achieves 71.08 on GPQA-Diamond and 9.95 on HLE, demonstrating its versatility across knowledge-intensive tasks.
Key Numbers at a Glance
- Backbone: Qwen2.5-32B (open weight), post-trained with long CoT SFT + RLVR (GRPO via verl).
- RL data: Guru (~92k prompts) across Math/Code/Science/Logic/Simulation/Tabular.
- Inference scaffold: Plan-before-you-think + BoN with verifiers; shorter outputs (e.g., −11.7% tokens on Omni-HARD) at higher accuracy.
- Throughput target: ~2,000 tok/s on Cerebras WSE with speculative decoding.
- Math micro-avg: 67.99 (AIME’24 90.83, AIME’25 81.24, HMMT’25 73.75, Omni-HARD 60.73).
- Code/Science: LCBv5 63.97; SciCode 39.2/12.0; GPQA-D 71.08; HLE 9.95.
- Safety-4 macro: 0.75 (Refusal 0.83, Conv. Robustness 0.89, Cybersecurity 0.56, Jailbreak 0.72).
K2 Think exemplifies how integrative post-training, test-time compute, and hardware-aware inference can bridge the gap to larger, proprietary reasoning systems. At 32B, it is feasible to fine-tune and deploy; with plan-before-you-think and best-of-N with verifiers, it manages token budgets; with speculative decoding on wafer-scale hardware, it achieves ~2k tokens per second per request. K2 Think is fully open, offering weights, training data, deployment code, and test-time optimization code.
Frequently Asked Questions
What is K2 Think and who developed it?
K2 Think is a groundbreaking 32-billion-parameter open-source AI reasoning system developed by researchers from the MBZUAI Institute of Foundation Models and G42. It is designed to excel in advanced AI reasoning tasks, outperforming models that are 20 times larger.
What are the key components of K2 Think’s framework?
The core framework of K2 Think consists of six pillars: Long chain-of-thought supervised fine-tuning, Reinforcement Learning with Verifiable Rewards, agentic planning, test-time scaling via best-of-N selection with verifiers, speculative decoding, and inference on a wafer-scale engine.
How does K2 Think perform in math, code, and science tasks?
K2 Think achieves frontier-level performance in mathematics, code, and science. It scores a micro-average of 67.99 in math, surpassing larger systems, and performs favorably in code and science tasks, demonstrating its versatility across knowledge-intensive tasks.
What is the significance of K2 Think’s open-source nature?
K2 Think maintains a transparent, fully open release of weights, data, and code, which allows for greater accessibility and collaboration in the AI research community. This openness supports further innovation and development in AI reasoning systems.
What hardware does K2 Think utilize for inference?
K2 Think utilizes the Cerebras Wafer-Scale Engine for inference with speculative decoding, achieving per-request throughput of over 2,000 tokens per second. This hardware-aware inference aligns with the system’s ‘small-but-fast’ philosophy.







