Kimi K2 Thinking is built on a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion active parameters per token. The model employs 384 experts, dynamically selecting 8 per token plus 1 shared expert, enabling high specialization while maintaining computational efficiency. Its architecture comprises 61 layers and incorporates Multi-Head Latent Attention (MLA), which optimizes long-context processing by significantly reducing KV-cache overhead. A key distinction from conventional MoE models is the use of the MuonClip optimizer during pretraining, which achieved zero training instability even at the trillion-parameter scale. A unique feature of K2 Thinking is its built-in INT4 quantization, implemented via Quantization-Aware Training (QAT) during the fine-tuning phase. Quantization is applied exclusively to the MoE components, doubling generation speed with no perceptible loss in quality.
K2 Thinking is trained to interleave chain-of-thought reasoning with function calls, enabling it to autonomously execute research, coding, and writing workflows spanning hundreds of steps. The model sustains goal-directed behavior over 200–300 consecutive tool calls—far surpassing earlier versions, which degraded after just 30–50 steps. It implements a “think-before-act” approach: formulating hypotheses, selecting tools, executing actions, verifying outcomes, and iteratively refining its plan. Additionally, the model employs reasoning budgets, allowing fine-grained control over the trade-off between response accuracy and speed.
On Humanity's Last Exam (HLE)—the most challenging benchmark for analytical reasoning—K2 Thinking scores 44.9% with tool use, outperforming GPT-5 High (41.7%) and Claude Sonnet 4.5 Thinking (32.0%). On BrowseComp, a benchmark for agentic web search, it achieves 60.2%, substantially ahead of GPT-5 (54.9%) and DeepSeek-V3.2 (40.1%). In mathematical competitions, it scores 99.1% on AIME25 (with Python) and 95.1% on HMMT25, demonstrating exceptional multi-step computational reasoning. On SWE-bench Verified (real-world GitHub pull requests), it reaches 71.3%, and on SWE-bench Multilingual, it achieves 61.1%, showcasing strong programming competence with tool integration.
The model is ideally suited for autonomous research agents capable of conducting deep information analysis from web sources, performing multi-step fact verification, and synthesizing insights from dozens of documents. In software development, K2 Thinking can handle complex tasks such as refactoring, debugging, and multi-file codebase modifications using Bash, code editors, and interpreters. It demonstrates expert-level performance in solving Olympiad problems, running numerical experiments with result validation, and generating formal proofs—making it a powerful tool for scientific research and mathematical modeling. For content creation, the model excels at long-horizon writing workflows that involve research queries, claim verification, and source-cited document generation. Thanks to its stability over hundreds of tool calls, K2 Thinking is particularly well-suited for financial research requiring multi-stage analysis of financial statements and market data.
| Model Name | Context | Type | GPU | TPS | Status | Link |
|---|
There are no public endpoints for this model yet.
Rent your own physically dedicated instance with hourly or long-term monthly billing.
We recommend deploying private instances in the following scenarios:
| Name | vCPU | RAM, MB | Disk, GB | GPU | |||
|---|---|---|---|---|---|---|---|
262,144.0 pipeline |
52 | 917504 | 960 | 6 | $28.39 | Launch | |
262,144.0 tensor |
52 | 1048576 | 960 | 8 | $37.37 | Launch | |
| Name | vCPU | RAM, MB | Disk, GB | GPU | |||
|---|---|---|---|---|---|---|---|
262,144.0 tensor |
52 | 1048576 | 1280 | 8 | $37.41 | Launch | |
There are no configurations for this model, context and quantization yet.
Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.