Gemma 4 12B is a dense multimodal model with a Unified, encoder-free architecture. The model occupies a middle ground between the compact E4B for mobile devices and the more powerful 26B A4B MoE, filling a mid-range niche optimized for running on consumer laptops with 16 GB of video memory.
The key distinction of Gemma 4 12B from the rest of the family is its completely encoder-free architecture: instead of separate vision and audio encoders, the model uses linear projections to feed raw image patches and audio waveforms directly into a single decoder. This is the first mid-size model in the Gemma family with native audio input, making it a unique solution for local multimodal AI. All modalities flow through a single decoder-only transformer, which reduces latency and allows fine-tuning the entire model in a single pass — there is no need to align separate frozen encoders.
As with other models in the family, the decoder of Gemma 4 12B is built on a hybrid attention mechanism that alternates layers with local sliding-window attention (1024 tokens) and layers with full global attention. These layers use so-called heteromorphic heads — with varying sizes within a single model. The local layers provide speed and low memory usage, since each token only sees its neighbors within the window, while the global layers cover the entire context, ensuring deep understanding of long-range dependencies.
The model supports text, image, video, and audio processing. A built-in thinking mode allows the model to reason step-by-step before producing an answer, which is critical for complex tasks. The model also supports function calling for agentic scenarios, variable image resolution, and multilingualism (140+ languages during pre-training, 35+ languages out of the box). Multi-Token Prediction (MTP) is supported to accelerate inference, significantly reducing generation latency without quality loss. The vocabulary comprises 262K tokens, and the context window reaches 256K tokens.
On key benchmarks, Gemma 4 12B delivers results close to the substantially larger 26B A4B MoE. On AIME 2026 (advanced mathematical reasoning) the model scores 77.5%, nearly quadrupling the result of Gemma 3 27B (20.8%). On GPQA Diamond (PhD-level expert questions in physics, chemistry, and biology) the model reaches 78.8% — an outstanding result for a 12B model, surpassing many larger models. LiveCodeBench v6 (real-world code generation) — 72.0%, Codeforces ELO — 1659, confirming strong programming abilities. Multimodal tests: MMMU Pro (universal image understanding) — 69.1%, MATH-Vision (mathematics on images) — 79.7%, MMMLU (multilingual knowledge) — 83.4%. On the CoVoST benchmark (audio translation) the model achieves the best result among all Gemma models (38.5%).
The model’s use cases are defined by three key factors: compactness, multimodality with native audio, and agentic capabilities. Gemma 4 12B is ideally suited for local agentic systems — from autonomous coding assistants to multimodal AI assistants with voice input. The model is effective for speech recognition and translation, video fragment analysis, intelligent document processing, and for building embedded AI solutions on desktops. For more details on use cases, check out the developer guide: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/
| Model Name | Context | Type | GPU | Status | Link |
|---|
There are no public endpoints for this model yet.
Rent your own physically dedicated instance with hourly or long-term monthly billing.
We recommend deploying private instances in the following scenarios:
| Name | GPU | TPS | Max Concurrency | |||
|---|---|---|---|---|---|---|
262,144.0 |
1 | $0.53 | 1.268 | Launch | ||
262,144.0 tensor |
2 | $0.54 | 1.795 | Launch | ||
262,144.0 tensor |
2 | $0.57 | 1.806 | Launch | ||
262,144.0 |
1 | $0.83 | 1.412 | Launch | ||
262,144.0 pipeline |
3 | $0.84 | 1.774 | Launch | ||
262,144.0 |
1 | $1.02 | 1.406 | Launch | ||
262,144.0 tensor |
4 | $1.12 | 2.833 | Launch | ||
262,144.0 tensor |
2 | $1.23 | 3.940 | Launch | ||
262,144.0 pipeline |
3 | $1.43 | 1.389 | Launch | ||
262,144.0 |
1 | $1.59 | 2.464 | Launch | ||
262,144.0 tensor |
4 | $1.82 | 2.320 | Launch | ||
262,144.0 |
1 | $2.37 | 8.933 | Launch | ||
262,144.0 |
1 | $3.83 | 8.923 | Launch | ||
262,144.0 |
1 | $4.11 | 10.802 | Launch | ||
262,144.0 tensor |
2 | $4.61 | 19.268 | Launch | ||
262,144.0 |
1 | $4.74 | 17.110 | Launch | ||
262,144.0 tensor |
2 | $9.40 | 35.623 | Launch | ||
| Name | GPU | TPS | Max Concurrency | |||
|---|---|---|---|---|---|---|
262,144.0 tensor |
2 | $0.54 | 1.143 | Launch | ||
262,144.0 tensor |
2 | $0.57 | 1.154 | Launch | ||
262,144.0 pipeline |
3 | $0.84 | 1.122 | Launch | ||
262,144.0 tensor |
2 | $0.93 | 3.288 | Launch | ||
262,144.0 tensor |
4 | $1.12 | 2.181 | Launch | ||
262,144.0 tensor |
2 | $1.23 | 3.288 | Launch | ||
262,144.0 tensor |
2 | $1.56 | 3.574 | Launch | ||
262,144.0 |
1 | $1.59 | 1.811 | Launch | ||
262,144.0 tensor |
4 | $1.82 | 1.668 | Launch | ||
262,144.0 tensor |
2 | $1.92 | 3.563 | Launch | ||
262,144.0 |
1 | $2.37 | 8.280 | Launch | ||
262,144.0 |
1 | $3.83 | 8.271 | Launch | ||
262,144.0 |
1 | $4.11 | 10.149 | Launch | ||
262,144.0 tensor |
2 | $4.61 | 18.616 | Launch | ||
262,144.0 |
1 | $4.74 | 16.457 | Launch | ||
262,144.0 tensor |
2 | $9.40 | 34.971 | Launch | ||
| Name | GPU | TPS | Max Concurrency | |||
|---|---|---|---|---|---|---|
262,144.0 pipeline |
3 | $0.88 | 1.528 | Launch | ||
262,144.0 tensor |
2 | $0.93 | 2.073 | Launch | ||
262,144.0 tensor |
4 | $0.96 | 3.127 | Launch | ||
262,144.0 pipeline |
3 | $1.06 | 1.544 | Launch | ||
262,144.0 tensor |
2 | $1.23 | 2.073 | Launch | ||
262,144.0 tensor |
4 | $1.26 | 3.149 | Launch | ||
262,144.0 tensor |
2 | $1.56 | 2.360 | Launch | ||
262,144.0 tensor |
2 | $1.92 | 2.349 | Launch | ||
262,144.0 |
1 | $2.37 | 7.066 | Launch | ||
262,144.0 tensor |
2 | $2.93 | 4.464 | Launch | ||
262,144.0 |
1 | $3.83 | 7.056 | Launch | ||
262,144.0 |
1 | $4.11 | 8.935 | Launch | ||
262,144.0 tensor |
2 | $4.61 | 17.402 | Launch | ||
262,144.0 |
1 | $4.74 | 15.243 | Launch | ||
262,144.0 tensor |
2 | $9.40 | 33.756 | Launch | ||
Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.