gemma-4-12B-it

reasoning
multimodal

Gemma 4 12B is a dense multimodal model with a Unified, encoder-free architecture. The model occupies a middle ground between the compact E4B for mobile devices and the more powerful 26B A4B MoE, filling a mid-range niche optimized for running on consumer laptops with 16 GB of video memory.

The key distinction of Gemma 4 12B from the rest of the family is its completely encoder-free architecture: instead of separate vision and audio encoders, the model uses linear projections to feed raw image patches and audio waveforms directly into a single decoder. This is the first mid-size model in the Gemma family with native audio input, making it a unique solution for local multimodal AI. All modalities flow through a single decoder-only transformer, which reduces latency and allows fine-tuning the entire model in a single pass — there is no need to align separate frozen encoders.

As with other models in the family, the decoder of Gemma 4 12B is built on a hybrid attention mechanism that alternates layers with local sliding-window attention (1024 tokens) and layers with full global attention. These layers use so-called heteromorphic heads — with varying sizes within a single model. The local layers provide speed and low memory usage, since each token only sees its neighbors within the window, while the global layers cover the entire context, ensuring deep understanding of long-range dependencies.

The model supports text, image, video, and audio processing. A built-in thinking mode allows the model to reason step-by-step before producing an answer, which is critical for complex tasks. The model also supports function calling for agentic scenarios, variable image resolution, and multilingualism (140+ languages during pre-training, 35+ languages out of the box). Multi-Token Prediction (MTP) is supported to accelerate inference, significantly reducing generation latency without quality loss. The vocabulary comprises 262K tokens, and the context window reaches 256K tokens.

On key benchmarks, Gemma 4 12B delivers results close to the substantially larger 26B A4B MoE. On AIME 2026 (advanced mathematical reasoning) the model scores 77.5%, nearly quadrupling the result of Gemma 3 27B (20.8%). On GPQA Diamond (PhD-level expert questions in physics, chemistry, and biology) the model reaches 78.8% — an outstanding result for a 12B model, surpassing many larger models. LiveCodeBench v6 (real-world code generation) — 72.0%, Codeforces ELO — 1659, confirming strong programming abilities. Multimodal tests: MMMU Pro (universal image understanding) — 69.1%, MATH-Vision (mathematics on images) — 79.7%, MMMLU (multilingual knowledge) — 83.4%. On the CoVoST benchmark (audio translation) the model achieves the best result among all Gemma models (38.5%).

The model’s use cases are defined by three key factors: compactness, multimodality with native audio, and agentic capabilities. Gemma 4 12B is ideally suited for local agentic systems — from autonomous coding assistants to multimodal AI assistants with voice input. The model is effective for speech recognition and translation, video fragment analysis, intelligent document processing, and for building embedded AI solutions on desktops. For more details on use cases, check out the developer guide: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/


Announce Date: 03.06.2026
Parameters: 12B
Context: 263K
Layers: 48, using full attention: 8
Attention Type: Sliding Window Attention
Developer: Google DeepMind
Transformers Version: 5.10.0.dev0
License: Apache 2.0

Public endpoint

Use our pre-built public endpoints for free to test inference and explore gemma-4-12B-it capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU Status Link
There are no public endpoints for this model yet.

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended server configurations for hosting gemma-4-12B-it

Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa10-1.16.32.160
262,144.0
1 $0.53 1.490 Launch
teslat4-2.16.32.160
262,144.0
tensor
2 $0.54 1.900 Launch
teslaa2-2.16.32.160
262,144.0
tensor
2 $0.57 1.911 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 1.633 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
3 $0.84 1.761 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 1.628 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
4 $1.12 2.703 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 4.044 Launch
rtx3080-3.16.64.160
262,144.0
pipeline
3 $1.43 1.376 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 2.685 Launch
rtx3080-4.16.64.160
262,144.0
tensor
4 $1.82 2.190 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 9.154 Launch
h100-1.16.64.160
262,144.0
1 $3.83 9.145 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 11.024 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 19.373 Launch
h200-1.16.128.160
262,144.0
1 $4.74 17.332 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 35.728 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-2.16.32.160
262,144.0
tensor
2 $0.54 1.184 Launch
teslaa2-2.16.32.160
262,144.0
tensor
2 $0.57 1.194 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
3 $0.84 1.045 Launch
teslaa10-2.16.64.160
262,144.0
tensor
2 $0.93 3.328 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
4 $1.12 1.987 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 3.328 Launch
rtx3090-2.16.64.160
262,144.0
tensor
2 $1.56 3.614 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 1.969 Launch
rtx3080-4.16.64.160
262,144.0
tensor
4 $1.82 1.473 Launch
rtx4090-2.16.64.160
262,144.0
tensor
2 $1.92 3.604 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 8.438 Launch
h100-1.16.64.160
262,144.0
1 $3.83 8.429 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 10.307 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 18.657 Launch
h200-1.16.128.160
262,144.0
1 $4.74 16.615 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 35.011 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-3.32.64.160
262,144.0
pipeline
3 $0.88 1.176 Launch
teslaa10-2.16.64.160
262,144.0
tensor
2 $0.93 1.838 Launch
teslat4-4.16.64.160
262,144.0
tensor
4 $0.96 2.658 Launch
teslaa2-3.32.128.160
262,144.0
pipeline
3 $1.06 1.192 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 1.838 Launch
teslaa2-4.32.128.160
262,144.0
tensor
4 $1.26 2.679 Launch
rtx3090-2.16.64.160
262,144.0
tensor
2 $1.56 2.125 Launch
rtx4090-2.16.64.160
262,144.0
tensor
2 $1.92 2.114 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 6.948 Launch
rtx5090-2.16.64.160
262,144.0
tensor
2 $2.93 4.229 Launch
h100-1.16.64.160
262,144.0
1 $3.83 6.939 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 8.817 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 17.167 Launch
h200-1.16.128.160
262,144.0
1 $4.74 15.126 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 33.522 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.