gemma-4-12B-it

reasoning
multimodal

Gemma 4 12B is a dense multimodal model with a Unified, encoder-free architecture. The model occupies a middle ground between the compact E4B for mobile devices and the more powerful 26B A4B MoE, filling a mid-range niche optimized for running on consumer laptops with 16 GB of video memory.

The key distinction of Gemma 4 12B from the rest of the family is its completely encoder-free architecture: instead of separate vision and audio encoders, the model uses linear projections to feed raw image patches and audio waveforms directly into a single decoder. This is the first mid-size model in the Gemma family with native audio input, making it a unique solution for local multimodal AI. All modalities flow through a single decoder-only transformer, which reduces latency and allows fine-tuning the entire model in a single pass — there is no need to align separate frozen encoders.

As with other models in the family, the decoder of Gemma 4 12B is built on a hybrid attention mechanism that alternates layers with local sliding-window attention (1024 tokens) and layers with full global attention. These layers use so-called heteromorphic heads — with varying sizes within a single model. The local layers provide speed and low memory usage, since each token only sees its neighbors within the window, while the global layers cover the entire context, ensuring deep understanding of long-range dependencies.

The model supports text, image, video, and audio processing. A built-in thinking mode allows the model to reason step-by-step before producing an answer, which is critical for complex tasks. The model also supports function calling for agentic scenarios, variable image resolution, and multilingualism (140+ languages during pre-training, 35+ languages out of the box). Multi-Token Prediction (MTP) is supported to accelerate inference, significantly reducing generation latency without quality loss. The vocabulary comprises 262K tokens, and the context window reaches 256K tokens.

On key benchmarks, Gemma 4 12B delivers results close to the substantially larger 26B A4B MoE. On AIME 2026 (advanced mathematical reasoning) the model scores 77.5%, nearly quadrupling the result of Gemma 3 27B (20.8%). On GPQA Diamond (PhD-level expert questions in physics, chemistry, and biology) the model reaches 78.8% — an outstanding result for a 12B model, surpassing many larger models. LiveCodeBench v6 (real-world code generation) — 72.0%, Codeforces ELO — 1659, confirming strong programming abilities. Multimodal tests: MMMU Pro (universal image understanding) — 69.1%, MATH-Vision (mathematics on images) — 79.7%, MMMLU (multilingual knowledge) — 83.4%. On the CoVoST benchmark (audio translation) the model achieves the best result among all Gemma models (38.5%).

The model’s use cases are defined by three key factors: compactness, multimodality with native audio, and agentic capabilities. Gemma 4 12B is ideally suited for local agentic systems — from autonomous coding assistants to multimodal AI assistants with voice input. The model is effective for speech recognition and translation, video fragment analysis, intelligent document processing, and for building embedded AI solutions on desktops. For more details on use cases, check out the developer guide: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/


Announce Date: 03.06.2026
Parameters: 12B
Context: 263K
Layers: 48, using full attention: 8
Attention Type: Sliding Window Attention
Developer: Google DeepMind
Transformers Version: 5.10.0.dev0
License: Apache 2.0

Public endpoint

Use our pre-built public endpoints for free to test inference and explore gemma-4-12B-it capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU Status Link
There are no public endpoints for this model yet.

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended server configurations for hosting gemma-4-12B-it

Prices:
Name GPU Price, hour TPS Max Concurrency
teslaa10-1.16.32.160
262,144.0
1 $0.53 1.268 Launch
teslat4-2.16.32.160
262,144.0
tensor
2 $0.54 1.795 Launch
teslaa2-2.16.32.160
262,144.0
tensor
2 $0.57 1.806 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 1.412 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
3 $0.84 1.774 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 1.406 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
4 $1.12 2.833 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 3.940 Launch
rtx3080-3.16.64.160
262,144.0
pipeline
3 $1.43 1.389 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 2.464 Launch
rtx3080-4.16.64.160
262,144.0
tensor
4 $1.82 2.320 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 8.933 Launch
h100-1.16.64.160
262,144.0
1 $3.83 8.923 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 10.802 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 19.268 Launch
h200-1.16.128.160
262,144.0
1 $4.74 17.110 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 35.623 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-2.16.32.160
262,144.0
tensor
2 $0.54 1.143 Launch
teslaa2-2.16.32.160
262,144.0
tensor
2 $0.57 1.154 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
3 $0.84 1.122 Launch
teslaa10-2.16.64.160
262,144.0
tensor
2 $0.93 3.288 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
4 $1.12 2.181 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 3.288 Launch
rtx3090-2.16.64.160
262,144.0
tensor
2 $1.56 3.574 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 1.811 Launch
rtx3080-4.16.64.160
262,144.0
tensor
4 $1.82 1.668 Launch
rtx4090-2.16.64.160
262,144.0
tensor
2 $1.92 3.563 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 8.280 Launch
h100-1.16.64.160
262,144.0
1 $3.83 8.271 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 10.149 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 18.616 Launch
h200-1.16.128.160
262,144.0
1 $4.74 16.457 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 34.971 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-3.32.64.160
262,144.0
pipeline
3 $0.88 1.528 Launch
teslaa10-2.16.64.160
262,144.0
tensor
2 $0.93 2.073 Launch
teslat4-4.16.64.160
262,144.0
tensor
4 $0.96 3.127 Launch
teslaa2-3.32.128.160
262,144.0
pipeline
3 $1.06 1.544 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 2.073 Launch
teslaa2-4.32.128.160
262,144.0
tensor
4 $1.26 3.149 Launch
rtx3090-2.16.64.160
262,144.0
tensor
2 $1.56 2.360 Launch
rtx4090-2.16.64.160
262,144.0
tensor
2 $1.92 2.349 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 7.066 Launch
rtx5090-2.16.64.160
262,144.0
tensor
2 $2.93 4.464 Launch
h100-1.16.64.160
262,144.0
1 $3.83 7.056 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 8.935 Launch
teslaa100-2.24.96.160.nvlink
262,144.0
tensor
2 $4.61 17.402 Launch
h200-1.16.128.160
262,144.0
1 $4.74 15.243 Launch
h200-2.24.256.160.nvlink
262,144.0
tensor
2 $9.40 33.756 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.