Qwen3-235B-A22B-Instruct-2507

Qwen3-235B-A22B-Instruct-2507 is an updated version of the flagship MoE model in the Qwen 3 series. This 235-billion-parameter model activates only 22 billion parameters at each inference step. The architecture consists of 94 transformer layers with 128 experts, of which only 8 are activated per token.

Unlike previous versions of Qwen, the new 2507 model has completely abandoned the hybrid thinking mode in favor of a highly optimized non-thinking mode. This decision was made based on user feedback indicating a preference for faster responses without the generation of <think> blocks. As a result, response speed has dramatically increased, along with significant improvements in output quality. In mathematical benchmarks, the model achieved remarkable gains: AIME25 (70.3 vs. 24.7 in the previous version) and HMMT25 (55.4 vs. 10.0). Particularly impressive is its performance on ZebraLogic (95.0), demonstrating near-perfect accuracy in logical reasoning tasks. In programming, the model also significantly outperforms its predecessor, achieving state-of-the-art results on LiveCodeBench and MultiPL-E. Overall, across numerous benchmarks, the model surpasses leading competitors such as GPT-4o, DeepSeek-V3, and Kimi K2.

Additionally, developers have released Qwen3-235B-A22B-Instruct-2507-FP8—an FP8 quantized version of the model. This innovative technique reduces memory requirements by approximately 50% while preserving nearly all of the original model's quality. The FP8 format outperforms traditional INT8 approaches, especially for large-scale models, offering a superior balance between accuracy and efficiency.

Another key technological advancement of Qwen3-235B-A22B-Instruct-2507 is native support for a context length of 262,144 tokens. This capability enables entirely new use cases—from analyzing lengthy documents and codebases to conducting multi-hour conversations while maintaining contextual understanding and high response accuracy even with a fully filled context window. Therefore, there are strong grounds to believe that the new Qwen3-235B-A22B-Instruct-2507 model positions itself as the leading open-source solution for a broad range of enterprise applications.


Announce Date: 21.07.2025
Parameters: 235B
Experts: 128
Activated: 22B
Context: 263K
Attention Type: Full or Sliding Window Attention
VRAM requirements: 154.3 GB using 4 bits quantization
Developer: Alibaba
Transformers Version: 4.51.0
License: Apache 2.0

Public endpoint

Use our pre-built public endpoints to test inference and explore Qwen3-235B-A22B-Instruct-2507 capabilities.
Model Name Context Type GPU TPS Status Link
chriswritescode/Qwen3-235B-A22B-Instruct-2507-INT4-W4A16 125,600.0 Public 2×TeslaH100 60.24 AVAILABLE try

API access to Qwen3-235B-A22B-Instruct-2507 endpoints

curl https://chat.immers.cloud/v1/endpoints/Qwen3-235B-A22B-Instruct-2507-optimized/generate/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer USER_API_KEY" \
-d '{"model": "Qwen-3-235B-A22B-Instruct", "messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'
$response = Invoke-WebRequest https://chat.immers.cloud/v1/endpoints/Qwen3-235B-A22B-Instruct-2507-optimized/generate/chat/completions `
-Method POST `
-Headers @{
"Authorization" = "Bearer USER_API_KEY"
"Content-Type" = "application/json"
} `
-Body (@{
model = "Qwen-3-235B-A22B-Instruct"
messages = @(
@{ role = "system"; content = "You are a helpful assistant." },
@{ role = "user"; content = "Say this is a test" }
)
} | ConvertTo-Json)
($response.Content | ConvertFrom-Json).choices[0].message.content
#!pip install OpenAI --upgrade

from openai import OpenAI

client = OpenAI(
api_key="USER_API_KEY",
base_url="https://chat.immers.cloud/v1/endpoints/Qwen3-235B-A22B-Instruct-2507-optimized/generate/",
)

chat_response = client.chat.completions.create(
model="Qwen-3-235B-A22B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"},
]
)
print(chat_response.choices[0].message.content)

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended configurations for hosting Qwen3-235B-A22B-Instruct-2507

Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslaa100-3.32.384.240 32 393216 240 3 $8.00 Launch
rtx4090-8.44.256.240 44 262144 240 8 $8.59 Launch
rtx5090-6.44.256.240 44 262144 240 6 $8.86 Launch
teslah100-3.32.384.240 32 393216 240 3 $15.58 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslaa100-4.44.512.320 44 524288 320 4 $10.68 Launch
teslah100-4.44.512.320 44 524288 320 4 $20.77 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.