A research project developed by ByteDance. It is designed as a unified multimodal model intended for studying unified image and video understanding, generation, and editing within a relatively small model and limited compute budget.
A cutting-edge MoE model with 1.6 trillion total parameters (49 billion active), capable of ultra-efficient processing of up to 1 million tokens of context thanks to an innovative hybrid attention architecture – CSA+HCA. The model confidently leads in mathematics, programming, and agentic tasks, supports three configurable reasoning modes (“non-think”, “think high”, “think max”), and consumes nearly 10 times less KV-cache memory compared to previous DeepSeek flagships.
An open MoE model from the DeepSeek V4 family, with 284 billion total parameters and 13 billion active per token, supporting a context of up to 1 million tokens. Thanks to its hybrid attention (CSA + HCA) it achieves extreme efficiency on ultra-long sequences. The model delivers results close to the Pro version in reasoning, programming, and agent tasks, while being far less demanding on infrastructure.
Qwen/Qwen3.6-27B is an open dense multimodal 27B-parameter model with a strong focus on agentic programming, large-repository work, and reasoning tasks. It supports text, images, and video, features a native context of 262K tokens, thinking/non-thinking modes, and outperforms not only Qwen3.5-27B but also the larger MoE model Qwen3.5-397B-A17B on a range of key benchmarks.
Qwen/Qwen3.6-35B-A3B is an open multimodal Mixture-of-Experts model with 35B parameters, of which only about 3B are activated per token, reducing computational overhead. Its architecture, built on Gated DeltaNet and Gated Attention, delivers high efficiency and memory savings. The model handles text, images, and video, supports thinking and non-thinking modes, offers a 262K-token context window (expandable to 1M), and is especially well-suited for agentic programming, repository-level work, and visual-textual tasks.
Open, multimodal model from Moonshot AI built with an agent‑centric philosophy. It uses a Mixture‑of‑Experts architecture with 1 trillion total parameters (32 billion active per token), a 256K‑token context window, and native INT4 quantization. The model is optimized for long‑horizon software problem solving, autonomous operation, and “agent swarm” orchestration, confidently competing with the best closed models in these areas. K2.6 can carry out complex engineering tasks for hours, turn visual mock‑ups into production‑ready web applications, and decompose and coordinate up to 300 parallel sub‑agents within a single session — making it one of the finest open solutions for research tasks and an effective intelligent core for a wide range of high‑tech products.
An open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is based on the Diffusion Transformer (DiT) architecture and incorporates additional components to enhance text processing and the handling of structured tasks.
MiniMax-M2.7 is the first model to have participated in its own evolution: during the development process, it built its own skills and optimized its own training. The architecture, based on a 230B MoE (10B active parameters) with full attention, ensures consistently high quality in complex agentic and office tasks. On benchmarks, the model demonstrates results on par with the best closed-source solutions. It is ideally suited for developing autonomous agents, working with office documents, and comprehensive automation of complex professional tasks, acting as an "omniscient and empathetic AI colleague."
An open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is based on the Diffusion Transformer (DiT) architecture and incorporates additional components to enhance text processing and the handling of structured tasks.
GLM‑5.1 is a flagship MoE model (744B total / 40B active parameters) featuring DSA sparse attention, built for sustained autonomous operation. At the time of its release, it holds the top position on SWE‑Bench Pro and CyberGym, outperforming all existing models (including closed-source ones), and consistently ranks among the leaders in other significant benchmarks. Crucially, it maintains the ability to make progress across hundreds of iterations and thousands of tool calls—where many models lose effectiveness and try to give a quick answer, GLM‑5.1 continues to search for the optimal solution.
The flagship instruct model of the GigaChat family, built on a Mixture‑of‑Experts (MoE) architecture with 702 billion total and 36 billion active parameters. Combining Multi‑head Latent Attention (MLA), Multi‑Token Prediction (MTP) and native FP8 training delivers record‑breaking performance on long contexts while drastically reducing memory consumption. The model outperforms open‑source peers such as DeepSeek‑V3‑0324 and Qwen3‑235B‑A22B on a number of benchmarks and is released under the MIT license, making it suitable for commercial use.
GigaChat 3.1 Lightning is a compact Mixture-of-Experts model with 1.8 billion active parameters out of 10 billion total, built on MLA attention and supporting MTP, which combined with native FP8 training delivers excellent speed and quality. The model holds leading positions in its class and is one of the best solutions for fast conversational AI assistants, as well as for running simple yet reliable agent systems with tool calling and other functionalities.
A highly efficient mixture‑of‑experts model that, activating only 3.8B parameters, delivers 97% of the quality of the flagship 31B model. The optimal choice for complex agentic and analytical tasks with moderate computational requirements.
The flagship dense model of the Gemma‑4 family, with 31B parameters it only slightly trails the largest proprietary and open‑source alternatives. Native multimodality, multilingual support, a 256K token context window, a hybrid sliding window attention mechanism to reduce memory requirements, and overall – an ideal choice for tasks demanding high‑quality reasoning and in‑depth analysis.
The NVIDIA Nemotron 3 Super 120B (12B active) is a hybrid model based on a sparse Latent Mixture-of-Experts (MoE) and Mamba-2 architecture, optimized for building complex agentic systems and handling contexts of up to 1 million tokens. Thanks to its innovative architecture, which activates only 12 billion parameters per token, and its Multi-Token Prediction (MTP) mechanism, the model delivers high inference efficiency, combining response quality with performance and computational savings when processing long sequences.
This is an updated version of the LTX-2 model, developed by Lightricks for synchronized video and audio generation within a single model. It is based on the DiT architecture and integrates key components of modern video generation systems. The model delivers improved audio and visual quality, as well as increased text prompt accuracy.
The most compact model in the gemma-4 lineup, with an effective size of 2.3B parameters, full support for text, images, and audio. An ideal solution for agentic workflows on local and edge devices.
A model with the innovative Per‑Layer Embeddings technique that, with an effective size of just 4.5B parameters, performs better than models two to three times larger. At the same time, the model retains reasoning capabilities and supports full multimodality (text, images, audio) — an ideal choice for complex tasks on local devices.
An upgraded open-source general-purpose image editing foundation model, built upon the capabilities outlined in the FireRed-Image-Edit-1.0 Technical Report. This version significantly enhances identity consistency, multi-image conditioning, and domain-specialized editing performance, aligning closer to real-world creative production needs.
An ultra-compact multimodal model with 0.8 billion parameters, featuring a hybrid architecture of Gated DeltaNet and Gated Attention. It boasts a record-breaking context length of 262,144 tokens for its size, supports 201 languages, and offers two operational modes—standard and reasoning (thinking)—making it an ideal solution for prototyping, research, and fine-tuning for specific tasks.