DeepSeek-Ai model with advanced reasoning capabilities and agent functions, combining high computational efficiency with GPT-5-level performance. Thanks to its Sparse Attention Architecture (DSA) and unique "in-call tool reasoning" mechanics, the model is ideally suited for building autonomous agents, ensuring a balance between speed, resource costs, and the complexity of tasks solved.
A specialized version of DeepSeek-V3.2 for deep reasoning, achieving GPT-5 and Gemini-3.0-Pro levels in solving complex problems in the fields of Olympiad mathematics and programming. The model does not support tool calling but possesses unlimited "thinking" depth, which allows it to achieve phenomenal results in these narrowly specialized knowledge domains. DeepSeek-V3.2-Speciale has become the first open model to win gold medals at the largest international mathematics and informatics Olympiads
The flagship and largest Russian-language instruct model at the time of its release, based on the Mixture-of-Experts (MoE) architecture with 702B total and 36B active parameters. The model integrates Multi-head Latent Attention (MLA) and Multi-Token Prediction (MTP), ensuring high inference throughput and is optimized for fp8 operation. GigaChat 3 Ultra Preview operates with a 128K token context, demonstrates strong performance in text generation, programming, and mathematics, and provides the deepest understanding of the Russian language and culture.
A compact, dialogue-oriented MoE model from the GigaChat family, with 10 billion total and 1.8 billion active parameters, optimized for high-speed inference and deployment in local or high-load production environments (commonly referred to as GigaChat 3 Lightning). In terms of understanding the Russian language, it surpasses popular 3-4B scale models while operating significantly faster.
HunyuanVideo-1.5 is a lightweight text-to-video and image-to-video generation model developed by Tencent, featuring 8.3 billion parameters while maintaining state-of-the-art visual quality and motion coherence. It is designed to run efficiently on consumer-grade GPUs, making advanced video creation accessible to developers and creators.
A 32 billion parameter rectified flow transformer designed for image generation, editing, and combination based on text instructions. It supports open-ended tasks such as text-to-image generation, single-reference editing, and multi-reference editing without requiring additional finetuning. Trained using guidance distillation to enhance efficiency, the model is optimized for research and creative applications under a non-commercial license.
A compact multimodal model from Baidu, built on an innovative heterogeneous Mixture-of-Experts (MoE) architecture that separates parameters for textual and visual experts. During inference, only 3 billion parameters are activated out of a total model size of 28 billion parameters. The model is an upgraded version of the base ERNIE-4.5-VL-28B-A3B, specifically optimized for multimodal reasoning tasks through a "Thinking Mode." It supports images, videos, visual grounding, and tool invocation, with a native maximum context length of 131K tokens, and stands out for its moderate computational requirements.
The largest open-source reasoning model from Moonshot AI at the time of its release, featuring a Mixture-of-Experts architecture (1 trillion parameters total, 32 billion active), capable of executing 200–300 consecutive tool calls without quality degradation while seamlessly interleaving function calls with reasoning chains. The model supports a 256K-token context window, incorporates native INT4 quantization for significantly accelerated inference with virtually no loss in accuracy, and employs Multi-Head Latent Attention (MLA) for highly efficient processing of long sequences. Kimi K2 Thinking sets new records among open-source models and outperforms leading commercial systems—including GPT-5 and Claude Sonnet 4.5—on a broad range of benchmarks.
LongCat-Video is a 13.6B-parameter foundational video generation model developed to excel in Text-to-Video, Image-to-Video, and Video-Continuation tasks. It supports efficient and high-quality generation of long videos (minutes-long) without color drifting or quality degradation, marking an initial step toward world models.
A reasoning version of the flagship 32-billion-parameter Danse model from the Qwen3-VL family, optimized for multi-step thinking and solving highly complex multimodal tasks that require deep analysis and logical inference based on visual information. It supports a native context of 256K (extendable to 1M) and achieves state-of-the-art among multimodal reasoning models of a similar size.
With only 2 billion parameters, a 256K context window, and capability for edge inference, this is one of the smallest visual reasoning models specialized in multi-step reasoning for visual analysis of images and videos. This means it's almost literally capable of "thinking while looking at images." Unlike the Instruct version, this model generates detailed chains of thought before producing the final answer, which enhances accuracy but impacts processing speed.
The most compact model in the Qwen3-VL multimodal family. With 2 billion parameters and a dense architecture, it is optimized for fast conversational systems and deployment on edge devices. At the same time, the model retains and supports all the advanced capabilities of the series: high-quality comprehension of images, videos, and text, support for OCR in 32 languages, object positioning, timestamp localization, and a native context length of 256K tokens.
A powerful multimodal model with 32 billion parameters and native support for a 256K context window, delivering state-of-the-art quality in multimodal understanding. The model outperforms the previous-generation 72B parameter version on most benchmarks, as well as similarly-sized solutions from other developers, such as GPT-5 and Claude 4.
An innovative VLM model for text recognition and document parsing, developed by DeepSeek.ai as part of research into the capabilities of information representation through the visual modality. The model offers a unique approach: instead of traditional text tokens, it uses visual tokens to encode information from documents, achieving text compression of 10-20 times, while reaching an OCR accuracy of 97%.
The Krea Realtime 14B model is a distilled version of the Wan 2.1 14B model (developed by Wan-AI) for text-to-video generation tasks. It was transformed into an autoregressive model using the Self-Forcing method, achieving an inference speed of 11 frames per second with 4 inference steps on a single NVIDIA B200 GPU.
A compact 4-billion-parameter model that retains the full functionality of the Qwen3-VL series: fast response speed, high-quality multimodal understanding with spatial and temporal stamping. At the same time, it significantly reduces hardware requirements – when using half of the natively supported 256K context, the model runs stably on just a single 24GB GPU.
A compact dense model with 8 billion parameters and enhanced step-by-step reasoning capabilities, specializing in complex multimodal tasks requiring in-depth analysis and superior visual content understanding. Natively supports a 256K token context. Outperforms renowned models such as Gemini 2.5 Flash Lite and GPT-5 nano high across nearly all key benchmarks.
A multimodal dense model with 8 billion parameters, optimized for dialogue and instruction following, capable of understanding images, videos, and text. It natively supports a context length of 256K tokens, features enhanced OCR for 32 languages, and possesses visual agent capabilities. The model demonstrates competitive performance against larger models on key benchmarks.
A reasoning-optimized 4B version of the Qwen3-VL model series with a 256K context window (expandable to 1M). The response generation always employs reasoning chains, enabling it to tackle complex multimodal tasks, while incurring a throughput penalty. It demonstrates performance that is only slightly inferior to Qwen3-8B-VL, despite having significantly lower hardware requirements.
The flagship Mixture-of-Experts (MoE) model from IBM's Granite-4.0 family, featuring 32 billion total parameters (with 9 billion active). It combines Mamba-2 and Transformer architectures to deliver performance on par with large-scale models while reducing memory requirements by 70% and doubling inference speed. It is optimized for enterprise-grade tasks such as RAG and agent workflows.