It is an enhanced image-to-image generation model, succeeding Qwen-Image-Edit-2509.
A model from NVIDIA with 31.6B parameters (3.6B active), specifically optimized for high-performance agentic systems. The model combines a hybrid Mamba-Transformer MoE architecture, delivering simultaneous memory efficiency, high throughput, and reasoning accuracy on contexts up to 1M tokens.
A multimodal model with 106B parameters, using a Mixture-of-Experts (MoE) architecture and a 128K token context. Its key feature is native tool-calling support, enabling it to directly work with images as both input and output, making it an ideal platform for building complex AI agents for document analysis, visual search, and front-end development automation.
A compact 9-billion parameter multimodal model with a 128K token context length and native support for visual Function Calling. Achieves state-of-the-art results on MMBench, MathVista, and OCRBench benchmarks among models of comparable size, optimized for local deployment and agent-based scenarios.
It is the image editing variant of LongCat-Image, supporting bilingual (Chinese-English) editing tasks with state-of-the-art performance among open-source models. It excels in instruction-following capabilities and visual consistency while maintaining high image quality.
It is an open-source, bilingual (Chinese-English) foundation model designed for text-to-image generation. It addresses key challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility. With only 6 billion parameters, it outperforms larger open-source models across benchmarks, showcasing efficient architecture design.
DeepSeek-Ai model with advanced reasoning capabilities and agent functions, combining high computational efficiency with GPT-5-level performance. Thanks to its Sparse Attention Architecture (DSA) and unique "in-call tool reasoning" mechanics, the model is ideally suited for building autonomous agents, ensuring a balance between speed, resource costs, and the complexity of tasks solved.
A specialized version of DeepSeek-V3.2 for deep reasoning, achieving GPT-5 and Gemini-3.0-Pro levels in solving complex problems in the fields of Olympiad mathematics and programming. The model does not support tool calling but possesses unlimited "thinking" depth, which allows it to achieve phenomenal results in these narrowly specialized knowledge domains. DeepSeek-V3.2-Speciale has become the first open model to win gold medals at the largest international mathematics and informatics Olympiads
The model, with 7 billion parameters, was developed for generating high-quality images with a focus on precise text representation, while optimized for operation on limited computational resources. Built on its predecessor, Ovis-U1, it is designed to operate efficiently on a single high-performance GPU.
The flagship and largest Russian-language instruct model at the time of its release, based on the Mixture-of-Experts (MoE) architecture with 702B total and 36B active parameters. The model integrates Multi-head Latent Attention (MLA) and Multi-Token Prediction (MTP), ensuring high inference throughput and is optimized for fp8 operation. GigaChat 3 Ultra Preview operates with a 128K token context, demonstrates strong performance in text generation, programming, and mathematics, and provides the deepest understanding of the Russian language and culture.
Kandinsky-5.0-T2I-Lite-sft-Diffusers is a text-to-image (T2I) model with 6 billion parameters, developed for generating images based on text prompts. The model belongs to the Kandinsky 5.0 family, which includes models for generating video and images.
Kandinsky-5.0-I2I-Lite-sft-Diffusers is a image-to-image (I2I) model with 6 billion parameters, developed for modifying images based on text prompts. The model belongs to the Kandinsky 5.0 family, which includes models for generating video and images.
A compact, dialogue-oriented MoE model from the GigaChat family, with 10 billion total and 1.8 billion active parameters, optimized for high-speed inference and deployment in local or high-load production environments (commonly referred to as GigaChat 3 Lightning). In terms of understanding the Russian language, it surpasses popular 3-4B scale models while operating significantly faster.
HunyuanVideo-1.5 is a lightweight text-to-video and image-to-video generation model developed by Tencent, featuring 8.3 billion parameters while maintaining state-of-the-art visual quality and motion coherence. It is designed to run efficiently on consumer-grade GPUs, making advanced video creation accessible to developers and creators.
A 32 billion parameter rectified flow transformer designed for image generation, editing, and combination based on text instructions. It supports open-ended tasks such as text-to-image generation, single-reference editing, and multi-reference editing without requiring additional finetuning. Trained using guidance distillation to enhance efficiency, the model is optimized for research and creative applications under a non-commercial license.
This is a image-to-video (I2V) model with 19 billion parameters, ensuring high-quality generation in HD format. The model belongs to the Kandinsky 5.0 family, which includes models for video and image generation.
This is a text-to-video (T2V) model with 19 billion parameters, ensuring high-quality generation in HD format. The model belongs to the Kandinsky 5.0 family, which includes models for video and image generation.
A compact multimodal model from Baidu, built on an innovative heterogeneous Mixture-of-Experts (MoE) architecture that separates parameters for textual and visual experts. During inference, only 3 billion parameters are activated out of a total model size of 28 billion parameters. The model is an upgraded version of the base ERNIE-4.5-VL-28B-A3B, specifically optimized for multimodal reasoning tasks through a "Thinking Mode." It supports images, videos, visual grounding, and tool invocation, with a native maximum context length of 131K tokens, and stands out for its moderate computational requirements.
The largest open-source reasoning model from Moonshot AI at the time of its release, featuring a Mixture-of-Experts architecture (1 trillion parameters total, 32 billion active), capable of executing 200–300 consecutive tool calls without quality degradation while seamlessly interleaving function calls with reasoning chains. The model supports a 256K-token context window, incorporates native INT4 quantization for significantly accelerated inference with virtually no loss in accuracy, and employs Multi-Head Latent Attention (MLA) for highly efficient processing of long sequences. Kimi K2 Thinking sets new records among open-source models and outperforms leading commercial systems—including GPT-5 and Claude Sonnet 4.5—on a broad range of benchmarks.
Ministral-3-3B-Instruct is the most compact model in the Ministral 3 family. With 3 billion parameters, multimodal support, a 256k context window, and agentic functions, it is ideal for local deployment and prototyping.