It is a native multimodal autoregressive model designed for image generation, supporting both text-to-image and image-to-image (TI2I) tasks. It features a unified architecture for multimodal understanding and generation, achieving performance comparable to leading closed-source models. The model includes two main variants: HunyuanImage-3.0 (text-to-image) and HunyuanImage-3.0-Instruct (enhanced with reasoning capabilities for intelligent prompt improvement and creative editing).
This model is designed for image-to-video generation. It falls under the category of "World Model". The project is licensed under Apache-2.0, ensuring open access to the code and models.
A 30-billion parameter MoE model with only ~3.6B parameters activated per token, delivering record-breaking performance in its class with minimal resource requirements (~24 GB VRAM). The model leads in agent-based tasks and programming, supports a 200K context, and is optimized for easy local deployment.
It is a 4 billion parameter rectified flow transformer model designed for fast image generation and editing. It unifies text-to-image generation and multi-reference image editing into a single compact architecture, enabling end-to-end inference in under a second. Optimized for real-time applications without compromising quality, it runs on consumer-grade GPUs such as NVIDIA RTX 3090/4070 with approximately 13GB VRAM.
It is a 9 billion parameter rectified flow transformer model designed for high-speed image generation and editing. It unifies text-to-image generation and multi-reference image editing into a single compact architecture, achieving state-of-the-art quality with end-to-end inference in under half a second. The model leverages an 8 billion parameter Qwen3 text embedder and is step-distilled to 4 inference steps, enabling real-time performance while matching or exceeding the quality of models five times its size.
It is a text-to-image and image-to-image generation model employing a hybrid architecture combining an autoregressive generator and a diffusion decoder. It excels in generating high-fidelity images with precise text rendering and semantic understanding, particularly in complex, information-dense scenarios.
It is the December 2025 update to Qwen-Image, a text-to-image foundational model. It is designed to generate high-quality images from textual prompts with enhanced capabilities in realism, detail rendering, and text integration.
An advanced MoE model with agentic capabilities, created as an intelligent partner for programming. Its uniqueness lies in its multi-level "thinking" system, which delivers unprecedented stability and control when tackling complex tasks. The ideal choice for development, automation, and programmatic visual content creation.
It is an enhanced image-to-image generation model, succeeding Qwen-Image-Edit-2509.
A model from NVIDIA with 31.6B parameters (3.6B active), specifically optimized for high-performance agentic systems. The model combines a hybrid Mamba-Transformer MoE architecture, delivering simultaneous memory efficiency, high throughput, and reasoning accuracy on contexts up to 1M tokens.
A multimodal model with 106B parameters, using a Mixture-of-Experts (MoE) architecture and a 128K token context. Its key feature is native tool-calling support, enabling it to directly work with images as both input and output, making it an ideal platform for building complex AI agents for document analysis, visual search, and front-end development automation.
A compact 9-billion parameter multimodal model with a 128K token context length and native support for visual Function Calling. Achieves state-of-the-art results on MMBench, MathVista, and OCRBench benchmarks among models of comparable size, optimized for local deployment and agent-based scenarios.
It is an open-source, bilingual (Chinese-English) foundation model designed for text-to-image generation. It addresses key challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility. With only 6 billion parameters, it outperforms larger open-source models across benchmarks, showcasing efficient architecture design.
DeepSeek-Ai model with advanced reasoning capabilities and agent functions, combining high computational efficiency with GPT-5-level performance. Thanks to its Sparse Attention Architecture (DSA) and unique "in-call tool reasoning" mechanics, the model is ideally suited for building autonomous agents, ensuring a balance between speed, resource costs, and the complexity of tasks solved.
A specialized version of DeepSeek-V3.2 for deep reasoning, achieving GPT-5 and Gemini-3.0-Pro levels in solving complex problems in the fields of Olympiad mathematics and programming. The model does not support tool calling but possesses unlimited "thinking" depth, which allows it to achieve phenomenal results in these narrowly specialized knowledge domains. DeepSeek-V3.2-Speciale has become the first open model to win gold medals at the largest international mathematics and informatics Olympiads
The flagship and largest Russian-language instruct model at the time of its release, based on the Mixture-of-Experts (MoE) architecture with 702B total and 36B active parameters. The model integrates Multi-head Latent Attention (MLA) and Multi-Token Prediction (MTP), ensuring high inference throughput and is optimized for fp8 operation. GigaChat 3 Ultra Preview operates with a 128K token context, demonstrates strong performance in text generation, programming, and mathematics, and provides the deepest understanding of the Russian language and culture.
Kandinsky-5.0-T2I-Lite-sft-Diffusers is a text-to-image (T2I) model with 6 billion parameters, developed for generating images based on text prompts. The model belongs to the Kandinsky 5.0 family, which includes models for generating video and images.
Kandinsky-5.0-I2I-Lite-sft-Diffusers is a image-to-image (I2I) model with 6 billion parameters, developed for modifying images based on text prompts. The model belongs to the Kandinsky 5.0 family, which includes models for generating video and images.
A compact, dialogue-oriented MoE model from the GigaChat family, with 10 billion total and 1.8 billion active parameters, optimized for high-speed inference and deployment in local or high-load production environments (commonly referred to as GigaChat 3 Lightning). In terms of understanding the Russian language, it surpasses popular 3-4B scale models while operating significantly faster.