Apr 8, 2026
Nano Banana
Nano Banana: Nano Banana image generation model. Generates images from text prompts with support for 10 aspect ratios, delivering fast and cost-effective results.Nano Banana Pro: Nano Banana Pro image generation model with higher quality output. Supports aspect ratio selection and output resolutions up to 4K.Nano Banana 2: Nano Banana 2 multimodal image model supporting both text-to-image and image-to-image workflows, with 14 aspect ratios and resolutions from 512 to 4K.Nano Banana Edit: Nano Banana image editing model. Transforms existing images based on prompt instructions with support for 10 aspect ratios.Nano Banana Pro Edit: Nano Banana Pro image editing model with superior detail preservation and enhanced prompt adherence. Supports output resolutions up to 4K.Nano Banana 2 Edit: Nano Banana 2 multimodal model in image-to-image editing mode. Requires a base64 data URI image input with support for multiple aspect ratios and resolutions from 512 to 4K.
Imagen 4.0
Imagen 4.0: Google Imagen 4.0 standard text-to-image model delivering high-quality photorealistic images. Supports batch generation (up to 4 images), person generation control, and output resolutions up to 2K.Imagen 4.0 Ultra: Google Imagen 4.0 Ultra text-to-image model with the highest quality output. Optimized for maximum detail and photorealism with batch generation and up to 2K resolution.Imagen 4.0 Fast: Google Imagen 4.0 Fast text-to-image model optimized for speed. Supports batch generation (up to 4 images), multiple aspect ratios, and person generation control.
Veo 3.1
Veo 3.1 T2V: Google’s flagship text-to-video model supporting resolutions up to 4K and optional reference images (up to 3) for style or character consistency across generations. Duration: 4/6/8 seconds.Veo 3.1 Fast T2V: A faster variant of Veo 3.1 with the same capabilities, including 4K resolution and reference image support. Duration: 4/6/8 seconds.Veo 3.1 I2V: Google’s flagship image-to-video model supporting resolutions up to 4K with person generation control. Duration: 4/6/8 seconds.Veo 3.1 Fast I2V: A faster variant of Veo 3.1 I2V with the same capabilities, including 4K resolution support. Duration: 4/6/8 seconds.
Veo 3
Veo 3 T2V: Google Veo 3.0 stable text-to-video model with resolutions up to 1080p and person generation control. Duration: 4/6/8 seconds.Veo 3 Fast T2V: A faster variant of Veo 3 with the same parameter set, supporting resolutions up to 1080p. Duration: 4/6/8 seconds.Veo 3 I2V: Google Veo 3.0 stable image-to-video model with resolutions up to 1080p and person generation control. Duration: 4/6/8 seconds.Veo 3 Fast I2V: A faster variant of Veo 3 I2V with the same parameter set, supporting resolutions up to 1080p. Duration: 4/6/8 seconds.
Veo 2
Veo 2 T2V: Google Veo 2.0 classic text-to-video model with flexible person generation policies (allow all, adult only, or disallow). Duration: 5/6/8 seconds.Veo 2 I2V: Google Veo 2.0 classic image-to-video model with flexible person generation policies (allow adult or disallow). Duration: 5/6/8 seconds.
Feb 25, 2026
Kling
Kling V1
Kling V1 T2I: Kuaishou’s foundational text-to-image model offering fast, cost-effective 1K image generation with strong prompt adherence and multiple aspect ratios.Kling V1 I2I: Kuaishou’s first-generation AI image model using a diffusion transformer architecture, capable of generating 1K-resolution images with strong prompt adherence and realistic detail.Kling V1 T2V: Kuaishou’s first-generation text-to-video model generating 5s or 10s clips with camera motion presets (pan, tilt, zoom) and adjustable prompt relevance.Kling V1 I2V: Kuaishou’s first-generation image-to-video model that animates static images into 5s or 10s videos with motion brush support and adjustable prompt relevance.
Kling V1.5
Kling V1.5 T2I: An enhanced text-to-image model with improved realism and subject/face reference support for generating consistent character images at 1K resolution.Kling V1.5 I2I: An upgraded image-to-image model with improved realism, better prompt interpretation, and subject/face reference modes for precise character control.Kling V1.5 I2V: The most feature-complete V1.x image-to-video model, adding simple camera motion control alongside motion brush and cfg_scale for precise video generation.
Kling V1.6
Kling V1.6 T2V: An improved text-to-video model with significantly better prompt adherence and visual quality over V1.5, supporting dual standard/professional generation modes.Kling V1.6 I2V: An improved image-to-video model with significantly better prompt adherence and visual quality over V1.5, supporting first-and-last frame control for smooth transitions.Kling V1.6 MI2V: Transforms up to 4 reference images into a cohesive video sequence with multi-element fusion, enabling character interaction and complex visual narratives.
Kling V2
Kling V2 T2I: A next-generation text-to-image model with significantly improved detail and visual fidelity, supporting both 1K and 2K resolutions for professional output.Kling V2 New T2I: A refined variant of V2 with updated model weights for sharper details, better consistency, and improved prompt-to-image alignment at up to 2K resolution.Kling V2 I2I: A major generational leap in image quality and creativity, featuring enhanced style diversity and significantly improved visual fidelity over V1.5.Kling V2 New I2I: A refined variant of Kling V2 with updated model weights for improved consistency, sharper details, and better prompt-to-image alignment.Kling V2 MI2I: Combines up to 4 subject images with optional scene and style references into a single cohesive output, supporting subject fusion, scene replacement, and style transfer.Kling V2 Master T2V: The V2-generation base text-to-video model producing cinematic-quality clips with superior motion realism and temporal coherence.Kling V2 Master I2V: The V2-generation base image-to-video model delivering cinematic-quality animations with superior temporal coherence and smoother motion transitions.
Kling V2.1
Kling V2.1 T2I: The latest and highest-quality text-to-image model in the Kling family, delivering state-of-the-art results at up to 2K resolution.Kling V2.1 I2I: The latest cost-efficient image-to-image model offering studio-grade quality with faster rendering and excellent prompt adherence.Kling V2.1 MI2I: The latest and highest-quality multi-image composition model, delivering superior subject fusion, scene replacement, and style transfer with up to 4 subject images.Kling V2.1 Master T2V: The V2.1-generation text-to-video model with enhanced rendering quality, improved frame consistency, and studio-grade 1080p output.Kling V2.1 I2V: A cost-efficient image-to-video model with advanced frame control and up to 1080p output, suitable for professional content creation.Kling V2.1 Master I2V: The recommended high-quality image-to-video model in the V2.1 series, producing studio-grade 1080p videos with precise start and end frame control.
Kling V2.5
Kling V2.5 Turbo T2V: A speed-optimized text-to-video model delivering cinematic 1080p videos with physics-accurate motion at ~30% lower cost than previous versions.Kling V2.5 Turbo I2V: A speed-optimized image-to-video model delivering cinematic 1080p videos with physics-accurate motion at ~30% lower cost than previous versions.
Kling V2.6
Kling V2.6 T2V: The first Kling text-to-video model to natively generate synchronized audio and video in one pass, including dialogue, ambient sounds, and lip-synced speech.Kling V2.6 I2V: The first Kling image-to-video model to natively generate synchronized audio and video in a single pass, supporting dialogue, sound effects, and lip-synced speech alongside motion brush.
Kling Avatar & Effects
Kling Avatar: Generates realistic talking-head videos from a reference image and audio input, with precise lip synchronization, expressive gestures, and support for multiple languages.Kling Video Effects: Applies 212 preset creative video effects — including dance, transformation, interaction, and animation styles — to one or two person images for instant viral content.
Kling Image Utilities
Kling Image Expansion: Intelligently extends images in any direction (up, down, left, right) with prompt-guided content generation, ideal for panorama creation, background extension, and canvas expansion.Kling Image Recognize: Detects and segments image content into 4 categories — object, head (with hair), face (without hair), and clothing — returning segmentation masks synchronously.Kling Image O1: A multimodal image generation model that accepts text, up to 10 reference images, and element inputs to produce 1K/2K images with precise style control and multi-reference feature extraction.
Kolors Virtual Try-On
Kolors Virtual Try-On V1: AI-powered virtual clothing try-on built on the Kolors diffusion model, generating realistic fitting results from a person photo and a single garment image (tops, bottoms, or dresses).Kolors Virtual Try-On V1-5: Enhanced virtual try-on model that supports both single garments and top+bottom outfit combinations, delivering higher-quality results with automatic clothing type detection.
Feb 4, 2026
MiniMax
MiniMax Image-01
MiniMax Image-01 T2I: MiniMax’s multimodal vision model that blends text-to-image generation with visual reasoning for seamless cross-modal tasks.MiniMax Image-01 I2I: MiniMax’s multimodal vision model that blends text-to-image generation with visual reasoning for seamless cross-modal tasks.MiniMax Image-01-Live I2I: MiniMax’s multimodal vision model that blends text-to-image generation with visual reasoning for seamless cross-modal tasks.
Hailuo
Hailuo 2.3 T2V: Hailuo 2.3 not only generates high-quality videos from text or images with exceptional instruction following, but also redefines realism through its state-of-the-art mastery of extreme physics.Hailuo 2.3 I2V: Hailuo 2.3 not only generates high-quality videos from text or images with exceptional instruction following, but also redefines realism through its state-of-the-art mastery of extreme physics.Hailuo 2.3 Fast I2V: Hailuo 2.3 Fast efficiently transforms images into dynamic videos with extreme physics mastery. It delivers exceptional value by generating high-quality, realistic motion at a reduced computational cost.Hailuo 02 T2V: Hailuo 02 masters both text-to-video and image-to-video generation with exceptional instruction following, while setting a new standard in visual realism through its extreme physics simulation.Hailuo 02 I2V: Hailuo 02 masters both text-to-video and image-to-video generation with exceptional instruction following, while setting a new standard in visual realism through its extreme physics simulation.Hailuo 02 FL2V: Hailuo 02’s FL2V function provides unprecedented creative control by generating dynamic videos between a user-defined start and end frame. This feature not only masters extreme physics and complex transitions but also enables the novel capability to deduce a story leading up to a specified final image.
MiniMax T2V-01
MiniMax T2V-01: MiniMax T2V-01 is a text-to-video model that uniquely delivers professional-level camera movement control, transforming written prompts into cinematic video clips with dynamic shots.MiniMax T2V-01-Director: T2V-01-Director is a text-to-video AI model that offers precise camera control, allowing users to create professional-looking video clips with cinematic movements through a variety of lens instructions.
MiniMax I2V-01
MiniMax I2V-01: MiniMax I2V-01 is a foundational image-to-video model that converts static pictures into high-quality video sequences, delivering smooth animation especially optimized for illustrations and anime styles.MiniMax I2V-01-Director: T2V-01-Director is a text-to-video AI model that offers precise camera control, allowing users to create professional-looking video clips with cinematic movements through a variety of lens instructions.MiniMax I2V-01-Live: I2V-01-Live is an image-to-video model specifically optimized for animating 2D illustrations and cartoon styles, enhancing smoothness and vivid motion to bring static art to life with fluid character movements and natural expressions.
MiniMax S2V-01
MiniMax S2V-01: The MiniMax S2V-01 is a specialized subject reference video model designed to solve the industry challenge of character consistency. It can generate dynamic videos where the main character’s identity stays highly consistent across every frame, using just a single photo as a reference and at a computational cost significantly lower than traditional solutions.
Jan 31, 2026
Alibaba
Wan
Wan 2.6 T2V: The Wan text-to-video model can generate videos from a single sentence, presenting rich artistic styles and cinematic quality. Wan 2.6 introduces multi-shot narrative capabilities and supports both automatic dubbing and uploading custom audio files.Wan 2.6 I2V Flash: The Wan image-to-video model can generate videos using prompts and image references, featuring rich artistic styles and cinematic quality. Wan 2.6 introduces multi-shot narrative capabilities and supports both automatic dubbing and uploading custom audio files.Wan 2.6 I2V: The Wan image-to-video model can generate videos using prompts and image references, featuring rich artistic styles and cinematic quality. Wan 2.6 introduces multi-shot narrative capabilities and supports both automatic dubbing and uploading custom audio files.Wan 2.5 T2V Preview: The Wan text-to-video model can generate videos from a single sentence, presenting rich artistic styles and cinematic quality. Wan 2.5 supports automatic dubbing and uploading custom audio files.Wan 2.5 I2V Preview: The Wan image-to-video model can generate videos using prompts and image references, featuring rich artistic styles and cinematic quality. Wan 2.5 supports automatic dubbing and uploading custom audio files.Wan 2.2 T2V Plus: The Wan text-to-video model can generate videos from a single sentence, presenting rich artistic styles and cinematic quality. Wan 2.2 features more accurate instruction understanding, stable and smooth motion generation, and richer details.Wan 2.2 I2V Flash: The Wan image-to-video model can generate videos using prompts and image references, presenting rich artistic styles and cinematic-quality visuals. Wan 2.2 Flash features ultimate generation speed, with more accurate instruction understanding and camera control, consistent visual elements, and comprehensively improved stability and success rates.Wan 2.2 I2V Plus: The Wan image-to-video model can generate videos using prompts and image references, presenting rich artistic styles and cinematic-quality visuals. Wan 2.2 Plus features more accurate instruction understanding, controllable camera movements, consistent visual elements, and comprehensively improved stability and success rates, delivering richer generated content.Wan 2.2 KF2V Flash: The Wan First-and-Last-Frame Video Generation Model: simply provide the first and last frame images, and it can generate a smooth, fluid dynamic video based on the prompt.
Wanx
Wanx 2.1 T2V Turbo: Wan text-to-video model can generate videos with a single sentence, featuring rich artistic styles and cinematic quality. Wanx 2.1 Turbo offers high cost-effectiveness.Wanx 2.1 T2V Plus: Wan text-to-video model can generate videos from a single sentence, featuring rich artistic styles and cinematic quality. Wanx 2.1 Plus offers even more refined visuals.Wanx 2.1 I2V Plus: The Wan image-to-video model can generate videos using prompts and image references, presenting rich artistic styles and cinematic-quality visuals. Wanx 2.1 Plus offers even more refined image quality.Wanx 2.1 I2V Turbo: The Wan image-to-video model can generate videos using prompts and image references, featuring rich artistic styles and cinematic-quality visuals. Wanx 2.1 Turbo offers high cost-effectiveness.Wanx 2.1 KF2V Plus: The Wan First-and-Last-Frame Video Generation Model: simply provide the first and last frame images, and it can generate a smooth, fluid dynamic video based on the prompt.
ByteDance
Seedream
Seedream 3.0 T2I: Seedream 3.0 is a Chinese-English bilingual image generation foundation model that supports native high resolution. Its overall capabilities are comparable to GPT-4o, ranking it among the world’s top tier. Faster response speed; more accurate small text generation and enhanced text typesetting effect; strong instruction-following ability, improved aesthetics & structure, and good fidelity and detail performance.Seedream 4.0 T2I: A SOTA-level multimodal image creation model based on a leading architecture. It breaks the creative boundaries of traditional text-to-image models and natively supports text, single-image, and multi-image inputs. Users can freely fuse text and images, and in the same model, realize diverse applications like multi-image fusion creation based on subject consistency, image editing, and group image generation.Seedream 4.0 I2I: A SOTA-level multimodal image creation model based on a leading architecture. It breaks the creative boundaries of traditional text-to-image models and natively supports text, single-image, and multi-image inputs. Users can freely fuse text and images, and in the same model, realize diverse applications like multi-image fusion creation based on subject consistency, image editing, and group image generation.Seedream 4.5 T2I: Seedream 4.5 is the latest in-house image generation model developed by ByteDance. Compared with Seedream 4.0, it delivers comprehensive improvements—especially in editing consistency, including better preservation of subject details, lighting, and color tone. It also enhances portrait refinement and small-text rendering. The model’s multi-image composition capabilities have been significantly strengthened.Seedream 4.5 I2I: Seedream 4.5 is the latest in-house image generation model developed by ByteDance. Compared with Seedream 4.0, it delivers comprehensive improvements—especially in editing consistency, including better preservation of subject details, lighting, and color tone. It also enhances portrait refinement and small-text rendering. The model’s multi-image composition capabilities have been significantly strengthened.
Seededit
Seededit 3.0 I2I: SeedEdit 3.0 is an image editing model that supports editing images via text instructions. SeedEdit 3.0 is trained based on the text-to-image model Seedream 3.0, integrated with diverse data fusion methods and specific reward models. Its ability to preserve image subjects, backgrounds, and details has been further improved, especially in scenarios such as portrait editing, background modification, perspective and light conversion.
Seedance
Seedance 1.0 Lite T2V: ByteDance’s small-parameter version of the video generation model achieves excellent video generation quality while significantly increasing generation speed, balancing both effect and efficiency.Seedance 1.0 Lite I2V: ByteDance’s small-parameter version of the video generation model achieves excellent video generation quality while significantly increasing generation speed, balancing both effect and efficiency.Seedance 1.0 Pro Fast T2V: Seedance 1.0 pro fast, inheriting the core advantages of the Seedance 1.0 pro model, has a 3x faster generation speed and a 72% lower price. It is a video generation model that achieves an excellent balance among quality, speed, and cost.Seedance 1.0 Pro Fast I2V: Seedance 1.0 pro fast, inheriting the core advantages of the Seedance 1.0 pro model, has a 3x faster generation speed and a 72% lower price. It is a video generation model that achieves an excellent balance among quality, speed, and cost.Seedance 1.0 Pro T2V: Seedance 1.0 is a video generation foundation model launched by ByteDance. As the large-parameter version of this model series, Seedance 1.0 Pro has unique multi-shot narrative capabilities and performs excellently across all dimensions. It has made breakthroughs in semantic understanding and instruction-following capabilities, and can generate 1080P high-definition videos that are smooth in motion, rich in details, diverse in style, and have cinematic-level aesthetics.Seedance 1.0 Pro I2V: Seedance 1.0 is a video generation foundation model launched by ByteDance. As the large-parameter version of this model series, Seedance 1.0 Pro has unique multi-shot narrative capabilities and performs excellently across all dimensions. It has made breakthroughs in semantic understanding and instruction-following capabilities, and can generate 1080P high-definition videos that are smooth in motion, rich in details, diverse in style, and have cinematic-level aesthetics.Seedance 1.5 Pro T2V: Seedance 1.5 pro is ByteDance’s new professional-grade audio-visual co-generation model. It builds on multi-shot narrative and HD generation capabilities, supporting integrated audio and video output for a unified creation experience (visuals, human voice, music, and sound effects). The model includes a start/end frame feature, allowing creators to lock the video’s style, composition, and characters by setting the first and last frames.Seedance 1.5 Pro I2V: Seedance 1.5 pro is ByteDance’s new professional-grade audio-visual co-generation model. It builds on multi-shot narrative and HD generation capabilities, supporting integrated audio and video output for a unified creation experience (visuals, human voice, music, and sound effects). The model includes a start/end frame feature, allowing creators to lock the video’s style, composition, and characters by setting the first and last frames.
Jan 26, 2026
Alibaba
Wan
Wanx
Wanx 2.1 T2I Turbo: The Wan text-to-image model generates beautiful images from text. Supports multiple styles and generates quickly.Wanx 2.1 T2I Plus: The Wan text-to-image model generates beautiful images from text. Supports multiple styles and generates images with rich details.Wanx 2.1 Image Edit: Can achieve diverse image editing through simple instructions, suitable for scenarios such as image expansion, watermark removal, style transfer, image restoration, and image enhancement.Wanx 2.0 T2I Turbo: The Wan text-to-image model excels in textured portraits and creative design, offering great value for money.Wanx Style Repaint V1: Can perform various stylized redraws on input portrait images, allowing the newly generated images to maintain the original facial features while presenting different artistic painting effects.Wanx Sketch to Image Lite: Based on input hand-drawn sketches and text descriptions, exquisite doodle artworks can be generated.Wanx Background Generation V2: Can expand and generate background information based on input foreground image materials, achieving natural light and shadow fusion effects, as well as delicate and realistic image generation.
Qwen
Qwen Image Plus: The qwen-image excels in text rendering, particularly for Chinese text. Currently more cost-effective than qwen-image.Qwen Image: The qwen-image excels in text rendering, particularly for Chinese text.Qwen Image Edit Plus: Supports precise bilingual Chinese-English text editing, color adjustment, detail enhancement, style transfer, object addition and removal, and other operations, enabling complex image and text editing.Qwen Image Edit Plus 2025-12-15: Supports precise bilingual Chinese-English text editing, color adjustment, detail enhancement, style transfer, object addition and removal, and other operations, enabling complex image and text editing.Qwen Image Edit Plus 2025-10-30: Supports precise bilingual Chinese-English text editing, color adjustment, detail enhancement, style transfer, object addition and removal, and other operations, enabling complex image and text editing.Qwen Image Edit: Supports precise bilingual Chinese-English text editing, color adjustment, detail enhancement, style transfer, object addition and removal, and other operations, enabling complex image and text editing.Qwen MT Image: Supports translating text from images in 11 languages into Chinese or English, accurately preserving original layout and content information, and provides custom features such as terminology definitions, sensitive word filtering, and image subject detection.
WordArt
WordArt Semantic: Can creatively deform the edge contours of input text based on prompt content, achieving more creative uses of a font, and returns a black-background white mask image containing the text.Wordart Texture: Can perform creative design on input text content or text images, adding materials and textures to the text based on prompt content to achieve effects such as 3D prominence or scene integration.
AI Try-On
AI Try-On: A virtual try-on image generation model that generates try-on images based on portrait photos and clothing images.AI Try-On Plus: Compared to the AI Try-On, there are improvements in image clarity, clothing texture details, and logo restoration effects, but the generation time is longer.AI Try-On Parsing V1: Supports segmentation of model images and clothing images, and can be used for pre-processing and post-processing of AI fitting room images.AI Try-On Refiner: Perform secondary generation on the effect images created by AI virtual try-on, outputting finely polished virtual try-on effect images with higher fidelity.
Image Utilities
Image Outpainting: Allows for free image extension, supporting image rotation and expansion through both expansion coefficient and pixel count methods.