Generates realistic talking-head videos from a reference image and audio input, with precise lip synchronization, expressive gestures, and support for multiple languages.
API Key authentication. Format: Bearer YOUR_API_KEY.
Reference image for digital human avatar. Supports URL or Base64 encoded string. Supported formats: JPG/JPEG/PNG. Maximum file size: 10MB. Minimum resolution: 300x300px. Aspect ratio should be between 1:2.5 and 2.5:1. Do not include Base64 prefix (e.g., 'data:image/png;base64,').
1"https://example.com/avatar.jpg"
Audio ID generated by TTS API. Mutually exclusive with sound_file - exactly one must be provided. Only supports audio generated within the last 30 days. Audio duration must be between 2-300 seconds.
1"audio-abc123"
Audio file URL or Base64 encoded string. Mutually exclusive with audio_id - exactly one must be provided. Supported formats: mp3/wav/m4a/aac. Maximum file size: 5MB. Audio duration must be between 2-300 seconds.
1"https://example.com/audio.mp3"
Prompt for defining avatar's actions, emotions, and camera movements (1-2500 characters). Optional.
1 - 2500"Speaking with a smile, camera slowly zooms in"
Generation mode: std (standard, cost-effective) or pro (professional, higher quality but longer processing)
std, pro "std"