Translation with lip synchronization

This guide lists the constraints to provide the best possible content that is intended to be translated using an audio + lipsync Gen AI Model.

To get the best result form this complex process this is our advices:

Ideal Speaker Count: The systems perform best when there are up to two speakers visible on screen.
Speaker Orientation: To ensure accurate voice capture and translation, speakers should face the camera directly, with their orientation no greater than 45 degrees away from facing the camera straight on. This positioning helps the product accurately capture audio and lip sync for effective video quality.
Proximity to Camera: For best results, speakers should be within 3 meters of the camera. This distance allows the product to effectively capture audio clarity and facial expressions, enhancing the translation accuracy and lip sync quality.
Camera Framing: Close-up shots of the speakers are preferred. Close-ups help in capturing detailed visual cues, which are essential for high quality lip sync.
No Dynamic Shot Cuts: This does not perform well in scenarios with frequent and dynamic camera shot cuts. Such conditions can disrupt the continuous capture of audio and visual cues necessary for lip sync. To ensure optimal performance, maintain a steady shot focusing on the speakers.
Multiple Speakers Talking Over Each Other: The service accuracy diminishes when multiple speakers talk simultaneously. For the most effective performance ensure that only one speaker talks at a time, allowing the product to accurately lip sync.
Background Noise: Minimize background noise to ensure the audio captured is as clear as possible. Excessive noise can interfere with speech recognition and lip sync accuracy.

General operation

Pre-recorded assets

Live sources

Médias pré-enregistrés

Translation with lip synchronization

To get the best result form this complex process this is our advices: