May 21, 2023 4 min read News

SadTalker: Making a Still Photo Talk and Move Naturally Using AI

Have you ever seen those funny apps that can make a still photo of a person appear to talk and move their head? This advanced technology is called talking head synthesis and it has many cool applications in fields like video conferencing, digital avatars, animation etc.

In this post, I will explain a new AI technique called SadTalker that can generate very realistic talking head videos from just a single photo! Published in a recent research paper, SadTalker achieves new state-of-the-art results in making the animated characters look like real humans.

The Challenge of Talking Head Synthesis

The idea sounds simple - take a photo, feed in some audio clip and make the person in the photo lip-sync and move naturally like they are actually saying those words. But this is extremely challenging to do realistically.

The lips need to perfectly match the sounds and words being spoken. The facial expressions and head motion also need to align with the tone and rhythm of the speech. All this should look natural and not staged or robotic. Failing to do so can result in eerie looking characters!

Previous methods tried to directly predict the talking motions from audio. But this led to weird artifacts, distortions and loss of identity. The key insight behind SadTalker is to break down talking head generation into multiple simpler parts that can be modeled better individually.

The Core Ideas Behind SadTalker

SadTalker uses AI to first predict realistic facial motions like expressions and head movements from the input audio. It then renders the final video by translating these motions onto the still photo. Here are the main ideas:

3D Morphable Face Model

SadTalker represents motions using a 3D Morphable Face Model or 3DMM. This mathematically models the key aspects of facial geometry - identity shape, expressions, pose etc. in a disentangled manner.

Realistic Expression Synthesis

The core of speech is in the lips. SadTalker uses a network called ExpNet to accurately predict lip shapes and expressions from the input audio. It distills lip motion from existing lip sync methods and adds perceptual losses based on 3D face rendering. This creates natural looking expressions tuned to the specific person's face.

Diverse Head Motion Synthesis

Apart from expressions, talking also involves a lot of head movements and gestures. SadTalker uses a generative network called PoseVAE to predict realistic and personalized head motion from audio. The rhythm and style is controlled by conditioning it on speaker identity and audio features.

3D-aware Neural Rendering

Finally, a novel rendering network maps the 3DMM motion parameters to a shared 3D space used in recent face reenactment techniques. This allows translating the motions to the image domain for photorealistic synthesis of the talking character!

Step-by-Step Generating a Talking Head Video

Now let's go through the complete pipeline of SadTalker to create a talking head video from an input photo:

The 3D shape, expression and pose parameters of the person's face are estimated from the input photo using advanced face reconstruction techniques.
The input audio clip is analyzed to extract features related to rhythm, tone etc.
ExpNet takes the audio features and predicts a sequence of facial expression parameters tailored to the specific face identity.
Similarly, PoseVAE generates plausible head motion parameters from the audio and speaker identity.
The 3DMM motion parameters are fed to the neural rendering network which translates them to warping fields in the image space.
By warping the input photo using these fields, photorealistic talking head frames are generated frame by frame resulting in a life-like video!

All the components are trained separately on large datasets of talking head videos. This divide and conquer approach allows specialized modeling of each facial aspect while keeping the motion looking organic.

Results and Applications

The researchers thoroughly evaluated SadTalker against previous state-of-the-art methods. They found significant improvements in the audio-visual synchronization, motion quality, identity preservation and sharpness of the generated talking heads.

Some interesting use cases this enables are:

Video conferencing avatars that capture your unique personality!
Dubbing anime/cartoon characters by just using a single reference image.
Creating interactive AI assistants with your own virtual doppelganger!
Making still photos of historic figures come alive and give speeches etc.

Limitations and Future Work

While SadTalker pushes the envelope on realism, like any technology it still has some limitations:

It focuses only on speech related motions and does not handle emotions/gaze changes etc.
The lack of a detailed mouth interior model can result in teeth artifacts.
Training requires large datasets of videos not available for all use cases.

How to deploy it?

The requirements are:

Windows or Linux computer with a Nvidia GPU
CUDA enabled and Anaconda installed

This example is run on a Windows 11 computer.

Create conda environment

conda create -n sadtalker python=3.8
conda activate sadtalker

Install pytorch

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Install from reposiitory

git clone https://github.com/Winfredy/SadTalker.git
cd SadTalker 
conda install ffmpeg
pip install -r requirements.txt

Download trained models

mkdir checkpoints

curl -o D:/SadTalker/checkpoints/mapping_00109-model.pth.tar https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2-rc/mapping_00109-model.pth.tar
curl -o ./checkpoints/mapping_00229-model.pth.tar https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2-rc/mapping_00229-model.pth.tar
curl -o ./checkpoints/SadTalker_V0.0.2_256.safetensors https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2-rc/SadTalker_V0.0.2_256.safetensors
curl -o ./checkpoints/SadTalker_V0.0.2_512.safetensors https://github.com/OpenTalker/SadTalker/releases/download/v0.0.2-rc/SadTalker_V0.0.2_512.safetensors

mkdir -p ./gfpgan/weights

curl -o ./gfpgan/weights/alignment_WFLW_4HG.pth https://github.com/xinntao/facexlib/releases/download/v0.1.0/alignment_WFLW_4HG.pth
curl -o ./gfpgan/weights/detection_Resnet50_Final.pth https://github.com/xinntao/facexlib/releases/download/v0.1.0/detection_Resnet50_Final.pth
curl -o ./gfpgan/weights/GFPGANv1.4.pth https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.4.pth
curl -o ./gfpgan/weights/parsing_parsenet.pth https://github.com/xinntao/facexlib/releases/download/v0.2.2/parsing_parsenet.pth

Launch

python app_sadtalker.py

How to deploy it?

You might also like...

How to Run and Install Nextcloud and OnlyOffice Server Using Docker

Setting Up an AI Image Generation Environment on Your Ubuntu 22 with NVIDIA GPU

Installing and running Ollama + OpenWebui in Ubuntu with CUDA support

"Pass wall" a webui for lucy OpenWRT to bypass censorship

V2RAY on a server as a secure proxy for the maximum privacy in the internet