MEMO: Memory-Guided Diffusion for
Expressive Talking Video Generation

Longtao Zheng^2*, Yifan Zhang^1*‡, Hanzhong Guo, Jiachun Pan¹, Zhenxiong Tan³, Jiahao Lu³, Chuanxin Tang¹, Bo An^1,2, Shuicheng Yan¹

¹Skywork AI ²Nanyang Technological University ³National University of Singapore

^*Equal contribution ^‡Project Lead

Paper GitHub 🤗 Model

MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.

Image: Albert Einstein

Audio: The Lion King - Hakuna Matata

Image: Cillian Murphy (Oppenheimer)

Audio: Frank Sinatra - Fly Me to the Moon

Image: Audrey Hepburn

Audio: La La Land - Someone in the Crowd

Image: Jang Won-young

Audio: ROSÉ & Bruno Mars - APT

Image: Leonardo Dicaprio

Audio: Kanye West - Runaway

Image: Ana de Armas

Audio: La La Land - Another Day of Sun

Abstract

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

Overview

Various image styles

MEMO can generate talking videos with images such as portrait, sculpture, digital art, and animation.

Image: William Shakespeare (Portrait)

Audio: Adele - Rolling in the Deep

Image: Girl with a Pearl Earring (Portrait)

Audio: IU - 삐삐 (BBIBBI)

Image: David (Michelangelo) (Sculpture)

Audio: Leehom Wang 王力宏 - 唯一

Image: Generated by Flux1-V1NewFantasyCore (AIGC)

Audio: Quentin Tarantino with Conan O'Brien

Image: Woody (Toy Story) (Animation)

Audio: Toy Story - You've Got a Friend in Me

Image: Portrait from Hallo examples (Portrait)

Audio: Imagine Dragons - Sharks

Various audio types

MEMO can generate talking videos with audio types such as speech, singing, and rap.

Image: Kaiming He

Audio: Steve Jobs' 2005 Stanford Address (Speech)

Image: Abraham Lincoln

Audio: Kendrick Lamar - Not Like Us (Rap)

Image: Anne Hathaway

Audio: FIFTY FIFTY (피프티피프티) - Cupid (Rap)

Image: Scarlett Johansson

Audio: 松原みき - Stay With Me (真夜中のドア) (Singing)

Image: Yoshua Bengio

Audio: Pulp Fiction - Ezekiel 25:17 (Speech)

Image: Anne Hathaway

Audio: Idina Menzel - Let It Go (from Frozen) (Singing)

Multiple languages

MEMO supports various languages like English, Mandarin, Spanish, Japanese, Korean, and Cantonese.

Image: Martin Luther King Jr.

Audio: Pulp Fiction - Ezekiel 25:17 (English)

Image: Brad Pitt

Audio: David Tao 陶喆 - 心乱飞 (Mandarin)

Image: Elon Musk

Audio: Luis Fonsi - Despacito ft. Daddy Yankee (Spanish)

Image: Leonardo Dicaprio

Audio: Hitomi Tohyama - Cathy (キャシィ) (Japanese)

Image: Anne Hathaway

Audio: BLACKPINK - As If It's Your Last (마지막처럼) (Korean)

Image: Generated by Flux1-AsianFemale

Audio: G.E.M. 鄧紫棋 - 喜歡你 (Cantonese)

Expressive talking video generation

MEMO can generate expressive talking videos or offset emotions in videos.

Sample from the MEAD dataset

Various head poses

MEMO can generate talking videos with various head poses.

Image: Sample from the MEAD dataset; Audio: Mark Ronson - Uptown Funk ft. Bruno Mars

Long video generation

MEMO can generate long-duration talking videos with less artifact and error accumulation.

Image: Yann Lecun, Leonardo Dicaprio, and Elon Musk; Audio: Steve Jobs' 2005 Stanford Address

Comparison with recent methods

Compared to baselines, MEMO can generate more natural and more vivid motion and expressions.

Compared to baselines, MEMO has better audio-lip synchronization and less error accumulation.

Acknowledgement

Our work is made possible thanks to high-quality open-source talking video datasets (including HDTF, VFHQ, CelebV-HQ, MultiTalk, and MEAD) and some pioneering works (such as EMO and Hallo).

Ethics statement

We acknowledge the potential of AI in generating talking videos, with applications spanning education, virtual assistants, and entertainment. However, we are equally aware of the ethical, legal, and societal challenges that misuse of this technology could pose. To reduce potential risks, we have only open-sourced a preview model for research purposes only. Demos on our website use publicly available materials. We welcome copyright concerns—please contact us if needed, and we will address issues promptly. Users are required to ensure that their actions align with legal regulations, cultural norms, and ethical standards. It is strictly prohibited to use the model for creating malicious, misleading, defamatory, or privacy-infringing content, such as deepfake videos for political misinformation, impersonation, harassment, or fraud. We strongly encourage users to review generated content carefully, ensuring it meets ethical guidelines and respects the rights of all parties involved. Users must also ensure that their inputs (e.g., audio and reference images) and outputs are used with proper authorization. Unauthorized use of third-party intellectual property is strictly forbidden. While users may claim ownership of content generated by the model, they must ensure compliance with copyright laws, particularly when involving public figures' likeness, voice, or other aspects protected under personality rights.

Citation

@article{zheng2024memo,
  title={MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation},
  author={Zheng, Longtao and Zhang, Yifan and Guo, Hanzhong and Pan, Jiachun and Tan, Zhenxiong and Lu, Jiahao and Tang, Chuanxin and An, Bo and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2412.04448},
  year={2024}
}

MEMO: Memory-Guided Diffusion forExpressive Talking Video Generation

MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.

Abstract

Overview

Various image styles

MEMO can generate talking videos with images such as portrait, sculpture, digital art, and animation.

Various audio types

MEMO can generate talking videos with audio types such as speech, singing, and rap.

Multiple languages

MEMO supports various languages like English, Mandarin, Spanish, Japanese, Korean, and Cantonese.

Expressive talking video generation

MEMO can generate expressive talking videos or offset emotions in videos.

Various head poses

MEMO can generate talking videos with various head poses.

Long video generation

MEMO can generate long-duration talking videos with less artifact and error accumulation.

Comparison with recent methods

Compared to baselines, MEMO can generate more natural and more vivid motion and expressions.

Compared to baselines, MEMO has better audio-lip synchronization and less error accumulation.

Acknowledgement

Ethics statement

Citation

MEMO: Memory-Guided Diffusion for
Expressive Talking Video Generation