Synthesize Speech & Talking Videos with Unprecedented Realism: Ada-TTA Unveiled! This is DeepFake+++ #241
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Synthesize Speech & Talking Videos with Unprecedented Realism: Ada-TTA Unveiled! This is DeepFake+++
Full tutorial: https://www.youtube.com/watch?v=ZuR_hxYIXF0
Given only a few-minute-long video of a person speaking with the audio track as the training data and arbitrary texts as the driving input, the authors aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet due to two challenges: (1) It is challenging to mimic the timbre from out-of-domain audio for a traditional multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and lip-synchronized talking avatars with limited training data. In this paper, the authors introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a generic zero-shot multi-speaker TTS model that effectively disentangles the text content, timbre, and prosody; and (2) embraces recent advances in neural rendering to achieve realistic audio-driven talking face video generation. With these designs, the authors' method overcomes the aforementioned two challenges and successfully generates identity-preserving speech and realistic talking person videos. Experiments demonstrate that the authors' method can synthesize realistic, identity-preserving, and audio-visual synchronized talking avatar videos.
Paper and video source link⤵️
https://arxiv.org/abs/2306.03504
Our Discord server⤵️
https://bit.ly/SECoursesDiscord
If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰⤵️
https://www.patreon.com/SECourses
Technology & Science: News, Tips, Tutorials, Tricks, Best Applications, Guides, Reviews⤵️
https://www.youtube.com/playlist?list=PL_pbwdIyffsnkay6X91BWb9rrfLATUMr3
Playlist of StableDiffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img⤵️
https://www.youtube.com/playlist?list=PL_pbwdIyffsmclLl0O144nQRnezKlNdx3
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
Technology and AI enthusiasts have been intrigued in recent times by the rise of generative artificial intelligence across different sectors. For example, Adamopoulou (2020) highlighted the use of large language models (LLM) like chatbots that can produce high-quality, natural, and realistic dialogues. The advancement in text-to-speech (TTS) systems has enabled the synthesis of personalized speech using reference audio and plain texts.
In addition, strides in neural rendering techniques have given us the ability to generate realistic and high-fidelity talking face videos, often called Talking Face Generation (TFG). With a few training samples, researchers have accomplished significant progress. Combining these advancements in TTS and TFG models opens up possibilities for creating talking videos from text inputs alone. This combined system presents tremendous potential in applications like news broadcasting, virtual lectures, and talking chatbots, particularly given the recent progress of ChatGPT.
However, earlier TTS and TFG models required a significant volume of identity-specific data to produce satisfactory personalized results, which proved to be challenging in real-world scenarios where only a few minutes of target person video is typically available. Inspired by this limitation, researchers have been exploring a new area of study - low-resource text-to-talking avatar (TTA), which aims to create identity-preserving, audio-lip synchronized talking portrait videos with minimal input data.
Given the challenges associated with TTS and TFG, the foremost concern in TTS is how to effectively preserve the timbre identity of the input audio. While solutions have been proposed to these challenges, none have been fully satisfactory, suffering from issues like information loss, unsatisfactory identity preservation, and poor lip synchronization.
To overcome these hurdles, researchers have introduced Ada-TTA, a joint system of TTS and TFG that employs the latest advancements in each domain. To enhance the identity-preserving capability of the TTS model, they have devised a unique zero-shot multi-speaker TTS model that leverages a massive 20,000-hour-long TTS dataset. It can synthesize high-quality personalized speech from a single short recording of an unseen speaker.
For high-fidelity and lip-synchronized talking face generation, the GeneFace++ system is integrated into Ada-TTA. This TFG system boosts lip-synchronization and system efficiency while maintaining high fidelity. With the combination of these innovative systems, Ada-TTA is able to produce high-quality text-to-talking avatar synthesis, even with limited resources.
Tests of Ada-TTA have demonstrated positive outcomes in the synthesis of speech and video. Ada-TTA not only holds up well under both objective and subjective metrics but also outperforms baseline measurements. This novel approach marks a promising step towards more realistic and accessible talking avatars.
Video Transcription
00:00:00 In this paper we propose Ada TTA, given only a few minute long talking person video with
00:00:06 transcribed audio track as the training data. Ada TTI could synthesize identity preserving and
00:00:12 audio lip synchronized talking portrait videos given the driving input text.
00:00:17 We will speak with you about the battle we're waging against an oil spill that is assaulting
00:00:22 our shores and our citizens. Good afternoon everyone
00:00:26 and together we are super excited to introduce you all to introduction to deep learning
00:00:31 the course of Carnegie Mellon University. In the first part of the course we will
00:00:36 talk about the generative deep learning that are used to generate data never existed in reality.
00:00:42 Good afternoon everyone and together we are super excited to introduce you all
00:00:47 to introduction to deep learning the course of Carnegie Mellon University.
00:00:50 In the first part of the course we will talk about
00:00:52 the generative deep learning that are used to generate data never existed in reality.
00:00:58 The video we just watched were released on this day June 6 from Ada-TTA: Towards Adaptive High
00:01:08 Quality Text-to-Talking Avatar Synthesis and this is the best avatar generation
00:01:14 paper that I have seen. It both generates the
00:01:18 very high quality video and also the audio. I have been searching for an audio generation
00:01:24 machine learning model and I haven't seen anything like this quality along with the video generation.
00:01:31 Just from a short video training they extract audio and generate this text-to-speech and also
00:01:39 from video frames they generate this talking face video and it is amazing.
00:01:44 The quality were just amazing. Unfortunately they didn't release any source
00:01:50 code, any model so far so we only have this demo. It is extremely promising. I hope that they also
00:01:58 release the source code and also the model to the public so we can also test it, use it and
00:02:04 see whether it is this good or not whether it is just a cherry picked showcase or not.
00:02:09 The link for this paper will be in the description of the video don't forget to check it out.
00:02:15 Hopefully see you in another amazing artificial intelligence news video.
Beta Was this translation helpful? Give feedback.
All reactions