Microsoft Research Asia is forging on with a new transhumanist program called VASA that creates “lifelike talking faces of virtual characters with appealing visual effective skills (VAS), given a single static image and a speech audio clip.”

The artificial intelligence (AI) division of Microsoft in Asia has been working on the program by compiling real single images of people, real audio, and in many cases various control signals such as the movements of people’s faces as they talk.

Using all this data, Microsoft Research Asia is generating moving images of fake people who could someday replace actual newscasters and podcasters—at least those with so little personality and soul that robots could basically do their jobs.


Advertisement


“Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness,” the research team wrote in a paper about these latest developments.

“The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos.

Through extensive experiments including evaluation on new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively.”

Microsoft Research Asia’s methods for developing these human-like deepfakes produce high-quality video coupled with realistic facial and head dynamics. Such video can be generated online at 512×512 with up to 40 frames per second (FPS) and negligible starting latency.

In layman’s terms, the technology is so believable that many people would probably fall for it and think these are real people on their screens. Only the most discerning can tell that something is not quite right with what they see.

“It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors,” Microsoft Research Asia proudly claims.

If you are interested in seeing a few examples of these creepy AI moving and speaking images, you can visit Microsoft.com.

“Our method is capable of not only producing precious lip-audio synchronization, but also generating a large spectrum of expressive facial nuances and natural head motions,” the company says.

“It can handle arbitrary-length [sic] audio and stably output seamless talking face videos.”