Sie sind auf Seite 1von 1

The Magic of the Magic of the Magic of the Magic


of Washington
a Vide bam
tput O Ou

Given audio of President Barack Obama, we synthesize a high quality video

Fig. 1. Given input

Obama au dio and a reference
video, we synthesiz
e photorealistic, lip
-synced video of Ob
of him speaking wi ama speaking those
th accurate lip sync, words.
Trained on many ho composited into a In this paper, we do
urs of his weekly ad tar ge t vid eo clip. a ca se stu dy on Pre
network learns the dress footage, a rec sid en t Barack Obama, an
mapping from raw urrent neural d
the mouth shape audio features to mo from his voice and
at each time instan uth shapes. Given sto ck foo tag
texture, and composi t, we synthesize high an initial test subje e. Ba rac k Obama is ideally
te it with proper 3D quality mouth ct for a number of suited as
appears to be saying pose matching to ch ab un da nc e reasons. First, there
in a target video to an ge what he of vid eo foo tag e fro ex ists an
approach produces match the input au 17 hours, and nearl m his weekly preside
photorealistic result dio track. Our y two million frame ntial addresses
s. s, spanning a perio
CCS Concepts: Comp years. Importantly, d of eight
uting methodologies the video is online an
Image manipulatio Im ag e-b we ll su ite d for academic resea d public domain, an
n;Animation; Shape ased rendering;
rch and publication. d hence
modeling; quality is high (HD) Furthermore, the
Additional Key Words , with the face reg
and Phrases: Audio large part of the fra ion occupying a rel
data, Videos, Audio , Face Synthesis, LST me. And, while light atively
visual Speech, Uncan M, RNN, Big ing and compositio
ny Valley, Lip Sync n varies
ACM Reference for the shots are relati
mat: vely controlled wi
Supasorn Suwajanako and facing the came th the subject in th
rn, Steven M. Seitz,
an d ra. Finally, Obamas e center
2017. Synthesizing Ira Kemelmacher-Sh co ns persona in this foo
Obama: Learning
Lip Sync from Audio
lizerman. ist en t it is th e President addres tage is
Graph. 36, 4, Article sing the nation dir
95 (July 2017), 13 pa . ACM Trans. adopting a serious ectly, and
ges. and direct to ne.
Despite the availab
ility of such promisi
ng data, the problem
1 IN TRODUCTION part to the technica
l challenge of mapp
How much can you signal to a (3D) tim ing from a one-dim
infer about someon e-varying image, bu ensional
footage? Imagine lea es persona from th t also due to the fac
rning how to replicate eir video t that
a persons voice, ho the sound and cade
w they speak, what nce of results that look un
and interact, and ho they say, how they canny. In addition to
w they appear and converse generating realistic
With tools like Skyp express themselve results,
e, FaceTime, and ot s.
solutions, we are inc her video conferenc video speech prob
reasingly capturing ing lem by analyzing a
video footage of ou data of a single pe large corpus of exist
rselves. rson. As such, it op ing video
ens to the door to
able online, in the modeling
form ofinterviews, e.g.,).
Analyzing this video speeches, newsca
is quite challenging, sts, etc.
however, as the fac Audio to video, as
es are ide from being int
eresting purely fro
m a sci-
interview to the ne
xt (also, most of th
is video is proprietar
2017 Copyright ing/transmission (w
held by the owner
/author(s). Publicatio hich makes up a lar
n rights internet bandwidth). ge percentage of
licensed to ACM. For hearing-impaire current
Graphics could enable lip-re d people, video synt
ACM Transactions on ading from over-th hesis
e-phone audio. An
d digital

ACM Transactions
on Graphics, Vol. 36,
No. 4, Article 95. Pub
lication date: July 201

Das könnte Ihnen auch gefallen