Blog
The latest from Google Research
Real-Time AR Self-Expression with Machine Learning
viernes, 8 de marzo de 2019
Posted by Artsiom Ablavatski and Ivan Grishchenko, Research Engineers, Google AI
Augmented reality
(AR) helps you do more with what you see by overlaying digital content and information on top of the physical world. For example,
AR features coming to Google Maps
will let you find your way with directions overlaid on top of your real world. With
Playground - a creative mode in the Pixel camera
-- you can use AR to see the world differently. And with the latest release of
YouTube Stories
and
ARCore
's new
Augmented Faces API
you can add objects like animated masks, glasses, 3D hats and more to your own selfies!
One of the key challenges in making these AR features possible is proper anchoring of the virtual content to the real world; a process that requires a unique set of perceptive technologies able to track the highly dynamic surface geometry across every smile, frown or smirk.
Our 3D mesh and some of the effects it enables
To make all this possible, we employ machine learning (ML) to infer approximate 3D surface geometry to enable visual effects, requiring only a single camera input without the need for a dedicated depth sensor. This approach provides the use of AR effects at realtime speeds, using
TensorFlow Lite
for mobile CPU inference or
its new mobile GPU functionality
where available. This technology is the same as what powers
YouTube Stories'
new creator effects, and is also available to the broader developer community via the
latest ARCore SDK
release and the
ML Kit Face Contour Detection API
.
An ML Pipeline for Selfie AR
Our ML pipeline consists of two real-time deep neural network models that work together: A detector that operates on the full image and computes face locations, and a generic 3D mesh model that operates on those locations and predicts the approximate surface geometry via regression. Having the face accurately cropped drastically reduces the need for common data augmentations like affine transformations consisting of rotations, translation and scale changes. Instead it allows the network to dedicate most of its capacity towards coordinate prediction accuracy, which is critical to achieve proper anchoring of the virtual content.
Once the location of interest is cropped, the mesh network is only applied to a single frame at a time, using a windowed smoothing in order to reduce noise when the face is static while avoiding lagging during significant movement.
Our 3D mesh in action
For our 3D mesh we employed
transfer learning
and trained a network with several objectives: the network simultaneously predicts 3D mesh coordinates on synthetic, rendered data and 2D semantic contours on annotated, real world data similar to those
MLKit
provides. The resulting network provided us with
reasonable
3D mesh predictions not just on synthetic but also on real world data. All models are trained on data sourced from a geographically diverse dataset and subsequently tested on a balanced, diverse testset for qualitative and quantitative performance.
The 3D mesh network receives as input a cropped video frame. It doesn't rely on additional depth input, so it can also be applied to pre-recorded videos. The model outputs the positions of the 3D points, as well as the probability of a face being present and reasonably aligned in the input. A common alternative approach is to predict a 2D heatmap for each landmark, but it is not amenable to depth prediction and has high computational costs for so many points.
We further improve the accuracy and robustness of our model by iteratively bootstrapping and refining predictions. That way we can grow our dataset to increasingly challenging cases, such as grimaces, oblique angle and occlusions. Dataset augmentation techniques also expanded the available ground truth data, developing model resilience to artifacts like camera imperfections or extreme lighting conditions.
Dataset expansion and improvement pipeline
Hardware-tailored Inference
We use
TensorFlow Lite
for on-device neural network inference. The
newly introduced GPU back-end
acceleration boosts performance where available, and significantly lowers the power consumption. Furthermore, to cover a wide range of consumer hardware, we designed a variety of model architectures with different performance and efficiency characteristics. The most important differences of the lighter networks are the residual block layout and the accepted input resolution (128x128 pixels in the lightest model vs. 256x256 in the most complex). We also vary the number of layers and the subsampling rate (how fast the input resolution decreases with network depth).
Inference time per frame: CPU vs. GPU
The result of these optimizations is a substantial speedup from using lighter models, with minimal degradation in AR effect quality.
Comparison of the most complex (left) and the lightest models (right). Temporal consistency as well as lip and eye tracking is slightly degraded on light models.
The end result of these efforts empowers a user experience with convincing, realistic selfie AR effects in YouTube, ARCore, and other clients by:
Simulating light reflections via environmental mapping for realistic rendering of glasses
Natural lighting by casting virtual object shadows onto the face mesh
Modelling face occlusions to hide virtual object parts behind a face, e.g. virtual glasses, as shown below.
YouTube Stories includes Creator Effects like realistic virtual glasses, based on our 3D mesh
In addition, we achieve highly realistic makeup effects by:
Modelling Specular reflections applied on lips and
Face painting by using luminance-aware material
Case study comparing real make-up against our AR make-up on 5 subjects under different lighting conditions.
We are excited to share this new technology with creators, users and developers alike, who can use this new technology immediately by downloading the
latest ARCore SDK
. In the future we plan to broaden this technology to more Google products.
Acknowledgements
We would like to thank Yury Kartynnik, Valentin Bazarevsky, Andrey Vakunov, Siargey Pisarchyk, Andrei Tkachenka, and Matthias Grundmann for collaboration on developing the current mesh technology; Nick Dufour, Avneesh Sud and Chris Bregler for an earlier version of the technology based on parametric models; Kanstantsin Sokal, Matsvei Zhdanovich, Gregory Karpiak, Alexander Kanaukou, Suril Shah, Buck Bourdon, Camillo Lugaresi, Siarhei Kazakou and Igor Kibalchich for building the ML pipeline to drive impressive effects; Aleksandra Volf and the annotation team for their diligence and dedication to perfection; Andrei Kulik, Juhyun Lee, Raman Sarokin, Ekaterina Ignasheva, Nikolay Chirkov, and Yury Pisarchyk for careful benchmarking and insights on mobile GPU-centric network architecture optimizations.
Etiquetas
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
materials science
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Responsible AI
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2022
may
abr
mar
feb
ene
2021
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2020
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2019
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2018
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2017
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2016
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2015
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2014
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2013
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2012
dic
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2011
dic
nov
sep
ago
jul
jun
may
abr
mar
feb
ene
2010
dic
nov
oct
sep
ago
jul
jun
may
abr
mar
feb
ene
2009
dic
nov
ago
jul
jun
may
abr
mar
feb
ene
2008
dic
nov
oct
sep
jul
may
abr
mar
feb
2007
oct
sep
ago
jul
jun
feb
2006
dic
nov
sep
ago
jul
jun
abr
mar
feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.