Blog
The latest news from Google AI
uDepth: Real-time 3D Depth Sensing on the Pixel 4
Friday, April 10, 2020
Posted by Michael Schoenberg, uDepth Software Lead and Adarsh Kowdle, uDepth Hardware/Systems Lead, Google Research
The ability to determine 3D information about the scene, called
depth sensing
, is a valuable tool for developers and users alike. Depth sensing is a very active area of computer vision research with recent innovations ranging from applications like
portrait mode
and
AR
to fundamental sensing innovations such as
transparent object detection
. Typical RGB-based stereo depth sensing techniques can be computationally expensive, suffer in regions with low texture, and fail completely in extreme low light conditions.
Because the
face unlock
feature on Pixel 4 must work at high speed and in darkness, it called for a different approach. To this end, the front of the Pixel 4 contains a real-time infrared (IR) active stereo depth sensor, called uDepth. A key computer vision capability on the Pixel 4, this technology helps the authentication system identify the user while also protecting against spoof attacks. It also supports a number of novel capabilities, such as after-the-fact photo retouching, depth-based segmentation of a scene, background blur, portrait effects and 3D photos.
Recently, we provided access to uDepth as an API on
Camera2
, using the
Pixel Neural Core,
two IR cameras, and an IR pattern projector to provide time-synchronized depth frames (in
DEPTH16
) at 30Hz. The Google Camera App uses this API to bring
improved depth capabilities to selfies
taken on the Pixel 4. In this post, we explain broadly how uDepth works, elaborate on the underlying algorithms, and discuss applications with example results for the Pixel 4.
Overview of Stereo Depth Sensing
All stereo camera systems reconstruct depth using
parallax
. To observe this effect, look at an object, close one eye, then switch which eye is closed. The apparent position of the object will shift, with closer objects appearing to move more. uDepth is part of the family of
dense
local stereo
matching techniques, which estimate parallax computationally for each pixel. These techniques evaluate a region surrounding each pixel in the image formed by one camera, and try to find a similar region in the corresponding image from the second camera. When calibrated properly, the reconstructions generated are
metric
, meaning that they express real physical distances.
Pixel 4 front sensor setup, an example of an active stereo system.
To deal with textureless regions and cope with low-light conditions, we make use of an “active stereo” setup, which projects an IR pattern into the scene that is detected by stereo IR cameras. This approach makes low-texture regions easier to identify, improving results and reducing the computational requirements of the system.
What Makes uDepth Distinct?
Stereo sensing systems can be extremely computationally intensive, and it’s critical that a sensor running at 30Hz is low power while remaining high quality. uDepth leverages a number of key insights to accomplish this.
One such insight is that given a pair of regions that are similar to each other, most corresponding subsets of those regions are also similar. For example, given two 8x8 patches of pixels that are similar, it is very likely that the top-left 4x4 sub-region of each member of the pair is also similar. This informs the uDepth pipeline’s initialization procedure, which builds a pyramid of depth proposals by comparison of non-overlapping tiles in each image and selecting those most similar. This process starts with 1x1 tiles, and accumulates support hierarchically until an initial low-resolution depth map is generated.
After initialization, we apply a novel technique for
neural depth refinement
to support the regular grid pattern illuminator on the Pixel 4. Typical active stereo systems project a pseudo-random grid pattern to help disambiguate matches in the scene, but uDepth is capable of supporting repeating grid patterns as well. Repeating structure in such patterns produces regions that look similar across stereo pairs, which can lead to incorrect matches. We mitigate this issue using a lightweight (75k parameter) convolutional architecture, using IR brightness and neighbor information to adjust incorrect matches — in less than 1.5ms per frame.
Neural depth refinement architecture.
Following neural depth refinement, good depth estimates are iteratively propagated from neighboring tiles. This and following pipeline steps leverage another insight key to the success of uDepth — natural scenes are typically locally planar with only small nonplanar deviations. This permits us to find planar tiles that cover the scene, and only later refine individual depths for each pixel in a tile, greatly reducing computational load.
Finally, the best match from among neighboring plane hypotheses is selected, with subpixel refinement and invalidation if no good match could be found.
Simplified depth architecture. Green components run on the GPU, yellow on the CPU, and blue on the Pixel Neural Core.
When a phone experiences a severe drop, it can result in the factory calibration of the stereo cameras diverging from the
actual
position of the cameras. To ensure high-quality results during real-world use, the uDepth system is self-calibrating. A scoring routine evaluates every depth image for signs of miscalibration, and builds up confidence in the state of the device. If miscalibration is detected, calibration parameters are regenerated from the current scene. This follows a pipeline consisting of feature detection and correspondence, subpixel refinement (taking advantage of the dot profile), and bundle adjustment.
Left:
Stereo depth with inaccurate calibration.
Right:
After autocalibration.
For more details, please refer to
Slanted O(1) Stereo
, upon which uDepth is based.
Depth for Computational Photography
The raw data from the uDepth sensor is designed to be accurate and metric, which is a fundamental requirement for face unlock. Computational photography applications such as
portrait mode
and 3D photos have very different needs. In these use cases, it is not critical to achieve video frame rates, but the depth should be smooth, edge-aligned and complete in the whole field-of-view of the color camera.
Left to right:
raw depth sensing result, predicted depth, 3D photo. Notice the smooth rotation of the wall, demonstrating a continuous depth gradient rather than a single focal plane.
To achieve this we trained an end-to-end deep learning architecture that enhances the raw uDepth data, inferring a complete, dense 3D depth map. We use a combination of RGB images, people segmentation, and raw depth, with a
dropout scheme
forcing use of information for each of the inputs.
Architecture for computational photography depth enhancement.
To acquire ground truth, we leveraged
a volumetric capture system
that can produce near-photorealistic models of people using a geodesic sphere outfitted with 331 custom color LED lights, an array of high-resolution cameras, and a set of custom high-resolution depth sensors. We added Pixel 4 phones to the setup and synchronized them with the rest of the hardware (lights and cameras). The generated training data consists of a combination of real images as well as synthetic renderings from the Pixel 4 camera viewpoint.
Data acquisition overview.
Putting It All Together
With all of these components in place, uDepth produces both a depth stream at 30Hz (exposed via Camera2), and smooth, post-processed depth maps for photography (exposed via Google Camera App when you take a depth-enabled selfie). The smooth, dense, per-pixel depth that our system produces is available on every Pixel 4 selfie with Social Media Depth features enabled, and can be used for post-capture effects such as
bokeh
and 3D photos for social media.
Example applications. Notice the multiple focal planes in the 3D photo on the right.
Finally, we are happy to provide a demo application for you to play with that visualizes a real-time point cloud from uDepth — download it
here
(this app is for demonstration and research purposes only and not intended for commercial use; Google will not provide any support or updates). This demo app visualizes 3D point clouds from your Pixel 4 device. Because the depth maps are time-synchronized and in the same coordinate system as the RGB images, a textured view of the 3D scene can be shown, as in the example visualization below:
Example single-frame, RGB point cloud from uDepth on the Pixel 4.
Acknowledgements
This work would not have been possible without the contributions of many, many people, including but not limited to Peter Barnum, Cheng Wang, Matthias Kramm, Jack Arendt, Scott Chung, Vaibhav Gupta, Clayton Kimber, Jeremy Swerdlow, Vladimir Tankovich, Christian Haene, Yinda Zhang, Sergio Orts Escolano, Sean Ryan Fanello, Anton Mikhailov, Philippe Bouchilloux, Mirko Schmidt, Ruofei Du, Karen Zhu, Charlie Wang, Jonathan Taylor, Katrina Passarella, Eric Meisner, Vitalii Dziuba, Ed Chang, Phil Davidson, Rohit Pandey, Pavel Podlipensky, David Kim, Jay Busch, Cynthia Socorro Herrera, Matt Whalen, Peter Lincoln, Geoff Harvey, Christoph Rhemann, Zhijie Deng, Daniel Finchelstein, Jing Pu, Chih-Chung Chang, Eddy Hsu, Tian-yi Lin, Sam Chang, Isaac Christensen, Donghui Han, Speth Chang, Zhijun He, Gabriel Nava, Jana Ehmann, Yichang Shih, Chia-Kai Liang, Isaac Reynolds, Dillon Sharlet, Steven Johnson, Zalman Stern, Jiawen Chen, Ricardo Martin Brualla, Supreeth Achar, Mike Mehlman, Brandon Barbello, Chris Breithaupt, Michael Rosenfield, Gopal Parupudi, Steve Goldberg, Tim Knight, Raj Singh, Shahram Izadi, as well as many other colleagues across Devices and Services, Google Research, Android and X.
Labels
accessibility
ACL
ACM
Acoustic Modeling
Adaptive Data Analysis
ads
adsense
adwords
Africa
AI
AI for Social Good
Algorithms
Android
Android Wear
API
App Engine
App Inventor
April Fools
Art
Audio
Augmented Reality
Australia
Automatic Speech Recognition
AutoML
Awards
BigQuery
Cantonese
Chemistry
China
Chrome
Cloud Computing
Collaboration
Compression
Computational Imaging
Computational Photography
Computer Science
Computer Vision
conference
conferences
Conservation
correlate
Course Builder
crowd-sourcing
CVPR
Data Center
Data Discovery
data science
datasets
Deep Learning
DeepDream
DeepMind
distributed systems
Diversity
Earth Engine
economics
Education
Electronic Commerce and Algorithms
electronics
EMEA
EMNLP
Encryption
entities
Entity Salience
Environment
Europe
Exacycle
Expander
Faculty Institute
Faculty Summit
Flu Trends
Fusion Tables
gamification
Gboard
Gmail
Google Accelerated Science
Google Books
Google Brain
Google Cloud Platform
Google Docs
Google Drive
Google Genomics
Google Maps
Google Photos
Google Play Apps
Google Science Fair
Google Sheets
Google Translate
Google Trips
Google Voice Search
Google+
Government
grants
Graph
Graph Mining
Hardware
HCI
Health
High Dynamic Range Imaging
ICCV
ICLR
ICML
ICSE
Image Annotation
Image Classification
Image Processing
Inbox
India
Information Retrieval
internationalization
Internet of Things
Interspeech
IPython
Journalism
jsm
jsm2011
K-12
Kaggle
KDD
Keyboard Input
Klingon
Korean
Labs
Linear Optimization
localization
Low-Light Photography
Machine Hearing
Machine Intelligence
Machine Learning
Machine Perception
Machine Translation
Magenta
MapReduce
market algorithms
Market Research
Mixed Reality
ML
ML Fairness
MOOC
Moore's Law
Multimodal Learning
NAACL
Natural Language Processing
Natural Language Understanding
Network Management
Networks
Neural Networks
NeurIPS
Nexus
Ngram
NIPS
NLP
On-device Learning
open source
operating systems
Optical Character Recognition
optimization
osdi
osdi10
patents
Peer Review
ph.d. fellowship
PhD Fellowship
PhotoScan
Physics
PiLab
Pixel
Policy
Professional Development
Proposals
Public Data Explorer
publication
Publications
Quantum AI
Quantum Computing
Recommender Systems
Reinforcement Learning
renewable energy
Research
Research Awards
resource optimization
Robotics
schema.org
Search
search ads
Security and Privacy
Self-Supervised Learning
Semantic Models
Semi-supervised Learning
SIGCOMM
SIGMOD
Site Reliability Engineering
Social Networks
Software
Sound Search
Speech
Speech Recognition
statistics
Structured Data
Style Transfer
Supervised Learning
Systems
TensorBoard
TensorFlow
TPU
Translate
trends
TTS
TV
UI
University Relations
UNIX
Unsupervised Learning
User Experience
video
Video Analysis
Virtual Reality
Vision Research
Visiting Faculty
Visualization
VLDB
Voice Search
Wiki
wikipedia
WWW
Year in Review
YouTube
Archive
2021
Feb
Jan
2020
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2019
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2018
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2017
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2016
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Jul
May
Apr
Mar
Feb
2007
Oct
Sep
Aug
Jul
Jun
Feb
2006
Dec
Nov
Sep
Aug
Jul
Jun
Apr
Mar
Feb
Feed
Follow @googleai
Give us feedback in our
Product Forums
.