People are able to retrieve the visual information about 3D environments from a picture quite easily — we can identify objects, determine instance sizes, and reconstruct 3D scene layout, all using the limited signals contained in 2D images. This ability is commonly known as the inverse projection problem, which refers to reconstructing the ambiguous mapping from the retinal images to the sources of retinal stimulation. Real-world computer vision applications, such as autonomous driving, heavily rely on these capabilities to localize and identify 3D objects, which require vision models to infer the spatial location, semantic class, and instance label for each 3D point projected to the 2D images. The ability to reconstruct the 3D world from images can be decomposed into two disjoint computer vision tasks: monocular depth estimation (predicting depth from a single image) and video panoptic segmentation (the unification of instance segmentation and semantic segmentation, in the video domain). However, research has generally considered each task separately. Tackling these tasks jointly with a unified computer vision model could result in easier deployment and greater efficiency by sharing computation among multiple tasks.
Driven by the potential value of a model that predicts depth and video panoptic segmentation at the same time, we present “ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation”, accepted to CVPR 2021. In this work, we propose a new task, depth-aware video panoptic segmentation, that aims to simultaneously tackle monocular depth estimation and video panoptic segmentation. For the new task, we present two derived datasets accompanied by a new evaluation metric called depth-aware video panoptic quality (DVPQ). This new metric includes the metrics for depth estimation and video panoptic segmentation, requiring a vision model to simultaneously tackle the two sub-tasks. To this end, we extend Panoptic-DeepLab by adding network branches for depth and video predictions to create ViP-DeepLab, a unified model that jointly performs video panoptic segmentation and monocular depth estimation for each pixel on the image plane, and achieves state-of-the-art performance on several academic datasets for the sub-tasks. This video demonstrates the new task and shows the results of ViP-DeepLab.
Overview While Panoptic-DeepLab is able to output semantic segmentation, center prediction, and center regression for a single frame, it lacks the capability of depth estimation and temporally consistent instance ID prediction for multiple frames. However, ViP-DeepLab accomplishes this by performing additional predictions from two consecutive frames as input. The first additional output is depth estimation for the first frame, for which it assigns an estimated depth to each pixel. In addition, ViP-DeepLab also performs center regression for two consecutive frames for only the object centers that appear in the first frame. This process is called center offset prediction, and allows ViP-DeepLab to group all the pixels in the two frames to the same object that appears in the first frame. New instances emerge if they are not grouped to the previously detected instances. This process continues for every two consecutive frames (with one overlapping frame) in a video sequence, stitching panoptic predictions together to form predictions with temporally consistent instance IDs. That is, it stitches together where objects are and how they move in a video scene with time.
Neural Network Design Building on top of Panoptic-DeepLab, ViP-DeepLab additionally contains two prediction branches: (1) a depth prediction branch, and (2) a next-frame instance branch. Specifically, the depth prediction head is a simple design that predicts depth regression for every pixel, while the next-frame instance branch predicts the center offsets for the pixels in the second frame with respect to the centers in the first frame.
Results We have tested ViP-DeepLab on multiple popular benchmarks, including Cityscapes-VPS, KITTI Depth Prediction, and KITTI Multi-Object Tracking and Segmentation (MOTS).
Specifically, ViP-DeepLab achieves state-of-the-art (SOTA) results, significantly outperforming previous methods by 5.1% video panoptic quality (VPQ) on the Cityscapes-VPS test set.
ViP-DeepLab ranks 1st on the KITTI depth prediction benchmark, improving over previous methods by 0.65 SILog (the smaller the better).
Additionally, ViP-DeepLab was also 1st on KITTI MOTS pedestrians and 3rd on KITTI MOTS cars ranked by the metric sMOTSA, and now is 3rd for both pedestrians and cars ranked by a newer metric HOTA.
Finally, we also present two new datasets for the new task, depth-aware video panoptic segmentation, and test ViP-DeepLab on them. We hope our ViP-DeepLab results on these two new datasets will serve as a strong baseline for the community to compare against. The results are shown below.
Conclusion With a simple architecture, ViP-DeepLab achieves state-of-the-art performance on video panoptic segmentation, monocular depth estimation, and multi-object tracking and segmentation. We hope that along with MaX-DeepLab, which proposes an efficient dual-path transformer module that allows for end-to-end image panoptic segmentation, ViP-DeepLab is useful to the community and furthers research into a more holistic understanding of scenes in the real world.
Acknowledgements We would like to thank the support and valuable discussions with Yukun Zhu, Hartwig Adam, and Alan Yuille (co-authors of ViP-DeepLab), as well as Maxwell Collins, and the Mobile Vision team.