Frame interpolation is the process of synthesizing in-between images from a given set of images. The technique is often used for temporal up-sampling to increase the refresh rate of videos or to create slow motion effects. Nowadays, with digital cameras and smartphones, we often take several photos within a few seconds to capture the best picture. Interpolating between these “near-duplicate” photos can lead to engaging videos that reveal scene motion, often delivering an even more pleasing sense of the moment than the original photos.
Frame interpolation between consecutive video frames, which often have small motion, has been studied extensively. Unlike videos, however, the temporal spacing between near-duplicate photos can be several seconds, with commensurately large in-between motion, which is a major failing point of existing frame interpolation methods. Recent methods attempt to handle large motion by training on datasets with extreme motion, albeit with limited effectiveness on smaller motions.
In “FILM: Frame Interpolation for Large Motion”, published at ECCV 2022, we present a method to create high quality slow-motion videos from near-duplicate photos. FILM is a new neural network architecture that achieves state-of-the-art results in large motion, while also handling smaller motions well.
FILM Model Overview The FILM model takes two images as input and outputs a middle image. At inference time, we recursively invoke the model to output in-between images. FILM has three components: (1) A feature extractor that summarizes each input image with deep multi-scale (pyramid) features; (2) a bi-directional motion estimator that computes pixel-wise motion (i.e., flows) at each pyramid level; and (3) a fusion module that outputs the final interpolated image. We train FILM on regular video frame triplets, with the middle frame serving as the ground-truth for supervision.
Scale-Agnostic Feature Extraction Large motion is typically handled with hierarchical motion estimation using multi-resolution feature pyramids (shown above). However, this method struggles with small and fast-moving objects because they can disappear at the deepest pyramid levels. In addition, there are far fewer available pixels to derive supervision at the deepest level.
To overcome these limitations, we adopt a feature extractor that shares weights across scales to create a “scale-agnostic” feature pyramid. This feature extractor (1) allows the use of a shared motion estimator across pyramid levels (next section) by equating large motion at shallow levels with small motion at deeper levels, and (2) creates a compact network with fewer weights.
Specifically, given two input images, we first create an image pyramid by successively downsampling each image. Next, we use a shared U-Net convolutional encoder to extract a smaller feature pyramid from each image pyramid level (columns in the figure below). As the third and final step, we construct a scale-agnostic feature pyramid by horizontally concatenating features from different convolution layers that have the same spatial dimensions. Note that from the third level onwards, the feature stack is constructed with the same set of shared convolution weights (shown in the same color). This ensures that all features are similar, which allows us to continue to share weights in the subsequent motion estimator. The figure below depicts this process using four pyramid levels, but in practice, we use seven.
Bi-directional Flow Estimation After feature extraction, FILM performs pyramid-based residual flow estimation to compute the flows from the yet-to-be-predicted middle image to the two inputs. The flow estimation is done once for each input, starting from the deepest level, using a stack of convolutions. We estimate the flow at a given level by adding a residual correction to the upsampled estimate from the next deeper level. This approach takes the following as its input: (1) the features from the first input at that level, and (2) the features of the second input after it is warped with the upsampled estimate. The same convolution weights are shared across all levels, except for the two finest levels.
Shared weights allow the interpretation of small motions at deeper levels to be the same as large motions at shallow levels, boosting the number of pixels available for large motion supervision. Additionally, shared weights not only enable the training of powerful models that may reach a higher peak signal-to-noise ratio (PSNR), but are also needed to enable models to fit into GPU memory for practical applications.
Fusion and Frame Generation Once the bi-directional flows are estimated, we warp the two feature pyramids into alignment. We obtain a concatenated feature pyramid by stacking, at each pyramid level, the two aligned feature maps, the bi-directional flows and the input images. Finally, a U-Net decoder synthesizes the interpolated output image from the aligned and stacked feature pyramid.
Loss Functions During training, we supervise FILM by combining three losses. First, we use the absolute L1 difference between the predicted and ground-truth frames to capture the motion between input images. However, this produces blurry images when used alone. Second, we use perceptual loss to improve image fidelity. This minimizes the L1 difference between the ImageNet pre-trained VGG-19 features extracted from the predicted and ground truth frames. Third, we use Style loss to minimize the L2 difference between the Gram matrix of the ImageNet pre-trained VGG-19 features. The Style loss enables the network to produce sharp images and realistic inpaintings of large pre-occluded regions. Finally, the losses are combined with weights empirically selected such that each loss contributes equally to the total loss.
Shown below, the combined loss greatly improves sharpness and image fidelity when compared to training FILM with L1 loss and VGG losses. The combined loss maintains the sharpness of the tree leaves.
Image and Video Results We evaluate FILM on an internal near-duplicate photos dataset that exhibits large scene motion. Additionally, we compare FILM to recent frame interpolation methods: SoftSplat and ABME. FILM performs favorably when interpolating across large motion. Even in the presence of motion as large as 100 pixels, FILM generates sharp images consistent with the inputs.
Conclusion We introduce FILM, a large motion frame interpolation neural network. At its core, FILM adopts a scale-agnostic feature pyramid that shares weights across scales, which allows us to build a “scale-agnostic” bi-directional motion estimator that learns from frames with normal motion and generalizes well to frames with large motion. To handle wide disocclusions caused by large scene motion, we supervise FILM by matching the Gram matrix of ImageNet pre-trained VGG-19 features, which results in realistic inpainting and crisp images. FILM performs favorably on large motion, while also handling small and medium motions well, and generates temporally smooth high quality videos.
Try It Out Yourself You can try out FILM on your photos using the source code, which is now publicly available.
Acknowledgements We would like to thank Eric Tabellion, Deqing Sun, Caroline Pantofaru, Brian Curless for their contributions. We thank Marc Comino Trinidad for his contributions on the scale-agnostic feature extractor, Orly Liba and Charles Herrmann for feedback on the text, Jamie Aspinall for the imagery in the paper, Dominik Kaeser, Yael Pritch, Michael Nechyba, William T. Freeman, David Salesin, Catherine Wah, and Ira Kemelmacher-Shlizerman for support. Thanks to Tom Small for creating the animated diagram in this post.
Deep reinforcement learning (RL) continues to make great strides in solving real-world sequential decision-making problems such as balloon navigation, nuclear physics, robotics, and games. Despite its promise, one of its limiting factors is long training times. While the current approach to speed up RL training on complex and difficult tasks leverages distributed training scaling up to hundreds or even thousands of computing nodes, it still requires the use of significant hardware resources which makes RL training expensive, while increasing its environmental impact. However, recent work [1, 2] indicates that performance optimizations on existing hardware can reduce the carbon footprint (i.e., total greenhouse gas emissions) of training and inference.
RL can also benefit from similar system optimization techniques that can reduce training time, improve hardware utilization and reduce carbon dioxide (CO2) emissions. One such technique is quantization, a process that converts full-precision floating point (FP32) numbers to lower precision (int8) numbers and then performs computation using the lower precision numbers. Quantization can save memory storage cost and bandwidth for faster and more energy-efficient computation. Quantization has been successfully applied to supervised learning to enable edge deployments of machine learning (ML) models and achieve faster training. However, there remains an opportunity to apply quantization to RL training.
To that end, we present “QuaRL: Quantization for Fast and Environmentally Sustainable Reinforcement Learning”, published in the Transactions of Machine Learning Research journal, which introduces a new paradigm called ActorQ that applies quantization to speed up RL training by 1.5-5.4x while maintaining performance. Additionally, we demonstrate that compared to training in full-precision, the carbon footprint is also significantly reduced by a factor of 1.9-3.8x.
Applying Quantization to RL Training In traditional RL training, a learner policy is applied to an actor, which uses the policy to explore the environment and collect data samples. The samples collected by the actor are then used by the learner to continuously refine the initial policy. Periodically, the policy trained on the learner side is used to update the actor's policy. To apply quantization to RL training, we develop the ActorQ paradigm. ActorQ performs the same sequence described above, with one key difference being that the policy update from learner to actors is quantized, and the actor explores the environment using the int8 quantized policy to collect samples.
Applying quantization to RL training in this fashion has two key benefits. First, it reduces the memory footprint of the policy. For the same peak bandwidth, less data is transferred between learners and actors, which reduces the communication cost for policy updates from learners to actors. Second, the actors perform inference on the quantized policy to generate actions for a given environment state. The quantized inference process is much faster when compared to performing inference in full precision.
In ActorQ, we use the ACME distributed RL framework. The quantizer block performs uniform quantization that converts the FP32 policy to int8. The actor performs inference using optimized int8 computations. Though we use uniform quantization when designing the quantizer block, we believe that other quantization techniques can replace uniform quantization and produce similar results. The samples collected by the actors are used by the learner to train a neural network policy. Periodically the learned policy is quantized by the quantizer block and broadcasted to the actors.
Quantization Improves RL Training Time and Performance We evaluate ActorQ in a range of environments, including the Deepmind Control Suite and the OpenAI Gym. We demonstrate the speed-up and improved performance of D4PG and DQN. We chose D4PG as it was the best learning algorithm in ACME for Deepmind Control Suite tasks, and DQN is a widely used and standard RL algorithm.
We observe a significant speedup (between 1.5x and 5.41x) in training RL policies. More importantly, performance is maintained even when actors perform int8 quantized inference. The figures below demonstrate this for the D4PG and DQN agents for Deepmind Control Suite and OpenAI Gym tasks.
Quantization Reduces Carbon Emission Applying quantization in RL using ActorQ improves training time without affecting performance. The direct consequence of using the hardware more efficiently is a smaller carbon footprint. We measure the carbon footprint improvement by taking the ratio of carbon emission when using the FP32 policy during training over the carbon emission when using the int8 policy during training.
In order to measure the carbon emission for the RL training experiment, we use the experiment-impact-tracker proposed in prior work. We instrument the ActorQ system with carbon monitor APIs to measure the energy and carbon emissions for each training experiment.
Compared to the carbon emission when running in full precision (FP32), we observe that the quantization of policies reduces the carbon emissions anywhere from 1.9x to 3.76x, depending on the task. As RL systems are scaled to run on thousands of distributed hardware cores and accelerators, we believe that the absolute carbon reduction (measured in kilograms of CO2) can be quite significant.
Conclusion and Future Directions We introduce ActorQ, a novel paradigm that applies quantization to RL training and achieves speed-up improvements of 1.5-5.4x while maintaining performance. Additionally, we demonstrate that ActorQ can reduce RL training’s carbon footprint by a factor of 1.9-3.8x compared to training in full-precision without quantization.
ActorQ demonstrates that quantization can be effectively applied to many aspects of RL, from obtaining high-quality and efficient quantized policies to reducing training times and carbon emissions. As RL continues to make great strides in solving real-world problems, we believe that making RL training sustainable will be critical for adoption. As we scale RL training to thousands of cores and GPUs, even a 50% improvement (as we have experimentally demonstrated) will generate significant savings in absolute dollar cost, energy, and carbon emissions. Our work is the first step toward applying quantization to RL training to achieve efficient and environmentally sustainable training.
While our design of the quantizer in ActorQ relied on simple uniform quantization, we believe that other forms of quantization, compression and sparsity can be applied (e.g., distillation, sparsification, etc.). We hope that future work will consider applying more aggressive quantization and compression methods, which may yield additional benefits to the performance and accuracy tradeoff obtained by the trained RL policies.
Acknowledgments We would like to thank our co-authors Max Lam, Sharad Chitlangia, Zishen Wan, and Vijay Janapa Reddi (Harvard University), and Gabriel Barth-Maron (DeepMind), for their contribution to this work. We also thank the Google Cloud team for providing research credits to seed this work.
Many exciting contemporary applications of computer science and machine learning (ML) manipulate multidimensional datasets that span a single large coordinate system, for example, weather modeling from atmospheric measurements over a spatial grid or medical imaging predictions from multi-channel image intensity values in a 2d or 3d scan. In these settings, even a single dataset may require terabytes or petabytes of data storage. Such datasets are also challenging to work with as users may read and write data at irregular intervals and varying scales, and are often interested in performing analyses using numerous machines working in parallel.
Today we are introducing TensorStore, an open-source C++ and Python software library designed for storage and manipulation of n-dimensional data that:
TensorStore has already been used to solve key engineering challenges in scientific computing (e.g., management and processing of large datasets in neuroscience, such as peta-scale 3d electron microscopy data and “4d” videos of neuronal activity). TensorStore has also been used in the creation of large-scale machine learning models such as PaLM by addressing the problem of managing model parameters (checkpoints) during distributed training.
Familiar API for Data Access and Manipulation TensorStore provides a simple Python API for loading and manipulating large array data. In the following example, we create a TensorStore object that represents a 56 trillion voxel 3d image of a fly brain and access a small 100x100 patch of the data as a NumPy array:
>>> import tensorstore as ts >>> import numpy as np # Create a TensorStore object to work with fly brain data. >>> dataset = ts.open({ ... 'driver': ... 'neuroglancer_precomputed', ... 'kvstore': ... 'gs://neuroglancer-janelia-flyem-hemibrain/' + ... 'v1.1/segmentation/', ... }).result() # Create a 3-d view (remove singleton 'channel' dimension): >>> dataset_3d = dataset[ts.d['channel'][0]] >>> dataset_3d.domain { "x": [0, 34432), "y": [0, 39552), "z": [0, 41408) } # Convert a 100x100x1 slice of the data to a numpy ndarray >>> slice = np.array(dataset_3d[15000:15100, 15000:15100, 20000])
Crucially, no actual data is accessed or stored in memory until the specific 100x100 slice is requested; hence arbitrarily large underlying datasets can be loaded and manipulated without having to store the entire dataset in memory, using indexing and manipulation syntax largely identical to standard NumPy operations. TensorStore also provides extensive support for advanced indexing features, including transforms, alignment, broadcasting, and virtual views (data type conversion, downsampling, lazily on-the-fly generated arrays).
The following example demonstrates how TensorStore can be used to create a zarr array, and how its asynchronous API enables higher throughput:
>>> import tensorstore as ts >>> import numpy as np >>> # Create a zarr array on the local filesystem >>> dataset = ts.open({ ... 'driver': 'zarr', ... 'kvstore': 'file:///tmp/my_dataset/', ... }, ... dtype=ts.uint32, ... chunk_layout=ts.ChunkLayout(chunk_shape=[256, 256, 1]), ... create=True, ... shape=[5000, 6000, 7000]).result() >>> # Create two numpy arrays with example data to write. >>> a = np.arange(100*200*300, dtype=np.uint32).reshape((100, 200, 300)) >>> b = np.arange(200*300*400, dtype=np.uint32).reshape((200, 300, 400)) >>> # Initiate two asynchronous writes, to be performed concurrently. >>> future_a = dataset[1000:1100, 2000:2200, 3000:3300].write(a) >>> future_b = dataset[3000:3200, 4000:4300, 5000:5400].write(b) >>> # Wait for the asynchronous writes to complete >>> future_a.result() >>> future_b.result()
Safe and Performant Scaling Processing and analyzing large numerical datasets requires significant computational resources. This is typically achieved through parallelization across numerous CPU or accelerator cores spread across many machines. Therefore a fundamental goal of TensorStore has been to enable parallel processing of individual datasets that is both safe (i.e., avoids corruption or inconsistencies arising from parallel access patterns) and high performance (i.e., reading and writing to TensorStore is not a bottleneck during computation). In fact, in a test within Google’s datacenters, we found nearly linear scaling of read and write performance as the number of CPUs was increased:
Performance is achieved by implementing core operations in C++, extensive use of multithreading for operations such as encoding/decoding and network I/O, and partitioning large datasets into much smaller units through chunking to enable efficiently reading and writing subsets of the entire dataset. TensorStore also provides configurable in-memory caching (which reduces slower storage system interactions for frequently accessed data) and an asynchronous API that enables a read or write operation to continue in the background while a program completes other work.
Safety of parallel operations when many machines are accessing the same dataset is achieved through the use of optimistic concurrency, which maintains compatibility with diverse underlying storage layers (including Cloud storage platforms, such as GCS, as well as local filesystems) without significantly impacting performance. TensorStore also provides strong ACID guarantees for all individual operations executing within a single runtime.
To make distributed computing with TensorStore compatible with many existing data processing workflows, we have also integrated TensorStore with parallel computing libraries such as Apache Beam (example code) and Dask (example code).
Use Case: Language Models An exciting recent development in ML is the emergence of more advanced language models such as PaLM. These neural networks contain hundreds of billions of parameters and exhibit some surprising capabilities in natural language understanding and generation. These models also push the limits of computational infrastructure; in particular, training a language model such as PaLM requires thousands of TPUs working in parallel.
One challenge that arises during this training process is efficiently reading and writing the model parameters. Training is distributed across many separate machines, but parameters must be regularly saved to a single object (“checkpoint”) on a permanent storage system without slowing down the overall training process. Individual training jobs must also be able to read just the specific set of parameters they are concerned with in order to avoid the overhead that would be required to load the entire set of model parameters (which could be hundreds of gigabytes).
TensorStore has already been used to address these challenges. It has been applied to manage checkpoints associated with large-scale (“multipod”) models trained with JAX (code example) and has been integrated with frameworks such as T5X (code example) and Pathways. Model parallelism is used to partition the full set of parameters, which can occupy more than a terabyte of memory, over hundreds of TPUs. Checkpoints are stored in zarr format using TensorStore, with a chunk structure chosen to allow the partition for each TPU to be read and written independently in parallel.
Use Case: 3D Brain Mapping The field of synapse-resolution connectomics aims to map the wiring of animal and human brains at the detailed level of individual synaptic connections. This requires imaging the brain at extremely high resolution (nanometers) over fields of view of up to millimeters or more, which yields datasets that can span petabytes in size. In the future these datasets may extend to exabytes as scientists contemplate mapping entire mouse or primate brains. However, even current datasets pose significant challenges related to storage, manipulation, and processing; in particular, even a single brain sample may require millions of gigabytes with a coordinate system (pixel space) of hundreds of thousands pixels in each dimension.
We have used TensorStore to solve computational challenges associated with large-scale connectomic datasets. Specifically, TensorStore has managed some of the largest and most widely accessed connectomic datasets, with Google Cloud Storage as the underlying object storage system. For example, it has been applied to the human cortex “h01” dataset, which is a 3d nanometer-resolution image of human brain tissue. The raw imaging data is 1.4 petabytes (roughly 500,000 * 350,000 * 5,000 pixels large, and is further associated with additional content such as 3d segmentations and annotations that reside in the same coordinate system. The raw data is subdivided into individual chunks 128x128x16 pixels large and stored in the “Neuroglancer precomputed” format, which is optimized for web-based interactive viewing and can be easily manipulated from TensorStore.
Getting Started To get started using the TensorStore Python API, you can install the tensorstore PyPI package using:
pip install tensorstore
Refer to the tutorials and API documentation for usage details. For other installation options and for using the C++ API, refer to installation instructions.
Acknowledgements Thanks to Tim Blakely, Viren Jain, Yash Katariya, Jan-Matthis Luckmann, Michał Januszewski, Peter Li, Adam Roberts, Brain Williams, and Hector Yee from Google Research, and Davis Bennet, Stuart Berg, Eric Perlman, Stephen Plaza, and Juan Nunez-Iglesias from the broader scientific community for valuable feedback on the design, early testing and debugging.
A long-standing problem in the intersection of computer vision and computer graphics, view synthesis is the task of creating new views of a scene from multiple pictures of that scene. This has received increased attention [1, 2, 3] since the introduction of neural radiance fields (NeRF). The problem is challenging because to accurately synthesize new views of a scene, a model needs to capture many types of information — its detailed 3D structure, materials, and illumination — from a small set of reference images.
In this post, we present recently published deep learning models for view synthesis. In “Light Field Neural Rendering” (LFNR), presented at CVPR 2022, we address the challenge of accurately reproducing view-dependent effects by using transformers that learn to combine reference pixel colors. Then in “Generalizable Patch-Based Neural Rendering” (GPNR), to be presented at ECCV 2022, we address the challenge of generalizing to unseen scenes by using a sequence of transformers with canonicalized positional encoding that can be trained on a set of scenes to synthesize views of new scenes. These models have some unique features. They perform image-based rendering, combining colors and features from the reference images to render novel views. They are purely transformer-based, operating on sets of image patches, and they leverage a 4D light field representation for positional encoding, which helps to model view-dependent effects.
Overview The input to the models consists of a set of reference images and their camera parameters (focal length, position, and orientation in space), along with the coordinates of the target ray whose color we want to determine. To produce a new image, we start from the camera parameters of the input images, obtain the coordinates of the target rays (each corresponding to a pixel), and query the model for each.
Instead of processing each reference image completely, we look only at the regions that are likely to influence the target pixel. These regions are determined via epipolar geometry, which maps each target pixel to a line on each reference frame. For robustness, we take small regions around a number of points on the epipolar line, resulting in the set of patches that will actually be processed by the model. The transformers then act on this set of patches to obtain the color of the target pixel.
Transformers are especially useful in this setting since their self-attention mechanism naturally takes sets as inputs, and the attention weights themselves can be used to combine reference view colors and features to predict the output pixel colors. These transformers follow the architecture introduced in ViT.
Light Field Neural Rendering In Light Field Neural Rendering (LFNR), we use a sequence of two transformers to map the set of patches to the target pixel color. The first transformer aggregates information along each epipolar line, and the second along each reference image. We can interpret the first transformer as finding potential correspondences of the target pixel on each reference frame, and the second as reasoning about occlusion and view-dependent effects, which are common challenges of image-based rendering.
LFNR improved the state-of-the-art on the most popular view synthesis benchmarks (Blender and Real Forward-Facing scenes from NeRF and Shiny from NeX) with margins as large as 5dB peak signal-to-noise ratio (PSNR). This corresponds to a reduction of the pixel-wise error by a factor of 1.8x. We show qualitative results on challenging scenes from the Shiny dataset below:
Generalizing to New Scenes One limitation of LFNR is that the first transformer collapses the information along each epipolar line independently for each reference image. This means that it decides which information to preserve based only on the output ray coordinates and patches from each reference image, which works well when training on a single scene (as most neural rendering methods do), but it does not generalize across scenes. Generalizable methods are important because they can be applied to new scenes without needing to retrain.
We overcome this limitation of LFNR in Generalizable Patch-Based Neural Rendering (GPNR). We add a transformer that runs before the other two and exchanges information between points at the same depth over all reference images. For example, this first transformer looks at the columns of the patches from the park bench shown above and can use cues like the flower that appears at corresponding depths in two views, which indicates a potential match. Another key idea of this work is to canonicalize the positional encoding based on the target ray, because to generalize across scenes, it is necessary to represent quantities in relative and not absolute frames of reference. The animation below shows an overview of the model.
To evaluate the generalization performance, we train GPNR on a set of scenes and test it on new scenes. GPNR improved the state-of-the-art on several benchmarks (following IBRNet and MVSNeRF protocols) by 0.5–1.0 dB on average. On the IBRNet benchmark, GPNR outperforms the baselines while using only 11% of the training scenes. The results below show new views of unseen scenes rendered with no fine-tuning.
Future Work One limitation of most neural rendering methods, including ours, is that they require camera poses for each input image. Poses are not easy to obtain and typically come from offline optimization methods that can be slow, limiting possible applications, such as those on mobile devices. Research on jointly learning view synthesis and input poses is a promising future direction. Another limitation of our models is that they are computationally expensive to train. There is an active line of research on faster transformers which might help improve our models’ efficiency. For the papers, more results, and open-source code, you can check out the projects pages for "Light Field Neural Rendering" and "Generalizable Patch-Based Neural Rendering".
Potential Misuse In our research, we aim to accurately reproduce an existing scene using images from that scene, so there is little room to generate fake or non-existing scenes. Our models assume static scenes, so synthesizing moving objects, such as people, will not work.
Acknowledgments All the hard work was done by our amazing intern – Mohammed Suhail – a PhD student at UBC, in collaboration with Carlos Esteves and Ameesh Makadia from Google Research, and Leonid Sigal from UBC. We are thankful to Corinna Cortes for supporting and encouraging this project.
Our work is inspired by NeRF, which sparked the recent interest in view synthesis, and IBRNet, which first considered generalization to new scenes. Our light ray positional encoding is inspired by the seminal paper Light Field Rendering and our use of transformers follow ViT.
Video results are from scenes from LLFF, Shiny, and IBRNet collected datasets.
Natural language enables flexible descriptive queries about images. The interaction between text queries and images grounds linguistic meaning in the visual world, facilitating a better understanding of object relationships, human intentions towards objects, and interactions with the environment. The research community has studied object-level visual grounding through a range of tasks, including referring expression comprehension, text-based localization, and more broadly object detection, each of which require different skills in a model. For example, object detection seeks to find all objects from a predefined set of classes, which requires accurate localization and classification, while referring expression comprehension localizes an object from a referring text and often requires complex reasoning on prominent objects. At the intersection of the two is text-based localization, in which a simple category-based text query prompts the model to detect the objects of interest.
Due to their dissimilar task properties, referring expression comprehension, detection, and text-based localization are mostly studied through separate benchmarks with most models only dedicated to one task. As a result, existing models have not adequately synthesized information from the three tasks to achieve a more holistic visual and linguistic understanding. Referring expression comprehension models, for instance, are trained to predict one object per image, and often struggle to localize multiple objects, reject negative queries, or detect novel categories. In addition, detection models are unable to process text inputs, and text-based localization models often struggle to process complex queries that refer to one object instance, such as “Left half sandwich.” Lastly, none of the models can generalize sufficiently well beyond their training data and categories.
To address these limitations, we are presenting “FindIt: Generalized Localization with Natural Language Queries” at ECCV 2022. Here we propose a unified, general-purpose and multitask visual grounding model, called FindIt, that can flexibly answer different types of grounding and detection queries. Key to this architecture is a multi-level cross-modality fusion module that can perform complex reasoning for referring expression comprehension and simultaneously recognize small and challenging objects for text-based localization and detection. In addition, we discover that a standard object detector and detection losses are sufficient and surprisingly effective for all three tasks without the need for task-specific design and losses common in existing works. FindIt is simple, efficient, and outperforms alternative state-of-the-art models on the referring expression comprehension and text-based localization benchmarks, while being competitive on the detection benchmark.
Multi-level Image-Text Fusion Different localization tasks are created with different semantic understanding objectives. For example, because the referring expression task primarily references prominent objects in the image rather than small, occluded or faraway objects, low resolution images generally suffice. In contrast, the detection task aims to detect objects with various sizes and occlusion levels in higher resolution images. Apart from these benchmarks, the general visual grounding problem is inherently multiscale, as natural queries can refer to objects of any size. This motivates the need for a multi-level image-text fusion model for efficient processing of higher resolution images over different localization tasks.
The premise of FindIt is to fuse the higher level semantic features using more expressive transformer layers, which can capture all-pair interactions between image and text. For the lower-level and higher-resolution features, we use a cheaper dot-product fusion to save computation and memory cost. We attach a detector head (e.g., Faster R-CNN) on top of the fused feature maps to predict the boxes and their classes.
Multitask Learning Apart from the multi-level fusion described above, we adapt the text-based localization and detection tasks to take the same inputs as the referring expression comprehension task. For the text-based localization task, we generate a set of queries over the categories present in the image. For any present category, the text query takes the form “Find the [object],” where [object] is the category name. The objects corresponding to that category are labeled as foreground and the other objects as background. Instead of using the aforementioned prompt, we use a static prompt for the detection task, such as “Find all the objects.”. We found that the specific choice of prompts is not important for text-based localization and detection tasks.
object
After adaptation, all tasks in consideration share the same inputs and outputs — an image input, a text query, and a set of output bounding boxes and classes. We then combine the datasets and train on the mixture. Finally, we use the standard object detection losses for all tasks, which we found to be surprisingly simple and effective.
Evaluation We apply FindIt to the popular RefCOCO benchmark for referring expression comprehension tasks. When only the COCO and RefCOCO dataset is available, FindIt outperforms the state-of-the-art-model on all tasks. In the settings where external datasets are allowed, FindIt sets a new state of the art by using COCO and all RefCOCO splits together (no other datasets). On the challenging Google and UMD splits, FindIt outperforms the state of the art by a 10% margin, which, taken together, demonstrate the benefits of multitask learning.
On the text-based localization benchmark, FindIt achieves 79.7%, higher than the GPV (73.0%), and Faster R-CNN baselines (75.2%). Please refer to the paper for more quantitative evaluation.
We further observe that FindIt generalizes better to novel categories and super-categories in the text-based localization task compared to competitive single-task baselines on the popular COCO and Objects365 datasets, shown in the figure below.
Efficiency We also benchmark the inference times on the referring expression comprehension task (see Table below). FindIt is efficient and comparable with existing one-stage approaches while achieving higher accuracy. For fair comparison, all running times are measured on one GTX 1080Ti GPU.
Conclusion We present Findit, which unifies referring expression comprehension, text-based localization, and object detection tasks. We propose multi-scale cross-attention to unify the diverse localization requirements of these tasks. Without any task-specific design, FindIt surpasses the state of the art on referring expression and text-based localization, shows competitive performance on detection, and generalizes better to out-of-distribution data and novel classes. All of these are accomplished in a single, unified, and efficient model.
Acknowledgements This work is conducted by Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. We would like to thank Ashish Vaswani, Prajit Ramachandran, Niki Parmar, David Luan, Tsung-Yi Lin, and other colleagues at Google Research for their advice and helpful discussions. We would like to thank Tom Small for preparing the animation.
This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.
We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).
Organizing Committee
Industry Liaisons include: Bhuvana Ramabahdran
Area Chairs include: John Hershey, Heiga Zen, Shrikanth Narayanan, Bastiaan Kleijn
ISCA Fellows
Include: Tara Sainath, Heiga Zen
Publications
Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays
Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka
Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar
UserLibri: A Dataset for ASR Personalization Using Only Text Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey
SNRi Target Training for Joint Speech Enhancement and Recognition Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani
Turn-Taking Prediction for Natural Conversational Speech Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He
Streaming Intended Query Detection Using E2E Modeling for Continued Conversation Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman
Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee
XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
Extracting Targeted Training Data from ASR Models, and How to Mitigate It Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays
Detecting Unintended Memorization in Language-Model-Fused ASR W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews
AVATAR: Unconstrained Audiovisual Speech Recognition Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module Richard Rose, Olivier Siohan
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video Dmitriy Serdyuk, Otavio Braga, Olivier Siohan
Unsupervised Data Selection via Discrete Speech Representation for ASR Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani
Non-parallel Voice Conversion for ASR Augmentation Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno
Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund
Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani
Improving Deliberation by Text-Only and Semi-supervised Training Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang
E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu
CycleGAN-Based Unpaired Speech Dereverberation Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post) Joel Shor, Subhashini Venugopalan
Learning Neural Audio Features Without Supervision Sarthak Yadav, Neil Zeghidour
SpeechPainter: Text-Conditioned Speech Inpainting Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi
SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
Distance-Based Sound Separation Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey
Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno
Improving Rare Word Recognition with LM-Aware MWER Training Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach
MAESTRO: Matched Speech Text Representations Through Modality Matching Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen
Pseudo Label is Better Than Human Label Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman
On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran
Streaming Align-Refine for Non-autoregressive Deliberation Wang Weiran, Ke Hu, Tara Sainath
Federated Pruning: Improving Neural Network Efficiency with Federated Learning Rongmei Lin*, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays
A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman
4-Bit Conformer with Native Quantization Aware Training for Speech Recognition Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov
Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha
A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein
Reducing Domain Mismatch in Self-Supervised Speech Pre-training Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano
On-the-Fly ASR Corrections with Audio Exemplars Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim
A Language Agnostic Multilingual Streaming On-Device ASR System Bo Li, Tara Sainath, Ruoming Pang*, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani
XTREME-S: Evaluating Cross-Lingual Speech Representations Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
Towards Disentangled Speech Representations Cal Peyser, Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho
Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O'Malley, Ian McGraw
A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation Tom O’Malley, Arun Narayanan, Quan Wang
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen*, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark
A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno
Text-Driven Separation of Arbitrary Sounds Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi
Workshops, Tutorials & Special Sessions
The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22) Organizers include: Arsha Nagrani
Self-Supervised Representation Learning for Speech Processing Organizers include: Tara Sainath
Learning from Weak Labels Organizers include: Ankit Shah
RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey
Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids Authors: Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla
Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset Authors: Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines
Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays
Trustworthy Speech Processing Organizers include: Shrikanth Narayanan