Skip to main content

The latest research from Google

Vid2Seq: a pretrained visual language model for describing multi-event videos

Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.

Responsible AI at Google Research: The Impact Lab

Learning from deep learning: a case study of feature discovery and validation in pathology

PaLM-E: An embodied multimodal language model

The BirdCLEF 2023 Challenge: Pushing the frontiers of biodiversity monitoring

Announcing the ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Performer-MPC: Navigation via real-time, on-robot transformers

Distributed differential privacy for federated learning

Teaching old labels new tricks in heterogeneous graphs

Datasets at your fingertips in Google Search