Skip to main content

The latest research from Google

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. While the challenges for this technology are centered around noisy audio inputs, the visual stream in multimodal videos (e.g., TV, online edited videos) can provide strong cues for improving the robustness of ASR systems — this is called audiovisual ASR (AV-ASR).

Retrieval-augmented visual-language pre-training

Large sequence models for software development activities

Foundation models for reasoning on charts

Barkour: Benchmarking animal-level agility with quadruped robots

Differentially private clustering for large-scale datasets

Google Research at I/O 2023

Resolving code review comments with ML

Making ML models differentially private: Best practices and open challenges

Sparse video tubes for joint video and image vision transformers

Responsible AI at Google Research: PAIR