Arsha Nagrani

Arsha Nagrani
anagrani at google dot com

I am a senior research scientist at Google AI Research, where I work on machine learning for video understanding. I did my PhD with Andrew Zisserman in the VGG group at the University of Oxford, where I was fortunate enough to be funded by a Google PhD Fellowship. My thesis won the ELLIS PhD award.

Before that I did my undergrad at the University of Cambridge, where I worked with Roberto Cipolla and Richard Turner.

CV / Google Scholar / LinkedIn / Twitter / GitHub / Thesis

Research

My research focuses on self-supervised and multi-modal machine learning techniques for video recognition, including the use of sound and text to learn better visual representations. Recently, I have also become interested in computer vision for wildlife conservation. For a full list of publications please see Google Scholar.

	VidChapters-7M: Video Chapters at Scale Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid NeurIPS (Datasets and Benchmarks), 2023 arXiv / code, models, data, project page New dataset and video-language tasks for video chapterisation in long web videos.
	PaLI-X: On Scaling up a Multilingual Vision and Language Model Xi Chen et al. arXiv, 2023 arXiv Scaling up video and language models gives SOTA on 25+ VL benchmarks.
	UnLoc: A Unified Framework for Video Localization Tasks Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid ICCV, 2023 arXiv Image-text models like CLIP can be used for Moment Retrieval, Temporal Localization, and Action Segmentation
	AutoAD II: The Sequel – Who, When, and What in Movie Audio Description Tengda Han, Max Bain, Arsha Nagrani , Gül Varol, Weidi Xie, Andrew Zisserman ICCV, 2023 PDF Describing visual content in movies (automatic audio descriptions!), focus on character recognition. This is part II, see below for AutoAD I @ CVPR23.
	Verbs in Action: Improving verb understanding in video-language models Liliane Momeni, Mathilde Caron, Arsha Nagrani , Andrew Zisserman, Cordelia Schmid ICCV, 2023 arXiv / code Generating better negatives using an LLM can produce a verb-focused video-text model with contrastive training.
	LanSER: Language-Model Supported Speech Emotion Recognition Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou Interspeech, 2023 PDF. LLMs are used to obtain emotion training data from speech.
	Modular Visual Question Answering via Code Generation Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani , Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein ACL, 2023 arXiv / code / google AI blog LLMs are used to automatically create modular executable code, which is used to solve visual QA. CodeVQA gets SOTA on the COVR and GQA datasets.
	AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR Paul Hongsuck Seo, Arsha Nagrani , Cordelia Schmid CVPR, 2023 arXiv SOTA, zero-shot audiovisual ASR is possible by using lightweight adaptors on top of large frozen unimodal models.
	AutoAD: Movie Description in Context Tengda Han, Max Bain, Arsha Nagrani , Gül Varol, Weidi Xie, Andrew Zisserman CVPR, 2023 (Highlight) PDF / project page / code Describing visual content in movies (automatic audio descriptions!)
	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Antoine Yang, Arsha Nagrani , Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid CVPR, 2023 arXiv / project page / code / google AI blog A single-stage, dense event captioning model pretrained on narrated videos at scale achieves SOTA for dense video captioning.
	TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency Medhini Narasimhan, Arsha Nagrani , Chen Sun, Miki Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid ECCV, 2022 arXiv / project page / code, WikiHow Summaries dataset Creating short video summaries for instructional videos using simple heuristics. New test set released.
	AVATAR: Unconstrained Audiovisual Speech Recognition Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid Interspeech*, 2022 arXiv / project page, visSpeech dataset Visual context (objects, actions) improves ASR performance under challenging audio conditions.
	Learning Audio-Video Modalities from Image Captions Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid ECCV, 2022 arXiv / VideoCC dataset Mining audiovisual clips for text captions by leveraging image-image similarity for SOTA video retrieval and captioning.
	End-to-end Generative Pretraining for Multimodal Video Captioning Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid CVPR, 2022 arXiv New unsupervised pretraining framework for multimodal video captioning that leverages future utterances in unlabelled videos.
	Masking Modalities for Cross-modal Video Retrieval Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid WACV, 2022 PDF Video encoder pretraining using appearance, sound, and transcribed speech, by masking out an entire modality and predicting it using the others.
	Automated audiovisual behavior recognition in wild primates Max Bain, Arsha Nagrani , Daniel Schofield, Sophie Berdugo, Joana Bessa, Jake Owens, Kimberley J. Hockings, Tetsuro Matsuzawa, Misato Hayashi, Dora Biro, Susana Carvalho, Andrew Zisserman Science Advances, 2021 PDF Fully automated, audio-visual pipeline to detect and track two audio-visually distinctive actions in wild chimpanzees: buttress-drumming and nut-cracking.
	Attention Bottlenecks for Multimodal Fusion Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun NeurIPS, 2021 arXiv / project page / code / google AI blog Fully transformer based multimodal fusion model gets SOTA on video classification. Attention bottlenecks at multiple layers force cross-modal information to be condensed thereby improving performance at lower computational cost.
	Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval Max Bain, Arsha Nagrani, Gul Varol, Andrew Zisserman ICCV, 2021 arXiv / code, models / WebVid dataset End-to-end encoder for visual retrieval that uses only self-attention blocks. This allows flexible training with variable length videos and images jointly.
	Composable Augmentation Encoding for Video Representation Learning Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid ICCV, 2021 arXiv / project page / models Encoding augmentations along with data views gives SOTA on video self-supervised learning benchmarks.
	Audio-Visual Synchronisation in the Wild Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman BMVC, 2021 arXiv / VGGSound-sync data, project page Transformer model for audio-visual synchronization works well on non-speech classes in the wild.
	Localizing Visual Sounds the Hard Way Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman CVPR, 2021 arXiv / VGG-SS dataset, project page Localizes sounding objects without any supervision using hard negative mining from within the image. Gives SOTA on Flickr SoundNet and a new VGG-SS dataset.
	With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition Evangelis Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen BMVC, 2021 arXiv / code, models / project page Uses a language model to learn a sequence of actions as temporal context for egocentric action recognition.
	Look Before you Speak: Visually Contextualized Utterances Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid CVPR, 2021 arXiv / project page Predicting future utterances in a video based on previous dialogue and video frames without manual labels gives SOTA on standard QA datasets.
	Slow-Fast Auditory Streams For Audio Recognition Evangelis Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen ICASSP, 2021 (Outstanding Paper Award) arXiv / code, models / project page Two stream audio recognition models that gets SOTA on VGG-Sound and EPIC-Kitchens-100.
	Playing a Part: Speaker Verification at the Movies Andrew Brown, Jaesung Huh, Arsha Nagrani, Joon Son Chung, Andrew Zisserman ICASSP*, 2021 arXiv / VoxMovies dataset, project page Investigate the performance of speaker recognition in movies, where often actors intentionally disguise their voice to play a character.
	Condensed Movies: Story Based Retrieval with Contextual Embeddings Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman ACCV, 2020 (Oral Presentation) project page, CMD dataset / challenge A large-scale story understanding dataset that contains the key scenes from movies with semantic captions. Basis of the CMD Challenge at ICCV 2021.
	Spot the conversation: speaker diarisation in the wild Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman INTERSPEECH, 2020 project page, VoxConverse dataset / challenge Breaking up multispeaker videos into "who spoke when". Based on this work we are hosting a new speaker diarisation* track at the VoxCeleb Speaker Recognition Challenge.
	Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid ECCV, 2020 arXiv Action localisation in movies using video level labels only.
	Speech2Action: Cross-modal Supervision for Action Recognition Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman CVPR, 2020 project page, data / slides Action recognition in movies using the speech alone.
	Disentangled Speech Embeddings using Cross-modal Self-supervision Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman ICASSP, 2020 project page / some code Disentanglement of speech embeddings into content and identity with only accompanying facetrack as supervision. Based on this work we are hosting a new self-supervised* track at the VoxCeleb Speaker Recognition Challenge.
	Voxceleb: Large-scale speaker verification in the wild Arsha Nagrani, Joon Son Chung, Weidi Xie, Andrew Zisserman Computer Speech and Language, 2020 project page, data / code & models / challenge Overview of the VoxCeleb1 and VoxCeleb2 datasets including various updates and splits, and new models for speaker recognition.
	Count, Crop and Recognise: Fine-Grained Recognition in the Wild Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman ICCV Workshops, 2019 (Oral Presentation) project page / slides Recognition of wild chimpanzees using full body and full frame CNNs methods. We also release an 'in the wild' video chimpanzee recognition dataset.
	Chimpanzee face recognition from videos in the wild using deep learning Daniel Schofield, Arsha Nagrani, Andrew Zisserman, Misato Hayashi, Tetsuro Matsuzawa, Dora Biro, Susana Carvalho Science Advances, 2019 project page Press: New Scientist, MIT Tech Review, TechXplore, Verdict, Digital Trends, Oxford News Face detection, tracking, and recognition of wild chimpanzees from long-term video records using deep CNNs. We also show a brief application for social network analysis.
	EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen ICCV, 2019 project page / video / code and models We propose a novel architecture for combining modalities in videos for action recognition, by using a temporal window to allow a range of temporal offsets.
	Use What You Have: Video Retrieval Using Representations From Collaborative Experts Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman BMVC*, 2019 project page / code & models / challenge We fuse the information from different embeddings experts for the task of video retrieval - achieving SOTA results on 5 different datasets. This work is also the basis for the Video Pentathlon at CVPR 2020.
	Utterance-level Aggregation For Speaker Recognition In The Wild Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman ICASSP, 2019 (Oral Presentation) project page / code & models A NetVlad layer in a deep CNN works well for speaker recognition on long noisy speech utterances.
	Emotion Recognition in Speech using Cross-Modal Transfer in the Wild Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman ACM Multimedia, 2018 project page / code We use the redundant (common) signal in both audio (speech) and vision (faces) to learn speech representations for emotion recognition without manual supervision.
	VoxCeleb2: Deep Speaker Recognition Joon Son Chung, Arsha Nagrani, Andrew Zisserman INTERSPEECH, 2018 data Speaker Recognition in the Wild using deep CNNs. The VoxCeleb datasets are also used integrally in the VoxCeleb Speaker Recognition Challenge.
	Learnable PINs: Cross-Modal Embeddings for Person Identity Arsha Nagrani, Samuel Albanie, Andrew Zisserman ECCV, 2018 project page We learn joint embedding of faces and voices using cross-modal self-supervision from YouTube videos.
	Seeing Voices and Hearing Faces: Cross-modal biometric matching Arsha Nagrani, Samuel Albanie, Andrew Zisserman CVPR, 2018 (Spotlight) project page / video / blog post Can you recognise someone’s face if you have only heard their voice? Or recognise their voice if you have only seen their face?
	From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script Arsha Nagrani, Andrew Zisserman BMVC, 2017 (Oral Presentation) project page
	VoxCeleb: a large-scale speaker identification dataset Arsha Nagrani, Joon Son Chung, Andrew Zisserman INTERSPEECH, 2017 (Oral Presentation, Best Student Paper Award) data / challenge We use face recognition and active speaker detection to automatically create a large scale speaker identification dataset from YouTube videos.

Service

Area Chair : CVPR23, ICCV23
Reviewer : CVPR, ECCV, ICCV, BMVC, NeurIps, ICML, AAAI, IEEE Triple Access

Workshop/Tutorial Organization :

Sight and Sound Workshop @ CVPR
[2020-2022] website

VoxSRC: VoxCeleb Speaker Recognition Challenge @ INTERSPEECH
[2021] report / challenge / workshop / data
[2020] report / challenge / workshop / data
[2019] report / challenge / workshop / data

The End-of-End-to-End: A Video Understanding Pentathlon @ CVPR 2020
report / challenge / workshop / recording

WICV: Women in Computer Vision Workshop @ CVPR
[2020] website / twitter
[2019] report / website / twitter

This guy is good at website design.