Research
My research focuses on self-supervised and multi-modal machine learning techniques for video recognition, including the use of sound and text to learn better visual representations.
Recently, I have also become interested in computer vision for wildlife conservation. For a full list of publications please see Google Scholar.
|
|
VidChapters-7M: Video Chapters at Scale
Antoine Yang, Arsha Nagrani, Ivan Laptev, Josef Sivic, Cordelia Schmid
NeurIPS (Datasets and Benchmarks), 2023  
arXiv / code, models, data, project page
New dataset and video-language tasks for video chapterisation in long web videos.
|
|
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen et al.
arXiv, 2023  
arXiv
Scaling up video and language models gives SOTA on 25+ VL benchmarks.
|
|
UnLoc: A Unified Framework for Video Localization Tasks
Shen Yan*, Xuehan Xiong*, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, Cordelia Schmid
ICCV, 2023  
arXiv
Image-text models like CLIP can be used for Moment Retrieval, Temporal Localization, and Action Segmentation
|
|
AutoAD II: The Sequel – Who, When, and What in Movie Audio Description
Tengda Han, Max Bain, Arsha Nagrani , Gül Varol, Weidi Xie, Andrew Zisserman
ICCV, 2023  
PDF
Describing visual content in movies (automatic audio descriptions!), focus on character recognition. This is part II, see below for AutoAD I @ CVPR23.
|
|
Verbs in Action: Improving verb understanding in video-language models
Liliane Momeni, Mathilde Caron, Arsha Nagrani , Andrew Zisserman, Cordelia Schmid
ICCV, 2023  
arXiv / code
Generating better negatives using an LLM can produce a verb-focused video-text model with contrastive training.
|
|
LanSER: Language-Model Supported Speech Emotion Recognition
Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou
Interspeech, 2023  
PDF.
LLMs are used to obtain emotion training data from speech.
|
|
Modular Visual Question Answering via Code Generation
Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani , Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein
ACL, 2023  
arXiv / code / google AI blog
LLMs are used to automatically create modular executable code, which is used to solve visual QA. CodeVQA gets SOTA on the COVR and GQA datasets.
|
|
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Paul Hongsuck Seo, Arsha Nagrani , Cordelia Schmid
CVPR, 2023  
arXiv
SOTA, zero-shot audiovisual ASR is possible by using lightweight adaptors on top of large frozen unimodal models.
|
|
AutoAD: Movie Description in Context
Tengda Han*, Max Bain*, Arsha Nagrani , Gül Varol, Weidi Xie, Andrew Zisserman
CVPR, 2023   (Highlight)
PDF / project page / code
Describing visual content in movies (automatic audio descriptions!)
|
|
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang, Arsha Nagrani , Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid
CVPR, 2023  
arXiv / project page / code / google AI blog
A single-stage, dense event captioning model pretrained on narrated videos at scale achieves SOTA for dense video captioning.
|
|
TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency
Medhini Narasimhan, Arsha Nagrani , Chen Sun, Miki Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid
ECCV, 2022  
arXiv / project page / code, WikiHow Summaries dataset
Creating short video summaries for instructional videos using simple heuristics. New test set released.
|
|
AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur*, Paul Hongsuck Seo*, Arsha Nagrani*, Chen Sun, Karteek Alahari, Cordelia Schmid
Interspeech, 2022  
arXiv / project page, visSpeech dataset
Visual context (objects, actions) improves ASR performance under challenging audio conditions.
|
|
Learning Audio-Video Modalities from Image Captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
ECCV, 2022  
arXiv / VideoCC dataset
Mining audiovisual clips for text captions by leveraging image-image similarity for SOTA video retrieval and captioning.
|
|
End-to-end Generative Pretraining for Multimodal Video Captioning
Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab, Cordelia Schmid
CVPR, 2022  
arXiv
New unsupervised pretraining framework for multimodal video captioning that leverages future utterances in unlabelled videos.
|
|
Masking Modalities for Cross-modal Video Retrieval
Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
WACV, 2022  
PDF
Video encoder pretraining using appearance, sound, and transcribed speech, by masking out an entire modality and predicting it using the others.
|
|
Automated audiovisual behavior recognition in wild primates
Max Bain, Arsha Nagrani , Daniel Schofield, Sophie Berdugo, Joana Bessa, Jake Owens, Kimberley J. Hockings, Tetsuro Matsuzawa, Misato Hayashi, Dora Biro, Susana Carvalho, Andrew Zisserman
Science Advances, 2021  
PDF
Fully automated, audio-visual pipeline to detect and track two audio-visually distinctive actions in wild chimpanzees: buttress-drumming and nut-cracking.
|
|
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
NeurIPS, 2021  
arXiv / project page / code / google AI blog
Fully transformer based multimodal fusion model gets SOTA on video classification. Attention bottlenecks at multiple layers force cross-modal information to be condensed thereby improving performance at lower computational cost.
|
|
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Max Bain, Arsha Nagrani, Gul Varol, Andrew Zisserman
ICCV, 2021  
arXiv / code, models / WebVid dataset
End-to-end encoder for visual retrieval that uses only self-attention blocks. This allows flexible training with variable length videos and images jointly.
|
|
Composable Augmentation Encoding for Video Representation Learning
Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
ICCV, 2021  
arXiv / project page / models
Encoding augmentations along with data views gives SOTA on video self-supervised learning benchmarks.
|
|
Audio-Visual Synchronisation in the Wild
Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
BMVC, 2021  
arXiv / VGGSound-sync data, project page
Transformer model for audio-visual synchronization works well on non-speech classes in the wild.
|
|
Localizing Visual Sounds the Hard Way
Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman
CVPR, 2021  
arXiv / VGG-SS dataset, project page
Localizes sounding objects without any supervision using hard negative mining from within the image. Gives SOTA on Flickr SoundNet and a new VGG-SS dataset.
|
|
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
Evangelis Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima Damen
BMVC, 2021  
arXiv /
code, models /
project page
Uses a language model to learn a sequence of actions as temporal context for egocentric action recognition.
|
|
Look Before you Speak: Visually Contextualized Utterances
Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
CVPR, 2021  
arXiv / project page
Predicting future utterances in a video based on previous dialogue and video frames without manual labels gives SOTA on standard QA datasets.
|
|
Slow-Fast Auditory Streams For Audio Recognition
Evangelis Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
ICASSP, 2021   (Outstanding Paper Award)
arXiv /
code, models /
project page
Two stream audio recognition models that gets SOTA on VGG-Sound and EPIC-Kitchens-100.
|
|
Playing a Part: Speaker Verification at the Movies
Andrew Brown*, Jaesung Huh*, Arsha Nagrani*, Joon Son Chung, Andrew Zisserman
ICASSP, 2021  
arXiv /
VoxMovies dataset, project page
Investigate the performance of speaker recognition in movies, where often actors intentionally disguise their voice to play a character.
|
|
Condensed Movies: Story Based Retrieval with Contextual Embeddings
Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman
ACCV, 2020   (Oral Presentation)
project page, CMD dataset / challenge
A large-scale story understanding dataset that contains the key scenes from movies with semantic captions. Basis of the CMD Challenge at ICCV 2021.
|
|
Spot the conversation: speaker diarisation in the wild
Joon Son Chung*, Jaesung Huh*, Arsha Nagrani*, Triantafyllos Afouras, Andrew Zisserman
INTERSPEECH, 2020
project page, VoxConverse dataset / challenge
Breaking up multispeaker videos into "who spoke when". Based on this work we are hosting a new speaker diarisation track
at the VoxCeleb Speaker Recognition Challenge.
|
|
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
ECCV, 2020
arXiv
Action localisation in movies using video level labels only.
|
|
Speech2Action: Cross-modal Supervision for Action Recognition
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
CVPR, 2020
project page, data /
slides
Action recognition in movies using the speech alone.
|
|
Disentangled Speech Embeddings using Cross-modal Self-supervision
Arsha Nagrani*, Joon Son Chung*, Samuel Albanie*, Andrew Zisserman
ICASSP, 2020
project page /
some code
Disentanglement of speech embeddings into content and identity with only accompanying facetrack as supervision. Based on this work we are hosting a new self-supervised track
at the VoxCeleb Speaker Recognition Challenge.
|
|
Voxceleb: Large-scale speaker verification in the wild
Arsha Nagrani, Joon Son Chung, Weidi Xie, Andrew Zisserman
Computer Speech and Language, 2020
project page, data /
code & models /
challenge
Overview of the VoxCeleb1 and VoxCeleb2 datasets including various updates and splits, and new models for speaker recognition.
|
|
Count, Crop and Recognise: Fine-Grained Recognition in the Wild
Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman
ICCV Workshops, 2019   (Oral Presentation)
project page / slides
Recognition of wild chimpanzees using full body and full frame CNNs methods. We also release an 'in the wild' video chimpanzee recognition dataset.
|
|
Chimpanzee face recognition from videos in the wild using deep learning
Daniel Schofield*, Arsha Nagrani*, Andrew Zisserman, Misato Hayashi, Tetsuro Matsuzawa, Dora Biro, Susana Carvalho
Science Advances, 2019
project page
Press:
New Scientist,
MIT Tech Review,
TechXplore,
Verdict,
Digital Trends,
Oxford News
Face detection, tracking, and recognition of wild chimpanzees from long-term video records using deep CNNs. We also show a brief application for social network analysis.
|
|
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen
ICCV, 2019
project page /
video /
code and models
We propose a novel architecture for combining modalities in videos for action recognition, by using a temporal window to allow a range of temporal offsets.
|
|
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
Yang Liu*, Samuel Albanie*, Arsha Nagrani*, Andrew Zisserman
BMVC, 2019
project page /
code & models /
challenge
We fuse the information from different embeddings experts for the task of video retrieval - achieving SOTA results on 5 different datasets. This work is also the basis
for the Video Pentathlon at CVPR 2020.
|
|
Utterance-level Aggregation For Speaker Recognition In The Wild
Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman
ICASSP, 2019   (Oral Presentation)
project page /
code & models
A NetVlad layer in a deep CNN works well for speaker recognition on long noisy speech utterances.
|
|
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Samuel Albanie*, Arsha Nagrani*, Andrea Vedaldi, Andrew Zisserman
ACM Multimedia, 2018
project page /
code
We use the redundant (common) signal in both audio (speech) and vision (faces) to learn speech representations for emotion recognition without manual supervision.
|
|
VoxCeleb2: Deep Speaker Recognition
Joon Son Chung*, Arsha Nagrani*, Andrew Zisserman
INTERSPEECH, 2018
data
Speaker Recognition in the Wild using deep CNNs. The VoxCeleb datasets are also used integrally in the
VoxCeleb Speaker Recognition Challenge.
|
|
Learnable PINs: Cross-Modal Embeddings for Person Identity
Arsha Nagrani*, Samuel Albanie*, Andrew Zisserman
ECCV, 2018
project page
We learn joint embedding of faces and voices using cross-modal self-supervision from YouTube videos.
|
|
Seeing Voices and Hearing Faces: Cross-modal biometric matching
Arsha Nagrani, Samuel Albanie, Andrew Zisserman
CVPR, 2018   (Spotlight)
project page /
video /
blog post
Can you recognise someone’s face if you have only heard their voice? Or recognise their voice if you have only seen their face?
|
|
From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script
Arsha Nagrani, Andrew Zisserman
BMVC, 2017   (Oral Presentation)
project page
|
|
VoxCeleb: a large-scale speaker identification dataset
Arsha Nagrani*, Joon Son Chung*, Andrew Zisserman
INTERSPEECH, 2017   (Oral Presentation, Best Student Paper Award)
data /
challenge
We use face recognition and active speaker detection to automatically create a large scale speaker identification dataset from YouTube videos.
|
This guy is good at website design.
|
|