Vision augmented hearing
Theo Murphy meeting organised by Professor Jennifer Bizley, Professor Michael Akeroyd, Professor Adrian KC Lee.
Acoustic information is not the sole determinant of how the everyday world sounds: our brains rely on vision to rescue hearing in situations when audition is hazy or worse. Perception continuously, seamlessly binds information across senses, but how remains mysterious. We will gather diverse experts to unify the latest research and chart a path towards better virtual and augmented-reality technology.
Programme
The programme, including speaker biographies and abstracts, is available below but please note the programme may be subject to change.
Poster session
There will be a poster session from 5pm on Tuesday 3 March 2026. Registered attendees will be invited to submit a proposed poster title and abstract (up to 200 words). Acceptances may be made on a rolling basis so we recommend submitting as soon as possible in case the session becomes full. Submissions made within one month of the meeting may not be included in the programme booklet.
Attending the event
This event is intended for researchers in relevant fields.
- Free to attend and in-person only
- When requesting an invitation, please briefly state your expertise and reasons for attending
- Requests are reviewed by the meeting organisers on a rolling basis. You will receive a link to register if your request has been successful
- Catering options will be available to purchase upon registering. Participants are responsible for booking their own accommodation. Please do not book accommodation until you have been invited to attend the meeting by the meeting organisers
Enquiries: Contact the Scientific Programmes team.
Organisers
Schedule
| 09:00-09:05 |
Welcome by Royal Society
|
|---|---|
| 09:05-09:30 |
Talk title tbc
Dr Jennifer Bizley, University College London, UK
Dr Jennifer Bizley, University College London, UKDr Jennifer Bizley obtained her D.Phil. from the University of Oxford where she was also a post-doctoral fellow. She is currently a Reader and holder of a Royal Society / Wellcome Trust Sir Henry Dale Fellowship, at the Ear Institute, University College London where her research group is based. Her work explores the brain basis of listening and, in particular, how auditory and non-auditory factors influence the processing of sound. Her research combines behavioural methods with techniques to measure and manipulate neural activity as well as anatomical and computational approaches.
Professor Adrian KC Lee
Professor Adrian KC LeeAdrian KC Lee is a Professor in the Department of Speech & Hearing Sciences and at the Institute for Learning and Brain Sciences at the University of Washington, Seattle, USA. He obtained his bachelor’s degree in electrical engineering at the University of New South Wales and his doctorate at the Harvard-MIT Division in Health Sciences and Technology. His research focuses on developing multimodal imaging techniques to investigate the cortical network involved in auditory scene analysis and attention, especially through designing novel behavioral paradigms that bridge the gap in psychoacoustics, multisensory and neuroimaging research. |
| 09:30-09:45 |
Discussion
|
| 09:45-10:15 |
Talk title tbc
Professor Jennifer GrohDuke University Professor Jennifer GrohDuke University |
| 10:15-10:30 |
Discussion
|
| 10:30-11:00 |
Break
|
| 11:00-11:30 |
Tracing the effect of visual stimuli on speech encoding along the human auditory pathway
In noisy settings, seeing a talker allows them to be much better understood. Several studies have demonstrated cortical effects of audio-visual integration in humans and animal models. Subcortically, some work in animals has shown effects of visual stimuli in auditory areas, but there is very little human work to back that up. In this study, we presented 23 listeners with audio-visual speech under two conditions: coherent, in which the acoustic and visual speech matched, and incoherent, in which the visual speech was replaced with a different recording of the same talker. The target speech was presented alongside two acoustic masker talkers. Listeners were asked to report keywords. We recorded EEG and computed the brainstem temporal response function, from which we derived a waveform for each condition resembling the auditory brainstem response (ABR). Behavioral results confirmed the perceptual benefit of the congruent condition over the incongruent: all subjects showed better performance, with a mean improvement of 10% correct. ABR waveforms to target speech did not differ between the two audio-visual conditions. Responses to masker speech were similarly unaffected by the visual stimulus. It is clear from our behavioral results and countless prior studies that congruent visual speech improves understanding in the presence of background noise. Audio-visual integration of speech signals has been shown in humans in later cortical waves, but was not seen subcortically in our present study. This is consistent with recent work from our lab showing that selective attention impacts cortical but not subcortical EEG responses in human listeners.
Dr Ross MaddoxUniversity of Michigan
Dr Ross MaddoxUniversity of Michigan Ross Maddox earned his PhD and MS in Biomedical Engineering from Boston University, and his BS in Sound Engineering from the University of Michigan. Following his PhD, he completed a postdoctoral appointment at the University of Washington Institute for Learning & Brain Sciences (I-LABS). He was on the faculty of the Departments of Biomedical Engineering and Neuroscience at the University of Rochester before moving to the Kresge Hearing Research Institute at the University of Michigan in 2024. |
| 11:30-11:45 |
Discussion
|
| 11:45-12:15 |
Talk title tbc
Dr Rebecca NorrisUniversity College London, UK Dr Rebecca NorrisUniversity College London, UK |
| 12:15-12:30 |
Discussion
|
| 13:30-14:00 |
The function of top-down processes in segmenting and selecting objects in the visual scene
Accurate segmentation of the visual scene allows us to select and manipulate objects in our environment. Top-down connections in sensory systems are thought to modulate activity in primary sensory areas to enhance object-related activity while suppressing background activity. Here I will discuss recent work in mice, monkeys and humans showing how connectivity between cells in higher visual areas tuned for border-ownership and cells in V1 leads to precise scene segmentation. I will show how interaction with local circuitry in V1 allows top-down connections to drive activity, even in the absence of bottom-up input from the retina. Finally, I will discuss how segmentation processes evolve over time, from an early phase where local contextual effects determine activity to a later phase where the global scene organisation is represented in primary visual cortex.
Dr Matthew SelfUniversity of Glasgow
Dr Matthew SelfUniversity of Glasgow Dr Matthew Self is a Senior Lecturer at the School of Psychology and Neuroscience at the University of Glasgow. His work follows two broad research themes. In the visual system he studies how feedforward and feedback circuits are used to segment the visual scene into objects and backgrounds, the neural circuits that mediate contextual and predictive effects, and the role of top-down processes in visual perception. He also collaborates with neurosurgeons, in the Netherlands and internationally, to record single-cell activity in the human hippocampus during cognitive behaviours. He studies how hippocampal activity can be controlled and used to learn spatiotemporal information. |
|---|---|
| 14:00-14:15 |
Discussion
|
| 14:15-14:45 |
Natural audiovisual speech encoding in the early stages of the human cortical hierarchy
Seeing a speaker’s face in a noisy environment can greatly improve one’s ability to understand what they are saying, a process that is attributed to the multisensory integration of audio and visual speech. In this talk, I will present a model of such multisensory integration that is based on the notion that visual speech can influence auditory speech processing at multiple stages of processing – including an early stage based on the correlated dynamics of visual and auditory speech and later stages where the form of visual articulators helps with linguistic categorization. This model relies on the hypothesis that visual cortex represents both low-level visual features and higher-level linguistic cues and that these representations can differentially and flexibly influence the processing of audio speech. I will present evidence for this model across a series of studies that involved modeling EEG responses obtained from adult participants while they were presented with naturalistic audio-visual speech stimuli. Professor Edmund LalorUniversity of Rochester, USA and Trinity College Dublin, Ireland Professor Edmund LalorUniversity of Rochester, USA and Trinity College Dublin, Ireland Edmund Lalor is a Professor of Biomedical Engineering and Neuroscience at the University of Rochester. His lab takes a quantitative modelling approach to the analysis of sensory electrophysiology in humans – with a view to understanding the sensory, perceptual, and cognitive processes that underpin everyday human function. While much of this work focuses on neurotypical, healthy adults, the team is also interested in how sensory and perceptual processing is affected in certain populations including people with schizophrenia and those with a diagnosis of autism. |
| 14:45-15:00 |
Discussion
|
| 15:00-15:30 |
Break
|
| 15:30-16:00 |
See what you hear: Making sense of the senses
Adaptive behaviour in a complex, dynamic, and multisensory world raises some of the most fundamental questions for neural processing, notably perceptual inference, decision making, learning, binding, attention and probabilistic computations. In this talk, I will present our recent behavioural, computational and neural research that investigates how the brain tackles these challenges. First, I will focus on how the brain solves the causal inference or binding problem, deciding whether signals come from common causes and should hence be integrated or else be processed independently. Combining psychophysics, Bayesian modelling and neuroimaging (fMRI, EEG) we show that the brain arbitrates between sensory integration and segregation consistent with the principles of Bayesian Causal Inference by dynamically encoding multiple perceptual estimates across the cortical hierarchy. Next, I will explore how prior expectations and attentional mechanisms can modulate sensory integration. Finally, I will show research into how the brain solves the causal inference problem in more complex environments with multiple signals and sources.
Professor Uta NoppeneyDonders University
Professor Uta NoppeneyDonders University Uta Noppeney’s research investigates the neural mechanisms of perceptual inference, learning, decision making, attention and probabilistic computations through a multisensory lens combining psychophysics, computational modelling (Bayesian, neural network) and neuroimaging (fMRI, M/EEG) in humans. She is a Professor at the Neurophysics department and a Principal Investigator at the Donders Institute for Brain, Cognition and Behaviour, Radboud University (Netherlands). Previously, she was the director of the Computational Neuroscience and Cognitive Robotics Centre at the University of Birmingham (UK) and an independent research group leader at the Max Planck Institute for Biological Cybernetics, Tübingen (Germany). She is the recipient of a Young Investigator Award of the Cognitive Neuroscience Society (2013), a Turing Fellowship (2018) and two ERC grants ( 2013, 2023). She is a member of the Academia Europaea and an academic editor of PLOS Biology and Multisensory Research. |
| 16:00-16:15 |
Discussion
|
| 16:15-16:45 |
Talk title tbc
Professor Aleena GarnerHarvard University, US Professor Aleena GarnerHarvard University, US |
| 16:45-17:00 |
Discussion
|
| 17:00-18:30 |
Drinks reception and poster session
|
| 18:30-00:00 |
Close
|
| 09:00-09:30 |
Multisensory Speech Perception: Models and Mechanisms
The most natural form of human interaction is face-to-face, integrating auditory information from the voice of the talker with visual information from the face of the talker. A dramatic illustration of the influence of visual information on speech perception is provided by the McGurk effect. In this illusion, pairing an auditory "ba" with an incongruent visual "ga" sometimes results in the percept of a different syllable. We show that an artificial neural network known as AVHuBERT also perceives the McGurk effect. Both humans and AVHuBERT report a mixture of percepts to McGurk stimuli, including the auditory component of the stimulus, the fusion percept of "da" reported in the original description of the illusion, and other syllables, including "fa" and "ah". The similar responses of humans and AVHuBERT to McGurk stimuli suggest that artificial neural networks may provide a useful model for human audiovisual speech perception. The neural basis of audiovisual speech perception was examined using stereoencephalographic (SEEG) recordings from neurosurgical patients. These recording demonstrate an anatomical boundary between the superior temporal gyrus, which responds similarly to auditory-only and audiovisual speech, and the superior temporal sulcus, which shows larger and faster responses to audiovisual compared with auditory-only speech. Consistent with behavioural studies, audiovisual enhancement was more pronounced in the presence of auditory noise. |
|---|---|
| 09:30-09:45 |
Discussion
|
| 09:45-10:15 |
Audiovisual scene dynamics and their influence on loudness perception
Understanding speech in noisy, multisensory environments is a fundamental challenge for both listeners and researchers. While most studies focus on speech intelligibility, less is known about how the perceptual construct of loudness is influenced by the structure of the audiovisual (AV) scene. This work presents how systematically manipulating the temporal synchrony between auditory and visual signals, as well as the linguistic content of the speech modulates perceived loudness. Our results show that increased AV asynchrony leads to a significant drop in perceived loudness, with effects emerging beyond natural synchrony ranges. Surprisingly, linguistic complexity also alters loudness ratings, even when physical intensity is held constant. By transforming subjective ratings into an objective measure of perceived target-to-masker ratio (TMR), we demonstrate that extreme AV asynchrony results in a 2–3 dB drop in perceived TMR, independent of linguistic content. These findings highlight the value of loudness perception as a sensitive index of AV scene analysis. This approach offers new avenues for studying multisensory processing in both typical and neurodiverse populations, and for developing accessible protocols that decouple scene analysis from linguistic ability.
Dr Liesbeth GibelsUniversity of Washington, US
Dr Liesbeth GibelsUniversity of Washington, US Liesbeth Gijbels is a research scientist with a rich academic background in Speech and Hearing Sciences (PhD, Master, Bachelor), Psychology, and Education. The focus of her work is to bridge clinical practice and academic research in Speech and Hearing Sciences. Liesbeth has 10+ years of hands-on clinical experience supporting individuals with communication, learning, and hearing challenges. After moving to the US in 2018, she completed her PhD at the University of Washington, focusing on audiovisual speech perception and the cognitive processes underlying human communication. Building on her clinical and academic foundations, Liesbeth’s current work as a research scientist at Meta Reality Labs centers on developing AI-driven tools and technologies to enhance hearing and communication. |
| 10:15-10:30 |
Discussion
|
| 10:30-11:00 |
Break
|
| 11:00-11:30 |
A model incorporating the influence of gaze on apparent sound source direction
The mistakes that listeners make in determining the direction of a sound source are not solely the consequence of sampling error from a distribution centered around the source’s true position. Burgeoning evidence suggests that the distributions themselves have characteristic shifts or biases that depend on adaptive processes of different time courses, the spatial statistics and recent history of the auditory scene itself, the relative angle of the head and of the torso, and on eye gaze angle, to name a few contributing factors. These effects indicate that perceived acoustic space is not fixed but undergoes continual bias remapping in response to behavioural and sensory context. It is the intent of this talk to review the history of what we know about such systematic changes in spatial acoustic perception and in particular the interaction with gaze. We will then present a neurophysiologically inspired model of how head angle and eye gaze influence sound localization, as well as related phenomena such as sound source segregation and spatial release from masking. We conclude by comparing model predictions with key results from the literature, demonstrating that the model captures a common structure underpinning the diverse ways in which gaze alters spatial auditory perception.
Dr Owen BrimijoinFacebook Reality Labs
Dr Owen BrimijoinFacebook Reality Labs |
| 11:30-11:45 |
Discussion
|
| 11:45-12:15 |
Talk title tbc
Dr Mark WallaceVanderbilt University Dr Mark WallaceVanderbilt University |
| 12:15-12:30 |
Discussion
|
| 13:30-14:00 |
Learnings from the audio-visual speech enhancement challenge: from stimuli design to evaluation
Speech enhancement technologies have rapidly developed in the last decade. Multi-modal speech perception has inspired the next generation of speech enhancement technologies to explore multi-modal approaches that leverage visual information to overcome scenarios that can be challenging for audio-only speech enhancement models (e.g., overlapping speakers). The Audio-Visual Speech Enhancement Challenge (AVSEC) provided the first benchmark to assess algorithms that use lip-reading information to augment their speech enhancement capabilities. Throughout its four editions, we provided carefully designed datasets that enabled us to explore AV-SE performance in a range of listening conditions different from those commonly used in laboratory settings. In this talk, I will present our proposed approach to scalable stimuli design from in-the-wild data and introduce the AVSEC protocol for human listening evaluation of AV-SE systems. I will present an overview of the listening test results throughout different editions of the challenge and discuss them in relation to characteristics of the designed stimuli.
Dr Lorena AldanaUniversity of Edinburgh, UK
Dr Lorena AldanaUniversity of Edinburgh, UK Lorena Aldana is a Research Associate at the University of Edinburgh. She has a background in sound engineering and computer science. She was a DAAD scholar at The University of Bielefeld and finished her PhD in 2021. Her research interests lie at the intersection of multi-modal speech and hearing technologies, audio signal processing and machine learning. Lorena has been a technical lead of the four successful editions of the International Audio-Visual Speech Enhancement Challenge (AVSEC). Her current research focuses on advancing evaluation methods for speech and hearing technologies addressing ecological validity and integrating individual differences in hearing beyond current standard clinical methods. |
|---|---|
| 14:00-14:15 |
Discussion
|
| 14:15-14:45 |
Talk title tbc
Professor Lorenzo PicinalliImperial College, UK
Professor Lorenzo PicinalliImperial College, UK |
| 14:45-15:00 |
Discussion
|
| 15:00-15:30 |
Break
|
| 15:30-16:00 |
Talk title tbc
Professor Chris SumnerNottingham Trent University, UK Professor Chris SumnerNottingham Trent University, UK |
| 16:00-16:15 |
Discussion
|
| 16:15-17:00 |
Panel discussion/overview
|
| 17:00-00:00 |
Close
|