Vision augmented hearing

03 - 04 March 2026 09:00 - 17:00 Holiday Inn Manchester - City Centre Free
Request an invitation
Vision- lead image

Theo Murphy meeting organised by Professor Jennifer Bizley, Professor Michael Akeroyd, Professor Adrian KC Lee.

Acoustic information is not the sole determinant of how the everyday world sounds: our brains rely on vision to rescue hearing in situations when audition is hazy or worse. Perception continuously, seamlessly binds information across senses, but how remains mysterious. We will gather diverse experts to unify the latest research and chart a path towards better virtual and augmented-reality technology.

Programme

The programme, including speaker biographies and abstracts, is available below but please note the programme may be subject to change.

Poster session

There will be a poster session from 5pm on Tuesday 3 March 2026. Registered attendees will be invited to submit a proposed poster title and abstract (up to 200 words). Acceptances may be made on a rolling basis so we recommend submitting as soon as possible in case the session becomes full. Submissions made within one month of the meeting may not be included in the programme booklet.

Attending the event

This event is intended for researchers in relevant fields.

  • Free to attend and in-person only
  • When requesting an invitation, please briefly state your expertise and reasons for attending
  • Requests are reviewed by the meeting organisers on a rolling basis. You will receive a link to register if your request has been successful
  • Catering options will be available to purchase upon registering. Participants are responsible for booking their own accommodation. Please do not book accommodation until you have been invited to attend the meeting by the meeting organisers

Enquiries: Contact the Scientific Programmes team.

Organisers

  • blank avatar

    Dr Jennifer Bizley, University College London, UK

    Dr Jennifer Bizley obtained her D.Phil. from the University of Oxford where she was also a post-doctoral fellow. She is currently a Reader and holder of a Royal Society / Wellcome Trust Sir Henry Dale Fellowship, at the Ear Institute, University College London where her research group is based. Her work explores the brain basis of listening and, in particular, how auditory and non-auditory factors influence the processing of sound. Her research combines behavioural methods with techniques to measure and manipulate neural activity as well as anatomical and computational approaches.

  • blank avatar

    Professor Michael Akeroyd

  • Professor Adrian KC Lee

    Professor Adrian KC Lee

    Adrian KC Lee is a Professor in the Department of Speech & Hearing Sciences and at the Institute for Learning and Brain Sciences at the University of Washington, Seattle, USA. He obtained his bachelor’s degree in electrical engineering at the University of New South Wales and his doctorate at the Harvard-MIT Division in Health Sciences and Technology. His research focuses on developing multimodal imaging techniques to investigate the cortical network involved in auditory scene analysis and attention, especially through designing novel behavioral paradigms that bridge the gap in psychoacoustics, multisensory and neuroimaging research.

Schedule

09:00-09:05 Welcome by Royal Society
09:05-09:30 Talk title tbc
Dr Jennifer Bizley, University College London, UK

Dr Jennifer Bizley, University College London, UK

Professor Adrian KC Lee

Professor Adrian KC Lee

09:30-09:45 Discussion
09:45-10:15 Talk title tbc
Professor Jennifer Groh

Professor Jennifer Groh

Duke University

10:15-10:30 Discussion
10:30-11:00 Break
11:00-11:30 Tracing the effect of visual stimuli on speech encoding along the human auditory pathway

In noisy settings, seeing a talker allows them to be much better understood. Several studies have demonstrated cortical effects of audio-visual integration in humans and animal models. Subcortically, some work in animals has shown effects of visual stimuli in auditory areas, but there is very little human work to back that up.

In this study, we presented 23 listeners with audio-visual speech under two conditions: coherent, in which the acoustic and visual speech matched, and incoherent, in which the visual speech was replaced with a different recording of the same talker. The target speech was presented alongside two acoustic masker talkers. Listeners were asked to report keywords. We recorded EEG and computed the brainstem temporal response function, from which we derived a waveform for each condition resembling the auditory brainstem response (ABR).

Behavioral results confirmed the perceptual benefit of the congruent condition over the incongruent: all subjects showed better performance, with a mean improvement of 10% correct. ABR waveforms to target speech did not differ between the two audio-visual conditions. Responses to masker speech were similarly unaffected by the visual stimulus.

It is clear from our behavioral results and countless prior studies that congruent visual speech improves understanding in the presence of background noise. Audio-visual integration of speech signals has been shown in humans in later cortical waves, but was not seen subcortically in our present study. This is consistent with recent work from our lab showing that selective attention impacts cortical but not subcortical EEG responses in human listeners.

Dr Ross Maddox

Dr Ross Maddox

University of Michigan

11:30-11:45 Discussion
11:45-12:15 Talk title tbc
Dr Rebecca Norris

Dr Rebecca Norris

University College London, UK

12:15-12:30 Discussion

13:30-14:00 The function of top-down processes in segmenting and selecting objects in the visual scene

Accurate segmentation of the visual scene allows us to select and manipulate objects in our environment. Top-down connections in sensory systems are thought to modulate activity in primary sensory areas to enhance object-related activity while suppressing background activity. Here I will discuss recent work in mice, monkeys and humans showing how connectivity between cells in higher visual areas tuned for border-ownership and cells in V1 leads to precise scene segmentation. I will show how interaction with local circuitry in V1 allows top-down connections to drive activity, even in the absence of bottom-up input from the retina. Finally, I will discuss how segmentation processes evolve over time, from an early phase where local contextual effects determine activity to a later phase where the global scene organisation is represented in primary visual cortex.

Dr Matthew Self

Dr Matthew Self

University of Glasgow

14:00-14:15 Discussion
14:15-14:45 Natural audiovisual speech encoding in the early stages of the human cortical hierarchy

Seeing a speaker’s face in a noisy environment can greatly improve one’s ability to understand what they are saying, a process that is attributed to the multisensory integration of audio and visual speech. In this talk, I will present a model of such multisensory integration that is based on the notion that visual speech can influence auditory speech processing at multiple stages of processing – including an early stage based on the correlated dynamics of visual and auditory speech and later stages where the form of visual articulators helps with linguistic categorization. This model relies on the hypothesis that visual cortex represents both low-level visual features and higher-level linguistic cues and that these representations can differentially and flexibly influence the processing of audio speech. I will present evidence for this model across a series of studies that involved modeling EEG responses obtained from adult participants while they were presented with naturalistic audio-visual speech stimuli.

Professor Edmund Lalor

Professor Edmund Lalor

University of Rochester, USA and Trinity College Dublin, Ireland

14:45-15:00 Discussion
15:00-15:30 Break
15:30-16:00 See what you hear: Making sense of the senses

Adaptive behaviour in a complex, dynamic, and multisensory world raises some of the most fundamental questions for neural processing, notably perceptual inference, decision making, learning, binding, attention and probabilistic computations. In this talk, I will present our recent behavioural, computational and neural research that investigates how the brain tackles these challenges. First, I will focus on how the brain solves the causal inference or binding problem, deciding whether signals come from common causes and should hence be integrated or else be processed independently. Combining psychophysics, Bayesian modelling and neuroimaging (fMRI, EEG) we show that the brain arbitrates between sensory integration and segregation consistent with the principles of Bayesian Causal Inference by dynamically encoding multiple perceptual estimates across the cortical hierarchy. Next, I will explore how prior expectations and attentional mechanisms can modulate sensory integration. Finally, I will show research into how the brain solves the causal inference problem in more complex environments with multiple signals and sources.

Professor Uta Noppeney

Professor Uta Noppeney

Donders University

16:00-16:15 Discussion
16:15-16:45 Talk title tbc
Professor Aleena Garner

Professor Aleena Garner

Harvard University, US

16:45-17:00 Discussion
17:00-18:30 Drinks reception and poster session
18:30-00:00 Close

09:00-09:30 Multisensory Speech Perception: Models and Mechanisms

The most natural form of human interaction is face-to-face, integrating auditory information from the voice of the talker with visual information from the face of the talker. A dramatic illustration of the influence of visual information on speech perception is provided by the McGurk effect. In this illusion, pairing an auditory "ba" with an incongruent visual "ga" sometimes results in the percept of a different syllable. We show that an artificial neural network known as AVHuBERT also perceives the McGurk effect. Both humans and AVHuBERT report a mixture of percepts to McGurk stimuli, including the auditory component of the stimulus, the fusion percept of "da" reported in the original description of the illusion, and other syllables, including "fa" and "ah". The similar responses of humans and AVHuBERT to McGurk stimuli suggest that artificial neural networks may provide a useful model for human audiovisual speech perception. The neural basis of audiovisual speech perception was examined using stereoencephalographic (SEEG) recordings from neurosurgical patients. These recording demonstrate an anatomical boundary between the superior temporal gyrus, which responds similarly to auditory-only and audiovisual speech, and the superior temporal sulcus, which shows larger and faster responses to audiovisual compared with auditory-only speech. Consistent with behavioural studies, audiovisual enhancement was more pronounced in the presence of auditory noise.

09:30-09:45 Discussion
09:45-10:15 Audiovisual scene dynamics and their influence on loudness perception

Understanding speech in noisy, multisensory environments is a fundamental challenge for both listeners and researchers. While most studies focus on speech intelligibility, less is known about how the perceptual construct of loudness is influenced by the structure of the audiovisual (AV) scene. This work presents how systematically manipulating the temporal synchrony between auditory and visual signals, as well as the linguistic content of the speech modulates perceived loudness. Our results show that increased AV asynchrony leads to a significant drop in perceived loudness, with effects emerging beyond natural synchrony ranges. Surprisingly, linguistic complexity also alters loudness ratings, even when physical intensity is held constant. By transforming subjective ratings into an objective measure of perceived target-to-masker ratio (TMR), we demonstrate that extreme AV asynchrony results in a 2–3 dB drop in perceived TMR, independent of linguistic content.

These findings highlight the value of loudness perception as a sensitive index of AV scene analysis. This approach offers new avenues for studying multisensory processing in both typical and neurodiverse populations, and for developing accessible protocols that decouple scene analysis from linguistic ability.

Dr Liesbeth Gibels

Dr Liesbeth Gibels

University of Washington, US

10:15-10:30 Discussion
10:30-11:00 Break
11:00-11:30 A model incorporating the influence of gaze on apparent sound source direction

The mistakes that listeners make in determining the direction of a sound source are not solely the consequence of sampling error from a distribution centered around the source’s true position. Burgeoning evidence suggests that the distributions themselves have characteristic shifts or biases that depend on adaptive processes of different time courses, the spatial statistics and recent history of the auditory scene itself, the relative angle of the head and of the torso, and on eye gaze angle, to name a few contributing factors. These effects indicate that perceived acoustic space is not fixed but undergoes continual bias remapping in response to behavioural and sensory context. It is the intent of this talk to review the history of what we know about such systematic changes in spatial acoustic perception and in particular the interaction with gaze. We will then present a neurophysiologically inspired model of how head angle and eye gaze influence sound localization, as well as related phenomena such as sound source segregation and spatial release from masking. We conclude by comparing model predictions with key results from the literature, demonstrating that the model captures a common structure underpinning the diverse ways in which gaze alters spatial auditory perception.

Dr Owen Brimijoin

Dr Owen Brimijoin

Facebook Reality Labs

11:30-11:45 Discussion
11:45-12:15 Talk title tbc
Dr Mark Wallace

Dr Mark Wallace

Vanderbilt University

12:15-12:30 Discussion

13:30-14:00 Learnings from the audio-visual speech enhancement challenge: from stimuli design to evaluation

Speech enhancement technologies have rapidly developed in the last decade. Multi-modal speech perception has inspired the next generation of speech enhancement technologies to explore multi-modal approaches that leverage visual information to overcome scenarios that can be challenging for audio-only speech enhancement models (e.g., overlapping speakers).

The Audio-Visual Speech Enhancement Challenge (AVSEC) provided the first benchmark to assess algorithms that use lip-reading information to augment their speech enhancement capabilities. Throughout its four editions, we provided carefully designed datasets that enabled us to explore AV-SE performance in a range of listening conditions different from those commonly used in laboratory settings. In this talk, I will present our proposed approach to scalable stimuli design from in-the-wild data and introduce the AVSEC protocol for human listening evaluation of AV-SE systems. I will present an overview of the listening test results throughout different editions of the challenge and discuss them in relation to characteristics of the designed stimuli.

Dr Lorena Aldana

Dr Lorena Aldana

University of Edinburgh, UK

14:00-14:15 Discussion
14:15-14:45 Talk title tbc
Professor Lorenzo Picinalli

Professor Lorenzo Picinalli

Imperial College, UK

14:45-15:00 Discussion
15:00-15:30 Break
15:30-16:00 Talk title tbc
Professor Chris Sumner

Professor Chris Sumner

Nottingham Trent University, UK

16:00-16:15 Discussion
16:15-17:00 Panel discussion/overview
17:00-00:00 Close