Talk title: Opportunities in Egocentric Vision
Abstract: Forecasting the rise of wearable devices equipped with audio-visual feeds, this talk will present opportunities for research in egocentric video understanding. The talk argues for new ways to foresee egocentric videos as partial observations of a dynamic 3D world, where objects are out of sight but not out of mind. I’ll review new data collection and annotation HD-EPIC (https://hd-epic.github.io/) that merges video understanding with 3D modelling, showcasing current failures of VLMs in understanding the perspective outside the camera’s field of view — a task trivial for humans. All projects details are at: https://dimadamen.github.io/index.html#Projects
Bio: Dima Damen is a Professor of Computer Vision at the University of Bristol and Senior Research Scientist at Google DeepMind. Dima is currently an EPSRC Fellow (2020-2025), focusing her research interests in the automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors. She is best known for her leading works in Egocentric Vision, and has also contributed to novel research questions including mono-to-3D, video object segmentation, assessing action completion, domain adaptation, skill/expertise determination from video sequences, discovering task-relevant objects, dual-domain and dual-time learning as well as multi-modal fusion using vision, audio and language. She is the project lead for EPIC-KITCHENS, the seminal dataset in egocentric vision, with accompanying open challenges and follow-up works: EPIC-Sounds, VISOR and EPIC Fields. She is part of the large-scale consortium effort Ego4D and Ego-Exo4D. She is an ELLIS Fellow, Associate Editor-in-Chief (AEIC) of IEEE TPAMI and associate editor (AE) of IJCV, and was a program chair for ICCV 2021. She is frequently an Area Chair in major conferences and was selected as Outstanding Reviewer in CVPR2021, CVPR2020, ICCV2017, CVPR2013 and CVPR2012. Dima received her PhD from the University of Leeds (2009), joined the University of Bristol as a Postdoctoral Researcher (2010-2012), Assistant Professor (2013-2018), Associate Professor (2018-2021) and was appointed as chair in August 2021. She supervises 11 PhD students, 2 Visiting PhD students and 2 postdoctoral researchers. At the University of Bristol, Dima leads the Machine Learning and Computer Vision (MaVi) lab, and is the university chair of the Research Data Storage Management Executive Board. At Google DeepMind, Dima is part of the Vision team, led by Andrew Zisserman, focusing on video understanding research. Her latest contribution is to the Perception Test project on measuring perception in AI models
Talk title: Machine learning applications in distributed rendering and path tracing
Abstract: In the first study, distributed rendering is investigated. In cloud-based gaming and virtual reality (G&VR), scene content is rendered in a cloud server and streamed as low latency encoded video to the client device. Distributed rendering aims to offload parts of the rendering to the client. An adaptive approach is proposed, which dynamically assigns assets to client-side vs. server side rendering according to varying rendering time and bitrate targets. This is achieved by streaming perceptually optimized scene control weights to the client, which are compressed with a composable autoencoder in conjunction with select video segments. This creates an adaptive render-video (REVI) streaming framework, which allows for substantial tradeoffs between client rendering time and the bitrate required to stream visually-lossless video from the server to the client. A key result is that, when the client is providing for 50% of the rendering time needed to render the whole scene, up to 60% average bitrate saving is achieved versus streaming the entire scene to the client as video.
In the second study, we introduce a compact learning-based representation that encodes the full Monte Carlo sampling distribution of an image during Path Tracing. Our format enables rendering at arbitrary samples per pixel (SPP) given at inference, without the need for expensive path tracing operations. We achieve this by fitting parametric distributions to the pixel radiance values and demonstrate how those can be fitted, stored and sampled efficiently. By encoding a richer representation of scene lighting, our method unifies diverse SPP requirements into one format, cutting storage costs while maintaining flexibility and fidelity across the entire SPP range.
Link to first paper: Adaptive Render-Video Streaming for Virtual Environments
Bio: As a student, Matthias Treder obtained degrees in Mathematics, Computer Science, and Cognitive Psychology. In 2009, he completed a PhD on human vision at Radboud University Nijmegen. He then worked on brain-computer interfaces at TU Berlin (2009-13), as well as the application of machine learning to neuroimaging data at University of Cambridge (2013-15), and University of Birmingham (2016-17). He joined Cardiff University in 2017 and became a Lecturer in Computer Science in 2018. In 2021, he joined the Computer Vision startup iSIZE which was acquired by Sony Interactive Entertainment in 2023. His current role is Senior Manager in Computer Vision. Together with his team, he develops ML based solutions for enhanced computer graphics targeting the PlayStation ecosystem. Matthias’s research interests are primarily in ML applications for graphics and rendering, including neural approaches for real-time Path Tracing and related techniques for global illumination estimation.
Talk title: Multimodal AI for Real Time Communication
Abstract: Modern AI is transforming the way we live and work. It impacts our day to day jobs in any sector – science, teaching, engineering, healthcare, public service – and will continue to transform our lives.
From a science and research perspectives, the landscape is also changing. No longer can we easily distinguish between fields such as computer vision, graphics, audio processing. Instead, we live in a multimodal world, where solving important problems involves learning from and utilising multiple signals. However, major challenges still exist – from learning approaches, architectures, to creating efficient real time models that are low cost and run on consumer-type devices.
In this talk I will talk about some of these trends from my experiences working first as an academic leading a large multidisciplinary research centre (CAMERA), to leading real time AI efforts at Microsoft for Mixed Reality devices and software, and the worlds largest communications platform – Microsoft Teams.
Bio: Darren Cosker leads a team at Microsoft focusing on real time multimodal AI. His team has developed, built and shipped AI technologies for Microsoft’s Mixed Reality platform Microsoft Mesh, shipped AI on devices such as Meta Quest, developed real-time AI systems for Microsoft Hololens, and now works on future communication technologies for Teams – a platform with billions of users.
He also holds a Professor position at the University of Bath, where he was previously founding Director (2015-2021) of a £20m research centre CAMERA – the Centre for Analysis of Motion, Entertainment Research, and Applications.
Talk title: Towards a new generation of standards for AI-powered imaging
Abstract: The landscape of imaging science and technology is undergoing a seismic shift, propelled by the relentless march of Artificial Intelligence. From the subtle nuances of facial recognition to the uncanny realism of deepfakes, AI is not merely enhancing our visual world—it's fundamentally reshaping it.
We've witnessed the rise of end-to-end autoencoders in image compression with unprecedented efficiency and deep learning, resurrecting details lost to noise and low resolution. But this is just the beginning.
We start with dissecting a profound transformation. For decades, our visual technologies have catered to the human eye, optimizing for our perception. Now, as machines increasingly consume and interpret visual data autonomously, we're witnessing a paradigm shift that demands new approaches.
Furthermore, the line between real and synthetic content blurs, challenging our notions of authenticity. We'll explore the implications of AI-generated media and the urgent need for robust methods to cope with them.
Finally, we'll spotlight the emerging JPEG standards, like JPEG AI and JPEG Trust, which are poised to harness AI's potential while safeguarding the integrity of our visual world.
Bio: Touradj Ebrahimi is a professor of image processing at Ecole Polytechnique Fédérale de Lausanne (EPFL), active in teaching and research in multimedia signal processing. He is the founder and the head of the Multimedia Signal Processing Group at EPFL. Since 2014, he has been the Convener of the JPEG standardization Committee which has produced a family of standards that have revolutionized the imaging world. He represents Switzerland as the head of its delegation to JTC 1 (in charge of standardization of information technology in ISO and IEC), and SC 29 (the body overseeing MPEG and JPEG standardization). He is a member of ITU representing EPFL and contributes to its SG12 and SG16 activities. Prof. Ebrahimi is a consultant, evaluator, and expert for the European Commission and other governmental funding agencies in Europe, North America, and Asia. He advises several Venture Capital companies in Switzerland in their scientific and technical audits. He has founded several startup and spinoff companies in the past two decades, including the most recent RayShaper SA, a startup based in Switzerland involved in AI-powered multimedia. His areas of interest include image and video compression, media security, quality of experience in multimedia, and AI-based image and video processing and analysis. Prof. Ebrahimi is a Fellow of the IEEE, SPIE, EURASIP, and AAIA and has been the recipient of several awards and distinctions, including an IEEE Star Innovator Award in Multimedia, an Emmy Award on behalf of JPEG, and the SMPTE Progress medal.