Listen "EP03: Perspectives of Computer Vision Researchers on the Challenges and Opportunities for Advancing Visual Interpretation Technologies"
Episode Synopsis
Brief summary of the episode
Marcus Rohrbach, Andrew Howard, and James Coughlan share about their experiences developing state-of-art research in visual interpretation algorithms and systems.
Questions asked in the episode
[04:30] For the sake of the audience, could you share how you got into your current line of work as a computer vision researcher and share about what problems you work on?
[17:46] What are the kinds of methods you are using to tackle the problems you are working on and what are the kinds of errors and limitations you encounter with these methods?
[38:03] What do you see as the key turning points in the computer vision community over the years or decades that have shifted which problems our community is able to address and what do these changes mean for end users?
[48:11] What’s the next decade look like? What do you think is going to be the next big thing?
Guest bios
Marcus Rohrbach is a Research Scientist at Meta AI Research, with a PhD from the Max Planck Institute for Informatics. Marcus is most well-known for his work at the intersection of computer vision and natural language processing. Over his career, he has driven key progress in visual question answering, language grounding, and generating descriptions about image and videos, in particular movies - all of which is highly relevant for the blind/low-vision community.
Andrew Howard is a Senior Staff Software Engineer at Google Research, with a PhD in Computer Science from Columbia University. Andrew is most well-known for his work in mobile-friendly deep learning models. Starting with MobileNets, then MobileNetsV2, then MobileNetsV3, and also MnasNets, his work has been broadly adopted in deep learning packages like PyTorch and Tensorflow as well as across a host of mobile phone platforms and apps.
James Coughlan is a Senior Scientist at the Smith-Kettlewell Eye Research Institute, with a PhD in Physics from Harvard University. James has been at Smith-Kettlewell since 1998 and over this time has developed a wide array of impactful technologies for the blind and low-vision community.
Daniela Massiceti is a machine learning researcher at Microsoft Research. Her research focuses on the intersection of ML and human-computer interaction. She is primarily interested in ML systems that learn and evolve with human input, so called “teachable” systems, giving users the power to completely customise their AI experiences – from personalised assistive tools for people who are blind/low-vision, to personalised avatars in the metaverse.
Links to resources mentioned
https://vizwiz.org/workshops/2022-workshop/
Marcus Rohrbach, Andrew Howard, and James Coughlan share about their experiences developing state-of-art research in visual interpretation algorithms and systems.
Questions asked in the episode
[04:30] For the sake of the audience, could you share how you got into your current line of work as a computer vision researcher and share about what problems you work on?
[17:46] What are the kinds of methods you are using to tackle the problems you are working on and what are the kinds of errors and limitations you encounter with these methods?
[38:03] What do you see as the key turning points in the computer vision community over the years or decades that have shifted which problems our community is able to address and what do these changes mean for end users?
[48:11] What’s the next decade look like? What do you think is going to be the next big thing?
Guest bios
Marcus Rohrbach is a Research Scientist at Meta AI Research, with a PhD from the Max Planck Institute for Informatics. Marcus is most well-known for his work at the intersection of computer vision and natural language processing. Over his career, he has driven key progress in visual question answering, language grounding, and generating descriptions about image and videos, in particular movies - all of which is highly relevant for the blind/low-vision community.
Andrew Howard is a Senior Staff Software Engineer at Google Research, with a PhD in Computer Science from Columbia University. Andrew is most well-known for his work in mobile-friendly deep learning models. Starting with MobileNets, then MobileNetsV2, then MobileNetsV3, and also MnasNets, his work has been broadly adopted in deep learning packages like PyTorch and Tensorflow as well as across a host of mobile phone platforms and apps.
James Coughlan is a Senior Scientist at the Smith-Kettlewell Eye Research Institute, with a PhD in Physics from Harvard University. James has been at Smith-Kettlewell since 1998 and over this time has developed a wide array of impactful technologies for the blind and low-vision community.
Daniela Massiceti is a machine learning researcher at Microsoft Research. Her research focuses on the intersection of ML and human-computer interaction. She is primarily interested in ML systems that learn and evolve with human input, so called “teachable” systems, giving users the power to completely customise their AI experiences – from personalised assistive tools for people who are blind/low-vision, to personalised avatars in the metaverse.
Links to resources mentioned
https://vizwiz.org/workshops/2022-workshop/
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.