With the LDV Vision summit fast approaching, we want to catch up with some of the computer vision scientists/researchers who work deep inside the internet giants and who will be speaking at the event. Their research might not be as well-known as the features they power but they however deserve as much credit as the ceo’s who had the insight to hire them. Andrea Frome from Research at Google took time on her busy schedule to answer some of our questions:
A little bit about yourself, what is your background and how did you end up at Research at Google ?
I’ve taken an unusual path compared to many of my peers. I got my B.S. at University of Mary Washington in Fredericksburg, VA in Environmental Science, and after working in environmental consulting for a couple years doing modeling
and database application work, I decided I wanted to go to graduate school for Computer Science. I was accepted at a program that existed at UC Berkeley at the time called the Computer Science Reentry Program, which was for women and underrepresented minorities with a non-CS Bachelor degree but were looking to go to grad school for CS. I spent four semesters building up a transcript of Computer Science and EE undergraduate courses. I applied for graduate school, chose to attend Berkeley, and ultimately finished my Ph.D. at UC Berkeley in Computer Vision and Machine Learning with Dr. Jitendra Malik in 2007. In the last two years of my Ph.D., I did two internships at Google, first with Image Search for 6 months and then in Research with Dr. Yoram Singer for about 10 months, and these led to papers co-authored with Yoram and ultimately my thesis. I joined Google as a Software Engineer after graduation, and while I’m a Software Engineer by title, I’m associated with Research at Google through the Deep Learning Research group, and my projects on both Street View and on my current team have had large research components, leading to several papers.
What drives your research ? How do you decide what to research ?
A sense of what currently feels like the “right” direction to go, which I think comes from understanding how current popular techniques and research directions fall short of the larger goals of the Computer Vision field, while also taking into account what directions are both promising in the near or medium term and aren’t where people are currently pushing. I credit my advisor, Jitendra Malik, for developing this sense in his students, and I have had the great fortune of getting to know many talented researchers from our lab and the larger community that are particularly good at this, like Dr. Alyosha Efros. Another aspect is that, in general, I tend not to be satisfied with working on a small increment from what is currently being done; I like to take things in directions that may shake things up a little and challenge the dominant paradigm.
Is your ultimate goal to properly classify all images taken ? Is that even possible ?
As an ultimate goal, classifying images is too limited. Myself and many in the community see the grand goal being a system that can that fully understand visual input, in the way that humans are able to. Humans don’t look at a room and think “desk”, “chair”, “coffee mug”. Instead humans understand why things are in their particular places, how a person might interact with objects or the space. We recognize peoples’ actions and their intentions, and we predict what will happen next. For example, you understand from visual input and our interactions with the world what will happen if you tip the table your coffee mug is resting on or if you try to place your coffee mug on your keyboard. Not only do I believe that we can build systems that learn these things, I believe we will be able to build systems that do this processing on video in real-time and those systems will learn from large amounts of video, without human labeling. This is still a ways out, but the pace of progress is increasing, our toolbox and knowledge is getting stronger by the month, and more than I’ve seen previously in my career, researchers are pushing toward these big goals.
Which of your research are you the most proud of ? Why ?
I would guess many driven people have complicated relationships with their work product. Often by the time that I’ve published a paper on a topic, and certainly after talking about it for a few months, I have examined it from many angles, see primarily the shortcomings, and am ready to move on to the next thing. For example, I’m proud of the work my team did on Street View face and license plate blurring because we addressed a real need on a difficult problem on a tight timeline, and it allowed Street View to become a global product. However, if I were to solve the same problem again today I would take a different algorithmic approach. I am happy that two of my papers at the end of my Ph.D. were the introduction for some in the Computer Vision community to the use image triplets for learning image similarity, but I think it a mistake to hold too tightly to credit for things like this. I hope that I will be most proud of my current work in the area of “visual attention”, which I believe will be a critical advancement toward some of the grand goals of Computer Vision.
What was your hardest research to date and why ?
My current research in mechanisms for “attention”, which is a shortcut phrase for algorithms that process some input and, based on that, decide what additional information to gather and incorporate to achieve a goal, e.g. object detection and tracking or machine translation. For vision, it’s also referred to as “learning where to look”. Many in the field are working on it, with a big increase in just the last year, and there are many algorithmic challenges to getting it to be accurate and efficient. Good solutions will need to bring together many tools from the machine learning and/or neural network tool boxes in new ways and will likely require development of new techniques.
What are some of the places on Google where we can see practical applications of your research ?
I’m limited in how much I can say, but one example I can give is Street View, where our research was behind the automated face and license plate blurring. Putting those privacy controls into place were critical for launching Street View around the world. Another example is from my work with Image Search as an intern. The work that I did in 2005 was the seed for what became the “Related Images” and “Search By Image” features, though lots of research and work was put into those features by others after my involvement ended.
Looking around the photo:tech space, what do you see that excites you?
I am excited about systems that assist the blind in interacting with the visual world, which is an area of interest for Serge Belongie as well. It presents difficult challenges both in terms of the quality of our visual algorithms and in designing how we communicate visual information to a user, while at the same time being grounded in a real use case where we can gather feedback and data. If we can design systems that are compelling and usable for that market, I am certain that along the way we will also discover other compelling use cases that we hadn’t thought of before.
With the LDV summit coming up, what do you hope to get from it ?
I’m involved with organization of CVPR 2016—the big annual U.S. Computer Vision conference—as the Industry Relations Chair, and we’re expanding the conference expo to support a much larger industry presence. We want to make it a place where players from different parts of the field can come together to network and learn from each other, and I am looking to the LDV Summit as an opportunity to connect with startups and investors, understand better what they gain by engaging with the research community, and get them excited and involved in CVPR 2016. I’m also just excited to learn what is happening in the startup world and how it’s different from academia and work at big companies like Google.
What would you like to see happening that technology cannot yet deliver?
I am particularly interested in technology that has a benefit to society in terms of societal, economic, health, or political good. The hard part about problems in these areas is that tech solutions alone often aren’t enough; they often also require understanding and negotiating the difficult and messy worlds of human organizations and political structures and sometimes challenging them. I’m excited by technologists focusing on the good they can affect in the world and bringing to bear the broad set of skills necessary to make great things happen.
[ NDLR : Kaptur is a proud media partner of the LDV Vision Summit and will be bringing you privileged coverage of the event and its guest. Stay tuned……]
Photo by See-ming Lee 李思明 SML
Author: Paul Melcher
Paul Melcher is the founder of Kaptur and Managing Director of Melcher System, a consultancy for visual technology firms. He is an entrepreneur, advisor, and consultant with a rich background in visual tech, content licensing, business strategy, and technology with more than 20 years experience in developing world-renowned photo-based companies with already two successful exits.