10 Questions for a visual researcher : Yahoo Labs

Paul Melcher

9 years ago

Data crunching for the discovery of underlying trends might seem as being restricted to obviously quantifiable results. However, Alejandro James, Director of Research/Video product at Yahoo Labs has recently been applying his research to better understanding aesthetic and cultural impact, both very subjective, in our interaction with visual content. Before speaking and judging at the upcoming LDV Vision Summit, he sat down with us to discuss his path, his research and his vision of the visual:tech space.

A little bit of yourself, what is your background and how did you
end up at Yahoo Labs ?

I started programming when I was 12 (learned Assembly Language at 14), took on entrepreneurship as an undergraduate, studied photography alongside MFA students (and have since had many exhibitions), did five internships as a graduate

Alejandro Jaimes, Director of Research/Video product at Yahoo

student, and obtained a Ph.D. in Electrical Engineering after getting an M.S. in Computer Science, from Columbia University. During my internships, I discovered that if I really wanted to, I could live and work anywhere—and so I have. After graduating from Columbia in New York and working at IBM, I spent a few months in South America, then a few years in Tokyo, followed by Lausanne, Madrid, Barcelona, and Daejon, South Korea. When I was living in Madrid I was offered the opportunity to join Yahoo, and I did, in Barcelona. Now I’m based in New York.

Over the years I’ve worked on a wide range of applied problems focused on machine learning—from social network analysis to recommender systems to computer vision.

What drives your research? How do you decide what to research?

At Yahoo Labs, our mission is to power Yahoo’s most critical products with innovative science. In that spirit, I like to focus my research on interesting problems that have real impact on people through our products. I am drawn to finding solutions to difficult problems for which there are no current solutions; that almost invariably, by itself, makes those problems interesting. However, I am only drawn to those problems if their solutions will have significant impact in real applications used by millions of people.

Furthermore, I have a strong interest in business and how people use technology. Therefore, when I look at projects I’m very focused on understanding the behavior of users in order to contribute to improving their experiences; at the same time, I consider how our research makes business sense. It’s all about balance—we try to make sure our research is technically interesting, impactful, user-focused, and makes business sense.

I believe strongly that research striving to improve user experience is at the core of innovation. Thus, over the years I’ve been developing a “Human-Centered Innovation” framework in which the starting points for projects are data analysis, hypotheses, and interaction. When I choose projects, I always start with one of those three elements and iterate cyclically several times over the other two, focusing on users within a specific application context. By following that process, I believe innovation–which has positive impact on users and is the core of business–flows naturally.

From The Beauty of Capturing Faces: Rating the Quality of Digital Portraits research paper

A few of your research focuses on creativity and aesthetics. Isn’t that the harder nut to crack , defining and quantifying something that is very subjective and cultural ?

All of my work is “human-centered.” By definition, that means considering social, cultural, and “human” issues. This implies that in practically all of my work, there’s an intersection of qualitative and quantitative methods, and that often means going as deep as possible in quantifying subjectivity and understanding user needs. As a scientist and engineer, I feel extremely privileged to be able to work on things that make a difference—and I think in order to succeed we really need to keep a strong focus on users. Even what at first glance might be considered simple, is hard and complex when we consider human factors.

The field of image retrieval (in computer science), for example, started with very basic algorithms to find similar images based on colors and textures. As it turned out, psychologists and artists had been studying colors and textures for a very long time, and librarians had created processes to classify and index images. Psychologists and linguists had also studied how people name and classify objects (a lot of which is culture dependent), and in spite of computational advances and exciting progress in recent years, the field is still in its infancy. So yes, it is very hard, but if we want to make things that matter, we cannot ignore the humanity of our users. That makes the work extremely exciting.

Talking about cultural, do you see a lot a variances in human interaction with photography based on their countries ?

Culture is as much a part of who we are as memory, and even though technology is often deemed to be culturally neutral, the way we use it is not. Part of the problem is the definition of culture itself—there are so many possible sub-divisions, based on language, socioeconomic status, geography, country, etc. It’s definitely an area that has hardly been studied, and one of the biggest opportunities for Big Data. In general, I think, the biggest differences are in how cameras are used and perceived in different societies. In Korea, for example, I learned that most cars have cameras that continually record. Although the technology exists, it is not as prevalent in countries like the U.S.

In terms of consumer photography, I don’t think there has been a large case study, but without a doubt things like preferences for certain colors could be discerned, not just based on culture, but also on context. Cities, and sometimes countries, have certain “colors” (India might be a good example), and those would be more prevalent in photos at those locations. But I guess to really answer this question, one has to look at cultural visual traditions and do an in-depth analysis. Aspects would include not just the photos themselves, but how people share them. Based on some work I did on social media, it is apparent, for example, that people do tend to connect according to cultural norms. Some Asian societies are more collective in nature (as opposed to cultures in countries like the U.S. which tend to be more individualistic), and this translates to things like people being more reciprocal in following those that connect to them.

What was your hardest research to date and why ?

The hardest projects are those in which there is no pre-determined ground truth, and where we have to come up with definitions that determine what that ground truth will look like. By ground truth, I mean simply an ideally large set of labeled examples that can be used by a machine learning algorithm.

Examples of this include our recent work at Yahoo on automatically creating short video summaries of a longer video or a collection of videos, and building classifiers to distinguish creative versus non-creative videos. In both cases, one of the biggest challenges is the set of questions that the problem leads us to deal with: if the goal is to create a 15 second video summary of a video, what constitutes a good summary? Arguably, it will depend on the goal: do you want to show only the highlights, or be as comprehensive as possible? What may seem like a simple problem (generate a 15 second summary) is composed of a series of complicated questions. And in addressing those questions, one has to ask human subjects to label data (in this case videos) to evaluate the performance of the algorithms developed. So, a lot of thinking has to go into how to define what a “good summary” is and whether a particular video is creative or not, including the goal and any instructions given to human annotators. With these kinds of problems, we are trying to create algorithms that mimic high-level human judgments, which is something not even humans are very good at!

A lot of the very successful photo apps (Instagram, Snapchat, Pinterest) are about photo discovery rather than image search. Are we seeing an evolution on how people experience photography?

I think there’s been a long standing misunderstanding with regard to what image search is, in the sense that the main assumption with image search has often been that it is just like Web search. In many cases, when we perform a Web search, we are looking for a particular item (e.g., a document, a website, a product, etc.), and that same assumption has been the basis of most, if not all, image search applications.

We’ve been able to study user behavior in image search and found that a lot of image search is in fact about discovery, where people use different strategies depending on what they are looking for. Also, mostly, people are not looking for a single image, but are instead looking at many images within a category—the query is just a starting point for discovery and the query itself is the “category.”

The main difference, I think, can be made between “explicit” and “implicit” search. In some apps, there is no explicit search; there is no query in the traditional sense. This has really taken off, I think, due to the proliferation of high quality cameras in phones, and in general to the explosion in social media. With so many people creating and sharing, we are inundated by multimedia. This has democratized not only creation, but also access, creating very interesting technical challenges in automatically surfacing the most creative, highest quality content. That may be our biggest challenge: locating and exposing interesting, creative, high quality content.

Shatford-Panofsky Framework.The approach characterizes images/queries based on four facets (who, what, where when) and three aspects (specific, generic, abstract). source

What are some of the places on Yahoo where we can see practical applications of your research ?

There are many. As I mentioned, the mission of our team is to power some of Yahoo’s most popular products with innovative science. Our work on video, for example, impacts every single Yahoo product where videos are shown. This includes end-of-play recommendations, celebrity recognition (images and video), classification of images and videos (tagging), automatic thumbnail selection, near duplicate detection, automatic video previews, automatic image quality assessment, and several others.

In addition to computer vision, I’ve also worked on a lot of projects around Big Data and also in improving user experience. I led the creation of Yahoo Clues, which was a product for search insights, and worked with my team to generate internal dashboards, and insights reports, a lot of A/B testing, and instrumentation of several products to measure and improve user engagement and make product design decisions (which features to include, etc.).

Yahoo Clues gave you insight into the types of people searching for specific keyword phrases and shows related terms based on those searches and searchers. No longer active.

Looking around the photo:tech space, what do you see that excites you?

I think we’re just beginning to see a revolution in photography. Cameras will continue to improve, but the real opportunities are in what we do with images and video. Video is still hard for most people, mainly because it’s difficult to create compelling videos, and I think we will see a shift there, and perhaps a bigger shift in new creativity tools based on computer vision. Some of these are likely to work “live” and others might take our huge collections of images and videos and help us create even more compelling content. But the other area that excites me is the communication opportunities afforded by cameras everywhere.

With the LDV summit coming up, what do you hope to get from it ?

The LDV summit looks like a really exciting event— what I like most is the diversity in the roster of speakers, many of whom I know personally. I’m sure I’ll get very different perspectives on the future of photography, and I also hope to find new, exciting applications and connect with a lot of people doing interesting stuff around images. I like that the summit includes entrepreneurs, VCs, and academics. I’m excited to connect with a lot of startups, share insights into the innovation process I advocate, and learn more about what others are doing in this space.

Tell us what is interesting and unique about the LDV Vision Summit Entrepreneurial Computer Vision Challenges? Who should compete and why?

I’m a strong believer in competitions like the LDV Challenge. First, the Challenge helps participants focus on a particular task, whether it is presenting their startup or trying to solve a problem with a clear goal. Second, it encourages advances in the community as having several groups work on the same problem and sharing the results leads to innovation. Third, it’s a lot of fun to compete in such challenges, and to see how one’s approach may differ from what others do. One clearly unique aspect of the LDV Challenge is the mix of people and that the Challenge is focused on computer vision. I think it’s very exciting and every startup should compete. Challenges like this one, if taken seriously, require serious effort. However, the benefits far outweigh the investment.

What would you like to see happening that technology  cannot yet deliver ?

Integration of context. Over the years I’ve made around 200,000 photos, but the process of organizing them is still largely manual. A portion of those are geo-tagged, some are family photos, some are associated with events that I attended, etc. A lot of information is already in the collection, but none of it is used. I’d like to see all of that integrated so that I could more easily experience my collection. It’s not necessarily that technology cannot deliver this–all of the components are largely there. It just hasn’t been built. But I’m confident that it will be.

I think ultimately, however, what I’d really like to see is technology where my brain would be directly connected, wirelessly, to something that would record and also show me photographs and video based on my memory, and that would include both photos explicitly taken and images recorded by my own visual system. I could be sitting at a cafe in Paris and the smell of a baguette could trigger a memory of my first trip to Italy. I’d then be able to see images of that experience, and photos I had made with a camera (I think, and hope, the act of explicitly photographing will not disappear). And I could also see photos or videos of historical events. A similar scenario could occur as I walk down a street in any city. I imagine being able to see images of what that street was like 50, 100, or more years ago. In many ways, photographs are about memory and memories cannot easily be described because they store aspects of experiences. So linking them directly to our thoughts would allow us to recall a larger number of images and revisit memories of those experiences. Right now, a lot of that gets lost in the digital shoe-boxes in which we keep most of our own collections…

You can read some of Alejandro James research at Yahoo labs here

_{[ NDLR : Kaptur is a proud media partner of the LDV Vision Summit and will be bringing you privileged coverage of the event and its guest. Stay tuned……]}

Photo by Joel Bedford

Author: Paul Melcher

Paul Melcher is a highly influential and visionary leader in visual tech, with 20+ years of experience in licensing, tech innovation, and entrepreneurship. He is the Managing Director of MelcherSystem and has held executive roles at Corbis, Stipple, and more. Melcher received a Digital Media Licensing Association Award and is a board member of Plus Coalition, Clippn, and Anthology, and has been named among the “100 most influential individuals in American photography”

Twitter Linkedin

A little bit of yourself, what is your background and how did you end up at Yahoo Labs ?