LDV Vision Summit 17 : 10 questions to a visual search expert at Pinterest

Paul Melcher

7 years ago

[It’s Spring and the LDV Vision Summit is coming back. Like every year since its inception, we are proud to feature some of the prestigious speakers and judges who will take the stage during the two-day event. Here is the first of the series]

With appetite for visual content showing no signs of taming down and users wanting to do more than just look at pictures, platforms like Facebook, Instagram, Snapchat, and others have been in a close race to outbid each other with new features which more than often not, look identical. Not so, it seems, for Pinterest who is carving its own path, deliberately taking full advantage of its unique dataset and user base. To learn more, we caught up with Pinterest’s visual search lead software engineer, Andrew Zhai, before his keynote at the LDV Vision Summit in May.

– A little bit about you. what is your background:

Andrew Zhai, Software Engineer / Tech Lead at Pinterest

I went to University of California, Berkeley for my Bachelors where I studied Electrical Engineering and Computer Science. I’m currently part-time at Stanford working towards a CS Master’s specialization in AI. I became very interested in scalable computer vision problems at UC Berkeley where I worked two years in the Video and Image Processing Lab on scalable image geolocalization (i.e. given a street view image, determine the latitude/longitude of where the image was taken). Continuing this interest, I joined Pinterest in 2014 where I built the first productionized visual search system, retrieving results from billions of images in a fraction of a second. This technology has led to products such as visual search and Lens and improved existing products such as Related Pins. Currently, I am the technical lead of the visual search team at Pinterest.

– What is unique to Pinterest data set of images that cannot be found elsewhere. How is Pinterest data different than others?

Pinterest is a visual discovery engine with 100 billion ideas saved by 175 million people around the world. What makes the Pinterest image dataset so powerful is the human-curated associations we have amongst billions of images.

The core object on Pinterest is the Pin, containing an image, descriptions, link, and other metadata. Our users organize these Pins into collections called boards, and, as such, each board contains a set of images related to each other. We call this relationship the taste graph and use it to directly power billions of recommendations every day through products like Related Pins as well as for training data for our visual search technology. Though individual associates generated by our users may be noisy, the aggregated signal from 100+ billion Pins organized into 2+ billion boards is robust.

Additionally, the majority of Pinterest content is high-quality product photography, which is more appealing for visual retrieval products.

Pinterest : A visual discovery engine with 100 billion ideas saved by 175 million people around the world

– What are the visual challenges you face at Pinterest ( for example, low quality of images, repetition, spam..)?

Since Pinterest is one of those most image-heavy services today, you can expect many visual challenges. For example, some of our initial use-cases for computer vision were to automatically classify spammy images and detect near duplicate images. Near dupe removal is especially needed for popular images that can have hundreds of thousands of variants (crops, watermarks, resize, etc.).

One area we are investing in is improving the retrieval precision of our visual search system. This entails many challenges: how to train fine-grained image embeddings, how to deal with domain adaptation from low quality camera images to our Pinterest image corpus more skewed to higher quality professional photos, how to scale object/objectness detection to billions of images, and other potential problems such as training on-device image classification models for real-time visual annotations.

– When working on visual search solutions, what is your primary target (engagement levels, engagement time, Click through, increase pins, increase shopping)?

When engineers work on mature products such as Related Pins at Pinterest, the primary goal is to improve engagement rate. We measure engagement rate mainly through looking at click-through and save metrics.

For newer products such as visual search and Lens, although we do consider engagement rate, we focus more on product adoption and user retention. We continue to evolve these products both through visual search quality and the frontend experience as we learn more about how people use these products.

– Between Related Pins, Visual Search and Lens, which one has the most traction with your users? and why?

Currently, Related Pins is the highest engaged product, accounting for nearly a third of all engagement on Pinterest. Related Pins is optimized for content within the Pinterest image corpus by heavily utilizing the pin-board associations through collaborative filtering to generate recommendations. Because the input to Related Pins is a Pin, we have a rich set of metadata curated from our users along with our visual signals, allowing us to generate high-quality recommendations.

In contrast, Lens and visual search are products focused on new images (camera snaps, parts of existing images in Pinterest, images on third party websites). Because the image is not in the Pinterest ecosystem, we don’t know anything about the given image beyond our visual signals, making it technically challenging to return high-quality recommendations in real-time.

– Are they any visual search innovation you see in other companies that inspire you? or in academia? if so, which one and why?

There’s a lot of great work being done in other companies. One company that stands out is Aipoly Vision who uses computer vision to build an app that helps the blind or visually impaired understand their surroundings. What is interesting about the app is that the computer vision is done on-device including the deep learning models instead of a backend server. It’s really cool and inspiring to think about the augmented reality possibilities from moving more computer vision computation on-device and creating a really low latency feedback between the real world and product.

– At last year LDV Vision summit, a lot of the conversations were about visual recognition going past just understanding content in order to understand context? Is that an approach you are taking at Pinterest?

Definitely. One example of this is object search in Lens. When a Lens user takes a picture of a chair, we not only return visually similar objects and annotations related to the chair, we also show entire rooms containing the chair to give our users decoration ideas.

– What do you expect from this year LDV Vision Summit and why?

I expect a higher emphasis with on-device deep learning computing. With progress from academia research and new open source tooling to shrink deep learning models, the barrier to on-device models is shrinking. With the advantages of scale and low-latency, on-device models seem game-changing and I’m excited to see how others think/use the technology.

– What would you like to see Pinterest create that technology cannot yet deliver?

I would like to see our visual search products like Lens become a core component of people’s everyday lives. When I look at past attempts at visual search, I think the main barrier to the adoption of such technology is really the quality of the results. When seeing how visual search has progressed over the last four years, I am very encouraged. Visual recognition technology is improving year by year and visual search results are getting better and better.

I believe Pinterest is one of the best-positioned companies to tackle this problem since beyond accurate visual recognition is the problem of content — what to show a user after understanding the image. With 100+ billion Pins, Pinterest is one of a few companies that has such compelling content as well as context and technology to help people discover ideas they can use every day.

=> To learn more, please join Andrew as he takes the stage for his keynote during the upcoming 2017 LDV Vision Summit.

[Kaptur is a proud media sponsor of the LDV Vision Summit.]

Photo by mkhmarketing

Author: Paul Melcher

Paul Melcher is a highly influential and visionary leader in visual tech, with 20+ years of experience in licensing, tech innovation, and entrepreneurship. He is the Managing Director of MelcherSystem and has held executive roles at Corbis, Stipple, and more. Melcher received a Digital Media Licensing Association Award and is a board member of Plus Coalition, Clippn, and Anthology, and has been named among the “100 most influential individuals in American photography”

Twitter Linkedin