After a day listening to some of the leading figures of the photo/video tech world, here are 3 points that quickly emerged:

New York, New York June 4, 2014. LDV Vision Summit - Image ©Dan Taylor/Heisenberg Media -
LDV Vision Summit in New York, New York June 4, 2014. ©Dan Taylor/Heisenberg Media

Computer vision :

Or more specifically, object recognition, is making huge progress. Already we’ve seen facial recognition going mainstream in many apps, Facebook, included. But the harder task of recognizing everything in a photo is getting more efficient rapidly.

Companies like Clarifai or CamFind have impressive algorithms that have brought the success rate at impressive numbers. Unfortunately the last mile might be the hardest to break. Where it gets tricky is when the images are not so clean, have lots of noise and when object are very similar to each other.

Humans have no problem recognising objects because, on top of doing a very fast elimination and matching process, they approach every images with a high level of expectancy. They already know, thanks to context and environment, what they expect to see. For example, if the image is in article about the war in Afghanistan, there is a very good chance objects in the image will be war related ( soldiers, tanks, explosion, destructions, dust..) and not gardening related ( wheelbarrow, green grass, deck, ..). Humans know that and are able to process and filter out option very quickly. Computers, for now, approach each image the same way, from scratch with no pre filtering. One of the reason is that it has to do a complete data analysis of every group of pixel in the image anyway and match it to existing information. A ground up approach. Having pre filtering based might help in the matching process but need much more processing power.

As well, lighting and similarities are complex issues to solve. Lighting, because, for a computer, it can completely change the aspect of an object and similarities, because well, that is even hard for humans. While we can recognise a pair of jeans immediately, we have a much harder time telling if it’s a Levis or Wrangler. And that brings us to our second point.

Clarifai example of auto tagging
Clarifai examples of auto tagging


Marching to the beat of the brands

Brands will ( or should we say are) the driving force being the computer vision evolution and progress. While they might know it yet, all this massive effort to teach computers to see and understand photos as well as we do is driven by the massive revenue potential sitting untouched in the billions of images uploaded every day. For now, we are barely able to sell products and services around images, like Facebook, Twitter, Google , Yahoo , Instagram and Pinterest currently do. The egg to crack is actually helping people buy the products that they see in the photos being shared. A task that all the companies already mentioned above, plus retailers like Amazon, Target, Walmart are working hard to get resolved. Thus, when you leave Academia, every single effort in computer vision and recognition is made to satisfy brands’ needs.

A new file format

Photos are too short and videos are too long. Photos say too little and videos too much. In an age where we can link data so quickly and easily, it is amazing to see that photos are still the online equivalent of their printed version : A flat 2d representation of the offline world with a little editorial text around it. Photos should not only be able to carry and grab all related data automatically ( GPS location, identities, social media, Wikipedia entries, etc) when published, they should be able to reveal emotions, energy, sounds. While companies have been working on trying to replace the jpg compression format, none have tackle the issue of a brand new format that would allow the conversion and display of much more than light. Companies like Lytro with light field or Seene with 3 D are already offering new ways to capture photographs but have to rely on  players to be displayed. Cinemagraphs, Gifs, photos with sound clips also offer new ways to display photos but demand more time to be assembled.  Finally Thinglink or Kiosked allow for the seamless integration of interactive metadata directly in the photos. None yet, offer the possibility to have a photo, once uploaded, to combine all these extraordinary features as well as automatically fetch all associated metadata and display it in a non intrusive manner. An industry wide effort to redefine the next generation of photographs would be welcomed.

A Cinemagraph by Flixel 

We will be talking more about the LDV Vision Summit in the next coming days as we crunch the huge amount of captivating information we heard . We didn’t want to leave you without some key  thoughts for your  Week End.


Author: Paul Melcher

Paul Melcher is a highly influential and visionary leader in visual tech, with 20+ years of experience in licensing, tech innovation, and entrepreneurship. He is the Managing Director of MelcherSystem and has held executive roles at Corbis, Stipple, and more. Melcher received a Digital Media Licensing Association Award and is a board member of Plus Coalition, Clippn, and Anthology, and has been named among the “100 most influential individuals in American photography”


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.