After a day listening to some of the leading figures of the photo/video tech world, here are 3 points that quickly emerged:
Computer vision :
Or more specifically, object recognition, is making huge progress. Already we’ve seen facial recognition going mainstream in many apps, Facebook, included. But the harder task of recognizing everything in a photo is getting more efficient rapidly.
Companies like Clarifai or CamFind have impressive algorithms that have brought the success rate at impressive numbers. Unfortunately the last mile might be the hardest to break. Where it gets tricky is when the images are not so clean, have lots of noise and when object are very similar to each other.
Humans have no problem recognising objects because, on top of doing a very fast elimination and matching process, they approach every images with a high level of expectancy. They already know, thanks to context and environment, what they expect to see. For example, if the image is in article about the war in Afghanistan, there is a very good chance objects in the image will be war related ( soldiers, tanks, explosion, destructions, dust..) and not gardening related ( wheelbarrow, green grass, deck, ..). Humans know that and are able to process and filter out option very quickly. Computers, for now, approach each image the same way, from scratch with no pre filtering. One of the reason is that it has to do a complete data analysis of every group of pixel in the image anyway and match it to existing information. A ground up approach. Having pre filtering based might help in the matching process but need much more processing power.
As well, lighting and similarities are complex issues to solve. Lighting, because, for a computer, it can completely change the aspect of an object and similarities, because well, that is even hard for humans. While we can recognise a pair of jeans immediately, we have a much harder time telling if it’s a Levis or Wrangler. And that brings us to our second point.
Marching to the beat of the brands
Brands will ( or should we say are) the driving force being the computer vision evolution and progress. While they might know it yet, all this massive effort to teach computers to see and understand photos as well as we do is driven by the massive revenue potential sitting untouched in the billions of images uploaded every day. For now, we are barely able to sell products and services around images, like Facebook, Twitter, Google , Yahoo , Instagram and Pinterest currently do. The egg to crack is actually helping people buy the products that they see in the photos being shared. A task that all the companies already mentioned above, plus retailers like Amazon, Target, Walmart are working hard to get resolved. Thus, when you leave Academia, every single effort in computer vision and recognition is made to satisfy brands’ needs.
A new file format
Photos are too short and videos are too long. Photos say too little and videos too much. In an age where we can link data so quickly and easily, it is amazing to see that photos are still the online equivalent of their printed version : A flat 2d representation of the offline world with a little editorial text around it. Photos should not only be able to carry and grab all related data automatically ( GPS location, identities, social media, Wikipedia entries, etc) when published, they should be able to reveal emotions, energy, sounds. While companies have been working on trying to replace the jpg compression format, none have tackle the issue of a brand new format that would allow the conversion and display of much more than light. Companies like Lytro with light field or Seene with 3 D are already offering new ways to capture photographs but have to rely on players to be displayed. Cinemagraphs, Gifs, photos with sound clips also offer new ways to display photos but demand more time to be assembled. Finally Thinglink or Kiosked allow for the seamless integration of interactive metadata directly in the photos. None yet, offer the possibility to have a photo, once uploaded, to combine all these extraordinary features as well as automatically fetch all associated metadata and display it in a non intrusive manner. An industry wide effort to redefine the next generation of photographs would be welcomed.
A Cinemagraph by Flixel
We will be talking more about the LDV Vision Summit in the next coming days as we crunch the huge amount of captivating information we heard . We didn’t want to leave you without some key thoughts for your Week End.
Author: Paul Melcher
Paul Melcher is the founder of Kaptur and Managing Director of Melcher System, a consultancy for visual technology firms. He is an entrepreneur, advisor, and consultant with a rich background in visual tech, content licensing, business strategy, and technology with more than 20 years experience in developing world-renowned photo-based companies with already two successful exits.