The fundamental goal of visual tech should be reducing friction on how we interact with the world. Success will be measured on how easily we can pass from one function to the other with minimum active input.
A few years ago, while working on the development of a Saas, the everlasting issue of international compliance came up. With hundreds of possible input lines, the necessary elements for customization, the issue was who will translate and how many languages should we support. While English is understood by most of the countries actively using the web, it is still seen as foreign to any countries outside Canada, USA, UK, New Zealand and Australia. Internal discussions raged on. This is where visuals come in: In any Olympic event, instead of translating every signalisation into thousands of different languages, the Olympic organizers manage to achieve the same result with a set of logograms. To indicate directions to various parts of the Olympic Village, every post features a design understandable by every culture. If it can be done in real life, why not on the web?
A security barrier.
Visual Tech can carry this idea much further. Already we have seen password protection scheme replaced by a series of movements on photographs, making them much harder to discover as they do not have the combination limitation of 4 digits or letters. At the same time, they are much easier to remember since they are visual ( and personalized). Facial recognition based door bells offer keyless entry system literally erasing any physical contact between what used to be an ageless physical relationship between user and interface. Identities, whether full facial or iris-based, have also been much better protected by visual tech than any combination of social security numbers, PIN numbers or “Secret questions”. And this is just the beginning.
Losing the key
Since the invention of the written language, relationships between content and its users have been text-based. Like needing a key to open a door, you need to use a text to access the information needed. Storing, finding and retrieving information has always been the sole domain of translating idea, concept or thoughts into a text format and then looking for a match. This transition step – thoughts to text- has forced us to alter, if not diminish, our thoughts in order to make it fit into the text highly regimented structure. That is about to change. As we analyze visual content and transform it into data, we are getting closer to the possibility to search and retrieve from visuals to visuals directly. In fact, Image matching technology does it already. Once fingerprinting an image, it can scour the internet to find similar fingerprints. The next leap will be when is no longer matching pixel per pixel but rather content per content. For example, feed the search engine with an image of a ball, a tree and a beach and the system retrieves all images that have a tree, ball, beach, regardless of their position in the frame. Not far enough? add semantics and the search engine will analyze the image for its meaning and find similar images with the same meaning, regardless of its content.
Because of the complexity of human understandings and the possibility to escape the limitations of text, we will be able to formulate, just with data, meanings that we could never define with words. This will open a new world of understanding, relationships and thus communication that we could only dream about today.
A wordless world
What does it mean in the practical world? Well, for example, e-commerce would change: Instead of typing a search for a red long skirt and finding a bunch of matches in numerous retailer sites, a user will upload a photo of a the dress they saw another person wearing. Not only the engine will quickly identify the color and length, but also the fabric, the flow, the light reflection, the emotion delivered, the intended impact and as long as it has other past data on you, understand why you like that dress in that picture. The result will be an exact match of what you are looking for – not what you were looking at- that could even be a different color and length. Because with fashion, it is not only how you look but how you feel and what you want to express about yourself that matters. A semantic visual search will be able to deliver exactly that. No more disappointments.
The consequence on our everyday lives will be huge. Rather than relying on textual descriptions, we will be able to rely on emotional responses as well. From travel destinations to restaurants picking, from dating ( imagine comparing to sets of personal pictures to match people together, based on what and how they photograph their lives) to house hunting, breaking down the walls of textual description will lead to a universe of more accurate, deep level connections. And the more we will photograph our world, the better it will become at providing us with perfectly accurate results.
For visual tech to deliver on its promise there is a long way to go. First, similar to what with did with all the text knowledge, we will have to index every single visual taken. We will also need to create a new level of language that will allow photos to be connected with other photos, without any text descriptions. We will need to map relationships between every single piece of visual content and comprehend the reason why. It’s a huge undertaking that not even a Google can process today. However, it is not too early to take the first steps.
Author: Paul Melcher
Paul Melcher is the founder of Kaptur and Managing Director of Melcher System, a consultancy for visual technology firms. He is an entrepreneur, advisor, and consultant with a rich background in visual tech, content licensing, business strategy, and technology with more than 20 years experience in developing world-renowned photo-based companies with already two successful exits.