The uncomfortable relationship between words and images

Paul Melcher

4 months ago

The evolution from cave paintings to written text took millennia. The return journey might just be around the corner

Human communication began with images. Long before the first written word, our ancestors painted scenes on cave walls, not as decoration, but as visual communication systems that conveyed complex ideas about hunting, social structures, and spiritual beliefs. These early images were direct, experiential, and universally comprehensible within their communities.

As civilizations developed, the limitations of pure visual communication became apparent. How do you paint justice? How do you draw the concept of tomorrow, or the feeling of regret? The invisible world, abstract concepts, temporal relationships, and internal states demanded a new tool. Written language emerged as humanity’s solution, allowing us to discuss the unseen and the conceptual with unprecedented precision.

This evolutionary leap came with a trade-off. While written language unlocked abstract thinking and complex reasoning, it pushed visual communication into a supporting role. Images became icons, then hieroglyphs, then characters, until they took a secondary role as illustrations to text. Over centuries, we developed increasingly sophisticated textual vocabularies while our visual literacy remained relatively primitive.

Lascaux cave paintings. By EU – Own work, Public Domain

The Stock Photography Trap

Today’s generative AI reflects this historical limitation. When we ask current AI to visualize abstract concepts like “innovation” or “sustainability,” we get predictable stock photography clichés: lightbulbs for ideas, handshakes for partnership, green leaves for environmental consciousness. These artificial symbols, created for commercial purposes rather than genuine communication, represent the shallow end of human visual expression.

This isn’t a technical failing of AI; it’s a training data problem. Current generative AI has learned from the visual detritus of commercial culture rather than from humanity’s rich tradition of natural visual communication. It replicates the visual shorthand of advertising and stock photography rather than tapping into the sophisticated nonverbal communication system that humans use every day.

Consider this: 65% of human-to-human communication is nonverbal. Every conversation contains layers of gestural information, spatial relationships, facial expressions, and environmental cues that convey meaning with remarkable precision. We naturally arrange objects to explain concepts, use our hands to describe relationships, and modify our facial expressions to communicate nuanced emotional states. This represents a vast, sophisticated visual language that we’ve barely begun to systematize.

But in our relation with the world, this changes dramatically. The world “speaks to us visually”, allowing us to make decisions based on what we see. We constantly receive nonverbal cues about our environment that we process, interpret, and read like a language, and use to understand our surroundings and our world. Part of those are captured by photography, which we encounter in our travel or news magazines, allowing us to extend that visual reading beyond our immediate reach.

A New Training Paradigm

The solution isn’t better algorithms, it’s better data. Instead of training AI on stock photography and commercial imagery, what if we trained it on how humans actually communicate visually? What if we captured the rich dataset of everyday human visual expression: the gestures teachers use to explain complex concepts, the spatial arrangements people create when telling stories, the environmental modifications humans make to convey abstract ideas? How do we cross a street? how we plant and manage a garden?

This approach would fundamentally shift AI from reproducing artificial visual symbols to understanding authentic human visual communication patterns. Rather than generating clichéd metaphors, AI could learn to recognize and replicate the subtle visual cues that humans naturally use to understand and communicate abstract concepts. The same way we are teaching cars to understand their environment only using visual cues captured via Lidars.

The implications extend far beyond better stock imagery. We’re potentially looking at the development of a new visual language, one sophisticated enough to handle the abstract concepts that initially drove us from images to text, but grounded in humanity’s innate visual communication abilities.

We train cars to learn by what they see. Lidar Image courtesy Velodyne

The Return to Visual-First Communication

This evolution could enable a historic reversal: a return to image-first communication, but at a level of sophistication that rivals or surpasses textual communication. Instead of reading lengthy explanations of complex concepts, we might communicate through generated images that convey meaning with the immediacy of pointing to a mammoth on a cave wall, but with the conceptual depth that took millennia to develop through written language.

Imagine generated visuals that can communicate “the tension between individual freedom and collective responsibility” or “the emotional experience of technological displacement” with the clarity and speed that images provide, but without sacrificing the nuance that such concepts require. This isn’t about replacing text, but about expanding our communicative repertoire to include visual expression of abstract concepts. A step beyond what emojis do currently.

The potential applications span industries: educational content that makes complex concepts immediately graspable, cross-cultural communication that transcends linguistic barriers, therapeutic tools that help people express difficult emotional states, and business communication that conveys strategic concepts with unprecedented clarity.

4 examples of using the prompt: “The tension between individual freedom and collective responsibility”, as generated by an AI. We can certainly do better

The Cultural Challenge

The primary obstacle isn’t technical; it’s cultural. Developing this sophisticated visual language requires moving beyond the visual clichés of commercial culture toward genuine visual literacy. This means training AI on authentic human visual communication patterns while simultaneously developing new frameworks for visual meaning-making that can work across cultural boundaries.

The challenge resembles the early development of written language: communities must develop a shared understanding of visual meaning, create systems for teaching visual literacy, and establish conventions for visual grammar and syntax. Unlike the millennia it took for written language to develop, however, we now have the tools to accelerate this process dramatically.

Implications for Creative Industries

For photographers, graphic designers, and visual communicators, this represents both disruption and unprecedented opportunity. The current model of visual communication, built on commercial symbolism and cultural clichés, will likely become obsolete. In its place, we’ll need visual professionals who understand the deeper patterns of human visual communication and can work with AI systems to develop new forms of visual expression.

The creative challenge shifts from producing visually appealing imagery to developing visually meaningful communication. Success will depend not on mastering existing visual conventions, but on helping to create new ones that leverage both human visual intuition and AI’s generative capabilities.

Are emojis a first step ? Photo by Planet Volumes on Unsplash

A Communication Revolution

We stand at a unique historical moment: possessing the technical capability to return to visual-first communication while maintaining the conceptual sophistication that written language enabled. The question isn’t whether this transformation will occur, but how quickly we can develop the visual literacy and cultural frameworks necessary to make it work.

The implications extend beyond communication technology to fundamental questions about human expression, cultural transmission, and the future of literacy itself. As we move toward this visual future, we’re not just developing better AI; we’re potentially unlocking a more intuitive, immediate, and universal form of human communication that could reshape our ideas, how we share them, build understanding, and connect.

The cave painters of Lascaux could never have imagined that their visual communication system would eventually give way to written text. We’re now approaching the moment when their medium, enhanced by artificial intelligence and grounded in authentic human visual expression, returns to reclaim its place at the center of human communication.

main photo by Photo by Murat Onder on Unsplash

Author: Paul Melcher

Paul Melcher is a highly influential and visionary leader in visual tech, with 20+ years of experience in licensing, tech innovation, and entrepreneurship. He is the Managing Director of MelcherSystem and has held executive roles at Corbis, Gamma Press, Stipple, and more. Melcher received a Digital Media Licensing Association Award and has been named among the “100 most influential individuals in American photography”

Twitter Linkedin