Cloudflare acquired Human Native on January 15, 2026. The London startup had built a marketplace connecting content owners with AI companies seeking training data. The deal completes an 18-month infrastructure build, but the strategic question remains: is Cloudflare operating a toll booth that charges for passage, a warehouse that aggregates content, or attempting both?
Two Models, Different Economics
The dataset licensing market is split into two distinct approaches based on what AI companies actually need.
The warehouse model aggregates content for training. AI companies license bulk access to structured, vetted datasets. The warehouse handles rights verification, content normalization, metadata enrichment, and quality vetting. Content rights holders contribute their archives to the warehouse and receive royalties when AI companies license that aggregated collection for model training. Companies like Protege (healthcare AI training data) and Troveo (video datasets with 5,000+ licensors and 3 million hours of footage) have established ground in these categories, licensing for millions of dollars annually. The warehouse model is predominantly useful for AI companies that need large volumes of content for foundation model training and for those seeking to plug holes or fine-tune existing models due to biases or data gaps.
Dow Jones Factiva operates the clearest publisher-warehouse model, aggregating 8,000+ publishers into a unified database. Enterprise clients license access to the entire warehouse rather than negotiating individual publisher contracts. Typical of the warehouse model, Factiva handles complexity, and publishers receive compensation when their content gets used.

The toll booth model charges for passage during inference and RAG (Retrieval-Augmented Generation) applications. AI systems access content in real-time for citations, grounding, and query responses. The toll booth meters those transactions and processes payments, but doesn’t aggregate or store content, nor do they manage complexity. Publishers maintain their own servers and content. When an AI company queries content, the request routes through the toll booth infrastructure, gets metered, and revenue flows to the content owner.
TollBit operates a pure toll booth with 1,400+ publishers, including TIME and Penske Media, that use the platform for transaction infrastructure without content aggregation. When an AI assistant cites a TIME article in response to a user query, TollBit meters that access and TIME receives payment. No training, no bulk licensing, just per-query transactions.
The distinction matters because training is a one-time event while RAG creates recurring access. Training revenue concentrates at the warehouse. Inference revenue flows through the toll booth.
Cloudflare’s Hybrid Bet
Cloudflare built the toll booth infrastructure first. AI Crawl Control showed which bots accessed sites. Pay Per Crawl enabled HTTP 402 payment codes, charging crawlers per request as they pass through the CDN. AI Index created structured discovery and pub/sub subscriptions for real-time content updates.
Since Cloudflare already handles approximately 20% of global internet traffic through 41 million websites, it has turned into a significant gateway for AI companies and content owners. When AI systems access those sites, they pass through Cloudflare infrastructure. Pay Per Crawl charges for passage. Content stays on publisher servers. Cloudflare monetizes the transaction.
Cloudflare must have realized that out of their 41 million websites, most did not host dynamically updated content and thus offered little appeal for RAG. However, they did host valuable content for foundation model training and fine-tuning. Enter Human Native, which adds warehouse capability. The startup built infrastructure for transforming raw media into AI-ready datasets, structuring, metadata normalization, rights verification, and quality documentation. Enter the warehouse model.
Why the Distinction Matters
Warehouses require trust. Enterprises licensing content need confidence that rights have been verified, data have been structured correctly, and quality standards have been maintained. Factiva’s credibility comes from being a publisher, Dow Jones, which owns the Wall Street Journal. They won’t sign deals they wouldn’t sign themselves. The same goes for Protege, which has vetted high-quality medical imagery, and Troveo, which has, similar to Human Native, brought high-quality metadata to otherwise chaotic content.
Cloudflare operates infrastructure, not content creation. They lack the editorial credibility that validates warehouse operations. Human Native brought expertise and a network of trust relationships: a team from DeepMind, Google, Figma, and Bloomberg who understand dataset structuring for AI applications and needs.
Toll booths scale through network position. Cloudflare already controls 20% of traffic. Adding transaction infrastructure to the existing flow is straightforward. Warehouses scale through aggregation and trust. That requires a reputation built over time and deep industry relationships.
The Data Quality Argument
The case for licensed dataset content hinges on quality driving performance, as algorithmic improvements slow down as training larger models on more parameters shows diminishing returns. When architectural innovation stalls, data quality becomes the differentiator.
Garbage in, garbage out. AI trained on scraped open web content inherits the web’s inaccuracies, biases, and contradictions. Models trained on vetted datasets from trusted sources produce more reliable outputs, increasing the model output quality, sometimes better than any coding tweaks. That reliability justifies paying premiums for toll booth-protected transactions and curated warehouse access.
Autonomous agents will only intensify this. As AI systems make independent decisions on humans’ behalf, they will ned to be more reliable and thus need training data that won’t generate hallucinatory information. An autonomous procurement agent negotiating contracts can’t rely on models trained on unverified internet content.
What This Means for Visual Content
The toll booth model requires owners tightly control their distribution. Content must live on secured domains that they operate, and can be configured for payment infrastructure. But for work scattered across Instagram, portfolio platforms, client websites, and publisher archives, toll booth infrastructure provides little to nothing. The irony is that the more successful content creators become, the more their content appears everywhere, the less control they have over it.
The warehouse model aggregates content from content owners who may not control distribution. A photo agency licenses its archive into a marketplace, even if individual images are distributed across dozens of publications. The warehouse handles rights verification and content structuring. They then receive royalties when their work from the warehouse gets licensed. This model particularly benefits smaller content owners. Aggregation creates advantages they can’t achieve alone: their content becomes part of larger licensing packages that AI companies actually want to buy, and they become discoverable through the warehouse’s marketplace infrastructure rather than getting lost in the noise.
The Unanswered Question
Cloudflare positioned itself to capture transaction fees, whether the market develops around warehouse aggregation or toll booth infrastructure.
But Human Native sold after 18 months rather than raising additional funding to scale warehouse operations. That timeline suggests either the warehouse market wasn’t materializing fast enough, or the founders recognized standalone warehouse operations couldn’t compete without infrastructure integration.
If enterprises demand ring-fenced warehouses of vetted content for training, Human Native’s dataset structuring becomes valuable. If the market develops around real-time RAG applications paying per-query, Cloudflare’s CDN infrastructure and Pay Per Crawl become valuable. to make sure, they built both.
But there’s a timeline problem. Gartner projects that synthetic data will completely overshadow real data in AI models by 2030, with synthetic data constituting more than 95% of data used for training AI models in images and videos. If that happens, warehouse operations aggregating human-created content for training lose their market. The data quality argument collapses when synthetic alternatives become good enough.
Toll booth infrastructure might survive longer: RAG applications need current information that can’t be pre-generated. But training revenue, which justified most current licensing deals, might evaporate.
The toll booth exists. The warehouse is under construction. Whether either generates meaningful revenue also depends on proving that the data quality justifies the premium, which is not something Cloudflare controls. The bet is in the volume they represent, there is enough quality to appeal to the marketplace. Like similar models, it will likely be 80/20, with a select few capturing most of the revenue. All before synthetic alternatives make the question irrelevant. For now, Cloudflare is building both and waiting to see which model the market validates, and how long that market lasts.
Author: Paul Melcher
Paul Melcher is a highly influential and visionary leader in visual tech, with 20+ years of experience in licensing, tech innovation, and entrepreneurship. He is the Managing Director of MelcherSystem and has held executive roles at Corbis, Gamma Press, Stipple, and more. Melcher received a Digital Media Licensing Association Award and has been named among the “100 most influential individuals in American photography”

