The data fueling AI is vanishing rapidly
AI's foundation data is dwindling fast, posing challenges for its future development and applications across industries.
For years, developers of powerful artificial intelligence systems have relied on vast amounts of text, images, and videos sourced from the internet to train their models. However, in recent times, many critical web sources used for AI training have started limiting access to their data. A study by the Data Provenance Initiative, led by M.I.T., found that out of 14,000 web domains analyzed across three prominent AI training datasets (C4, RefinedWeb, and Dolma), there is a significant trend of data use restrictions. This situation has been described as an "emerging crisis in consent," with publishers and online platforms increasingly blocking data harvesting. Approximately 5% of overall data and 25% from the highest-quality sources in these datasets have been affected by such restrictions. These measures are typically implemented using the Robots Exclusion Protocol, a longstanding method for website owners to prevent automated bots from accessing their pages via a robots.txt file.
The study also revealed that up to 45% of the data in one dataset, C4, had been restricted due to websites' terms of service.
"We're witnessing a rapid decrease in consent for data use across the internet, impacting not just AI firms but also researchers, academics, and non-commercial entities," said Shayne Longpre, the study's lead author, in an interview.
Data forms the core of modern generative AI systems, which rely on billions of examples of text, images, and videos. Much of this data is collected from public websites by researchers and compiled into extensive datasets, which are either freely available for use or supplemented with additional sources.
Training on such data enables generative AI tools like OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude to create text, code, images, and videos. The quality of their outputs typically improves with higher-quality data inputs.
In the past, AI developers could gather data relatively easily. However, the recent surge in generative AI has strained relations with data owners, many of whom are hesitant about their content being used for AI training without compensation or acknowledgment.
Amid increasing resistance, certain publishers have implemented paywalls or revised terms of service to restrict AI training data usage. Some have also blocked automated web crawlers used by companies like OpenAI, Anthropic, and Google.
Platforms such as Reddit and StackOverflow now charge AI firms for data access, while legal actions, like The New York Times' lawsuit against OpenAI and Microsoft for alleged copyright infringement, highlight growing tensions over data use.
To enhance their systems, companies like OpenAI, Google, and Meta have resorted to measures like transcribing YouTube videos and adjusting data policies. Recently, some AI firms have secured agreements with publishers such as The Associated Press and News Corp (owner of The Wall Street Journal) for ongoing data access. Nevertheless, widespread data restrictions pose a significant challenge to AI firms reliant on continuous access to high-quality data to maintain the efficacy of their models.
Stella Biderman, the executive director of EleutherAI, a nonprofit AI research organization, expressed similar concerns.
"Major tech companies already possess vast amounts of data," she remarked. "Altering data licenses doesn't retroactively revoke their permissions, primarily affecting smaller startups and researchers."
AI firms argue that their use of public web data falls under fair use, but acquiring new data has become more challenging. Some AI executives fear encountering a "data wall" where all accessible training data from the public internet is exhausted, blocked by robots.txt files, or tied up in exclusive agreements.
To address this, some companies are exploring synthetic data—generated by AI systems themselves—to train models. However, doubts persist among researchers about whether current AI systems can produce sufficient high-quality synthetic data to replace human-generated data effectively.
Additionally, while publishers can attempt to restrict AI data scraping through robots.txt files, these requests lack legal enforceability, relying on voluntary compliance akin to a "no trespassing" sign for data.