YouTube creators shocked to discover Apple and others used their videos for AI training

YouTube creators are surprised to learn that Apple and other companies have used their videos to train AI models without their knowledge.

Jul 17, 2024 - 10:40

YouTube creators shocked to discover Apple and others used their videos for AI training

Proof News contacted several creators and the companies that utilized the dataset for their comments.

AI models developed by Apple, Salesforce, Anthropic, and other leading tech companies were trained on tens of thousands of YouTube videos without the consent of creators, potentially breaching YouTube's terms of service, according to a recent report from Proof News and Wired. These companies utilized "the Pile," a dataset created by the nonprofit EleutherAI, designed to help smaller entities compete with Big Tech, though larger firms have also adopted it.

The Pile includes a wide range of content, such as books and Wikipedia articles, and incorporates YouTube captions gathered via the YouTube captions API from 173,536 videos across over 48,000 channels. This collection includes popular creators like MrBeast, PewDiePie, and tech commentator Marques Brownlee. On X, Brownlee addressed Apple’s use of this dataset, noting the complexity of assigning blame since Apple did not directly collect the data. He stated:

"Apple has sourced data for their AI from various companies. One of them scraped a significant amount of data/transcripts from YouTube videos, including mine. Technically, Apple avoids 'fault' here because they're not the ones doing the scraping, but this issue is going to persist for a long time."

The dataset includes channels from various mainstream and online media brands, featuring content created by Ars Technica and other Condé Nast publications like Wired and The New Yorker. Notably, one video in the dataset is a short film by Ars Technica that humorously claims to have been written by AI. Proof News' article highlights that the dataset also contains videos featuring a parrot, leading to the idea that AI models are essentially mimicking a parrot that mimics human speech, along with other AIs imitating humans. As AI-generated content becomes more widespread online, it will be increasingly difficult to assemble datasets for training AI without including content already created by AI. While much of this information isn’t new, The Pile has been widely referenced in AI discussions and has previously been used by tech companies for training purposes. It has also been cited in several lawsuits involving intellectual property claims against AI companies. The defendants, including OpenAI, argue that this type of data scraping falls under fair use, although the lawsuits remain unresolved. Additionally, Proof News developed a tool that allows users to search The Pile for specific videos or channels.

The work highlights the extensive nature of data collection and emphasizes the limited control that intellectual property owners have over the usage of their content available on the open web. However, it's essential to understand that this data may not have been used specifically to train models for creating competitive content for end users. For instance, Apple might have utilized the dataset for research purposes or to enhance text autocomplete features on its devices.

Reactions from creators

Proof News contacted several creators and the companies that utilized the dataset for their comments. Most creators were taken aback that their content had been used in this manner, and those who responded were critical of EleutherAI and the companies involved. For instance, David Pakman from The David Pakman Show stated:

“No one approached me to ask for permission to use this... This is my livelihood, and I invest significant time, resources, and staff into creating this content. There’s certainly no lack of work.”

Julia Walsh, CEO of Complexly, which produces SciShow and other educational content by Hank and John Green, expressed:

“We are frustrated to discover that our carefully crafted educational content has been used without our consent.”

Additionally, there are concerns about whether scraping this content violates YouTube's terms of service, which ban accessing videos through "automated means." EleutherAI founder Sid Black noted that he used a script to download the captions via YouTube's API, similar to how a web browser operates.

Anthropic is among the companies that have trained models using the dataset and maintains that there are no violations involved. Spokesperson Jennifer Martinez stated:

"The Pile contains only a tiny subset of YouTube subtitles... YouTube’s terms apply to direct use of its platform, which is different from using The Pile dataset. For questions about potential violations of YouTube's terms, we would refer you to The Pile's authors."

A Google spokesperson informed Proof News that the company has "taken action over the years to prevent abusive, unauthorized scraping," but did not provide further details. This situation is not new, as AI and tech companies have faced criticism for training models on YouTube videos without proper permission. Notably, OpenAI, the creator of ChatGPT and the video tool Sora, is suspected of using YouTube data for model training, although not all claims have been confirmed.

In an interview with The Verge’s Nilay Patel, Google CEO Sundar Pichai indicated that using YouTube videos to train OpenAI's Sora would likely violate YouTube's terms, although this is different from scraping captions through the API.