YouTube has warned tech companies that some AI practices violate the platform’s terms of service, but that caution hasn’t stopped the unauthorized use of videos for the purpose of AI training. Top creators like Marques Brownlee have spoken out after their content was included in a widely-circulated data set dubbed “the Pile.”
According to an investigation from Proof News, tech companies like Apple, Nvidia, and Anthropic have employed the Pile for training purposes. YouTube content in the Pile includes subtitles from 173,536 YouTube videos, which originate from more than 48,000 channels.
Some of that YouTube content comes from sources — like the educational hub Khan Academy — that make sense as training material for generative AI models. In other cases, users of the Pile encountered content from top creators like MrBeast, Brownlee, and Jacksepticeye, even though those YouTube power players did not approve the use of their videos for AI training purposes.
Proof News has put together a tool that allows users to search through the material in the Pile.
The nonprofit EleutherAI put together the Pile to provide smaller AI operations with low-cost, readily available access to training material. Though the dataset was not compiled for big tech firms like Apple and Nvidia, those companies have used it regardless.
“Apple technically avoids ‘fault’ here because they’re not the ones scraping,” Brownlee wrote on X. “But this is going to be an evolving problem for a long time.”
(Update 7/18: Apple has issued a statement saying its OpenELM model, which was trained on YouTube videos, isn’t used to power any of its AI/machine learning tools, including Apple Intelligence.)
AI developers, media companies, and individual creators all seem to have different ideas about the materials that can or cannot be repurposed for training. Those squabbles have led to ongoing lawsuits
, several of which have targeted innovators like OpenAI. In response, the Microsoft-backed firm is building tools that give content owners more power over the ways their IP is used. But while creators wait for those controls to be put in place, they are left with few means to protect their videos against unauthorized reuse.Some of the companies that are benefitting from the Pile have challenged the authority of YouTube’s terms of service. “The Pile includes a very small subset of YouTube subtitles,” an Anthropic spokesperson told WIRED. “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset.”
The owners of content included in the dataset have different ideas. Dave Wiskus, the CEO of creator-led streamer Nebula, described the training practices of Pile users as “theft.” Julia Walsh, the CEO of Vlogbrothers-affiliated media company Complexly, expressed similar ideas. “We are frustrated to learn that our thoughtfully produced educational content has been used in this way without our consent,” she said.
Our opinion here at Tubefilter is that U.S.-based creators who unwittingly become AI training dummies deserve the same protections that are afforded to content owners in other regions. The E.U. recently passed a sweeping law that lays out specific regulations for the datasets that are fed to AIs. A similar law in the U.S. would clear up a lot of the confusion about whom — if anyone — is responsible for the rights of the creators found in the Pile.
Each week, we handpick a selection of stories to give you a snapshot of trends,…
Roblox is quadrupling down on chasing adult gamers--and rewarding developers who make games appealing to…
Five months after FaZe Clan's collapse, some of its best-known alumni are looking to bring back…
Creators have already made their mark in movie theaters and on Broadway stages. Now, they're…
Vine is back, and it's anti-AI. Jack Dorsey, co-founder and former multi-time CEO of Twitter,…
On the internet, it's been a roller coaster ride for the humble check mark. At…