Highlights:
Apple, Anthropic, and Other Companies Used YouTube Videos to Train AI
18/7/24
By:
Bharti B. Hariyani
YouTube has said using creators’ content to train AI systems would violate its terms of service — so what happens if they did?
More than 170,000 YouTube videos have been utilized in a massive dataset to train AI systems for some of the biggest technology companies, according to an investigation by Proof News and co-published with Wired. Apple, Anthropic, Nvidia, and Salesforce are among the tech giants that have used the "YouTube Subtitles" data, which was scraped from the video platform without permission. This dataset comprises subtitles from YouTube videos belonging to over 48,000 channels, though it does not include any imagery from the videos.
Popular Creators Affected
The dataset includes videos from well-known creators such as MrBeast and Marques Brownlee, along with clips from major news outlets like ABC News, the BBC, and The New York Times. More than 100 videos from The Verge are part of the dataset, as well as numerous other videos from Vox.
“Apple has sourced data for their AI from several companies,” Brownlee, known by his handle MKBHD, wrote in a post on X. “One of them scraped tons of data/transcripts from YouTube videos, including mine.” He added, “This is going to be an evolving problem for a long time.”
YouTube did not immediately respond to The Verge’s request for comment.
Legal and Ethical Implications
The discovery raises significant questions about the legal and ethical implications of using such data without permission. As part of its investigation, Proof News released an interactive lookup tool that allows users to see if their content — or their favorite YouTuber’s — appears in the dataset.
The subtitles dataset is part of a larger collection from the nonprofit EleutherAI called The Pile, an open-source collection that also includes datasets of books, Wikipedia articles, and more. Last year, an analysis of one dataset, Books3, revealed which authors’ work had been used to train AI systems, leading to lawsuits from authors against the companies involved.
Transparency Issues
AI companies are often not transparent about the data used to train their systems. How YouTube content is being used has been a pressing question in recent months. In March, when OpenAI unveiled its powerful video generation tool, Sora, CTO Mira Murati repeatedly dodged questions about whether the system was trained on YouTube videos.
“I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” she told The Wall Street Journal at the time. When pressed about YouTube content specifically, Murati said she “wasn’t sure about that.”
YouTube's Stance
YouTube CEO Neal Mohan has previously stated that using video content to train AI, including transcripts, would violate the platform’s terms. In May, on an episode of Decoder, Google CEO Sundar Pichai echoed this sentiment, noting that if OpenAI had indeed trained Sora on YouTube content, it would have breached YouTube’s terms.
“We have terms and conditions, and we would expect people to abide by those terms and conditions when you build a product, so that’s how I felt about it,” Pichai said.
@decoderpod Google CEO Sundar Pichai responds to a question about OpenAI allegedly using YouTube to train its new video product Sora: “We have terms and conditions, and we expect people to abide by those… when you build a product.” #ai #google #openai #sora #youtube
Moving Forward
The use of YouTube content to train AI systems without permission presents a challenge for both tech companies and content creators. As AI continues to advance, the need for clear guidelines and ethical standards becomes increasingly important. For now, the affected creators and platforms must navigate this evolving landscape to protect their content and ensure fair usage practices.
Stay updated with Kushal Bharat Tech News for the latest developments on this story and more tech news.
All images used in the articles published by Kushal Bharat Tech News are the property of Verge. We use these images under proper authorization and with full respect to the original copyright holders. Unauthorized use or reproduction of these images is strictly prohibited. For any inquiries or permissions related to the images, please contact Verge directly.
Latest News