PC & Mobile technology
Social networks
27.07.2024 08:09

Share with others:

Share

Apple, Nvidia and Anthropic used YouTube videos to train AI

While megacorporations have been using YouTube videos to train their AI models, creators have accused them of doing so without their knowledge.
Apple, Nvidia and Anthropic used YouTube videos to train AI

Tech companies are using controversial tactics to feed their data-hungry artificial intelligence (AI) models. Data from books, websites, photos and social media posts are often used without the creators' knowledge.

AI companies are very secretive about data

However, companies that train AI models are generally very secretive about their data sources. We've pointed this out many times before, but now an investigation by the non-profit news organization Proof News has revealed that some of the world's major players in artificial intelligence have been using material from thousands of videos posted on YouTube for training. The companies did so despite YouTube's rules prohibiting the collection of material from the platform without permission.

Many companies from Silicon Valley are said to use subtitles (YouTube Subtitles) from more than 173,000 videos, which were obtained from more than 48,000 channels on the aforementioned platform, to collect data. The companies are said to include Anthropic, Nvidia, Apple and Salesforce.

Models are also supposed to learn from conspiracy theories

A dataset called YouTube Subtitles contains transcripts of videos from educational channels such as Khan Academy, MIT, and Harvard. They also used news outlets such as The Wall Street Journal, NPR, and the BBC. The Late Show With Stephen Colbert, Last Week Tonight With John Oliver and Jimmy Kimmel Live were among those videos from which data was allegedly extracted. But they also found material from YouTube megastars, including creators such as MrBeast, Marques Brownlee, Jacksepticeye and PewDiePie. These are creators who have more than 10 million followers, for example MrBeast has even 300 million.

Among other things, the problem can be that the AI also draws data from videos that, for example, promote conspiracy theories about the earth being a flat surface and the like.

"No one has come to me and asked, 'Can we use this?'" said David Pakman, host of The David Pakman Show, a channel with more than 2 million subscribers and more than 2 billion views. Almost 160 of his videos have been used for AI training. Its full-time production is created by four employees who publish podcasts, videos, which are also published on TikTok and other platforms. If AI companies are paid to do this, Pakman said, they should be compensated for using the data. He pointed to the fact that some media companies have recently entered into agreements to pay for the use of AI training works. "This is what I do for a living, I invest time, resources, money and the time of my employees into creating content," he said.

"It's a steal," says Dave Wiskus, director of streaming service Nebula. He said it is disrespectful to use the work of creators without their consent. Especially since studios may be able to use "generative artificial intelligence to replace the videos of today's creators" in the future. “Will they be able to use this learning to exploit and harm artists? Definitely," Wiskus is convinced.

Where did it all begin?

The dataset is said to be part of a compilation released by the non-profit organization EleutherAI called Pile. They included not only material from YouTube, but also from the European Parliament, the English Wikipedia, and a trove of emails from Enron employees that were released as part of the federal investigation.

Most of the Pile datasets are available on the Internet and open to anyone with enough space and computing power to access them. Academics and other developers outside of “Big Tech” used the dataset, but they weren’t the only ones.

Companies such as Apple, Nvidia and Salesforce describe in their announcements that they used Pile to train AI. The documents indicate that Pile also used Apple to train OpenELM, a high-profile model that was released in April, weeks before the company revealed it would add new AI capabilities to iPhones and MacBooks.

So has Anthropic, a leading AI developer in which Amazon has invested $4 billion and promotes its focus on "AI security."

The concerns, however, are not just the aforementioned conspiracy theories. Pile also contains numerous profanities and is said to be biased against gender, certain religious groups and races.

Representatives for EleutherAI, the creators of the YouTube dataset, have yet to respond to requests for comment on Proof News' findings. The company's website states that their overall goal is to reduce barriers to the development of artificial intelligence even outside of the companies that represent "Big Tech".

YouTube Subtitles do not include video clips, but instead consist of the plain text of video subtitles, often accompanied by translations in languages including Japanese, German, and Arabic.

YouTube is a goldmine of data

Companies developing AI are competing with each other to see which one has the better artificial intelligence model. Earlier this year, The New York Times reported that Google, which owns YouTube, was training its model on videos. A Google spokesperson said the footage was used in accordance with contracts with creators who publish on the platform.

In the same investigation, the media reported that the videos were allegedly used without authorization by the company Open AI, which neither denied nor confirmed this. According to some reports, this data should be used to train its AI model Sora, which can create videos based on language prompts.

YouTube Subtitles and similar solutions are a goldmine of data, as they can be of great help in training models to imitate human speech or conversations. And of course AI can learn the most from the largest collection of videos in one place – YouTube.

Proof News wanted to get reactions from the channel owners featured in this story. Those who managed to get hold of them were unaware that their data was being used to train the AI. Among those surprised were the producers of the shows Crash Course and SciShow, which are the pillars of the video education empire of brothers Hank and John Green. "We are disappointed to learn that our thoughtfully crafted educational content has been used in this way without our consent," Julie Walsh Smith, CEO of production company Complexly, said in a statement.

And YouTube's subtitles are just one in a series of cases of data theft to train AI causing problems for the creative industries. A similar thing happened when they used a set of over 180,000 books (Books3) to train the AI. Again, the Pile data set was used. At the time, many authors sued AI companies for unauthorized use of their works and alleged copyright infringements.

Pile of 3D Play Button Logos

We can expect more similar disputes in the future

Most of the litigation is still in its early stages, so questions about permits and potential penalties remain up in the air. The Pile database has since been removed from the official download site, but is still available on file sharing services.

Companies that develop artificial intelligence somehow defend fair use and do not call it a place, creators on the other hand do not agree with this and expect compensation or some kind of compensation for use, especially if we look at the future in such a way that AI may take away part of their business .

It is precisely because of all the above that the creators are in considerable uncertainty. YouTubers make it their full-time job to add copyright notices to their videos. They worry that it's only a matter of time before AI will be able to create content that closely resembles what they're producing themselves—or even be able to create a perfect imitation.

Pakman, the creator of The David Pakman Show, recently got a taste of the power of artificial intelligence while surfing TikTok. He came across a video labeled as a recording of American political commentator Tucker Carlson, but when he looked at it, he was left speechless. It sounded like Carlson, but he recognized the exact words he repeated on his YouTube show. He was even more concerned because he found only one of all the comments under the clip that recognized it as a fake, a voice clone of Carlson reading Pakman's script.

This will be a big problem, because you can do the same with anyone's voice, Pakman believes.


Interested in more from this topic?
artificial intelligence YouTube search engine


What are others reading?