Search tool exposes music in AI training datasets

The Atlantic has made public a searchable database of songs appearing in four music datasets that reporter Alex Reisner identified as AI training material. The tool gives artists, labels and listeners a way to check collections that, according to Reisner, include millions of tracks and have already circulated widely online.

Reisner reported that two of the datasets are far larger than the others, with 12 million and 9 million tracks. The other two each contain more than 100,000 songs, according to his reporting.

The database arrives as musicians and technology companies continue to fight over what audio can be used to build generative AI models. Reisner said the datasets have been downloaded thousands of times, though he also said it is not possible to know every person or company that has used them.

Research papers name some users

According to Reisner, Google and Stability have each acknowledged using some of the datasets in research papers. The reporting does not establish a complete list of users, and Reisner framed the papers as confirmed examples rather than a full accounting.

Some of the music comes from places that are accessible to listeners under limits. Reisner cited the Free Music Archive as one example: tracks there can be streamed for personal use, but commercial applications require licensing.

The Verge reported that well-known names appear in the searchable material, including Lady Gaga, Fred Again.., Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen and experimental composer Hainbach. Their presence in a dataset does not, by itself, show which AI products may have used a particular track.

Links, downloads and platform rules

Reisner reported that three of the datasets are distributed largely as lists of links to songs hosted on YouTube or Spotify, rather than as ready-made audio archives. Developers who want the audio may then use automated tools to download it from those platforms, according to his reporting.

Some of those tools can avoid logins, ads and systems that generate revenue or subscribers for creators, Reisner wrote. He said that use of such tools violates the terms of service for the platforms involved.

The searchable database makes a technical supply chain easier to inspect. Instead of requiring people to locate and parse large training datasets themselves, The Atlantic’s tool lets the public search for artists and tracks that may be included in those collections.

The reporting does not say that every track in the datasets was used to train a commercial AI product. It does show that large music collections are available to AI developers and researchers, and that at least some major technology organizations have used them in published research.

This story draws on original reporting from The Verge.