Over 20 AI Copyright Lawsuits Highlight Data Sourcing and Compensation Imperative

Image for Over 20 AI Copyright Lawsuits Highlight Data Sourcing and Compensation Imperative

Davis Blalock, a prominent research scientist formerly associated with Databricks and MosaicML, has ignited further discussion on the contentious issue of artificial intelligence (AI) companies utilizing public data for training. Blalock's recent tweet, stating, "While I briefly have no employer, let me tell you what's really happening with AI companies training on public data: [1/n]," suggests an insider's perspective on the industry's practices. His statement comes amidst a growing legal battle over the legality and ethics of large language models (LLMs) being trained on vast datasets, often scraped from the internet.

The core of the debate centers on whether the extensive use of copyrighted material in AI training datasets constitutes "fair use" under intellectual property law. AI developers often argue that their models "transform" the data by learning patterns and generating new content, rather than reproducing the original works. This transformative argument is a key defense in the numerous lawsuits currently facing major AI firms.

Recent court rulings have offered mixed signals, underscoring the complexity of the issue. In June 2025, a U.S. District Judge ruled that Anthropic's training of its Claude LLMs on copyrighted books was "fair use" due to its transformative nature. However, the same ruling found that Anthropic's acquisition and storage of these books from pirated sources infringed copyright, mandating a trial for damages. Similarly, a separate ruling dismissed a lawsuit against Meta's Llama LLMs, but the presiding judge emphasized that copying copyrighted works for training could indeed be illegal and that companies should find ways to compensate copyright holders.

The U.S. Copyright Office, in its May 2025 report on generative AI training, clarified that "publicly available" data is not synonymous with "authorized" data. The report indicated that knowingly using datasets derived from pirated or illegally accessed material should weigh against a fair use defense. It also highlighted concerns about AI models "memorizing" and reproducing copyrighted content, which would constitute infringement.

With at least 21 ongoing lawsuits in the U.S. alone, the legal landscape for AI training data is rapidly evolving. These cases are pushing for clearer boundaries on data sourcing and compensation for creators whose works underpin the development of these powerful AI systems. The industry faces increasing pressure to balance innovation with ethical data practices and intellectual property rights.