Aspire Market Guides


“Applying the fair use doctrine to generative AI presents novel and complex challenges that stretch the boundaries of traditional copyright analysis.”

fair useThe advent of generative AI brings to the forefront many novel and complex legal questions related to fair use and copyright infringement. Historically, assessing whether a particular use qualifies as fair use has been analyzed through an established economic framework. Applying the same methodology to copyright matters involving generative AI, however, presents unique challenges—primarily due to the distinct nature of the AI-generated content and the processes involved.

Two central questions have emerged in the burgeoning legal battles surrounding content-generating AI models, including Large Language Models (LLMs) and Generative Image Models: (1) whether the training process infringes existing copyrights, and (2) whether AI-generated content itself constitutes infringement. These questions cannot be answered independently from one another, as determining whether the AI-generated output is infringing partly depends on whether the use of copyrighted content for AI-model training qualifies as fair use.

Economic studies dating back at least to the early 1990s emphasize the concept of substitution when determining whether an unauthorized use of copyrighted content qualifies as fair use. The economic substitution principle stipulates that a secondary use might be considered fair if it does not serve as an economic substitute for the original work. When applying this principle in practice, courts have typically assessed fair use based on the following four factors:

  1. Purpose and Character of the Use—whether the use is transformative, adding new expression or meaning;
  2. Nature of the Copyrighted Work—whether the original work is factual or creative;
  3. Amount and Substantiality of the Portion Used—how much of the copyrighted content was used and the significance of that portion; and
  4. Effect on the Market or Value of the Copyrighted Work—whether the new use adversely affects the market for or value of the original work.

The four factors, guided by the underlying economic principle of economic substitution, have become increasingly significant as copyright owners initiate lawsuits against generative AI companies. For example, in Authors Guild v. OpenAI, the plaintiffs claim that their literary works were used without authorization to train ChatGPT. In turn, OpenAI argues that the content was used for learning purposes, and that ChatGPT output is not an economic substitute for the original since it is transformative—a summary of a particular book, for example—and therefore constitutes fair use.

In The New York Times v. Open AI and Microsoft, however, plaintiffs claim that ChatGPT-generated output can be substantially similar to the New York Times articles, sometimes even verbatim. Plaintiffs rely on this finding to argue that the output is not transformative and instead directly competes with the original works. Similarly, while plaintiffs in Andersen v. Stability AI claim that Stability AI improperly uses copyrighted works of art in training an AI service that generates images, the defendants argue that generated images are transformative and not substantially similar to the original works.

I.A. Purpose and Character of the Use (Factor 1)

Courts have determined that unauthorized use of copyrighted work can be considered fair use if it is “transformative,” i.e., if it adds new expression or meaning to the original work. A classic example is a book review. As the review serves a different purpose (to summarize and evaluate the contents of the book) than the book itself, the use of excerpts from the book is considered fair use.

Although the secondary use need not benefit the original work (for example, a book review may be unfavorable) to qualify as fair use, evidence of positive impact on the original work suggests that the secondary use can be a complement to, instead of a substitute for the original work. To this end, existing economic studies and case-specific data (when available) can inform whether a particular use—e.g., reliance on copyrighted material in AI model training—benefited the original work.

One illustrative economic study in this context is Erikson et al. (2013), which examined the impact of YouTube parody music videos on copyright holders. The researchers found no evidence that these videos caused economic harm to copyright holders through substitution. Further, their analysis revealed that the presence of parody not only correlates with a larger audience for the original music videos but can also serve as a predictor for increased viewership.

Conceptually, the question of whether the use of copyrighted material in AI training benefits the original work is complicated by a number of factors. The output of generative AI is not limited to a particular use case and may instead serve a wide variety of user needs—including summarizing, paraphrasing, expanding upon, remixing, or even replicating copyrighted content. The varied use cases of generative AI would require either assessing each instance individually to determine whether it is transformative or a substitute or assessing in aggregate the overall net effect. For example, although an AI-generated summary of a novel may be transformative in purpose, any output that closely paraphrases or reconstructs key elements of a protected text may function more like a substitute. Similarly, image generators may produce entirely new artistic interpretations or, alternatively, outputs that replicate the style or composition of the original. Determining whether AI-generated output is transformative would then require a granular, context-specific assessment of how the AI content is used and perceived, and whether it fulfills the same market demand as the original.

Empirically analyzing these issues also presents a number of challenges. Analyses evaluating whether a secondary use complements or substitutes for a copyrighted work often seek to identify causal effects using variation across time or space. For example, one common approach is to use temporal variation—comparing sales or consumption of the original work before and after the allegedly infringing work becomes available. Alternatively, economists may use spatial variation—comparing outcomes across geographic areas or media platforms with differing levels of exposure to the allegedly infringing content. Both approaches share the same goal: to isolate the effect of the secondary use from other confounding factors that might influence market outcomes.

Applying these empirical frameworks to generative AI presents multiple complications. First, AI-generated content may not be tied to a specific location or release date in the way traditional works are, thus limiting the usefulness of temporal or geographic variation. Second, if AI-generated output is largely generated on demand and used privately by individuals, our ability to observe and measure exposure may be limited. Third, the modern generative AI platforms can disseminate outputs globally in real time, making it difficult to construct valid control groups. Finally, proprietary constraints may prevent researchers from obtaining the data needed to measure exposure or use accurately.

I.B. Nature of the Copyrighted Work—Factual v. Creative (Factor 2)

Another important factor in a fair use claim relates to whether the original work is considered predominantly factual or creative. Factual works contain objective, verifiable information and are primarily informative rather than expressive. Examples include a dictionary, which provides definitions and explanations of words; the U.S. Census, a decennial record of population statistics; and the Bureau of Labor Statistics’ monthly reports on inflation and wages. Creative works, however, are characterized by originality and personal expression, often intended for purposes of entertainment, to inspire, or to convey unique perspectives. Novels, paintings, songs, and films are all examples of creative works, as they reflect the author’s or artist’s imagination, creativity, and stylistic choices.

From an economic perspective, courts have generally found that factual works are less susceptible to market harm from unauthorized use and thus more suitable for a fair use defense. In other words, when the original work is factual in nature, its unauthorized secondary use is more likely to be considered fair use. However, when a work blends factual and creative elements—as is often the case with hybrid works like historical biographies—courts may consider apportioning the relative value derived from each component.

To illustrate, consider a best-selling historical biography that combines detailed factual accounts of events with narrative storytelling. One way to estimate the value of the creative element may be to compare its market performance with a purely factual accounting of the same events, but without any of the creative elements (assuming such an alternative exists); this approach bears some resemblance to “non-infringing alternative” analysis often used in patent infringement matters. If the historical biography significantly outsells the factual version, this may suggest that the creative components—such as storytelling, character development, and narrative structure—add substantial value in this case.

Consumer surveys can provide additional insights into and help quantify the relative importance of the creative and factual aspects of the product. For instance, a survey could ask readers which factors from a list, such as price, recommendations, historical accuracy, or narrative style, were most important in their decision to purchase the book. Analyzing how consumers weigh these attributes can shed light on the value added by creative expression.

However, making a distinction between factual and creative in the generative AI context comes with both conceptual and empirical challenges. AI models are trained on vast and often uncurated collections of text and media, which may include an unknown mix of factual, creative, and hybrid works. The sheer scale of these inputs—many of which may come from aggregated sources such as web scrapes, where content is unlabeled or inconsistently structured—poses a significant challenge. Separating or apportioning the factual and creative elements of even a single hybrid work can be difficult; doing so across millions or billions of inputs is exponentially more complex.

Estimating the distinct contribution of creative versus factual components in training data is further complicated by the limited visibility into model training pipelines and the proprietary nature of AI datasets. Generative AI systems involve iterative tuning and multi-stage pretraining, making it nearly impossible to trace how specific inputs—factual or creative—contribute to particular outputs. Because developers typically do not publish detailed records that track exactly which pieces of content were used during the training process, when they were used, and how they influenced model development over time, it can be extremely difficult to determine how much influence any particular input had on a model’s behavior. Similarly, tools that would allow one to trace a specific output back to the particular training data used to generate it are generally lacking in current generative AI systems. Without such tools, any attempts to apportion value across content types may be speculative.

I.C. Amount and Substantiality of the Portion Used (Factor 3)

Both the quantity and quality of the copyrighted material used are important when determining whether secondary use constitutes fair use. If the unauthorized secondary use involves either (1) a large portion of the copyrighted work or (2) a smaller but important key part of the copyrighted work, it is less likely to be deemed fair use (however, if unauthorized secondary use involves a less important or “de minimis” portion of the original work, the courts might allow it without even conducting a fair use analysis).

Analyses assessing the importance (or salience) of specific content can be informative in determining fair use. For example, a movie scene prominently featured in trailers and widely distributed through memes, social media posts, and online discussions may be deemed a particularly significant part of the movie. Similarly, a specific portion of a book or article that is frequently quoted and discussed in media, online reviews, and social media posts might be viewed as especially important relative to the other parts of the same work.

To assess such salience, relevant economic analyses could, for example, examine data from Google Trends or social media platforms related to the content at issue. For instance, one such analysis might involve quantification and comparison of the total number of Twitter posts, retweets, or comments referencing or related to the specific excerpt in question, relative to those referencing or related to the entire original work.

Evaluating the quantity and quality of the copyrighted material used in the context of generative AI introduces challenges similar to those discussed under Factor 2. Given (1) the scale of the input data and (2) that AI systems typically do not disclose which parts of specific works were accessed, replicated, or emphasized in generating outputs, even identifying whether a salient or substantial portion of a work was used can be exceedingly difficult—let alone determining the significance of that portion of a work in the AI model’s output.

I.D. Effect on the Market or Value of the Copyrighted Work (Factor 4)

Unauthorized secondary use can have notable economic effects on the copyright owner—it may deprive them of income, reduce the original work’s value, or harm an existing or potential market for the work. Fair use under this factor can be assessed by examining the presence or absence of such effects.

Economic models can be used to estimate impact of the new use (i.e., AI-generated content) on the sales and/or market share of the original content to assess the presence (or lack) of economic substitution. Such analyses are similar to those previously described for Factor 1 (the purpose or character of use), except that for this factor, quantifying any positive or negative impact is highly relevant. For example, in the Authors Guild v. HathiTrust case, plaintiffs failed to convince the court that scanning copyrighted books and so-called orphan works deprived the copyright owners of commercial licensing opportunities.

As with the other factors, the diversity of use cases, the scale of input data, and the limited visibility into the AI training process complicate economic analyses here as well. In addition to these challenges, other unique considerations arise when evaluating market effects in the context of generative AI. For instance, even if the original work were previously offered for free (i.e., at zero price), and thus no immediate market harm appears to exist, the potential impact of the AI-generated content on the value of the original work might need to be considered in a broader business context. For example, a work offered for free might serve as a gateway to the creator’s other paid works by helping the author build reputation and/or brand name. Similarly, products offered at zero price can be integral to “freemium” business models, where the free product draws users to premium offerings. If such a “free” original work is used to train an AI model that generates related content, then it may be necessary to examine the potential economic impact on both the original work and related products—and in some cases, an entire business.

Furthermore, when the copyrighted material is used to train an AI model, the licensing terms under which the original works were made available can also play a crucial role in the economic analysis. Business models involving “ShareAlike” licenses allow free access to copyrighted works with the expectation that any derivative works will also be freely available. If an AI model is trained on such works, but generates content that is not subject to reciprocal sharing, the value proposition for both creators and consumers may be undermined.

Concluding Remarks

Applying the fair use doctrine to generative AI presents novel and complex challenges that stretch the boundaries of traditional copyright analysis. While the established four-factor framework—rooted in economic principles such as substitution and market harm—remains a useful tool, the nature of AI-generated content and the opacity of the training process demand careful, context-specific analysis.

Courts will need to grapple with conceptual and empirical uncertainties, particularly around the scale of data usage, the diversity of outputs, and the difficulty of tracing inputs to outputs. Economic analysis will play a critical role in evaluating whether and when AI uses are transformative, and whether they harm or complement the markets for original works. In doing so, an evidence-based application of fair use will remain essential to balancing innovation with the rights of original creators.

Disclosure:  Steven Herscovici served as a consultant to the HathiTrust in the Authors Guild v. HathiTrust case. 



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *