SSP Forum: Mark Lemley on Fair Learning
Neural network and machine learning artificial intelligences (AIs) need comprehensive data sets to train on. Those data sets will often be composed of images, videos, audio, or text. All those things are copyrighted. Copyright law thus stands as an enormous potential obstacle to training AIs. Not only might the aggregate data sets themselves be copyrighted, but each individual image, video, and text in the data set is likely to be copyrighted too.
It’s not clear that the use of these databases of copyrighted works to build self-driving cars, or to learn natural languages by analyzing the content in them, will be treated as a fair use under current law. Fair use doctrine in the last quarter century has focused on the transformation of the copyrighted work. AIs aren’t transforming the databases they train on; they are using the entire database, and for a commercial purpose at that. Courts may view that as a kind of free riding they should prohibit.
We argue that AIs should generally be able to use databases for training whether or not the contents of that database are copyrighted. There are good policy reasons to do so. And because training data sets are likely to contain millions of different works with thousands of different owners, there is no plausible option simply to license all the underlying photographs or texts for the new use. So allowing a copyright claim is tantamount to saying, not that copyright owners will get paid, but that no one will get the benefit of this new use.
There is another, deeper reason to permit such uses, one that has implications far beyond training AIs. Understanding why the use of copyrighted works by AIs should be fair actually reveals a significant issue at the heart of copyright law. Sometimes people (or machines) copy expression but they are only interested in learning the ideas conveyed by that expression. That’s what is going on with training data in most cases. The AI wants photos of stop signs so it can learn to recognize stop signs, not because of whatever artistic choices you made in lighting or composing your photo. Similarly, it wants to see what you wrote to learn how words are sequenced in ordinary conversation, not because your prose is particularly expressive.
AIs are not alone in wanting just the facts. The issue arises in lots of other contexts. In American Geophysical Union v. Texaco, for example, the defendants were interested only in the ideas in scientific journal articles; photocopying the article was simply the most convenient way of gaining access to those ideas. Other examples include copyright disputes over software interoperability cases like Google v. Oracle, current disputes over copyright in state statutes and rules adopted into law, and perhaps even Bikram yoga poses and the tangled morass of cases around copyright protection for the artistic aspects of utilitarian works like clothing and bike racks. In all of these cases, copyright law is being used to target defendants who actually want something the law is not supposed to protect – the underlying ideas, facts, or functions of the work.
Copyright law should permit copying of works for non-expressive purposes. When the defendant copies a work for reasons other than to have access to the protectable expression in that work, fair use should consider under both factors one and two whether the purpose of the defendant’s copying was to appropriate the plaintiff’s expression or just the ideas. We don’t want to allow the copyright on the creative pieces to end up controlling the unprotectable elements.