Spring 2026 Bulletin

Generative AI Is Terrific, But Is It Really Legal?

Back to table of contents

Several figures in elegant historic clothing stand in a grand hall, facing a large circular portal with a detailed landscape visible through it. — *Théâtre D’opéra Spatial*, an image made using generative AI. Image by Midjourney AI, Public Domain, via Wikimedia Commons.

A Morton Mandel Conversation | 2140th Stated Meeting | November 10, 2025 | Chevron Auditorium, International House at UC Berkeley

The Academy’s Berkeley Committee hosted a panel discussion on generative AI (GenAI) that offered a technical overview of the technology and explored the legal and economic issues raised by the growing number of lawsuits challenging the legality of GenAI. The panel included Jennifer Chayes, Dean of the UC Berkeley College of Computing, Data Science, and Society; Pamela Samuelson, Professor of Law at UC Berkeley School of Law; and Abhishek Nagaraj, economist and Associate Professor at the Berkeley Haas School of Business. Goodwin Liu, Chair of the Academy’s Board of Directors, delivered welcome remarks. An edited transcript of the panelists’ presentations and discussion follows.

Jennifer Chayes

Jennifer Chayes is Dean of the UC Berkeley College of Computing, Data Science, and Society; and Professor of Electrical Engineering and Computer Sciences, Information, Mathematics, and Statistics at the University of California, Berkeley. She was elected to the American Academy of Arts and Sciences in 2014.

A woman is standing in a room with a bookshelf behind her. She is wearing a black suit and a blue shirt. She is smiling and appears to be posing for a picture. — Photo by UC Berkeley College of Computing, Data Science, and Society.

It’s a pleasure to be here. I’ve been asked to talk about some of the technical aspects of generative AI technologies and to highlight some of the opportunities of these technologies.

GenAI is a game-changing technology. We saw a real breakthrough with what we call the transformer model, which is basically a model that allows us to look at words in a sentence to understand their relationships and context. Unlike older models that processed words one by one, this model processes sequences of words all at once. The transformer model has an encoder-decoder architecture: the encoder takes “tokens” (words) and converts them into vectors. And the decoder takes vectors and converts them into tokens (words).

I think somewhere between ChatGPT 2 and 3 there was a phase transition, analogous to a change of phase in a physical system like water freezing to ice, beyond which these large language models (LLMs) could now extract information from the remote parts in the system. It seemed like magic because the LLMs could connect with remote information. We later moved beyond simple LLMs to multimodal models that include images, audio, and video.

Another breakthrough that helped to prevent some of the hallucinations was the use of reinforcement learning. Even as a user of the technology, you could mitigate hallucinations by carefully prompting these models. For ChatGPT, in areas in which it doesn’t have that much knowledge, if you’re very careful, you can lead it through prompts like you would lead a child learning how to add or multiply. And then it does a lot better. Some improvements are the result of inference-time compute, which is the extra computational power used after an AI model is trained, allowing the model to draw inferences to improve its problem-solving. And that made it a lot more reliable in many technical domains.

In pretraining LLMs, the system follows scaling laws. Performance improves exponentially as we increase the model and dataset size as well as the number of compute hours used for training. But the gains from pretraining are slowing down; we are seeing diminishing returns. When pretraining an already robust model, you need to feed in more and more information, and you get smaller and smaller improvements. You need much more data and many more parameters and that takes a lot of compute time. Basically, you’re using more power to get incremental improvements.

In post-training LLMs, you feed extra information into the model. The more specific you are and the more you teach it, the better the answers will be. Reinforcement learning was also used to make these models safer. We are post-training LLMs with real feedback.

Now the goal of diffusion models is to generate something that doesn’t exist. For example, let’s say I want a painting of my cat in the style of a van Gogh. Many online apps can do this for me. How do they achieve this? They follow three steps. First, they add random noise (what we call a diffusion process) to an image of a van Gogh painting by knocking out pixels at random. Second, they reverse the process (what we call backward diffusion or denoising) to undo that random addition of noise and recover the original photo of the van Gogh painting. They do this by constructing a neural network that can denoise the image. And third, they take a random image and “denoise” it by using the backward diffusion process in step 2 to generate something entirely new. Obviously, for all of this we need to be aware of the copyright implications, which Pam will talk about later.

A few years ago, I met Omar Yaghi, who recently won the 2025 Nobel Prize in Chemistry. He told me that it took between two and three years to synthesize metal-organic frameworks (the discovery that earned him the Nobel Prize). I thought we should try to use AI to synthesize these materials. We received a gift to do this work at the Baker Institute of Digital Materials for the Planet. The great thing about these metal-organic frameworks is that they can capture and store toxic gases, and even harvest water from desert air. We started working together, and our group includes Christian Borgs and many postdocs and graduate students. We developed seven specialized LLM agents that took the synthesis time down from two to three years to less than two weeks. This work blew us away. And then we added a diffusion model to the LLMs, but instead of painting van Goghs, it imagined new molecules and materials and helped us to identify certain properties in these molecules and materials so that we were able to start generating them. This work was a very close collaboration between domain experts and experts in AI.

Let me end with this thought. I believe deeply that GenAI is the most empowering technology of our lives. But we must mitigate its harms and find ways that it can be used for good in leveling the playing field among people with different resources. I am working with a group at Berkeley and UCLA to create a curriculum for the community college system in California, which has 2.1 million students, that includes a module that teaches students both how to interrogate the output of GenAI and how to train an AI agent to enable students to effectively have their own research assistants, independent of their financial resources. We both mitigate the ill effects and enable the democratizing effects of AI. My collaborator at UCLA is Safiya Noble. She’s a MacArthur winner, and she studies the ill effects of social media.

It’s clear that we need collective expertise across disciplines. You cannot do anything without disciplinary expertise. It’s wishful thinking for anyone who says that they can do this without deep collaboration with the experts. So what should we do? We need new levels of engagement with industry, academia, government, and civil society. We also need open source models, open weight models, testbeds, and frameworks that support fairness, accountability, and innovation. A lot of that was the subject matter of a report that we did for Governor Newsom on how we can mitigate some of the harms of GenAI while not forgetting about the potential upside of this technology. Thank you.

Pamela Samuelson

Pamela Samuelson is the Richard M. Sherman Distinguished Professor of Law and Information at the University of California, Berkeley; Professor at UC Berkeley’s School of Information; and Codirector of the Berkeley Center for Law & Technology. She was elected to the American Academy of Arts and Sciences in 2013.

A woman wearing a red jacket and glasses is smiling for the camera. — Photo by Jim Block.

I’m going to talk a little bit about the lawsuits challenging GenAI systems. Generative AI is the fastest adopted technology, as compared with other computer and internet-related technologies. As of late 2024, nearly 40 percent of the U.S. adult population uses generative AI for work or for personal reasons, and this is at a rate faster than PCs and the internet. GenAI has become part of the infrastructure of our lives.

There are billions and billions of works on the internet that are being used as training data, as well as other forms of data that are being used to train these models. But many people who voluntarily put information up on the internet didn’t expect that somebody was going to come along and scrape all this information and then use it for the purpose of training models for these generative AI systems. Many of the lawsuits are actually what are called class-action lawsuits, in which a small number of individual authors or copyright owners say, “I’m going to bring this on behalf of all the people who are similarly situated because we were all injured by the defendants when they used our works as training data.”

The lawsuits include some other claims, but the training data claims in these cases are what I call the big kahuna. There are fifty-nine lawsuits so far in the United States. As Jennifer mentioned, GenAI systems are not only using text. They include images, recorded music, and video. All manner of things can be used as long as they are in digital form. And because the developers are using copyrighted works when training the models in GenAI systems, there are a lot of lawsuits. Getty Images has a lawsuit about its stock photography, and there are new lawsuits about news articles, movie characters, music lyrics, and recorded music.

What’s the motivation behind these lawsuits? If you’re a lawyer for a class-action lawsuit, then you see the potential for big money awards, as much as one-quarter of the take. Some of you may have heard about the recent $1.5 billion settlement in one of the Anthropic cases. The lawyers who were representing the plaintiffs in that case could get $375 million. That’s a pretty big take. But it’s also the case that establishing a new precedent for a cutting-edge field is enormously exciting and will attract more clients.

Many book authors, visual artists, songwriters, and even programmers object to the use of their works as training data without their permission, and without receiving any compensation or credit. Their stance is that the only reason why the GenAI systems are able to generate such high-quality outputs is because of the quality of the inputs that they used to build their models, and those inputs are the works created by authors, artists, and programmers. GenAI products are competing in the marketplace with human-authored works, and the human authors are losing income. The GenAI developers are huge corporations profiting from their use of these human-authored works.

Let’s talk about copyright for a minute. The way that copyright works is that copyright protection attaches to all works of authorship that have been fixed in a tangible medium, and that includes everything from the Reddit post that you made to the photograph that you took and uploaded to Facebook. Copyright is not just for novels and movies. It’s for everything that is an original work. And the rights vest in authors who can then sell or license those rights in whole or part to other people.

So what does copyright give you? It gives you the right to control reproductions of the work, the distribution of copies, and the creation of derivative works, and those rights last for a long time: for the life of the author plus seventy years. Those rights are limited by a concept called fair use, which GenAI developers are relying on quite heavily.

Before I talk about fair use, let me give you a sense of the high-level principles of copyright because they have a bearing on how we should think about resolving these lawsuits. The constitutional purpose of copyright is to promote the progress of science—that is, knowledge—and the useful arts for the public good. Granting these exclusive rights to authors is an incentive for them to create and disseminate their works of authorship. But fair use is a limit on those exclusive rights because it provides breathing room for next-generation creations. Fair use has also become an important way for copyright to adapt in a time of rapid technological change. The Supreme Court recently said that fair use is a “context-based check that can help keep a copyright monopoly within its lawful bounds.”1

So what is fair use? Every time you forward somebody’s email or share something on a social media site, it’s fair use. Fair use is basically a defense to charges of infringement, and courts consider four factors when deciding if a use is fair: 1) What was the purpose of the challenged use? Was it for criticism, comment, news reporting, teaching, or scholarship? Was it for a commercial or noncommercial purpose? 2) What is the nature of the copyrighted work? 3) What is the amount and the substantiality of the taking? 4) What kind of effect does the challenged use have on the market for the value of the work?

Let me give you an example of fair use before I talk about the GenAI cases. Google digitized millions of books from research library collections, and its purpose was to index the contents and then allow users to get snippets in response to their search queries. The Court of Appeals said that this use was highly transformative because Google’s purpose in indexing the contents and providing snippets is a very different purpose from the authors’. And snippets provide public access to information, and that’s a good thing. The nature of the work didn’t cut one way or the other in this particular case because Google used all kinds of books. The court said that if you want to index the contents of books, then you have to copy the entire content. And so it was reasonable in light of the purpose. In terms of the market effects, according to the Court of Appeals, the authors do not have the rights to license the indexing of the contents of their works. And the snippets are too random and scattered to undermine the incentives for authors to write their books. Weighing all these factors together, the court concluded that Google made fair use of the books.

Of course, there can be different views about fair use. In the GenAI cases, from the authors’ standpoint, the developers used their works for a commercial purpose. They copied the entire contents many times and used pirated books to train the models. This shows bad faith. The developers used the authors’ creative expression to produce outputs that compete with the authors’ works, causing them to lose sales and licensing revenues.

According to the developers, they say that the purpose is highly transformative because the works are being used for such different purposes. The developers are only interested in the works as data and not for the expression of the work. They are interested in the words in relation to each other and how the sentences are constructed. They are not using the works to consume the expression that you would get if you were listening to or reading something. They claim that the amount that is being used is reasonable in light of the developers’ purposes. The outputs don’t infringe or supplant demand for the original works. And, finally, licensing markets are impossible.

There have been three decisions on these fair use cases so far. One is Thomson Reuters v. Ross Intelligence. Ross was making copies of Westlaw notes to train an AI model. The judge said the use was commercial and competes with Westlaw. Ross is appealing. A second decision is Kadrey v. Meta. This case was about whether it was fair use to use copyrighted books to train an AI model. The judge basically said, yes, it is fair use to use copyrighted books to train models. The results might have been different, however, if the author was able to show that AI books will flood the markets and that, in turn, will harm the authors. And the third case is Bartz v. Anthropic. The judge in this case agreed with the judge in the Kadrey case that it’s fair use to use books to train models, but when Anthropic’s engineers downloaded 7 million pirated books, it was no longer fair use. There’s a $1.5 billion settlement pending.

If we compare Bartz and Kadrey, we see that the judges agree on most of the issues: The developers’ use of the copyrighted works had a highly transformative purpose, and they rejected the claim that authors are entitled to control licensing markets for the training data. But the judges disagreed on some issues. One is the significance of the uses of the data. The judge in the Kadrey case said that using the pirated books does not undermine a fair use defense. The judge in the Bartz case said that using the pirated books was definitely not fair use. The other issue on which the judges disagreed concerns the market dilution. The judge in the Bartz case, the one involving Anthropic, thought that the market dilution theory—that AI products are going to flood the market and nobody is ever going to buy human-authored stuff again—was science fiction. But the judge in the Kadrey case said that this is a big deal: The GenAI outputs are indirect substitutes for the authors’ works so it doesn’t matter that the outputs are not substantially similar. So we’ll see what happens.

Now, if you are a nonprofit researcher, you may be wondering how all of this will impact you. And the answer is that for nonprofit research, especially for scientific and scholarship purposes, using copyrighted works as training for GenAI systems are favored uses, and they are less likely to harm the markets for those works. But we don’t know how the judges will come out on these issues. They could issue a broad ruling against fair use or a broad ruling in favor of fair use, and that is going to have spillover effects for other entities besides the big tech companies. I have encouraged some nonprofit researchers to think about filing a brief to tell the judges that nonprofit research should be protected.

But what if you are using these pirated works? Are you an infringer? In the 1980s when the Sony Betamax case was before the U.S. Supreme Court, five million American households owned video tape recording machines. From Universal’s standpoint, every person who used a video tape recording machine to make unauthorized copies of television programs was an infringer. Universal alleged that Sony was contributorily liable for infringement because it provided the tool that materially contributed to the infringement, and that Sony knew people were going to make copies without permission. As it turned out, the Supreme Court said that private noncommercial copies were fair use, and so Sony was off the hook.

Now if generative AI developers are infringers, does that mean that their users are also infringers if they use those GenAI technologies to generate outputs? There’s a great deal at stake here. If you take your favorite Disney cartoon character and put them in a different setting, are you an infringer when you do that? Well, Disney thinks so. One of the things that the copyright law allows owners to do is to destroy things that are infringing on fair use. Some of the complaints have called for the destruction of models that have been trained with infringing material.

We really don’t know what’s going to happen. It will likely be five or maybe even ten years before the lawsuits in these fair use cases are resolved. That’s a long time to wait to learn whether courts perceive GenAI technologies as beneficial to society or as predatory to copyright owners. Other countries, such as Japan, Singapore, and Israel, have laws that provide broad exemptions, giving AI developers a green light. So there is a risk that AI development in the United States could move offshore if the courts reject the fair use defenses. Thank you.

Abhishek Nagaraj

Abhishek Nagaraj is a Research Associate at the National Bureau of Economic Research, an Associate Professor in the Haas School of Business at the University of California, Berkeley, and Director of the Data Innovation and AI Lab.

A man wearing a blue jacket and a white shirt is smiling for the camera. — Photo by Berkeley Haas School of Business.

What I hope I can add to this discussion are some of the core economic issues and questions that will reshape our understanding of intellectual property in the digital age. Two years ago, I served on a panel of ten economists examining copyright issues related to AI. We realized very quickly that the economics of copyright in the age of generative AI centers on outputs and inputs. Should the outputs of AI systems be copyrightable? Think about a young developer or a young artist who uses an AI system to produce an original song or a movie. Should the developer or the artist own the copyright to that work? Is it creative or original? And on the input side, which has been the focus of a lot of our discussion today, is the training of these models on copyrighted content without explicit licenses legal or fair use?

What we learned is that the output question, although quite important, has a lot of precedent. Many of the economic questions here are not specific to generative AI and there are many other examples of people using technology to create new content. Every time we take a picture on an iPhone, the iPhone has its own algorithms that decide precisely how that picture should look. So even though you think the photo looks wonderful because of your input, the algorithms involved also can take some credit for that creative output. Frameworks exist to adjudicate sufficient human involvement, which is critical for this question. Copyright law has traditionally protected human-produced content. Copyright protection for AI-generated content has received less attention, and will require examination on a case-by-case basis.

What is particularly interesting from an economic point of view are the questions on the input side: whether the use of copyrighted material in training large language models constitutes fair use. As Pam mentioned, there are fifty-nine active lawsuits. So it is clear that generative AI and copyright are a topic of interest.

Before discussing the issues specific to fair use and copyright, let me first outline existing work on the economics of AI. One of the questions that economists are well qualified to answer relates to measuring the productivity impact of AI in real-world industries and settings. For example, there are some good case studies that show that when AI is used in call centers, writing, or coding, individual productivity can increase substantially. We also know that AI is designed to be a general purpose technology. The basic transformer architecture that Jen described can be modified with very little additional work to produce images or answer call center questions. The core technology is very flexible, and the productivity impacts can be extremely broad.

When AI is applied in specific domains, its effects can vary widely. The impact, in particular, depends on the skill level of the human using the system. As a result, outcomes can differ widely by context. In more routine work, such as call centers, AI can help new employees get up to speed much faster—sometimes matching or exceeding the performance of workers with ten years of experience. We see that the gap between the best and the rest can decrease dramatically when somebody is using AI. You might think that’s a good thing. In scientific research or in domains in which there’s much more creativity, what we see is that people who know how to use AI can get much better compared to others who don’t know how to use the technology. In other words, the impact of AI by skill level depends on the nature of the task itself. In some cases, it can widen the gap between the experts and the novices, and in others it can close it.

One thing that’s clear from all these studies is that this technology will have a transformational impact on the economy. The question that economists and computer scientists are interested in is to what extent will AI contribute to growth? One of the striking facts of economic growth is that whatever historical technology you consider, whether it’s electricity, computing, or something else, U.S. growth for the past 150 years has been surprisingly stable at 2 percent per year, and none of these technologies have really changed that trajectory. Some economists will say that AI is transformative, but they contend that it’s a “normal technology” in the sense that it’ll help us continue to grow along that trajectory. Computer scientists and AI proponents, on the other hand, argue that the old models don’t really work to assess economic impact, and that with AI, we’re going to see a 10 percent productivity growth. The answers to these debates will become clearer in the next decade, but at this point it’s a debate on the magnitude of economic impact rather than its possibility.

What’s missing in these debates are the economics of legal questions around fair use and fair compensation. To what extent does an AI system, training on data that weren’t explicitly licensed, harm the market for that information? And when you break that question down, we find that there are two very different markets that we need to consider. The first is the market for the training data. The argument isn’t really that if I produce an image of Mickey Mouse in a new context, that will reduce demand for the original movie. Rather, the concern is that Disney has an established and well-developed market for licensing its works. To the extent that such a market exists, or could exist, the claim is that AI companies harm it by using that content without paying for it.

The other market is what I call the “home market”: the market for the original creative works. The argument goes that if people use an AI system to understand the plot of a J.K. Rowling book or movie, they may be less motivated to read or watch the original, which directly harms the underlying work. The type of media as well as the type of technology may influence these debates about the derivative works. And that is why I think we’re going to see some variation across these cases, not just in terms of how the judges interpret the law, but around the economics of how these questions play out in a particular context and at a particular point in time.

In the market for training data, one of the real challenges is that transaction costs are important. The question is, to what extent is training data a commodity for these models, and to what extent can we establish both a price and/or transaction costs?

We know that there are millions of dispersed copyright holders, and there are no established mechanisms to negotiate with all of the parties. What the AI companies will tell us is that even if they wanted to pay the licensing fees for these works, they don’t know how to figure out who the copyright owners are, and so determining fair compensation is tricky.

The other challenge concerns pricing. When you buy a book in the market, there’s a specific price for that book. There’s no reason to think that that’s the price an AI model would be willing to pay, or if that’s even the value of that book to a particular model. The models don’t care how well a book “reads” to a human in the traditional sense. The real consideration for these models is how many high-quality tokens there are. The context in terms of quality is completely different.

Beyond these questions, one must also consider the heterogeneous incentives of the creators and developers. I’ve talked with many academic authors who are creating content so that the AI systems can use that content for free. That’s a very different incentive model from the one in which authors are looking to monetize their work, and their livelihood depends on it. Overall, these are thorny questions and there seems to be no consensus on valuation methodologies. So it’s difficult to figure out the transaction costs.

On the other side of the “home market,” what about the market impact of the end products, with the outputs competing with the inputs? There are two versions of this argument. One is the strict cannibalization version, in which the output acts as a substitute or a copy of the original work. To what extent are users using these models to get around paying for an underlying copy? To what extent do AI outputs reduce the demand for original works? In the Google books case, some of my work has shown that Google’s distribution of these works actually increased demand for the physical work rather than substituted for it. The model might actually make us more interested in the underlying work.

There is a second argument for “dilution.” Here the contention is that original products created with generative AI will compete with the content in the training data even if it does not exactly mimic or copy it closely. This is a relatively novel argument and the economic evidence for this is scant at the moment; the legal and economic frameworks have yet to be worked out as well.

The fair use rulings could fundamentally reshape the nature of competition for commercial and noncommercial models. One of the things the economics research has shown is that having a diversity of voices helps not only to get more open source and cheaper models, but it will actually get us to the next generation of this technology much sooner. Insofar as these fair use rulings will be broad, they will affect not only the decisions around how many providers or how many open source models we get, but the types of AI that we may have in the future. For example, in cancer treatment, intellectual property laws affect the ability to remix or combine different kinds of drugs that can have a meaningful effect on a person’s lifespan. Intellectual property law determines to what extent follow-on companies can make specific combinations and remixes of the drugs. We can imagine some of those questions coming to fruition in the next generation of AI technology. There will also be geopolitical ramifications insofar as geographic variations in copyright law may create competitive advantages and disadvantages in different jurisdictions.

Thank you for your attention. While there are many questions in this area, I hope that, in the near future, some of the answers will become clear both from an economic as well as legal perspective.

Discussion

PAMELA SAMUELSON: We have some questions that our audience has submitted. The first question is for Abhishek. In cases like Anthropic, a $1.5 billion settlement might seem like a lot. But if the data are really good, then it’s a high return on investment for Anthropic to settle if it drives model improvement. What’s your take on that?

ABHISHEK NAGARAJ: It’s very hard to comment on the specifics of that case because there’s a lot that went on behind the scenes that we don’t know much about. For instance, what are the different strategic considerations? Having said that, I think one of the things that I learned from you, Pam, is that copyright law is actually quite indifferent to infringement and how infringement should be paid for by companies. The value of a potential settlement is not how much these data helped me, but what are the liabilities and how much do I have to lose.

SAMUELSON: One of my hobby horses for much of my career is statutory damages in copyright law. Statutory damages are available for works that have been registered with the U.S. Copyright Office before infringement occurs. And those works include books, movies, sound recordings, and so on. The statutory damages start at $750 per infringed work and can go up to $150,000 per infringed work as the court deems just. When you multiply 750 by 1 billion, it’s a big number. The incentive to settle these cases when you’re facing liability in the billions or trillions of dollars is pretty high.

One way to look at the Anthropic settlement is it has a moat around itself because all of the big developers have used some pirated books, even if they haven’t used seven million books. What, then, is the biggest weakness in the tech companies’ arguments? If you talk about pirated data, there are judges who will say that it’s bad to use pirated data. But from the standpoint of the engineers, they say that they found the data on the internet and they’re using it only for statistical analysis. They say that that purpose is not affecting the market. Other judges will say that using seven million books is just too many. The judge in the Kadrey case basically said that the amount doesn’t move the needle one way or the other. In the Google case, Google digitized the books with the purpose of making the index of the contents and the snippets available to the public.

NAGARAJ: I have a six-year-old child, and when she reads a book, what she is learning from the book is way beyond the literal content of that book. She’s improving her vocabulary, she’s learning about how sentence structure works, she’s learning about other cultures. When a model reads a book, does it also have these multiple levels of learning? It seems to learn language, structure, and, of course, the exact content of the book. But the value of the output may be different, depending on the model. So I think it’s important to understand the context in which the data affect the model’s performance.

JENNIFER CHAYES: Our next questions are about China. Many people claim that China will win the AI field. Does copyright law in the United States compared to the laws in China help determine the winner? Does U.S. law negatively impact U.S. competitiveness in AI?

China has less severe IP and copyright restrictions than the United States. The big tech companies here are investing a fortune in these models and paying their people very high salaries while keeping the models secret. These companies are expecting to make huge profits from the models.

China has a very different attitude. It is building open weight models so people can go in and modify the models. In the United States, we have open source software. Berkeley is the central player in open source software, and some of the most valuable companies that have spun out of Berkeley have released their open source software to the world. And then they put a proprietary layer on top of it, and that’s how they make money. There’s a real question of whether a more open ecosystem like they have in China or the one we have here, in which there are silos of different companies and a few, very highly paid people contributing to them, is the right way to go. These proprietary tools do not lead to more innovation. Given the copyright and IP elements as well as China’s decision to build open weight models, I think China has a bigger chance to be the strongest contributor to generative AI.

SAMUELSON: Our next question concerns using AI systems as assistants or collaborators. Will these AI assistants or collaborators lessen the human skills needed to produce high-quality original works? Will the human creators earn less money? What are the economic consequences?

NAGARAJ: I’ve certainly heard arguments about the effects of using AI to train the next generation, independent of whether it’s writing or music. One of our students is at UCSF, studying how the use of AI affects the training of the next generation of radiologists. The argument that I’ve heard some people make is that the use of these systems might harm the potential of the next generation to create new content. Having said that, there’s a debate within economics on augmentation versus automation. Will the AI systems make all the boring work go away? A good counterexample is this: a lot of animation was computerized in the late 1980s and 1990s. And what that meant was someone didn’t have to draw the same image by hand twenty times. They could move on to more creative tasks. I don’t think any of us would say that the field of animation has become less creative because not as many people are learning to draw by hand. What is not so clear is whether the use of AI in the creative arts will harm the production of original and new creative works. And if that’s true, then obviously those become potential inputs for the training data of the future.

CHAYES: Let me add to Abhishek’s comments. We graduate about 2,200 students a year from our college of computing and data science, and we have the two largest majors on campus. At homecoming weekend, essentially every parent who comes up to me asks the same question: “Is my kid going to be okay?” It’s a reasonable question. The genie is out of the bottle. We cannot prevent AI from being out there. Over the years, I’ve shifted my thinking. When I was a professor of mathematics in 1987, I wouldn’t let people use a calculator in a second-year calculus class, and I remember the slide rule wars. But now I believe it’s our responsibility to train our students to work in teams with AI collaborators. I think that’s the world in which they will work and in which they will have their careers. We need to train our students in a way that prepares them for the very rapidly evolving world that they’re entering.

SAMUELSON: Our next questions focus on the legal issues. When developers are training models with very large datasets, how does a copyright owner know, learn, or identify that their works were used to train a model?

This is a question that policymakers all over the world are trying to figure out. In the European Union and in the California legislature, there are some bills and laws about disclosure concerning training datasets. Many of the large tech companies consider their datasets to be proprietary. One of the things that we need to understand is that when these companies do a big scrape, license some data, and then put it all into a big database, there is a stage in the training process in which the data are curated. That means removing duplicates, removing child porn, removing hate speech. This process gets rid of things that will not be good contributors to the model. And so there will be resistance from these big developers to provide work-by-work information. If you put your work on the internet, then it was probably used as training data.

Another interesting thing to note is that these companies need data. But the authors of these books, who are claiming that they have rights to license these works as training data, don’t actually have the data themselves. The publishers have the data. And so the deals that you’ve seen talked about in the press are ones in which the data are from the publisher.

Our next question concerns Harvard University Press. Some authors or partners are licensing an AI company to use the press’s material for a one-year trial as long as the citations to the material are guaranteed. Is that a good thing?

I don’t think we are there yet. Jennifer mentioned reinforcement learning, in which you ask an AI chatbot, “What about this?” And what you get back is actually a hallucination. So you retrain the model. “What you gave me just now was not the answer that I’m looking for.” There’s a way in which to fine-tune some of these models to make them more accurate. However, the idea that hallucinations are going to go away any time soon is itself a hallucination.

NAGARAJ: I think there’s a sense that some people have that if we get direct citations then we have some direct measurement. The problem is there’s no one-to-one association between the trainings and the outputs. When my six-year-old grows up and does really well on a particular exam, how much of that credit should be attributed to what she read when she was six? It’s a ludicrous question to ask. In some ways, we’re asking that question when we ask for citations for the AI-generated material. There are millions of different works. The value of each is in combination with the others. It may be really valuable on its own, but with others, perhaps it’s not that valuable. Citations are helpful, but I think they are really a band-aid on the deeper philosophical issues that we need to talk about.

CHAYES: From a technical point of view, I agree with that. But if you take in the works in different orders, that is going to have a very different effect on quality.

SAMUELSON: Our next question asks if we can dream of the many ways in which AI can revolutionize the social good. Economic incentives allowed us to build products. What can we do to incentivize AI for social good?

CHAYES: There are grants for AI to tackle social good challenges. And that might lead NGOs rather than for-profit companies to do work in this area. I hope that the philanthropic community and the federal government will stand up and support this work, because the incentives are already there.

SAMUELSON: Universities and university researchers are working on things that advance the social good. I think the example that Jennifer gave of the Nobel Prize winner who is using AI to be more productive and faster shows that this is happening. It’s a really good idea to give more attention to the use of AI for social good. In the health care area, there’s an enormous amount of work going on. Jennifer mentioned the summit with the governor’s office, which included AI experts, on the ways in which California could become more productive in managing traffic, water resources, environmental pollution, and the like. We are having these conversations in California with a governor who’s open to the idea of making social good AI a meaningful thing.

That is all the time we have. I want to thank Jennifer and Abishek for their presentations and thoughtful comments, and I thank everyone who joined us today.

To view or listen to the presentation, visit the Academy’s website.

1
Google LLC v. Oracle America, Inc., 593 U.S. 1, 22 (2021).