ChatGPT Stole Your Work. So What Are You Going to Do?
Media companies, whose work is quite important to large language models (LLMs), may also want to consider some of these ideas to restrict generative AI systems from accessing their own content, as these systems are currently getting their crown jewels for free (including, likely, this very op-ed). For instance, Ezra Klein mentioned in a recent podcast that ChatGPT is great at imitating him, probably because it downloaded a whole lot of his articles without asking him or his employer.
Critically, time is also on the side of data creators: As new events occur in the world, art goes out of style, facts change, and new restaurants open, new data flows are necessary to support up-to-date systems. Without these flows, these systems will likely fail for many key applications. By refusing to make new data available without compensation, data creators could also put pressure on companies to pay for access to it.
On the regulatory side, lawmakers need to take action to protect what might be the largest theft of labor in history, and quickly. One of the best ways to do this is clarifying that “fair use” under copyright law does not allow for training a model on content without the content owner’s consent, at least for commercial purposes. Lawmakers around the world should also work on “anti-data-laundering” laws that make it clear that models trained on data without consent have to be retrained within a reasonable amount of time without the offending content. Much of this can build on existing frameworks in places like Europe and California, as well as the regulatory work being done to make sure news organizations get a share of the revenue they generate for social media platforms. There is also growing momentum for “data dividend” laws, which would redistribute the wealth generated by intelligent technologies. These can also help, assuming they avoid some key pitfalls.
In addition, policymakers could help individual creators and data contributors come together to make demands. Specifically, supporting initiatives such as data cooperatives—organizations that make it easy for data contributors to coordinate and pool their power—could facilitate large-scale data strikes among creators and bring AI-using firms to the negotiating table.
The courts also present ways for people to take back control of their content. While the courts work on clarifying interpretations of copyright law, there are many other options. LinkedIn has been successful at preventing people who scrape its website from continuing to do so through Terms of Use and contract law. Labor law may also provide an angle to empower data contributors. Historically, companies’ reliance on “volunteers” to operate their businesses have raised important questions about whether these companies violated the Fair Labor Standards Act, and these fights could serve as a blueprint. In the past, some volunteers have even reached legal settlements with companies that benefited from their work.
There is also a critical role for the market here. If enough governments, institutions, and individuals demand “full-consent LLMs”—which pay creators for the content they use—companies will respond. This demand could be bolstered by successful lawsuits against organizations that use generative AI (in contrast to organizations that build the systems) without paying users. If applications built on top of AI models face lawsuits, there will be greater demand for AI systems that aren’t playing in the legal Wild West.
Our lab’s research (and that of colleagues) also suggests something that surprised us: Many of the above actions should actually help generative AI companies. Without healthy content ecosystems, the content that generative AI technologies rely on to learn about the world will disappear. If no one goes to Reddit because they get answers from ChatGPT, how will ChatGPT learn from Reddit content? That will create significant challenges for these companies in a way that can be solved before they appear by supporting some of the above efforts.