The Great AI Backlash Has Claimed a New Victim—and You’ve Probably Never Heard of It
The Great AI Backlash Has Claimed a New Victim—and You’ve Probably Never Heard of It
Although he was anointed the ur-tech bro of the week, Smith doesn’t have much VC slickness. He’s a walking Portlandia stereotype, with piercings and bird tattoos and stubble; he talks effusively about the art of storytelling, like he’s auditioning for the role of a superfan of The Moth. A self-described theater kid, Smith dabbled in playwriting before getting his first tech gig at a computational linguistics company.
The idea for Prosecraft, he says, came from his habit of counting the words in books he admired while he was working on a memoir about surviving the 2012 Costa Concordia shipwreck. (“Eat Pray Love is 110,000 words,” he says.) He thought other authors might find this type of analysis helpful, and he developed some algorithms using his computational linguistics training. He created a submissions process so writers could add their own work to his database; he hoped it would someday make up the bulk of his library. (All in all, around a hundred authors submitted to Prosecraft over the years.) It did not occur to Smith that Prosecraft would end up enraging many of the very people he wanted to impress.
Prosecraft did not train off any large language models. It was not a generative AI product at all, but something much simpler. More than anything else, it resembled the kind of tool an especially devoted and slightly corny computational linguistics graduate student might whip up as an A+ final project. But it appears to share something crucial with most of the AI projects making headlines these days: It trained on a massive set of data scraped from the internet without regard to possible copyright infringement issues.
Smith saw this as a grimy means to a justifiable end. He doesn’t defend his behavior now—“I understand why everyone is upset”—but wants to explain how he defended it to himself at the time. “What I believed would happen in the long run is that, if I could show people this thing, that people would say, ‘Wow, that's so cool and it's never been done before. And it's so fun and useful and interesting.’ And then people would submit their manuscripts willfully and generously, and publishers would want to have their books on Prosecraft,” he says. “But there was no way to convey what this thing could be without building it first. So I went about getting the data the only way that I knew how—which was, it's all there on the internet.”
Smith didn’t buy the books he analyzed. He got most of them from book-pirating websites. It’s something he alluded to in the apology note he posted when he took Prosecraft down, and it’s something he’ll admit if you ask, although he seems bewildered about how mad people are about it. (“Would people be less angry with me if I bought a copy of each of these books?” Smith wonders out loud as we talk over Zoom. “Yes,” I say.) The practice of using shadow libraries to conduct scholarly work has been debated for years, with projects like Sci-Hub and Libgen disseminating academic papers and books to the applause of many researchers who believe, as the old adage goes, that information wants to be free.