Jun 21, 2024 1:22 PM

Perplexity Plagiarized Our Story About How Perplexity Is a Bullshit Machine

Experts aren’t unanimous about whether the AI-powered search startup’s practices could expose it to legal claims ranging from infringement to defamation—but some say plaintiffs would have strong cases.

Illustration of two figures pointing at each other both filled with screenshots of the Wired story about Perplexity...

Earlier this week, WIRED published a story about the AI-powered search startup Perplexity, which Forbes has accused of plagiarism. In it, my colleague Dhruv Mehrotra and I reported that the company was surreptitiously scraping, using crawlers to visit and download parts of websites from which developers had tried to block it, in violation of its own publicly stated policy of honoring the Robots Exclusion Protocol.

Our findings, as well as those of the developer Robb Knight, identified a specific IP address almost certainly linked to Perplexity and not listed in its public IP range, which we observed scraping test sites in apparent response to prompts given to the company’s public-facing chatbot. According to server logs, that same IP visited properties belonging to Condé Nast, the media company that owns WIRED, at least 822 times in the past three months—likely a significant undercount, because the company retains only a small portion of its records.

We also reported that the chatbot was bullshitting, in the technical sense. In one experiment, it generated text about a girl following a trail of mushrooms when asked to summarize the content of a website that its agent did not, according to server logs, attempt to access.

Perplexity and its CEO, Aravind Srinivas, did not substantively dispute the specifics of WIRED’s reporting. “The questions from WIRED reflect a deep and fundamental misunderstanding of how Perplexity and the Internet work,” Srinivas said in a statement. Backed by Jeff Bezos’ family office and by Nvidia, among others, Perplexity has said it is worth a billion dollars based on its most recent fundraising round, and The Information reported last month that it was in talks for a new round that would value it at $3 billion. (Bezos did not reply to an email; Nvidia declined to comment.)

After we published the story, I prompted three leading chatbots to tell me about the story. OpenAI’s ChatGPT and Anthropic’s Claude generated text offering hypotheses about the story’s subject but noted that they had no access to the article. The Perplexity chatbot produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them. (According to WIRED's server logs, the same bot observed in our and Knight’s findings, which is almost certainly linked to Perplexity but is not in its publicly listed IP range, attempted to access the article the day it was published, but was met with a 404 response. The company doesn't retain all its traffic logs, so this is not necessarily a complete picture of the bot's activity, or that of other Perplexity agents.) The original story is linked at the top of the generated text, and a small gray circle links out to the original following each of the last five paragraphs. The last third of the fifth paragraph exactly reproduces a sentence from the original: “Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods.”

This struck me and my colleagues as plagiarism. It certainly appears to satisfy the criteria set out by Poynter Institute—including, perhaps most stringently, the seven-to-10 word test, which proposes that it’s “hard to incidentally replicate seven consecutive words that appear in another author’s work.” (Kelly McBride, a Poynter SVP who has described this test as being useful in identifying plagiarism, did not reply to an email.)

“If one of my students turned in a story like this, I would take them before the academic dishonesty committee for plagiarism,” said John Schwartz, professor of practice at the University of Texas at Austin’s journalism school, after reading the original story and the summary. “I find this just too close. When I was reading the Perplexity version, I just thought, there’s an echo in here.”

Perplexity and Srinivas, the company’s CEO, did not respond to a detailed request for comment in which they were presented with the criticisms experts made of the company for this story.

Bill Grueskin, professor of professional practice at Columbia Journalism School, wrote in an email that the summary looked to be “pretty much ok” for a chatbot identified as such, but that it was hard to say because he hadn’t had time to read the original WIRED story. “Quoting a sentence verbatim without quote marks is bad, of course,” he wrote. “I'd be pretty mortified if a news org ran an AI summary like this without disclosing the source—or worse, pretending it came from a human.” (Perplexity, of course, isn’t claiming this material came from a human.)

Perhaps luckily for Perplexity and its backers, this is a literal academic debate. Plagiarism is a concept pertaining to professional ethics, important in contexts like journalism and academia where being able to identify the source of information is of fundamental importance but of no legal significance in itself. If a rival studio releases a film containing a reasonable chunk of footage from Inside Out 2, Disney would sue not for plagiarism but for copyright infringement; similarly, a letter Forbes reportedly sent Perplexity threatening legal action is said to mention “willful infringement” of Forbes’ copyrights. Here, legal experts say, Perplexity is on somewhat safer ground—probably.

“In terms of the copyright, this is a tough call,” says James Grimmelmann, professor of digital and information law at Cornell University. On one hand, he argues, the summary is reporting facts, which cannot be copyrighted; but on the other, it does partially duplicate the original and summarize the details found in it. “It’s not a slam dunk copyright case, but it’s not trivial, either. It’s not frivolous.”

Grimmelmann sees a host of potential issues for Perplexity, among them consumer protection, unfair advertising, or deceptive trade practices claims he believes could be made against a company that says it respects the Robots Exclusion Protocol but doesn’t follow it. (The standard is voluntary but widely adhered to.) He also thinks it could be vulnerable to a claim of misappropriation of hot news, in which a publisher argues that a competitor summarizing its material before it’s had a chance to commercially benefit from it, or in a way that undermines its value to paying subscribers, is infringing on its copyright. Perplexity’s evident ability to circumvent paywalls “is a bad fact for them,” he says, as is the fact that its system is automated.

Grimmelmann also says that Perplexity may be forfeiting the protection of Section 230 of the Communications Decency Act. This is the law that, among other things, protects search engines like Google from liability for defamation when they link to defamatory content because they are services passing on information from other content providers; as he sees it, Perplexity is similarly shielded as long as it accurately summarizes material. (Whether AI-generated material enjoys 230 protection at all is a matter of debate.)

“They’d only get in trouble if they summarized the story incorrectly and made it defamatory when it wasn’t before. That’s something that they actually would be at legal risk for, especially if they don’t credit the original source clearly enough and people can’t easily go to that source to check,” he says. “If Perplexity’s edits are what make the story defamatory, 230 doesn’t cover that, under a bunch of case law interpreting it.”

In one case WIRED observed, Perplexity’s chatbot did falsely claim, albeit while prominently linking to the original source, that WIRED had reported that a specific police officer in California had committed a crime. (“We have been very upfront that answers will not be accurate 100% of the time and may hallucinate,” Srinivas said in response to questions for the story we ran earlier this week, “but a core aspect of our mission is to continue improving on accuracy and the user experience.”)

“If you want to be formal,” says Grimmelmann, “I think this is a set of claims that would get past a motion to dismiss on a bunch of theories. Not saying it will win in the end, but if the facts bear out what Forbes and WIRED, the police officer—a bunch of possible plaintiffs—allege, they are the kinds of things that, if proven and other facts were bad for Perplexity, could lead to liability.”

Not all experts agree with Grimmelmann. Pam Samuelson, professor of law and information at UC Berkeley, writes in an email that copyright infringement is “about use of another’s expression in a way that undercuts the author’s ability to get appropriate remuneration for the value of the unauthorized use. One sentence verbatim is probably not infringement.”

Bhamati Viswanathan, a faculty fellow at New England Law, says she’s skeptical the summary passes a threshold of substantial similarity usually necessary for a successful infringement claim, though she doesn’t think that’s the end of the matter. “It certainly should not pass the sniff test,” she wrote in an email. “I would argue that it should be enough to get your case past the motion to dismiss threshold—particularly given all the signs you had of actual stuff being copied.”

In all, though, she argues that focusing on the narrow technical merits of such claims may not be the right way to think about things, as tech companies can adjust their practices to honor the letter of dated copyright laws while still grossly violating their purpose. She believes an entirely new legal framework may be necessary to correct for market distortions and promote the underlying aims of US intellectual property law, among them to allow people to financially benefit from original creative work like journalism so that they’ll be incentivized to produce it—with, in theory, benefits to society.

“There are, in my opinion, strong arguments to support the intuition that generative AI is predicated upon large scale copyright infringement,” she writes. “The opening ante question is, where do we go from there? And the greater question in the long run is, how do we ensure that creators and creative economies survive? Ironically, AI is teaching us that creativity is more valuable and in demand than ever. But even as we recognize this, we see the potential for undermining, and ultimately eviscerating, the ecosystems that enable creators to make a living from their work. That’s the conundrum we need to solve—not eventually, but now.”

You Might Also Like …