Perplexity in Hot Water: AI Chatbot Ethics and the Challenge of Crawling Content

WIRED's investigation into Perplexity AI reveals ethical concerns over data scraping practices and potential plagiarism by its chatbot.

Ethical Concerns for Perplexity AI
Perplexity AI's methods of content acquisition and chatbot summarization are under scrutiny, prompting discussions on plagiarism and AI's impact on copyright law.


A recent investigation by WIRED has cast a spotlight on AI startup Perplexity, raising ethical concerns about its data collection practices and the potential for plagiarism by its AI-powered chatbot. The story has ignited a debate on copyright infringement in the age of AI and the need for new legal frameworks to govern the rapidly evolving landscape of information access.

Perplexity's "answer engine" functions by crawling and indexing vast swathes of the internet, allowing users to pose questions and receive narrative answers with citations and links to the source content. However, websites can utilize a Robots Exclusion Protocol (REP) to prevent unwanted crawling by bots. WIRED, along with an independent researcher, alleges that Perplexity has been disregarding these protocols, essentially scraping content from websites that explicitly request to be left out.

Perplexity CEO Aravind Srinivas denies any intentional violation of REP. He argues that the company relies on a combination of its own crawlers and third-party services for content acquisition. However, Srinivas declined to disclose the name of the third-party provider, citing a non-disclosure agreement. This lack of transparency raises further questions about Perplexity's data collection practices and potential accountability issues.

Srinivas contends that the REP, established in 1994, might not be entirely suitable for the modern AI environment. He suggests the need for a new set of guidelines that define a more collaborative relationship between content creators (publishers) and platforms like Perplexity.


Fabrication and Misinformation

Beyond the crawling controversy, WIRED also reported troubling instances where Perplexity's chatbot generated summaries that closely resembled WIRED articles, even when the chatbot hadn't accessed the content itself. In essence, the chatbot was "bullsh*tting," as the article phrased it, by fabricating details and summaries based on prompts.

One particularly concerning example involved a WIRED article about Perplexity's practices. When prompted with the article's content, the chatbot produced a detailed summary that mirrored the original piece, even including a sentence lifted verbatim. This raises serious concerns about plagiarism and the potential for AI-powered tools to spread misinformation.

Experts differ in their opinions on whether Perplexity's chatbot summaries constitute plagiarism. While journalistic and academic standards of originality might be challenged by the close mimicry of content, Perplexity argues that it's simply summarizing facts. This highlights a gray area in the application of plagiarism rules to AI-generated content.


Copyright Infringement: A Murky Legal Landscape

The legal implications of Perplexity's practices are also unclear. Copyright law protects the original expression of ideas, but not the underlying facts themselves. However, the act of verbatim copying and summarizing details raises questions about potential copyright infringement. Legal experts suggest that a definitive ruling might be difficult, as the case might not meet the threshold for "substantial similarity" required for a copyright claim.


Even if copyright infringement is not a slam dunk, Perplexity might face other legal challenges:

Consumer Protection, Unfair Advertising, and Deceptive Trade Practices: If Perplexity misrepresents its data collection practices or the capabilities of its chatbot, it could face consumer protection lawsuits.

Misappropriation of Hot News: This legal principle protects a publisher's right to be the first to benefit from the news it gathers. If Perplexity summarizes content before a publisher can monetize it, it could be considered misappropriation.

Paywall Bypass: Perplexity's ability to potentially bypass paywalls on news websites could further strain its relationship with content creators.

Abusing Section 230 Protections: Platforms like Google enjoy liability protection under Section 230 of the Communications Decency Act for content posted by users. However, if Perplexity's chatbot actively generates defamatory content, it might lose this protection.


Expert Opinions Diverge: A Call for New Frameworks

The WIRED investigation has sparked a lively debate among legal scholars and technology experts. Some argue that the verbatim copying and lack of clear attribution in Perplexity's summaries could lead to legal trouble if the information is defamatory. Others believe the summaries, while derivative, don't meet the threshold of substantial similarity required for a copyright claim.

Beyond the legal specifics, the bigger concern is the potential disruption AI poses to existing copyright and intellectual property law. Large-scale scraping and content summarization by AI could undermine the ability of creators to profit from their work. This calls for a re-evaluation of copyright law in the digital age and the development of new legal frameworks that address ethical concerns and incentives.

CNET also reported that The New York Times sued OpenAI and its partner, Microsoft, late last year, alleging that the companies used its articles to "train chatbots that now compete with it." OpenAI responded that the suit is "without merit" and that the company believes its systems are protected under "fair use" provisions in copyright law. OpenAI told UK lawmakers earlier this year that because copyright law covers blog posts, photographs, forum posts, code, and documents, "it would be impossible to train today's leading AI models without using copyrighted materials."


More Suits Against AI

Getty Images similarly filed a suit against Stability AI last year, alleging that the image-generation startup had copied more than 12 million photographs from its collection without permission, "as part of its efforts to build a competing business." Stability AI reportedly acknowledged that Getty Images were used to train its AI, a process involving "temporary copying" to create its service, but said the service generates "new and original synthetic images."

The tech industry's aggressive approach to AI has already undercut core services across the internet. A dramatic example is Google's new AI Overviews feature, meant to summarize search results that billions of people rely on to seek out facts around the web. Within days of its launch, Google's service was found to be spreading racist conspiracy theories and dangerous health recommendations, including the suggestion to add glue to pizza and eat rocks as part of a healthy diet. Google has since pumped the brakes on its AI summaries, though some users are still reporting egregious errors.

Adobe, meanwhile, faced backlash from users who worried that a new version of its terms of service entitled the company to use their work to help train its AI engines without explicit permission. Adobe released new terms of service agreements promising it will not use customer work unless it's submitted to the Adobe Stock marketplace.

Post a Comment

Previous Post Next Post

Contact Form