A startling revelation has emerged in the world of artificial intelligence: a new analysis by plagiarism detection platform Copyleaks claims that DeepSeek-R1, a popular AI model developed by Chinese company DeepSeek, may have been trained on data generated by OpenAI’s ChatGPT. The findings, if confirmed, could ignite debates over intellectual property rights, AI ethics, and the transparency of training practices in the rapidly evolving AI industry.
The report, titled “AI Authorship Exposed: Tracing ChatGPT’s Footprints in DeepSeek,” alleges that DeepSeek’s outputs contain striking similarities to ChatGPT’s writing patterns, including idiosyncratic phrasings, structural overlaps, and even replicated errors. Researchers at Copyleaks used proprietary algorithms to compare outputs from both models, concluding that there is a “high statistical probability” that DeepSeek ingested ChatGPT-generated text during its training.
“This isn’t just about mimicry—it’s about potential data laundering,” said a Copyleaks spokesperson. “If AI models are trained on outputs from other AI systems, it creates a hall-of-mirrors effect, compounding biases and inaccuracies while obscuring the original sources.”
The full report, published by Copyleaks, can be accessed here.
The Controversy Explained
DeepSeek, launched in 2023, quickly gained traction for its multilingual capabilities and cost-effective API, positioning itself as a competitor to OpenAI and Anthropic. However, questions about its training data have lingered. Unlike OpenAI, which disclosed that ChatGPT was trained on publicly available texts up to 2021, DeepSeek has never fully detailed its data sources, citing proprietary concerns.
The Copyleaks study adds fuel to these suspicions. By analyzing thousands of responses from both models, researchers identified recurring anomalies. For example, when asked to summarize the same obscure 19th-century poem, both ChatGPT and DeepSeek produced nearly identical misinterpretations of a line that human experts consistently interpret differently. Similarly, both models generated the same rare numerical error when solving a niche algebra problem.
“The overlaps go beyond coincidence,” said Dr. Elena Torres, a computational linguist unaffiliated with the study. “This suggests either direct training on ChatGPT’s outputs or access to a corpus heavily influenced by them.”
DeepSeek’s Response
DeepSeek has vehemently denied the allegations. In a statement released earlier today, the company called the report “methodologically flawed” and emphasized its commitment to “ethical AI development.”
“Our training data consists exclusively of legally sourced, publicly available texts, including books, journals, and licensed web content,” the statement read. “Any similarities between DeepSeek-R1 and other models reflect the universality of human language, not unauthorized data usage.”
Notably, the company did not address whether its training datasets included AI-generated content from any source—a loophole critics say needs urgent regulatory attention.
Broader Implications for AI
The controversy highlights growing concerns about the “black box” nature of AI training. While companies like Google and Meta face scrutiny over copyrighted material in their datasets, the rise of AI-generated content adds a new layer of complexity. If AI models train on outputs from other AIs, it could lead to irreversible model collapse—a phenomenon where repeated recycling of synthetic data degrades performance over time.
Legal experts warn that the issue could also spark lawsuits. “If DeepSeek used ChatGPT’s outputs without permission, OpenAI might have a case for copyright or terms-of-service violations,” said IP attorney Mark Nguyen. “But the law here is untested. We’re in uncharted territory.”
Calls for Transparency
Advocates are urging stricter transparency standards for AI training practices. “Users deserve to know whether the tools they rely on are built with original data or recycled outputs,” said Sarah Lin, director of the AI Ethics Collective. “This isn’t just academic—it affects trust, accountability, and innovation.”
As the debate unfolds, the Copyleaks report underscores a pressing need for industry-wide norms. For now, the question remains: How many AI models are quietly learning from each other, and what does that mean for the future of artificial intelligence?