Large websites block AI crawlers

According to new data from Originality.AI for AI content detection, 20% of the world's top 1,000 websites have blocked crawlers that collect web data for AI services.

Data from Originality.AI shows that websites large and small are taking matters into their own hands because there are no clear legal or regulatory rules governing AI use of copyrighted material. GTBot's crawler blocking rates top 1,000 websites every week. About 5%.

In early August, OpenAI launched a crawler called GTTBot, announced that the data collected could be used to improve future models, promised to exclude paid content, and gave websites instructions on how to block the crawler.

Many media outlets, including the New York Times, Reuters, and CNN, quickly began banning GTBot, and many other sites followed suit.

Among the top 1,000 most visited websites worldwide, the number of websites blocking OpenAI crawlers rose from 9.1% on August 22 to 12% on August 29, according to Originality.AI.

Data suggests that larger websites are more likely to block AI crawlers, with Amazon and Quora being among the top websites to block GTBot crawlers.

Common Crawl, another crawler that regularly collects web data for AI services, has a ban rate of 6.77% of the top 1,000 websites in the world.

Bots work similarly to web browsers: they collect data from every available web page and store it in a database rather than displaying it to the user. This is how search engines like Google collect information.

Website owners have always been able to instruct crawlers to stop collecting data. However, cooperation is voluntary and malicious actors can ignore these instructions.

Google and other web companies consider crawlers fair use, but many publishers and intellectual property owners have long opposed this, and the company has faced numerous lawsuits over the practice.

The emergence of large-scale language models and generative AI has brought the problem back into the spotlight, with AI companies sending out crawlers to collect data to train their models and “equip their chatbots.”

Since Google and other search sites direct users to their ad-supported sites but publishers block AI crawlers, publishers are realizing the value of allowing crawlers to access sites. Because there is currently no advantage in transferring your data to AI companies.

Many media companies are in discussions with AI companies to pay them to license their data, but these discussions are still at an early stage.

Meanwhile, some website and intellectual property owners are taking legal action against AI companies that use their data without permission.

The media, which felt misled by Google over the past two decades, now views AI services with suspicion and struggles to find the right balance between embracing and resisting new technologies.

On the one hand, these organizations are looking for innovative ways to improve profit margins in labor-intensive operations, but introducing new technologies into newsroom workflows raises thorny ethical questions at a time when trust in media companies is at an all-time low.


Previous Post Next Post