top of page

OpenAI just admitted it has a bot that crawls the web to collect AI training data.


I hate spiders. When I traveled around the world in 2003, the thought of chunky, hairy arachnids creeping beneath my mosquito net kept me awake on many a tropical night.

Unbeknownst to most people, there are digital spiders crawling all over the websites you read and create. The most active one is probably Googlebot, which automatically collects web information so Google can later rank and serve it up in Search results.

Right now, there are several of these spiderbots crawling all over these words I wrote here, which is kinda creepy.

Some of these digital crawlers have also been incredibly helpful. Take the book I wrote about my travels in 2003. When Google's bot crawls my book webpage, I'm happy because when people later search for travel books they might be sent to my book. Maybe they'll buy it and read it.


AI is undermining the grand web bargain

Now the rise of generative AI and large language models is undermining this deal. OpenAI recently admitted that it has one of these spiders crawling around the web. It's called GPTbot and it's being used to scrape and collect online content for AI model training. The next big model, GPT-5, will likely be trained on the data scooped up by this bot.

GPT-4, ChatGPT, and other powerful models cleverly answer questions immediately, so there's less need to send users to the sources of the original information. This may be a great user experience, but the incentives to share high-quality free information online begin to break down pretty quickly.

Why would any producer of free online content let OpenAI scrape its material when that data will be used to train future LLMs that later compete with that creator by pulling users away from their site? You can already see this in action as fewer people visit Stack Overflow to get software coding help.


Self-sabotage

It's self-sabotage to let OpenAI's GPTbot crawl your website. This realization is spreading pretty swiftly among online communities. The Verge, a digital publication that competes with Insider, looks like it took steps to block GPTbot already.

It's unclear how long OpenAI's spiderbot has been lurking around online. The company recently announced a way to block GPTbot, using a common protocol called robots.txt. Some creators have already implemented this, although some wonder whether OpenAI already had a bot secretly scooping up everyone's online data for months or years.

"Finally, after soaking up all your copyrighted content to build their proprietary product, OpenAI gives you a way to prevent your content from being used to further improve their product," Prasad Dhumal, a search engine optimization consultant, wrote on Twitter this week.

"We are now blocking another one of OpenAI's scraping bots. You can too. (I don't know if this is the secret one we couldn't block before or if that one is still in use.)" wrote Neil Clarke, editor of Clarkesworld, a science fiction and fantasy magazine.


Trust is evaporating

I asked Clarke about his decision, and his responses reveal how quickly trust has evaporated between online content creators and AI companies.

"OpenAI and other 'AI' creators have demonstrated repeatedly that they have no respect for the rights of authors, artists, and other creative professionals. Their products are largely based on the copyrighted works of others, taken without authorization or compensation," Clarke wrote in an email. "They repeatedly defend the use of these practices and have only recently identified this bot. It's not entirely clear that opting out of this bot (and CCBot) will be sufficient to avoid having content harvested by OpenAI. Their track record on transparency leaves much to be desired."

CCBot is another digital spider that crawls the web collecting all the content. This is run by an organization called Common Crawl, which is a major supplier of training data for AI models. Common Crawl stores all this information regularly, so even if you block its bot now, your data has already probably been taken.

"I'm unaware of anyone that has managed to get Common Crawl to remove data," Clarke said. "I've tried, but have had no response."


'Opt-in' not 'opt-out'

Clarke and others are now calling for these AI spiderbots to be "opt-in" rather than "opt-out." Right now, OpenAI scrapes everyone's data as the default, and creators must take steps to opt out and actively block it. An "opt-in" approach would require OpenAI to ask permission first.

"Data collection methods for these models must become strictly opt-in. Many people will not find out how to protect their work until it has already been taken, yet again," Clarke wrote. "Since we are presently unable to have our content removed from existing models and scraped data sets, opt-out is not enough. It is not our responsibility to provide data for these companies, nor should they be allowed to simply take it without consent, regardless of the benefits they imagine coming from it."

I asked OpenAI about all this on Tuesday morning. The company didn't respond.

Paying for AI training data

OpenAI has made an effort to respect some online data. GPTbot is now designed to filter out sources that require paywall access and to remove other sources known to gather personally identifiable information.

The company also recently announced a deal with the Associated Press where OpenAI will pay to license AP content for AI training data.

If the company paid for this data, why doesn't it also pay for everyone else's information, too? I asked the company and it didn't respond.

'Block it'

OpenAI hasn't contacted Neil Clarke at Clarkesworld about paying for his online content. "We have not been approached to license works we published, nor would we be open to it. I'm unable to think of anything they could say or do that would change my mind," he told Insider.

So what is Clarke's advice for other online content creators when it comes to GPTbot?

"In short I'd say 'block it' and suggest they reach out to lawmakers to express their concern regarding past, present, and future data collection methodologies," he said.

When the Googlebot crawls a website and scrapes content, that process ends up sending users to the original site that created the information. That is the reward and the essential deal that is at the heart of the web. What is the incentive that OpenAI offers to have these content creators allow GPTbot to crawl and scrape their sites?

Comments


bottom of page