Meta recently unleashed new bots that crawl the web and suck up data for its AI models and related products.
These bots have features that make it harder for website owners to block their content from being scraped and collected.
The company says the Meta-ExternalAgent bot is “for use cases such as training AI models or improving products by indexing content directly.”
A second one, called Meta-ExternalFetcher, is related to the company’s AI-assistant offerings and collects links to support specific product functions.
These bots first appeared in July, according to archived Meta webpages analyzed by Originality.ai, a startup that specializes in spotting AI content.
Robots.txt under fire
Startups and tech giants are racing to build the most powerful AI models. A key ingredient is high-quality training data. One of the main ways to amass this is to send bots out to the web to crawl and scrape online content. Google, OpenAI, Anthropic, and several other AI companies have these bots.
If content owners want to block such bots, they use an established rule called robots.txt that prevents automated scraping of websites. It’s a single bit of code that’s been used since the late 1990s and is widely accepted as one of the unofficial rules supporting the web.
The thirst for AI training data has undermined this system, though. In June, OpenAI and Anthropic were found to be either ignoring or circumventing robots.txt.
Meta’s bot bypass
Meta may also be trying to skirt the robots.txt rule in subtle ways.
The company warns that one of its new bots, Meta-ExternalFetcher, “may bypass robots.txt rules.”
Meanwhile, the Meta-ExternalAgent bot performs two functions, which is unusual. One is collecting AI training data, while the other is indexing content.
Website owners may wish to block Meta from sucking up their data for AI-model training, but they may want the tech giant to index their sites so more human users visit.
Combining both functions in a single bot makes it harder to block. According to Originality.ai, only 1.5% of the top websites are blocking the new Meta-ExternalAgent bot.
An earlier Meta crawler called FacebookBot, which has been scraping online data for years to train Meta’s large language models and AI speech-recognition technology, is blocked by almost 10% of the top websites, including X and Yahoo, according to Originality.ai.
It says the other new Meta bot, Meta-ExternalFetcher, is being blocked by less than 1% of the top websites.
“Companies should provide the ability for websites to block their sites’ data from being used for training while not reducing the visibility of the websites’ content in its products,” said Jon Gillham, the CEO of Originality.ai.
Not respecting previous blocking decisions
He made another good point: Meta is not respecting website owners’ previous decisions on its older bots.
Any website that previously blocked the FacebookBot now needs to also block the new Meta-ExternalAgent crawler to ensure their data is not used to train Meta’s AI Models.
“If a website had opted out of its data being used to train “Language Models for our Speech Recognition Technology” (the FacebookBot description), then they would presumably also want to opt out of “training AI models” (Meta-ExternalAgent’s description)” Gillham explained in an email to BI.
Meta comments
A Meta spokesperson said the company is trying “to make it easier for publishers to indicate their preferences.”
“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesperson also wrote in an email to Business Insider. “We recognize that some publishers and web domain owners want options when it comes to their websites and generative AI.”
Meta, the spokesperson added, has several web-crawling bots to avoid “bundling all use cases under a single agent, providing more flexibility for web publishers.”
Website owners can find information on how to block Meta’s bots here.