New meta tools stealthily harvest web data for AI training

Meta quietly uses new web crawlers to scrape publicly available data for AI training, despite previous anti-scraping policies and limited disclosure.

Aug 22, 2024 - 15:36

New meta tools stealthily harvest web data for AI training

If it had, the recent guidance on how to opt-out of being crawled would largely be irrelevant.

Meta has recently deployed two "new" custom web crawlers designed to collect data from across the internet for training its AI models. Despite the significance of these tools, the company has not made an effort to inform the general public about their use. Instead, Meta quietly acknowledged the existence of these crawlers through a subtle update to a webpage intended for developers in late July.

These crawlers, known as Meta-External Agent and Meta-ExternalFetcher, have been identified by AI monitoring firms such as Originality.ai and Dark Visitors. According to these firms, Meta-ExternalAgent plays a crucial role in training AI models and enhancing AI-driven products by "indexing content directly" from the web. On the other hand, Meta-ExternalFetcher is closely tied to Meta's AI assistant tools, actively seeking out web links to enhance the performance of these tools in responding to user queries.

Although the crawlers were detected in July by Originality.ai and Dark Visitors, Meta did not publicly disclose their existence. Instead, the company opted to quietly update a webpage that outlines its web crawlers for developers. When approached by Fortune, Meta confirmed that it has been using these new crawlers, explaining that they are successors to an older Meta crawler, known as Facebook-ExternalHit. This previous crawler has been collecting data from apps and websites shared on Meta's platforms, including Facebook, Instagram, and Messenger, for several years.

A Meta spokesperson recently acknowledged that, like many other companies, Meta trains its generative AI models using content that is publicly accessible on the internet. The spokesperson also mentioned that Meta has recently updated its guidelines to inform publishers on how to prevent their domains from being crawled by Meta’s AI-related tools.

However, this guidance may offer little reassurance to those aware of Meta's previously stated stance on web scraping, which was made public in April 2021 and has apparently not been revised since. The company's policy clearly states, "Using automation to get data from Facebook without our permission is a violation of our terms." It goes on to explain that while the data itself might be readily available for public use, scrapers are prohibited from accessing or collecting data from Meta's products through automated means without prior authorization.

Given the scale and nature of Meta's operations, it seems unlikely that the company has obtained explicit permission from every website it scrapes. If it had, the recent guidance on how to opt-out of being crawled would largely be irrelevant. This approach mirrors Meta's strategy last year with its AI image generator, which was trained on images from Instagram and Facebook. Rather than seeking widespread permission, Meta appears to operate under an "ask for forgiveness, not permission" philosophy, regardless of whether website owners or the broader internet community agree with these practices.