What are Crawlers?

Web crawlers, also known as spiders or bots, are automated programs designed to browse, and catalogue the World Wide Web. The Google Bot powers Google Search, and in some cases is synonyms with web crawling.

Web crawlers operate continuously, visiting billions of pages across the internet. They're essential for keeping search engine results up-to-date. However, their activity can strain web servers, and at times stumble upon things they're not meant to. The robots.txt file is one way website owners can manage crawlers.

AI Crawlers

The explosion of new AI companies has also led to a significant increase in the number of bots crawling the internet. For brands managing a website and/or a blog this brings about risk and opportunity.

The Risk

You're exposing your intellectual property to LLMs. This probably isn't a major concern for most brands (we probably want LLMs to read and regurgugitate our content), but it does expose some risk. For example, if an LLM can happily crawl your blog, it would be quite simple for a competitor to use a tool like ChatGPT to take your articles and have them rewritten into their tone of voice for their blog.

The Opportunity

Your content will have a chance to be summarized and referenced by various answer engines. The benefit of sharing your data is that you will have the chance of showing up when people ask Chatbots and Generative Search engines for information on a topic that you have content for. This is similar trade off that we all make with Google when we let Google crawl our content so that we can show up in search results.

For most brands the opportunity is going to far outweigh the risk, managing a robots.txt file is only really necessary if there are specific pieces of content you want to shield from crawlers.

What is Robots.txt?

A robots.txt file is used by websites to communicate with web crawlers. It tells the crawlers which pages or sections of the website they are allowed to access and which pages, and which ones they should avoid. The file is typically located in the root directory of a website (for example at www.aibrandrank.com/robots.txt).

By managing the content web crawlers access you can prevent certain sections of a site from being indexed and shown in Google Search, or AI companies from using your data to train their models. However, it’s important to note that robots.txt is a directive, not a mandate, meaning that well-behaved bots will follow it, but some might ignore it.

How Does Robots.txt Work?

By simply having a robots.txt file publically available, well behaved crawlers will automatically look for the file before crawling, read the rules laid out in the file, and then follow them. There is a common format that needs to be followed though, to make sure that crawlers can read the file.

User-agent: *
Disallow: /hidden/
Allow: /
Crawl-delay: 10
Sitemap: https://www.example.com/sitemap.xml

This example tells all bots (*) not to access the /hidden/ directory, allows access to everything else, requests a 10-second delay between crawls, and points to the sitemap.

Remember, while robots.txt can control access for well-behaved bots, it's not a security measure for sensitive information, as the file and its directives are publicly visible.

User Agent

Different web crawlers identify themselves with different names – these are referred to as user agents. By specifiying user agents in a robots.txt file websites are able to provide specific rules for specific crawlers.

User-agent: *
Disallow: /hidden/

In our previous example we used an *. This is used to indicate that the following rules (in this case Disallow: /hidden/) should be followed by all crawlers.

User-agent: *
Disallow: /hidden/

User-agent: Googlebot
Allow: /hidden/

Web crawlers will always follow the most specific instructions offered to them. By adding specific rules for Googlebot the crawlers would now start to crawl the /hidden/ section of the site even though it's disallowed for all crawlers, because we have explicitly allowed it.

Disallow

Is always added directly under a user agenct. Indicates a directory that a crawler should not access.

User-agent: *
Disallow: /hidden/

In this example we are telling crawlers not to access anything in the hidden directory. Important to note this includes the '/hidden' page if it exists, and also any sub page such as '/hidden/draft-article'.

Allow

As you can imagine this is the opposite of disallow. Of course every page a crawler can access is assumed to be allowed by default however, so allow is only necessary if you want to set a rule that overrides a disallow.

User-agent: *
Disallow: /hidden/
Allow: /hidden/public/

In this example we want to disallow all crawlers from accessing the hidden directory, however within that directory the hidden directory, we have a public subdirectory which we want to carve out an exception for, and make visible to web crawlers.

User-agent: *
Disallow: /hidden/

User-agent: Googlebot
Allow: /hidden/

In our previous example we can also see that exceptions might be made for specific crawlers. In this case the hidden directory is disallowed for all crawlers except for Googlebot which has been explicitly allowed.

Crawl Delay

Everytime a web crawler visits a page on our site, it puts some load on your site's server, the same way a user visiting your site would. Most of the time this isn't a problem, and web crawlers are respectful to this fact and don't overload your website by visiting many pages multiple times all at once. Sometimes however, you might find that a web crawler is putting a lot of strain on our website, and setting a Crawl Delay can help with this.

User-agent: Googlebot
Disallow: /hidden/
Crawl Delay: 5

Setting a crawl delay of five tells Googlebot that when it crawls your site it should wait five seconds between each request. Normally a crawler will visit one page, then immediately visit all of the pages linked from that page – for large sites that behaviour can stack up to 1,000s of page visits very quickly, and so a crawl delay can help to slow things down and reduce the strain on your website's server.

Most crawlers have sensible defaults, and won't overload your site, but it's worth keeping an eye out for them and taking action if you need to.

Sitemap

The final rule (optional) rule in a sitemap is not related to any specific user agent, and typically it has an empty row above it to keep it clear from anything else.

User-agent: *
Disallow: /hidden/

Sitemap: https://www.example.com/sitemap.xml

The sitemap should always link to the entire URL for the website that the robots.txt is for. It's optional, but is a good shortcut for web crawlers to be able to find links out to your entire site. While you don't need to add it, if you want Google to search and crawl your site so that it can show up in search results, a sitemap link can help make it easier for Google to crawl your site.

The Most Important Crawlers to Manage

Google

Google runs the most widespread network of web crawlers which fuel Google's search engine. The primary crawler for this is the 'Googlebot'. Google does also have a range of other crawlers that focus on image, video, and other specific types of content, but most websites don't need to worry about setting specific rules for these crawlers.

Recently, a 'Google-Extended' crawler has been created by Google, seperate from the existing Googlebot, this crawler's purpose is to crawl the internet to build a database of training data for Google's AI models like Gemini. If there are things that you want included in Google Search, but wouldn't want Google to crawl for their AI models, you might want to set specific rules for the 'Google-Extended' crawler.

User Agent:Googlebot

Purpose: Traditional Search Index

Summary: Builds the index which powers Google Search. Anything indexed by the Google Bot may be shown in search results.

User Agent:Google-Extended

Purpose: AI Training and Augmentation

Summary: Crawls data that may be used for AI training, or in search assisted AI responses from Google Gemini and Vertex AI.

Apple

Often overlooked compared to Google and even Bing, but Apple runs it's own search engine to power Spotlight recommendations, and Siri (although Siri relies on Google when Apple's Search is insufficient).

Similar to Google, Apple has two crawlers, 'AppleBot', and 'AppleBot-Extended' to allow websites to have seperate controls for crawlers powering Apple's search engine, and crawlers that are building a database for the AI models Apple is developing.

User Agent:AppleBot

Purpose: Traditional Search Index

Summary: Builds the index which powers Apple Search. Typically this includes Spotlight search results, and Siri.

User Agent:AppleBot-Extended

Purpose: AI Model Training

Summary: Crawls websites for information to be included in training data sets by Apple when developing AI models.

Microsoft

Of course Microsoft runs a crawler to power Bing, but they have not provided information about a different AI crawler like other providers have.

It might be that Microsoft is sharing training data with OpenAI as part of their partnership, but it might be safer to assume that anything crawled by Bingbot might also be used for training AI models.

User Agent:Bingbot

Purpose: Traditional Search Index

Summary: Builds the index which powers Bing Search. Anything indexed by the Bing Bot may be shown in search results.

OpenAI

Compared to the rest of the new AI companies that have cropped up OpenAI is the most transparent about their web crawlers, and have set their crawlers up in a way to give websites a good level of control over what they want to share, and for what purposes.

User Agent:OAI-SearchBot

Purpose: AI Search

Summary: Identifies websites to be surfaced in Search GPT.

User Agent:ChatGPT-User

Purpose: AI Chat Search

Summary: Enables ChatGPT and custom GPT tools to crawl your website and surface information (generally with a reference).

User Agent:GPTBot

Purpose: AI Model Training

Summary: Crawls websites for information to be included in training data sets by OpenAI when developing AI models.

Anthropic

While claiming that their crawler is only used to power generative search features within their chatbots, it would be safer assume that this crawler is also building a database of information for Anthropic's AI models to be trained on because they haven't shared information about any other crawlers they're using.

User Agent:ClaudeBot

Purpose: AI Model Training

Summary: Crawls websites for information to be included in training data sets by Meta when developing AI models.

Common Crawl

An often overlooked crawler – Common Crawl is a not for profit organisation that index and archives as much of the internet as possible. The Common Crawl archive is made available for free to businesses, academia, and governments.

It's important to bear in mind that because the index is made available, AI companies use Common Crawl's information for to train their models. Even if you block content from an AI companies crawler, if you don't also block Common Crawl it's likely the AI company would still train their model on that data through Common Crawls index.

User Agent:CCBot

Purpose: Web Catalogue

Summary: A not for profit that crawls the internet with the intention of cataloguing information for use in organizations, academia and non-profits

Full List of AI Web Crawlers

Owner	User Agent	Type	Purpose	Details
Google	Googlebot	Traditional Search Index	Builds the index which powers Google Search. Anything indexed by the Google Bot may be shown in search results.	More
Google	Google-Extended	AI Training and Augmentation	Crawls data that may be used for AI training, or in search assisted AI responses from Google Gemini and Vertex AI.	More
Meta	FacebookBot	AI Model Training	Crawls websites for information to be included in training data sets by Meta when developing AI models.	More
Apple	AppleBot	Traditional Search Index	Builds the index which powers Apple Search. Typically this includes Spotlight search results, and Siri.	More
Apple	AppleBot-Extended	AI Model Training	Crawls websites for information to be included in training data sets by Apple when developing AI models.	More
Microsoft	Bingbot	Traditional Search Index	Builds the index which powers Bing Search. Anything indexed by the Bing Bot may be shown in search results.	More
OpenAI	OAI-SearchBot	AI Search	Identifies websites to be surfaced in Search GPT.	More
OpenAI	ChatGPT-User	AI Chat Search	Enables ChatGPT and custom GPT tools to crawl your website and surface information (generally with a reference).	More
OpenAI	GPTBot	AI Model Training	Crawls websites for information to be included in training data sets by OpenAI when developing AI models.	More
Anthropic	ClaudeBot	AI Model Training	Crawls websites for information to be included in training data sets by Meta when developing AI models.	More
Perplexity	PerplexityBot	AI Generative Search	Crawls websites to power AI generated search summaries and results. Unclear if Perplexity also trains models with this data.	More
Cohere	cohere-ai	Unconfirmed	Unconfimed, likely used by Cohere to train their foundational models which compete with companies like Anthropic and OpenAI	More
ByteDance (TikTok)	Bytespider	Unconfirmed	Unconfimed, likely used by ByteDance to train AI models like their ChatGPT competitor Doubao	More
Common Crawl	CCBot	Web Catalogue	A not for profit that crawls the internet with the intention of cataloguing information for use in organizations, academia and non-profits	More

Robots.txt For AI Crawlers

Contents

What are Crawlers?

AI Crawlers

The Risk

The Opportunity

What is Robots.txt?

How Does Robots.txt Work?

User Agent

Disallow

Allow

Crawl Delay

Sitemap

The Most Important Crawlers to Manage

Google

User Agent:Googlebot

User Agent:Google-Extended

Apple

User Agent:AppleBot

User Agent:AppleBot-Extended

Microsoft

User Agent:Bingbot

Meta

User Agent:FacebookBot

OpenAI

User Agent:OAI-SearchBot

User Agent:ChatGPT-User

User Agent:GPTBot

Anthropic

User Agent:ClaudeBot

Common Crawl

User Agent:CCBot

Full List of AI Web Crawlers

Subscribe to our newsletter!