What are Crawlers?
Web crawlers, also known as spiders or bots, are automated programs designed to browse, and catalogue the World Wide Web. The Google Bot powers Google Search, and in some cases is synonyms with web crawling.
Web crawlers operate continuously, visiting billions of pages across the internet. They're essential for keeping search engine results up-to-date. However, their activity can strain web servers, and at times stumble upon things they're not meant to. The robots.txt file is one way website owners can manage crawlers.
AI Crawlers
The explosion of new AI companies has also led to a significant increase in the number of bots crawling the internet. For brands managing a website and/or a blog this brings about risk and opportunity.
The Risk
You're exposing your intellectual property to LLMs. This probably isn't a major concern for most brands (we probably want LLMs to read and regurgugitate our content), but it does expose some risk. For example, if an LLM can happily crawl your blog, it would be quite simple for a competitor to use a tool like ChatGPT to take your articles and have them rewritten into their tone of voice for their blog.
The Opportunity
Your content will have a chance to be summarized and referenced by various answer engines. The benefit of sharing your data is that you will have the chance of showing up when people ask Chatbots and Generative Search engines for information on a topic that you have content for. This is similar trade off that we all make with Google when we let Google crawl our content so that we can show up in search results.
For most brands the opportunity is going to far outweigh the risk, managing a robots.txt file is only really necessary if there are specific pieces of content you want to shield from crawlers.
What is Robots.txt?
A robots.txt file is used by websites to communicate with web crawlers. It tells the crawlers which pages or sections of the website they are allowed to access and which pages, and which ones they should avoid. The file is typically located in the root directory of a website (for example at www.aibrandrank.com/robots.txt).
By managing the content web crawlers access you can prevent certain sections of a site from being indexed and shown in Google Search, or AI companies from using your data to train their models. However, it’s important to note that robots.txt is a directive, not a mandate, meaning that well-behaved bots will follow it, but some might ignore it.
How Does Robots.txt Work?
By simply having a robots.txt file publically available, well behaved crawlers will automatically look for the file before crawling, read the rules laid out in the file, and then follow them. There is a common format that needs to be followed though, to make sure that crawlers can read the file.
User-agent: * Disallow: /hidden/ Allow: / Crawl-delay: 10 Sitemap: https://www.example.com/sitemap.xml
This example tells all bots (*) not to access the /hidden/ directory, allows access to everything else, requests a 10-second delay between crawls, and points to the sitemap.
Remember, while robots.txt can control access for well-behaved bots, it's not a security measure for sensitive information, as the file and its directives are publicly visible.
User Agent
Different web crawlers identify themselves with different names – these are referred to as user agents. By specifiying user agents in a robots.txt file websites are able to provide specific rules for specific crawlers.
User-agent: * Disallow: /hidden/
In our previous example we used an *. This is used to indicate that the following rules (in this case Disallow: /hidden/) should be followed by all crawlers.
User-agent: * Disallow: /hidden/ User-agent: Googlebot Allow: /hidden/
Web crawlers will always follow the most specific instructions offered to them. By adding specific rules for Googlebot the crawlers would now start to crawl the /hidden/ section of the site even though it's disallowed for all crawlers, because we have explicitly allowed it.
Disallow
Is always added directly under a user agenct. Indicates a directory that a crawler should not access.
User-agent: * Disallow: /hidden/
In this example we are telling crawlers not to access anything in the hidden directory. Important to note this includes the '/hidden' page if it exists, and also any sub page such as '/hidden/draft-article'.
Allow
As you can imagine this is the opposite of disallow. Of course every page a crawler can access is assumed to be allowed by default however, so allow is only necessary if you want to set a rule that overrides a disallow.
User-agent: * Disallow: /hidden/ Allow: /hidden/public/
In this example we want to disallow all crawlers from accessing the hidden directory, however within that directory the hidden directory, we have a public subdirectory which we want to carve out an exception for, and make visible to web crawlers.
User-agent: * Disallow: /hidden/ User-agent: Googlebot Allow: /hidden/
In our previous example we can also see that exceptions might be made for specific crawlers. In this case the hidden directory is disallowed for all crawlers except for Googlebot which has been explicitly allowed.
Crawl Delay
Everytime a web crawler visits a page on our site, it puts some load on your site's server, the same way a user visiting your site would. Most of the time this isn't a problem, and web crawlers are respectful to this fact and don't overload your website by visiting many pages multiple times all at once. Sometimes however, you might find that a web crawler is putting a lot of strain on our website, and setting a Crawl Delay can help with this.
User-agent: Googlebot Disallow: /hidden/ Crawl Delay: 5
Setting a crawl delay of five tells Googlebot that when it crawls your site it should wait five seconds between each request. Normally a crawler will visit one page, then immediately visit all of the pages linked from that page – for large sites that behaviour can stack up to 1,000s of page visits very quickly, and so a crawl delay can help to slow things down and reduce the strain on your website's server.
Most crawlers have sensible defaults, and won't overload your site, but it's worth keeping an eye out for them and taking action if you need to.
Sitemap
The final rule (optional) rule in a sitemap is not related to any specific user agent, and typically it has an empty row above it to keep it clear from anything else.
User-agent: * Disallow: /hidden/ Sitemap: https://www.example.com/sitemap.xml
The sitemap should always link to the entire URL for the website that the robots.txt is for. It's optional, but is a good shortcut for web crawlers to be able to find links out to your entire site. While you don't need to add it, if you want Google to search and crawl your site so that it can show up in search results, a sitemap link can help make it easier for Google to crawl your site.
The Most Important Crawlers to Manage
Google runs the most widespread network of web crawlers which fuel Google's search engine. The primary crawler for this is the 'Googlebot'. Google does also have a range of other crawlers that focus on image, video, and other specific types of content, but most websites don't need to worry about setting specific rules for these crawlers.
Recently, a 'Google-Extended' crawler has been created by Google, seperate from the existing Googlebot, this crawler's purpose is to crawl the internet to build a database of training data for Google's AI models like Gemini. If there are things that you want included in Google Search, but wouldn't want Google to crawl for their AI models, you might want to set specific rules for the 'Google-Extended' crawler.
User Agent:Googlebot
Purpose: Traditional Search Index
Summary: Builds the index which powers Google Search. Anything indexed by the Google Bot may be shown in search results.
User Agent:Google-Extended
Purpose: AI Training and Augmentation
Summary: Crawls data that may be used for AI training, or in search assisted AI responses from Google Gemini and Vertex AI.
Apple
Often overlooked compared to Google and even Bing, but Apple runs it's own search engine to power Spotlight recommendations, and Siri (although Siri relies on Google when Apple's Search is insufficient).
Similar to Google, Apple has two crawlers, 'AppleBot', and 'AppleBot-Extended' to allow websites to have seperate controls for crawlers powering Apple's search engine, and crawlers that are building a database for the AI models Apple is developing.
User Agent:AppleBot
Purpose: Traditional Search Index
Summary: Builds the index which powers Apple Search. Typically this includes Spotlight search results, and Siri.
User Agent:AppleBot-Extended
Purpose: AI Model Training
Summary: Crawls websites for information to be included in training data sets by Apple when developing AI models.
Microsoft
Of course Microsoft runs a crawler to power Bing, but they have not provided information about a different AI crawler like other providers have.
It might be that Microsoft is sharing training data with OpenAI as part of their partnership, but it might be safer to assume that anything crawled by Bingbot might also be used for training AI models.
User Agent:Bingbot
Purpose: Traditional Search Index
Summary: Builds the index which powers Bing Search. Anything indexed by the Bing Bot may be shown in search results.
Meta
Meta has a few crawlers that do things like fetch webpage information to generate link previews, in general these crawlers don't need to be given special rules though. The FacebookBot is the only crawler that really needs to be configured in robots.txt files. If you don't setup anything specific to FacebookBot it will default to the rules you set for Googlebot.
User Agent:FacebookBot
Purpose: AI Model Training
Summary: Crawls websites for information to be included in training data sets by Meta when developing AI models.
OpenAI
Compared to the rest of the new AI companies that have cropped up OpenAI is the most transparent about their web crawlers, and have set their crawlers up in a way to give websites a good level of control over what they want to share, and for what purposes.
User Agent:OAI-SearchBot
Purpose: AI Search
Summary: Identifies websites to be surfaced in Search GPT.
User Agent:ChatGPT-User
Purpose: AI Chat Search
Summary: Enables ChatGPT and custom GPT tools to crawl your website and surface information (generally with a reference).
User Agent:GPTBot
Purpose: AI Model Training
Summary: Crawls websites for information to be included in training data sets by OpenAI when developing AI models.
Anthropic
While claiming that their crawler is only used to power generative search features within their chatbots, it would be safer assume that this crawler is also building a database of information for Anthropic's AI models to be trained on because they haven't shared information about any other crawlers they're using.
User Agent:ClaudeBot
Purpose: AI Model Training
Summary: Crawls websites for information to be included in training data sets by Meta when developing AI models.
Common Crawl
An often overlooked crawler – Common Crawl is a not for profit organisation that index and archives as much of the internet as possible. The Common Crawl archive is made available for free to businesses, academia, and governments.
It's important to bear in mind that because the index is made available, AI companies use Common Crawl's information for to train their models. Even if you block content from an AI companies crawler, if you don't also block Common Crawl it's likely the AI company would still train their model on that data through Common Crawls index.
User Agent:CCBot
Purpose: Web Catalogue
Summary: A not for profit that crawls the internet with the intention of cataloguing information for use in organizations, academia and non-profits
Full List of AI Web Crawlers
Owner | User Agent | Type | Purpose | Details |
---|---|---|---|---|
Googlebot | Traditional Search Index | Builds the index which powers Google Search. Anything indexed by the Google Bot may be shown in search results. | More | |
Google-Extended | AI Training and Augmentation | Crawls data that may be used for AI training, or in search assisted AI responses from Google Gemini and Vertex AI. | More | |
Meta | FacebookBot | AI Model Training | Crawls websites for information to be included in training data sets by Meta when developing AI models. | More |
Apple | AppleBot | Traditional Search Index | Builds the index which powers Apple Search. Typically this includes Spotlight search results, and Siri. | More |
Apple | AppleBot-Extended | AI Model Training | Crawls websites for information to be included in training data sets by Apple when developing AI models. | More |
Microsoft | Bingbot | Traditional Search Index | Builds the index which powers Bing Search. Anything indexed by the Bing Bot may be shown in search results. | More |
OpenAI | OAI-SearchBot | AI Search | Identifies websites to be surfaced in Search GPT. | More |
OpenAI | ChatGPT-User | AI Chat Search | Enables ChatGPT and custom GPT tools to crawl your website and surface information (generally with a reference). | More |
OpenAI | GPTBot | AI Model Training | Crawls websites for information to be included in training data sets by OpenAI when developing AI models. | More |
Anthropic | ClaudeBot | AI Model Training | Crawls websites for information to be included in training data sets by Meta when developing AI models. | More |
Perplexity | PerplexityBot | AI Generative Search | Crawls websites to power AI generated search summaries and results. Unclear if Perplexity also trains models with this data. | More |
Cohere | cohere-ai | Unconfirmed | Unconfimed, likely used by Cohere to train their foundational models which compete with companies like Anthropic and OpenAI | More |
ByteDance (TikTok) | Bytespider | Unconfirmed | Unconfimed, likely used by ByteDance to train AI models like their ChatGPT competitor Doubao | More |
Common Crawl | CCBot | Web Catalogue | A not for profit that crawls the internet with the intention of cataloguing information for use in organizations, academia and non-profits | More |