November 22, 2023

What is robots.txt?

If you’re new to SEO, some of the more technical aspects of it can seem a bit confusing and overwhelming, especially if you’re not from a technical or web development background to begin with. Robots.txt is one such aspect of Technical SEO that causes more than a few head-scratches, so we got one of our Senior SEO Account Managers to take you through it. Let’s start with the basics; what exactly is robots.txt?

Robots.txt is a text file on your website that instructs web robots and crawlers on which pages they should or should not access. When it comes to search engine agents, the robots.txt file allows us to stop Googlebot, Bingbot, and other crawlers from accessing certain areas of your site, and better manage the crawl budget.

The robots.txt file is part of a number of tools that website owners and developers can use to implement the Robots Exclusion Protocol, alongside X-robots-tags, robots meta tags, and rel attributes.

Read on to find out more about how and why we use robots.txt files. 

How does robots.txt work?

Robots.txt is a simple text file, without any HTML markup. It is hosted on the web server, located at the root of your domain, and it is publicly accessible. If a website has a robots.txt file, you will be able to find it by typing in the URL followed by /robots.txt

Robots.txt is the first file that search crawlers read after reaching a domain. This file provides bots with information on how to crawl the website and what pages, resources, or folders they should not crawl. If the bots do not find a robots.txt or if the file does not contain any disallow directives, it is implied that they can crawl all the links found on the domain. 

The file contains lines of text. Each line specifies a rule for one or more crawlers, allowing or disallowing their access to specific file paths on the domain. 

Wildcat’s robots.txt file indicates that all crawlers can access all URLs on the site. It also points to the specific location of the XML sitemap. 

What is robots.txt used for?

The main goal of the robots.txt file is to manage good* bot traffic and activity, so the crawl budget is used effectively and servers do not become overloaded. There most common uses of robots.txt include allowing and disallowing specific agents, directories, or files and specifying the location of your sitemap.

*more detail on this point under Limitations

How to create a robots.txt file

Many website builders will create a robots.txt file by default. Here is how you can create your robots.txt file if your website does not already have one, and how you can optimise your existing file. 

Syntax

The robot.txt file is structured as a series of lines, where each line contains a single field specifying a user-agent, allow directive, disallow directive, or sitemap location. The order of these fields matters to the proper understanding of the file. Below, we will outline the most important rules to follow when writing your robots.txt file. 

User-agent

Defines the web crawler or user agent that the rule applies to. It can be a specific agent or a wildcard (*) for all agents.

Disallow

The disallow command is the most commonly used directive in robots.txt. It tells crawlers to omit certain areas of the site. It can be used to:

User-agent: *

Disallow: /

User-agent: *

Disallow: /wp-login.php

User-agent: *

Disallow: /wp-admin/

User-agent: *

Disallow: /my-account/secret-info

User-agent: *

Disallow: /shop/?query=*

Allow

The allow command does just that – it  allows bots to access certain pages or directories. Because bots will always follow the most specific command on the file. The allow directive can be used to, for example, allow access to a specific page within a disallowed directory. Or,  to allow one crawler access while disallowing all others. 

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Sitemaps

Adding a link to the XML sitemap in your robots.txt file helps readers find all your pages and understand what you deem to be the most important links on your site. 

User-agent: *

Disallow:

Sitemap: https://wildcatdigital.co.uk/sitemap_index.xml

Crawl delay*

The crawl delay directive can be used to tell the user agent to wait for a specified number of milliseconds between crawl requests. This helps avoid overtaxing the server. 

*While Bing and Yandex still recognise this directive, Google no longer does. However, the crawl frequency for Google bot can be set through Google Search Console.

Field order and grouping in robots.txt

Understanding the logic robots use to read your file can help you write effective rules.

Snippet from Semrush’s robots.txt file. All of the directives below the user agent field apply to that user agent. In this case, all user agents. 

In this configuration, all other bots will follow the first group of rules, where no disallows are in place. Both Yahoo and Yandex will follow the second group and will not crawl any of the pages in the domain. 

Because both rules apply to Googlebot (the disallow rule applies to all bots), Googlebot will follow the second, more specific directive.

This example, from Google’s documentation, specifies that in this case, the Google bots will follow the allow directive.

The example above disallows crawling for all dynamic shop search URLs.

Disallows all URLs ending in .php

Disallows the root URL without disallowing lower-level URLs like /root/file.

Robots.txt Limitations

For all its useful implementations, the robots.txt file has certain limitations that are important to know about before making any changes. 

Robots.txt does not enforce directives

It is important to note that the commands contained in the robots.txt file are directives, not rules. This means that malicious bots and crawlers can choose to ignore these directives. While you can rely on Google, Bing and most good bots to follow these directives, you must employ alternative methods to truly protect sensitive content on your website, like password-protecting files. 

Disallowed pages can be indexed

The disallow directives on the robots.txt file stop search engine crawlers from reading the content of the disallowed pages. However, when these pages are linked to from other crawlable pages, they may still be indexed and appear in search results.

No index directives in the robots.txt file are not supported by Google, and the robots.txt file directives should not be used to manipulate search results. 

To reliably prevent certain pages from appearing in search results, we can use noindex robot directives on the necessary pages. 

Need Help With Your robots.txt File?

Our team of technical SEO specialists at Wildcat Digital have a wealth of experience setting up websites for success. Checking that your robots.txt is set up correctly and following best practices is a key step in our technical audits and campaign planning. If you need help with your robots.txt or have any concerns about the indexing and crawling of your website, get in touch today. 

Post by

Miruna Hadu

Will Hitchmough

Founder

Our founder, Will Hitchmough, worked at a number of high profile Sheffield Digital Agencies before founding Wildcat Digital in 2018. He brings an extensive knowledge of all things related to SEO, PPC and Paid Social, as well as an expert knowledge of digital strategy.

Digital Marketing can be a minefield for many businesses, with many agencies ready to take your money without knowing how to deliver results. I founded Wildcat Digital to deliver digital success to businesses with smaller budgets in a transparent way.

Chloe Robinson

Content Strategist Team Lead

With a degree in Marketing and a background in more traditional, offline marketing, Chloe joined Wildcat in 2021 after deciding to move into the digital marketing industry. She joined us as a Content Specialist and quickly moved up the ranks, becoming a Content Strategist and later an SEO Team Leader.

Outside of work, Chloe is an avid creative. If she’s not knitting, you’ll likely find her behind a sewing machine or in the kitchen trying (and often failing!) to make sourdough.

Paul Pennington

SEO Account Director

Paul has a strong background in SEO, having previously founded and ran a successful eCommerce business, as well as running a personal blog that achieves an average of 17K users per month. Paul’s knowledge of SEO is extensive, with a strong emphasis on client handling and technical SEO.

Outside of work, Paul enjoys spending time with his family and staying active with weight lifting and combat sports.

More blogs.

View all

Northern Pride: Our Double Success at Northern Digital Awards 

May 13, 2024

Reflecting on our journey as a dynamic force in the digital marketing landscape, Wildcat Digital takes immense pride in our…

Rachel Davies

News

Is It Better To Use “www” or Not for SEO?

April 29, 2024

For many that are setting up their first website, the question of whether they need to include “www” at the…

Jamie Stowe

Knowledge Hub
SEO
Technical SEO

Devising Your Marketing Plan For The Year Ahead

April 29, 2024

In any business sector, it’s crucial to develop an annual marketing plan that outlines your company’s specific goals and strategies.…

Rachel Davies

Knowledge Hub
Marketing