fbpx

What Is A "Robot txt file" & How To Set It Up For An Ecommerce Site

robot txt in wooden blocks

“Robots.txt file” — As complicated and mysterious as it might sound, it’s a simple tool that tells search engine crawlers how to index the content on your website.

In fact, it is especially important for ecommerce sites, as it allows them to control which pages are visible and indexed by search engines.

In this blog post, we’ll break down everything that goes into a robots.txt file and why it’s necessary for ecommerce success. We’ll also cover how and when to use this critical document and various tips on optimizing it for optimal results!

Read on to learn more about what a robots.txt file is and why you should care!

What Is A Robots.txt file?

All live websites are visited by virtual robots called “spiders,” “crawlers,” or “bots”. These bots are sent by all major search engines, such as Google, Yahoo, and Bing, so they can crawl your entire site. These bots analyze and understand your content, index it, and make it appear in the search results.

However, there might be times when you don’t want search engine bots to crawl and index your website content. That’s when you’d want to use a robots.txt file.

It’s a simple-yet-powerful tool used to allow or prevent the crawling and indexing of pages and folders on any website.

Where did it come from?

The robots.txt file resulted from a consensus among search engine developers back in the early days. They created a group of web standards that regulated how bots will crawl a website, access content, index pages, and serve all these to the users. These standards were named robots exclusion protocol (REP), and the robots.txt file is a part of it.

Now, website owners can add commands (or directives) to a robots.txt file and direct the bots to crawl only the pages of their choice.

However, you can’t control every aspect of crawling. For instance, you can’t tell how many times a crawler should come to your website. Google, in particular, disregards any guidelines regarding the frequency of crawling which may have been included in the robots.txt file.

Why Does A Robots.txt File Matter?

A robots.txt file is not a mandatory part of a website. Websites can go live and get indexed properly even without it.

However, it’s extremely important for ecommerce websites.

Three of the most logical reasons why it matters are:

  • Stay Within The Crawl Budget

“Crawl budget” means the number of web pages Google can crawl on a particular website in a certain duration. This budget or “number” varies depending on your website’s loading speed, internal linking, size, and overall health.

But why is the crawl budget important?

This number is critical because the total number of your website pages could exceed your crawl budget. This means your website might be left with some uncrawled pages that won’t be indexed.

So, when you use robots.txt to block unnecessary pages, you allow Google to spend the crawl budget where it’s important.

Some prime examples of unnecessary pages include login pages, web pages on staging websites, internal search results pages, and other pages.

  • Avoid Duplicate Content Issues

Duplicate content is a common SEO problem that occurs when you have the same or near-similar content on your ecommerce website. This is often caused by duplicate page titles or meta descriptions or even pages with very similar content.

This generally leads to search engines indexing only one of the pages while leaving out all duplicates. This could have a detrimental effect on your website rankings as search engine algorithms prioritize unique content.

You can prevent this issue by using robots.txt to block duplicate pages from the indexing process.

  • Keep Data Private

Finally, you can use robots.txt to hide private user information or prevent private data from appearing in search engine results.

For example, if you have any sensitive information stored on your website, you can add directives in robots.txt to ensure it isn’t crawled or indexed.

Some of the common ecommerce pages with sensitive user information are:

  • www.example.com/cart
  • www.example.com/account
  • www.example.com/login
  • www.example.com/checkout

Or, you might have private business information on PDFs, images, and videos meant for your internal matters.

A robots.txt file can keep all of these hidden.

Now that you know what is a robot txt file and its importance for ecommerce SEO, let’s see how it works.

How Does A Robots.txt File Work?

The robots.txt file is essentially a no-HTML markup code text file — this is why it has a .txt extension. The file resides on your web server like all other website files and can be viewed by visitors too.

However, it can’t be accessed by browsing your website through the homepage. Instead, a user will need to add robots.txt at the end of your website’s URL, like https://www.example.com/robots.txt.

This is why users don’t accidentally stumble upon any website’s robots.txt file.

You can also access the robots.txt file through the File Manager in your CPanel dashboard. Or, if you use WordPress, you can find it in the public_html folder.

In any case, website visitors can’t easily find or view your robots.txt file.

Web crawlers or bots, on the other hand, can easily spot it in your website directory. In fact, good bots like Google’s crawlers will first look for a robots.txt file before crawling your website.

Spam bots or bad web robots, however, will completely ignore a robots.txt file and crawl whatever they can find on your website.

That is how robots.txt files work.

Now, let’s get to understanding their syntax and see what should be in robots.txt file.

Robots.txt Basics

Syntax

The syntax of a robots.txt file is surprisingly simple and straightforward. It’s just a bunch of instructions, more commonly known as directives, written in blocks of plain text.

This is how these blocks look:

User-agent: *

Disallow: /

User-agent: Googlebot

Disallow:

User-agent: bingbot

Disallow: /not-for-bing/

As you can see, each block begins by addressing a user agent and then stating what action you want that user agent to perform.

User Agents

User agent is nothing but a web browser or web crawler like Googlebot.

The first line of the block will address user agents or bots and then give them directives to follow. You can do it in two ways:

  • Address a specific user agent, such as the googlebot user agent.
  • Use the wild card directive or *(asterisk) to address them all.

Remember that there are hundreds of user agents and even more than one type of crawler or spider for a single search engine. You need to address specific user agents with specific user agent names. Some of the most popular ones are:

User-agent: Googlebot

User-agent: Googlebot-Image

User-agent: Googlebot-Mobile

User-agent: Googlebot-News

User-agent: Bingbot

User-agent: Baiduspider

User-agent: msnbot

User-agent: slurp (Yahoo)

User Agent Directives

The second part of a typical text or block in the robots.txt file is the user agent directive.

User agent directives are nothing but instructions you give to a web crawler or bot. You can use any of the following directives:

  • Allow Directive
  • Disallow Directive
  • Crawl-Delay Directive
  • Sitemap Directive
  • No Index Directive

Let’s study them all in detail.

User Agent Directives And Their Uses

  • Allow Directive

The ‘Allow’ directive instructs bots on which pages to crawl and index. For example, if you want Googlebot to crawl your contact page and display it in the search results, you can add an Allow directive to your robots.txt file:

User-agent: Googlebot

Allow: /contact/

Similarly, if you want the bot to crawl the Photos directory of your website, you may write:

User-agent: Googlebot

Allow: /Photo

  • Disallow Directive

The ‘Disallow’ directive or “disallow robots.txt” tells bots which pages to not crawl and index.

Stop Bots From Crawling Certain Pages

For example, if you want Googlebot to not crawl your ‘About’ page, you may add the following code snippet to your robots.txt:

User-agent: Googlebot

Disallow: /about/

If you want to do the same for all bots, you can add the following line of code:

User-agent: *

Disallow: /about/

This will stop all search engine bots from crawling your website’s about page.

Important: Oftentimes, a page can get indexed even when you use the no index robot txt. This can happen because that page had external links pointing to that page.

Therefore, adding a noindex meta tag is the most efficient way to ensure your page is not indexed.

Stop Bots From Crawling Complete Website

If you want all bots to stay away from crawling any part of your site, you may write:

User-agent: *

Disallow: /

However, if you leave the Disallow directive empty, this would change things. For example:

User-agent: *

Disallow:

This directive will allow all search engines to crawl your whole website.

Dropping just one character would change everything. So, make sure you’re careful with it.

Stop Bots From Crawling Files

You can also use the Disallow directive to block bots from crawling certain file types. For example, if you want to prevent all search engine bots from crawling any images, you can write:

User-agent: *

Disallow: *.png$

This will stop all bots from crawling any .png files from your website.

Or, if you wish for your PDF files to stay private, you may write:

User-agent: *

Disallow: *.pdf$

The “$” sign in the above directives instructs bots not to crawl any URL that ends with a .png or .pdf extension, respectively.

  • Crawl-Delay Directive

The ‘Crawl-Delay’ directive regulates how frequently a search engine bot may visit your website. You can specify the number of seconds that must pass before the next visit.

The main idea behind using this directive is to keep your website from slowing down due to excessive load.

Unfortunately, Google has discontinued its support for this directive. You’ll now have to adjust Google’s crawling speed through their tools inside Google Search Console.

However, you can use the commands to tell Bing and Yandex bots to delay their crawling activities on your website.

Let’s you want crawlers to wait 20 seconds after every crawl activity. You can set a 20-second delay timer using the following command:

User-agent: *

Crawl-delay: 20

  • Sitemap Directive

This is not a commonly used directive; however, you can still use a sitemap directive and inform all major search engines where to find your website’s XML sitemap.

The directive for “sitemap robots txt” is simple and looks like this:

Sitemap: https://www.example.com/my-sitemap.xml

However, it is recommended that you submit your sitemap to all search engines through their webmaster tools. Why?

Because it does the job just as efficiently and allows you to track and resolve errors and get other vital information related to your website.

  • Noindex Directive

This has always been a controversial directive, as there have been questions about its functionality.

Still, people used to add a noindex directive to the robots.txt file to block some URLs from getting indexed. It meant that those URLs didn’t get indexed but still appeared in search results.

However, Google publicly announced in September 2019 that the noindex directive isn’t supported. Users, instead, have to use noindex meta tags (as we mentioned earlier too).

So, these are the directives you can use with a robots.txt file to:

  • allow/disallow all search engine spiders from crawling your website
  • allow/disallow a specific web crawler from crawling your website
  • delay crawling activities of most user agents
  • Submit your XML sitemap for a given website

Wondering how to create a txt file?

Let’s go through a simple 4-step process to create one for your ecommerce store.

How To Create A Robots.txt File?

Creating a robots.txt file is a fairly easy process. You can either leverage an automatic robots.txt generator or create one for yourself.

Here’s how to do it yourself:

  1. Create a file in any text editor of your choice and save it with the name “robots.txt.”
  2. Add directives to the robots.txt file as explained above. (Make sure you properly understand each directive and know what result it will have on your website.)
  3. Once your robots.txt file is ready, upload it to your website. Remember, the procedure will vary based on your CMS and website structure. (Wear your researcher hat, and find your answers on Google.)
  4. Test your robots.txt file in Search Console and make sure it is publicly accessible.

That’s it — you’ve created and uploaded a robots.txt file for your ecommerce website.

Bonus: Robots.txt Recommended Practices

  • Take Care Of Case Sensitivity

When writing directives in a robots.txt file, remember that directives like “Allow” or “Disallow” aren’t case-sensitive, so you can write them as you want.

But the values added in front of the directives are case-sensitive. So, if your folder containing photos is named “Photos,” you must write it as /Photos/ and not /photos/.

  • Write Directives In New Lines

When writing multiple directives in a robots.txt file, avoid cramming them all in the same line.

Write each directive on its own line. This will help you maintain better readability and ensure that all directives work as expected.

Look at the following example:

Recommended:

User-agent: *

Disallow: /admin/

Disallow: /directory/

Not recommended:

User-agent: * Disallow: /admin/

Disallow: /directory/

  • Don’t Repeat User Agent Names

If you’re writing multiple commands and are targeting the same user-agent, avoid repeating the User-agent name for each command.

Instead, write the user agent name once and then list down all your directives.

Recommended:

User-agent: Googlebot

Disallow: /items-page

Disallow: /items-page-2

Not recommended:

User-agent: Googlebot

Disallow: /items-page

User-agent: Googlebot

Disallow: /items-page-2

Google’s crawler will follow the instructions in both cases, but using the crawler’s name once keeps things neat and well-organized.

  • Always Test & Validate Your Robots.txt File

After you write the directives and save your robots.txt file, always test it to make sure that everything works as expected.

Robots.txt file is often vulnerable to syntax errors and typos, so it’s a good practice to validate the file using the robots.txt Tester inside Search Console.

With the Google Search Console robot txt validator, you can test your robots.txt file and make sure that the directives are working as expected.

Finally, make sure to keep your robots.txt file up-to-date and constantly monitor it for any changes.

It’s important because outdated or incorrect directives can lead to unwanted indexing and crawling problems that might degrade your website’s performance and ranking.

Have Control Of Your Ecommerce SEO

So, this was our detailed guide on Robots.txt file and how to create one for your ecommerce store. We highly recommend creating and customizing your ecommerce website’s robots.txt file and leveraging it to maintain better control over search engine crawlers.

However, make sure not to make any mistakes while writing directives. And always test and validate your robots.txt file using Search Console before uploading it to the server.

Remember, crawling and indexing ecommerce websites require a comprehensive SEO strategy. So, make sure to create and maintain a robots.txt file as part of your overall SEO plan.

Mongoose Media Agency Logo
Mongoose Media Agency Logo
home
what we offer
Trophy Room
e-commerce
resources
testimonials
book club
Welcome!
How can I help you today?
Mongoose-Brand-Building-Blocks_fav
Support
Mongoosemedia
Mongoosemedia
Get Access to our
Ecommerce Price Guide
Get Access to our
Email Services Price Guide