Setting Up a Robots.txt File

The service file robots.txt contains instructions for search robots. Using this file, you can prohibit or restrict the access of search robots to certain pages or the entire site. You can also specify restrictions for different types of search robots. Reducing the permissible frequency of search engine robots' requests to the site allows to reduce the load on the site and on the server. On the other hand, a complete restriction can lead to decrease the site's position in the search engine results. Therefore, it is very important to correctly configure robots.txt. This file is located in the root directory of the site: /public_html/robots.txt.

Robots.txt format

  • Each rule in the robots.txt file is written on a new line as follows:
    directive_name: value
  • The new block of rules in the robots.txt file starts with the User-agent directive. Blank lines cannot be left inside the block. An empty string is used to separate blocks.
  • Notes are separated by #.
  • The name must be indicated in lower case. ROBOTS.TXT or Robots.txt are incorrect names.
  • For some robots, directives must be specified separately. For example, YandexDirect takes into account only the rules that are created specifically for it. GoogleBot ignores Host and Crawl-Delay.

It is recommended to check the robots.txt file in special services, for example, YandexGoogle.

Directives used

User-agent

Each block starts with a User-agent directive. The robot for which this rule is used is indicated here.

For example, to set a rule for the Yandex index bot, enter:

User-agent: YandexBot

To apply the rule to all Yandex and Google bots, enter:

User-agent: YandexBot

User-agent: Googlebot

To apply the rule to all robots:

User-agent: *

Disallow и Allow

Disallow and Allow directives allow or deny access to sections of the site.

For example, to deny access to the entire site, enter:

Disallow: /

To at the same time allow access to the catalog of the site catalog1 enter:

Allow: /catalog1

If you want to disable the indexing of the /catalog1/* pages, but allow the /catalog1/catalog12 pages, enter:

User-agent: * #or bot_name

Disallow: /catalog1

Allow: /catalog/catalog12

Each rule is written on a new line and can include only one folder. A separate rule must be specified for each new folder.

It is recommended to restrict access to the site of certain bots. This reduces the load on the site. For example, the majestic.com service uses the MJ12bot search bot, ahrefs.com uses the AhrefsBot. To deny access for several bots, enter:

User-agent: MJ12bot # rule works for bot MJ12bot

User-agent: AhrefsBot # rule works for bot AhrefsBot

User-agent: DotBot # rule works for bot DotBot

User-agent: SemrushBot # rule works for bot SemrushBot

Disallow: / # denying access to the entire site
  • Disallow:  - empty directive, i.e. prohibits nothing.
  • Allow: / - the directive allows everything.
  • $ - denotes a strict match for the parameter. For example, the directive Disallow: / catalog $ denies access to this catalog only. Access to catalog1 or catalog-best will be allowed.

Sitemap

If the sitemap.xml file is used to describe the site structure, you can specify the path to it:

User-agent: *

Disallow:

Sitemap: https://domain.com/path-to-file/sitemap.xml

Host

This directive tells Yandex robots the location of the site mirror.

For example, if the site is also located on the https://domain2.com domain, enter:

User-agent: GoogleBot

Disallow:

Host: https://domain2.com

The robot accepts only the first Host directive specified in the file, the rest are ignored.

If http is used, the mirror can be specified without the protocol - domain2.com. If https is used, the protocol must be specified - https://domain2.com.

The Host directive is specified after Disallow and Allow.

Crawl-delay

The Crawl-delay directive sets the minimum interval with which robots can access the site. This reduces the load on the site.

The value is specified in seconds (separator - point).

User-agent: GoogleBot

Disallow:

Crawl-delay: 0.6

The Crawl-delay directive is specified after Disallow and Allow.

The Google bot hit rate is set in Search Console.

Clean-param

This directive is intended for Yandex and allows you to exclude pages with dynamic parameters in URLs from indexing. The robot will not re-index the content of the pages and create additional load.

For example, the site has pages:

www.domain1.com/news.html?&parm1=1&parm2=2

www.domain1.com/news.html?&parm2=2&parm3=3

In fact, these are two identical pages with different dynamic content. To Yandex not indexing every copy of this page, enter the directive:

User-agent: GoogleBot

Disallow:

Clean-param: parm1&parm2&parm3 /news.html

The & is used to list parameters that the robot does not take into account. Then the page for which this directive applies.

If you have any questions, please create a ticket to technical support.