Understanding Robots.txt

Search engines send their spiderbots crawling all over the internet to visit every live website. They do this so that your site can be indexed and included on the Search Engine Result Pages (SERPs). And the first thing that spiderbots look for when they visit a site is the robots.txt file.

What is a Robots.txt file?

Robots.txt is a file that contains commands or rules telling search engines to ignore some pages or directories on your site. Following are some of the reasons why web developers use the robots file:

  • If you have pages or posts on your website that have almost identical content with other posts or pages, you may want to block them from being crawled by spiderbots; duplicated contents are viewed as a spamming practice and are red flagged by search engines.
  • If you don’t have have a robots.txt, your website may return a 404 error
  • You can save a lot of bandwidth by not letting search engine crawl directories that are not essential in the growth of your site

How do you create a robots.txt file?

Create a .txt file and name it as ‘robots.txt’. The entries or rules you put in robots.txt should be in the following format:

<field>:<value>

A simple robots file uses the following basic fields:

  • User-agent: indicates the web robot the rules/commands applies to
  • Disallow: the url or website directory you don’t want to be accessed by spiderbots

Example 1:  This command tells the spiderbots to ignore your entire website:

User-agent: *
 Disallow: /

This means that all files inside the root directory should be bypassed by crawlers (* means all, and / points to the root directory).

Example 2: This allows all spiderbots to access all directories on the web:

User-agent: *
 Disallow:

Example 3: This command stops Googlebots from indexing your images so that they don’t get displayed on an image search:

User-agent: Googlebot-Image
 Disallow: /

Example 4: The command below only allows Googlebot to access your site, all other crawlers are blocked:

User-agent: *
 Disallow: /
 User-agent: Googlebot
 Disallow:

Where do you put the robots.txt file?

After specifying the rules you want, you can then upload your robots.txt file to the root folder of your website (www or public_html directory) through FTP (File Transfer Protocol).

Other parameters included in your robots.txt

  • Request-rate: defines the pages/second crawling ratio. A 1/25 ratio means that 1 page should be crawled every 25 seconds
  • Visit-time: defines a specific time when you want your pages crawled. Visit time of 0200-0430 means that your pages will get indexed between 02:00 AM and 04:30 AM.
  • Sitemap: This parameter tells the spiderbots where your sitemap is located. You must include the full sitemap url here.

The robots.txt file is used to tell spiderbots WHAT THEY SHOULD NOT VISIT, not what they should visit. This means, that if you have a page included in a directory with 100 pages that you don’t want to be crawled, you have to place that single page elsewhere.

Image by FreeDigitalPhotos.net

Article By: Sanjay Modasia

Sanjay is our lead developer with over 6 years experience, handling a strong team of developers with various industry expertise working on projects and meeting deadlines.


Comments are closed.