What do you need to keep in mind when making a good robots.txt file?

Question

What do you need to keep in mind when making a good robots.txt file?

1 Answers

Answer Question

Answer 1

The most important thing to remember when building your robots.txt file is to keep it simple. This file should be human-readable, so if you have trouble reading it then you have got a bit lost somewhere.

The fundamental purpose of the file is to communicate with web spiders (or bots, as they're commonly known) which URLs you don't want them to look at. There may be bots that you don't care to be crawled by, so you can instruct specific user-agents to behave differently. However, be careful when you start creating different sets of rules for different user-agents, as they may well interpret the file differently to you.

It is web etiquette for bots to read and conform to the robots.txt file, but they may not necessarily do so - either accidentally or deliberately. You should make sure that your web server still gracefully handles bot requests to pages that you don't want them to visit. After all, you can also think of robots.txt as a starting point for malicious people who will try to attack your site by making requests to those URLs that you don't want to be in the search index.

You may also find some of your bot-disallowed URLs appearing in search indexes if those URLs have been linked to elsewhere on the web. It helps that you list these URLs in your robots.txt, but it isn't a guarantee that they won't get crawled or indexed.

Another good piece of information to put in robots.txt is the location of your sitemap (or your sitemap index, if you need to have lots of sitemaps).

Google, one of the most important search companies out there and with extensive crawler experience, has lots of help for webmasters. As a starting point for their use of robots.txt, have a read of this: Block or remove pages using a robots.txt file