Speak to the web robots

Last update : July 2, 2013
Web Robots (also called crawlers, wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For various reasons web robots are not always welcome to access certain web pages.

web robots

Googlebot

A simple method used to exclude web robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt”. The contents of this file uses two records: user-agent and disallow.

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future web robots will use it. Consider it a common facility the majority of web robot authors offer the WWW community to protect web servers against unwanted accesses by their robots.

The latest version of the robots.txt document can be found on http://www.robotstxt.org/orig.html.

Another way to tell web robots what to do is the use of meta tags index and follow. Informations about these meta tags are available at the metatags.info website.

Examples :
index the whole website
<meta name="robots" content="index, follow" />
index the current page and stop there
<meta name="robots" content="index, nofollow" />
ignore the current page, but crawl the other web pages
<meta name="robots" content="noindex, follow" />
ignore the whole website
<meta name="robots" content="noindex, nofollow" />

There are more robot meta tags. Sometimes search engines uses descriptions from the ODP (Open Directory Project) as the title and snippet for a web result. The tag noodp lets you opt out of the ODP title and description. The tag noydir does the same for the Yahoo directory. The tag noarchive prevents serach engines from showing the cached link for a page. The tag nosnippet prevents a snippet from being shown in the search results. The tag noimageindex lets you specify that you do not want your page to appear as the referring page for an image that appears in Google search results.