Robots Exclusion Protocol: One Standard Applicable to All Search Engines

Google, Yahoo! And Microsoft worked together to post details of how each of them interpret different tags. They consolidated their efforts to provide one standard applicable to all of them. Thus, webmasters and web publishers can more easily manage the way search engines access and display their content applying only one set of directives instead of three.

Web Robots (also known as Crawlers or Spiders), are programs that quest the Web clicking on every single link while gathering data from webpages. For instance, search engines like Google use them to index the web content.

The Robots Exclusion Protocol (REP) is the way by which site publishers interact with search engines. This protocol lets them indicate which pages of their sites they want exposed to the public and which pages they prefer being kept private from robots.

The introduction of REP dated back in the early 1990s. Over the years, the REP has evolved to support not only "exclusion" directives but also directives controlling what content to be included, how this content to be displayed, and how frequently the content to be crawled. There are three main advantages of REP: it has been evolving simultaneously with the web, it is universal and therefore it can be implemented across all search engines and robots, and it works for publishers of any scale.

For robots.txt, all search engines support:

Previously the only directive was "Disallow", indicating which pages or paths of the site should be ignored by the search engine robots. Now publishers can also set "Allow" directive which tells a crawler the specific pages on your site you want indexed. So you can use this in combination with disallowed URL paths, use wildcards in these paths (such as *pdf), and tell a crawler where it can find a sitemap XML file.

For robots meta tags, all search engines support:

HTML Meta directives can either be placed in the HTML of a page or in the HTTP header for non-HTML content like PDF, video, etc. using an X-Robots-Tag. The X-Robots-Tag directive, added to the HTTP header, gives webmasters the flexibility to achieve exclusions on PDF, Word documents, PowerPoint, video, and other file types, including html files.

NOINDEX META Tag and NOFOLLOW META tag control whether the robot should use the page text for the search index or follow the links to other content on a given page, respectively.

NOSNIPPET META Tag tells search engines not to display the match words in content on the search results page, and NOARCHIVE META Tag tells the search engines not to even make a copy of the page. This is especially useful for pages that frequently change, such as displays of headlines. NOODP META Tag tells the search engines not to show any information for the Open Directory Project for this page on search results.

All of these directives provide website publishers with much more control over their sites’ search results.

According to Search engine land, this move of the three major search engines “may be an effort to show a consolidated front in light of the ongoing publisher attempts to create new search engine access standards with ACAP. This direction reflects the ongoing direction of the messaging the search engines have had about ACAP. For instance, Rob Jonas, Google's head of media and publishing partnerships in Europe, said in March that "the general view is that the robots.txt protocol provides everything that most publishers need to do."

For more information, see each of the search engines’ blog posts:

http://www.ysearchblog.com/archives/000587.html

http://googlewebmastercentral.blogspot.com/2008/06/
improving-on-robots-exclusion-protocol.html

http://blogs.msdn.com/webmaster/archive/2008/06/03/
robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx