Home Editors Desk Editor`s

PromotionWorld.com Editor's Desk

Robots Exclusion Protocol: One Standard Applicable to All Search Engines


June 16, 2008


Google, Yahoo! And Microsoft worked together to post details of how each of them interpret different tags. They consolidated their efforts to provide one standard applicable to all of them. Thus, webmasters and web publishers can more easily manage the way search engines access and display their content applying only one set of directives instead of three.

Web Robots (also known as Crawlers or Spiders), are programs that quest the Web clicking on every single link while gathering data from webpages. For instance, search engines like Google use them to index the web content.

The Robots Exclusion Protocol (REP) is the way by which site publishers interact with search engines. This protocol lets them indicate which pages of their sites they want exposed to the public and which pages they prefer being kept private from robots.

The introduction of REP dated back in the early 1990s. Over the years, the REP has evolved to support not only "exclusion" directives but also directives controlling what content to be included, how this content to be displayed, and how frequently the content to be crawled. There are three main advantages of REP: it has been evolving simultaneously with the web, it is universal and therefore it can be implemented across all search engines and robots, and it works for publishers of any scale.

For robots.txt, all search engines support:

Previously the only directive was "Disallow", indicating which pages or paths of the site should be ignored by the search engine robots. Now publishers can also set "Allow" directive which tells a crawler the specific pages on your site you want indexed. So you can use this in combination with disallowed URL paths, use wildcards in these paths (such as *pdf), and tell a crawler where it can find a sitemap XML file.

For robots meta tags, all search engines support:

HTML Meta directives can either be placed in the HTML of a page or in the HTTP header for non-HTML content like PDF, video, etc. using an X-Robots-Tag. The X-Robots-Tag directive, added to the HTTP header, gives webmasters the flexibility to achieve exclusions on PDF, Word documents, PowerPoint, video, and other file types, including html files.

NOINDEX META Tag and NOFOLLOW META tag control whether the robot should use the page text for the search index or follow the links to other content on a given page, respectively.

NOSNIPPET META Tag tells search engines not to display the match words in content on the search results page, and NOARCHIVE META Tag tells the search engines not to even make a copy of the page. This is especially useful for pages that frequently change, such as displays of headlines. NOODP META Tag tells the search engines not to show any information for the Open Directory Project for this page on search results.

All of these directives provide website publishers with much more control over their sites’ search results.

According to Search engine land, this move of the three major search engines “may be an effort to show a consolidated front in light of the ongoing publisher attempts to create new search engine access standards with ACAP. This direction reflects the ongoing direction of the messaging the search engines have had about ACAP. For instance, Rob Jonas, Google's head of media and publishing partnerships in Europe, said in March that "the general view is that the robots.txt protocol provides everything that most publishers need to do."

For more information, see each of the search engines’ blog posts:

http://www.ysearchblog.com/archives/000587.html

http://googlewebmastercentral.blogspot.com/2008/06/
improving-on-robots-exclusion-protocol.html

http://blogs.msdn.com/webmaster/archive/2008/06/03/
robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx


                



Request Reprint Permission

Copyright © 2017 DevStart, Inc. Permission is required to use the material on this page.


Submit Your Articles or Press ReleaseAdd comment (Comments: 1)  

Title: ACAP and REP

June 16, 2008
Comment by Heidi Lambert

The ACAP team welcomes what Microsoft, Google and Yahoo! have done re REP. This as a very useful first step in moving towards a solution which will meet ACAP’s objective of giving publishers better control over the reuse of their content online, while avoiding the imposition of unnecessary cost or complexity. Greater transparency and improved consistency of interpretation of REP will make it very much easier for the ACAP technical team to work with the search engines to bridge the gaps and make existing protocols compatible with the current requirements of copyright owners. Anyone who wants to know more about ACAP and understand why it is different from the robots exclusion protocol should visit our website at www.the-acap.org.

ACAP Project Director, Mark Bide

promotion awards

Advertisement

Other Resources

arrow