Good Google - Writing for the most powerful robot in the world

Google "...is big. Really Big. You just won't believe how vastly hugely mind-bogglingly big it is." (excerpt from The Hitchhiker's Guide to the Galaxy)

Google is the most powerful information resource humans have ever constructed. The power of any major search tool boggles the mind but considering the vastness of Google's complex simplicity can truly hurt one's brain. With over 8-billion references in its rapidly growing, organically generated index, Google sets the standards other search engines follow. Benefiting from a three year reign as the undisputed leader of search, Google has had a very good year and looks poised to make 2005 an even better year.

In 2004, Google introduced more new and improved applications for its users than any other tech company, posted one of the most successful IPO's in business history in a most unorthodox Dutch-Auction format, and met or exceeded any challenges its rivals threw at.

Google is no longer just a search engine, it is an advertising machine. Drawing about 90% of its revenues from paid advertising and contextual ad-delivery, Google has had two major focuses this quarter. The first is increasing the number of places paid-advertising might show up. The second is to develop new products and features that will retain current user loyalty and win new users from the other search firms. Both initiatives rely heavily on Google's reputation for delivering fast, free and relevant search results. Google has the world's largest database of indexed websites and it acquires site information through its spider GoogleBot.

GoogleBot is probably the most well-known spider working the web today. It is also likely among the most analyzed applications ever written. On one level, GoogleBot is quite simple and can be depended on to act in a very specific manner. GoogleBot lives to follow links. GoogleBot will often chase down a link-path until it can no longer work its way deeper into a site. It will also work its way through any site linked to from any other site. Google finds the majority of new sites in its index by following links from established sites. If a link exists, Google will (A) find it, (B) follow it, (C ), record every bit of information it can possibly record, and (D) weigh that information against a fairly rigid algorithm to determine the perceived topic or theme of a site for future reference. If a site in Google's index is modified or changes, Google will re-spider the site as quickly as it possibly can.

GoogleBot's mission is to create a snap-shot of the World Wide Web and store it across Google's network of data centers around the world. When you reference information from Google, the results you see reflect Google's most recent snap-shot of the web. Parts of that snap-shot might be hours or even weeks old but overall the index is updating itself every minute of every day, 24/7. The fastest way to see exactly what Google views as the most recent version of your site is to click on the "Cached" link generally below the main link-reference Google displays for your site.

How GoogleBot behaves as it acquires sites is one thing. What Google does with the information its bot gathers is another thing. Google's method of ranking websites is extremely (and increasingly) complex. To understand how Google works today, a brief (and over simplified) explanation of the principle of PageRank is in order.

Google was originally developed as a means of finding information in research documents at Stanford University where its inventors Larry Page and Sergey Brin met as grad students. PageRank was developed as the basic sorting algorithm for their search tool (then known as Backrub) and was based on a very simple concept, trust.

Page and Brin understood that documents on the Internet could be linked together. They speculated that if someone took the time to code a link (by hand in those days) to another document there was likely a relevance between the two documents. Why else would one researcher link to another researcher's work? Simply put, the more incoming links a particular document has, the better it would rank when sorted by PageRank. Given the environment in which it was developed, Google's genesis proved to be the perfect tool for intelligent users. Tranfering that simplicity from a dorm room at Stanford to practically every living room and office space on Earth has been a great challenge for Google's engineers. While it is still somewhat based on the original, "democratic" nature of PageRank, Google's sorting algorithm has become infinitely more complicated.

Google continues to weigh the number of links directed towards a site as positive indicators that there is relevant information to be found there. Since links are the veins and arteries of the web, links continue to be the most important factor influencing Google's perception of the relevance of a website. As the Google index has grown so rapidly over the past six years, and search engine marketers have learned how to use Google's behaviours to influence rankings, Google weighs several other factors when considering the relevance of a site but the core of the algorithm remains rooted in PageRank.