Linking Mistakes To Avoid (Part 2): Removing Orphaned URLs

Right now, as you read this, you probably have some orphaned URLs you don't know about, collecting dust in the forgotten pile at the bottom of the search engine indexes.

It happens to the best of us. Even me, the self proclaimed Link Mensch, was humbled recently to discover several old URLs in AltaVista's database that no longer physically exist on my web server. Some expert I am.

During the life span of any web site, you create and update and delete and remove URLs on a regular or semi-regular basis. New files go up, old ones come down, or get renamed and archived. Sometimes entire web sites with thousands of pages get re-hosted on new servers using new content management tools. I've even seen cases where every URL on a site changed at once.

What you must remember is that at the same time you have been diligently running your web site, adding, deleting, moving and archiving files and URLs, the search engine crawlers have been carousing the Web, and on occasion, your server, on a hit and run basis for years. Maybe a crawler came across one of your URLs as it crawled a newsgroup post at Deja News a couple years ago. Maybe a newsletter wrote about your site and just as they archived that issue a crawler wandered by and stumbled onto your URL. There are countless ways a crawler could have found your URLs without ever going near your server. In fact, most of the URLs in any search engine's database were found and followed from source other than your own site.

The question that matters most

Of all the URLs your site has ever had in its lifetime, how many of them are still in the database of any given search engine?

Search engines do not know if the URLs they have recorded and indexed are still in existence at any given moment. Thus you may have updated your web site and removed links/URLs that the search engines still think exist. Search results are nothing but placeholders for the actual page on its serverr. Search results are a list of links.

Every URL from your site that no longer exists but which a search engine thinks does still exist is like a lump of coal to be turned into a diamond. With search engines charging for indexing of URLs, it becomes even more important to revive those dead links before the engines find out they are dead and purge them. A purged URL is forever lost.

Nearly every marketer tries get their site fully indexed by the search engines. Most site owners wish they could get more of their sites' pages indexed. If you have old links showing up in search results, count yourself lucky. And get busy making those dead links live again.

Finding them and fixing them

Here's one way to find out how many URLs from your site a search engine has indexed. Go to AltaVista, and in the search box type

host:your domain

(replacing your domain with whatever your domain is, for example host:pbs.org)

Look at the results. What you see is every single file that AltaVista has in its index and thus thinks are active. Peruse the list. Put your mouse cursor over the clickable link but don't click. Look at the bottom of your browser to see the actual filename of the URL you're studying. Are all the filenames you see still in existence? Probably not. Look at the filenames, and if some of them no longer exist on your site, create a new page with EXACTLY the same filename as the old one AltaVista thinks is still around, and get it on your server ASAP.

For example, let's say you used to have a sitemap page named site-map.html, and you see that file among the search results. Now let's say that six months ago you changed that file to map.html, and removed the site-map.html file from your server. The search engine has no idea you removed the URL, and still has it a record of that page and what was on it.

You can also examine your own server logs to find all page requests that result in a 404 file not found server request. This even works if you use custom 404 pages. This is how I discovered that on my site there was a file that had been returning 404 error messages about 30 times a day or almost 1,000 times a month. I created a file that had the same name and content as the one that no longer existed, and bingo, I have recaptured every bit of that lost traffic. You can do the same thing. Start with your server logs and then try some test searches.

If you want to find out what URLs the engines have indexed from your site, Danny Sullivan's Search Engine Watch site has a section just for this at http://searchenginewatch.com/webmasters/checkurl.html