System Downtime

Downtime can disrupt your business, customers, and damage your company's reputation. But how do you prevent or minimize downtime? Can a server monitoring service help? To answer these questions, first we need to understand the causes of downtime. Downtime can be broken down into three distinct categories.

CategoryDescription
PlannedNormal downtime that is planed and scheduled in advance.
Semi-PlannedIncludes software or hardware upgrades that are scheduled, but not entirely by your company.

For example: A vendor released security patches that must be applied quickly to avoid vulnerability. Your company sets schedule, but the schedule is largely driven by others.
UnplannedEvents that force immediate downtime such as: hardware/software failures; operator error by the administrator; malicious acts, disasters, ISP maintenance windows.

There is really nothing that can be done about planned downtime short of establishing totally redundant failover systems. Without redundant systems, impact can be minimized by looking for "maintenance windows" that will cause the least interruption for your business and customers. Analyzing core business hours, server logs, and purchase patterns will help in identifying periods that will cause the least disruption. Generally speaking, web sites normally experience the lowest volume on weekends around 4:00 AM to 5:00 AM. Most customers understand that some level of maintenance is required and grudgingly accept maintenance windows. Home Depot for example, updates their web site very early in the morning. When doing so a friendly message is displayed stating the system is being updated to better serve them. The message addresses the issue and gracefully presents the downtime so users are not greeted with a 404 or server not found error.

Semi-planned maintenance periods are periods that you schedule, however the timetable is normally driven by others. An example is vendor released security patches that require a reboot or a service restart. While immediate action is not required, it is in your best interest to install quickly so the associated vulnerability can be avoided. As with planned downtime, nothing can be done to eliminate this type of downtime other than the use of redundant systems. The impact of semi-planned downtime can be minimized if maintenance windows are used and user friendly message are displayed.

Of the three types of downtime, unplanned has the ability to cause the most disruption, loss of revenue and customer confidence. While most customers understand that some degree of downtime may be required, few understand when systems simply do not respond or displays a cryptic error messages. Fortunately, several steps can be taken to lessen the impact of unplanned downtime. The first step is to knowing the system is down. Without some sort of automated monitoring in place, notification of a downed system is all too often reported by a coworker or even worse, a customer. A web search will reveal a wide range of companies offering monitoring services to meet just about any budget and business need.

Next, get a user friendly message displayed if a web site, or in the case of a file/print server notify your users of the outage. While redundant systems offer the best protection, a lot of low cost options exist. Older machines can be built, hosted, and setup to display a friendly web site down message. A quick change to a DNS entry, and the machine could be live. Telephone systems, email, intercom systems, or a nicely printed message displayed on office doors or by the elevators can all be used to alert internal users. Remember the goal is to greet people with a friendly message that lets them know you are aware of the issue and are working to resolve it.

The final step is to resolve the unplanned issue. I am always surprised at the wide range of attitudes IT staff display concerning outages. Some people are in no rush regardless of the severity, while others will move heaven and earth. It is important to set IT staff expectations. I convey that the only thing that takes priority over a virus or system outage is loss of life. Simply put, if you are giving mouth-to-mouth or pulling someone out of a burning building, you are excused. Otherwise, stop whatever you are doing and get the issue resolved. This may sound drastic, but it clearly sets staff expectations to quickly resolve outages.

Clearly, redundant systems that are monitored offer the best protection against downtime. With shrinking IT budgets over the past several years, implementing redundant systems may not be possible or practical. However, monitoring your systems, determining maintenance windows, greeting users with friendly messages, and timely IT staff reaction can all assist in lessening the impact.