How do you define high availability and disaster recovery?

A while back I was on a call with someone who asked me the difference between high availability (HA) and disaster recovery (DR), saying that there are so many different solutions out there and that a lot of people seem to use the terminology but are unable to explain anything more about these two descriptions. So, here’s an attempt to demystify things.

First of all, let’s take a look at the individual terms:

High Availability

According to Wikipedia, you can define availability in the following ways:

The degree to which a system, subsystem, or equipment is operable and in a committable state at the start of a mission, when the mission is called for at an unknown, i.e., a random, time. Simply put, availability is the proportion of time a system is in a functioning condition.

The ratio of (a) the total time a functional unit is capable of being used during a given interval to (b) the length of the interval.

And most online dictionaries seem to have a similar definition of availability. When we are talking about HA, we imply that we want the functioning condition of your system to be increased.

Going by the above you will also notice that there is no fixed definition of the availability. Simply put, it would mean that you need to put your own definition in place when talking about HA. You need to define what HA means in your environment. I’ve had customers that needed HA and defined this as the system having a certain amount of uptime, which is one way to measure it.

On the other hand you would be hard pressed if you were able to work with your system, but the data that you were working with was corrupted because one of your power users made an error during a copy job and wrote an older data set in the wrong spot. This would mean that your system is in itself available. You can log on to it, you can work with it, but the output you are going to get will be wrong.

To me, such a scenario would mean that your system isn’t available. After all, it’s not about everything being online. It’s about using a system in the way you would expect it to work. But when you ask most people in IT about availability, the first thing you will likely hear is something related to uptime or downtime. So, my tip to you is once again:

Define what “available” means to you and your organization/customer!

Disaster Recovery

Natural disasterLet’s do the same thing as before and let’s turn to some general definitions. Wikipedia defines disaster the following way:

disaster is a perceived tragedy, being either a natural calamity or man-made catastrophe. It is a hazard which has comes to fruition. A hazard, in turn, is a situation which poses a level of threat to life, health, property, or that may deleteriously affect society or an environment.

And recovery is defined the following way (when it comes to health):

Healing, or Cure, the process of recovering from an injury or illness.

So, in a nutshell this is about bouncing back to your feet once a disaster strikes. Now again, it’s important to define what you would call a disaster, but at least there seems to be some sort of common understanding that anything that would get you back up and running after an entire site goes down, usually falls under the label of a DR solution.

It all boils down to definitions!

When you talk to other companies or vendors about HA and/or DR, you will soon notice that most have a different understanding of what HA and DR are. Your main focus should be to have a clear definition for yourself. Try to find out the importance and value of your solution and base your requirements on that. Ask yourself simple questions like for example:

  • What is the maximum downtime I can cope with before I need to start working again? 8 hours per year? 1 hour per year? 4 hours per month? What is my RPO and RTO
  • How do I handle planned maintenance? Can I bring everything down or do I need to distribute my maintenance across independent entities?
  • Can I afford the loss of any data at all? Can I afford the partial loss of data?
  • What if we see a city-wide power outage? Do I need a failover site, or are all my users in the same spot and won’t be able to work anyway?

Questions like these will help you realize that not everything you have running has the same value. Your development system with 6000 people working on it worldwide might need better protection than your productive system that is only being used by 500 people spread through the Baltic region.

Or in short

Knowing what kind of protection you need is key. Fact is that both HA and DR solutions never come cheap. If you need the certainty that your solution is available and able to recover from a disaster, you will notice that the price tag will quickly skyrocket. Which is another reason to make sure that you know exactly what kind of protection you need, and creating that definition is the most important starting point. Once you have your own definition, make sure that you communicate those definitions and requirements so that all parties are on the same page. It should make your life a little easier in the end.

About the author



Leave a Comment