disaster recovery | Trent Richardson's Blog

If your IT shop is less than perfect, you are already an expert in disaster recovery at many levels. The reason is the faulty way applications have traditionally been implemented. This means continuing to spend more and more time with recovery efforts, but still not often getting the results the business demands. The disaster recovery plans are a lot like elephant repellant in New York city; the effectiveness remains really unknown until it is too late. Fortunately, disaster recovery will no longer be needed after the end of 2011.

The doubters will be quick to point out high availability is now a standard business expectation. The systems have to stay up. The perceived demands on IT continue to inflate. But what is the business really demanding? Do the customers really want to pay for it? Or is the overhead of disaster recovery just to avoid employee inconvenience or to provide executives with plausible deniability? The challenge is IT staffs have fallen into a tradition of separating application design, implementation, and recovery. The failure of this approach is catastrophic:

Budget-minded small to midsized businesses (SMBs) once viewed business continuity (BC) planning as an expensive luxury. Not anymore. Upgrading disaster recovery (DR) capabilities is a major priority for 56% of IT decision makers in the U.S. and Europe, according to Forrester Research Inc .‘s.
While companies think they’re immune to any long-term outage, more than one-fourth of companies have experienced a disruption in the last 5 years, averaging eight hours, or one business day. Source: Comdisco Vulnerability index
The U.S. Department of Homeland Security says one in four businesses won’t reopen following a disaster.

But what is a Disaster Recovery Plan? If you ask IT staff, it is rebuilding boxes. If you ask, the communications manager, it is re-establishing connectivity. If you as the PM, it is a large collection of nicely formatted documents. So which of the following is a disaster recovery plan?

Backup Tapes
Document
Software
Contracts
Recovery site
Server Virtualization

Perhaps, we find that it is really none of the above. The individual items will lead us to an overly narrow view of the effort. And lead us away from the business. We will find that an ounce of prevention is worth a terabyte of cure.

In order to better understand of the needs of recovery, we need to first look at the types of disasters and the likelihood of each occurring. We are all familiar with the dangers of a hurricane – obviously bad for business. But what is equally as damaging is a business user posting the same address to all participants or dropping a billing table. Which is more likely to occur on our watch? Should the recovery effort really be any different? The new view of disaster recovery will address both situations with the same solution.

To address these needs, there are various types of recovery strategies. In fact, ALL companies have a fully functioning disaster recovery plan; the only difference is the result the plan achieves. The following are common types of plans. In order to protect the guilty, the businesses using each have been omitted.

Denial
Bunker
Copy
Nuclear
Active (right answer)

Which type of plan do you believe you have? Do all members of your company have the same impression? Incredibly, more than 44% of the companies with a workable disaster recovery plan have NOT informed anybody about the plan? Why? Is it because they don’t believe it will work or don’t want to take responsibility for it? It is easy for the CEO to buy into the idea that they have done their ‘due diligence’ by spending a ton on a nuclear scheme. But is spending a real bench mark for recovery? In fact, some of the best recovery plans can be done quite affordably. The key is to have an active resilience plan that is utilized on a daily basis.

It is the business…stupid. Too often, IT staffs have discussed IT disaster recovery in terms of recovering servers rather than business value. What of the following should we be focusing on?

Backups
Disaster Recovery
Business Continuity (right answer)

The most powerful metric: “Are you trying to avoid employee inconvenience with your requested service levels?” IT staff are all too often the guy with the hammer looking at all problems as nails. We forget that business was dutifully conducted before faxes, emails, ETL, and mobile phones. What is really critical to the business? And how can it be done in a pinch with a manual solution? All too often the first question is how do we replicate the databases all over the planet, when the question we should be asking is: how can we call the customers? The real need is to establish business continuity.

We can distill the needs by looking at common terms in the recovery business. But we cannot accept the representations of the primary uses as gospel. All uses will say their functions must be 100% at all times with no possibility of any data loss: really? But this is seldom the truth from a core business perspective.

Recovery Time Objective – Time required to recover critical systems to a functional state, often assumed to be “back to normal” for those systems designated as mission critical.
Recovery Point Objective – Point in time to which the information has been restored when the RTO has elapsed and is dependent upon what is available from an offsite data storage location.

The Test

A great test is to ask the staff if they are willing to take a 10% pay decrease to build out a nuclear infrastructure. When faced with making the decision personal, it is amazing the clever workarounds people are capable of. This forces the conversation away from how to build bigger IT plants to how achieve business continuity.

Another overlooked test is to ask the customer. But you have to ask the customer the right way. If you simply ask if they want everything all the time, then the answer will be yes. But say, if you gave your bank customer the following options:

Be guaranteed they can access their account 24×7, but have fees of $200 a month (the fees are there whether advertised or not).
Have a strong availability but accept if the access is down from time to time, but they will get a credit of $400 a month in their account (yes the swing is 2x).

Some recovery experts suggest categorizing applications. They are usually project managers or consultants looking for work. Or we can break out a crystal ball and try to prioritizing the impact. This usually results in a massively huge cost (think infinity). This approach is widely used by hardware sales agents trying to sell a nuclear gizmo. However, this thinking is flawed in that more and more portal data is interconnected. A portal may consume both high priority and low priority data. But has your portal been tested to function WITHOUT the low priority data? A silo view is no longer practical because virtually everything is interconnected.

So then, should we simply replicate everything? Well, although more and more shops technically have all of their data replicated to a DR location, it is not readily usable by applications because it is not in sync. As a result, database administrators and application specialists need to spend additional hours, sometimes days, reconciling data and rolling databases back to bring the various data components into alignment. By the time this effort is complete, the desired recovery window has long since been exceeded.

The hard part is not rebuilding the box. The most common mistake businesses make when determining service-level requirements is trying to keep the business running as if nothing happened. The point is not that some new cool technology like clouds and SANs are not useful, but rather that the usage needs to be designed into the application deployment. If it is designed as an afterthought and assigned to another department, the costs will rise and the effectiveness will drop.

You have to make sure your disaster recovery plan will work with or without the internal key people who developed it. If the director in charge of financial ERP applications wrote the plan, for example, ask the business intelligence manager to test the recovery. The biggest bottleneck to any recovery is not the applications or the data, but rather the key people who know how the proprietary tools were configured for your shop. If a hurricane hits, your staff needs to be focused on their families, not your CRM systems.

The secrets to success

Build resiliency into the design – Keep the architecture simple.
Build before planning
Reverse the offsite co-location so that the primary location is remote and the recovery is local.
Include key vendors in the plan so that they can provide assistance.
Use offshore resources to daily validate and bring current secondary sites on a daily basis – routine failovers.
Make high availability the responsibility of everyone – business and IT.

The secrets to failure

Depend upon familiar local resources
Plan before building
Use complex technology that inserts more moving parts into the daily operations
Prepare thick complex manuals.
Designate a special recovery team
View the recovery in terms of hardware
Test the process annually over a 1-2 day period.
Forget unique needs of legacy applications.
Assume each application is an independent silo

Where should you spend money? Too often, IT staffs have discussed IT disaster recovery in terms of recovering servers rather than business value. We are all familiar with the extreme costs of moving from .99 of uptime to .9999. But let’s say it a different way: it is easy to overspend trying to eliminate short downtimes. In reality, the business impact is fairly low. And we probably do not spend enough making darn sure we can avoid long term downtimes. Ironically, many of the nuclear solutions insert so many moving parts to allow us to be instantly available, that when they fail, we are usually down for days or weeks. It is easy to under spend trying to protect from the big impacts.

Where to spend too much money?

Overprotecting data that is not critical to the business daily needs
Fail to maintain disaster recovery plans
Test disaster recovery plans too often
Overlook the benefits of server virtualization
Reluctance to renegotiate with disaster recovery service providers
Rely on technology as a silver bullet
Engage a consultancy to do a detailed plan

How to save money?

Identify all of the costs
Determine the assumptions
Review the cost allocation
Build the recovery cost into the implementation

Clearly, the costs for distinct disaster recovery spending are trending upward. It is going up, because it cannot deliver the results. When we have the recovery effort assigned to a separate team or department, the right people are not bearing the costs of the availability. And thus we cannot get unbiased feedback on the real needs. As the costs for availability become baked into implementations, the costs as a separate line item evaporate. And the overall spend actually is reduced because it is cheaper to build it in once, than design and implement it twice.

Thus, new implementations will bake in the appropriate resilience making disaster recovery obsolete. This will be the final step in the evolution of recovery:

Can recovery be a disaster? Whether in a test or an actual recovery, the plan itself can be a substantial security risk. During the process, the protected data is outside of its normal zone and subject to unexpected events as well as organized threats. Companies go to great lengths to protect the PII (personal identifying information) within their data centers, but overlook the issues during a recovery effort. Some are flat out unavoidable!

How to get data to facility?
How to recover licenses?
How to recover keys?
Where are passwords?
What happens to data after the test?
Were any data transmissions logged?

So as an executive, what can you do to do a quick stock take without hiring an expensive consultant? Here is a handy executive checklist:

What constitutes a disaster?
Do all senior managers understand their role in the event of a disaster?
How will the interim business be managed?
How will public relations be managed? How will staff communications be managed?
How will customers react? Do they really want to pay for .9999?
What are the core business deliveries? What can be performed through alternative manual means?
How much will downtime affect the share price and market confidence?
How will the recovery effort be staffed?
What is the resiliency of the solutions purchased?
What is the PII exposure during a recovery effort?

A checklist to see what you learned

1) Organizations should lay out a five-year plan with a recovery time objective that is ________a. Less than two hours
b. Going to improve over time
c. The same as what you have now

2) Of the 50% to 70% of organizations that develop IT disaster recovery plans, fewer than ____ actually test those plans.a. One quarter
b. One third
c. One half

3) 44% of disaster recovery planners polled haven’t told anybody that a DR plan exists in their organization.True
False

4) How do current budget constraints change IT disaster recovery discussions with other parts of the business?a. They don’t — IT should proceed as it has before.
b. It makes it more important to involve other business departments.
c. It makes it less important to involve other business departments.

5) The test of an IT disaster recovery plan came fast and furiously last year a gas and electric company, when flood waters swept over its Cedar Rapids, Iowa, territory. What technology, not touted as a big piece of the IT disaster recovery plan, came to the rescue?a. Voice over Internet Protocol
b. Desktop virtualization
c. Duplication services

6) The recession is putting a squeeze on budgets for outsourcing disaster recovery services. As such, CIOs are turning to _________ to reduce floor space at their leased recovery sites, according to providers of IT disaster recovery services.a. Server virtualization
b. Cloud computing
c. Contract renegotiation

7) How are companies using cloud computing for IT disaster recovery outsourcing?a. They’re increasing the number of licensees with access to DR applications.
b. They’re moving mission-critical applications to a cloud environment.
c. They’re creating carbon copies of applications.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tag Archives: disaster recovery

Disaster Recovery is dead

Blogroll

Search

Trent’s Calendar

What you will find here