Friday, December 21, 2012

issues with disaster recovery choices


Recently, I was involved in delivering a disaster recovery / business continuity solution for a world-renowned arts and entertainment organisation.

Some time ago, I adopted a IT decision philosophy diagram which goes like this:

Fig.1 The IT Decision Triangle

Actual business needs ought to reign supreme.  However, these are sometimes unclear and CV-driven or boy-toy technology decisions often over-rule.  Next, political muscles flex to leave a mark, which is sometimes a scar.  And ultimately, everything decided has a budget constraint, which often comes left-field. 

As we proceeded through the DR/BCP project, we faced a number of trade-off decision-points, which epitomised these ideas and effects.  

My first task was to create a detailed service catalog, relating IT services to the IT infrastructure which underpinned those services.  This step helped me to get an overview of what we needed to protect and is a step I strongly recommend to anyone taking on DR/BCP. 

My next technical task was to investigate and compare as many DR/BCP solutions as I could discover, knowing what I did of the technology at stake.  A future post will analyze the options I considered and how I was able to compare an apple with a pear with a pineapple.

Simultaneously, we undertook a detailed analysis of what the business actually expected. When I say detailed, I really mean exhaustive and exhausting.  Via a questionnaire and a series of meetings, we approached each and every business unit and asked the Unit Leaders (Managers) to detail their IT requirements and business continuity wants. 

RTO's and RPO's
In any DR project, the question of what RTO or RPO you hope to achieve kicks off your discussions.  The technology team we had at our disposal had some definite ideas and proposed that we offer the world to the business.  However, it soon became apparent from the business analysis, that the business didn't really need or desire the speed of recovery or range of recovery initially suggested by the technology team.

And here is an important point: with any DR/BCP project, don't over-promise or even over-deliver.  Find out what the business really expects!   Short RTO's (seconds) and RPO's (seconds) will cause you grief and cost you an arm.    

Even the tier 1 money earning application (as far as the business was concerned) was at the outset of the project tolerating a 1 day RPO and a similar RTO for the most serious DR events, involving total corruption of database or loss of site.    I realised that we could make strong gains by offering 60 minute RPO's and similar RTO's for the most critical applications and reduce our budget requirement dramatically from one which offered  RTO's/RPO's in seconds.

The business analysis had also thrown up the fact that a large swathe of business applications and IT services would tolerate a much greater RTO/RPO combination, like a day.  Once we identified these, we could plan our different tiers of service accordingly and save money.

We put this information into the IT service catalog, of course, and circulated it for comment.  No one important read it, naturally .... yet.


NEXT ...... How I chose a solution .....