Design For Disaster

Everyone has a plan until they get punched in the face.
Mike Tyson

There are many different considerations when looking at Disaster Recovery (DR) and in many ways it is a specialist field of technical architecture. It is however important that it is built in as an intrinsic part of any solution as it will impact all parts (application, data, integration, infrastructure). The following gives some guidance on how to approach this.

Know your acronym:
DR is not the same as Business Continuity (BC) or High Availability (HA), but should be an intrinsic part of BC and can be the same as HA in specific circumstances – confused yet? As a technical architect you will need to look at both High Availability (how to design a system that is highly available with a single instance) and Disaster Recovery (how to design a system that can recover in the event of a Site-Wide event). If you are lucky enough to be deploying into data centre configurations that are close enough and have good connectivity then your HA and DR solution may be the same thing. Business Continuity is the overall framework of managing disasters that may impact an organisation. Your DR solution will contribute towards the continuity plans but is not the whole of the BC solution. The overall BC framework will include aspects such as dealing with loss of people and public relations management and there will usually be a specialist team managing this.

Clients may want DR, may not want DR or simply not know…
While there may be a super efficient BC team to dictate the requirements to in precisely articulated terms for each part of the solution, there may also not be any requirements defined at all. As with all TA deliverables, the DR solution should be driven by the Non Functional Requirements usually expressed as:

RTO (Recovery Time Objective): The RTO is a measurement, as a unit of time(hours), of how long the business can survive following a disaster before operations are reinstated. If the RTO is 12 hours, it means that the business can sustain operations for 12 hours without the system being available. If the system is not recovered within 12 hours, the business could endure severe reputational or financial damage.
RPO (Recovery Point Objective): The RPO is a measurement, as a unit of time(hours), of the maximum acceptable amount of data that can be lost following a disaster. It measures how much time can occur between the last data backup and a disaster without causing serious harm to the business. RPO is useful for determining how often to perform data backups. If the RPO is 1 hour, it means it’s acceptable for the business to lose no more than 1 hour of data. Therefore, an hourly data backup should be taken.

When analysing the requirements think about and challenge the different elements of the solution – does the training system with relatively static content really require the same RTO & RPO as the online banking systems holding real time customer financial data. Also consider legacy integration and how your system may be reliant on legacy systems that may not have DR implemented.

Finally – clearly define your requirements. They should be Technical Recovery targets and so include only the time taken to recover a system. This may be very different from when the disaster actually happened since it can take many hours or even days from a disaster event before a company will actually decide to invoke DR!

Decide where and when to build DR into your solution
On any big Sl project your DBAs will push you towards using database replication & your Infrastructure team will order you to specify SAN replication. There is no right answer but a technology architects, we need to come to a decision about which layers to build in our resilience and data replication solutions. When designing the execution architecture consider all the options and decide which ones provide the best balance between meeting the NFRs and cost. If there is no specified RTO then tape backup with offsite storage may be sufficient.

Accept that the impossible is impossible
DR is impacted by more external factors than any other area of TA: geography, network links, existing legacy systems, existing Infrastructure standards and existing processes/procedures. If the NFRs are not achievable, communicate this as early as possible and make sure all parties are aware.

You May Also Like

A Brief Architecture Assessment

A Picture Is Worth A Thousand Words

IT Architecture Is A Team Sport

Unlocking Security Architecture