Security+ Course – 5.3 Business Continuity and Disaster Recovery

1. Business Continuity Planning

Business continuity efforts are a collection of activities designed to keep a business running in the face of adversity. This adversity may come in the form of a small-scaled incident such as a single system failure or a catastrophic incident, such as an earthquake or tornado. Business continuity plans may also be activated by man-made disasters such as a terrorist attack or hacker intrusion.

Defining BCP Scope

When an organisation begins a business continuity effort, it’s easy to quickly become overwhelmed by the many possible scenarios and controls a project might consider. For this reason, the team developing a business continuity plan should take time up front to carefully define their scope

What business activities will be covered by the plan?
What types of systems will it cover?
What types of controls will it consider?

The answers to these questions will help make critical prioritisation decisions down the road.

Business Impact Assessment (BIA)

This identifies and prioritises risks. The BIA is a risk assessment that follows one of the quantitative or qualitative processes that we discussed earlier in this course. The BIA begins by identifying the mission-essential functions that a business depends upon, and then traces them backwards to identify the critical systems that support those functions. Once planners have identified the affected systems, they can then identify the potential risks to those systems, and conduct a risk assessment.

This risk assessment is based upon a variety of factors:

impact on life and safety
impact on property and finances
impact on the organisation’s reputation

The risk assessment should cover all of the threats that might face an organisation.

BIA Results:

2. Business Continuity Controls

Business continuity professionals have a variety of tools at their disposal to help remediate potential availability issues

Single Point of Failure Analysis

One of the critical ways that IT professionals protect the availability of systems is ensuring that they are redundant. That simply means that systems are designed in such a way that the failure of a single component doesn’t bring the entire system down

The following example is a web server protected by a firewall and connected to the internet-

The single points of failure are:

Single web server
Single firewall
1 connection for internal and to the web

To resolve these issues we put in:

Cluster of web servers
High availability firewall pair
2 separate network connections per link

This single point of failure analysis may continue on identifying and remediating issues until either the team stops finding new issues, or the cost of addressing issues outweighs the potential benefit.

Succession Planning for Staff

One final component of business continuity planning that is often overlooked is personnel succession planning. Information technology depends upon highly skilled team members who develop, configure, and maintain systems and processes. IT leadership should work with human resources to identify those team members who are essential to continued operations, and identify potential successors for those positions, that way when someone leaves the organisation, management has already thought through potential replacements, and hopefully provided those successors with the professional development opportunities they need to step in to the departing employee’s shoes.

3. High Availability and Fault Tolerance

There are two key technical concepts that improve the availability of systems.

1. High availability (HA)

This uses multiple systems to protect against failures.
NOTE: Load balancing is a related but different concept. Load balancing uses multiple systems in an attempt to spread the burden of providing a service across those systems providing a scalable computing environment. While they use similar technologies, load balancing and high availability are different goals.

2. Fault tolerance (FT)

This helps protect a single system from failing in the first place by making it resilient in the face of technical failures. The 2 most common points of failures within a computer system are:

1. Power supplies:
they have moving parts and fail often. It is best to have redundant power supplies and connect them to different power sources where possible

2. Storage devices
The second priority of many fault tolerance efforts is protecting against the failure of a single storage device. They achieve this through the use of a technology known as RAID, redundant arrays of inexpensive disks. RAID comes in many different forms but each of them is designed to provide redundancy by having more disks than needed to meet business needs.

RAID Technologies
Mirroring(RAID1): In this approach the server contains two disks, each disk has identical data contents and when the system writes any data to one disk, it automatically makes the same change to the other disk, keeping the second disk as a synchronised copy or mirror of the primary disk.

Striping (RAID5): In this approach the system contains three or more disks and writes data across all of those disks, but also includes additional elements known as parody blocks spread across the disks. If one of the disks fails, the system can regenerate that disk’s contents by using the parody information.

RAID is a fault tolerance strategy designed to protect against a single disk failure. It is not a backup strategy, you should still perform regular data backups to protect your organisation’s information in the event of a more catastrophic failure, such as the physical destruction of an entire server.

EXAM TIP: you wont need to know detailed information about RAID but do know that mirroring requires 2 disks and striping with parity requires three

4. Disaster Recovery

Business continuity programs are designed to keep a business up and running in the face of a disaster. Sometimes continuity controls fail, or the sheer magnitude of a disaster overwhelms the organisation’s capacity to continue operations. That’s where the disaster recovery begins.

Disaster recovery is a subset of business continuity activities, designed to restore a business to normal operations as quickly as possible following a disruption.

Initial Response

Contain the damage caused by the disaster
Recover whatever capabilities can be immediately restored
Possibly activating an alternate processing facility
Possibly calling in contractors

Employees responsibilities change dramatically during disaster recovery. Flexibility is key. Organisations should plan disaster recovery responsibilities and train staff accordingly.

Disaster Recovery Communication

Initial activation of disaster recovery team
regular status updates for both employees in the field and leadership
ad hoc communications capabilities to meet tactical needs

Assessment Mode

After the immediate danger to the organisation clears, the disaster recovery team shifts from immediate response mode into assessment mode. The goal of this phase is simple:

to triage the damage to the organisation
develop a plan to recover operations on a permanent basis

Order of Responsibility: should prioritise systems by criticality

Disaster Recovery Metrics

1. Recovery Time Objective (RTO): Maximum amount of time it should take to recover a service after a disaster
2. Recovery Point Objective (RPO): Maximum time period which data may be lost in the wake of a disaster. EG: can you lose 1 days data? 15 mins data?

EXAM TIP: Disaster recovery efforts end only when the business is operating normally in its primary facility.

5. Backups

Backups are perhaps the most important component of any disaster recovery plan. Backups provide organisations with a fail-safe way to recover their data in the event of a technology failure, human error, natural disaster, or other circumstances that result in data’s accidental or intentional deletion or modification.

Backup Media

Tape backup: this is still a very common practice today, however, tapes are unwieldy to manage
Disk to disk backups: write data from the primary disk to special disks set aside for backup purposes. Those backup disks may be in a separate facility, where it would be unlikely that the same physical disaster would affect both the primary and backup site
Cloud backups: write backups directly to storage provided by cloud computing vendors, such as Amazon Web Services, Microsoft Azure, or their competitors. This provides great geographic diversity, as the backup data is stored in separately-managed facilities and cloud providers usually perform their own backups of their systems, providing an added layer of protection for customer data.

3 Primary Backup Types

Full Backups: includes a complete copy of the data
Differential Backups: includes all data modified since the last full backup
Incremental Backups: includes all data modified since the last full or incremental backup. These use less space but require greater recovery time.

Snapshots

Snapshots are a special case of full backups. They leverage technology designed to capture an image of a system as it stood at a specific point in time. A snapshot may then be used at a later time to restore the system to the state it was in when the snapshot was taken. Snapshot technology is often provided by virtualisation platforms and cloud computing providers.

EXAM TIP: The exam commonly includes questions about backup types, and it’s very important that you understand not only what data is included in each backup, but also what backups would need to be restored in the event of a disaster.

Sample Question:
“Joe is the storage administrator for his company, and he performs a full backup of his systems every Sunday afternoon. He then performs differential backups every weekday evening. If the system fails on Friday morning, what backups would he need to restore?”
Answer:
1. Sundays Full backup
2. Thursdays differential backup

If it was incremental backups he would have restore
1. Sundays full backup
2. Mon, Tue, Wed and Thurs incremental backups

Media Rotation Strategies

This allows the reuse of backup media. Most restorations requests are for recent backups

Grandfather-Father-Son Rotation: in this method you have four of each media (son, father, grandfather)

Son set of media is for weekdays
Father set is for Fridays
grandfather set is the last day of the month

6. Disaster Recovery Sites

During a disaster organisations may need to shift their computing functions from their primary data centre, to an alternate facility designed to carry the load when their primary site is unavailable, or non-functioning. Disaster recovery sites are alternate processing facilities, specifically designed for this purpose.

3 Types of Disaster Recovery Sites

Hot sites:
– Fully operational data centres stocked with equipment and data
– available a moments notice
– in many cases the hot site can activate itself if the primary site fails
– Very expensive
Cold sites:
– Empty data centres stocked with the core equipment, network and environmental controls. But they don’t have the servers and data
– Activating them may take months or weeks of effort
– much cheaper option
Warm sites:
– stocked with all necessary equipment and data, but not maintained in parallel fashion
– similar in expense to hot sites
– available in hours or days

Alternate Business Processes

In addition to alternate processing facilities, organisations may incorporate alternate business processes as a component of their disaster recovery plans. For example, the organisation might move to a paper based ordering process, if an electronic order management system will remain down for an extended period of time.

7. Geographic Disaster Recovery Considerations

Offsite Backups

Backups should be stored in a facility that is sufficiently far away from the primary data centre that they will not be impacted by the same disaster. There’s no clear cut answer to the question, how far is enough? But, rather, organisations should perform location selection in a manner that is informed by their risk assessments. Generally speaking, it’s a wise practice to store backups more than 100 miles away from the primary facility, and ideally even farther away.

Redundant Data Centres

If you’re going to have two data centres operating in an active-active or active-passive configuration, it only makes sense to place them far enough apart from each other that the same disaster won’t impact both facilities. Cloud computing facilitates the use of geographically distant backups. Infrastructure as a service providers operate in many different regions and availability zones around the world, simplifying the process of shipping backups between sites and building geographic resiliency.

Legal Implications

In addition to considering the resiliency components of location selection, organisations should also take legal implications into account. When you operate a data centre in a distant jurisdiction, the data within that facility may become subject to local laws that differ from those in your primary region. For example, if you store data within the European Union, it may then fall under the provisions of the EU’s general data protection regulation

8. Testing Disaster Recovery Plans

DR Testing Goals

Validate that the plan functions correctly
Identify necessary plan updates

DR Test Types

Read through: disaster recovery staff distribute copies of the current plan to all personnel involved in disaster recovery efforts, and ask those personnel to review their procedures. Team members then provide feedback about any updates needed to keep the plan current
Walk through: involve getting everyone together around the same table to review the plan. For this reason, a walk-through is also known as a table top exercise. Walk-throughs achieve the same result as read-throughs, but are generally more effective, because they give the team the opportunity to discuss the plan together
Simulation: the team meets together and discuss how they would respond to a specific scenario
Parallel test: this activates the disaster recovery facility (hot or warm site) but do not switch operations to there
Full interruption test: this switches primary operations to the alternate facility. This is the most effective test but the most disruptive

9. After Action Reports

After every activation of the disaster recovery or business continuity plans, organisations should conduct a formal review of the event and document it in an after action report. The purpose of the after action report is to create a formal record of the incident that documents the circumstances surrounding the event and identifies opportunities for future improvement.

Writing a report

An after action report should contain several major sections.

Brief executive summary: this allows a casual reader to capture the basics of the event and the major findings in a few paragraphs. When you write this summary, imagine that you are writing it for an audience who will read that section and nothing more, because that’s probably the case.

Background Information: this allows the reader to analyse the events and circumstances leading up to the incident. For example, you might include details about the state of the operating environment, external factors that contributed to the situation, and other relevant facts.

Detailed Summary: Explain what happened, being careful to cover as many of the key questions as possible. Who was involved in the event? What factors contributed to the success and/or failure of the effort? When did the event take place? Why was the disaster recovery or business continuity plan activated? Where did the incident occur, and how did the event occur?

Lessons Learned: In what ways did the organisation perform well? What areas were deficient? How could successful processes be even further improved, and how can deficiencies be corrected?

Conclusion: This should clearly outline next steps that the organisation should take based upon the lessons learned. This section should assign clear responsibility for implementing changes and timelines for completion. The next steps section should be very specific so that it may be used to hold the organisation accountable for implementing the recommended changes.

SPK Tech Fit

Technology, Fitness & General Ramblings on Life

Security+ Course – 5.3 Business Continuity and Disaster Recovery

1. Business Continuity Planning

Defining BCP Scope

Business Impact Assessment (BIA)

2. Business Continuity Controls

Single Point of Failure Analysis

Succession Planning for Staff

3. High Availability and Fault Tolerance

1. High availability (HA)

2. Fault tolerance (FT)

4. Disaster Recovery

Initial Response

Disaster Recovery Communication

Assessment Mode

Disaster Recovery Metrics

5. Backups

Backup Media

3 Primary Backup Types

Snapshots

Media Rotation Strategies

6. Disaster Recovery Sites

3 Types of Disaster Recovery Sites

Alternate Business Processes

7. Geographic Disaster Recovery Considerations

Offsite Backups

Redundant Data Centres

Legal Implications

8. Testing Disaster Recovery Plans

DR Testing Goals

DR Test Types

9. After Action Reports

Writing a report

About sean

Leave a Reply Cancel reply

SPK Tech Fit

Technology, Fitness & General Ramblings on Life

1. Business Continuity Planning

Defining BCP Scope

Business Impact Assessment (BIA)

2. Business Continuity Controls

Single Point of Failure Analysis

Succession Planning for Staff

3. High Availability and Fault Tolerance

1. High availability (HA)

2. Fault tolerance (FT)

4. Disaster Recovery

Initial Response

Disaster Recovery Communication

Assessment Mode

Disaster Recovery Metrics

5. Backups

Backup Media

3 Primary Backup Types

Snapshots

Media Rotation Strategies

6. Disaster Recovery Sites

3 Types of Disaster Recovery Sites

Alternate Business Processes

7. Geographic Disaster Recovery Considerations

Offsite Backups

Redundant Data Centres

Legal Implications

8. Testing Disaster Recovery Plans

DR Testing Goals

DR Test Types

9. After Action Reports

Writing a report

Related Posts

Aruba Certified Switching Associate – Study Notes

7. IPv4 Routing Part 1

6. Link Aggregation

About sean

Leave a Reply Cancel reply