This chapter is from the book There are many different approaches to BCP and DRP. Some companies address these processes separately, whereas others focus on a continuous process that interweaves the plans. The National Institute of Standards and Technology (NIST) (http://www.csrc.nist.gov) offers a good example of the contingency process in Special Publication 800-34: Continuity Planning Guide for Information Technology Systems (http://tinyurl.com/yb3lcw). In NIST SP 800-34, the BCP/DRP process is defined as Show
Before we go further, let's define the terms disaster and business continuity. A disaster is any sudden, unplanned calamitous event that brings about great damage or loss. Entire communities have concerns following a disaster; however, businesses face special challenges because they have responsibilities to protect the lives and livelihoods of their employees, and to guard company assets on behalf of shareholders. In the business realm, a disaster can be seen as any event that prevents the continuance of critical business functions for a predetermined period of time. In other words, the estimated outage might force the declaration of a disaster. Business continuity is the process of sustaining operation of critical systems. The goal of business continuity is to reduce or prevent outage time and optimize operations. The Business Continuity Institute (http://www.thebci.org), a professional body for business continuity management, defines business continuity management in the following terms:
Although there are competing methodologies that can be used to complete the BCP/DRP process, this chapter will follow steps that most closely align with reference documentation recommended by ISC2. Figure 7.1 illustrates an overview of the process, the steps for which are as follows:
Figure 7.1 BCP/DRP process.
We will discuss each of these steps individually. Project Management and InitiationBefore the BCP process can begin, it is essential to have the support of management. You might need to educate management about the need for a BCP. One way to accomplish this is to prepare and present a seminar for management that overviews the risk the organization faces, identifies basic threats, and documents the costs of potential outages. This is a good time to remind management that, ultimately, they are legally responsible. Customers, shareholders, stockholders, or anyone else could bring civil suits against senior management if they feel the company has not practiced due care. Without management support, you will not have funds to successfully complete the project, and resulting efforts will be marginally successful, if at all. Management is responsible for
Management must choose a team leader. This individual must have enough creditability with senior management to influence them in regard to BCP results and recommendations. After the team leader is appointed, an action plan can be established and the team can be assembled. Members of the team should include representatives from management, legal staff, recovery team leaders, information security department, various business units, networking, and physical security. It is important to include asset owners and the individuals that would be responsible for executing the plan. Next, determine the scope. A properly defined scope is of tremendous help in maximizing the effectiveness of the BCP plan. Be sensitive to interoffice politics, which, if out of control, can derail the planning process. Another problem to avoid is project creep, which occurs when more and more items that were not part of original project plan are added to the plan. This can delay completion of the project or cause it to run over budget. The BCP benefits from adherence to traditional project plan phases. Issues such as resources (personnel, financial), time schedules, budget estimates, and any critical success factors must be managed. Schedule an initial meeting to kick off the process. Finally, the team is ready to get to work. The team can expect a host of duties and responsibilities:
It's important for everyone on the team to realize that the BCP is the most important corrective control the organization will have, and to use the planning period as an opportunity to shape it. The BCP is more than just corrective controls; the BCP is also about preventive and detective controls. These three elements are described here:
Business Impact AnalysisThe next task is to create the BIA, the role of which is to measure the impact each type of disaster could have on critical business functions. The BIA is an important step in the process because it considers all threats and the implications of those threats. As an example, the city of Galveston, Texas is an island known to be prone to hurricanes. Although it might be winter in Galveston and the possibility of a hurricane is extremely low, it doesn't mean that planning can't take place to reduce the potential negative impact if and when a hurricane arrives. The steps for accomplishing this require trying to think through all possible disasters, assess the risk of those disasters, quantify the impact, determine the loss, and identify and prioritize operations that would require disaster recovery planning in the event of those disasters. The BIA is tasked with answering three vital questions:
The development of multiple scenarios should provide a clear picture of what is needed to continue operations in the event of a disaster. The team creating the BIA will need to look at the organization from many different angles and use information from a variety of sources. Different tools can be used to help gather data. Strohl Systems BIA Professional and SunGard's Paragon software can automate portions of the data input and collection process. Although the CISSP exam will not require that you know the names of various tools, it is important to understand how the BIA process works, and it helps to know tools that are available. Whether the BIA process is completed manually or with the assistance of tools, its completion will take some time. Anytime individuals are studying processes, techniques, and procedures they are not familiar with, a learning curve will be involved. As you might be starting to realize, creation of a BIA is no easy task. It requires not only the knowledge of business processes but also a thorough understanding of the organization itself, including IT resources, individual business units, and the interrelationships of each. This task will require the support of senior management and the cooperation of IT personnel, business unit managers, and end users. The general steps within the BIA include
Assessing Potential LossThere are different approaches to assessing potential loss. One of the most popular methods is the use of a questionnaire. This approach requires the development of a questionnaire distributed to senior management and end users. The objective of the questionnaire is to maximize the identification of real loss from the people completing business processes jeopardized by the disaster. This questionnaire might be distributed and independently completed or filled out during an interactive interview process. Figure 7.2 shows a sample questionnaire.
Figure 7.2 BIA questionnaire. The questionnaire can also be completed in a round table setting. In fact, this sort of group completion can add synergy to the process, providing the dynamics of the group allow for open communication and the required key individuals can all schedule and meet to discuss what impact specific types of disruptions would have on the organization. The importance of the inclusion of all key individuals must be emphasized because management might not be aware of critical key tasks for which they do not have direct oversight. A questionnaire is a qualitative technique for assessing risk. Qualitative assessments are scenario-driven and do not attempt to assign dollar values to anticipated loss. A qualitative assessment ranks the seriousness of an impact using grades or classes, such as low, medium, high, or critical. This sort of grading process enables quicker progress in the identification of risks, and provides a means of classifying processes that might not easily equate to a dollar value. As an example:
The BIA can also be undertaken using a quantitative approach. This method of analysis attempts to assign a monetary value to all assets, exposures, and processes identified during the risk assessment. These values are used to calculate the material impact of a potential disaster, including both loss of income and expenses. A quantitative approach requires
The process of performing a quantitative assessment is covered in much more detail in Chapter 10. It is important that a quantitative study include all associated costs resulting from a disaster, such as
Both quantitative and qualitative assessment techniques require the BIA team to examine how the loss of service or data would affect the company. Each method is seeking to reduce risk and plan for contingencies, as shown in Figure 7.3.
Figure 7.3 Risk reduction process. The severity of an outage is generally measured by considering the maximum tolerable downtime (MTD) for which the organization can survive without that function or service. Will there be a loss of revenue or operational capital or will the organization be held personally liable? Although the team might be focused on what the immediate effect on an outage would be, cost can be immediate or delayed. Many organizations are under regulatory requirements. The result of an outage could be a legal penalty or fine. The organization's reputation could even be tarnished. Recovery StrategyRecovery strategies are the predefined actions that management has approved in the event that normal operations are interrupted. To judge the best strategy to recover from a given interruption, the team must evaluate and complete:
This information is used to determine the best course of action based on the analysis of data from the BIA. With so much to consider, it is helpful to divide the organization's recovery into specific areas, functions, or categories:
Business Process RecoveryBusiness processes can be interrupted due to the loss of personnel, critical equipment, supplies, or office space; or from uprisings, such as strikes. As an example, in 2005 after Katrina, New Orleans had a huge influx of workers in the city rebuilding homes, offices, and damaged buildings. Fast food restaurants were eager to meet the demand these workers had for burgers, fries, tacos, and fried chicken. However, there was insufficient low-cost housing for the fast food industry's employees. The resulting shortage forced fast food restaurants to pay bonuses of up to $6,000 to entice potential employees to the area. It is worth noting that even if the facility is intact after a disaster, people are still required and are an important part of the business process recovery. Workflow diagrams and documents can assist business process recovery by mapping relationships between critical functions. Let's process an order for a widget to illustrate a sample flow:
A more detailed listing would be appropriate for industrial use, but you get the idea. Building these types of flowcharts allows organizations to examine what resources are required for each step and what functions are critical for continued business operations. Facility and Supply RecoveryFacility and supply interruptions can be caused by fire, loss of inventory, transportation problems, telecommunications, or heating, ventilating, and air conditioning (HVAC) problems. It is too late to start discussions on alternative sites when a disaster is striking your facility. Redundant services enable rapid recovery from these interruptions. Many options are available, from a dedicated offsite facility, to agreements with other organizations for shared space, to the option of building a prefab building and leaving it empty as a type of cold backup site. The following sections examine some of these options. Subscription ServicesOrganizations might opt to contract their facility needs to a subscription service. The CISSP exam considers hot, warm, and cold sites to be subscription services. Data-processing facilities are expensive. The organization might decide to dedicate the funds for a hot, warm, or cold site. A hot site facility is ready to be brought online quickly. A hot site is fully configured and is equipped with the same system as the production network. It can be made operational within just a few hours. A hot site will need staff, data files, and procedural documentation. Hot sites are a high-cost recovery option, but can be justified when a short recovery time is required. Because hot sites are typically a subscription service, a range of associated fees exist, including monthly cost, subscription fees, testing costs, and usage or activation fees. Contracts for hot sites need to be closely examined because some charge extremely high activation fees to prevent users from utilizing the facility for anything less than a true disaster. To get an idea of the types of costs involved, http://www.drj.com reports that subscriptions for hot sites average 52 months in length and costs can be as high as $120,000 per month. Compare this to cold sites, which can also be 5 to 6 years in length and can average anywhere between $500 to $2,000 per month. Regardless of what fees are involved, the hot site needs to be periodically tested. Theses tests should evaluate processing abilities as well as security. The physical security of the hot site should be at the same level or greater than the primary site. Finally, it is important to remember that the hot site is intended for short term usage only. As a subscriber-based service, there might be others competing for the same resource. The organization should have a plan to recover primary services quickly or move to a secondary location. For those companies lacking the funds to spend on a hot site or in situations where a short term outage is acceptable, a warm site might be acceptable. A warm site has data equipment and cables, and is partially configured. It could be made operational anywhere from in a few hours to a few days. The assumption with a warm site is that computer equipment and software can be procured as required due to a disaster. Although the warm site might have some computer equipment installed, it is typically of lower processing power than the primary site. The costs associated with a warm site are similar to those of a hot site but slightly lower. The warm site is a popular subscription alternative. In situations where even longer outages are acceptable, a cold site might be the right choice. A cold site is basically an empty room with only rudimentary electrical power and computing capability. Although it might have a raised floor and some racks, it is nowhere near ready for use. It might take several weeks to a month to get the site operational. Cold sites offer the least preparedness when compared to hot and warm subscription services discussed. Redundant SitesThe CISSP exam considers redundant sites to be sites owned by the company. Although these might be either partially or totally configured, the CISSP exam does not typically expect you to know that level of detail. A redundant site is capable of handling all operations if another site fails. Although there is an increased cost, it offers the company fault tolerance. If the redundant sites are geographically dispersed, the possibility of more than one being damaged is reduced. For low to medium priority services, a distance of 10 to 20 miles from the primary site is considered acceptable. If the loss of services, for even a very short time, could cost the organization millions of dollars, the redundant site should be farther away. Therefore, redundant sites that are to support highly critical services should not be in the same geographical region or subject to the same types of natural disasters as the primary site. For organizations that have multiple sites dispersed in different regions of the world, multiple processing centers might be an option. Multiple processing centers allow a branch in one area to act as backup for a branch in another area. Table 7.1 shows some sample functions and their recovery times. Table 7.1. Organization Functions and Example Recovery Times
Mobile SitesMobile sites are another processing alternative. Mobile sites are usually tractor-trailer rigs that have been converted into data-processing centers. These sites contain all the necessary equipment and are mobile, permitting transport to any business location quickly. Rigs can also be chained together to provide space for data processing and provide communication capabilities. Mobile units are a good choice for areas where no recovery facilities exist and are commonly used by the military, large insurance agencies, and others. Whatever recovery method is chosen, regular testing is important to verify that the redundant site meets the organization's needs, and that the plan can handle the workload to meet minimum processing requirements. Reciprocal AgreementThe reciprocal agreement option requires two organizations to pledge assistance to one another in case of disaster. The support requires sharing space, computer facilities, and technology resources. On paper, this appears to be a cost-effective approach, but it has its drawbacks. The parties to this agreement must place their trust in the other organization to provide aid in case of a disaster. However, a nonvictim might become hesitant to follow through when a disaster actually occurs. Also, confidentiality requires special consideration. This is because the damaged organization is placed in a vulnerable position while needing to trust the sponsoring party housing the victim's confidential information. Legal liability can also be a concern. One company agrees to help the other organization out when down and as a result it is hacked. Finally, if locations of the parties of the agreement have physical proximity, there is always the danger that disaster could strike both parties; thereby, rendering the agreement useless. User RecoveryUser recovery is primarily about what employees must have to accomplish their jobs. Requirements include
At issue here is the fact that a company might be able to get employees to a backup facility after a disaster, but if there are no phones, desks, or computers, the employees' ability to work will be severely limited. User recovery can even include food. As an example, my brother-in-law works for a large chemical company on the Texas Gulf Coast. During storms, hurricanes, or other disasters, he is required to stay at work as part of the emergency operations team. His job is to stay at the facility regardless of time; the disaster might last two days or two weeks. During a simulation test several years ago, it was discovered that someone had forgotten to order food for the facility where the employees were to remain for the duration of the drill. Luckily, the 40 or so hungry employees were not really in a disaster, and were able to order pizza and have it delivered. Had it been a real disaster, no takeout would have been available. Operations RecoveryOperations recovery addresses interruptions caused by the loss of capability due to equipment failure. Redundancy solves this potential loss of availability, such as redundant equipment, Redundant Array of Inexpensive Disks (RAID), backup power supplies (BPS), and other redundant services. Hardware failures are one of the most common disruptions that can occur. Preventing the disruptions is critical to operations. The best place to start planning redundancy is when equipment is purchased. At purchase time, there are two important numbers that the buyer must investigate:
A formula for calculating availability is MTBF / (MTBF+ MTTR) = Availability To maximize availability of critical equipment, an organization can consider obtaining a service level agreement (SLA). There are all kinds of SLAs. In this situation the SLA is a contract between a company and a hardware vendor, in which the vendor promises to provide a certain level of protection and support. For a fee, the vendor agrees to repair or replace the covered equipment within the contracted time. Fault tolerance can be used at the server or drive level. For servers, there is clustering, which is technology that allows you to group several servers together, where those servers are viewed logically as a single server. Users see the cluster as one unit. The advantage is that if one server in the cluster fails, the remaining active servers pick up the load and continue operation. Fault tolerance on the drive level is achieved primarily with RAID, which provides hardware fault tolerance and/or performance improvements. This is achieved by breaking up the data and writing it to multiple disks. To applications and other devices, RAID appears as a single drive. Most RAID systems have hot-swappable disks. This means that faulty drives can be removed and replaced without restoring the entire computer system. If the RAID system uses parity and is fault tolerant, the parity data can be used to reconstruct the newly replaced drive. The technique for writing the data across multiple drives is called striping. Although write performance remains almost constant, read performance is drastically increased. RAID has humble beginnings that date back to the 1980s at the University of California. RAID is discussed in depth in Chapter 11, "Operations Security." Although operations can be disrupted because of the failure of equipment, the loss of communications can also disrupt critical processes. Protecting communication with fault tolerance can be achieved through redundant WAN links, diverse routing, and alternate routing. Whatever method is chosen, the organization should verify capacity requirements and acceptable outage times. The primary methods for network protection include the following:
Networks are susceptible to the same types of outages as equipment. If operational recovery concerns are not addressed, these outages can be a real problem for companies that rely heavily on networks to deliver data when needed. Data and Information RecoveryThe focus here is on recovering the data. Solutions to data interruptions include backups, offsite storage, and/or remote journaling. Because data processing is essential to most organizations, the data and information recovery plan is critical. The objective of the plan is to back up critical software and data that permits quick restores with minimum loss of content. Policy should dictate when backups are performed, where the media is stored, who has access to the media, and what the reuse or rotation policy will be. Types of backup media include tape reels, tape cartridges, removable hard drives, disks, and cassettes. Tape and optical systems still have the majority of market share for backup systems. Common types of media include
Another technology worth mentioning is MAID (Massive Array of Inactive Disk). MAID offers a distributed hardware storage option for the storage for data and applications. It was designed to reduce the operational costs and improve long-term reliability of disk-based archives and backups. MAID is similar to RAID except it provides power management and advanced disk monitoring. MAID might or might not stripe data and/or supply redundancy. The MAID system powers down inactive drives, reduces heat output, electrical consumption, and increases the drive's life expectancy. In addition to defining the media type, the organization must determine how often backups should be performed and what type of backup should be performed. Answers will vary depending on the cost of the media, the speed of the restoration needed, and the time allocated for backups. Backup methods include
Backup and RestorationBackups need to be stored somewhere, and backups are needed quickly when it's time to restore. Where the backup media is stored can have a real impact on how quickly data can be restored and brought back online. The media should be stored in more than one physical location so that the possibility of loss is reduced. These remote sites should be managed by a tape librarian. It is this individual's job to maintain the site, control access, rotate media, and protect this valuable asset. Unauthorized access to the media is a huge risk because it could impact the organization's capability to provide uninterrupted service. Transportation to and from the remote site is also an important concern. Important backup and restoration considerations include
It is recommended that companies contract their offsite storage needs with a known firm that demonstrates control of their facility and is responsible for its maintenance. Physical and environmental controls at offsite storage locations should be equal to or better than the organization's own facility. A letter of agreement should specify who has access to the media and who is authorized to drop off or pick up media. There should also be agreement on response times that will be met in times of disaster. Onsite storage should maintain copies of recent backups to ensure the capability to recover critical files quickly. Backup media should be securely maintained in an environmentally controlled facility with physical control appropriate for critical assets. The area should be fireproof, and anyone depositing or removing media should have a record of their access logged. Software itself can be vulnerable, even when good backup policies are followed, because sometimes software vendors go out of business or no longer support needed applications. In these instances, escrow agreements can help. Tape-Rotation StrategiesAlthough most backup media is rather robust, no backup media can last forever; it will fail over time. This means that tape rotation is another important part of backup and restoration. Additionally, backup media needs to be periodically tested. Backups will be of little use if you find out during a disaster that they have malfunctioned and no longer work. Tape-rotation strategies can range from simple to complex.
Other Data Backup MethodsOther alternatives that exist for further enhancing a company's resiliency and redundancy are listed in the following list. Some organizations use these techniques by themselves; others combine these techniques with other backup methods.
Choosing the Right Backup MethodIt is not easy to choose the right backup method. To start the process, the team must consider how long of an outage the organization can endure and how current the restored information must be. These two recovery requirements are technically called
Figure 7.4 RPO and RTO. What you should realize about both RPO and RTO is that the lower the time requirements are, the higher the maintenance cost will be to provide for reduced restoration capabilities. For example, most banks have a very small RPO because they cannot afford to lose any processed information. Plan Design and DevelopmentThe BCP process is now ready for its next phase—plan design and development. In this phase, the team designs and develops a detailed plan for the recovery of critical business systems. The plan should be directed toward major catastrophes. Worst case scenarios are planned for because, by definition, the entire facility has been destroyed. If the organization can handle these types of events, less severe events such as disasters, which render the facility unusable only for a time, can be easily dealt with. The plan should be a guide for implementation. The plan should include information on both long-term and short-term goals and objectives:
The plan should also detail how the organization will contact and mobilize employees, provide for ongoing communication between employees, interface with external groups, the media, and provide employee services. Each of these items is discussed next. Personnel MobilizationThe process for contacting employees in case of an emergency needs to be worked out before a disaster. The process chosen depends on the nature and frequency of the emergency. Call trees and outbound dialing systems are widely used. An outbound dialing system stores the numbers to be called in an emergency. These systems can provide various services such as
A call tree is a communication system in which the person in charge of the tree calls a lead person on every branch, who in turn calls all the leaves on that branch. If call trees are used, the team will want to verify that there is a feedback mechanism built in. As an example, the last person on any branch of the tree calls and confirms that he /she got the message. This can help ensure that everyone has been contacted. Call trees can be automated with VoIP and public switched telephone networks (PSTNs) and online services. Personnel mobilization can also be triggered by emails to PDAs, BlackBerrys, and so on. Such systems require the email server to be functioning. Interface with External GroupsDeciding how to interface with external groups is another important aspect of business continuity. Damaging rumors can easily start and it is important to have protocols in place for dealing with these incidents, accidents, and catastrophes. The organization must decide how to deal with response teams, the fire department, the police department, ambulance, and other emergency response personnel. Someone should be identified to deal with the media. Negative public opinion can be costly. It is important to have a properly trained spokesperson to speak and represent the organization. The media spokesperson must be in the communication path to have the facts before speaking or meeting with the press. The appointed spokesperson should interface with senior management and legal counsel prior to making any public statement. Meeting with the media during a crisis is not something that should be done without preparation. The corporate plan should include generic communiqués that address each possible incident. The spokesperson will also need to know how to handle tough questions. Liability should never be assumed; the spokesperson should simply state that an investigation has begun. Tackling these tough issues up front will allow the company to have a preapproved framework to work with should a real disaster occur. Employee ServicesCompanies have an inherent responsibility to employees and to their families. This means that paychecks must continue and that employees need to be taken care of. Employees must be trained on what to do in case of emergencies and on what they can expect from the company. Insurance and other necessary services must continue. During a disaster, employees must know what is expected of them and who is in charge. Someone must have the authority to allocate emergency funding as needed. As an example, after Hurricane Katrina, the U.S. Congress passed 48 C.F.R. § 13.201(b) (2005), which increased the limit on FEMA-issued credit cards to $250,000. The idea was to allow government employees to acquire needed items quickly and without delay. Although funding is important, controls must also be in place to ensure that funds are not misappropriated. InsuranceInsurance is one option that companies can consider to remove a portion of the risk the team has uncovered during the BIA. Just as protection insurance can be purchased by individuals for a host of reasons, companies can purchase protection insurance for each of the following items:
Insurance is not without its drawbacks, such as high premiums, delayed claim payout, denied claims, and problems proving real financial loss. Also, most insurance policies pay for only a percentage of any actual loss and do not pay for lost income, increased operating expenses, or consequential loss. ImplementationThe BCP team is now nearing the end of the plan's development process, and is ready to submit a completed plan for implementation. The plan is the result of all information gathered during the project initiation, the BIA, and the recovery strategies phase. A final checklist for completeness ensures the plan addresses all relevant factors, such as
The completed plan should be presented to senior management for approval. References for the plan should be cited in all related documents so that the plan is maintained and updated whenever there is a change or update to the infrastructure. When management approves the plan, it must be released and disseminated to employees. Awareness training will help make sure that everyone understands what their tasks and responsibilities are when an emergency occurs. Awareness and TrainingThe goal of awareness and training is to make sure all employees know what to do in case of an emergency. If employees are untrained, they might simply stop what they're doing and run for the door anytime there's an emergency. Even worse, they might not leave when an alarm has sounded, even though the plan required they leave because of possible danger. Instructions should be written in easy to understand language that uses common terminology that everyone will understand. The organization should design and develop training programs to make sure each employee knows what to do and how to do it. Employees assigned to specific tasks should be trained to carry out needed procedures. If possible, plan for cross-training of teams so that those team members are familiar with a variety of recovery roles and responsibilities. TestingThis final phase of the process is to test and maintain the BCP. Training and awareness programs are also developed during this phase. The test of the disaster-recovery plan is critical. Without performing a test, there is no way to know whether the plan will work. Testing transforms theoretical plans into reality. Testing should be repeated at least once a year. Tests should start with easiest parts of the plan and then build to more complex items. The initial tests should focus on items that support core processing, and they should be scheduled during a time that causes minimal disruption to normal business operations. As a CISSP candidate, you should be aware of the five different types of BCP tests:
The final step of the BCP process is to combine all this information into the BCP plan and inter-reference it with the organization's other emergency plans. Although the organization will want to keep a copy of the plan onsite, there should be another copy offsite. If a disaster occurs, rapid access to the plan will be critical. Monitoring and MaintenanceWhen the testing process is complete, a few additional items still need to be considered. This is important because some might falsely believe that the plan is completed once tested. That's not true. All the hard work that has gone into developing the plan can be lost if controls are not put into place to maintain the current level of business continuity and disaster recovery. Life is not static and neither should the organization's BCP plans be. The BCP should be a living document, subject to constant change. To ensure the plan is maintained, first build in responsibility for the plan. This can be done by
Also, disaster recovery implications for monitoring, maintaining, and recovery should be made a part of any discussions for procuring new equipment, modifying current equipment, or for making changes to the infrastructure. The best method to accomplish this is to add BCP review into all change management procedures. If changes are required to the approved plans, they must also be documented and structured using change management. A centralized command and control structure eases this burden. Table 7.2 lists the individuals responsible for specific parts of the BCP process are listed in. Table 7.2. BCP Process Responsibilities
|