HP World 98 Presentation

HP Disaster Recovery by Design

James A Depp

SunGard Recovery Services
1285 Drummers Lane
Wayne, PA 19087
Phone: 800-523-4970
E-mail: jdepp@sungardrs.com

The loss of information technology to today’s organizations can have severe impact on the ongoing services of the organization. In the worst cases, the organization’s survival may be at stake. Any organization not prepared for disruption of its processing base will in some way be hurt. With the efficiency of automation comes the responsibility of protecting the ongoing activity if the automation is lost.

The assessment of needs and preparedness should be the first step in determining a course of action. Defining the business of the organization and characterizing the continuity of that business will help determine the critical processing. It will also begin to fix the potential cost of disaster, and hence the value in protecting from disruption. Such risk assessment takes many forms, but a starting point is the people impacted by the business.

While specifying a key group of people to protect depends on the charter of the specific organization, it will usually be among customers, shareholders, staff, suppliers, and regulators. The latter may more reasonably be described persons from whom to be protected!

Profitable enterprises usually depend upon their customers, or more specifically the revenue from billing customers for products and services provided. Sometimes those services are provided to one group, and billed to a third party. Taxing agencies bill in advance and pass on revenues for spending, presumably to provide services. Whatever the sequence, the disruption of processing could certainly disrupt the generation of revenue, and hence profits.

To the extent that profits are impacted, it is the shareholders or owner who will hold the management organization accountable for the continuing business operation, in spite of any system failures. In many ways, lenders fall into this category, as well, since a business dependent upon credit will strive to protect the confidence of these people by protecting the flow of cash critical to business operation or survival.

The staff is the obvious group concerned about business survival, but more immediately their own situations. Payroll production is a concern, but so is the existence of the employment itself. Also important is the ability or inability of the staff to perform their jobs without the systems with which they interact to do their jobs. In many organizations, the skills to do some tasks without the computer are simply no longer present, and where still present, there may not be nearly enough hours available to get the job done. An additional benefit of recovery evaluations can be greater understanding of systems in place, and job improvements leading to greater employee satisfaction.

Suppliers as a group are sometimes maligned in recovery planning. The activities of paying the staff and collecting revenue seem to overshadow the need to pay suppliers in a timely fashion, ensuring the continued flow of materials and services necessary to produce the product. While it is true that a supplier who has been paid routinely will frequently ease the requirements in the aftermath of a disaster, their ability to continue to provide raw materials will be reduced if not paid promptly.

Finally, the control of regulations and contractual obligations can not be ignored. For those in banking, regulations are clearly drawn giving backup requirements. For food, drug and some other products, recall and control regulations define some system requirements. Contractual requirements for delivery, with just-in-time only the most obvious of these, may drive certain requirements. Labor requirements or industry requirements can define expenses to be handled promptly in order to avoid legal retaliation.

Business continuity is defined in the context of the products and services, and the people impacted by business disruption. Typically, the management and shareholders want to see growth unabated. Much management effort goes into enabling the employees to get the job, whatever it is, done more effectively and profitably. While some view the objective as maintaining the current, protecting status quo, or providing continued employment, others strive to enhance the current status, and attempt to ensure company growth, dominance, and stature.

The unforeseen interruption of the business activity can disrupt the best of plans. Disaster in the information industry is frequently defined as unexpected denial of access to information or processing. At the time disaster strikes it is not particularly important why the information is impacted, but rather it is crucial to recover the data and reestablish a processing environment. The approaches taken in setting up systems, in backing them up, and in choosing among recovery options become the keys to success, and in each of those areas it is foresight and planning which are on our side, and time which is the enemy.

The processing environment is increasingly defined by the software application choices. There was a time when mainframe and proprietary operating systems generally supported many activities on one machine, but open systems and client/server environments have generally defined the hardware in the context of the software. It is common to see UNIX systems supporting a single application per machine, with several machines providing the array of application needs. While this can add some independence among applications linked together, the interdependencies within data can cause all applications to fail if one system goes down.

Moreover, the application choices may make the hardware environment more complex, as particular benefits derived from software selection dictate or encourage specific hardware choices. Multiple platform shops are common today, with machines taking maximum advantage of software, and networks providing the common thread to bring information together and to the user. The added complexity spills over into recovery strategies, and the selection of appropriate recovery facilities based upon staffing and communication issues.

Network connections can ease or complicate recovery strategies, with frame relay currently providing a straightforward recovery path, and point-to-point connections complicating recovery. There are many right ways to communicate both from operational and recovery perspectives. The intersection of the strategies can provide an optimal solution, or conscious evaluation of the options can define the tradeoffs among solutions.

Reemergence of the glass house is another recent trend. Long the domain of the mainframe, the centralized data center is returning to house centrally controlled smaller systems, and their numbers are increasing. This approach can drive up the cost of disaster recovery, or could be designed in a way to moderate the additional costs over several separate sites.

Staff savings are frequently cited as the reason for consolidation of resources. With skilled UNIX people in continuing short supply, consolidation can enable broader expansion of computerized services. At the same time, it has become common to rely on recovery provider staff to augment in house expertise. Recoveries without subscriber staff at the recovery center are common, as is remote testing, and even out-sourced testing.

Application design has also led to larger hardware requirements, including such arrangements as database and application servers, client/server, and decision support systems separate from production. A shorter data backup window in the midst of busy production schedules has resulted in a variety of creative disk and tape configurations to maximize available system time. Each of these developments affects the recovery approach to be taken, and the cost of the approach. While it is easy to approach recovery with a replication of the production environment, costs sometimes discourage this luxury, and the experience and creativity of the recovery planner and system developers can save substantial cost.

The business components of people, software, hardware and infrastructure all contribute to the complexity or simplicity of the operation, and its recovery. As procedures are developed, staffing established, and software and equipment decisions made, an eye toward recovery is worthwhile. The infrastructure of power, network, and other facility details can also impact the difficulty in recovery.

Staff allocation and development can be a key to smooth recovery. While staff head count will probably always be a challenge, careful assignment of responsibilities, cross training, and priority setting can ease the burden of recovery. For the data base administrator, for example, it may not be enough to understand the structure of the production databases, but consideration should also be given the configuration of a backup system unlikely to be identical to the production system. Where requirements dictate replication of hardware, it is helpful to know this in advance so that contracts or purchasing can properly handle this, leaving testing to uncover more subtle pitfalls. For the operations staff, ability to configure key peripherals such as tape autoloaders or backup software becomes important, particularly if these devices are relatively specific to the company situation. Surprisingly, configurations of hardware and software do vary among sites, and the more knowledge retained locally rather than with hardware vendors, the better.

Admittedly, various routine priorities frequently conflict with the long term benefits of recovery preparation. While this is understandable, it is also true that consideration of recovery requirements can provide a view sufficiently different from the routine, that benefits can be realized. The experience of recovery providers with specific backup software and devices can provide additional insight prior to purchase. Since the size of the recovery system typically influences the service cost, it is worth evaluating minimum requirements. In one such evaluation a key database design and size revision resulted in significant response time improvement, delaying a hardware expansion and increasing productivity each day.

There is a balance between the skills of the production and development staff versus the requirements for disaster recovery. That balance can be tipped in favor of the staff by using the expertise of commercial recovery services. While it is appealing to present a solution to several vendors and ask for bids, the objectivity gained is not worth the creativity potentially lost. In the early stages of deliberation, disaster recovery is typically not among the skill sets of most of a production staff. It is worthwhile to draw on vendors for review of the production environment, and the suggestions can easily and appropriately be a part of a later evaluation process.

It is also important to consider the full demands placed upon the production and development staff, as well as the users, in the event of disaster. Where recovery will require staff to go to another city, it may be difficult to find employees who are parents or spouses who are excited about leaving loved ones in the midst of a disaster equally likely to affect them personally as well as professionally. Recovery solutions exist which bring processing to the disaster or alternate site, and which provide able staff to assist with the recovery. Testing is a key component of developing both the resource and the confidence in using it. Where external assistance is required, as in support software or backup devices, be certain that the help is available at any time, and test that availability along with the systems and recovery providers.

Software considerations include operating system, database, applications, operations, system management, and backup/restoration packages. The operating system is increasingly related to the hardware as hardware capability increases, speeds increase, and internal data structures change. Recovery options should consider not only today’s requirements, but also those in the foreseeable one to two years. Likewise, the database is a key building block, and its relationship to the operating system levels has made recovery considerations more challenging just as it has complicated upgrading of systems. If discussions are underway to change any of the components of the operating environment, these should be available to those working to develop recovery strategies.

Applications software is dependent, as well, on both the operating system and the database, so this too must be part of the development of recovery plans. Not only is production important, in many cases test and development machines come into play as the steps to implementing change have become complex. If possible, testing and development should be curtailed during a disaster recovery. In some cases it cannot, and provisions have to be made to cover these activities. It is also true that operation and system management tasks increasingly utilize a separate management system, and it is worth evaluating the inclusion of such a system in the recovery mix if it is a mainstay in daily operation.

A critical application is the software used to backup the system, and the device or devices it supports in production and can support in recovery. Jukeboxes are particularly significant in that drivers must be available, as well as numbers of drives, and even tape slots. Some software will not start unless all tapes in the restoration are accessible to the software. In other cases, where equivalent units are provided, it is key to know they are equivalent from the eyes of the software, not just the specification sheets. Data itself can be evaluated, and there are likely multiple situations to be developed. Machine access is different if the production environment is destroyed along with the data center, and different if the distribution inventory is also destroyed. Access to information for insurance purposes need not be as quick as it would be for a production environment. The recovery solution should also assess whether the information already in the system is key, as in account receivable, or whether ongoing entry is also important, as in payroll or decision support. Decision support itself can be evaluated as to near term requirement (tactical) or longer term (strategic) importance.

In evaluating the hardware requirements for recovery, the function of particular system is more important than its mere existence. As an example, many installations of SAP software have database server with failover system, and a significant set of application servers. From a cost perspective few SAP users recover the failover system, and in many cases only a subset of the application servers are included, but the setup and modification of the base tables must be understood in order to have this flexibility. Likewise, in client/server environments, only a few servers might be covered for testing purposes, and others made available as a time of disaster quick ship option.

Decision support is a developing information use, and companies vary widely in their use of this interesting resource. As planners become more skilled in using the information, long term direction setting use has given way to short term decision-making. It is worth routinely evaluating the way such systems impact daily activities, and then to determine if they should be covered, and how. These systems sometimes have little backup capability, on the assumption the data is available from other systems. Where backup capability exists, it may take days to restore from many incremental tapes, contradicting the apparent need for quick availability of information. The capacity of these systems presents specific challenges.

From the facility perspective, power, cooling, access, and similar things need to be considered, but if commercial recovery is chosen, these should be a standard part of the offering. The network infrastructure, however, should be considered in some detail as it provides interaction among systems and user access to the recovered information. Local and wide area network designs must be reviewed to determine survivability, and to determine how best to link them to recovery sites. Here, as well, many options exist, and the best use of time would be to join provider network experts with internal staff to discuss the production environment and recovery options.

With the many preceding considerations evaluated, the next step is critical in terms of going forward. If, as each area was evaluated, the materials were retained and assembled in a cohesive fashion, much of the structure of a recovery plan could be at hand. What is lacking is approval for expanding the evaluation into a project leading to a solution, perhaps contracts, and most certainly changes in the organization. Commitment at high levels is needed before proceeding. With the amount of detail assembled, a reasonable assessment of risk and general statement of solution should be possible. Where service providers have been part of the information gathering, some cost perspective is likely to be available. Where an internal alternate site solution is proposed, its costs should be calculated. Local culture now dictates the next steps to be taking in terms of formal request to the highest officers of the company, an informational document requesting approval to move forward with details, informal discussion with the corporate officers – whatever is necessary in the organization structure.

The request should include the possibility of a more formal assessment of risk, of plan development to include staff assignments to the process and as a result of the structure and procedures developed, and of implementation to include providing coverage for the critical processing identified. Depending on staff priorities and availability, this effort could proceed as an internal project, or as a consulting engagement. The plan development will be a process of information acquisition and strategy distribution. A key part of the structure will be the management of the project cycle, and establishment of a recycle annually through the system developed.

Preparation for recovery begins with selection of the critical processes from among the many in production. Criteria will need to be established for selection and guidelines for those participating in the process. The length of the recovery window should be chosen for each process. If there are alternative ways to provide access to the information, those should be considered as well.

The probability of each specific disaster occurring is not really as important as the possibility of the disaster. But identification of risks is appropriate since the reactions to the individual threats can be different, even though any disruption would trigger a response. Probability is less important primarily since the attempt to calculate the cost is frustrated by numbers small on paper but extremely significant if the disaster occurs.

Some of the selection criteria for critical processes will be staff requirements to accomplish their mission, the hardware required for the recovery, what sites are impacted by the application, and the organization’s ability to recover the systems. Sometimes outside influences will sway the decision, such as food and drug or other regulatory requirements.

At this point a decision should be made on how recovery is to be accomplished. One option is to do nothing, and take chances on the severity of a disruption, or the time to recover; this is essentially betting the business against the disaster. Internal recovery centers are sometimes developed. They are very expensive but they can be extremely well controlled and there is no competition for their use. All activity within them will likely be developed internally, or with the assistance of consultants. Commercial recovery centers are an increasingly popular and effective approach to balancing the costs with the needs. Options available are numerous, including the traditional hotsite, mobile services, and equipment available for shipment to a site of the subscriber’s choice. I should also mention that reciprocal arrangements have become rare, due primarily to the difficulty in management, and the disruption to both the disaster and healthy sites.

The implementation can not be characterized as easy or quick, but the material gathered to this point is a tremendous start to the details necessary for putting together the tactics for success. Much as in coding a program, the development process already accomplished is a major part of the task.

The specific development of a plan is an important accomplishment, but it’s development works well as a part of the initial testing of the recovery strategy. For the balance of this discussion, I will assume the commercial option was selected above. If an internal center is being developed, the construction and installation management will need to be added.

The plan develops at two levels. The strategy is built from the documentation and assembly of the materials used to confirm that recovery is necessary. Certainly this includes people and overriding procedures, and some detail on the operational aspects which will help the operators each day, as well. The tactical portion is the instructions and scripts needed to load the system, bring up the applications, etc. Each of these should be developed to a comfort level, but the tactical in particular should remain fluid into the first test, as it will also enlighten.

For that first test, objectives should be set carefully. This is the first of several phases to be accomplished, and collection of needs items and their use in bringing the system up successfully and loading the application is a sufficient challenge. Operating system and backup software tapes should be identified and pulled. Data backup tapes, hopefully verified, should be gathered, along with a spare copy from the previous similar full dump. Procedures, scripts, passwords, and schedules should be identified and gathered. Estimates of when certain events can take place should be documented simply for comparison to actual as the test proceeds. These milestones and metrics will develop into a measurement of progress on each successive test.

The actual test process will start before you arrive, when the system is set up and configured from a hardware perspective and checked using a vanilla copy of the operating system. The production site tapes will be used to update the operating system, and patches may be applied for specific hardware or software requirements. Backup software is installed as needed. Typically the system is rebooted to take advantage of the work to this point.

Next, applications and data are restored to the system, being certain that all tapes are good, and these steps are documented and the procedure refined and updated. It is wise to remove passwords on key users so that the next steps do not lock out the key users. Once restored, the system is again booted bringing processes on line or initiating specific activities. At this point passwords are activated, so that further steps are somewhat more difficult. The milestones are checked and updated.

Depending on which test in a series, testing takes on different personalities. Initially, getting the system up and the applications restored and working is an excellent goal. A next test could include the network and limited testing to confirm it is solid. Giving the recovery system to local or remote users might be the third test, and everyone should be encouraged to write down any concerns or changes to be made.

With conclusion of any test, evaluations should come from everyone as to places of difficulty, or approaches to approve. Procedures should be updated and any follow-up items logged and assigned. With completion of the items remaining, preparation can begin for the next test in the cycle.

The nature of plan management is a continuing cycle of preparation, test, follow-up, and move forward to the next test. The plans will be updated for changes in staff, applications, and equipment. It is important to include review for recovery handling in any change management approach, be it for applications, purchased software, or operational procedures. Sometime after the first few tests the activity will begin to feel stable, and change management will be the primary activity – and status quo will be recognized as one of the benefits of good planning.

Author | Title | Tracks | Home

HP Disaster Recovery by Design

James A Depp

Send email to Interex or to theWebmaster©Copyright 1998 Interex. All rights reserved.

Send email to Interex or to theWebmaster
©Copyright 1998 Interex. All rights reserved.