Managing a data center on the HP 3000 in today's business environment has dramatically changed in the last decade. Up until 5 years ago all business data resided on a single small system that was easily managed from a dedicated console. Similarly, recovery of a full system from a backup tape provided acceptable recovery times. Today's data center environment is radically different. Business data is dispersed over a number of systems often in remote locations. Cost pressures and data availability goals emphasize solutions that support centralized and operatorless environments. The notion of management by exception has emerged as an important criterion for system management. At the same time customers require data access 24 hours a day 7 days a week. Backup solutions emphasize on-line backup or backups that do not require downtime. Similarly, to meet data availability goals unplanned downtime has become unacceptable. This paper will explore these changes in our customers' data center and the solutions that are available on the HP 3000 to address these changes.
Changes in today's Computing Environment
Up until 5 years ago, recovery of a full system from a backup tape provided acceptable recovery times. All business data resided on a single small system and a full system recovery only took 1 to 2 hours. Things have changed tremendously over the last five years and we need to consider recovery in light of these changes. We need to fully assess the time needed to get our business critical applications back up and operational. Restoring from a backup tape may not be sufficient and may require alternative methods to minimize business losses when recovery is required.
In today's environment, businesses are demanding that solutions minimize downtime and inaccessibility to critical data. These customers are looking for high availability features in nearly every solution from networks to recovery solutions. Storage management software solutions need to provide much greater data availability and reliability amidst a much more complex environment. Today, there is a much greater need to recover more data more quickly.
The key challenge to datacenter managers is the ability to manage this complex operating environment that has grown from single vendor, stand-alone environments of the past to today's multi-vendor, network-centric, client/server environments. Issues such as configuration, capacity planning, problem resolution, performance management as well as reliability and availability management have all increased in complexity along with this changing more complex operating environment. While we see a growth and dispersion of data we see a trend in the datacenter toward centralization of Information Management. While data is distributed across multiple systems, often in remote locations, management of that data depends on centralized datacenter resources.
For today’s IT organizations, server based storage requirements are exploding at 50% or even 100% per year. In the last few years, there have been many new trends and technologies within the commercial/business computing industry which have influenced the dramatic change in data storage capacity requirements and the need to manage it better. Systems capabilities have grown at dramatic rates in size and storage capacity. Cost reduction directions such as system consolidations and distributed computing environments (in which server systems centralize data storage needs for many clients) have caused the measurement unit of data storage capacity to change from megabytes to Gigabytes. Processing Power has increased dramatically allowing applications to grow in complexity and size. Incorporation of new storage intensive data types such as imaging, voice and video within many commercial/business applications over the next few years will only cause data storage capacities to increase ever more. Disk storage is more affordable and provides greater storage capacity per device. The growth in storage capacity requirements results in pressure to find system management for large volumes of data.
Given the dramatic changes in the data center four emerging key management issues that concern IT executives today are:
Fault Avoidance and Rapid Recovery
According to International Data Corporation "A system is considered to be highly available if, when failure occurs, data is not lost and the system can recover in a reasonable amount of time." But just what constitutes ‘reasonable?’ The specifics will differ with each business’ requirements, and its tolerance for outages. There are two basic techniques to prevent unscheduled outages: fault avoidance and rapid Recovery
To meet a customer’s specific level of outage tolerance, HP has identified a hierarchy of three levels of data availability for the HP3000: Basic System Availability, High Availability System and Disaster Tolerance System
The HP 3000 has built in "Fault Avoidance with its Basic System Availability. Fault avoidance combines reliable technology and support. It includes reliable Hardware and Operating System components, preventive features and automated operations to prevent faults from ever occurring.
Rapid recovery can minimize (or eliminate) downtime when a fault does occur. This can be achieved through a well-executed recovery plan and redundant components that maintain operation in the event of a failure. The HP 3000 has virtually eliminated the impact caused by a fault with its High Availability and Disaster Tolerance Systems.
The first step in determining an appropriate level of data availability is to determine an applications cost of downtime. Every application has an associated Cost of Downtime. This dollar figure exposes which application sets are important to your business processes and determines important investment decisions affecting high availability solutions which include recovery and backup solutions. This figure includes the cost of unplanned vs. planned downtime, the unavailability during peak usage hours vs. off-hours, and how the application affects business profits when it is unavailable. From a total cost of downtime it becomes an easier task to determine what level of availability (Basic System, High Availability System, or Disaster Tolerance System) your business requires.
Basic System Availability: Built in Fault Avoidance
The basic system of availability refers to the inherent robustness of HP Hardware and MPE Operating System and its ability to be fault tolerant. You do not need to buy additional hardware or software to attain this level and yet you can be assured that the system has the ability to withstand exception conditions without failing the system. Because the basic system is the foundation of your computer system (i.e. everything you do runs on top of the system) it must be very stable and robust. If recovery is required then it largely depends on recovery from a backup tape. You can significantly decrease recovery time just by using user volume sets. Without the use of User Volume Sets a customer restores the whole system from a backup tape. With a single volume set, a disk fault requires recovery of all data, system & application (a full system reload).
High Availability System: Rapid Recovery
With the basic system, we have made the foundation strong and solid. We recognize, however, that faults do occur and the ability to rapidly recover must be addressed in order to achieve a higher level of system availability. Businesses that require a system providing higher availability than that of the 'basic system', typically associate a higher cost to either planned or unplanned downtime. For these businesses, the cost associated with purchasing products to achieve greater availability is far less than the cost of downtime due to a fault.
One of the most fault prone areas of any system is getting and placing data on disks. Eliminating your vulnerability to a fault in this area is critical for a high availability solution. This can be achieved either through hardware solutions such as the Model 20 disk array or EMC Symmetrix products or through software solutions such as MirrorDisk/iX. With the introduction of High Availability Array Failover the HP 3000 can recover from any fault that occurs along the I/O path. Both the Model 20 & EMC arrays have two controllers that can act as alternates for each other. Recognizing that one path is unavailable, MPE will take a detour (via the alternate card, cable, and disk controller) to the data within the disk array. With this feature, the arrays provide the full protection required for a high availability solution.
Disaster Tolerance system: Rapid Recovery in the face of Disaster
The High Availability System although providing rapid recovery does not protect against a disaster to your data center or to a system level fault. The Disaster Tolerant system combined with the High Availability System addresses the highest level of availability need. SharePlex/iX-NetBase and the newly introduced EMC's SRDF are two products that provide solutions for this level of the pyramid.
Both SharePlex/iX-NetBase and SRDF shadow data to secondary hosts that can be accessed in the event of a disaster. However, applications and users can access data that is shadowed by SharePlex/iX-NetBase during replication. SRDF requires replication to halt and the volumes to be mounted before allowing access to the data. In this sense SRDF is a disaster tolerant solution while SharePlex/iX-NetBase provides disaster tolerance as well as clustering and load balancing for the HP 3000.
Recovery Planning
A Recovery Plan is probably the most important deliverable that you can produce which will minimize the cost associated with an interruption in service. This is also one deliverable entirely within your control and completely customized to your environment.
One of the most important needs in enterprise-wide storage is recovery. IT managers must establish a recovery policy that provides the appropriate level of data integrity and availability. The recovery policy must ensure that critical data can be completely and quickly recovered even in the event of a disaster. Your Recovery Plan document should include specific detailed recovery strategies and procedures for all scenarios your staff will need to recover from, including but not limited to: system aborts, application aborts, disk faults, power outage, network interruption, system component faults, user and operator errors, disasters. First priority should be to get business critical applications available with minimal business loss. An application's Cost of Downtime drives this priority as well as calculating the amount of downtime you can tolerate. Ask yourself the question; do I have the appropriate recovery strategy for each application? If not, look at ways of reducing the risk and minimizing recovery time (e.g. mirror disk/iX, Fast/Wide arrays, SRDF, Shareplex), reducing downtime due to backup (7x24 True-online), using faster recovery/backup devices (e.g. DDS-3 or DLT), and optimizing backup/recovery configurations (e.g. user volume application sets, massive parallel store/restores, interleaving).
One key component in your recovery plan is detailed appropriate procedures for each fault. You don't want to go through the process of switching users over to an alternate server in a SharePlex/iX-NetBase environment if another recovery procedure is less expensive and less disruptive. Knowing when to initiate a recovery plan is critical in any successful disaster plan. Likewise protecting your disks with a disaster tolerance solution is not the most cost-effective means of attaining High Availability. You should always protect your business from disk faults with MirrorDisk/iX or Disk Arrays rather than with SharePlex/iX-NetBase.
A recovery plan should be tested with the operations staff that will be implementing the recovery in a real disruption. Perform dry runs of the recovery plan with different failure scenarios. Test, review and update your plan regularly. A good Recovery Plan is only good as long as nothing changes.
Data Management
Data Management is the activity of organizing, protecting, archiving, retrieving, and storing data. The cornerstone of the enterprise datacenter is the management of data. Issues such as disk space, media management, file retention and aging, backup and recovery are key concerns of the datacenter system manager. Businesses cannot continue normal operations without access to current, accurate information. A loss of a database with no backup could spell the demise of the business.
Backup Trends
Strategies for backing up data range from small shops able to back up data at night to enterprise wide backups of heterogeneous clients and servers. In a small shop there is little data to backup, a job can be scheduled at night after the close of a business day. Very little goes wrong and the jobs complete easily by the start of the next business day. This company’s availability requirements are easily met with the basic Store product available with the standard MPE/iX Operating System. As companies grow with more data to back up it becomes more and more difficult to complete backups within this allotted time. This leads many companies to adopting night shifts to change tapes or to look for other ways to complete backups within an allotted window. One typical way is to break up the store process by running massive parallel stores on separate user volume sets. Additionally, faster larger capacity devices can be used. Some companies have had great success using autochangers or libraries. The use of libraries can help customers implement a Lights-out environment. To eliminate downtime due to backup, some customers have gone to on-line backup. This allows users and jobs to continue modifying databases and files while the backup is occurring.
Store/iX & TurboStore/iX
STORE is an excellent choice for small shops that do not require online backup or environments with little room for delays. STORE is included with the OS and offers basic functionality.
TurboSTORE/iX products for the HP 3000 provide high performance backup solutions designed to meet today’s backup requirements. TurboSTORE/iX products offer powerful parallel backup and recovery, data interleaving, data compression, and online backup capabilities.
TurboSTORE/iX 7x24 True online
As businesses move closer toward continuous operations, IT managers find the growing need for solutions that can meet the demands of a 7 days per week, 24 hours per day environment. TurboSTORE/iX 7x24 True-Online Backup was specifically designed for 7x24 environments by providing backup of selected data without requiring application downtime or user log off. In addition True Online provides the same powerful backup capabilities of previous versions of TurboSTORE/iX.
Legato Solutions
Legato offers a true client/server solution for network backup in heterogeneous computer environment. It is tuned to handle data transfers across the network in a very efficient manner. Couple Legato’s NetWorker server software with TurboSTORE/iX 7x24 True ONLINE and you have a lights-out solution that covers your entire environment. At the simplest level, there is the Legato NetWorker Client for MPE/iX. It offers high-speed backups to libraries attached to either a UNIX or NT server from clients like Windows NT desktops, HP-UX, AIX, Solaris, MPE/iX, SCO UNIX, Macintosh, NetWare, and NT servers. Legato Storage Nodes are an entirely unique approach to enterprise storage management that enables centralized management in a distributed environment. Storage Nodes are a key component of Legato’s Enterprise Storage management Architecture Strategy.
NetWorker ClientPak for MPE/iX is a client/server application that provides advanced storage management capabilities to a wide variety of servers and desktop computers. It is an excellent example of a product well established in the UNIX world that would provide HP 3000 customers with better integration in a multi-vendor environment.
Legato Storage Node for MPE/iX
The main barrier to large backups over a network is the available network bandwidth. With 100VGI and later with Fiber Channel, network backups of large amounts of data become more feasible. In many mission critical environments, customers need to move backup off their systems faster than network backups allow. That is when a Storage Node becomes a better solution than a network backup. A Storage Node still allows the user to manage policies and procedures from a central location; but backups are local to the host HP 3000.
A Storage Node is a unique architecture that increases system availability, manageability, and performance. A Storage Node is any HP 3000 on the network that has a locally attached storage device and acts as a remote management agent. Within a storage node, data backup takes place locally, but the metadata (file index and control information) is still maintained on the central NetWorker server. In the Enterprise Storage management Architecture, Storage Nodes, along with a NetWorker server and its networked clients, comprise a data zone. Enterprise networks may contain one or multiple data zones. There are many advantages to deploying storage nodes in the enterprise:
Increased Performance: Storage Nodes provide the capability to locate storage devices closer to the physical location of the data that needs protection. As a result, the data, itself, is backed up locally, and only the control information transverses the network. As a result, backup/restore performance over the network is optimized, and central management control is maintained at the NetWorker server. This is especially beneficial for situations that deploy very large databases where enormous amounts of data need protection.
Fail-over capabilities: Clients in a network contain a list of storage nodes that can backup their data. Should the first Storage Node and/or its attached devices in that list be unavailable for any reason, the backup will proceed to the second Storage Node on the list, and so forth.
Data Protection at a Local Level: Remote data does not ever need to leave the site, providing a local level of data protection.
Centralized Management/Administration: While local data does not need to leave the remote site, management of that data can remain at a centralized location, minimizing the need for additional personnel to support remote locations.
Cross-Platform Support: Storage Nodes provide for a very flexible network. NetWorker servers and storage nodes can be of any type of supported platform. For example, a HP-UX server can be deployed with a MPE/iX, Windows NT and HP-UX storage node.
Scalability: Storage Nodes allow the flexibility to grow enterprise networks as data needs expand, technology changes, and organizations evolve.
Selecting a Hardware device
HP solutions for the HP 3000 are based on customer requirements for amount of data to be stored and time in which to store it. For customers with small datasets and longer backup windows, DDS may be a very appropriate solution. At the mid range and high end DLT-7000 mechanisms and Libraries provide backup for customers with large amounts of data and limited backup windows. Based on throughput and capacity, multiple DLT-7000 mechanisms and libraries can meet the needs of customers with large volumes of data and aggressive backup windows.
HP DAT Products
For backup on the HP 3000 HP offers the latest DDS-3 tape drives. The new DDS-3 format has a native mode capacity of 12GB. With data compression, customers can typically store 24GB on a single tape. DDS DAT is a highly reliable industry standard high capacity device. Its compact media is easily stored in a fireproof safe.
Digital Linear Tape
The HP 3000 server’s support automated digital linear tape (DLT) mechanisms. DLT4000 provides greater native cartridge capacity (20GB) and the DLT7000 provide 35GB native enabling fast, unattended backup of large quantities of data within the brief windows available for backup in today’s high-end, mission critical environments. The DLT native transfer rate (5.0GB/hr) is 28% faster than DDS-3 (3.6GB/hr). Besides large capacity DLT boasts superior drive head longevity.
The DLT7000 is a fast-wide device with a native transfer rate of 5.5MB/sec. The DLT7000’s native tape capacity is 35GBs and 70GBs compressed. Store and TurboSTORE/iX support the standalone DLT7000.
Libraries
The DLT7000 libraries allow not only unattended backup but also very high-speed backups. In addition, many of our customers need the ability to access archive data faster than can traditionally be found in a Tape Vault. Libraries allow for greater data security by removing the need for operator intervention. In network environments, the Library itself can be located in a secure environment. The DLT7000 libraries are only supported on the HP 3000 with the Legato Storage Node software only.
Backup Schedules and Plans
Backup is the #1 cause of planned application downtime today accounting for 83% of total planned downtime. As with recovery planning ascertain the length of your backup window. How much planned downtime can your business afford for backup? If you cannot tolerate any planned downtime and have chosen an on-line backup solution, when and how long are the windows where there are low application/system usage? Your backup window will help drive other decisions such as the speed and number of your backup devices and configuration policies such as the number of parallel stores and backup schedules. If application availability allows, the standard full/partial backup schedule is a good schedule. An alternative may be to rotate full backups and partials of major applications. Each major applications full backup is on a different day, combined with partials of the other applications.
Policies should be in place to minimize the amount of data in each backup stream. Your current backup probably includes data that is reference data or archival data. By removing the reference data and archive data from the active data backup, backup times can be reduced significantly. Put your backup on a diet. Don’t backup data that doesn’t need to be recovered or will already be recovered. For example, system data is recovered with SLT and FOS tape and STDLIST spoolfiles are unnecessary. You can often recover non-production utilities from other systems. Continually monitor and remove files that are no longer needed and can be placed in archive on less expensive media. Significant amounts of current on-line data storage capacity can be freed by use of automated data management activities such as file compression, trimming, or purging.
Capacity and Performance Management
Capacity Management is the management activity that measures, plans for, tracks and implements solutions that ensure the optimum capacity required to maintain the business. Historically, long lead times to acquire more capacity placed a great deal of emphasis on this function. Insufficient capacity or unacceptable performance during peak processing cycles, can create major hardships for the business; invoices will not be issued on time, orders cannot be processed, year end closings cannot be completed. These problems are all very costly to the organization. HP 3000 solutions for capacity and performance management include HP’s Glance software that provides immediate performance information about the customer’s computer system.
Resource Sharing
In multi-system environments you want to take advantage of all the system resources available to you. SharePlex/iX-NetBase can make resources on any system on the network available to all users. Files, databases, printers, and programs can be shared among users on the network, regardless of geographic location.
With a cluster approach, the Shareplex/iX-NetBase bundle gives Mid-range to high-end customers virtually unlimited growth potential. Many existing customers have reached limits in the operating systems, table limits, locking and concurrent end-user limits. AutoRPM and Network File Access (NFA), two products within the Shareplex/iX-NetBase suite of products, give customers the ability to transcend these limits by implementing a clustering strategy.
With NFA required data need not reside on a single server. This data can be dispersed throughout a cluster. NetBase maintains a centralized directory of all files and databases that are available to network users. The directory will automatically direct the application to the appropriate location of the file.
AutoRPM allows users transparent access to programs located on a remote machine. Users gain instant access to virtually any software on the network. AutoRPM transports the user to the appropriate server within the cluster.
Problem and Change Management
Problem Management is a set of processes, products and personnel dedicated to reporting, diagnosing, tracking and resolving problems in the data center environment. Change management is the management and control over the introduction of change into the Information Technology environment including system software, hardware, applications, data, configuration and facilities. The objective is to create an environment where changes do not disrupt the availability of business applications.
The IT/Operations (OpenView Operations Center is a solution to the datacenter to help the system manager identify and diagnose problems, track and report problems and control the work flow.
Service & Support
HP offers a broad continuum of services that enable customers to scale and manage their specific availability needs based upon their computing environment. These services begin with Standard Support, move up to Critical System Support and extend to Business Continuity Support.
Standard Support
HP's Standard Support includes software and network support, licenses for updates, software media and documentation. It ranges from eight hours a day, five days a week to 24 hours a day, 365 days a year. Beyond Standard Support, HP also offers two special high availability services: Critical System Support and Business Continuity Support.
Critical System Support
HP's Critical System Support (CSS) is a suite of services for businesses running enterprise-class computing environments that require very high systems availability. It is a modular, flexible service designed for companies that want a proactive response and immediate reactive support beyond HP's standard services.
CSS provides hardware, software and network support for critical systems by reducing the frequency of systems failures, and helping to recover systems when problems occur. HP assigns a CSS support manager who leads a team of experts in maintaining high-availability computing environments. In the event of an outage, the CSS system recovery team responds immediately with access to HP's critical-parts network and immediate dispatch of personnel to repair system hardware within six hours. Phone-in software assistance is available 24 hours a day, 365 days a year.
Business Continuity Support
HP's Business Continuity Support Services (BCS) is the industry's most comprehensive support program. It guarantees a four-hour call to restoration and is designed to identify and prevent problems before they affect operations. It features maximum on-site presence and an outage prevention program tailored to each customer's specific needs. This service is designed to identify and prevent problems before they can have an impact on operations. If a problem should occur, the service can respond quickly to restore the system.
BCS features an assigned account team who is experienced in high-availability hardware, software and network support. These specialists include an account support manager, trained specialists from HP's phone-in response center, R&D engineers, support delivery specialists and HP senior management.
Disaster Recovery services
A solid contingency plan may be the single most critical factor in a company's ability to survive a disaster. Hewlett-Packard offers flexible business recovery services to keep a company operating in the event of such a catastrophe. Two services, HP Backup-Quickship and HP Backup Service offer a range of options from do-it -yourself PC-based methodology to consulting experts to comprehensive integrated planning.
Future Requirements for the Data Center
As HP designs and evolves its Data Center Management solutions to meet our growing customers' requirements, we can’t lose sight of some important trends in the data center environment of our customers. Faster network capabilities will make centralized network even more feasible. As HP 3000 systems co-exist with heterogeneous systems, customers will want to manage multiple clients and servers from a single centralized point. FibreChannel will become the dominant storage interface over the next 3-5 years. Faster interconnect as well as the availability of faster storage devices will be available. Data capacity requirements will continue to grow requiring solutions to recover and backup Terabytes of data. Cost pressures and availability requirements continue to emphasize solutions that support operatorless environments. The management of systems needs to be driven by exception notification and centralized command center support. More and more customers will want storage management functionality that allows on-line backups and simple, sophisticated media management. But the driving force is the customers demand to gain instantaneous access to data no mater what time no mater where.