HPWorld 98 & ERP 98 Proceedings

Service Management Support with
HP OpenView OmniBack II

Harald Burose

HP OpenView Service Management Solutions
February 1998
View PowerPoint Presentation

Introduction

Data Backup and Recovery are critical elements in IT Service Delivery and Management. Unfortunately, users may perceive data backup as an irritation that can deny them access to services while it is being conducted. However, without this necessary evil the continued availability of services within agreed times can be compromised and placed at significant risk in the event of data loss due to software failures, hardware failures or operational error. As IT Departments evolve to become internal IT Service Providers, the proactive management and monitoring of the quality of services delivered becomes a vital business activity. Service Level measures and reports act as one of the key tools that IT management can use to demonstrate the value that is delivered to the organisation and to also maintain competitive cost structures in the face of the increased trend toward selective outsourcing.

Enhancements to the HP OmniBack II software in version 3.0 provide IT Service Managers with key data to enable the proactive monitoring and planning of backup and data recovery operations. This information can be leveraged into service availability and recovery planning activities that are key if Service Level Agreements are to be adhered to. In addition the data provided by the enhanced version of OmniBack II can be leveraged to implement cost management and chargeback models for true IT Financial Management. This White Paper provides an overview of the Service Management enhancements in version 3.0 of OmniBack II and discusses their use in availability, cost management and data assurance planning and operations. Three sample scenario’s are also presented to illustrate the practical use of the enhancements.

 

Service Level Agreements

A number of published works provide comprehensive discussion of the value of, and process for establishing, Service Level Agreements (SLA’s) and the reader is encouraged to reference these documents for more detailed information as required.,. For the purposes of this document we will consider three aspects that are typical of many service level agreements: Availability, Time to Recover and Cost Recovery (or Chargeback).

Availability is typically defined as the percentage of time when the service will be accessible and usable by users within agreed time constraints. A typical availability clause in a SLA might be: "The service will be available for use for 98% of the time between the hours of 8:00am and 6:00pm, Monday through Friday, excepting public holidays". There may be more detail within the agreement to describe what is meant by "98% of the time" and how this is calculated, but this is beyond the scope of this document.

Time to Recover defines how quickly a service will be restored once an outage has occurred. For example, we might include a clause within our SLA that states "the maximum duration of any single service outage instance will be two hours from the time that the outage is identified". The SLA will typically also define the maximum number of outages that are acceptable within an agreed timeframe, for example "no more than 2 outages in any one month period".

Cost Recovery is an optional activity that is a natural extension of IT’s position as a Service Provider. The cost of delivering the service is recovered directly from the user based upon a model that reflects the users consumption of the service. In the case of Backup operations we might choose to bill the user based upon the number of Gigabytes of data backed up during each operation. This is a much more egalitarian approach than simply distributing the cost of the OmniBack infrastructure equally across all users, although this approach also has it merits (primarily that of simplicity).

Clearly, backup and recovery play a significant part in supporting these Service Level Objectives (SLO’s). We need to ensure that any curfew that is enforced during data backup is short enough that it does not impact upon our ability to make the service available at the agreed times, for both Interactive and Batch processing activities. We also need to ensure that any data recovery operations that may be necessary in the event of a service outage can be completed within the mandated time to recover, in addition to any other remedial activity (such as replacing a defective hardware component.).

In addition to ongoing monitoring and management of our backup service, it is also highly desirable (not to mention sensible) to perform proactive analysis of trends in backup performance, for example the duration and amount of data processed. This enables potential issues to be identified ahead of time and remedial activities to be planned and implemented before the backup service has a negative impact upon availability of the application services.

 

Backup and Recovery Service Monitoring

In order to support our Availability and Time to Recover Service Level Objectives, we first of all need to be able to monitor how long the backup and data recovery operations take (along with other key measures such as the volume of data processed), and maintain this data in a repository to support reporting and analysis activities. HP OmniBack II version 3.0 has been enhanced to track the elapsed times of key operations and to register this data as well as volume data using the Application Response Measurement Version 2.0 API (ARM 2.0 API).

The ARM API is an emerging standard for measuring end to end response times of transactions in distributed environments. Application programs that utilise the ARM API act as sources of response time information (and also user supplied information that may be relevant to a particular transaction) for ARM compliant system management and monitoring tools such as HP MeasureWare. HP MeasureWare will log ARM transaction information in its repository for subsequent analysis and reporting. It also has the capability to raise real time alerts (or "alarms") when the elapsed time of a specific transaction, such as a backup operation, exceeds a predefined threshold. When a real time alert is raised a number of actions are possible including, but not limited to, informing a central operations console such as HP OpenView IT/Operations, paging a system operator or taking automated remedial action to resolve the problem.

The following transactions have been instrumented to provide data to the ARM API in HP OmniBack II:

Transaction Description Additional Data Logged to ARM Usage
Datalist backup session duration GB of data processed Availability and Recovery Planning. Chargeback
Object backup session duration GB of data processed Availability and Recovery Planning. Chargeback
Restore session duration GB of data recovered Availability and Recovery Planning
OmniBack DB purge duration OmniBack Database size after purge OmniBack architecture management
OmniBack DB Check duration No. of errors, No. of client systems OmniBack architecture management, Chargeback

 

Every event of the type listed above will have its elapsed time and also any additional data as noted in the table logged to ARM. On systems with HP MeasureWare installed all of these events can then have alerts raised against them in real time. The elapsed time and subsidiary data is made available for subsequent Service Level Reporting and Analysis either using the HP PerfView Analyser or by using the standard export functionality to provide data to spreadsheets or other analysis tools such as SAS/CPE. The first three items are of particular interest in managing Service levels as perceived by end users and relate directly to the Availability, Time to Recover and Cost Recovery sections of the SLA. The last two items are not of direct interest to Line of Business users (with the exception of the number of client systems supported), but are highly relevant to IT staff seeking to proactively manage the OmniBack operational environment.

 

 

Service Level Reporting and Analysis with OmniBack II ARM data.

Once the data has been logged to HP MeasureWare it can either be exported for use in analysis tools such as Lotus 123 or Microsoft Excel, or it can be directly accessed using the HP PerfView Analyser and Planner tools for graphing and trending purposes. The types of activity that might be conducted using the data include :

    • Real time alerting of backup or restore sessions that overrun their allotted time window
    • Graphing of the elapsed time of particular backup or recovery sessions over time to detect trends in increased elapsed time. Allied with the additional data which details the volume of information that was backed up we can determine if the elapsed time is increasing due to additional data volumes or due to some bottleneck within the infrastructure such as network bandwidth or backup server resource overutilisation. HP MeasureWare provides a comprehensive set of system and network level statistics to support detailed analysis of this type of scenario.
    • Forecasting of the projected future duration of backup and restore operations to determine when the allotted time window will be exceeded. This enables proactive decisions to be made regarding the scheduling of operations and tuning or augmentation of the infrastructure (faster networks, upgraded backup servers, additional backup devices etc.) to ensure that Service Levels are maintained.
    • Trending and forecasting of the time taken to perform periodic maintenance on the OmniBack database (purge and check operations) to determine when the elapsed time of these operations will exceed the allowable time window and take suitable proactive steps to manage this.
    • Monitoring and reporting on the volumes of data backed up and recovered for backup device planning purposes and input into any chargeback activities.
    • Input into high level Service Quality review data for periodic reviews by the joint IT/User Service review board.
    • Trending and forecasting of the size of the OmniBack software’s database for backup server disk space management and planning.
    • Trending and forecasting of the number of OmniBack clients being served for software license planning, SLA validation and cost accounting/chargeback purposes.

 

These activities have direct benefit to the business in terms of increased quality resulting from proactive management of the availability of services, structured and accurate planning of budgets for backup infrastructure enhancement and fair and appropriate cost recovery models.

In this scenario we look at the use of OmniBack II Service Management data in conjunction with data from the backup infrastructure to analyse the cause of a backup duration SLA violation. Figure 1 shows historical data logged by HP MeasureWare for relevant resources and activities;

The metrics presented in the graph are described below (metrics marked * are from OmniBack II);

Metric Graphed Description
SAPSVR1:SAP_DB:Backup_Time_Minutes (X10)* The minutes taken to backup our SAP database (in units of 10 to ease graph scaling)
SAPSVR1:SAP_DB:Backup_Target_Minutes (X10) Our agreed Service Level Objective for completing this backup
SAPSVR1:SAP_DB:GB_Stored* The number of Gigabytes stored during the operation
SAPSVR1:GBL_Net_Packet_Rate (X1000) The rate of network packets through the SAP systems LAN card.
SAPSVR1:GBL_CPU_Total_Util The average Global CPU Utilisation on the SAP system during the backup
OBSVR1:GBL_Net_Packet_Rate (X1000) The rate of network packets through the OmniBack Servers LAN card.
OBSVR1:GBL_CPU_Total_Util The average Global CPU Utilisation on the OmniBack server during the backup
OBSVR1:NMX:LAN_Utilisation:Segment1 The utilisation of the network segment between the SAP system and the OmniBack server during the backup

 

In this case, by correlating the appropriate metrics from HP MeasureWare (with data automatically integrated from the MeasureWare collector, OmniBack’s ARM 2.0 extensions and HP NetMetrix) it is immediately obvious that an increase in the usage of the network segment between the SAP database and OmniBack servers caused the problem. This reduced the network throughput for the backup and resulted in an unacceptable increase in the backup time. Further analysis using HP PerfView and NetMetrix would allow the root cause to be determined and rectified.

In addition, at the time of the SLA violation, HP MeasureWare could, if desired, raise an alarm to notify the appropriate operations staff (for example, by radio paging) that intervention was required, enabling corrective measures to be taken before the business was impacted.

Service Management Reporting scenario 2

Next, we will look at the use of OmniBack II Service Management data for the proactive prediction of when Service Level Objectives and Backup Infrastructure limits will be reached. Figure 2 1 shows historical data logged by HP MeasureWare for relevant resources and activities in addition for extrapolations of key metrics based upon trend analysis.

We are considering two key metrics (also presented and described in scenario 1);

Metric Graphed Description
SAPSVR1:SAP_DB:Backup_Time_Minutes (X10) The minutes taken to backup our SAP database (in units of 10 to ease graph scaling)
SAPSVR1:SAP_DB:GB_Stored The number of Gigabytes stored during the operation

Also graphed are the forecasted values for these two measures for the next six months, the Service Level Objective for the backup (10 hours) and the physical capacity of the backup device currently being used (a 120GB Optical Juke Box).

The analysis provides us with two valuable pieces of information. Based upon historical trend analysis we expect the time taken to perform the backup to exceed our agreed SLO at the beginning of September, in six months time. In addition, and more urgently, we can see that the volume of data backed up will exceed the capacity of the Optical Juke Box in four months time. Clearly, we need to make some changes to the backup infrastructure to rectify this potential problem before it occurs. An upgrade to a higher capacity backup device will address the storage capacity problem. Before investing in this upgrade we should also perform further analysis to determine the limiting factor in the speed of our backup to ensure that if this is a throughput problem for the backup device that we address it at the time of the upgrade. Detailed analysis of the data captured by HP MeasureWare will enable this analysis to be quickly conducted.

We might also perform a similar analysis using the time taken to restore data in order to ensure that we can continue to meet our Time to Recover SLO’s.ice Management reporting scenario 3

In our final scenario, we will consider the use of OmniBack II data in Cost Accounting and Chargeback activities. The Service Management extensions for version 3 of OmniBack II include the logging of the volumes of data processed by each backup operation. The ARM compliant collection software, for example HP MeasureWare, logs this data. Once logged the information can be exported to the Cost Management tool that the IT department has selected. HP MeasureWare provides straightforward export of data into spreadsheets and also direct import into packaged Cost Management solutions .

A sample excerpt from a cost management billing report is shown below.

Note that the costs shown in the sample are fictitious and are purely for illustrative purposes. The actual cost rates used would typically include items such as software and hardware depreciation, media and labour costs, fixed infrastructure charges (such as computer suite floorspace and power etc.) and might be broken out separately. Fixed tariff’s for the monthly provision of each service (in addition to the per usage costs) might also be levied, depending upon the terms of the Service Level Agreement.

The sample Statement of Costs illustrates an aspect of Service Level Agreements that may be considered to be extreme; the imposition of penalties on the IT Service Provider in the event of SLO’s not being achieved. In this case we failed to meet our Objective for backing up the SAP database on March 1st and therefore do not charge the customer for this particular instance. Penalties for under performance are not mandatory, however they do signify a Service Provider’s level of maturity and control in the delivery of IT services, and help to maintain constant focus in ensuring that the quality of service delivered meets the needs of the users and the Business.

 

Summary

The Service Management enhancements to HP OmniBack II version 3.0 provide powerful yet easy to exploit sources of data that can be leveraged into a myriad of Service Management and Operations and Availability planning activities. The data is also highly valuable in the identification and isolation of potential or ongoing issues in the delivery of Data Backup and Recovery services.

As IT service providers continue to evolve to better meet the requirements of their customers and Business, the ability to proactively measure and manage service levels becomes increasingly important. HP OmniBack II version 3.0 in combination with other Service Management tools, such as its companion OpenView product family members, is able to provide robust support of Service Level Monitoring and Management activities, freeing IT staff to focus on enhancing existing services and developing and deploying new applications.

Author | Title | Tracks | Home


Send email to Interex or to theWebmaster
©Copyright 1998 Interex. All rights reserved.