HPWorld 98 & ERP 98 Proceedings

High Availability Management
Windows NT Server

Ric O. Stewart, Sr. Consultant


Hewlett Packard Company
11575 Great Oaks Way
Suite 100, MS 217
Alpharetta, GA 30022
Telephone: 404.648.6656
FAX: 404.648.1599
Email: ric_stewart@hp.com



View the PowerPoint presentation

Introduction

In the world of the Internet and corporate intranets, the management of the availability of services across an enterprise to an acceptably high level is becoming more and more complex every day. A factor that increases the complexity substantially is that services are provided on top of infrastructure networked server components, which are typically distributed (networked) throughout a corporate enterprise, yet need to be managed from a central point. As we move into the 21st century, business and information technology (IT) services will need to be delivered with an Information Utility (IU) mindset. One simply needs to review the history of the telecommunications industry to get a good view of the high availability requirements of a utility service provider. It is expected today that no matter what time day or night, a person will have access to a telephone virtually anyplace in the world (service is pervasive), a dial tone will be heard when the phone is off hook (service is persistent), and the quality of Service (QoS) of the line will be adequate to carry on an intelligible conversation (service performance). The same expectations are quickly becoming the norm for consumers of business/IT services. The providers will be compelled to respond as an IU.

Definitions

To set the context of high availability management of business/IT services provided on an infrastructure of networked servers (corporate enterprise), a common understanding of several terms must be established:

In its purest form, a server is a process that performs

service tasks for requestor processes (clients). A basic unit of work for a service task is a transaction. In a more practical sense, a server can either be an entire physical computer system or a computer process running on a physical computer system, both concurrently performing service tasks for many requestors.

A transaction is a unit of work submitted to a server component for processing by the methods (algorithms) of the component. It is common for a transaction to be submitted by a client component, but just as typically, server components submit transactions to other server components (effectively becoming a client as well as a server). For most practical purposes, a transaction is composed of many steps bracketed by BEGIN TRANSACTION and END TRANSACTION tags. As in the context of relational database management servers, a general transaction in the context of high availability (HA) will have the same ACID (atomicity, consistency, isolated, and durable) properties required by database server transactions. This, of course, implies that a transaction must be finished in its entirety to be considered complete. Ultimately, HA is solely determined by randomly submitted transactions being successfully completed with ACID properties intact during an accepted time interval.

 

    • Server Component

A server component is a compilation of objects working in concert that have identifiers, properties, and methods (algorithms) by which they perform the service tasks requested by clients. Examples of server components are a print spooler queue handler, a SQL database server process, and a retail POS application server process.

    • Networked Server Component

Client requests are connected to server components either co-located in the same physical computer system or in disparate computer systems, resultantly forming a network composed of client/server components with links between.

    • Infrastructure

Infrastructure is the arrangement (networked) of a group of related components (servers and links) to form a whole with parts interacting to perform a fundamental task (service). Therefore, the networked server components defined previously inherently compose an infrastructure.

 

 

    • High Availability (HA)

HA is the competency/capacity of a networked server component to complete a requestor’s service task in an acceptable time interval beginning at the random instant it is submitted. High Availability of the component outside this time interval is moot, since the absence of useful work (service tasks) eliminates the necessity for a high availability state condition. In other words, if you measure HA during an interval of time when there is no queue of useful work to be completed, who cares if a networked server component IS or IS NOT available? Of course, for all practical purposes, a computer system has multiple server processes serving multiple clients, and therefore, high availability must be measured continuously.

Discrete points on the continuum of HA are commonly identified as (1) highly resilient, (2) highly available, (3) near-continuous, and (4) continuously available.

Another common way of representing HA levels today is percentage of uptime: (1) 99.9% uptime or 3-nines, no more than 8 hours downtime per year, (2) 99.99% or 4-nines, no more than 1 hour of downtime per year, and (3) 99.999% or 5-nines, no more than 5 minutes downtime per year. The price of each of the HA levels varies directly proportional to the degree of high availability achieved. For that reason, most cost effective HA solutions today consist of a mix of the four levels spread across an enterprise.

    • HA Properties

Critical business/IT services should be mapped to Service Level Objectives (SLO) which will be conceptually represented in this paper as objects of type (class) HA. Each service itself is also an object of type HA and is the parent object of the SLO objects that impact the high availability aspects of the service. The properties of an HA object type are:

    1. Persistence (availability)
    2. Performance (competence and capacity)
    3. Pervasiveness
    4. Service Availability Thread (see next paragraph).

Persistence, performance, and pervasiveness are considered to be the primary properties of an HA object type. Persistence is the property of a component that ultimately indicates whether or not it is available. However, if the ability to perform (competence) at an acceptable level of service is not maintained, then for all practical purposes, the service is not available. Also, if the component doesn’t have the dynamic capacity to perform a requestor’s task, effectively the component is rendered unavailable. The pervasiveness property is an indicator of whether or not the pre-defined access paths from the client (consumer) to the server component (provider) are available.

The true/false state of these primary properties is what must be measured and controlled in order to maintain any level of high availability. The ultimate goal is that the state of these primary properties be TRUE all of the time (continuously available). An out-of-service condition exists if any of these primary properties are FALSE.

    • Service Availability Thread (SAT)

The one secondary property of an HA object is defined in this paper as the Service Availability Thread (SAT). A SAT is a collection of objects which are measure points at the application, infrastructure, and primitive levels of software, network fabric, and hardware. Basically, the SAT links together all of the measure points that influence the state of the primary properties of the HA object. For a services HA object, the SAT is a collection of all of the SLO objects that must be in a TRUE state for the service to be considered highly available. For each SLO HA object, the SAT is composed of objects from the application, infrastructure, and primitive layers that determine whether or not the SLO is being achieved.

A child object of the SAT property of an SLO object may have a FALSE state and represent a potential service outage but not an actual service outage. A component could be operating in a fail-over state, and thusly, the related SLO and service are actually still showing a TRUE state for all of their primary HA properties.

 

    • High Availability Management (HAM)

HAM is essentially how a corporation effectuates the measure and control of high availability for its enterprise business/IT services from a centralized management point. There is a long and successful legacy upon which HAM can be built. Over the past few decades, SCADA (System Control and Data Acquisition) and NSM (Network and System Management) have been well developed and provide very capable building blocks for HAM. In the manufacturing sector, SCADA has been very successfully used to measure and control the high availability of instruments and processes in facilities such as petrochemical refineries. For over a decade in more general purpose computing system environments which span local area and wide area networks, corporations have been using NSM to very successfully measure and control server and network components. However, at the services level within the corporate intranet or Internet environments, the correlation of all the events that occur at the infrastructure and primitive levels with the impact on the consumer of a business/IT service is still in its early developmental stage. The high availability management of business/IT services is in need of advanced development over the next several years. To accomplish true services HAM, there must be a thread that ties a service together with the networked server components that influence the availability of the service. Of course, this is the Service Availability Thread discussed in the previous paragraph.

With the explosion of E-business and the information revolution in general, the world’s need for business/IT services availability management will probably never be any greater than it is today. The world has grown very dependent on infrastructure networked server components dispersed across LAN/WANs, and the consumers’ expectations for high availability of services today are peaked – i.e., randomly submitted transactions must be acceptably completed anytime day or night, 24x7x365. Therefore, the management of the availability of business/IT services is virtually becoming as important as the management of the networked server components themselves.

 

 

Design Considerations

 

In order to design an effective high availability management system, the following elements must be considered:

    1. Alignment of business/IT services with critical networked server components that influence their availability. Inherent in this process is the mapping of business requirements to IT services (functional requirements) and then to SLOs with their associated levels of availability. This alignment is imperative to produce cost-effective total HA solutions for the enterprise.
    2. Instrumentation of the measure and control of the state of the primary and secondary properties of a service HA object. For example, the Windows NT Server perflib object System: Processor Queue Length is an instantiation of the System object class with a property of Processor Queue Length. This object could be instrumented as a child object to an SLO HA object (e.g., process 30 orders per hour) with the assumption that the SLO is at risk if the Processor Queue Length exceeds "2 x number of processors" for a period of 10 elapsed minutes. By instrumenting an associated HA object method (algorithm) that is triggered when the state of the performance property is FALSE, a scan could be performed that would locate known ill-behaved processes that are not business critical and terminate them. In this manner, HAM would protect the mission critical order processing application by measuring and controlling the state of SLO properties.
    3. Thorough investigation into the instrumentation of the measure and control of resource scalability – static and dynamic. For a networked server component to be highly available, it must have the capacity property to acceptably complete randomly submitted transactions, no matter what the instantaneous demand. This, of course, implies dynamic scalability of components, which can be influenced by HAM and NSM to the extent of the capabilities of the HA platform.
    4. Instrumentation of the measure and control of the restoration of critical components to a normal operational state. This presumes a component has failed, and HAM’s role is to measure and control the support systems available for restoring the component back to its normal operational state (e.g., file is corrupted, where is the backup tape located, what support entity mounts the tape, and what is the total elapsed time to recover the file to its normal state).
    5. Instrumentation of periodic HAM reports showing availability at the services level. This presumes that data is being collected at each of the SAT measure points over time and at regular intervals. Too much and too fine a granularity of measurement data is virtually worthless. HAM’s role is to analyze HA property measurements over time intervals and then to reduce the data for meaningful capacity planning and performance trending graphs and reports. Exception reporting is the ideal approach to providing a meaningful look at the big picture – i.e., red/yellow/green warning indicators can be shown for each SLO and a drill-down capability can be provided for viewing of application, infrastructure, and primitives of the SAT associated with the SLO. With this approach, a red indicator at the SLO level would show that a service is in jeopardy, and the drill-down will provide insight into the infrastructure networked component that is really putting the service at risk.
    6. Instrumentation of on-line access to historical measure/event store for troubleshooting and ad-hoc reporting purposes. To facilitate day-to-day high availability management of business/IT services, often times there will be the need to review events and availability data prior to receiving the standard periodic HAM reports. Therefore, it is imperative that this capability be instrumented as part of the overall HAM solution. An example of this need might be that a consumer has called the customer service department complaining of a service outage when the provider really had no prior indication that a service outage had occurred. On-line access to the event/message store database would facilitate the efficient handling and resolution of such an instance.

 

 

 

Methodology

Now that the overall contextual framework has been established for HAM, a look at a methodological approach to designing and implementing such a strategy is in order. Inherent to the methodology is one primary element, which is that the approach taken must first and foremost produce a services focused solution. A services focused solution ultimately is one that assumes a service is available if and only if the consumer confirms it. Therefore, the perspective of a services focused solution is from the top down – i.e., from where the service touches the consumer down to the infrastructure networked server components that are managed by the provider. The danger of using a bottom-up approach, which is from the provider components up to where the service touches the consumer, is that individual primitive components may be available and yet, due to other factors, a related service is not.

Listed are the basic phases of the recommended methodology:

  1. Assessment
  2. Design and Build
  3. Service Level Agreement
  4. Deploy
  5. Operate.

Only the first three (3) phases will be discussed in detail in this paper. They have general application across all implementations of this solution.

 

Assessment

During the assessment phase, several tasks must be completed:

 

    1. Analyze business requirements. Of course, there must be active involvement from all the affected business departments of the corporation in order to determine a true picture of what services must be provided to the consumer at what HA levels. Also, this phase should include a Risk Analysis to determine if the return on the high availability investment (ROI) is adequate for the business model under which the corporation is operating.
    2. Align business requirements with service descriptions. Once the business services are determined for the corporation, then further effort must be taken by the IT department (provider) to produce a clear and concise written description of each service with associated levels of high availability. All affected business departments of the corporation must then approve this written description.
    3. For each of the business services offered, define high availability service level objectives (SLO) associated with the service. Of course, it is very likely that multiple SLOs will be associated with each of the services provided. The SLOs should clearly establish what actually determines whether or not a service is available. For instance, is it sufficient that attempts to access a consumer’s email mailbox are successful 99% of the time, or must it also be true that the response time when reading email must be 2 seconds or less?
    4. Once business services and associated SLOs are defined, determine the IT functional requirements necessary to achieve such service levels. This presumes that a Gap Analysis between the current IT infrastructure and the one necessary to achieve the desired highly available services will be performed.
    5. The beginnings of a high level HAM design will take shape as this assessment phase is carried out. This will be input to the next phase.

Design and Build

Tasks associated with this phase are:

    1. During this phase the high level HAM design evolves into and is documented in an Architecture Detail Design Document. This document is the result of investigating the SLOs against the industry HAM technology products that can be selected to perform the required HA property measure and control functions. Once the overall solution is architected and the technology products are selected, a total cost of ownership (TCO) should be produced. The TCO is used in conjunction with the Risk Analysis from the assessment phase to produce the ROI report.
    2. One of the most critical aspects of the technical detail design is that a Service Availability Thread (SAT) be defined for each of the SLOs. This consists of searching for measure points at each layer of the service model – i.e., application, infrastructure (middleware), primitives (OS), hardware, and network. For example, if an SLO is that "the order processing application server component be available 99.9% of the time," then one measure point would be that the application server component must be running. If the application accesses a database server on another system in the network, then a primitive layer measure point would be that IP is responsive on the network interface. All of the measure points become properties of the SLO SAT.

      Once the SAT measure points are identified, then it is imperative that the HAM technology products selected be capable of providing intelligent scripting features necessary to measure and control them. Much of the design and build time for HAM solutions should be spent in developing this intelligence. This will assure that correlated measures and resultant actions will be effective in how they influence the availability of the services being provided to the consumer.

    3. Once the detailed design with TCO is available, a prototype of the HAM solution should be built. It is imperative that the prototype solution directly mirrors all aspects of the production environment for which the solution is being built.
    4. Once the prototype is built, thorough validation tests must be devised to test that the HAM solution is able to measure and control each of the SLO HA objects defined for the business/IT services being offered to the consumer. It is important that the tests include induced faults at the various measure points of the SATs so that the "intelligence" of the solution can be validated – i.e., measures are being taken and corrective actions are occurring.

 

Service Level Agreement

The general tasks of this phase are:

    1. Revisit the SLOs and make sure there is a clear understanding between the provider and consumer concerning the details of the high availability SLOs. Make sure this understanding is clearly articulated in a Service Level Agreement (SLA).
    2. As with any agreement between any two parties, there are roles and responsibilities for each. The consumer’s role is to create the market demand and give feedback to the provider indicating whether or not service level expectations are being met. Of course, the provider has the responsibilities of (1) making sure the services offered are aligned with the market demand, (2) making sure the IT infrastructure provided is capable of providing the services in a highly available fashion, and (3) managing the infrastructure to the availability level agreed to in the SLA. Inherently, the provider also assumes the major portion of the risk, but also has the prospects of high reward.
    3. Establish guidelines for accountability and consequences/remedies. The ultimate goal of this task is to develop an accountability framework that will (1) allow the consumer to be accountable for providing timely feedback to the provider if service levels are not being met, and (2) allow the provider time to react to the feedback by offering corrective remedies prior to reaching a "terminate with cause" point. Of course, if remedies are not provided in a timely fashion, then clearly defined consequences must be defined in the SLA.
    4. Release the solution to deployment. Never release the solution to deployment without an SLA in place. This is such an obvious tenet, yet one that is often ignored.

 

 

 

Summary

Just as a ham hock in the culinary world tremendously enhances the flavor of a pot of black-eyed peas on New Years day, so does HAM enhance the "flavor" (predominant quality) of highly available services being provided to consumers in a corporate enterprise. Investing money to make a mission critical application highly available in the enterprise is money well spent only if the investment also includes the management of the HA properties of the services being provided by the application to the consumer. HAM insures that primary HA properties are, in fact, maintained in a TRUE state by continuously measuring and controlling the state in real-time. For example, in a two-node cluster a mission critical application is running in Single Point of Failure (SPOF) mode whenever the node of a cluster on which it is executing fails-over. If the fail-over event goes undetected, it is possible that the application will not be returned to redundancy mode prior to a second failure causing a catastrophic service outage. At the services level, HAM will determine if the single event correlates to a service outage, and will initiate corrective action by either dynamic control or console alerts. This is what HAM is all about – measuring and controlling events both in real-time and over time (trends) that affect the availability of business/IT services.

 

 

 

 

 

 

 

 

 

 

 

 

Author | Title | Tracks | Home


Send email to Interex or to theWebmaster
©Copyright 1998 Interex. All rights reserved.