Introduction
In the world of the Internet and corporate intranets, the management of the availability of services across an enterprise to an acceptably high level is becoming more and more complex every day. A factor that increases the complexity substantially is that services are provided on top of infrastructure networked server components, which are typically distributed (networked) throughout a corporate enterprise, yet need to be managed from a central point. As we move into the 21st century, business and information technology (IT) services will need to be delivered with an Information Utility (IU) mindset. One simply needs to review the history of the telecommunications industry to get a good view of the high availability requirements of a utility service provider. It is expected today that no matter what time day or night, a person will have access to a telephone virtually anyplace in the world (service is pervasive), a dial tone will be heard when the phone is off hook (service is persistent), and the quality of Service (QoS) of the line will be adequate to carry on an intelligible conversation (service performance). The same expectations are quickly becoming the norm for consumers of business/IT services. The providers will be compelled to respond as an IU.
Definitions
To set the context of high availability management of business/IT services provided on an infrastructure of networked servers (corporate enterprise), a common understanding of several terms must be established:
In its purest form, a server is a process that performs
service tasks for requestor processes (clients). A basic unit of work for a service task is a transaction. In a more practical sense, a server can either be an entire physical computer system or a computer process running on a physical computer system, both concurrently performing service tasks for many requestors.
A transaction is a unit of work submitted to a server component for processing by the methods (algorithms) of the component. It is common for a transaction to be submitted by a client component, but just as typically, server components submit transactions to other server components (effectively becoming a client as well as a server). For most practical purposes, a transaction is composed of many steps bracketed by BEGIN TRANSACTION and END TRANSACTION tags. As in the context of relational database management servers, a general transaction in the context of high availability (HA) will have the same ACID (atomicity, consistency, isolated, and durable) properties required by database server transactions. This, of course, implies that a transaction must be finished in its entirety to be considered complete. Ultimately, HA is solely determined by randomly submitted transactions being successfully completed with ACID properties intact during an accepted time interval.
A server component is a compilation of objects working in concert that have identifiers, properties, and methods (algorithms) by which they perform the service tasks requested by clients. Examples of server components are a print spooler queue handler, a SQL database server process, and a retail POS application server process.
Client requests are connected to server components either co-located in the same physical computer system or in disparate computer systems, resultantly forming a network composed of client/server components with links between.
Infrastructure is the arrangement (networked) of a group of related components (servers and links) to form a whole with parts interacting to perform a fundamental task (service). Therefore, the networked server components defined previously inherently compose an infrastructure.
HA is the competency/capacity of a networked server component to complete a requestor’s service task in an acceptable time interval beginning at the random instant it is submitted. High Availability of the component outside this time interval is moot, since the absence of useful work (service tasks) eliminates the necessity for a high availability state condition. In other words, if you measure HA during an interval of time when there is no queue of useful work to be completed, who cares if a networked server component IS or IS NOT available? Of course, for all practical purposes, a computer system has multiple server processes serving multiple clients, and therefore, high availability must be measured continuously.
Discrete points on the continuum of HA are commonly identified as (1) highly resilient, (2) highly available, (3) near-continuous, and (4) continuously available.
Another common way of representing HA levels today is percentage of uptime: (1) 99.9% uptime or 3-nines, no more than 8 hours downtime per year, (2) 99.99% or 4-nines, no more than 1 hour of downtime per year, and (3) 99.999% or 5-nines, no more than 5 minutes downtime per year. The price of each of the HA levels varies directly proportional to the degree of high availability achieved. For that reason, most cost effective HA solutions today consist of a mix of the four levels spread across an enterprise.
Critical business/IT services should be mapped to Service Level Objectives (SLO) which will be conceptually represented in this paper as objects of type (class) HA. Each service itself is also an object of type HA and is the parent object of the SLO objects that impact the high availability aspects of the service. The properties of an HA object type are:
Persistence, performance, and pervasiveness are considered to be the primary properties of an HA object type. Persistence is the property of a component that ultimately indicates whether or not it is available. However, if the ability to perform (competence) at an acceptable level of service is not maintained, then for all practical purposes, the service is not available. Also, if the component doesn’t have the dynamic capacity to perform a requestor’s task, effectively the component is rendered unavailable. The pervasiveness property is an indicator of whether or not the pre-defined access paths from the client (consumer) to the server component (provider) are available.
The true/false state of these primary properties is what must be measured and controlled in order to maintain any level of high availability. The ultimate goal is that the state of these primary properties be TRUE all of the time (continuously available). An out-of-service condition exists if any of these primary properties are FALSE.
The one secondary property of an HA object is defined in this paper as the Service Availability Thread (SAT). A SAT is a collection of objects which are measure points at the application, infrastructure, and primitive levels of software, network fabric, and hardware. Basically, the SAT links together all of the measure points that influence the state of the primary properties of the HA object. For a services HA object, the SAT is a collection of all of the SLO objects that must be in a TRUE state for the service to be considered highly available. For each SLO HA object, the SAT is composed of objects from the application, infrastructure, and primitive layers that determine whether or not the SLO is being achieved.
A child object of the SAT property of an SLO object may have a FALSE state and represent a potential service outage but not an actual service outage. A component could be operating in a fail-over state, and thusly, the related SLO and service are actually still showing a TRUE state for all of their primary HA properties.
HAM is essentially how a corporation effectuates the measure and control of high availability for its enterprise business/IT services from a centralized management point. There is a long and successful legacy upon which HAM can be built. Over the past few decades, SCADA (System Control and Data Acquisition) and NSM (Network and System Management) have been well developed and provide very capable building blocks for HAM. In the manufacturing sector, SCADA has been very successfully used to measure and control the high availability of instruments and processes in facilities such as petrochemical refineries. For over a decade in more general purpose computing system environments which span local area and wide area networks, corporations have been using NSM to very successfully measure and control server and network components. However, at the services level within the corporate intranet or Internet environments, the correlation of all the events that occur at the infrastructure and primitive levels with the impact on the consumer of a business/IT service is still in its early developmental stage. The high availability management of business/IT services is in need of advanced development over the next several years. To accomplish true services HAM, there must be a thread that ties a service together with the networked server components that influence the availability of the service. Of course, this is the Service Availability Thread discussed in the previous paragraph.
With the explosion of E-business and the information revolution in general, the world’s need for business/IT services availability management will probably never be any greater than it is today. The world has grown very dependent on infrastructure networked server components dispersed across LAN/WANs, and the consumers’ expectations for high availability of services today are peaked – i.e., randomly submitted transactions must be acceptably completed anytime day or night, 24x7x365. Therefore, the management of the availability of business/IT services is virtually becoming as important as the management of the networked server components themselves.
Design Considerations
In order to design an effective high availability management system, the following elements must be considered:
Methodology
Now that the overall contextual framework has been established for HAM, a look at a methodological approach to designing and implementing such a strategy is in order. Inherent to the methodology is one primary element, which is that the approach taken must first and foremost produce a services focused solution. A services focused solution ultimately is one that assumes a service is available if and only if the consumer confirms it. Therefore, the perspective of a services focused solution is from the top down – i.e., from where the service touches the consumer down to the infrastructure networked server components that are managed by the provider. The danger of using a bottom-up approach, which is from the provider components up to where the service touches the consumer, is that individual primitive components may be available and yet, due to other factors, a related service is not.
Listed are the basic phases of the recommended methodology:
Only the first three (3) phases will be discussed in detail in this paper. They have general application across all implementations of this solution.
Assessment
During the assessment phase, several tasks must be completed:
Design and Build
Tasks associated with this phase are:
One of the most critical aspects of the technical detail design is that a Service Availability Thread (SAT) be defined for each of the SLOs. This consists of searching for measure points at each layer of the service model – i.e., application, infrastructure (middleware), primitives (OS), hardware, and network. For example, if an SLO is that "the order processing application server component be available 99.9% of the time," then one measure point would be that the application server component must be running. If the application accesses a database server on another system in the network, then a primitive layer measure point would be that IP is responsive on the network interface. All of the measure points become properties of the SLO SAT.
Once the SAT measure points are identified, then it is imperative that the HAM technology products selected be capable of providing intelligent scripting features necessary to measure and control them. Much of the design and build time for HAM solutions should be spent in developing this intelligence. This will assure that correlated measures and resultant actions will be effective in how they influence the availability of the services being provided to the consumer.
Service Level Agreement
The general tasks of this phase are:
Summary
Just as a ham hock in the culinary world tremendously enhances the flavor of a pot of black-eyed peas on New Years day, so does HAM enhance the "flavor" (predominant quality) of highly available services being provided to consumers in a corporate enterprise. Investing money to make a mission critical application highly available in the enterprise is money well spent only if the investment also includes the management of the HA properties of the services being provided by the application to the consumer. HAM insures that primary HA properties are, in fact, maintained in a TRUE state by continuously measuring and controlling the state in real-time. For example, in a two-node cluster a mission critical application is running in Single Point of Failure (SPOF) mode whenever the node of a cluster on which it is executing fails-over. If the fail-over event goes undetected, it is possible that the application will not be returned to redundancy mode prior to a second failure causing a catastrophic service outage. At the services level, HAM will determine if the single event correlates to a service outage, and will initiate corrective action by either dynamic control or console alerts. This is what HAM is all about – measuring and controlling events both in real-time and over time (trends) that affect the availability of business/IT services.