HP World 98 Presentation

Choosing the Right Availability Solution

Introduction

For companies that require the highest levels of availability for their critical applications, there are two architectural choices offered in the market: HA clusters and fault-tolerant systems. Each customer and each application has a unique set of technical characteristics, unique availability requirements and a unique operational environment that will favor one solution over another. In large, enterprise-scale environments a combination of HA clusters and fault-tolerant systems can provide additional benefits that cannot be achieved with a single architecture alone.

Who Needs Availability?

In the past, only a few applications were considered critical enough to require high-availability. These were usually found in large companies in selected industries such as financial services and telecommunications. More recently, as the cost of computing has come down, and as new technologies, such as the Internet, have emerged; businesses have introduced many new applications that would not have been possible even a year or two ago. These new applications have improved productivity, provided better information to decision-makers and increased the speed of business transactions. But they have also made companies much more dependent on their computer systems. Many more businesses than ever before cannot operate effectively if key applications are down. Fortunately, the choices for high-availability systems are increasing and the costs for these systems are coming down.

Defining Availability

Availability is typically stated as a percentage of uptime. Based on continuous system operation of 24 hours per day, 7 days per week, 365 days a year, the following chart relates availability percentages to the amount of downtime per year.

Availability	Downtime per Year
99%	87 hours, 36 minutes
99.5%	43 hours, 48 minutes
99.95%	4 hours, 23 minutes
99.99%	53 minutes
99.999%	5 minutes

The need for very high levels of availability is not limited to 24x7 environments. Many applications must be available during normal business hours or for a critical time period within the day. Any system failure within these critical periods is unacceptable for many users.

Fault-Tolerant Systems

Stratus and Tandem are the primary vendors of fault-tolerant computer systems. Fault-tolerant computers provide a fully-replicated hardware design that allows uninterrupted operation in the event of a component failure. With Stratus systems, there is no recovery time or performance loss when a failure occurs. All memory contents and all disk-based information are preserved. Stratus’ self-checking hardware guarantees that even transient errors will be detected before any data integrity problems can occur. Built-in diagnostics, isolation of errors to a failed component, remote service capabilities and user-replaceable components allow Stratus to be used in remote, lights-out environments, including sites that are very distant from local service offices.

Stratus Continuum systems are based on the HP PA-RISC microprocessor family and run a completely standard version of the HP-UX operating system. Stratus HP-UX systems are fully ABI and API compatible with HP-9000 servers and can run both HP and third party, shrink-wrapped software without modification. Standard HP-9000 system administration procedures and tools are also used to administer Stratus fault-tolerant systems.

High-Availability Clusters

The list of vendors offering high-availability clusters includes Hewlett-Packard, IBM, Sun, Digital and Microsoft. High-availability clusters, such as MC/ServiceGuard, allow multiple servers, in conjunction with shared disk storage units, to quickly recover from failures. Application processing and access to disk-based data is typically restored within several minutes, although recovery times will very depending upon specific characteristics of the application and system configuration. Software upgrades and other hardware or software maintenance operations can be done with minimal disruption by migrating applications from one node in a cluster to another. MC/ServiceGuard clusters support a broad range of hardware platforms from the entry-level D-class to the high-end V-class.

Comparing Clusters and Fault-Tolerant Systems

HA clusters and fault-tolerant systems each have characteristics that make them more suitable for some applications and less suitable for others. The following table describes the most important characteristics of the two architectures.

	HA Cluster	FT System
Recovery Time	Seconds to minutes	Milliseconds
Recovery Model	Disk/transaction recovery	Flexible – maintains disk, comm connections and memory state
Critical Data Protected	Disk only	Memory and disk
Implementation	Script development and testing	No work required
Operations	Multi-system cluster	Single system
Product Range	Low-end to high-end	Mid-range

Application Tiers and Availability

Most applications today use a multi-tier architecture based on some type of client-server model. For availability purposes, it is helpful to view the application in three layers: database services, application services and communication services. The availability issues for each of these layers are very different. At a large central site, application layers are often spread over several servers. In these cases, it is possible to provide a distinct availability solution for each application layer. For smaller installations, including remote sites, it is necessary to pick a single solution that best fits the needs of all three layers. Figure 1 below shows a typical, multi-tier architecture in an enterprise setting that includes both central and remote sites.

Figure 1 – Multi-Tier Enterprise Architecture

Database Services

The database/transaction services layer usually requires a large, central database that dictates a high-end, scalable server system to support high transaction activity and large amounts of storage capacity. Capacity and scaling requirements will usually favor a K-, T- or V-class HP server solution. Issues around recovery are simplest for this layer, since the persistence, data integrity and recoverability of application data is a standard feature of all modern database and transaction software. MC/ServiceGuard, combined with appropriate database and transaction software, provides a strong availability solution for the database/transaction services layer.

Application Services

The application services layer functions include user interface processing, transaction capture, transaction sequencing, computational processing, local searching and sorting, statistics generation, logging – or anything else not covered by the database or communication services layers. Specific functions will vary widely from one application to another. The best availability solution for this layer will vary based on the specific application requirements. A HP cluster or multiple, standard HP servers are a good choice for applications with no critical, memory-based state information or in-memory database. Stratus fault-tolerant servers are a good choice when it is important to protect application memory state or an in-memory database. Stratus-based application services can also provide stand-in processing where applications must provide continuous client service even during periods when the back-end database service is unavailable due to failover/recovery delays.

Communications Services

The communications services layer includes control of the physical LAN and WAN connections along with higher-level communications and messaging functions such as message routing, message store-and-forward and protocol/data conversion.

Communications services and related application functions depend upon a large amount of memory-resident data. This data includes the status of all communication connections, messages received but not yet forwarded, intermediate results of multi-host transactions, and the state of multi-step user dialogs. Stratus’ ability to preserve memory across failures protects this critical data and thus provides a greatly simplified availability solution for communications and messaging services.

Providing equivalent protection in a HA cluster architecture is possible, but requires much more time and effort, and often results in design changes that greatly impact performance. There are no generic solutions for reliable communication services equivalent, for example, to the reliability that database management software offers for database transaction services. Additional challenges arise from today’s complex, heterogeneous communications environments that combine open and legacy protocols, servers with multiple operating systems, and a range of standard and specialized devices. In this environment, nobody can control all elements of the communication environment, making integrated end-to-end solutions impossible. Without total control of all the end-points in a communication network, a reliable control point must be established. The Stratus fault-tolerant architecture is a natural fit for this reliable point of communication control.

A Stratus fault-tolerant communications server can also provide continuous availability for higher-level communications services including message routing, message store and forward, and protocol conversion. These services allow a front-end to route messages among multiple back-end systems or networks, to store and later submit transactions if the back-end system is temporarily unavailable, or translate messages between different legacy or standard protocols while introducing minimal delay. Running these services on a fault-tolerant server guarantees service availability as well as reliable message delivery combined with very high performance using a memory-based design. A design based on logging each message to disk can provide high-availability, but introduces a performance cost and message delay that will be unacceptable in many situations.

Combining HA Clusters with Fault-Tolerant Servers

An architecture that combines HP and Stratus servers in a front-end/back-end configuration provides customers with the best combination of performance, flexibility, scalability and availability. HP servers, combined with MC/ServiceGuard software, provide a robust, scalable back-end database service. Stratus fault-tolerant servers provide a continuously available front-end communication service. Application services can run on either the back-end, front-end or be split between them, depending on the specific application requirements. In some cases, the application services layer may warrant a separate set of server systems that could be either HP or Stratus, again depending upon the particular application environment and availability needs. A key feature of Stratus and HP systems is that they can be "mixed and matched"; their common HP-UX operating environment support seamless interoperability of middleware and applications in a combined architecture.

This front-end/back-end architecture is actually very similar to the traditional mainframe architecture that has supported enterprise applications for over 30 years. Mainframes have used an intelligent communications controller to offload communications processing from the host (much of this communication is IBM SNA) and to allow routing of transactions among multiple hosts, providing both higher availability and load sharing. The combination of HP and Stratus systems brings the benefits of this traditional architecture to the world of open systems. Stratus systems, which support both open and legacy communications, can provide a reliable bridge from the proprietary world to the open systems world.

Departmental Applications and Remote Sites

Departmental applications and remote sites typically run on a single server or small cluster. Often these applications are replicated at many different sites within a single company. Deciding between a HA cluster or fault-tolerant system will depend on the particular characteristics of the application compared against the relative benefits of the two solutions.

HA cluster is a good choice if:

Database services are the predominant part of the application
The recovery model is transaction-based
Scalability beyond a mid-range system is needed
Application can accommodate seconds to minutes of recovery time
Operations staff is available for cluster management

Fault-tolerant system is a good choice if:

Communications services or in-memory data are the predominant part of the application
The recovery model relies on in-memory data or application state
Remote site requires lights-out operation
Application requires sub-second recovery time

Application Examples

Automated Teller Machine (ATM) and Point of Sale (POS) card authorization systems must support transactions on a global, 24x7 basis. Most ATM/POS architectures consist of a continuously available communication front-end that services the ATM and POS devices and provides basic authorization, logging and routing services. Communications have been extended to support switching of transactions initiated by cards issued by other financial institutions, with all the complexities of backing out and reconciling related transactions on two or more systems in case of device, network or system failure. The back-end consists of one or more mainframe-class systems handling the large customer account databases and providing inter-bank settlement. By providing basic authorization and routing, including transaction store and forward, in the fault-tolerant communication front-end, service can continue during periods where the back-end host is unavailable.

Another example of an enterprise application is a call center combined with a customer service or customer order application. The call center front-end often requires 24x7 availability and can provide stand-in processing or transaction capture if the back-end database is unavailable. The front-end can also route transactions to multiple back-end applications if required. The database and transaction services at the back-end handle the full customer account, service and product databases and associated application processing.

In the telecommunications industry, the implementation of the Intelligent Network (IN) has created the need for a reliable computer system, called a Service Control Point (SCP), which provides application and database services to the voice switches. Communication between the SCP and the switch is done using the SS7 protocol running over wide-area links. SCP applications require very high reliability and rely on large in-memory databases in order to meet the sub-second response times required. IN applications have proved to be an excellent fit for Stratus fault-tolerance.

Network management applications, both in the telecommunications and commercial sectors, also exhibit characteristics that benefit from the Stratus architecture. Network management systems typically keep large amounts of network status information in memory. Since Stratus systems preserve memory in the event of a hardware failure, this status information is continuously available. Without a fault-tolerant hardware system, it is impossible to recreate this data without polling of the entire network. Polling takes a significant amount of time and also generates a large amount of network traffic that could adversely affect critical applications.

Summary

HA clusters and fault-tolerant systems both provide effective availability solutions. Each has specific advantages that best fit a certain set of problems. In some cases, a combination of both types of systems is the best choice. For Unix environments, HP and Stratus offer a broad range of compatible systems and software that cover the total spectrum of availability from high-availability clusters to fault-tolerant hardware systems. The total compatibility and interoperability of these products, combined with the ability of HP to offer a single-vendor source for products, services and support, provides customers with the complete solution to their enterprise availability requirements.

Author | Title | Tracks | Home

Choosing the Right Availability Solution

Send email to Interex or to theWebmaster©Copyright 1998 Interex. All rights reserved.

Send email to Interex or to theWebmaster
©Copyright 1998 Interex. All rights reserved.