HP World 98 Presentation

Improving Windows NT Cluster Capabilities

A White Paper V2.1 – June 1998

Compaq Windows NT Enterprise Software Group

View PowerPoint Presentation

Table of Contents

Abstract

Product Availability and Coverage

Clustering – an Overview

High Availability for applications and users
High Availability with Fault Tolerant Systems
Increased application and/or user population scaling
Scaling with Symmetric Multi-Processing (SMP) Systems
Less than sum-of-the-parts system administration capabilities
Investment protection for hardware and software
Overview Summary

Microsoft Cluster Server

Microsoft "Q and A" on Microsoft Cluster Server

Compaq’s Enhancements to Microsoft Cluster Server

Shared-Disk Clustering
The Cluster File System
The Distributed Lock Manager
Messaging Performance of the Distributed Lock Manager

Miscellaneous Enhancements

Disaster Tolerant solutions
Cluster Administration

Summary

Abstract

A wide range of clustering products exist in the market today, based on several Unix and proprietary operating systems. These products have been available since the mid-1980s, each providing different capabilities – from implementations that are technically simple to those that are highly sophisticated.

Clustering for Windows NT has been available from several vendors since the mid-1990s. At this time the implementations are still relatively simple, offering appropriately limited capabilities. The market for sophisticated clustering implementations based on Windows NT that deliver extensive functionality is growing rapidly.

This paper describes enhancements to the Microsoft Cluster Server (also known as Wolfpack) clustering environment. These enhancements are being developed by Compaq’s Windows NT Enterprise Software Group.

The paper is divided into four distinct sections:

The first provides an overview of clustering – its goals and benefits, with comparisons to Fault Tolerant and SMP systems.
The second section provides a brief overview of Microsoft Cluster Server.
The third section describes the additional capabilities and technologies that Compaq is bringing to the Microsoft Cluster Server product.
The final section provides a brief overview of some miscellaneous cluster-related capabilities that will be included in Compaq’s clustering software portfolio.

Product Availability and Coverage

The enhancements described in this paper are currently under development, and are due for initial product availability during the later part of 1998 and throughout 1999.

Two important features of the capabilities include:

They will be available for all systems on the Microsoft Hardware Compatibility List, regardless of vendor.
They will be supported on Windows NT V4 and, when it becomes available, V5.

Clustering – an Overview

This section provides a broad overview of clustering technology, with no emphasis on any specific product. There are many clustering white papers available on the Internet, from a wide range of vendors. Readers are encouraged to research these papers.

Clustering definitions are as numerous as the number of cluster products on the market. However, it is commonly accepted that clustering is a method of combining several computers together so that the resulting configuration provides one or more of the following capabilities:

High Availability for applications and users.
Increased application and/or user population scaling.
Less than sum-of-the-parts system administration capabilities.
Investment protection for hardware and software.

The list above provides vendors a broad palette with which to create a product that can be marketed and sold under the umbrella term of "clustering". The result has been that the variation of capabilities across clustering products is very wide – much wider than, for example, computing capabilities such as SMP, networking, and RAID. In general, clusters designed for the commercial application market concentrate primarily on delivering high availability, with the other capabilities being of lesser importance. However, it should be understood that the specific mix of capabilities varies from product to product.

The following paragraphs examine the capabilities on the list above individually:

High Availability for applications and users

Providing high availability is generally considered the most important capability for any clustering product. (Note that it is not a mandatory capability. For example, "workstation farm" clustering products make no attempt to enhance availability, but concentrate purely on scaling application performance.)

Delivering high availability is primarily achieved by duplicating hardware, so that, in the event of a hardware failure, there is enough spare hardware to enable continued operation. Of course, software is also required to perform the necessary control and switching of hardware components. This will be discussed in greater detail later.

Duplicating hardware is achieved by two methods:

Using multiple off-the-shelf (commodity) computers and connecting them together in such a way that if one computer fails the remaining computer(s) can assume the workload of the failed computer. Usage of commodity computers means that costs can be contained, and is usually considered to be a fundamental requirement of cluster products.
Building special-purpose computers where all the individual components (CPU, memory, I/O adapters, power supplies, etc) are duplicated, or possibly even triplicated. Such systems are termed fault-tolerant, and deliver the highest levels of availability. Fault tolerant computers are often considerably more expensive than ordinary systems, and are not generally considered to be clusters.

Cluster products attempt to provide availability for computers (systems) – so that applications can continue to run – as opposed to providing availability for storage (data). Storage availability is provided by using RAID techniques; indeed, it is common for a complete configuration to include both clustering and RAID. Considering how storage should be deployed is a critical element in the design of any cluster configuration. If one server is to be capable of assuming the workload of another, it must first be able to access the failed server’s storage. In practice this means that nearly all cluster products rely on shared access to storage (at the hardware connection level). Products that do not rely on shared access to storage use data replication techniques (mirroring) to duplicate data across all the systems.

High availability clusters

As described earlier, cluster products provide high availability by connecting several systems together in such a way that, when one system fails, a remaining system can assume the workload of the failed system. Because a cluster is comprised of commodity systems its capabilities are provided by a mixture of software and, depending on the specific product, special purpose "interconnect" hardware. The term "interconnect" refers to the hardware that is used to join the clustered systems together – in many cases this will be an off-the-shelf network, such as Ethernet, but can also be an optimized high-performance communications network (Compaq’s ServerNet and Memory Channel are two examples).

The figure shows a very typical cluster configuration. Note that the cluster comprises all the components within the gray oval.

Two servers connected to a common storage bus (usually SCSI). Microsoft Cluster Server configurations are currently limited to two systems, but this restriction will be removed in the future.
A private communication path (interconnect) between the servers.
Clients connected to the cluster using a standard LAN
Each server has a private system disk. This is common to all Windows NT-based clusters at this time.

Self evidently, at the time of a failure the applications that are running on the failed system abruptly cease to execute. Clustering products vary greatly as to how recovery from this situation is achieved. The most common, and technically simplest, scenario is that the applications are restarted on a remaining node, an operation that can take several seconds (up to a minute or so). This process is called "failover", or, more accurately, "application failover". More sophisticated clusters permit an application to execute on multiple systems simultaneously, so that when one system fails the others simply continue, picking up the load of the failed system. This avoids the requirement to perform application failover.

Availability can be steadily increased in a cluster by adding further systems to the configuration. While having two systems in a cluster enables operation to continue if either system fails or shuts down, the remaining system must assume the entire load of the first system. In clusters that contain more than two systems the load of a failed system can be amortized over the remaining systems. Experience shows that most environments will benefit with up to three to four systems, but that the availability benefits start to decline once there are more than about five or six systems (for example, adding a seventh system to a cluster is unlikely to improve overall availability by much).

High Availability with Fault Tolerant Systems

No paper on clustering can ignore the design and capabilities of fault tolerant systems. Because these systems are designed from the outset to be "fault tolerant" they provide the highest levels of availability. In general, any component part of the system can fail without impacting application and user operation in any way. The failure may be handled completely within the hardware, such that even the operating system is unaware of the fault (power supply failure is often handled entirely within the hardware, except for any signal that the operating system may require for event logging purposes). Other failures may require operating system action to complete recovery, but, in any event, user applications continue to operate unaware of any underlying failure (memory failure generally requires operating system assistance to complete recovery). The ability to mask all failures from users and applications is the distinguishing feature between fault tolerant systems and clusters. As will be seen, a feature of clusters is that users and applications are often aware, if only briefly, that a failure has occurred.

The figure shows a single, hypothetical, fault tolerant system. The system has fully duplicated hardware, two CPUs, two memory systems, and duplicated I/O adapters, each configured into two "zones". The two zones execute in a fully synchronized manner; both memories contain the same instructions and data, and both CPUs execute the same instructions. A failure of any component is contained within its zone, and the other zone can continue without interruption.

While fault tolerant systems provide the highest levels of availability, suitable for extremely demanding applications, they have a few weaknesses that can render them inappropriate for many uses:

The special purpose hardware required makes them expensive. As the figure above shows, although there are two CPUs in the system the attainable compute power is only equal to that of a single CPU.
They are not immune to "disaster" failures, such as fires, floods and earthquakes.
In the same way as any other single system, they are not inherently scalable. An overloaded fault-tolerant system must be replaced with a larger system, assuming a larger one is available.

Increased application and/or user population scaling

Increasing overall application throughput and the number of users/clients that can be handled are important clustering capabilities. Ideally, all the systems in the cluster configuration can be used to handle a portion of the total application load. Whether this is possible in practice is highly dependent on the underlying capabilities of the cluster and the application(s).

It is common for people to confuse the concept of scaling with that of performance. As systems are added to a cluster the total compute power available for applications grows – so the bandwidth, or scalability, of an application increases. If each system is capable of handling 100 transactions/second then a cluster of five systems is capable of handling 500 transactions/second (assuming, for simplicity, perfect scaling). However the transaction time remains constant, at 1/100^th second. Increased performance can only be achieved by increasing the power of an individual system; replacing a system capable of 100 transactions/second with one five times more powerful results in being able to perform 500 transactions/second (once again, assuming perfect scaling). However, in this case the transaction time is reduced to 1/500^th second.

In the simplest clustering products, running the same application on multiple systems at the same time is either not possible, or not straightforward. With these products it is common practice to allocate different applications to different systems – each system executes its specific application. When a system shuts down application failover will move the application to a remaining node. If there is only one application the cluster becomes a "hot standby" configuration – one system executes the application while the other system(s) wait, idle, until it shuts down. This is clearly an inefficient use of resources, providing no application scaling, so it is not a popular type of clustering.

In single application environments where hot standby operation is inappropriate it may be possible to modify how the application is deployed, so that it can be made suitable for running on multiple systems simultaneously. A common way of achieving this is to partition the application database into two or more sections, and then assigning the application on each system responsibility for an individual section.

More sophisticated clustering products provide the ability to run the same application on multiple systems at the same time, without the need to partition application databases. However, it is often necessary to modify the application before this can be done.

The level of scaling that a cluster can achieve will vary with the specific product and the application. In many cases it is possible to continue to add systems until some limit is reached with the clustering software or interconnect. Clusters containing over 100 systems have been created – this is especially common in the workstation farm environment. In commercial environments, where individual systems tend to be large and growth more gradual, it is common for cluster sizes to reach an upper limit based on system power. As new, more powerful systems are added to the cluster, old, less powerful systems are retired.

Scaling with Symmetric Multi-Processing (SMP) Systems

When considering the application scaling capabilities of a clustering product, a comparison with SMP application scaling capabilities can be useful. In general, it must be said that an SMP system will provide superior application scaling than is possible with a cluster. An SMP system is purpose-designed to permit several CPUs to coexist in the same computer. The figure shows the general arrangement of an SMP system – multiple CPUs connected to a single memory and I/O sub-system by a high performance system bus. The high performance system bus allows excellent CPU-to-CPU communication performance, and applications running on multiple CPUs to use shared memory for common data. However, while SMP systems can provide excellent application performance, they have several limitations that clusters overcome:

SMP systems do not provide enhanced availability, because they are single systems, with no hardware duplication. The failure of a single component, such as a CPU or a section of memory, will cause the complete system to fail.
The scalability achievable with an SMP system is limited by the number of CPUs that it can be configured with. Once this limit is reached it is necessary to replace the system with a larger one.
Large SMP systems are not commodity items, and are often expensive.

Of course, it is important to note that these limitations are mostly overcome by the simple expedient of clustering several SMP systems. Clustering and SMP technologies are highly compatible, and are often used together to provide the highest levels of performance and availability.

Less than sum-of-the-parts system administration capabilities

A cluster consists of multiple systems, all of which have to be managed. For the simplest clustering products the amount of administration required is a direct multiple of the number of systems in the cluster, often with additional management required for the cluster capabilities themselves (such as defining application failover scenarios).

In general, system administration complexity is often a weakness of clustering products, requiring skilled personnel and repetitious operations. Since system administration is usually a major component of system operational costs this should not be overlooked when choosing a clustering solution.

Sophisticated cluster products will provide cluster-aware administration utilities that simplify the management of multiple systems. These will often remove the need to perform the same administration activities on every system in the cluster – for example, a user account created on one system automatically becomes valid on all systems.

Investment protection for hardware and software

Low-end cluster products usually restrict the number and type of configurations supported by requiring that all systems are identical. These configurations benefit the vendor because they are simpler to develop and test (and leverage larger sales volumes). However, for obvious reasons they can increase end-user costs.

An attractive feature of the more sophisticated cluster products allows for older systems to be clusterable with newer systems. This allows systems to be added incrementally, as application and user needs demand, and permits older systems to be fully depreciated. In many cases older and slower systems can be delegated to handling less critical tasks, such as printing and backups.

Additionally, cluster products that permit multiple versions of the operating system and application(s) to coexist in the same configuration can reduce costs, and simplify administration and upgrades.

Overview Summary

These four basic cluster capabilities – availability, scalability, manageability, investment protection – should all be present to a greater or lesser extent in any clustering product. The better products will provide all four of these capabilities, others, one or two. When evaluating cluster products it is important to understand what features are provided.

Microsoft Cluster Server

Microsoft’s clustering solution for Windows NT – Microsoft Cluster Server – started shipping during 1997. This product, widely known by its development code-name – Wolfpack, concentrates on providing high availability capabilities for the Windows NT environment. Microsoft Cluster Server (MSCS) is delivered as part of Windows NT Server, Enterprise Edition.

Comprehensive information regarding MSCS can be found on the Microsoft Internet site at: http://www.eu.microsoft.com/nTServerEnterprise/Basics/Features/Clustering/default.asp

MSCS clusters support configurations with up to two systems (nodes) connected to a common LAN and storage interconnect. The common LAN is used for system-to-system and system-to-client communication. It is possible to configure the cluster with two LANs, one for system-to-system communication, the other for system-to-client communication. This generally leads to higher availability because intra-cluster communication remains possible in the event that the general-purpose client LAN fails. The common storage interconnect is either parallel SCSI or Fibre Channel. Disks connected to the common storage interconnect are accessible by either system, thereby providing the foundation for high availability serving.

MSCS provides the concept of cluster "resources". A wide range of resources are defined, essentially any entity for which high availability is required. Typical examples are applications, disks and TCP/IP network addresses. The MSCS administrator defines "groups" of resources that are required to provide a complete service for clients. Failover is performed at the group level – so, when a server shuts down or fails, all the resources in a group are moved to the other server. This ensures, for example, that when an application is restarted on a surviving node it has access to the correct disks and is reachable by clients using its normal TCP/IP address. This technique provides a simple and powerful infrastructure with which to create high availability server configurations.

Disks configured on MSCS shared storage interconnects, while physically connected to both servers, are only accessible by a single server at any point in time. This disk access technique, usually called "shared nothing", greatly simplifies the complexity of the cluster software, by avoiding the synchronization that is necessary when multiple systems access the same disk storage at the same time. In many cases shared-nothing clustering will provide excellent high availability, with applications being moved seamlessly from one server to another as servers shut down or fail.

The first figure shows an application initially executing on one server (the left server, in this example). The right server is unable to access the disk that the application on the left server is using.

The second figure shows what happens when the left server is shut down, or fails. The application is "failed over" to the right server. More accurately, the resource group is moved to the right server, which (1) gains access to the disk, (2) assumes the TCP/IP address of the left server, and (3) restarts the application.

Shared-nothing clusters provide a level of application scaling by allowing each server to process a different application. This is essentially the same as having two non-clustered (standalone) servers, but with a failover capability to permit continued application operation in the event of server shutdown or failure. This capability is particularly useful in "file and print" environments, and those with multiple applications. In these cases it is relatively straightforward to segment user and application files on to separate disks, thereby allowing each server to perform useful work. In environments that run a single application it is necessary to either:

Replicate application databases/files onto two disks, so that each server has access to its own copy of the required data, or
Partition the databases/files in half, so that each server can perform half the total application load.

Replicating or splitting databases and files is usually practical when the total amount of data is relatively limited – a few gigabytes, say – but becomes less so as the amount of data grows. Of course, the majority of application databases are not suitable for replication, because update activity would quickly cause them to diverge and become unsynchronized. Replication is suitable for read-only databases, such as those used by some Web Servers.

In the future, as Microsoft Cluster Server supports more systems, it will become necessary to partition databases/files into smaller sections (quarters for a four node cluster, sixths for a six node cluster, etc) as the cluster grows. This is likely to become increasingly burdensome for system and application administrators.

The diagrams below show:

A cluster in which each system is running a different application.
A cluster in which each system is running the same application, but the application database has been split across two disks alphabetically, customers A-N on one disk, O-Z on the other.

As can be seen from the above comments, Microsoft Cluster Server provides a simple and powerful infrastructure for building highly available server environments. It can be expected that these capabilities will be suitable for the majority of environments, where the number of nodes and quantity of data is limited. In these cases excellent availability can be achieved simply and cheaply, without the need for excessively burdensome system and application administration.

Microsoft "Q and A" on Microsoft Cluster Server

The following section is copied directly from the Microsoft Internet site noted earlier. The questions and answers provide an excellent commentary on several critical capabilities of MSCS.

Q: When a cluster is recovering from a server failure, how does the surviving server get access to the failed server's disk data?A: There are basically three techniques that clusters use to make disk data available to more than one server:

Shared disks: The earliest server clusters permitted every server to access every disk. This originally required expensive cabling and switches, plus specialized software and applications. (The specialized software that mediates access to shared disks is generally called a Distributed Lock Manager, or DLM.) Today, standards like SCSI have eliminated the requirement for expensive cabling and switches. However, shared-disk clustering still requires specially modified applications. This means it is not broadly useful for the wide variety of applications deployed on the millions of servers sold each year. Shared-disk clustering also has inherent limits on scalability since DLM contention grows geometrically as you add servers to the cluster. Examples of shared-disk clustering solutions include Digital VAX Clusters, and Oracle Parallel Server.

Mirrored disks: A more flexible alternative is to let each server have its own disks, and to run software that "mirrors" every write from one server to a copy of the data on at least one other server. This is a great technique for keeping data at a disaster recovery site in synch with a primary server. There are a large number of disk mirroring solutions available today; examples for the Windows NT Server environment are available from Network Integrity®, NSI Software®, Octopus®, and Vinca®. Many of these mirroring vendors also offer cluster-like high availability extensions that can switch workload over to a different server using a mirrored copy of data. However, mirrored-disk failover solutions cannot deliver the scalability benefits of clusters. It is also arguable that they can never deliver as high a level of availability and manageability as shared-disk clustering since there is always a finite amount of time during the mirroring operation in which the data at both servers is not 100 percent identical.

"Shared nothing": In response to the limitations of shared-disk clustering, modern cluster solutions employ a "shared nothing" architecture in which each server owns its own disk resources (that is, they share "nothing" at any point in time). In the event of a server failure, a shared-nothing cluster has software that can transfer ownership of a disk from one server to another. This provides the same high level of availability as shared-disk clusters, and potentially higher scalability since it does not have the inherent bottleneck of a DLM. Best of all, it works with standard applications since there's no special disk access requirements. Examples of shared-nothing clustering solutions include IBM® HACMP®, IBM SP2®, Tandem® NonStop®, Informix® Online/XPS®, and Microsoft Cluster Server.

Q: How does MSCS provide high availability?A: MSCS uses software "heartbeats" to detect failed applications or servers. In the event a server failure is detected, MSCS first confirms the failure using a sophisticated "quorum" algorithm. It then employs a "shared nothing" clustering architecture that automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server's workload on the surviving server. All of this-detection, confirmation, and restart-can typically take under a minute. If an individual application fails (but the server does not), MSCS will typically try to restart the application on the same server; if that fails, it moves the application's resources and restarts it on the other server. The cluster administrator can use a graphical console to set various recovery policies such as dependencies between applications, whether or not to restart an application on the same server, and whether or not to automatically "failback" (rebalance) workloads when a failed server comes back online.

Q: Should cluster-aware applications developed for MSCS use a shared-disk or shared-nothing architecture for greatest scalability?

A: Microsoft recommends a shared-nothing architecture for cluster-aware applications because of its greater scalability potential. With shared-disk applications, copies of the application running on two or more servers in the cluster share concurrent read/write access to a single set of disk files, mediating ownership of the files using a "distributed lock manager" (DLM). A shared-nothing application, on the other hand, avoids the potential bottleneck of shared resources and a DLM by partitioning or replicating the data so that each server in the cluster works primarily with its own data and disk resources. In theory, MSCS can support either type of application. However, Microsoft has no plans at this time to include a DLM in the MSCS cluster services, so vendors would have to develop or license a DLM to implement a shared-disk application on MSCS. Microsoft has chosen to use the shared-nothing architecture for future versions of Microsoft BackOffice® family applications because of that architecture's greater potential for cluster-enabled scalability.

Q: Will MSCS ever have a Distributed Lock Manager (DLM)?

A: Microsoft did not include a distributed lock manager in the first release of MSCS. Enhancements in future releases will be determined based on customer requirements.

The description, pictures, and questions & answers in the preceding pages provide a brief description of the Microsoft Cluster Server product and its capabilities. The following section describes how Compaq will provide enhancements to the Microsoft Cluster Server environment – permitting it to deliver even higher availability, superior application scaling, and simplified system administration.

Compaq’s Enhancements to Microsoft Cluster Server

The capabilities of Microsoft Cluster Server provide an excellent basis on which to deliver additional clustering features. It is important to realize that Compaq’s enhancements to Microsoft Cluster Server permit customers to take advantage of all standard MSCS features in conjunction with the enhancements.

The primary thrust of these enhancements is to deliver additional application availability and scaling, across a wider range of customer environments than possible with the standard MSCS product, while at the same time simplifying system and application administration. The central technical feature on which these capabilities will be based is the provision of a "shared disk" clustering paradigm. As alluded to previously, the combination of MSCS’s basic shared-nothing clustering, and Compaq’s shared-disk clustering will provide customers and applications with complete flexibility in how computing solutions are deployed – the two schemes can be combined within a single cluster, allowing customers to use the most appropriate paradigm on per-application basis.

Providing shared-disk clustering is technically non-trivial, and relies on two fundamental capabilities:

The provision of a Cluster File System. A cluster file system enables the shared-disk capability by permitting multiple servers simultaneous access to a single disk. This allows files on the disk to be shared by all users and applications, regardless of which server in the cluster they are using. This feature provides several significant advantages over those provided by shared-nothing environments:

The ability to deploy applications on multiple servers without the need for partitioning or replicating files and databases. This provides improved application availability, and straightforward application scaling.
Simplification (in some cases, complete avoidance) of failover schemes. By allowing applications to run on multiple servers in parallel there is no requirement to perform failover after a server shuts down or fails, because impacted applications are already operational on the remaining servers.
Simplification of system and application administration. Shared-disk environments remove the strict server-to-storage link that is imposed by shared-nothing environments, thereby removing significant administrative complexity. While this complexity is often acceptable in simple two-node MSCS configurations with limited storage requirements, it becomes excessively burdensome in the large configurations that will become common in the future.

The provision of a Distributed Lock Manager. A distributed lock manager is a programming synchronization primitive that differs from standard operating system synchronization primitives in that it operates cluster-wide. As a result, applications can be created that run on multiple systems in a cluster in parallel with fully synchronized access to application resources (typically, files and databases). The primary consumer of the distributed lock manager is the cluster file system, and, in most cases, its existence is invisible to end-users. However, the API is fully documented and supported for use by application developers.

Shared-Disk Clustering

That styles of clustering should be described in technical terms such as "shared-disk" and "shared-nothing" may seem strange to the casual observer, but the difference between these two styles is highly significant to the operation and usefulness of a cluster. Shared-nothing, discussed earlier in this paper, is the simplest form of clustering, and provides high availability for applications in simple environments. Shared-disk clustering, while technically complex to implement, provides significant benefits in terms of availability, scalability and manageability for larger, more complex application environments.

The figure shows the operation of a disk in a shared-disk cluster environment. Note that both (or all) servers in the cluster can directly access the same disk.

In the question and answer section quoted from the Microsoft Internet Web pages earlier, the following comments are made regarding shared-disk clustering:

The earliest server clusters permitted every server to access every disk. This originally required expensive cabling and switches, plus specialized software and applications. (The specialized software that mediates access to shared disks is generally called a Distributed Lock Manager, or DLM.) Today, standards like SCSI have eliminated the requirement for expensive cabling and switches.

Both shared-nothing and shared-disk clusters use the node-to-node communication interconnect to provide support for low-level activities such as polling heartbeats. Also, both clustering styles impose additional loads on the interconnect. For example, in shared-nothing environments it is common to use Windows NT Shares to enable servers in the cluster to access disks that are controlled by another server in the cluster. Data transfers using these shares will occur across the interconnect, imposing a load on it.

Similarly, shared-disk clusters use the interconnect for distributed lock manager messaging, also imposing a load on it. However, as mentioned, today’s modern commodity interconnects, such as 100Mb/sec Ethernet and Gigabit Ethernet provide exceptional performance. Proprietary interconnects, such as ServerNet and Memory Channel, provide even higher levels of performance, and also benefit from commodity pricing due to their implementation on industry standard system I/O buses, such as PCI.

However, shared-disk clustering still requires specially modified applications. This means it is not broadly useful for the wide variety of applications deployed on the millions of servers sold each year.

The requirement to modify applications before they can be deployed in shared-disk environments varies, as might be expected, with the application. For the most common of all computing environments, file and print, no modifications are necessary at all. File and print applications are essentially private – with each client creating and accessing files privately. As a result, modification of file and print applications is unnecessary. This is also generally true for application environments such as Web Serving and Mail.

The figure shows the level of flexibility that shared-disk clusters provide for File and Print environments. It can be seen that a client can access any disk through any server. In this case standard Windows shares are defined on all servers in the cluster and client load balancing algorithms (such as DNS round-robin) can be used to ensure that all servers are used efficiently. The system administration effort required to control this type of cluster (especially for large configurations) is significantly less than for shared-nothing clusters – careful placement of files and shares is not necessary, and definitions of resource groups and how they failover are not required.

For applications that manipulate shared files, the specific method of sharing used dictates whether application modification is necessary. Applications that use byte-stream files, which are typically accessed using a simple OpenFileExclusive, Serial Access, CloseFile algorithm do not require modification before deploying in a shared-disk environment. In these cases all necessary synchronization is performed automatically for the application by the cluster file system. Also, because the cluster file system takes advantage of the distributed lock manager’s byte-range locking capability, any application that uses byte-range locking for synchronization will also work in a shared-disk environment without modification. These styles of access are common in many PC and Unix applications.

However, as the answer above correctly states, the class of applications that permit multiple users to share access to common database files generally require modification before they can be deployed in clusters using shared-disk capabilities. In many cases these modifications will be provided by the application provider (such as Oracle with their Parallel Server product). In other cases it is necessary to perform a tradeoff between the benefits of a shared-disk application against the effort required to modify it (discussed below). Of course, once modifications have been completed, the updated application can be deployed on non-clustered systems, and any type of clustered system.

Even when applications cannot be run on multiple nodes simultaneously without modification there are benefits to running them in shared-disk environments. For example, load balancing of applications across servers is greatly simplified – applications can be instantly moved between servers without the need to copy application and data files between disks. Application administration operations (such as backup) can be performed from one server in the cluster while the application continues to run on another server. Put simply, a shared-disk environment makes it straightforward to load balance shared-nothing applications.

Shared-disk clustering also has inherent limits on scalability since DLM contention grows geometrically as you add servers to the cluster. Examples of shared-disk clustering solutions include Digital VAX Clusters, and Oracle Parallel Server.

Compaq DLMs are carefully designed to ensure that contention is not geometrical as servers are added to the cluster. The number of messages for any given DLM operation remains constant, regardless of whether the cluster is comprised of two or 100 nodes. As the answer implies, this is critical to ensuring excellent performance and scalability. Nevertheless, the answer touches on an important topic in shared-disk clusters – that of DLM performance. The following tradeoffs need to be considered:

Centralizing an application’s access to data through a single server within the cluster, and requiring that other servers in the cluster access that data in a client-server fashion. This permits all synchronization to be performed exclusively by the serving node (thus ensuring high performance synchronization). However, other systems in the cluster, despite the fact that they are directly connected to the data by virtue of the common storage interconnect, are required to access the data across the node-to-node interconnect. For many applications this may be a significant quantity of data (kilobytes to terabytes). The figure to the right illustrates this effect. The left hand server routes I/O requests to the right hand server; synchronization between clients (represented by the star) is performed by the right server, which then routes the I/O request to the appropriate disk. I/O data flows across the storage bus and the node-to-node interconnect.

Using a DLM to provide cluster-wide synchronization so that applications running on multiple servers in a cluster operate in a peer-to-peer fashion. Many DLM operations require no intra-cluster messages, but, for those that do, the average number of messages is two, and message size is approximately 64 bytes. This means that, for the cost of 128 bytes of synchronization data being transmitted across the node-to-node interconnect, a server can gain direct access to disk data (which, once again, could be significant – kilobytes to terabytes). Naturally, a direct access path to data is always superior to a served access path. The figure to the left illustrates this effect. Both servers can directly route client I/O requests to the same disk. However, synchronization is performed by both systems in a distributed fashion, using the DLM. DLM communication occurs across the node-to-node interconnect, but is restricted to small lock messages. I/O data flows across the storage interconnect only.

In many cases this is a simple tradeoff - by using the DLM to grant applications running on multiple nodes in a cluster peer access to data, significant performance enhancements can be achieved.

Of course, any application can be written poorly, so that the benefits of compute resources – SMP, caches, I/O subsystems, RAID, network infrastructures – are squandered. A Distributed Lock Manager is no different in this regard. However, after Compaq’s 14 years experience with DLMs, it is safe to say that, properly used, a DLM will provide excellent performance for shared-disk applications.

The Cluster File System

Compaq’s Cluster File System (CFS) for Windows NT provides all the functionality of NTFS, but for shared-disk environments. The CFS is fully compliant with NTFS, that is, it provides the same interface to applications and the Windows NT I/O subsystem.

Windows NT is provided today with three file systems, FAT, NTFS and CDFS. The initial release of NT included a fourth file system, HPFS, for OS/2 compatibility. As a result the ability of NT to include multiple coexisting file systems is well developed and understood. However, all these file systems will only coordinate disk activity from a single system. The CFS extends this capability to operate cluster wide – using the distributed lock manager to coordinate its activities.

The figure shows how two (or more) servers in a cluster can perform file operations, in this case a file creation, on the same disk at the same time. The technical complexity of this operation is caused by having to ensure that the two file creations do not result in allocating the same free space on the disk. If this operation were attempted using NTFS or FAT (or any other non-cluster-wide file system) there is a strong possibility that files would overlay each other, rapidly resulting in disk and file corruption.

Confusion as to the exact function of a file system is common. File systems are responsible for all actions relating to management of files on a disk. These actions include file creation, directory management, opening (finding) and closing a file, renaming and deleting files, imposing security mechanisms (access rights) on files, and so forth. Additionally, a file system must manage free space on the disk, control retirement of bad blocks, and other hardware related functions. However, a file system will not provide any control over what data is placed inside any given file, or how that data is organized. This is the responsibility of the application that accesses the file. So, while a file system can locate the Powerpoint file in a directory, it has no ability to find a specific slide within the file. Only the Powerpoint application can do this. The result of this segregation of responsibilities is that the cluster file system, using the distributed lock manager for synchronization, can coherently manage file operations on a disk, cluster-wide. However, if cluster-wide coordination of data within a file is required, this must be done by the application – which will typically be modified to use the distributed lock manager to achieve this.

As described earlier, the CFS is NTFS compatible – it provides full emulation of NTFS at the FSD (file system driver) level. The common capabilities between CFS and NTFS include support for Quotas, High-water Marking, OpLocks (Opportunistic Locking), and multiple Named Streams per file. Performance of the CFS is expected to be approximately equivalent to NTFS – with some operations being somewhat faster and some slightly slower (performance figures are not yet available). To ensure optimal integration with Windows NT, the CFS uses the standard NT data cache for files and propagating byte range locks across the cluster; optimal performance of file system operations is achieved by means of separate cluster-coherent write-behind metadata cache.

The CFS is designed to be extremely robust – it will tolerate metadata block corruption by means of maintaining multiple copies of metadata (metadata is the data required to maintain the on-disk structures, such as directories). High-speed recovery from cluster node failure is ensured by means of per-node recovery logs. Additionally, the CFS will permit online volume growth – so if a disk becomes full and the underlying storage is capable of growth (such as with volumes created with logical volume management products) the CFS can support the growth of the volume. Unlike NTFS, online volume growth can be performed while there are open files on the volume – so can be done without the requirement to shut down applications or disturb clients.

With respect to volume size, the CFS is designed to support the largest forseeable volumes, with designed-in scaling to multiple terabytes. Leadership technologies (such as secondary-level bitmaps) ensure that, as volume sizes grow, performance will not degrade.

The figure shows how the Cluster File System is integrated into the Windows NT kernel, as a peer of the other NT file systems. Note how the CFS uses the same system service interface, and the same storage driver interface as the other file systems. This ensures the highest level of application and I/O subsystem compatibility.

Initially the CFS will be supplied with special-purpose administration utilities for formatting, control (such as volume validation), analysis and repair. It is hoped to integrate these capabilities into the base Windows NT operating system over time. Since the CFS is fully compliant with NTFS it will be able to use industry-standard storage management products, such as Backup utilities and Performance monitors, that are provided by several vendors. However, the on-disk structure of CFS is not the same as NTFS so applications that assume a knowledge of the on-disk structure must be modified before they can be used with CFS. Such applications are not common, but include defragmenters.

The CFS has a carefully designed on-disk structure that is readily extensible. This will allow Compaq to enhance the CFS to track any changes that Microsoft may make to NTFS, and also to provide Compaq-specific enhancements (including, for example, additional features requested by customers).

The Distributed Lock Manager

Compaq’s Distributed Lock Manager (DLM) for Windows NT is based closely on the DLMs that Digital previously implemented on its Unix and OpenVMS operating systems (prior to acquisition by Compaq). The industry’s original DLM, for OpenVMS, was first shipped in 1984, so Compaq has the most comprehensive experience of building and optimizing these complex subsystems. Consequently, the Windows NT implementation benefits considerably from the two prior implementations – it is functionally comprehensive and mature, and algorithmically stable.

The Windows NT operating system provides a rich selection of synchronization primitives for application developers. These include:

Interlocked functions
Critical sections
WaitForSingleObject/WaitForMultipleObjects
Mutexes
Semaphores
Events

While these primitives provide a wide range of synchronization capabilities, they are restricted in scope – they only perform synchronization within a single system (either multiple threads within a process, or multiple processes within a system). As a result it is impossible to write an application which synchronizes activities across the nodes in a cluster.

The DLM provides the only programming primitive that enables cluster-wide synchronization of application access to resources. This capability is critical to the creation of any application that must scale across multiple servers within a cluster – an application running on one node of a cluster can coordinate access to any resource with an application running on another node in the same cluster.

Additionally, the DLM is available for non-clustered systems. In this case, since there is only a single system, the lock manager operates in a non-distributed manner – it becomes just another synchronization primitive for applications to use within a single system. There are two reasons that an application developer might choose to use the lock manager as a primary synchronization primitive within an application:

The lock manager provides synchronization capabilities that are significantly richer than the standard Windows NT primitives.
An application that uses the lock manager for synchronization can be deployed on non-clustered systems, and clustered systems using shared-disk capabilities, without modification. This permits a single application code base to be maintained for all systems, regardless of clustering.

The DLM operates with two basic programming structures – resources and locks. A resource is defined by an application program – typically using a simple text string – and is used to represent the object that is to be locked. Once a resource has been defined by an application, locks can be acquired on it. The application acquiring the lock specifies the level of access required – for example, exclusive access or shared-read access. Multiple locks can be queued against a resource; compatible locks (for example, multiple shared-read locks) will be granted simultaneously, while non-compatible locks wait for currently granted locks to be released. Note that the DLM imposes no explicit control over locked objects, but merely provides an infrastructure with which cooperating applications synchronize access, using resource names to represent the actual locked objects. This separation of resource names from the underlying locked objects makes the DLM extremely flexible – it can be used to synchronize access to anything. A good rule of thumb is that if something can be named, it can be locked by the DLM. This includes files, items within a file down to the bit level, in-memory cache buffers, any I/O device, whole applications, individual application routines, and so on.

A full description of the DLM is not practical in this paper, however, the following list provides a brief overview of the comprehensive range of features provided:

Resource names can be up to 32 bytes in length, simplifying applications’ locking implementations.
Resource granularity – trees of subresources and sublocks can be created. The figure shows an example locking hierarchy:
Multiple coherency modes – six access modes are provided:

Null (NL): Grants no access to the resource; a placeholder mode compatible with all other modes
Concurrent Read (CR): Grants read access; allows other writers
Concurrent Write (CW): Grants write access; allows other writers
Protected Read (PR): Grants read access; allows other readers but no writers (a traditional "share" lock)
Protected Write (PW): Grants write access; allows no other writers; allows CR readers (a traditional "update" lock)
Exclusive (EX): Grants exclusive access; no others accessors; compatible with NL mode

The following table shows the compatibility matrix of the various access modes. Using the appropriate access mode for a lock on a resource allows an application to define its willingness to share access to the resource – for example, Exclusive mode permits only one accessor to the resource, while Protected Read permits multiple readers, but no writers.

Mode of Existing Lock	Mode of Requested Lock
Mode of Existing Lock	NL	CR	CW	PR	PW	EX
NL	Yes	Yes	Yes	Yes	Yes	Yes
CR	Yes	Yes	Yes	Yes	Yes	No
CW	Yes	Yes	Yes	No	No	No
PR	Yes	Yes	No	Yes	No	No
PW	Yes	Yes	No	No	No	No
EX	Yes	No	No	No	No	No

Synchronous system service completion, or asynchronous completion via application callbacks.
Blocking notifications via application callbacks.
A lock value block capability – with 32 byte value blocks.
A queuing model that supports Granted, Conversion and Waiting queues for all locks on a Resource. The figure shows how these three queues are arranged:

Two APIs are provided for the DLM - a User Mode API, and a Kernel Mode API. The User Mode API will be of value to Windows NT application programmers, while the Kernel Mode API will be valuable to writers of kernel mode code, such as device drivers.

A full description of DLM capabilities is provided in the Programmers Manual available on the Internet at http://www.windowsnt.digital.com.

Messaging Performance of the Distributed Lock Manager

Referring once again to Microsoft’s question and answer section earlier in this paper, the following answer is provided to one of the questions:

Microsoft recommends a shared-nothing architecture for cluster-aware applications because of its greater scalability potential. With shared-disk applications, copies of the application running on two or more servers in the cluster share concurrent read/write access to a single set of disk files, mediating ownership of the files using a "distributed lock manager" (DLM). A shared-nothing application, on the other hand, avoids the potential bottleneck of shared resources and a DLM by partitioning or replicating the data so that each server in the cluster works primarily with its own data and disk resources.

The answer bases the potential for greater scalability on the requirement to partition or replicate data on multiple servers. From a performance viewpoint this leads to the same scaling that would be achieved with multiple non-clustered systems. As described in detail earlier in this paper, the practical implications of database partitioning or replication render the technique unsuitable for large, mission critical application environments. Compaq’s extensive experience with lock managers clearly shows that concerns regarding the potential for a DLM to become a bottleneck are almost without exception overstated. Typically, competitive pressures between vendors that do or do not have DLMs provides plenty of scope for exaggeration!

An understanding of the messaging techniques used by the DLM ensures that applications can achieve excellent performance. The following diagrams show the messaging handshakes used by DLM operations. For the majority of locking operations – those where an application locks a resource that no other node has an interest in – no node-to-node messages are necessary. In these cases the performance of the lock manager is essentially the same as any other Windows NT synchronization primitive. In cases where node-to-node messaging is required, the overwhelming majority of operations require a simple command-response protocol of two messages. Note that in cases where a lock request cannot be granted synchronously – so is stalled by the DLM – an additional "granted" message is required when the resource is released by a prior holder of the lock. Under these circumstances, the time to acquire the lock is dominated by the event that causes the stall, so the messaging performance of the DLM becomes irrelevant.

The first diagram shows how an initial lock is acquired. A single message is sent to a Directory node, followed by a response to the lock requester. The response notifies the requesting node that no other node has an interest in the resource, so that it can become the Master of the resource. Determining the Directory node for a given resource is done by hashing the Resource name – in this way the resource directory database is distributed across the nodes in the cluster. Note that for a two-node cluster there is a 50% chance that the requesting node will also be the directory node, eliminating the need for any messages.

The second figure shows the most common of all locking operations – creation of additional locks and sublocks on a resource tree by the application that initially defined the resource. In this case the node already knows that it is the master of the resource, so no messages to the directory node are required. The same process is used for conversion requests on existing locks.

The third figure shows the most message intensive operation performed by the DLM. In this figure a third node acquires its first lock on a resource already mastered on another node. In this case a message pair is sent to the directory node – in order to identify which node is mastering the resource – and then another message pair sent to the resource master node.

The fourth figure shows the messages required when a node acquires new sublocks on a resource that it is not mastering. In this case the requesting node already knows which node is the resource master, so a message pair to the directory node is avoided. The same process is used for lock conversions.

The last figure shows the messages required to release (or dequeue) a lock. These events require a single message to the Resource Master, and, when necessary, another to the Resource Directory.

These drawings provide a good overview of the message traffic imposed by a distributed lock manager. Note that, for the Windows NT DLM, lock messages are 64 bytes or smaller. Also, it can readily be seen that the algorithms used do not result in the number of messages increasing with the number of nodes in the cluster. In fact, message rates remain constant, ensuring that as cluster sizes grow the performance of the DLM will remain consistant. For a thorough treatment of this topic, readers are referred to "The VAX/VMS Distributed Lock Manager" by William E. Snaman and David W. Thiel, published in the Digital Technical Journal, Number 5, September 1987, ISBN:1-55558-004-1.

Miscellaneous Enhancements

Compaq’s Windows NT Enterprise Software Group is actively developing additional Windows NT capabilities that will take advantage of the shared-disk clustering paradigm. These capabilities are currently focused in two areas:

Providing support for Disaster Tolerant solutions in shared-disk environments.
Providing cluster-specific system administration utilities

Disaster Tolerant solutions

Providing suitable disaster tolerant capabilities for any computing environment is highly dependent on specific customer requirements and application capabilities. Disaster tolerance – the ability of a computing environment to survive disasters such flood, earthquake and fire – can be achieved in many ways:

Provision of an off-site storage facility for backup media, such as magnetic tapes. This permits valuable data to be restored in the event of a disaster. An advantage of this method of disaster tolerance is low cost.
Use of special-purpose storage infrastructures that permit disk or tape storage to be connected to computer systems over long distances. A typical configuration would implement a storage controller-based RAID-1 (mirror) data set with individual disk members in different geographical locations. Several storage vendors provide this type of capability. An advantage of this type of disaster tolerance is that it can be used with any operating system and application.
Application-based transaction replication. This technology allows a transaction-based application to duplicate transactions on local and remote copies of its databases. An advantage of this type of disaster tolerance is that, being application specific, it is easy to control the quantity of data that is transmitted to a remote site, simplifying network configurations and reducing costs.
Database journaling. This technology, provided by several major database products, uses a database log-shipping technique to save transaction redo logs at a remote site. In the event of a disaster the transaction log at the remote site can be applied to the most recent database backup, bringing the database up to date.
Operating system-based data mirroring. This technique is similar to using a special-purpose storage controller for disaster tolerance, but uses software in the host computer to provide the RAID-1 capability. An advantage of this type of disaster tolerance is that it requires no special storage hardware, and is appropriate for any application. Also, when used in a shared-disk clustering environment, applications and users at both sites are able to access a local copy of their data in a fully synchronized fashion.

Compaq’s Windows NT Enterprise Software Group is currently investigating the feasibility of implementing the last of these disaster tolerance technologies (other groups are working on the other technologies – refer to the appropriate white papers for additional information). The implementation would include a fully distributed RAID-1 (mirroring) capability that would support RAID-1 disk members located on any server in an MSCS cluster. By placing servers in multiple geographic locations, disaster tolerant configurations can be created. By deploying the distributed RAID-1 capability in conjunction with the Cluster File System, users and applications could use any server to access their data, regardless of geographic location.

Cluster Administration

As mentioned earlier in this paper, providing system administrators with cluster-aware and easy-to-use tools and utilities greatly simplifies the management of a cluster, and reduces costs. To ensure that the capabilities described in this paper are easily manageable Compaq will provide an accompanying cluster administration utility. The administration utility includes the following capabilities:

The utility provides proactive, multi-node, real-time, system and cluster monitoring, diagnosis, and correction. It is capable of monitoring many systems configured on a LAN simultaneously – this would typically include all the systems in a cluster, or several clusters, in addition to any non-clustered servers and workstations.

The utility uses a private, special-purpose, high-performance LAN protocol, and is not reliant on any underlying network protocol, such as TCP/IP or NetBEUI. An advantage of this approach is that even systems with non-functioning network stacks can still be monitored.

Administration is performed from a single NT Workstation. Monitoring several dozen NT systems is simple and straightforward, by means of a Java-based browser.

Monitored items include a comprehensive range of system, cluster, and storage resources and events – such as CPU and memory utilization, I/O rates, disk space usage, and process resource consumption. The monitoring capabilities include threshold and alarm levels, so that system administrators can specify automatic responses to chosen events.

The utility can monitor items that are specific to the Cluster File System and Distributed Lock Manager. Additionally, the utility can monitor items that are specific to other products that are part of Compaq’s Windows NT software portfolio. For example, it will monitor storage pool consumption and snapshot information provided by the Windows NT Storage Software product.

Data collection is performed, wherever possible, using Windows NT kernel mode threads. A benefit of this approach is that monitoring of systems that are apparently "hung" at the user mode application level is possible.

SNMP event notification of all appropriate events that occur in the monitored systems.

Summary

Hopefully, this paper has provided the reader with an appreciation of differences between, and intricacies of, two important clustering implementation styles. By now it will be clear that this is a complex topic, and that there are many variations between clustering products. When looking for a clustering solution the prospective purchaser needs an understanding of the capabilities of the available products, knowledge of what his actual requirements are (availability, application scaling, cost effective management, and so on), and the ability to see through the marketing hype that surrounds too many clustering products.

On the one hand, shared-nothing clusters are simple to implement and provide excellent availability characteristics. Importantly, end-users can benefit from their capabilities immediately – no application modifications are necessary. It is clear that shared-nothing clusters, such as Microsoft Cluster Server, will dominate the market for small, high availability solutions. However, as Windows NT cluster configurations grow, and are deployed into the high-end commercial marketplace, the inherently simplistic nature of shared-nothing clusters will start to become more of a hindrance than a benefit. This will be especially apparent in environments with large databases or file populations, where data partitioning and replication is impractical.

On the other hand, shared-disk clusters are technically complex for a software vendor to implement, and, in some cases, require application modifications before their full benefits can be realized. However, their inherent elegance, unbeatable availability, seamless application scaling, and simple management make them the most suitable clustering style for all large and complex computing configurations.

Compaq’s cluster enhancements to Microsoft Cluster Server offer the best of both worlds. MSCS’s shared-nothing capabilities can be used at the same time as Compaq’s shared-disk capabilities – with system administrators choosing the most appropriate style on a per-disk basis. This ability to mix the two capabilities permits customers ultimate flexibility. When applications can be deployed on shared-disk storage it is probable that most customers will be quick to take advantage of the feature. Meanwhile, applications that are restricted to shared-nothing storage can still be deployed, using the standard MSCS environment.

Author | Title | Tracks | Home

Improving Windows NT Cluster Capabilities

Compaq Windows NT Enterprise Software Group

Send email to Interex or to theWebmaster©Copyright 1998 Interex. All rights reserved.

Send email to Interex or to theWebmaster
©Copyright 1998 Interex. All rights reserved.