Product Availability and Coverage
High Availability for applications and users
High Availability with Fault Tolerant Systems
Increased application and/or user population scaling
Scaling with Symmetric Multi-Processing (SMP) Systems
Less than sum-of-the-parts system administration capabilities
Investment protection for hardware and software
Overview Summary
Compaqs Enhancements to Microsoft Cluster Server
Shared-Disk Clustering
The Cluster File System
The Distributed Lock Manager
Messaging Performance of the Distributed Lock Manager
A wide range of clustering products exist in the market today, based on several Unix and proprietary operating systems. These products have been available since the mid-1980s, each providing different capabilities from implementations that are technically simple to those that are highly sophisticated.
Clustering for Windows NT has been available from several vendors since the mid-1990s. At this time the implementations are still relatively simple, offering appropriately limited capabilities. The market for sophisticated clustering implementations based on Windows NT that deliver extensive functionality is growing rapidly.
This paper describes enhancements to the Microsoft Cluster Server (also known as Wolfpack) clustering environment. These enhancements are being developed by Compaqs Windows NT Enterprise Software Group.
The paper is divided into four distinct sections:
Product Availability and Coverage
The enhancements described in this paper are currently under development, and are due for initial product availability during the later part of 1998 and throughout 1999.
Two important features of the capabilities include:
This section provides a broad overview of clustering technology, with no emphasis on any specific product. There are many clustering white papers available on the Internet, from a wide range of vendors. Readers are encouraged to research these papers.
Clustering definitions are as numerous as the number of cluster products on the market. However, it is commonly accepted that clustering is a method of combining several computers together so that the resulting configuration provides one or more of the following capabilities:
The list above provides vendors a broad palette with which to create a product that can be marketed and sold under the umbrella term of "clustering". The result has been that the variation of capabilities across clustering products is very wide much wider than, for example, computing capabilities such as SMP, networking, and RAID. In general, clusters designed for the commercial application market concentrate primarily on delivering high availability, with the other capabilities being of lesser importance. However, it should be understood that the specific mix of capabilities varies from product to product.
The following paragraphs examine the capabilities on the list above individually:
High Availability for applications and users
Providing high availability is generally considered the most important capability for any clustering product. (Note that it is not a mandatory capability. For example, "workstation farm" clustering products make no attempt to enhance availability, but concentrate purely on scaling application performance.)
Delivering high availability is primarily achieved by duplicating hardware, so that, in the event of a hardware failure, there is enough spare hardware to enable continued operation. Of course, software is also required to perform the necessary control and switching of hardware components. This will be discussed in greater detail later.
Duplicating hardware is achieved by two methods:
Cluster products attempt to provide availability for computers (systems) so that applications can continue to run as opposed to providing availability for storage (data). Storage availability is provided by using RAID techniques; indeed, it is common for a complete configuration to include both clustering and RAID. Considering how storage should be deployed is a critical element in the design of any cluster configuration. If one server is to be capable of assuming the workload of another, it must first be able to access the failed servers storage. In practice this means that nearly all cluster products rely on shared access to storage (at the hardware connection level). Products that do not rely on shared access to storage use data replication techniques (mirroring) to duplicate data across all the systems.
High availability clusters
As described earlier, cluster products provide high availability by connecting several systems together in such a way that, when one system fails, a remaining system can assume the workload of the failed system. Because a cluster is comprised of commodity systems its capabilities are provided by a mixture of software and, depending on the specific product, special purpose "interconnect" hardware. The term "interconnect" refers to the hardware that is used to join the clustered systems together in many cases this will be an off-the-shelf network, such as Ethernet, but can also be an optimized high-performance communications network (Compaqs ServerNet and Memory Channel are two examples).
The figure shows a very typical cluster configuration. Note that the cluster comprises all the components within the gray oval.
Self evidently, at the time of a failure the applications that are running on the failed system abruptly cease to execute. Clustering products vary greatly as to how recovery from this situation is achieved. The most common, and technically simplest, scenario is that the applications are restarted on a remaining node, an operation that can take several seconds (up to a minute or so). This process is called "failover", or, more accurately, "application failover". More sophisticated clusters permit an application to execute on multiple systems simultaneously, so that when one system fails the others simply continue, picking up the load of the failed system. This avoids the requirement to perform application failover.
Availability can be steadily increased in a cluster by adding further systems to the configuration. While having two systems in a cluster enables operation to continue if either system fails or shuts down, the remaining system must assume the entire load of the first system. In clusters that contain more than two systems the load of a failed system can be amortized over the remaining systems. Experience shows that most environments will benefit with up to three to four systems, but that the availability benefits start to decline once there are more than about five or six systems (for example, adding a seventh system to a cluster is unlikely to improve overall availability by much).
High Availability with Fault Tolerant Systems
No paper on clustering can ignore the design and capabilities of fault tolerant systems. Because these systems are designed from the outset to be "fault tolerant" they provide the highest levels of availability. In general, any component part of the system can fail without impacting application and user operation in any way. The failure may be handled completely within the hardware, such that even the operating system is unaware of the fault (power supply failure is often handled entirely within the hardware, except for any signal that the operating system may require for event logging purposes). Other failures may require operating system action to complete recovery, but, in any event, user applications continue to operate unaware of any underlying failure (memory failure generally requires operating system assistance to complete recovery). The ability to mask all failures from users and applications is the distinguishing feature between fault tolerant systems and clusters. As will be seen, a feature of clusters is that users and applications are often aware, if only briefly, that a failure has occurred.
The figure shows a single, hypothetical, fault tolerant system. The system has fully duplicated hardware, two CPUs, two memory systems, and duplicated I/O adapters, each configured into two "zones". The two zones execute in a fully synchronized manner; both memories contain the same instructions and data, and both CPUs execute the same instructions. A failure of any component is contained within its zone, and the other zone can continue without interruption.
While fault tolerant systems provide the highest levels of availability, suitable for extremely demanding applications, they have a few weaknesses that can render them inappropriate for many uses:
Increased application and/or user population scaling
Increasing overall application throughput and the number of users/clients that can be handled are important clustering capabilities. Ideally, all the systems in the cluster configuration can be used to handle a portion of the total application load. Whether this is possible in practice is highly dependent on the underlying capabilities of the cluster and the application(s).
It is common for people to confuse the concept of scaling with that of performance. As systems are added to a cluster the total compute power available for applications grows so the bandwidth, or scalability, of an application increases. If each system is capable of handling 100 transactions/second then a cluster of five systems is capable of handling 500 transactions/second (assuming, for simplicity, perfect scaling). However the transaction time remains constant, at 1/100th second. Increased performance can only be achieved by increasing the power of an individual system; replacing a system capable of 100 transactions/second with one five times more powerful results in being able to perform 500 transactions/second (once again, assuming perfect scaling). However, in this case the transaction time is reduced to 1/500th second.
In the simplest clustering products, running the same application on multiple systems at the same time is either not possible, or not straightforward. With these products it is common practice to allocate different applications to different systems each system executes its specific application. When a system shuts down application failover will move the application to a remaining node. If there is only one application the cluster becomes a "hot standby" configuration one system executes the application while the other system(s) wait, idle, until it shuts down. This is clearly an inefficient use of resources, providing no application scaling, so it is not a popular type of clustering.
In single application environments where hot standby operation is inappropriate it may be possible to modify how the application is deployed, so that it can be made suitable for running on multiple systems simultaneously. A common way of achieving this is to partition the application database into two or more sections, and then assigning the application on each system responsibility for an individual section.
More sophisticated clustering products provide the ability to run the same application on multiple systems at the same time, without the need to partition application databases. However, it is often necessary to modify the application before this can be done.
The level of scaling that a cluster can achieve will vary with the specific product and the application. In many cases it is possible to continue to add systems until some limit is reached with the clustering software or interconnect. Clusters containing over 100 systems have been created this is especially common in the workstation farm environment. In commercial environments, where individual systems tend to be large and growth more gradual, it is common for cluster sizes to reach an upper limit based on system power. As new, more powerful systems are added to the cluster, old, less powerful systems are retired.
Scaling with Symmetric Multi-Processing (SMP) Systems
When considering the application scaling capabilities of a clustering product, a comparison with SMP application scaling capabilities can be useful. In general, it must be said that an SMP system will provide superior application scaling than is possible with a cluster. An SMP system is purpose-designed to permit several CPUs to coexist in the same computer. The figure shows the general arrangement of an SMP system multiple CPUs connected to a single memory and I/O sub-system by a high performance system bus. The high performance system bus allows excellent CPU-to-CPU communication performance, and applications running on multiple CPUs to use shared memory for common data. However, while SMP systems can provide excellent application performance, they have several limitations that clusters overcome:
Of course, it is important to note that these limitations are mostly overcome by the simple expedient of clustering several SMP systems. Clustering and SMP technologies are highly compatible, and are often used together to provide the highest levels of performance and availability.
Less than sum-of-the-parts system administration capabilities
A cluster consists of multiple systems, all of which have to be managed. For the simplest clustering products the amount of administration required is a direct multiple of the number of systems in the cluster, often with additional management required for the cluster capabilities themselves (such as defining application failover scenarios).
In general, system administration complexity is often a weakness of clustering products, requiring skilled personnel and repetitious operations. Since system administration is usually a major component of system operational costs this should not be overlooked when choosing a clustering solution.
Sophisticated cluster products will provide cluster-aware administration utilities that simplify the management of multiple systems. These will often remove the need to perform the same administration activities on every system in the cluster for example, a user account created on one system automatically becomes valid on all systems.
Investment protection for hardware and software
Low-end cluster products usually restrict the number and type of configurations supported by requiring that all systems are identical. These configurations benefit the vendor because they are simpler to develop and test (and leverage larger sales volumes). However, for obvious reasons they can increase end-user costs.
An attractive feature of the more sophisticated cluster products allows for older systems to be clusterable with newer systems. This allows systems to be added incrementally, as application and user needs demand, and permits older systems to be fully depreciated. In many cases older and slower systems can be delegated to handling less critical tasks, such as printing and backups.
Additionally, cluster products that permit multiple versions of the operating system and application(s) to coexist in the same configuration can reduce costs, and simplify administration and upgrades.
These four basic cluster capabilities availability, scalability, manageability, investment protection should all be present to a greater or lesser extent in any clustering product. The better products will provide all four of these capabilities, others, one or two. When evaluating cluster products it is important to understand what features are provided.
Microsofts clustering solution for Windows NT Microsoft Cluster Server started shipping during 1997. This product, widely known by its development code-name Wolfpack, concentrates on providing high availability capabilities for the Windows NT environment. Microsoft Cluster Server (MSCS) is delivered as part of Windows NT Server, Enterprise Edition.
Comprehensive information regarding MSCS can be found on the Microsoft Internet site at: http://www.eu.microsoft.com/nTServerEnterprise/Basics/Features/Clustering/default.asp
MSCS clusters support configurations with up to two systems (nodes) connected to a common LAN and storage interconnect. The common LAN is used for system-to-system and system-to-client communication. It is possible to configure the cluster with two LANs, one for system-to-system communication, the other for system-to-client communication. This generally leads to higher availability because intra-cluster communication remains possible in the event that the general-purpose client LAN fails. The common storage interconnect is either parallel SCSI or Fibre Channel. Disks connected to the common storage interconnect are accessible by either system, thereby providing the foundation for high availability serving.
MSCS provides the concept of cluster "resources". A wide range of resources are defined, essentially any entity for which high availability is required. Typical examples are applications, disks and TCP/IP network addresses. The MSCS administrator defines "groups" of resources that are required to provide a complete service for clients. Failover is performed at the group level so, when a server shuts down or fails, all the resources in a group are moved to the other server. This ensures, for example, that when an application is restarted on a surviving node it has access to the correct disks and is reachable by clients using its normal TCP/IP address. This technique provides a simple and powerful infrastructure with which to create high availability server configurations.
Disks configured on MSCS shared storage interconnects, while physically connected to both servers, are only accessible by a single server at any point in time. This disk access technique, usually called "shared nothing", greatly simplifies the complexity of the cluster software, by avoiding the synchronization that is necessary when multiple systems access the same disk storage at the same time. In many cases shared-nothing clustering will provide excellent high availability, with applications being moved seamlessly from one server to another as servers shut down or fail.
The first figure shows an application initially executing on one server (the left server, in this example). The right server is unable to access the disk that the application on the left server is using.
The second figure shows what happens when the left server is shut down, or fails. The application is "failed over" to the right server. More accurately, the resource group is moved to the right server, which (1) gains access to the disk, (2) assumes the TCP/IP address of the left server, and (3) restarts the application.
Shared-nothing clusters provide a level of application scaling by allowing each server to process a different application. This is essentially the same as having two non-clustered (standalone) servers, but with a failover capability to permit continued application operation in the event of server shutdown or failure. This capability is particularly useful in "file and print" environments, and those with multiple applications. In these cases it is relatively straightforward to segment user and application files on to separate disks, thereby allowing each server to perform useful work. In environments that run a single application it is necessary to either:
Replicating or splitting databases and files is usually practical when the total amount of data is relatively limited a few gigabytes, say but becomes less so as the amount of data grows. Of course, the majority of application databases are not suitable for replication, because update activity would quickly cause them to diverge and become unsynchronized. Replication is suitable for read-only databases, such as those used by some Web Servers.
In the future, as Microsoft Cluster Server supports more systems, it will become necessary to partition databases/files into smaller sections (quarters for a four node cluster, sixths for a six node cluster, etc) as the cluster grows. This is likely to become increasingly burdensome for system and application administrators.
The diagrams below show:
As can be seen from the above comments, Microsoft Cluster Server provides a simple and powerful infrastructure for building highly available server environments. It can be expected that these capabilities will be suitable for the majority of environments, where the number of nodes and quantity of data is limited. In these cases excellent availability can be achieved simply and cheaply, without the need for excessively burdensome system and application administration.
Microsoft "Q and A" on Microsoft Cluster Server
The following section is copied directly from the Microsoft Internet site noted earlier. The questions and answers provide an excellent commentary on several critical capabilities of MSCS.
Q: When a cluster is recovering from a server failure, how does the surviving server get access to the failed server's disk data?A: There are basically three techniques that clusters use to make disk data available to more than one server:
Q: How does MSCS provide high availability?A: MSCS uses software "heartbeats" to detect failed applications or servers. In the event a server failure is detected, MSCS first confirms the failure using a sophisticated "quorum" algorithm. It then employs a "shared nothing" clustering architecture that automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server's workload on the surviving server. All of this-detection, confirmation, and restart-can typically take under a minute. If an individual application fails (but the server does not), MSCS will typically try to restart the application on the same server; if that fails, it moves the application's resources and restarts it on the other server. The cluster administrator can use a graphical console to set various recovery policies such as dependencies between applications, whether or not to restart an application on the same server, and whether or not to automatically "failback" (rebalance) workloads when a failed server comes back online.
Q: Should cluster-aware applications developed for MSCS use a shared-disk or shared-nothing architecture for greatest scalability?
A: Microsoft recommends a shared-nothing architecture for cluster-aware applications because of its greater scalability potential. With shared-disk applications, copies of the application running on two or more servers in the cluster share concurrent read/write access to a single set of disk files, mediating ownership of the files using a "distributed lock manager" (DLM). A shared-nothing application, on the other hand, avoids the potential bottleneck of shared resources and a DLM by partitioning or replicating the data so that each server in the cluster works primarily with its own data and disk resources. In theory, MSCS can support either type of application. However, Microsoft has no plans at this time to include a DLM in the MSCS cluster services, so vendors would have to develop or license a DLM to implement a shared-disk application on MSCS. Microsoft has chosen to use the shared-nothing architecture for future versions of Microsoft BackOffice® family applications because of that architecture's greater potential for cluster-enabled scalability.
Q: Will MSCS ever have a Distributed Lock Manager (DLM)?
A: Microsoft did not include a distributed lock manager in the first release of MSCS. Enhancements in future releases will be determined based on customer requirements.
The description, pictures, and questions & answers in the preceding pages provide a brief description of the Microsoft Cluster Server product and its capabilities. The following section describes how Compaq will provide enhancements to the Microsoft Cluster Server environment permitting it to deliver even higher availability, superior application scaling, and simplified system administration.
Compaqs Enhancements to Microsoft Cluster Server
The capabilities of Microsoft Cluster Server provide an excellent basis on which to deliver additional clustering features. It is important to realize that Compaqs enhancements to Microsoft Cluster Server permit customers to take advantage of all standard MSCS features in conjunction with the enhancements.
The primary thrust of these enhancements is to deliver additional application availability and scaling, across a wider range of customer environments than possible with the standard MSCS product, while at the same time simplifying system and application administration. The central technical feature on which these capabilities will be based is the provision of a "shared disk" clustering paradigm. As alluded to previously, the combination of MSCSs basic shared-nothing clustering, and Compaqs shared-disk clustering will provide customers and applications with complete flexibility in how computing solutions are deployed the two schemes can be combined within a single cluster, allowing customers to use the most appropriate paradigm on per-application basis.
Providing shared-disk clustering is technically non-trivial, and relies on two fundamental capabilities:
That styles of clustering should be described in technical terms such as "shared-disk" and "shared-nothing" may seem strange to the casual observer, but the difference between these two styles is highly significant to the operation and usefulness of a cluster. Shared-nothing, discussed earlier in this paper, is the simplest form of clustering, and provides high availability for applications in simple environments. Shared-disk clustering, while technically complex to implement, provides significant benefits in terms of availability, scalability and manageability for larger, more complex application environments.
The figure shows the operation of a disk in a shared-disk cluster environment. Note that both (or all) servers in the cluster can directly access the same disk.
In the question and answer section quoted from the Microsoft Internet Web pages earlier, the following comments are made regarding shared-disk clustering:
The earliest server clusters permitted every server to access every disk. This originally required expensive cabling and switches, plus specialized software and applications. (The specialized software that mediates access to shared disks is generally called a Distributed Lock Manager, or DLM.) Today, standards like SCSI have eliminated the requirement for expensive cabling and switches.
Both shared-nothing and shared-disk clusters use the node-to-node communication interconnect to provide support for low-level activities such as polling heartbeats. Also, both clustering styles impose additional loads on the interconnect. For example, in shared-nothing environments it is common to use Windows NT Shares to enable servers in the cluster to access disks that are controlled by another server in the cluster. Data transfers using these shares will occur across the interconnect, imposing a load on it.
Similarly, shared-disk clusters use the interconnect for distributed lock manager messaging, also imposing a load on it. However, as mentioned, todays modern commodity interconnects, such as 100Mb/sec Ethernet and Gigabit Ethernet provide exceptional performance. Proprietary interconnects, such as ServerNet and Memory Channel, provide even higher levels of performance, and also benefit from commodity pricing due to their implementation on industry standard system I/O buses, such as PCI.
However, shared-disk clustering still requires specially modified applications. This means it is not broadly useful for the wide variety of applications deployed on the millions of servers sold each year.
The requirement to modify applications before they can be deployed in shared-disk environments varies, as might be expected, with the application. For the most common of all computing environments, file and print, no modifications are necessary at all. File and print applications are essentially private with each client creating and accessing files privately. As a result, modification of file and print applications is unnecessary. This is also generally true for application environments such as Web Serving and Mail.
The figure shows the level of flexibility that shared-disk clusters provide for File and Print environments. It can be seen that a client can access any disk through any server. In this case standard Windows shares are defined on all servers in the cluster and client load balancing algorithms (such as DNS round-robin) can be used to ensure that all servers are used efficiently. The system administration effort required to control this type of cluster (especially for large configurations) is significantly less than for shared-nothing clusters careful placement of files and shares is not necessary, and definitions of resource groups and how they failover are not required.
For applications that manipulate shared files, the specific method of sharing used dictates whether application modification is necessary. Applications that use byte-stream files, which are typically accessed using a simple OpenFileExclusive, Serial Access, CloseFile algorithm do not require modification before deploying in a shared-disk environment. In these cases all necessary synchronization is performed automatically for the application by the cluster file system. Also, because the cluster file system takes advantage of the distributed lock managers byte-range locking capability, any application that uses byte-range locking for synchronization will also work in a shared-disk environment without modification. These styles of access are common in many PC and Unix applications.
However, as the answer above correctly states, the class of applications that permit multiple users to share access to common database files generally require modification before they can be deployed in clusters using shared-disk capabilities. In many cases these modifications will be provided by the application provider (such as Oracle with their Parallel Server product). In other cases it is necessary to perform a tradeoff between the benefits of a shared-disk application against the effort required to modify it (discussed below). Of course, once modifications have been completed, the updated application can be deployed on non-clustered systems, and any type of clustered system.
Even when applications cannot be run on multiple nodes simultaneously without modification there are benefits to running them in shared-disk environments. For example, load balancing of applications across servers is greatly simplified applications can be instantly moved between servers without the need to copy application and data files between disks. Application administration operations (such as backup) can be performed from one server in the cluster while the application continues to run on another server. Put simply, a shared-disk environment makes it straightforward to load balance shared-nothing applications.
Shared-disk clustering also has inherent limits on scalability since DLM contention grows geometrically as you add servers to the cluster. Examples of shared-disk clustering solutions include Digital VAX Clusters, and Oracle Parallel Server.
Compaq DLMs are carefully designed to ensure that contention is not geometrical as servers are added to the cluster. The number of messages for any given DLM operation remains constant, regardless of whether the cluster is comprised of two or 100 nodes. As the answer implies, this is critical to ensuring excellent performance and scalability. Nevertheless, the answer touches on an important topic in shared-disk clusters that of DLM performance. The following tradeoffs need to be considered:
In many cases this is a simple tradeoff - by using the DLM to grant applications running on multiple nodes in a cluster peer access to data, significant performance enhancements can be achieved.
Of course, any application can be written poorly, so that the benefits of compute resources SMP, caches, I/O subsystems, RAID, network infrastructures are squandered. A Distributed Lock Manager is no different in this regard. However, after Compaqs 14 years experience with DLMs, it is safe to say that, properly used, a DLM will provide excellent performance for shared-disk applications.
Compaqs Cluster File System (CFS) for Windows NT provides all the functionality of NTFS, but for shared-disk environments. The CFS is fully compliant with NTFS, that is, it provides the same interface to applications and the Windows NT I/O subsystem.
Windows NT is provided today with three file systems, FAT, NTFS and CDFS. The initial release of NT included a fourth file system, HPFS, for OS/2 compatibility. As a result the ability of NT to include multiple coexisting file systems is well developed and understood. However, all these file systems will only coordinate disk activity from a single system. The CFS extends this capability to operate cluster wide using the distributed lock manager to coordinate its activities.
The figure shows how two (or more) servers in a cluster can perform file operations, in this case a file creation, on the same disk at the same time. The technical complexity of this operation is caused by having to ensure that the two file creations do not result in allocating the same free space on the disk. If this operation were attempted using NTFS or FAT (or any other non-cluster-wide file system) there is a strong possibility that files would overlay each other, rapidly resulting in disk and file corruption.
Confusion as to the exact function of a file system is common. File systems are responsible for all actions relating to management of files on a disk. These actions include file creation, directory management, opening (finding) and closing a file, renaming and deleting files, imposing security mechanisms (access rights) on files, and so forth. Additionally, a file system must manage free space on the disk, control retirement of bad blocks, and other hardware related functions. However, a file system will not provide any control over what data is placed inside any given file, or how that data is organized. This is the responsibility of the application that accesses the file. So, while a file system can locate the Powerpoint file in a directory, it has no ability to find a specific slide within the file. Only the Powerpoint application can do this. The result of this segregation of responsibilities is that the cluster file system, using the distributed lock manager for synchronization, can coherently manage file operations on a disk, cluster-wide. However, if cluster-wide coordination of data within a file is required, this must be done by the application which will typically be modified to use the distributed lock manager to achieve this.
As described earlier, the CFS is NTFS compatible it provides full emulation of NTFS at the FSD (file system driver) level. The common capabilities between CFS and NTFS include support for Quotas, High-water Marking, OpLocks (Opportunistic Locking), and multiple Named Streams per file. Performance of the CFS is expected to be approximately equivalent to NTFS with some operations being somewhat faster and some slightly slower (performance figures are not yet available). To ensure optimal integration with Windows NT, the CFS uses the standard NT data cache for files and propagating byte range locks across the cluster; optimal performance of file system operations is achieved by means of separate cluster-coherent write-behind metadata cache.
The CFS is designed to be extremely robust it will tolerate metadata block corruption by means of maintaining multiple copies of metadata (metadata is the data required to maintain the on-disk structures, such as directories). High-speed recovery from cluster node failure is ensured by means of per-node recovery logs. Additionally, the CFS will permit online volume growth so if a disk becomes full and the underlying storage is capable of growth (such as with volumes created with logical volume management products) the CFS can support the growth of the volume. Unlike NTFS, online volume growth can be performed while there are open files on the volume so can be done without the requirement to shut down applications or disturb clients.
With respect to volume size, the CFS is designed to support the largest forseeable volumes, with designed-in scaling to multiple terabytes. Leadership technologies (such as secondary-level bitmaps) ensure that, as volume sizes grow, performance will not degrade.
The figure shows how the Cluster File System is integrated into the Windows NT kernel, as a peer of the other NT file systems. Note how the CFS uses the same system service interface, and the same storage driver interface as the other file systems. This ensures the highest level of application and I/O subsystem compatibility.
Initially the CFS will be supplied with special-purpose administration utilities for formatting, control (such as volume validation), analysis and repair. It is hoped to integrate these capabilities into the base Windows NT operating system over time. Since the CFS is fully compliant with NTFS it will be able to use industry-standard storage management products, such as Backup utilities and Performance monitors, that are provided by several vendors. However, the on-disk structure of CFS is not the same as NTFS so applications that assume a knowledge of the on-disk structure must be modified before they can be used with CFS. Such applications are not common, but include defragmenters.
The CFS has a carefully designed on-disk structure that is readily extensible. This will allow Compaq to enhance the CFS to track any changes that Microsoft may make to NTFS, and also to provide Compaq-specific enhancements (including, for example, additional features requested by customers).
Compaqs Distributed Lock Manager (DLM) for Windows NT is based closely on the DLMs that Digital previously implemented on its Unix and OpenVMS operating systems (prior to acquisition by Compaq). The industrys original DLM, for OpenVMS, was first shipped in 1984, so Compaq has the most comprehensive experience of building and optimizing these complex subsystems. Consequently, the Windows NT implementation benefits considerably from the two prior implementations it is functionally comprehensive and mature, and algorithmically stable.
The Windows NT operating system provides a rich selection of synchronization primitives for application developers. These include:
While these primitives provide a wide range of synchronization capabilities, they are restricted in scope they only perform synchronization within a single system (either multiple threads within a process, or multiple processes within a system). As a result it is impossible to write an application which synchronizes activities across the nodes in a cluster.
The DLM provides the only programming primitive that enables cluster-wide synchronization of application access to resources. This capability is critical to the creation of any application that must scale across multiple servers within a cluster an application running on one node of a cluster can coordinate access to any resource with an application running on another node in the same cluster.
Additionally, the DLM is available for non-clustered systems. In this case, since there is only a single system, the lock manager operates in a non-distributed manner it becomes just another synchronization primitive for applications to use within a single system. There are two reasons that an application developer might choose to use the lock manager as a primary synchronization primitive within an application:
The DLM operates with two basic programming structures resources and locks. A resource is defined by an application program typically using a simple text string and is used to represent the object that is to be locked. Once a resource has been defined by an application, locks can be acquired on it. The application acquiring the lock specifies the level of access required for example, exclusive access or shared-read access. Multiple locks can be queued against a resource; compatible locks (for example, multiple shared-read locks) will be granted simultaneously, while non-compatible locks wait for currently granted locks to be released. Note that the DLM imposes no explicit control over locked objects, but merely provides an infrastructure with which cooperating applications synchronize access, using resource names to represent the actual locked objects. This separation of resource names from the underlying locked objects makes the DLM extremely flexible it can be used to synchronize access to anything. A good rule of thumb is that if something can be named, it can be locked by the DLM. This includes files, items within a file down to the bit level, in-memory cache buffers, any I/O device, whole applications, individual application routines, and so on.
A full description of the DLM is not practical in this paper, however, the following list provides a brief overview of the comprehensive range of features provided:
The following table shows the compatibility matrix of the various access modes. Using the appropriate access mode for a lock on a resource allows an application to define its willingness to share access to the resource for example, Exclusive mode permits only one accessor to the resource, while Protected Read permits multiple readers, but no writers.
Mode of Existing Lock |
Mode of Requested Lock |
|||||
NL |
CR |
CW |
PR |
PW |
EX |
|
NL | Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
CR | Yes |
Yes |
Yes |
Yes |
Yes |
No |
CW | Yes |
Yes |
Yes |
No |
No |
No |
PR | Yes |
Yes |
No |
Yes |
No |
No |
PW | Yes |
Yes |
No |
No |
No |
No |
EX | Yes |
No |
No |
No |
No |
No |
A full description of DLM capabilities is provided in the Programmers Manual available on the Internet at http://www.windowsnt.digital.com.
Messaging Performance of the Distributed Lock Manager
Referring once again to Microsofts question and answer section earlier in this paper, the following answer is provided to one of the questions:
Microsoft recommends a shared-nothing architecture for cluster-aware applications because of its greater scalability potential. With shared-disk applications, copies of the application running on two or more servers in the cluster share concurrent read/write access to a single set of disk files, mediating ownership of the files using a "distributed lock manager" (DLM). A shared-nothing application, on the other hand, avoids the potential bottleneck of shared resources and a DLM by partitioning or replicating the data so that each server in the cluster works primarily with its own data and disk resources.
The answer bases the potential for greater scalability on the requirement to partition or replicate data on multiple servers. From a performance viewpoint this leads to the same scaling that would be achieved with multiple non-clustered systems. As described in detail earlier in this paper, the practical implications of database partitioning or replication render the technique unsuitable for large, mission critical application environments. Compaqs extensive experience with lock managers clearly shows that concerns regarding the potential for a DLM to become a bottleneck are almost without exception overstated. Typically, competitive pressures between vendors that do or do not have DLMs provides plenty of scope for exaggeration!
An understanding of the messaging techniques used by the DLM ensures that applications can achieve excellent performance. The following diagrams show the messaging handshakes used by DLM operations. For the majority of locking operations those where an application locks a resource that no other node has an interest in no node-to-node messages are necessary. In these cases the performance of the lock manager is essentially the same as any other Windows NT synchronization primitive. In cases where node-to-node messaging is required, the overwhelming majority of operations require a simple command-response protocol of two messages. Note that in cases where a lock request cannot be granted synchronously so is stalled by the DLM an additional "granted" message is required when the resource is released by a prior holder of the lock. Under these circumstances, the time to acquire the lock is dominated by the event that causes the stall, so the messaging performance of the DLM becomes irrelevant.
The first diagram shows how an initial lock is acquired. A single message is sent to a Directory node, followed by a response to the lock requester. The response notifies the requesting node that no other node has an interest in the resource, so that it can become the Master of the resource. Determining the Directory node for a given resource is done by hashing the Resource name in this way the resource directory database is distributed across the nodes in the cluster. Note that for a two-node cluster there is a 50% chance that the requesting node will also be the directory node, eliminating the need for any messages.
The second figure shows the most common of all locking operations creation of additional locks and sublocks on a resource tree by the application that initially defined the resource. In this case the node already knows that it is the master of the resource, so no messages to the directory node are required. The same process is used for conversion requests on existing locks.
The third figure shows the most message intensive operation performed by the DLM. In this figure a third node acquires its first lock on a resource already mastered on another node. In this case a message pair is sent to the directory node in order to identify which node is mastering the resource and then another message pair sent to the resource master node.
The fourth figure shows the messages required when a node acquires new sublocks on a resource that it is not mastering. In this case the requesting node already knows which node is the resource master, so a message pair to the directory node is avoided. The same process is used for lock conversions.
The last figure shows the messages required to release (or dequeue) a lock. These events require a single message to the Resource Master, and, when necessary, another to the Resource Directory.
These drawings provide a good overview of the message traffic imposed by a distributed lock manager. Note that, for the Windows NT DLM, lock messages are 64 bytes or smaller. Also, it can readily be seen that the algorithms used do not result in the number of messages increasing with the number of nodes in the cluster. In fact, message rates remain constant, ensuring that as cluster sizes grow the performance of the DLM will remain consistant. For a thorough treatment of this topic, readers are referred to "The VAX/VMS Distributed Lock Manager" by William E. Snaman and David W. Thiel, published in the Digital Technical Journal, Number 5, September 1987, ISBN:1-55558-004-1.
Compaqs Windows NT Enterprise Software Group is actively developing additional Windows NT capabilities that will take advantage of the shared-disk clustering paradigm. These capabilities are currently focused in two areas:
Providing suitable disaster tolerant capabilities for any computing environment is highly dependent on specific customer requirements and application capabilities. Disaster tolerance the ability of a computing environment to survive disasters such flood, earthquake and fire can be achieved in many ways:
Compaqs Windows NT Enterprise Software Group is currently investigating the feasibility of implementing the last of these disaster tolerance technologies (other groups are working on the other technologies refer to the appropriate white papers for additional information). The implementation would include a fully distributed RAID-1 (mirroring) capability that would support RAID-1 disk members located on any server in an MSCS cluster. By placing servers in multiple geographic locations, disaster tolerant configurations can be created. By deploying the distributed RAID-1 capability in conjunction with the Cluster File System, users and applications could use any server to access their data, regardless of geographic location.
As mentioned earlier in this paper, providing system administrators with cluster-aware and easy-to-use tools and utilities greatly simplifies the management of a cluster, and reduces costs. To ensure that the capabilities described in this paper are easily manageable Compaq will provide an accompanying cluster administration utility. The administration utility includes the following capabilities:
Hopefully, this paper has provided the reader with an appreciation of differences between, and intricacies of, two important clustering implementation styles. By now it will be clear that this is a complex topic, and that there are many variations between clustering products. When looking for a clustering solution the prospective purchaser needs an understanding of the capabilities of the available products, knowledge of what his actual requirements are (availability, application scaling, cost effective management, and so on), and the ability to see through the marketing hype that surrounds too many clustering products.
On the one hand, shared-nothing clusters are simple to implement and provide excellent availability characteristics. Importantly, end-users can benefit from their capabilities immediately no application modifications are necessary. It is clear that shared-nothing clusters, such as Microsoft Cluster Server, will dominate the market for small, high availability solutions. However, as Windows NT cluster configurations grow, and are deployed into the high-end commercial marketplace, the inherently simplistic nature of shared-nothing clusters will start to become more of a hindrance than a benefit. This will be especially apparent in environments with large databases or file populations, where data partitioning and replication is impractical.
On the other hand, shared-disk clusters are technically complex for a software vendor to implement, and, in some cases, require application modifications before their full benefits can be realized. However, their inherent elegance, unbeatable availability, seamless application scaling, and simple management make them the most suitable clustering style for all large and complex computing configurations.
Compaqs cluster enhancements to Microsoft Cluster Server offer the best of both worlds. MSCSs shared-nothing capabilities can be used at the same time as Compaqs shared-disk capabilities with system administrators choosing the most appropriate style on a per-disk basis. This ability to mix the two capabilities permits customers ultimate flexibility. When applications can be deployed on shared-disk storage it is probable that most customers will be quick to take advantage of the feature. Meanwhile, applications that are restricted to shared-nothing storage can still be deployed, using the standard MSCS environment.