HP Consulting
In early data warehousing implementations for many companies, high availability was not a critical requirement. Since data warehouses support strategic and tactical, rather than operational decision-making, the unavailability of the data warehouse, even for a day or two, was not business-threatening for most companies.
However as data warehousing becomes more entrenched, data warehouses are becoming business critical for many companies. For example, in the financial services sector, data warehouses are being used in the risk management process. The data warehouse receives feeds from various operational systems, and consolidates the information to give an overall view of the current risk factors. Updates are usually daily, and it is critical that the information be pulled into the warehouse and made available in a timely way.
As the business requirement for warehouse availability grows more stringent, so does the complexity of data warehousing architectures. A typical data warehouse query may now pass through several layers of processing. With the growing prevalence of server-based query tools and web interfaces to the warehouse, extra layers are added with a corresponding increase in the number of single points of failure which can cause a query to abort.
Therefore the warehouse architecture must include high availability features appropriate to the business requirement for availability. This paper describes the various components of the warehouse architecture that can impact availability, and outlines techniques for ensuring the needed level of availability.
We will define availability as the capability of the data warehouse to provide service to the end user during the agreed hours of operation. The availability measure is the number of hours that the service was actually available divided by the number of hours that the service should have been operational.
The availability percent that is agreed upon usually will not take into account unpredictable disasters such as fire, floods and earthquakes. We will consider disaster recovery contingency planning to be outside the scope of this paper.
Measures of availability are often quoted for hardware only, but increasingly technology users are demanding that end-to-end availability be the measure that is used. In the data warehouse environment, this means that all of the links from the user workstation to the data source must be available. In this paper, we will consider end-to-end availability.
Determining the Business Requirement For Availability
In general, the higher the level of availability to be targeted, the higher the cost. Therefore in determining an appropriate availability target, the cost of providing higher availability usually must be weighed against the cost of unscheduled downtime.
Questions like the following can be used to provide a preliminary basis for designing the availability architecture:
The answers to these questions will serve as guidelines for constructing a high availability design.
As the design progresses, understanding will emerge as to the cost factors involved in achieving the desired level of availability. It is important that the end users be fully informed of the tradeoffs so that they can ensure that the expense of creating a highly available system is cost-justified.
A good way to ensure that the tradeoffs are understood is to present at least two alternative designs, with their associated costs.
HA Challenges Particular to Data Warehouses
There are several aspects of the typical data warehouse environment that pose particular challenges for managing availability. Each of these is discussed below. Throughout the paper, these issues will be referred to as we discuss the various technologies and processes that can be applied to provide high availability.
The most common method used for updating data warehouses is to periodically extract data from one or more source operational systems, transform the data as needed, load into the data warehouse and build the indices and summaries that support efficient querying. It may be necessary to back the data up after loading as well.
In some cases the time to complete the load process exceeds the window of time during which the data warehouse can be offline. The architecture of the warehouse must allow for updating to occur without affecting the availability of the warehouse to support user querying.
It is not uncommon for the mature data warehouse to be accessed in multiple ways. For example, some power users may construct their own ad hoc queries using a point-and-click query tool. Other analysts may submit large pre-built queries which populate a multidimensional database, while other users may submit canned queries through a web browser. Each of these access methods may follow a different path, involving different servers, network components and application software.
Having defined availability as end-to-end, the data warehouse architect must consider each of these access paths to ensure adequate resilience, redundancy and recoverability.
The Variability of Query Load on the Warehouse
Another area where the data warehouse poses a challenge is that it is often very difficult to predict the load on the warehouse from one day to the next. Unlike the operational system, where transaction volume can be predicted quite accurately based on business volumes, the warehouse can experience wide fluctuations in demand. The potential always exists for a few "killer queries" to be submitted that can affect the response time experienced by all users.
Since an acceptable response time is implied by the term "availability" this possibility must be considered. As we will discuss in this paper, techniques are being developed to safeguard against degradation in response time due to demand fluctuations.
Understanding the Availability Requirements
Before an architecture can be developed to support the availability requirements, it is important to have an understanding of the following data warehouse characteristics:
As mentioned above, a single data warehouse may be accessed in several different ways for different purposes and possibly by different types of users. Let’s look at an example where the data warehouse is accessed in the following four ways:
Each type of access uses different technology components, and will have a different set of SPOF’s (Single Points of Failure.) In example 4 above, the access method requires network connections, the web server, the application server, the data server, the disk drives, the DBMS, the web application and the query application.
Each type of access may also have different business requirements for level of availability. These requirements must be determined by discussion with the end users of the warehouse prior to beginning the design of the high availability architecture.
It is important to understand how the data warehouse will be loaded in order to ensure that loads can be carried out in a timely way, and that the load process will not interfere with the availability of the warehouse for querying.
Loads usually occur periodically rather than continually. Frequency is often monthly, but in some cases it is weekly, daily, or even several times a day. Major components of the load process include populating the detailed tables, creating indices and creating summaries.
Another function that must be performed prior to the load is data preparation. This includes such functions as:
In the large data warehouse environment, the data preparation activities are usually performed on a separate server. This "staging machine" is often larger than the actual data warehouse data server. It has its own availability needs that have to be considered in the overall design.
Loading During Scheduled Downtime
Typically, data warehouses have been loaded during scheduled downtime. For example, a warehouse which is updated monthly might be loaded overnight or on the weekend. This is the simplest scenario, since access to the warehouse is all-or-nothing, and there is no need for data redundancy or for limiting access to data which is not yet fully loaded.
This scenario assumes as well that the scheduled downtime provides enough time to complete the Quality Assurance of the data. If that is not the case, the alternative methods described below must be considered. It should also be noted that a risk associated with this method is that if there are problems with the load, there may not be sufficient time to resolve the problems and complete the load. In this case, either the warehouse would have to be kept offline until the load completes, or the loading of the new data would have to wait until the next available batch window.
It is not uncommon for the load process duration to exceed the window during which the data warehouse is down. If this is the case, the design must allow for the loads to occur while the data warehouse is up.
Loading While the Warehouse is Up
Most data base management systems will support adding records while the database is available to users. However there are two basic reasons why it is not usually advisable to add records to a data warehouse table that is available for querying. The first is that the load functions for adding records are less efficient than the load utilities that can be used when the table is offline. This difference can be very significant when relatively large amounts of data are being added at one time, which is usually the case with data warehouse loads.
The second reason for not adding records to a table which is available for query is that very often the reporting being generated is for a period. The user will assume that the complete set of data for a specific time period is available. For example, if a warehouse is loaded monthly, the user will assume that if any sales data is available for a specific month, all of the sales data is available, which will lead to inaccurate reporting.
Often the load process consists of appending the new data for a period to an existing table, for example adding the last month’s sales to the sales table. In a classic star schema modeled warehouse, this would be the method used to update the fact tables. If this is the case, the use of partitioned tables can allow the warehouse to be loaded while it is available for querying.
As an example, if the sales fact table is partitioned, the previous month’s sales would be loaded into a new table in the data base, thereby not impacting users who are querying the sales table. Once the monthly loads are complete, the new partition is made available to users of the warehouse. This method works best if the DBMS or warehouse middleware supports linking of partitioned tables into a table which is seen by the user as a single table.
In summary, it is possible to load a database that is available for querying, however there must be mechanisms in place to ensure that data is not accessed by users until the load is complete and the data is validated.
Software mirroring is the maintenance of a separate copy of the database, which is controlled by specialized software on the data server or the disk mechanism. There are a number of benefits to software mirroring, including
If one or more of the following factors applies for a data warehouse, the need for data mirroring to support the load process is often indicated.
To use disk mirroring to enable data loading, the following sequence would take place:
This technique assumes that the warehouse database is read-only except for loading. It requires specialized software such as HP Mirror Disk UX on the database server, or functionality within the disk mechanisms such as is available with EMC disks.
The section on Disk Technology in this paper will discuss more fully the options available as well as the other benefits of mirroring.
Data warehouse databases can be very large, in fact they are often much larger than the operational databases because they contain history. This in itself poses challenges for backup and recovery. In defining a strategy for backup, it is important to take into account the unique characteristics of data warehouse databases.
One very important factor influencing the backup strategy development is that data warehouses are usually read-only databases. Updates are planned in advance and performed only during the pre-determined load periods. Therefore it may not be necessary to do frequent complete backups of the data warehouse databases.
Another factor to consider is that an option to restoring data from backup tapes is often available in the data warehouse environment. If the flat files which were used to load the warehouse initially are kept available (either online or archived) the option exists of reloading the data warehouse if needed. This method could be used either as a long term backup strategy, or as an interim approach which will provide for data protection until there is an opportunity to take a backup of the newly-loaded data.
One common scenario is where the data warehouse is modeled as a star schema, and new partitioned fact tables are created each period. The new fact table partitions need to be backed up after they are loaded, but it is not necessary to back up fact tables from previous periods if they are not reloaded. Dimension tables are commonly reloaded each period, and so they need to be backed up again. Indices and summaries which are related to the new data should also be backed up, as it will be generally be quicker to restore them from backups than to rebuild them.
Some data warehouses are updated as a trickle process, where updates are collected as they occur in the operational system and passed to the warehouse. (This method is more frequently used for Operational Data Stores.) In this case, backup techniques will be similar to those used for operational systems, where it must be possible to restore the database to a particular point in time and then apply the updates that occurred since the last backup.
Once it has been determined what needs to be backed up and when, the options for performing the backups are similar to the options for loading the database. Both are processes that must be carried out without impacting the uptime of the warehouse.
Backing Up During Scheduled Downtime
A commonly used method, where the warehouse has periods of regularly scheduled downtime, is to perform backups during this downtime. This downtime is most commonly overnight or on weekends. For warehouses that are loaded in periodic batches, the backups should follow immediately after the load processes.
To determine whether sufficient offline time will be available to perform backups, the following process should be followed:
The above process will determine whether a sufficient batch window exists to carry out both the load and backup processes. Consideration should also be give to the possibility of problems with either the load or backup processes which could result in the backups not completing on time. It may be an acceptable risk to defer the backup to the next available window of downtime, if the option of reloading the data in the event of disk failure provides an adequate substitute in the short term for a backup copy of the data.
Backing Up While The Warehouse is Online
Some data base management systems offer the option of performing online backups. These might be either full or incremental backups. This can be considered as an option if the scheduled downtime does not provide a long enough window to complete backups.
The disadvantage of online backups is that they will use system resources, such as CPU and I/O. It will not be possible to run the backup processes on a separate server.
Software mirroring provides the option of backing up the warehouse data from the mirror copy, without impacting the production warehouse. The database must be taken offline for a short period while the mirror is split, and again for a short period while the mirror is restored.
The splitting of the mirror allows for the data copy to be accessed by a separate backup server, if that is required to remove contention on the data base server.
Efforts to achieve high availability have often focussed primarily on improvements to servers, disk mechanisms and network components to increase their uptime, allow redundancy, or to allow them to be repaired with minimum disruption to service. These advances are critical to achieving highly available systems, and this section will focus on these technology aspects of the solutions.
As discussed above in the section on access methods, end-to-end access to the data warehouse can require the availability of several servers, possibly including a mix of UNIX and NT servers in the open systems environment. The potential causes of failure will be similar for Unix and NT servers, but the HA solutions may be somewhat different.
For both UNIX and NT servers the possible failure points are similar. A sampling of the ways that servers can fail includes the following:
Both Unix and NT servers are becoming increasingly reliable. However the availability requirement may exceed that which can be achieved with a single server.
Going beyond single server availability and considering end-to-end availability, the challenge is even greater. Consider an access path that requires availability of four servers. If each of the servers has an availability rate of 99%, the overall server availability to be expected would be
(.99) (.99) (.99) (.99) = .96
99% availability would result in 26 hours of unscheduled downtime a year if the scheduled uptime was 5 days a week, 10 hours a day. 96% availability means 78 hours of unscheduled downtime.
Thus it is often necessary to take steps to provide for higher levels of server availability. The usual solution where higher availability is needed is to implement high availability server clusters. In an HA cluster, if a server fails, its applications will immediately cut over to be run on another server, with little or no interruption in service. The server which takes over the processing of the application could have been an idle server, or it could be running other applications, as long as it has the available capacity to provide an adequate response time for the applications it covers.
Thus clustering allows time for the repair process to take place on the failed server without impacting availability of the applications it supports.
In order to implement clustering, the first step is to determine which servers must be protected. This will depend on the business criticality for each possible access path to the warehouse. For example, if users who access the warehouse through web browsers require high availability, it may be necessary to protect the web server through clustering. Clustering methods and tools for Unix and NT servers are discussed in the following sections.
Hewlett Packard UNIX servers can be combined into clusters with the assistance of specialized software, which controls the switchover of applications following server failure, and the reconnection to the original server after the repairs are completed.
HP’s MC ServiceGuard is one software package that facilitates clustering. It provides continual monitoring on each server in the cluster, and will automatically move an application to an alternate server if a required resource is lost on the primary server. There are many options which are pre-configured at installation time.
Another HP software product that facilitates clustering is MC LockManager. This software works specifically with Oracle Parallel Server (OPS) to allow applications to run on multiple servers, accessing a common database. It provides the same features as ServiceGuard, while also ensuring that data changes are not overwritten.
Frequently the data warehouse infrastructure will contain a mixture of Unix and NT servers. If it is determined that NT servers need to be protected with clustering, software is available to facilitate this. The considerations are similar to those for Unix server clustering. Functionality of the available clustering software is somewhat more limited at this time.
Clustering is an excellent short-term solution for providing continuity in the event of a server outage. However it is still important to be able to quickly restore the server to minimize the likelihood of an additional outage which could jeopardize service.
The following are the main techniques usually used to prevent outages, or to enable quick restoration of service after an outage:
Much attention has been focussed on developing disk technologies to prevent both downtime and loss of data in the event of disk failure. Determining the best solution requires analyzing many factors. Availability objectives must usually be considered along with performance and cost, as tradeoffs may have to be made.
As discussed previously in the "Understanding the Requirements" section, the requirements for data loading and backup may influence the type of disk storage that is required. For example, a data warehouse that requires reliable 24 by 7 access, and which is heavily accessed, will probably require data mirroring to enable the load and backup processes to take place.
The following is a description of the main types of disk technology that might be utilized to support high availability for data.
RAID is an acronym for "Redundant Array of Inexpensive Disks." A RAID array consists of a collection of disk drives in a common enclosure. There are several levels of RAID, most of which provide data protection.
For example, in Raid 1, disks are mirrored. In Raid 5, data is striped block-wise, and the controller spreads parity information across all disks so that lost data from any disk can be recovered. AutoRAID combines RAID 1 and Raid 5 in the same array, and automatically migrates data between RAID modes according to access patterns.
All of the above RAID levels will provide data protection in the event of a disk drive failure. None of the RAID levels will allow mirrors to be broken for loading or backup operations.
JBOD refers to "Just a Bunch of Disks", or simple, non-array type disks. Mirroring can be done on these disks through software. In HP-UX, software mirroring is accomplished using Logical Volume Manager and the separate MirrorDisk/UX sub-system.
Use of JBOD without mirroring means that in the event of disk failure, the data warehouse will be down until a new disk drive is obtained, installed, and the data recovered from backups. JBOD with mirroring provides for high availability, as in the event of a disk drive failure, the other copy of the mirror can be accessed until the new disk drive is installed, then the data can be restored by the mirroring software without having to be reloaded from backups.
JBOD with mirroring provides the option of splitting the mirrors for backups. It can lead to enhanced performance in accessing the database as well, because queries can be split between the two copies of the database. A factor to keep in mind is that mirroring will always be at a higher cost, as the number of disk drives required is doubled.
Where data volumes are large (e.g. greater than 500 gb.) disk arrays such as the EMC Symmetrix may be the most practical solution. These disk arrays have built in functionality to support data mirroring in a number of ways, providing a great deal of flexibility in architecting for data availability and performance.
Like the JBOD’s, EMC offers either Raid 1 (mirroring) or Raid-S (which is parity data protection proprietary to EMC Corporation and similar to Raid 5.) The mirroring option offers the possibility of splitting the mirror to support loading or backup without impacting the availability of the data to the production system. A second copy can even be maintained to avoid any exposure during the load operation.
As described below, EMC also offers two other mirroring products, which can be utilized to address specific challenges for large data warehouses.
SRDF (Symmetrix Remote Data Facility) is software offered by EMC which allows the maintenance of mirrored copies of data on physically separated Symmetrix systems. This solution is most often used to facilitate disaster recovery. If this solution is implemented for disaster recovery, it should also be considered within the high availability architecture, for example it may be feasible to run backups off the mirror copy.
EMC TimeFinder is another product that may be a useful component of a high availability architecture. TimeFinder allows the creation of data volume copies, which are snapshots of the data volume at a point in time. These copies, called Business Continuance Volumes, or BCV’s, can be accessed from a different server. So, for example, a BCV can be created and then accessed by a separate backup server to allow the creation of backups without impacting the data warehouse data server.
BCV’s cannot be used for offline loading of data, because changes made to them cannot be moved back to the production data warehouse database. However BCV’s can be relevant to the sourcing of the data for the warehouse. If the data for the operations system which sources the data is on EMC disk, then the data extract could be taken from a BCV of the source data to avoid impacting the source system.
Other High Availability Considerations for Disk Technology
Regardless of the type of disk storage which is chosen, there are a number of other aspects that must be considered to ensure high availability. The following are some of the features of disk arrays that may be required to ensure high availability.
The network is a critical component of the data warehouse architecture. The key to providing network high availability is to eliminate single points of failure through redundancy of network components.
There are two main types of systems within the data warehouse that it is useful to consider as "application systems" for the purpose of high availability planning. These are:
High Availability Considerations for the Data Sourcing Application
The data transformation processes are very important to the high availability of the data warehouse. If the loads do not complete accurately and on time, the data warehouse may not be available for querying. If the data warehouse is brought up without the loads being completed, the data will not be as current as the end users require. For this reason, in the high availability data warehousing environment, the load process must be treated as an application, similar to the production operational systems, which is subject to rigorous IT control. The operational production environment can serve as a guide to the appropriate controls. These will include disciplines such as the following:
In addition, if the transform and load applications are to be protected from hardware failure through a server cluster, the application must be evaluated and then tested to ensure its ability to successfully fail over to another processor.
High Availability Considerations for the Data Access Applications
For a single data warehouse, there is a potential for a number of different applications to be invoked as users query the warehouse. Some of these may be custom applications that have been developed in-house. Others may be licensed software tools such as ad-hoc query tools, Online Analytical Processing (OLAP) tools, data mining tools or web browsers. A query middleware application may also be used.
The data access applications require some of the same disciplines mentioned above for the data sourcing application. Batch reporting streams should be automated with error detection mechanisms. The canned reports, and metadata such as data views, should be stored in protected directories with limited access. New versions should be tested before being put into production.
Reporting applications to be run on servers which are clustered must have the capability to fail over to the backup server, and this capability should be tested at regular intervals.
Service Management Processes For High Availability
In the previous section we discussed the server, disk, network and software components that can lay the foundation for a highly available data warehouse. But there is much more to high availability than hardware. Numerous studies have shown that a significant percentage of system outages are caused by process failure and human error.
In this section we will outline how the system architecture must be extended to include the processes that can help to ensure that availability goals are met. We will also mention specific types of software, such as system management tools, that can assist.
The following are some examples of incidents that cause downtime, and that could be prevented by effective service management processes:
Hewlett Packard has developed the Information Technology Service Management Reference Model to provide a framework for understanding the processes that are required to deliver quality IT services. Many of the processes described therein are pertinent to preventing incidents such as the examples mentioned above. This paper will provide a brief introduction to those concepts. For a more detailed description of the HP ITSM Reference Model, see the HP White Paper entitled "An HP IT Service Management Reference Model."
The ITSM Reference Model shows the interrelationships between disciplines such as the following:
Lack of attention to any of the above areas can threaten IT’s ability to deliver a highly available data warehouse.
The following diagram provides a high level overview of the ITSM Reference Model.
The ITSM Reference Model
In the model, the upper left-hand quadrant relates to the alignment of IT to the business. The lower left-hand quadrant shows the planning processes that must take place to ensure effective service management. The lower right quadrant refers to the implementation processes for services and applications. The upper right quadrant contains the ongoing operational processes involved in providing service management. This paper will briefly discuss some aspects of the latter three quadrants as they relate to data warehouse high availability.
Planning the Service Management Processes
The service management planning process addresses questions such as
Each of these questions is addressed by one of the process groups in the model. For example, the establishment of service level agreements falls within the Service Level Management group. Planning for the monitoring of load, access and backup processes would fall within the Availability and Continuity Management process group.
Another process group in the planning quadrant that is very important for high availability is the Capacity Management group. The processes in this group relate to monitoring of performance levels and planning to ensure adequate capacity to meet business demands.
The following will discuss the above process groups in a little more detail.
A service level agreement is an agreement between the business users of a system and the group providing the service (usually Information Technology) as to the levels of service to be provided. For a data warehouse, the service level agreement will normally set standards for the hours of availability of the warehouse, and may also include standards for query response time.
It is important that service level agreements only cover service aspects which can be measured. The users expectations for availability and performance will, of course, be essential inputs to the design of the warehouse environment, but in the absence of measurement tools and processes, service level agreements will not serve a useful purpose.
Measuring Availability and Performance
Techniques are emerging to allow the measurement of end-to-end availability and performance of the data warehouse.
For availability, the goal is to automatically collect statistics on end-to-end availability of the warehouse. One technique that can be used is what Hewlett Packard refers to as ‘BOT technology. ‘BOT is short for robot, and refers to the concept of an agent running on a workstation, periodically submitting a very simple query, and maintaining data on the percentage of time that the system is able to complete the query.
The advantage of the above technique is that it measures availability across all server types (Unix and NT), the network, the application, the DBMS and any other component that is required to complete the query. This technique can also be used to capture performance data.
Another emerging technique is the use of Application Response Measurement (ARM) agents in decision support applications. The ARM API (Application Programming Interface) is an emerging industry standard for performance measurement. These agents can capture end-to-end response times for actual user queries and send them to a central performance data repository.
Availability and Continuity Management
The Availability and Continuity Management group encompasses a number of processes which are critical to high availability.
This includes planning how monitoring and proactive problem prevention will take place in the environment. There are a large number of software tools that can assist, such as HP’s OpenView products. However the first stage in the planning process is to determine what conditions needs to be flagged and who will be responsible for monitoring and taking corrective action.
For the data warehouse, some of the key aspects to be monitored on a real time basis are
Once it has been determined what monitoring will take place, an assessment can be made as to whether adequate system management tools are available, or new tools must be acquired. Detailed design will be required prior to implementing the service management process (a lower right quadrant function.)
Availability and Continuity Management also encompasses the establishment and ongoing maintenance of support contracts with component suppliers. Adequate support levels are essential to ensuring that failed components are fixed or replaced in a timely way so as not to jeopardize service continuity.
As mentioned earlier in this paper, the load on a data warehouse is often very variable. This can make performance and capacity management a challenging task. However it is critical to achieving a highly available data warehouse.
If a query which normally would take two minutes to run is taking ten to fifteen minutes, the warehouse users may well give up trying, resort to counter-productive measures such as resubmitting the same query, or at the least be delayed in carrying out their tasks. In a case like this, although the data warehouse could be said to be available because all of the components are operational, the service to the users is downgraded to the point that they may perceive lack of availability.
Before the data warehouse goes into production, benchmarking should take place to ensure that adequate performance is experienced for the anticipated query volume. When the warehouse has been in production for a period of time, load and performance can be reviewed periodically to determine whether there are predictable peaks. With an understanding of the nature of the peaks and when they fall, it is often possible to take steps to manage the load.
Particularly important for high availability of the data warehouse is to forecast the need for upgrades to server, network and disk capacity, and plan to ensure that these upgrades are completed in a timely way.
Managing the Operation of the Data Warehouse
We have discussed the planning processes that must take place to design an environment that can support high availability. Equally important are the day-to-day processes of monitoring operations and responding to problems and incidents.
The main functions which carry out the operation of the data warehouse are the Incident Management/Service Desk (Help Desk) and the Operations function. The Help Desk must ensure that the correct procedures are followed in the event of service interruptions to ensure that service is restored as quickly as possible.
The Operations function must monitor the system operations to ensure that system interruptions are detected quickly, and that preventative measures are taken when indicated to prevent service interruption.