HPWorld 98 & ERP 98 Proceedings

Data Warehousing On HP3000 Using IMAGE/SQL – A New Alternative!

Jennie Hou

Hewlett-Packard
(408)447-5971

Cailean Sherman

Taurus Software
(650) 961-1323

Terry O’Brien

DISC
(303) 444-4000

INTRODUCTION

Many of the HP3000 based applications use IMAGE/SQL databases to store their operational data source. Over the past decade, many of these applications have been modified or expanded to include additional components (i.e., client/server, relational database modules, gateways, ODBCs, etc.) to meet the increasing business needs. However, for many, the main data source is still stored in the IMAGE/SQL databases. With advancement in information technology and competitive business environments, it has become essential for many companies to turn their operational data rapidly into useful information so they can make better and accurate decisions. This is commonly achieved using the data warehousing enabling technology. For those customers that use IMAGE/SQL predominantly in their business environments, the question of whether it is appropriate to build a data warehouse or data mart using IMAGE/SQL arises.

In general, a data warehouse is built on top of a relational database over other database types for faster and better ad-hoc query performance. In this scheme, it is common for HP3000 customers to host the target data warehouse on a non-HP3000 platform (i.e., HP9000 or HP NetServer, etc.). Although this can be a viable and even trouble-free solution for some customers, especially those that have heterogeneous computing environments, it also can be a big hassle for those customers that have to learn to operate on a new computing platform (i.e., operating system, user commands, databases, tools, etc.).

As a network database IMAGE/SQL is superb for on-line transaction processing, but is not as efficient for ad-hoc query processing. Therefore, IMAGE/SQL has not been selected as an ideal data warehouse database in the past. However, this limitation has been rectified very recently. Via the joint partnership of Hewlett-Packard and four key channel partners (Taurus Software, Quest Software, DISC, and Brio Technology), it is possible and advantageous for HP3000 IMAGE/SQL users to create a native HP3000 data warehouse solution using IMAGE/SQL, and get comparable performance to a solution built using a relational database through these third party products. This is indeed an exciting alternative for HP3000 IMAGE/SQL users.

This paper briefly covers the generic data warehouse concepts and various implementation alternatives for HP3000 users. The paper then discusses the native HP3000 data warehouse solution using IMAGE/SQL in more detail, such as how these third party products work together in an integrated HP3000 data warehouse environment, the advantages, performance information, and some potential data warehouse modeling templates targeted for certain vertical market industries using these products.

WHAT IS A DATA WAREHOUSE?

A data warehouse is a decision support database that is maintained separately from an organization’s operational databases and it usually resides on a dedicated server. This database is designed based on what kind of information a company is seeking (e.g., sales marketing, healthcare membership and providers, etc.) and it adopts a STAR (or SNOWFLAKE) schema design for maximum efficiency in performance. Extracting appropriate data from existing operational database(s), cleansing or scrubbing the data, denormalizing the data, and then loading the data into the database populate the database. (This data population process is also known as the data transformation process.) This database is then the "place" for top executives, managers, analysts, and other end-users to mine a rich source of company information. They can ask compelling business questions and find answers in their data so they can make key and timely business decisions from their desktops using GUI On-line Analysis Processing (OLAP) tools.

Attributes Of A Data Warehouse

According to W.H. Inmon, who is considered the father of data warehousing, "A Data Warehouse is a subject-oriented, integrated, time variant, nonvolatile collection of data in support of management’s decision-making process." These fundamental attributes of a data warehouse are further explained below:

Subject Oriented

Operational data, such as order processing and manufacturing databases, are organized around business activities or functional areas. They are typically optimized to serve a single, static, application. The functional separation of applications causes companies to store identical information in multiple locations. The duplicated information's format and currency are usually inconsistent. For example, in a delivery database, the customer list will have very detailed information on customer addresses and is typically indexed by customer number concatenated with a zip code. The same customer list in the invoicing system will contain a potentially different billing address and be indexed by an accounting "Customer Account Number". In both instances the customer name is the same, but is identified and stored differently. Deriving any correlation between data extracted from those two databases presents a challenge. In contrast, a data warehouse is organized around subjects. Subject orientation presents the data in a format that is consistent and much clearer for end users to understand. For example subjects could be "Product", "Customers", "Orders" as opposed to "Purchasing", "Payroll".

Integrated

Integration of data within a warehouse is accomplished by dictating consistency in format, naming, etc. Operational databases, for historic reasons, often have major inconsistencies in data representation. For example, a set of operational databases may represent "male" and "female" by "m" and "f," by "1" and "2," by "x" and "y". Frequently the inconsistencies are more complex and subtle. By definition, data is always maintained in a consistent fashion in a data warehouse.

Time variant

Data warehouses are time variant in the sense that they maintain both historical and (nearly) current data. Operational databases, in contrast, contain only the most current, up-to-date data values. Furthermore, they generally maintain this information for no more than a year (and often much less). By comparison, data warehouses contain data that is generally loaded from the operational databases daily, weekly, or monthly and then typically maintained for a period of 3 to 5 years. This aspect marks a major difference between the two types of environments. Historical information is of high importance to decision-makers. They often want to understand trends and relationships between data. For example, the product manager for a soft drink maker may want to see the relationship between coupon promotions and sales. This type of information is typically impossible to determine with an operational database that contains only current data.

Nonvolatile

Nonvolatility, another primary aspect of data warehouses, means that after the informational data is loaded into the warehouse, changes, inserts, or deletes are rarely performed. The loaded data is transformed data that originated in the operational databases. The data warehouse is subsequently reloaded or, more likely, appended on a periodic basis with new, transformed or summarized data from the operational databases. Apart from this loading process, the information contained in the data warehouse generally remains static. The property of nonvolatility permits a data warehouse to be heavily optimized for query processing.

Built From Scratch

Because each company has its own business needs and business queries, a data warehouse database is normally built from scratch utilizing the available data warehousing enabling tools. Determining what kind of questions or queries that end-users need is the first step, though, a time consuming one. Data modeling for such a "customized" data warehouse database can then be developed. Identifying what data is needed from the operational database(s) and then populating the data warehouse would be the subsequent steps. The entire process can then be repeated as additional refinement is needed over time.

From the attributes described above, it is apparent that the purpose and usage of an operational database and a data warehouse vary greatly. The chart below summarizes these differences:

Category Operational Database Data Warehouse
Function Data processing, support of business operations Decision support
Data Process oriented, current values, highly detailed Subject oriented, current and historical values, summarized and sometimes detailed
Usage Structured, repetitive Ad-hoc, some repetitive reports and structured applications
Processing Data entry, batch, OLTP End-user initiated queries

Figure 1: Operational Databases vs. Data Warehouses

Deviation From The Traditional Data Warehouse Attributes

As the data warehouse technology becomes a mainstream technology, some traditional attributes are being deviated from in order to meet users’ increasing demands. The most noticeable ones are timing variant, nonvolatile, and built from scratch.

Deviation from time variant & nonvolatile

As the size of the data warehouses becomes larger and larger (e.g., in terabytes), the amount of time to reload or append data can become very tedious and time consuming. Furthermore, users are demanding more up-to-date data to be included in the data warehouse. Instead of adhering to the traditional data warehouse attributes of keeping the data nonvolatile and time variant, new data is being added to the data warehouse on a daily basis, if not on a real-time basis. Thus, new approaches are being made to tackle this task. Two possible methods are:

Deviation from built from scratch

For customers that are in the "horizontal" industry, meaning their applications are unique to their own businesses, it is essential to build a data warehouse from scratch. However, for customers that are in a "vertical" industry, meaning their applications are either coming from the same vendor or the functionality of those applications from various vendors are similar in nature, it is possible to leverage an "off-the-shelf" pre-packaged MART. The MART is a data-modeling template that is designed with a certain set of queries in mind for that specific vertical industry. Instead of designing data models from scratch, leveraging these MARTs can reduce the development time and cost. According to Frederick Rook’s prediction, (a Senior VP of Platinum Technology Inc.,) approximately 80% of the data warehouses or data marts for the vertical industries will be "pre-packaged" in the next two years or so. This approach definitely deviates from the traditional one.

What Is A Star/Snowflake Schema?

As mentioned earlier, the data warehouse database adopts a star or snowflake schema to maximize performance. A star or snowflake schema design is very different from that of an operational database schema design. In an operational database design, the data is highly normalized to support consistent updates and to maintain referential integrity. In a data warehouse design, the data is highly denormalized to provide instant access without having to perform a large number of joins. A star or snowflake schema design represents data as an array in which each dimension is a subject around which analysis is performed.

As the name implies, the star schema is a modeling paradigm that has a single object in the middle radially connected to other surrounding objects like a star. The star schema mirrors the end user’s view of a business query such as a sales fact that is qualified by one or more dimensions (e.g., product, store, time, region, etc.). The object in the center of the star is called the fact table. This fact table contains the basic business measurements and can consist of millions of rows. The objects surrounding the fact table (which appear as the points of the star) are called the dimension tables. These dimension tables contain business attributes that can be used as SQL search criteria, and they are relatively small. The star schema itself can be simple or complex. A simple star schema consists of one fact table and several dimension tables. A complex star schema can have more than one fact table and hundreds of dimension tables. Figure 2 below depicts a simple star schema.

Figure 2: Star Schema

The snowflake schema is an extension of the star schema where each point of the star explodes into more points. In this schema, the star schema dimension tables are more normalized. The advantages provided by the snowflake schema are improvements in query performance due to minimized disk storage for the data and improved performance by joining smaller normalized tables, rather than large denormalized ones. The snowflake schema also increases the flexibility of the application because of the normalization that lowers the granularity of the dimensions. However, since the snowflake schema has more tables, it also increases the complexities of some of the queries that need to be mapped. Figure 3 below depicts a snowflake schema.

Figure 3: Snowflake Schema

What Are The Steps In Building A Data Warehouse?

Building a data warehouse is an involved process that requires several steps:

  1. Analyzing business needs
  2. Extracting operational data
  3. Transforming and loading data
  4. Query optimization
  5. Presentation of data
  6. Continuing refinement of the data warehouse

Analyzing business needs

When building a data warehouse, the first challenge is to define business needs. This is normally done by working with end-users. End-users can define business questions, which they can’t easily obtain through the operational databases. The information obtained can then be used in data modeling so an appropriate star schema database can be defined. Spending time with end-users to define business queries they need is not only essential but is extremely important to the success of the project.

Extracting operational data, transforming and loading data

Moving data from the operational databases to the data warehouse needs to be done via extraction tools. Operational data needs to be mapped to the target data warehouse. As part of the data movement, data transformation is performed as specified by the metadata rules developed during the data modeling stage.

Query optimization

Performance in data retrieval can be greatly enhanced through the use of multidimensional and aggregation indexes in a star or snowflake environment. Over 90% of data warehousing queries are multidimensional in nature using multiple criteria against multiple columns. For example, end-users rarely want to access data by only one column or dimension, such as finding the number of customers in the state of CA. They more commonly want to ask complex questions such as how many customers in the state of CA have purchased product B and C in the last year, and how does that compare to the year before. To optimize the query, an index can be put on each column that end-users want to query in the dimension tables. When an end-user issues a query, a qualifying count based on index access only can be returned without touching the actual data. According to Bill Inmon, it is much more efficient to service a query request by simply looking in an index or indexes instead of going to the primary source of data. In addition to multidimensional queries, end-users often want to see the data aggregated. A data aggregation is usually a COUNT or SUM, but can be an AVERAGE, MINIMUM, or MAXIMUM, such as number of customers, total sales dollars or average quantity. An aggregation is typically organized or grouped by another column, such as sum of sales by region, or average quantity of product line B sold by sales rep. By placing an index on aggregated values, performance can be enhanced.

Presentation of data

Once the data is retrieved from the data warehouse, it is important how the data is presented to end-users. Graphical presentation with drill down features and reporting capabilities is the norm. There are multitudes of vendors that provide such OLAP tools

Continuing refinement of the data warehouse

As more end-users use the data warehouse to extract nuggets of information from mountains of data, the data warehouse needs to be refined to accommodate new and different types of business queries. This involves the continual process of extracting data and transforming it into information, making and executing a decision in hopes of producing a significant return on investment. This is a never-ending challenge.

WHAT ARE THE SOLUTIONS ON THE HP3000?

For the HP3000 customers, there are two categories of data warehouse solutions available. These categories are a non-native HP3000 data warehouse solution and a native HP3000 data warehouse solution.

Non-native HP3000 Data Warehouse Solution

A non-native HP3000 data warehouse solution adopts the model of interoperating with databases on multiple platforms. For example, a company may have multiple operational databases on the HP3000. Instead of putting the target data warehouse database on the HP3000, the company opts for another HP platform such as an HP9000 platform or an HP Netserver platform. After the initial loading of data into the data warehouse database, subsequent incremental loads (especially small loads that are issued frequently) can be done either via data movement tools or database gateways that can work with those specific data sources. Figures 4a and 4b illustrate the numerous options, but not limited to, HP3000 customers can have in implementing such a non-native HP3000 data warehouse solution.

Figure 4a: Some Non-Native HP3000 Data Warehouse Solution Options

 

Channel

Partners

Data Extraction Tools

Middleware/

Data Access

Tools

Reporting/OLAP Tools

Data

Warehouse

Database

Data

Warehouse

Consulting

Arbor Software

   

X

   

Brio Technology

   

X

   

Client Systems

X

X

X

X

X

Cognos

   

X

   

DISC

 

X

     

DCI Conf.

       

X

ETI

X

       

Information Builder Inc.

 

X

X

   

Hewlett-Packard

X

X

X

X

X

Informix

     

X

 

Integration Alliance

X

X

X

X

X

Mitchell Humphrey

   

X

X

X

Oracle

X

 

X

X

X

Platinum Technology

X

X

X

X

X

Rankin Tech Group

X

       

Red Brick System

   

X

X

X

SAS

   

X

   

Speedware

   

X

   

Sybase

     

X

 

Taurus & Quest

X

       

Figure 4b: Example Of Some HP Channel Partners That Offer A Wide Range Of Data Warehousing/Data Marts Products & Services

Native HP3000 Data Warehouse Solution

A native data warehouse solution adopts the model of working with databases that are on the HP3000 platform. In this case, there are three databases to choose from: IMAGE/SQL, ALLBASE, and ORACLE. Until recently, it’s been common to see the IMAGE/SQL databases being the operational databases with the ORACLE or ALLBASE databases being the target data warehouse databases. IMAGE/SQL databases are renowned for their excellent On-Line Transaction Processing (OLTP) performance. However, popular wisdom suggests that the road to data warehousing for HP3000 customers requires a relational database (ALLBASE or ORACLE) for efficient ad-hoc query capabilities, which are not native to non-relational databases.

Recently, through the partnership of Hewlett-Packard, DISC, Taurus Software, Quest Software, and Brio Technology, a total data warehouse solution, with a code name of Intelli-Vault, is available on the HP3000 using an IMAGE/SQL database as the target data warehouse database and presenting the data through graphical end-user tools. Figure 5 illustrates some of the available native HP3000 data warehousing options.

Figure 5: Some Native HP3000 Data Warehouse Solution Options

The focus of the remaining paper is on this new Intelli-Vault technology bundle.

INTELLI-VAULT: A NATIVE HP3000 DATA WAREHOUSING TECHNOLOGY BUNDLE

Intelli-Vault is a data warehousing technology bundle specifically designed for HP3000 customers that want to build a high-performance and cost-competitive data warehouse in an environment that they are familiar with – using IMAGE/SQL on HP3000. Instead of learning how to operate on a new computing platform (i.e., new database, tools, operating system, user commands, system maintenance, etc), those customers can comfortably implement a total data warehousing solution using the database and tools that they already know in a reliable and familiar environment. This alternative is indeed very attractive.

In this technology bundle, each partner brings unique strengths which contribute to the composition of this total solution. The key values are listed below:

Hewlett-Packard: HP3000 & IMAGE/SQL

Taurus Software & Quest Software: BridgeWare

DISC: OMNIDEX

Brio Technology: BrioQueryä

Figure 6 provides a pictorial view of the Intelli-Vault technology bundle.

 

Figure 6: Native HP3000 Data Warehousing Technology Bundle

Together, these products deliver a comprehensive data warehousing solution using IMAGE/SQL databases on the HP3000. Of the 6 steps of building a data warehouse, they handle all but steps 1 and 6, the needs analysis and continual refinement.

  1. Analyzing business needs
  2. Extracting operational data
  3. Transforming and loading data
  4. Query optimization
  5. Presentation of data
  6. Continuing refinement of the data warehouse

Taurus and Quest handle steps 2 and 3, extracting operational data and then transforming and loading the data into the data warehouse. DISC handles step 4, optimizing the queries. BrioQueryä handles step 5, presenting the data to end-users in an intuitive fashion.

BridgeWare

Overview

Moving data from the operational store to the data warehouse is a two step process. The data warehouse must be initially loaded with data from the operational store(s). Often this data is coming from many different databases and files, all needing to be consolidated and merged into a single data warehouse target. Once loaded, the data warehouse must be kept up to date at a frequency which is acceptable to the users. Most often this requires a daily refresh frequency.

Trying to re-populate multiple gigabytes of data on a nightly basis quickly becomes impractical. Trying to figure out what "just today’s" transactions are is often impossible.

BridgeWare is a jointly developed product from Taurus and Quest designed to perform both the initial load of the data warehouse as well as "knowing" what has changed in the operational systems since the last data warehouse update, and performing an incremental load.

BridgeWare performs the essential data transformation, data scrubbing, data denormalization, and loading of data into a star schema design required in the initial data warehouse load. BridgeWare also provides standard batch updates and can perform cross-platform data updating in real time.

Data Extraction – Initial Data Warehouse Load

BridgeWare can extract data located in any of the leading databases and files and move it to any set of target databases and files, even if they exist on a different platform. Data movement can be bi-directional. Data type conversions are handled automatically or they can be user defined. Complex calculations, data manipulation and mapping can be defined in a "script" or through a graphic user interface running on Windows NT or Windows 95. Supported databases are Oracle, IMAGE/SQL, Allbase, DBMS, Informix, SQL Server, and assorted flat file types. Supported platforms are MPE, UNIX, VMS, and NT.

Graphic Interface

Initial loading of the star schema may be accomplished quickly through the use of
BridgeWare’s Graphic User Interface, Forklift, residing on Windows NT or Windows 95. The GUI generates scripts which can be executed or modified further and then run at a later time. Figure 7 below illustrates the Forklift GUI tool.

Figure 7: Bridgeware’s GUI – Forklift

Sample Scripting Language

Scripts generated by the Graphic Interface or developed interactively are easily viewed to make certain that each step of your data migration is handled appropriately. These scripts may be executed interactively or in batch.

As an example, a script created for the mapping illustrated by the prior screen follows. This example shows moving data from one IMAGE/SQL dataset to another.

* Forklift 1.00.0000 generated script.

* Mon May 18, 1998 02:33 PM

*

open CUSTDB REMOTE bert user=css/.sales/ IMAGE cust PASS=; MODE=5

define CUSTDB_COMPANY-INFO_REC : using CUSTDB.COMPANY-INFO

read COMPANY-DETAIL_FLOW_R = CUSTDB.COMPANY-DETAIL

setvar CUSTDB_COMPANY-INFO_REC.IDKEY = IDKEY

setvar CUSTDB_COMPANY-INFO_REC.MAJ-CPY-CODE = CPY-CODE

setvar CUSTDB_COMPANY-INFO_REC.SEQUENCE = SEQUENCE

setvar CUSTDB_COMPANY-INFO_REC.DP-EXTN = DP-CONTACT

setvar CUSTDB_COMPANY-INFO_REC.LINE1 = ADDRESS-1

copy CUSTDB_COMPANY-INFO_REC to CUSTDB.COMPANY-INFO

endread

Change Detection – Incremental Data Warehouse Load

Once the data warehouse has been populated, BridgeWare keeps it updated with changes coming from the transactional environments. To do this, BridgeWare relies on the industry proven technology of Quest’s Netbase (SharePlex/iX) to detect when a change has happened. BridgeWare detects changes in IMAGE/SQL, KSAM and MPE files. Entire databases may be selected for change detection, or individual datasets may be specified. The change detection happens automatically without any program changes.

As transactions are posted to the transactional databases or files, changes to the configured databases are intercepted, converted, and transported directly into the targeted data warehouse by way of a consistent stream of delta-change transactions (minimizing impact to the mission-critical legacy system, as well as to the network).

Database updating with BridgeWare can be configured in several ways: to take place immediately, to take place at pre-selected intervals, or to take place during off-hours in batch mode, thereby minimizing network traffic.

BridgeWare uses its own networking software. A BridgeWare process runs on each system in the network and communicates with each other process to ensure proper receipt and posting of data as well as interactive error handling.

IMAGE/SQL Star Schema

The IMAGE/SQL database schema for the star schema example shown earlier in Figure 2 could be as follows. Please note that there are no paths between the dimension datasets, which are manual masters, and the fact dataset. This is because OMNIDEX can ensure referential integrity without them, and they would add unnecessary overhead when the database is initially populated and periodically updated.

$CONTROL BLOCKMAX=1024

BEGIN DATA BASE HPDW;

PASSWORDS:

1 READER ; << 051105 040504 042522 020040 >>
2 WRITER ; << 053522 044524 042522 020040 >>

ITEMS:

Store-Key , I2 ;
City , X20 ;
State , X02 ;
Region , X02 ;
Promo-Key , I2 ;
Promo-Name , X40 ;
Promo-Date , X08 ;
Product-Key , I02 ;
Product-Class , X10 ;
Product-Name , X10 ;
Product-Desc , X40 ;
Time-Key , I02 ;
Day-Of-Week , X10 ;
Month , I01 ;
Year , I01 ;
Revenue , I02 ;
Quantity , I02 ;
Cost , I02 ;
Discount , I02 ;

SETS:

NAME: Product , MANUAL (1,2/2);
ENTRY:

Product-Key ( 0 ),
Product-Class ,
Product-Name ,
Product-Desc ;

CAPACITY: 10; << TABLE1 >>

NAME: Store , MANUAL (1,2/2);
ENTRY:

Store-Key (0),
City ,
State ,
Region ;

CAPACITY: 10;

NAME: Promotion , MANUAL (1,2/2);
ENTRY:

Promo-Key (0),
Promo-Name ,
Promo-Date ;

CAPACITY: 10;

NAME: Time , MANUAL (1,2/2);
ENTRY:

Time-Key (0),
Day-Of-Week ,
Month ,
Year ;

CAPACITY: 10;

NAME: Sales , DETAIL (1,2/2);
ENTRY:

Store-Key ,
Promo-Key ,
Product-Key ,
Time-Key ,
Revenue ,
Quantity , Cost ,
Discount ;

CAPACITY: 10;

END.

OMNIDEX

Once data is loaded into the IMAGE/SQL data warehouse using BridgeWare, it is ready to be queried, analyzed, and retrieved by the end-users. OMNIDEX, a leading database search engine for the HP3000, has been specifically enhanced for fast and flexible access into data warehouse databases that use star and snowflake schemas. Through the use of powerful, state-of-the-art Multidimensional and Aggregation Indexes, OMNIDEX solves and optimizes even complex queries by accessing only its indexes. By performing counts and aggregations at the index level only, without accessing the underlying database, it minimizes I/O. The database is accessed only when the end-users want to view the detail data, using fast indexed access. By changing the fundamental data access method, dramatic query performance can be realized. Users can analyze the data warehouse effectively and efficiently, asking any question and obtaining the answer instantly.

OMNIDEX integrates transparently with popular ODBC analysis/reporting tools, such as BrioQueryä , the powerful user software that is included with Intelli-Vault. To understand how OMNIDEX delivers breakthrough performance, let’s take a look at how its indexes handle two of the fundamental types of data access that end-users need for analysis of product profitability, promotional effectiveness and customer characteristics in a data warehousing environment:

Multidimensional Queries

A multidimensional query accesses data by more than one dimension, i.e. by more than one column or criteria as described earlier in the Query Optimization section. OMNIDEX Multidimensional Indexes are specially designed for unlimited multidimensional access, allowing complex queries using any number of criteria or dimensions for the most flexible data exploration and optimal multidimensional query performance.

For the star schema example shown earlier, one possible multidimensional query would be to find the number of sales for stores in CA, for holiday promotions, for products with the word "book" in the description, for the month of December. As shown in Figure 8 below, OMNIDEX can perform instantaneous multidimensional queries by any number of criteria, including keyword lookups on free-format fields such as description as well as commonly coded fields such as State.

Figure 8: Multidimensional Query Example

Because the end-users can quickly see the qualifying count each time they perform a search, they can decide whether to add criteria to further qualify the selection, aggregate, drill down into the detail data, or start over. Only when the end-users want to view the detail data is I/O performed against the database.

With OMNIDEX, end-users can query repeatedly against a VLDB and perform counts, aggregates, and drill downs in seconds or minutes instead of hours or days. This allows them to interact with and explore the data, in unpredictable ways, at remarkable speeds. For example, Figure 9 below shows likely access times for a star schema with one large fact table and four dimensions. The right column is the time OMNIDEX takes to access the index for each of the criteria and produce a count that qualifies 5% of the data.

Size of Fact Table (# of rows)

Query Time (to obtain count)

2,000,000

.2 second

10,000,000

1 second

50,000,000

5 seconds

100,000,000

10 seconds

250,000,000

25 seconds

Figure 9: OMNIDEX Query Speeds

The query speeds above are from a moderately fast single processor with 128 MB of memory. The CPU size and speed do affect the OMNIDEX lookup speeds, but only minimally due to the fact that the work is completed at the index level.

OMNIDEX uses index-only processing whenever possible. When a multidimensional query is requested based on criteria from multiple columns or dimensions, the required values are instantly accessed within the indexes, and the results joined at the index level.

Because OMNIDEX can solve complex queries by accessing only its indexes, the size of a database has almost no impact on the response time of OMNIDEX queries as compared to standard database retrievals. The important thing for OMNIDEX is how many rows qualify when it accesses the index, so retrieval speeds are similar against a 1-million-row table and 20-milion-row table, if the counts are the same.

OMNIDEX multidimensional indexes are optimized for the cross table retrievals and multiple joins inherent within a star or snowflake design. They are built at high speeds in only one pass through a table, and compressed for efficient disk space usage. OMNIDEX supports any type of data, including high-cardinality text columns. All its indexes work in conjunction with each other, no matter what the data type: low-cardinality and high-cardinality character data including BLOB text, binary data including quantities and amounts, and dates.

Data Aggregation

As mentioned earlier in the Query Optimization section, end-users often want to see data summarized (e.g., COUNT, SUM, AVERAGE, MINIMUM, MAXIMUM) such as average sales dollars, total number of customers, etc.). An OMNIDEX Aggregation Index can create any aggregate, group-by or order-by (sort) dynamically online through its indexes, without the I/O required to scan or fetch the data rows or perform a sort operation. OMNIDEX aggregation indexes summarize quantities or amounts dynamically at the time of the query at remarkable speeds, up to 1 million values per second. Figure 10 below illustrates aggregation speeds using OMNIDEX. The right column lists worst-case aggregation times on a fast CPU, best-case times for slower CPUs.

Size of Fact Table (# of rows)

Aggregation Time (in minutes)

2,000,000

.07

10,000,000

.34

50,000,000

1.67

100,000,000

3.34

250,000,000

8.34

Figure 10: OMNIDEX Aggregation Speeds

Once end-users have utilized the OMNIDEX multidimensional indexes to select data and identify the rows of interest from the fact table, OMNIDEX passes a list of the qualifying row identifiers to its aggregation indexes to perform calculations on the quantities or amounts. In a previous example, we discussed how OMNIDEX could instantly perform a multidimensional query on stores in CA for holiday promotions for products with the word "book" in the description for the month of December (refer back to Figure 8). After the multidimensional query, OMNIDEX could then instantly aggregate amounts to calculate the sum or average or both with an aggregation key that includes the revenue, quantity, and/or cost.

Because OMNIDEX couples all the speed and flexibility of selecting by its multidimensional indexes with the high-speed data summary capabilities of its aggregation indexes, users can qualify by any combination of multidimensional criteria, and then instantly summarize the desired amounts or quantities.

With the combination of OMNIDEX multidimensional and aggregation indexes, end-users are not limited to a small number of predefined selection criteria or dimensions. In addition, there is no need to create numerous summary tables, as is often the case with data warehousing applications that use a RDBMS alone for ROLAP.

The key to breakthrough performance is that all the OMNIDEX aggregation functions work in conjunction with all the multidimensional retrieval functionality, and OMNIDEX can fully resolve most queries within the search engine itself. Because the qualifying row counts and data to be aggregated are contained within the indexes, OMNIDEX can perform most of the work at the index level, without reading the actual data record, and I/O is eliminated or minimized.

OMNIDEX aggregation indexes also contain pointers back to the actual data in the fact table, so the underlying detail data can be instantly accessed and displayed to the user at any time. This means data can be kept at any level of granularity, then quickly aggregated and displayed - a capability that is lost when using only summary tables.

Seamless Integration

OMNIDEX is a database search engine that supplements the existing database, supplanting the native database indexes with ones specifically designed for high-performance multidimensional access and dynamic data aggregations. OMNIDEX provides additional access that is not possible with the native IMAGE/SQL or Oracle databases alone. OMNIDEX indexes reside outside of the existing database, sitting between the data and ODBC-compliant PC data access tools such as BrioQueryä .

When a user issues a request for information through a PC tool, ODBC then passes the query through the OMNIDEX search engine, which accesses indexes and passes the qualifying count or data aggregations back to the user tool. If the user requests the actual detail data, the pointers for the selected data rows (or rowids) are used to directly retrieve only those rows of interest from the fact table. The OMNIDEX SQL optimizer insures that all queries against a star schema or snowflake are completed in the fastest manner, and never cause slow, I/O-intensive scans of the data.

Furthermore, the OMNIDEX database search engine enhances applications across the enterprise. Its multidimensional indexes allow instant keyword lookups by free-format text columns such as name or description and are very useful for OLTP applications.

Sample Indexing Scripts

To place indexes on the star schema shown in Figure 2 earlier is a simple matter of setting up an "environment catalog" that defines the database layout, and then using a DISC utility called Dbinstal to specify the tables (datasets) and columns (fields) on which you want to query.

The environment catalog is as follows. Please note that it defines "PRIMARY" and "FOREIGN" keys to logically link the sales fact table with the dimension tables, without the need for IMAGE/SQL paths.

ENVIRONMENT "HPDW_ENV"

DATABASE "HPDW"
TYPE TURBOIMAGEXL
PHYSICAL "HPDW"

TABLE "PRODUCT"
TYPE MASTER
PRIMARY KEY "PRODUCT_KEY"

COLUMN "PRODUCT_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "PRODUCT-KEY"
COLUMN "PRODUCT_CLASS" DATATYPE CHAR LENGTH 10
PHYSICAL "PRODUCT-CLASS"
COLUMN "PRODUCT_NAME" DATATYPE CHAR LENGTH 10
PHYSICAL "PRODUCT-NAME"
COLUMN "PRODUCT_DESC" DATATYPE CHAR LENGTH 40
PHYSICAL "PRODUCT-DESC"

TABLE "STORE"
TYPE MASTER
PRIMARY KEY "STORE_KEY"

COLUMN "STORE_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "STORE-KEY"
COLUMN "CITY" DATATYPE CHAR LENGTH 20
COLUMN "STATE" DATATYPE CHAR LENGTH 2
COLUMN "REGION" DATATYPE CHAR LENGTH 2

TABLE "PROMOTION"
TYPE MASTER
PRIMARY KEY "PROMO_KEY"

COLUMN "PROMO_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "PROMO-KEY"
COLUMN "PROMO_NAME" DATATYPE CHAR LENGTH 40
PHYSICAL "PROMO-NAME"
COLUMN "PROMO_DATE" DATATYPE CHAR LENGTH 8
PHYSICAL "PROMO-DATE"

TABLE "TIME"
TYPE MASTER
PRIMARY KEY "TIME_KEY"

COLUMN "TIME_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "TIME-KEY"
COLUMN "DAY_OF_WEEK" DATATYPE CHAR LENGTH 10
PHYSICAL "DAY-OF-WEEK"
COLUMN "MONTH" DATATYPE INTEGER LENGTH 2
COLUMN "YEAR" DATATYPE INTEGER LENGTH 2

TABLE "SALES"
TYPE DETAIL

FOREIGN KEY "STORE_KEY" REFERENCES "STORE"("STORE_KEY")
FOREIGN KEY "PROMO_KEY" REFERENCES "PROMOTION"("PROMO_KEY")
FOREIGN KEY "PRODUCT_KEY" REFERENCES "PRODUCT"("PRODUCT_KEY")
FOREIGN KEY "TIME_KEY" REFERENCES "TIME"("TIME_KEY")

COLUMN "STORE_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "STORE-KEY"
COLUMN "PROMO_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "PROMO-KEY"
COLUMN "PRODUCT_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "PRODUCT-KEY"
COLUMN "TIME_KEY" DATATYPE INTEGER LENGTH 4
PHYSICAL "TIME-KEY"
COLUMN "REVENUE" DATATYPE INTEGER LENGTH 4
COLUMN "QUANTITY" DATATYPE INTEGER LENGTH 4
COLUMN "COST" DATATYPE INTEGER LENGTH 4
COLUMN "DISCOUNT" DATATYPE INTEGER LENGTH 4

INDEX "PRODUCT_PRODUCT_KEY"
TYPE HASHED
AS "PRODUCT.PRODUCT_KEY"

INDEX "STORE_STORE_KEY"
TYPE HASHED
AS "STORE.STORE_KEY"

INDEX "PROMOTION_PROMO_KEY"
TYPE HASHED
AS "PROMOTION.PROMO_KEY"

INDEX "TIME_TIME_KEY"
TYPE HASHED
AS "TIME.TIME_KEY"


The installation job for the star schema is as follows:

!JOB hpdwecin,MGR.TDODEV,HPDW;OUTCLASS=LP,1,1
!COMMENT Job to Install and Build OMNIDEX indexes.
!COMMENT File hpdwecin created Tue May 19 00:40:28 1998
!RUN DBINSTAL.PUB.DISC
hpdwec << Environment name >>
I << INSTALL command >>
PRODUCT << Multidimensional Keyword Installation >>
PRODUCT_CLASS
PRODUCT_NAME
PRODUCT_DESC
/ << end of PRODUCT keyword columns >>
STORE
CITY
STATE
REGION
/ << end of STORE keyword columns >>
PROMOTION
PROMO_NAME
PROMO_DATE
/ << end of PROMOTION keyword columns >>
TIME
DAY_OF_WEEK
MONTH;NP
YEAR;NP
/ << end of TIME keyword columns >>
SALES
STORE_KEY
PROMO_KEY
PRODUCT_KEY
TIME_KEY
/ << end of links >>
REVENUE;NP
QUANTITY;NP
COST;NP
DISCOUNT;NP
/ << End of Multidimensional Tables >>
SALES << Aggregation installation>>
STORE_PROD_PROMO_TIME_AGG! << aggregation index>>
STORE_KEY
PRODUCT_KEY
PROMO_KEY
TIME_KEY
REVENUE
QUANTITY
COST
DISCOUNT
/ << end of aggregation key>>
/ << end of aggregation installation >>
B; << Build indexes >>
PRODUCT
B;
STORE
B;
PROMOTION
B;
TIME
B STORE_PROD_PROMO_TIME-AGG << build aggregation index>>
QUIT
!EOJ

BrioQueryä

BrioQueryä is designed for end-users. It provides an intuitive query, analysis and reporting interface to the data warehouse. The intuitive user interface requires minimal training so users are instantly productive.

Brio Technology also offers a pre-defined data Mart for sales analysis called SMART. SMART includes a pre-designed Star schema based data mart for sales analysis, a pre-designed online data store, and pre-defined reports that any user can display. The data mart may be housed in IMAGE/SQL or other relational databases. Users can access their queries locally or over the web.

Some sample screens and analysis from SMART are illustrated in Figures 11 a,b,c,d below.

Figure 11a: BrioQueryä Presentation Of Top 20 Customers

 

Figure 11b: BrioQueryä Presentation Of Top 20 Customers

 

Figure 11c: BrioQueryä Presentation Of Product Revenue By Type

 

Figure 11d: BrioQueryä Presentation Of Commissionable Revenue By Year

 

PERFORMANCE INFORMATION

In the Intelli-Vault bundle, the performance is maximized using the following:

Specific to each component, HP3000 benchmark data is presented:

BridgeWare

A healthcare database environment was benchmarked by a third party to determine the impact of BridgeWare in a real-time cross platform update scenario. Quotations for this benchmark follow:

" There was no statistically significant impact, on a transaction-by-transaction basis, of any of the tested scenarios. In other words, implementation of the BridgeWare product in this test environment introduced no significant transaction-level overhead over and above the overhead spent to perform the shadowing.

There was an increase of approximately 4% CPU utilization associated with the NetBase Export process when it is bridging the change-data within the source database…

(When processing batched records) There is a cost of approximately 12-13% CPU utilization associated with the Warehouse background process when it is preprocessing the bridged change-data and piping it (real-time) over to the HP/9000 from the 939 system……

The increased CPU utilization associated with the Warehouse background job operated at a higher priority than the (standard production) job. This resulted in the (standard production) job running longer due to being preempted by the higher priority Warehouse processing. This did not, however, affect the amount of CPU time spend on the execution of the (standard production) job."

OMNIDEX

An "operational" healthcare IMAGE/SQL database containing 9 million records on a HP3000/967 system was tested with and without OMNIDEX in a batch reporting scenario where complex queries were executed:

Complex Query

Without OMNIDEX

With OMNIDEX

A

30 minutes

21 seconds

B

180 minutes

138 seconds

C

33 minutes

28 seconds

D

154 minutes

70 seconds

E

52 minutes

50 seconds

The above data is from an operational database where the data is normalized. If a star schema database is used, the performance is expected to be better.

VERTICAL INDUSTRY DATA MART MODELING

As mentioned earlier, "pre-packaged" MARTs for vertical industries will become the norm in the near future. Figure 12 below shows the vertical industries that are building data warehouses (and their average data warehousing project budgets) and can be ideal for "pre-packaged" MARTs.

Figure 12: Data Warehouse Budgets Across Vertical Industries
(Statistics From The Meta Group)

Looking at our current HP3000 vertical industry applications, the following are ideal candidates for "pre-packaged" MARTs utilizing the Intelli-Vault data warehousing technology bundle:

The application vendors can develop these ideal "pre-packaged" MARTs either in-house or through partnership with data warehousing vendors. Intelli-Vault is a viable data warehousing technology bundle that can be used in the development of these potential MARTs. As mentioned earlier, Intelli-Vault can’t be used to analyze business needs. This must be achieved through working with end-users in the respective vertical industries. However, Intelli-Vault can be used to construct a specific vertical industry MART once the business needs and business queries are defined. Assuming these MARTs are developed, end-users can seek answers to some sample queries below:

Health Care

  1. Users want to be able to design and change their own reports to satisfy the regulatory requirements
  1. Which group member submitted the most number of claims sorted by service types?
  2. What is the cost distribution incurred by the members within this group sorted by service types? How is this compared to the last 5 years?
  3. What is the claim distribution sorted by age group over the past 5 years?
  1. How many types of services are being offered by providers in each State? How is this compared to the last five years?
  2. Which provider incurs the least cost but with highest effectiveness rate for service type X?
  1. Which providers are performing services that are outside the norm for a specific age group?

Mail Catalog

    1. To whom should I mail my catalog?
    2. Which catalogs should I mail to which types of customers?
    3. Can I pick specific segments of the population to mail that will result in increased sales?
    4. Which catalog in the past 2 years has generated the highest number of customer orders?
    5. Who are my most valuable customers?
    6. Which products do my customers purchase together?
    7. Is my last promotion successful? How successful? Why?
    8. Which customers am I in danger of losing?
    1. What products should I put in my catalog?
    2. What inventory should I carry based upon catalog shipment dates and buying habits?
    3. Which product is the highest in demand sorted by catalogs over the past 3 years?
    1. Customer purchasing habits, when bought, what bought?
    2. How can my catalog contents, mailings, and mailing volumes best take advantage of these habits?
    3. When was the catalog released versus when purchase begin?
    4. Why did this region outperform all others last quarter?

Manufacturing

Airlines

Ideally, a vertical industry MART can meet 80% of end-users’ query needs. Through additional fine-tuning and adjustment, customized queries can then be incorporated into these "pre-packaged" MARTs and thus providing end-users with additional competitive information.

SUMMARY

Data is the building block for useful information. With access to accurate and timely information, appropriate business decisions can be made to maximize profit and gain competitive advantage over other competitors. Most companies today have no shortage of data; however, the data exists in the form that is difficult for human access or interpretation. The challenge lies in transforming the data into useful information. With the data warehousing technology, the means of achieving this is possible.

Today, there are multitudes of vendors that can assist in the process of building such data warehouses. For HP3000 customers, they can choose either a non-native or native HP3000 data warehouse solution. A non-native data warehouse solution would require the necessity to operate in another computing environment (e.g., hardware platform, operating system, databases, tools, etc.). This may be a non-issue for those companies that already have heterogeneous computing environments. For those companies that have an HP3000-only environment, a native data warehouse solution, Intelli-Vault, can be very attractive and appropriate because it offers a "familiar" and comfortable environment for deploying their data warehouses. The Intelli-Vault data warehousing bundle offers the following values:

With the emergence of vertical industry MARTs, implementing data warehouses can be made even simpler and faster for those vertical industry customers. Regardless of which data warehousing solution is implemented, HP3000 customers can maximize their decision-making power that is hidden within mountains of data.

Author | Title | Tracks | Home


Send email to Interex or to theWebmaster
©Copyright 1998 Interex. All rights reserved.