Phones: 800-442-6861, 630-620-5000
Fax: 630-691-0718
Table of Contents
Introduction
Why Do We Need to Manage Metadata?
Data from Everywhere
Data as a Potential Resource
Metadata Status in an Organization
Metadata Management: Historical Approaches
Data Administration: Data Dictionaries
The Havoc of Distributed Systems
Early Metadata
Early Data Warehousing
Metadata Management Trends
Growth of Data Warehousing
The Resurgent Repository
An Enterprise View of Metadata Management
Managing Metadata Within and Across Warehousing Efforts
A Repository as a Metadata Integration Platform
Lifecycle Issues
What Needs to Be In an Enterprise Repository to Make the Warehouse Work Better
Non-proprietary Relational Database Management System
Fully Extensible Meta Model
Application Programming Interface (API) Access
Central Point of Metadata Control
Impact Analysis Capability
Naming Standards Flexibility
Versioning Capabilities
Robust Query and Reporting
Data Warehousing Support
Conclusion
Glossary
Introduction
As a recent article in the Wall Street Journal pointed out — data is becoming an abundant
commodity — we can get it anywhere and everywhere. However, just as with any other item which
becomes a commodity, data by itself is losing value, in part simply because there is so much of it.
Not too long ago, the situation was exactly the opposite: raw information was extremely difficult to
acquire and therefore highly prized. For example, fifty years ago, investors interested in trading in a
futures market such as coffee beans would ask the question: "how is the coffee bean harvest doing in
South America?" In order to get their answer, they hired agents to investigate coffee production as
well as other important commodities. The reason they were willing to go to these lengths is that the
answers to these questions were critical, and could make an investor an overnight millionaire in the
futures market. Today, you can get higher quality raw data, including satellite pictures if you wish,
over the Internet, for less than ten dollars a month. More data can be obtained in 15 minutes, from a
broader and richer set of sources than any investor 50 years ago could get with months of effort.
The availability of data and the staggering proportions of the quantities available makes it difficult to
deal with. When data was scarce, the amount available was consumable by users with fairly
primitive tools. Now that data is plentiful, we are faced with the problem of separating the significant
facts from the rest. There are many analogies which can be used to describe the situation. Some say
it is like trying to take a drink from a fire hose; others say it is like trying to find one specific grain of
sand on a stretch of beach. In all cases, the problem being described is singular, pervasive, and
compelling, because one thing that hasn't changed is the fact that the ability to use data to get
answers to business questions is still the key to making money and achieving success in business
endeavors. Therefore, like the Sorcerer's Apprentice, we have too much of a good thing and we
don't know how to control it.
We do know that in order to make good business decisions we need good data. The process which
has been generally accepted as good business practice has been described as follows:
First, acquiring quality raw data;
second, combining and integrating the data to make useful information, and then;
analyzing the information and making high quality decisions.
The acute issue is knowing which data to use to create useful information. The end goal is to make
better and higher quality decisions than your predecessors could. The torrent of raw data has added
more choice, and therefore complexity, to the process. What we need to do is put the data in
context, give the data meaning, relevance, and purpose, and make it complete and accurate. Data
which is viewed in this light is called information, because we can use it for deductive and inductive
insights which lead us to quality decisions.
This paper discusses the issues confronting management today as they grapple with the floods of
raw data and the pressing need to know what they have as data assets and how to achieve the goal
of better decision making.
Why Do We Need to Manage Metadata?
As mentioned in the introduction, raw data is proliferating at a rapid rate. Data is flowing into the
company from suppliers and customers. And, the internal systems of the corporation are adding
their share.
Data From Everywhere
Corporations are coupling together in webs of suppliers, partners, and customers, exchanging a
myriad of information through a spectrum of technologies such as Electronic Document Interchange
(EDI) systems, Electronic Funds Transfer (EFT) systems, email, and a host of other data acquisition
and networking applications.
At the same time, existing legacy systems within enterprises continue to generate data on orders,
sales, revenues, employee information, manufacturing schedules, inventory, fleet status, and every
other parameter imaginable. As computers become more and more affordable, as storage costs
continue to plummet, as user sophistication increases in the use of information technology, the
proliferation of the technology adds to the exponential growth of the data it generates.
What do we know about the data being generated by these systems? First of all, we know that it is
by and large dispersed across the enterprise. Each department, division, group, branch, section or
any other subdivision is today capable of generating its own unique caches of data. The information
technology advances of the last 20 years have added significantly to the amount and depth of data
produced, managed, and stored. The waves of management interest in achieving operating
efficiencies by centralizing/decentralizing/reengineering along with the technology whiplash of
mainframe Warehouse to old two-tier client server architectures to new three-tier client server
architectures has created the opportunity for the data in one group to have a different meaning in
another group in the same organization. This disparate data is exacerbated by readily available
CASE tools, rapid application development tools, application and code generators, underutilized
data models and definitions, database products, spreadsheets, and other client friendly products,
and a lack of leadership in management.
Secondly, we know that along with the dispersion there exists a view that the data generated by
each group belongs only to itself, and is intended only for its own uses.
Finally, we know that because of the first two observations, the potential for integrating these
disparate data elements across various departments is poor with-out some significant work.
Recently, management has begun to recognize the value of using data as a corporate asset. The idea
of using all of the organization's data to get a complete picture of the enterprise is today's ideal. At
the very least, management is recognizing the need to view data from multiple departments to get
some kind of combined view of operations. The concept of a data warehouse has emerged as a
technology by which management can get a single comprehensive view of the state of the
organization. Data is extracted at regular intervals from existing systems and placed in the
warehouse, summarized to allow management to look at trends, but also available in detail for drill
down data access and analysis.
However, in many companies, the same data element may be used by different divisions to mean
different things. Manufacturing may exclude work in process from an inventory analysis, while
purchasing does not, for example. Different groups may have different standards and approaches to
defining prospects.
Data as a Potential Resource
Faced with these dilemmas, management has realized that data is a resource but only if all its
important attributes are known and understood. Data must be set in context, have meaning to its
users, be relevant, and have purpose. It is not enough to know that the inventory levels have ranged
between two values over time. One must also know what the definition is of inventory levels. It is
not enough to know that the value of a certain case of French wine has increased over time. One
must also know what has happened to the relative value of the French Franc to the Dollar, and
whether that value has been adjusted for fluctuations in the currency exchange.
Data must also be complete and accurate. If there are multiple sources for a particular data element,
which one is being used in the data warehouse, and why? What are the business rules which impact
how we view data? If we have calculated some data elements, such as profitability, what equations
and formulas have been used to derive those results? Only when these are known, understood, and
applied can data be fully utilized, and only then can we begin the reliable building of information from
the data which ultimately leads to quality decision making.
The need to understand the data leads to a need for managing the data. This need is particularly
acute in systems such as data warehouses whose primary purpose is to provide answers and supply
a fertile ground for exploration and insight.
Having thus established a requirement to under-stand and manage the properties of the data, the
question then becomes, what is the best mechanism for achieving this? What we are really talking
about then is a store of attributes about the data, or data about data. Semanticists have termed this
concept "metadata," from the Greek "meta," which means a later stage, transcending, or situated
behind. Literally, then, we are talking about data that sits behind the operational data, and that
describes its origin, meaning, derivation, etc. (What is gross sales? — Dollars or French Francs,
quarterly or annualized, what system does it come from, when is it extracted, etc.?) Metadata can
range from a conceptual overview of the real world to detailed physical specifications for a
particular database management system.
A data resource becomes useless without readily available high quality metadata. Its primary
objective is to provide a comprehensive guide to the data resource.
Metadata Status in an Organization
If organizations today are having problems managing data, what can we say about their ability to
manage metadata? Most companies suffer from the "ready, fire, aim" syndrome, in that they are so
rushed to implement systems that the planning and aspects of most projects are the first to suffer.
Pressure from management and users to gain the information or functions they need to do their work
leads inevitably to a rushed implementation where there is little thought given to coordinating data
elements with other groups who may use the same concept and few if any resources dedicated to a
careful documentation of the properties of the attributes, the business rules used in their derivation,
and so on.
In short, the problem continues. In most companies, the metadata situation is worse than the data
situation. Along with the disparate data arriving from multiple sources from within and outside the
corporation, there are multiple tools creating metadata in a variety of formats.
It is typical that companies want to rapidly implement data warehousing systems and then discover
themselves in a metadata dilemma. That is, they have a critical need for readily available high quality
metadata to leverage their data resource, yet the organization has no system in place for maintaining
adequate metadata. As a result, many data warehousing projects slow down as an organization
grapples with these issues brought about by disparate data and poor quality metadata. In any given
project, there is a need to include business experts, domain experts, and data experts so that the
metadata that is formed is relevant and useful as applied to the projects purpose.
Metadata Management: Historical Approaches
Managing the data assets of an organization effectively has been the goal of Information Technology
since its inception. As systems have increasingly become more diverse, distributed, and complex,
the management of the data assets has become increasingly difficult but nevertheless critical to the
corporate entity.
Data Administration: Data Dictionaries
Early in the days of Information Technology, all data was defined and maintained within the
computer program itself. These were total and complete packages of logic and purpose (64K
Assembler core programs). There was no need to share data between systems and programs
because one couldn't, without great effort, transfer data between different physical computer
systems. While the need and demand was in place to get multiple programs to work together
sequentially to get a given job done, the state of the art simply did not allow it.
In the late 1960's and early '70s, the technology improved to the point where multiple programs
could run sequentially against a given data set to solve business problems. For example, a batch run
could use a data set such as a collection of checking account transactions and calculate a new
balance. This required some coordination within the set of programs as they used the system holding
the transactions and then later as they accessed account balances, deposits, etc. In this evolution of
the technology, the overhead of having each program be its own little environment was not a tenable
solution, and so early versions of data coordinators were developed which were simple data
"dictionaries" that were shared by programs looking for data to use in their logic processes. The
programs would load the data definitions and locations it would need in its run from a common data
dictionary. These dictionaries were most likely managed by an IS organization that was centralized
and tightly controlled.
As developers became more sophisticated over time, data dictionaries evolved to provide more
than just data attribute descriptions. They were also able to track which applications accessed which
pieces of the database. This means that managers who took advantage of the capabilities of the data
dictionary and did a good job of designing and populating the data dictionary found themselves in
the enviable position of being able to maintain their systems more easily than their counter-parts who
did not. For example, a user wants to change a SKU number definition from five digits to seven.
How many programs need to be changed to affect this enhancement? If a manager has done the job
properly, this question can be answered by a simple query into the data dictionary. Such a centrally
designed and maintained system, which holds the data definitions as well as CASE information
about which applications use which pieces of the database is sometimes called a repository. This
concept will be expanded in later sections of the paper.
The Havoc of Distributed Systems
As demand for more IS technology blossomed and the technology advanced with lower cost
midrange and distributed systems, these systems found themselves as islands of automation
ensconced in individual departments and working on specific business problems which only related
to the specific department. Data was defined in a decentralized manner, by the business unit, with no
central arbiter, if it was defined at all. Worse yet, as new and different CASE tools came on the
market, and as new and different architectures came into vogue, such as object oriented databases
and client/ server architectures, different tools were used to define the data for different applications.
In some cases, the same element was defined multiple times with slight variations, as new systems
were used to create applications to help users solve various business problems.
Exchanging data between systems became risky, highly structured, and infrequent. Importing data
from external systems and environments became a labor intensive endeavor and was avoided if
possible. However, in many cases it became necessary. For example, a call management system for
customer support needs to get all of the customer information from the legacy databases. In order to
do this successfully, elaborate programs needed to be written to "scrub" the data clean. With
inadequate metadata, data incompatibilities could cause programs to fail, keeping programmers up
nights debugging operational systems, as well as taking down the call management system, for
example.
Early Metadata
Just as any other kind of warehouse needs to keep an inventory of its holdings, early implementers
of data warehouses found that they needed to keep track of what data the warehouse was currently
holding along with the "pedigree" of that data. To do this, the idea of a metadata repository was
created, similar to a data dictionary, to give users and technicians information about the data, such
as where the data came from, what rules were used in creating the data, what the data elements
meant, how recent the data was, and so on.
In early implementations, and even up to the recent past, many systems divided the business or
end-user related information from the technical or development directory, so that technical
information about the data, which would be of limited use to an end-user, and could arguably make
the end user's task more difficult, was kept in a separate store from that which was required by the
user.
It is now widely accepted that the metadata component must be designed so that everyone
understands the data in the warehouse. Robert Typanski, Data Manager at Bayer was recently
quoted in Datamation: "Unless people can identify the data that's in the warehouse, they're not
going to be able to access it any better than if it were buried in some legacy operational system."
Early Data Warehousing
Based on the seminal work by Bill Inmon, early adopters of the concept took Inmon literally in
defining a data warehouse as having an enterprise wide scope. The early incarnations of this concept
were gargantuan structures that encompassed summaries of data from all aspects of the business. In
order to tie together all of this data, a tremendous amount of work had to be done to find the legacy
data (data archaeology), build a common data model which was appropriate for an enterprise view
of the business, and then extract the data from the legacy systems. During the extraction, data had to
be not only consolidated but normalized and rationalized so that the resultant picture did not contain
any duplications, contradictions, or other anomalies which could interfere with the accurate and
timely analysis of the consolidated data. These projects often took longer than expected, and
estimated costs in were in the millions of dollars, much of which was spent in the data understanding
and data preparation phases of the project.
Large systems with broad scopes, such as the ones described above, are sometimes not as
responsive as users would like, not necessarily from the standpoint of system response time,
although that could be a problem in an enterprise wide data warehouse, but also from the standpoint
of being able to modify data structures and rules to make certain specific analyses which are
particular to a single department.
This led to the next evolution of the data warehouse, a special purpose warehouse, or data mart,
which is an application specific implementation, with data derived from the warehouse itself. The
objective of these individual department specific implementations was and continues to be the
generation of better support for management decision making by supplying data in a more relevant
form and in a more responsive manner.
Metadata Management Trends
The awareness of the need to manage metadata has been an offshoot of the growth of data
warehousing. As the number and diversity of data warehousing implementations began to grow, IT
managers and end users began to realize that the data warehouse was only as useful as the quality,
accuracy, and ease of use of its data.
Growth of Data Warehousing
There is no doubt that data warehousing has grown as a technology and that it has firmly established
itself as a mainstream tool in competitive businesses today. However, along with the success of the
concept came growth in a number of other areas.
Growth across platforms. Initially, the global data warehouses were the domain of a few
platforms and a few vendors. As their popularity grew, data warehousing technologies such
as parallel computing, parallel databases, and OLAP/ROLAP were extended into smaller
and more pervasive platforms, so that now the range of platforms which claim data
warehousing capabilities spans from Microsoft NT, through the UNIX domain, and on up
into the giants such as NCR, DEC, and IBM.
Growth across tools. Success has many parents, and failure is an orphan. This old saw is
certainly true of data warehousing. As the bandwagon got rolling, vendors sprang from all
directions to jump on, each touting special features and functions. From legacy data
extraction tools to maintenance and scheduling tools, and yes, even tools that purported to
handle metadata, vendors and products proliferated.
Growth across departments. Success also breeds demand, which is exactly what happens
when department A gets a new data mart and starts showing colleagues in department B how
easily they can now access consistent data. Soon, department B has its own data mart, and
departments C, D, and E are not far behind.
Throughout all of this growth, however, the subject of metadata management remained fractured
and dispersed. Extraction tools, loading tools, cleansing tools and analysis tools, all claimed to have
a piece of the metadata problem solved. In many ways, this was akin to the six wise men of
Hindustan who, in describing an elephant, all described the entirety of the elephant by the one piece
of the animal they were feeling: the trunk, the tail, the ear, the body, the tusk and the leg. In fact, until
recently, there has been little progress in terms of a solution to the integrated metadata issue.
However, enterprises today are clamoring for such a solution, and for good reason. Users need to
know what they are looking at if they are to make intelligent decisions and take informed actions
based on the data they have received from the data warehouse. A system cannot leave it up to the
user to assume the business rules embedded in a calculated data element because different
assumptions will lead to different courses of action that will inevitably conflict with each other. Yet,
in most cases today, metadata is spread across different components of the warehouse, from the
scheduler to the data extraction/cleansing tools which claim to build metadata as they are extracting
and cleansing, to the loading tools, to the OLAP tools, which need to present metadata to users in
order to navigate. Business rules are separated from technical metadata, as they should be, but are
kept by different systems in different formats with different user interfaces. In the case of multiple
data marts spread across the enterprise, this situation is multiplied by the number of marts. And, if
there is a need for a user in department A to use a data mart created by department B, then in most
cases, that user has to relearn the metadata navigation for that system. Clearly, users would like to
go to one place and be able to see either the business metadata or the technical metadata in a single
system with a single user interface and single screen metaphor for any and all data residing in the
enterprise.
The Resurgent Repository
The scenarios depicted above are the primary drivers behind the resurgence of the concept of a
single metadata repository. A repository is the vehicle of metadata. Simply put, a repository is
where information (metadata) about an organization's information systems components (objects,
table definitions, fields, business rules and so on) is held. A repository also contains tools to facilitate
the manipulation and query of the metadata.
A repository has a number of potential applications within an enterprise schema that deliver value
beyond that exclusively in the domain of a data warehouse. For example a repository can:
aid in the integration of the views of disparate systems by helping understand how the data
used by those systems are related;
support rapid change and assistance in building systems quickly by impact analysis and
provision of standardized data;
facilitate reuse by using object concepts and central accessibility;
assist in implementation for data warehousing (A central repository can be built in advance of
the warehouse, purely for data and application integration purposes, and then be ready to
support a warehouse implementation. Alternatively, if the repository is built in support of the
initial warehousing effort, it can be of enormous value in deploying subsequent efforts.);
support software development teams.
One of the primary benefits of a repository is that it provides consistency of key data structures and
business rules, which makes it easier to tie warehousing efforts together (data marts) across the
enterprise. This has been one of the major criticisms leveled at the proponents of independent data
marts that deploying data marts without a unifying infrastructure simply promulgates the "islands of
automation" problems we have with our legacy systems.
The repository also leverages an organization's investment in existing legacy systems by documenting
program information for future application development.
An Enterprise View of Metadata Management
Metadata is therefore a key resource to the warehouse during all phases of its life cycle, from the
warehouse construction, through the user access, and into the maintenance and update of the data it
holds.
During the past few years there has been a tremendous level of activity in the vendor metadata
repository field, largely due to the rapid growth in the data warehouse and data mart markets.
Business has come to clearly see the issues surrounding the disparate data as they attempt to
leverage these data assets across their organizations, and vendors are responding by building
enterprise level strategies. As an example of the state of metadata today, below is an excerpt from a
recent Datamation article:
" Syncing metadata between two products — different functions, different metadata
stores, different vendors — is a huge challenge, too. To do it, you'd have to get the
right piece of metadata at the right level of detail from one product and map it to the
right piece of metadata at the right level of detail in the other product, then straighten
out any differences in meaning or in coding between them. And then do it again for
each of the hundreds of other points in metadata space that the two products share in
common. And then figure out what to do when the metadata changes in one of the
products. And if the metadata structure (yes, that would be the meta-metadata) of a
product changes, you get to do it all again.
" Syncing the metadata between two products is tough. Syncing metadata among each
of the half-a-dozen tools it could take to build, run, and access a data warehouse is an
almost unthinkable task. But for a smooth, robust, efficient data warehouse operation,
it's sync or sink. " What you really need is a single, comprehensive metadata source
that is accessible to all of the tools you buy — the tools you buy for the data
warehouse, certainly, but also the tools you buy for virtually every other IS function, as
well. One metadata source, no syncing."
Vendors are beginning to respond to these kinds of pressures and are trying to solve the enterprise
metadata conundrum. However, the answers are not simple. Metadata is collected and/or generated
in a variety of places in the Data Warehousing architecture, from data extraction, from data
manipulation and application specifics, and from query engines such as OLAP. Today, each of these
areas has a number of vendors who offer products, and each vendor has a slightly different
approach or paradigm to their metadata solution. There are several possible approaches, and some
vendors are propagating the concept of a single enterprise level metadata repository to integrate the
enterprise's disparate metadata Tower of Babel. The simplest approach is to have all vendors utilize
the same semantics, paradigm, etc. and collect all of the metadata in a single format in a single
location.
Previous efforts to create a single repository to meet all needs have failed as they tried to be all
things to all users. Even standards councils have had a difficult time creating information models that
work for multiple vendors in heterogeneous environments.
The announcement by Microsoft Corp. and PLATINUM technology, inc. highlights a practical
approach to providing enterprise-wide metadata management in a heterogeneous environment.
Working with leading application development and data warehousing vendors, Microsoft will
publish an Open Information Model for use by independent software vendors to share metadata
using a single model.
In addition, over time, PLATINUM technology will port the Microsoft Repository to MVS and
major UNIX platforms, embedding the core Microsoft Technology in its repositories, so users can
implement a single repository solution across all leading platforms and databases.
By teaming together, the two leaders in repository technology will deliver a staged solution to cross
platform metadata management.
A second alternative is to partner with other vendors in various domains within the data warehouse
architecture (extraction, loading, OLAP, etc.), and build translation schema from their metadata to a
canonical metadata. This in effect gives users the option of a suite of products whose metadata all
can be translated into a single metadata "esperanto". Vendors such as PLATINUM technology,
who have an extensive spectrum of product in the data warehouse world, can then ensure all of their
own products speak this "metadata esperanto," and via teaming arrangements include a complete
suite of warehousing components. A single metadata repository then could be built at an enterprise
level, and a corporation could attain some measure of consistency and manageability by
implementing this concept.
This area will be of intense interest to corporations as they build their corporate architectures and
approaches to "enterprise" metadata. This is a very active area for vendors at this time with many
levels of interactivity. PLATINUM and Viasoft are integrating the products of the acquisitions into
their mainstream tool sets. Other major vendors are arranging alliances and developing bridging
software. Needless to say, in an area this active, there will be multiple degrees of integration,
features, and functions, all dynamic and changing with every release. In the following sections, we
will review important characteristics in an enterprise metadata product.
Managing Metadata Within and Across Warehousing Efforts
Most large organizations today have had some experience with data warehousing implementations.
Today, these typically take the form of data mart style implementations in various departmental
focus areas such as financial analysis or customer focused systems assisting business units. Many
organizations have multiple warehousing initiatives underway simultaneously and these systems will
most likely be based on products from multiple data warehousing vendors, in the typical
decentralized approach of most corporations. This approach has worked to date in that it has
allowed reasonably rapid implementation of these systems and demonstrated to the organization the
benefit and potential of data warehousing as a business tool at a fraction of the cost of the enterprise
data warehouse model.
However, as pointed out earlier, this is the typical "ready, fire, aim" approach which got us to the
legacy data Tower of Babel we have today, and in keeping with that, some areas of the business are
beginning to show signs of stress as a result of this approach to implementing data warehousing.
Data and metadata are spread across multiple data warehousing systems, and system managers are
wondering how best to coordinate and manage the dispersed metadata mess they have today. How
do we maintain consistency when business rules change as a result of corporate reorganizations,
regulatory changes, or other changes in business practices? What happens when an application
wants to change the technical definition? How many places are impacted for each of these potential
changes? These issues among others are forcing businesses to take a larger view — an enterprise
view — of metadata management systems. Coordinating metadata across multiple data warehouses
is one significant step in the right direction, and a repository is just the tool to do that.
A Repository as a Metadata Integration Platform
Ideally, a corporation should adopt a repository as a metadata integration platform, making
metadata available across the organization. This would serve to manage key metadata across all of
the data warehouse and data mart implementations within an organization. This would allow all of
the participants to share common data structures, business rule definitions, and data definitions from
system to system across the enterprise.
The platform would accept and manage information from multiple sources. These would include
systems from major vendor technology databases (e.g. IBM, Informix, Oracle, Microsoft, Sybase,
etc.) and across a broad spectrum of tools, from extraction tools to analysis tools. On the output
side, the system should provide open access by multiple tools as well as API's for custom needs.
The metadata repository also facilitates consistency and maintainability. It provides a common
understanding across warehouse efforts promoting sharing and reuse. If a new data element
definition is required for a data mart implementation, the platform should permit versioning to
support the need. With a shared metadata repository the exchange of key information between
business decision managers (facilitated by good solid end user access tools) becomes more feasible.
And, when multiple data marts and data warehouses are involved, a central metadata platform will
simplify and reduce the effort required to maintain them when viewed as a whole.
Lifecycle Issues
Repository systems need to contribute to and integrate with the existing legacy system environment
and play an active role throughout the lifecycle of data warehousing systems to be truly considered
enterprise metadata repositories.
Documenting database and legacy information are important capabilities in metadata repositories.
Legacy models provide the information sourcing, data inventorying, and design that are key to
developing an effective data warehouse. The metadata surrounding the acquisition, access, and
distribution of warehouse data is the key to providing the business user with a complete map of the
data warehouse.
The repository should play an active role in the entire life cycle of the data warehouse and all the
output attributes of system and business value. This includes existing legacy system as sources, third
party tools, etc. This then leverages the repository's role so it contributes in the development phases
as well as the bulk cost of all IS systems (the downstream support and maintenance costs). These
would include systems management, database management, business intelligence, and application
development tools and components listed below.
Systems management tools that can be used to manage jobs, improve performance, and
automate operations, not only in operational systems but also in data warehouse systems.
Database management tools that can help create and maintain the database management
systems for operational systems, data warehouses, and data marts.
Data movement tools that transform and integrate disparate data types and move data reliably
to the warehouse.
Business intelligence tools that provide end-user access and analysis for making business
decisions.
Business applications that provide packaged warehouse solutions for specific markets.
Data warehouse consulting that uses a methodology based on the experiences of hundreds of
other companies, thereby reducing the risk associated with making uninformed business
decisions.
Application development solutions that help you build, test, deploy, and manage operational
and warehouse applications throughout the enterprise.
CASE tools support that provide consistency and maintainability immediately by developing
consistent terminology and structures.
Repository-to-CASE interfaces that enable an organization to manage multiple CASE
workstations from the repository. These tools are designed to allow an organization to better
utilize the data maintained in their CASE workstations by providing a central point of control
and storage.
Sophisticated version control, collision management, and bi-directional interfaces, enabling
the sharing and reuse of metadata among programmers and analysts working independently.
What Needs to Be In an Enterprise Repository to Make the Warehouse Work Better
Some areas to focus on in reviewing repository functionality are discussed in the following sections.
Nonproprietary Relational Database Management System
A repository should ideally use an industry standard DBMS which provides significant advantages
over vendor-developed DBMSs. These advantages include advanced tools and utilities for database
management (such as backups and performance tuning) as well as dramatically enhanced reporting
capabilities. Furthermore, maintainablity and accessibility are enhanced by an "open" system.
Using a standard database also allows the repository vendor to focus on the quality of the
repository, not the features of the database management system. In addition, it allows the vendor to
take advantage of new features made available by the DBMS vendor.
Fully Extensible Meta Model
A repository should be a complete self-defining, extensible repository based on a common
entity/relationship diagram. By using a model that reflects industry standards, it can provide users
with the ability to easily customize the meta model to meet their specific needs. The repository
should support the following meta model extensions:
adding or modifying an entity type,
adding or modifying a linkage between entity types (associations or relationships),
adding user views (with different screen layouts or validations) to entities or relationships,
adding, deleting, or modifying attributes of relationships or entities,
modifying the list of allowable values for an attribute type,
adding or modifying commands or user exits,
adding custom command macros, and
adding or modifying help and informational messages.
The vendor should also support the Microsoft Open Information Model, which will allow
information to be shared across multiple vendor products. Ideally, the vendor will be part of the
Open Information Model design team.
Application Programming Interface (API) Access
An API access to the repository can provide an organization with the flexibility needed to create a
metadata management system which suits their unique needs. Architecture can make the repository
powerful by allowing users to create custom applications and programs.
In addition, the separation of metadata from the tools that access and manipulate it by the API is a
flexible feature. The tools can manipulate metadata through the API, thereby allowing transparent
access to the data. If the data structures change, the tools do not need to be changed. This allows
for greater efficiency and flexibility in an organization's application development.
Central Point of Metadata Control
The repository serves as a central point of control for data, providing a single place of record about
information assets across the enterprise. It documents where the data is located, who created and
maintains the data, what application processes it drives, what relationship it has with other data, and
how it should be translated and transformed. This provides users with the ability to locate and utilize
data that was previously inaccessible. Furthermore, a central location for the control of metadata
ensures consistency and accuracy of information, providing users with repeatable, reliable results
and organizations with a competitive advantage.
Impact Analysis Capability
If the repository has an impact analysis facility it can provide virtually unlimited navigation of the
repository definitions to provide the total impact of any change. Users easily determine where any
entity is used or what it relates to by using impact analysis views.
An impact analysis facility answers the true questions in the analysis phases without forcing a user to
sift through large quantities of unfocused information. Furthermore, sophisticated impact analysis
capabilities allow better time estimates for system maintenance tasks. They also reduce the amount
of rework resulting from faulty impact analysis (e.g., a program not being changed as a result of a
change to a table that it queries).
Naming Standards Flexibility
A repository should provide a detailed map of data definitions and elements, thereby allowing an
organization to evaluate redundant definitions and elements and decide which ones should be
eliminated, translated, or converted. By enforcing naming standards, the repository assists in
reducing data redundancies and increasing data sharing, making the application development
process more efficient and therefore less costly. In addition, an easily enforceable standard
encourages organizations to define and use consistent data definitions, thereby increasing the reuse
of standard definitions across disparate tools.
Versioning Capabilities
In repository discussions, "versioning" can have many different definitions. For example some
version control capabilities are:
version control as in test vs. production (lifecycle phasing);
versions as unique occurrences;
versioning by department or business unit; and
version by aggregate or workstation ID.
The repository's versioning capabilities facilitate the application lifecycle development process by
allowing developers to work with the same object concurrently. Developers should be able to
modify or change objects to meet their requirements without affecting other developers.
Robust Query and Reporting
The repository should provide business users with a vehicle for robust query and report generation.
The end user tool should seamlessly pass queries to its own tool or third party products for
automatic query generation and execution. Furthermore, business users should be able to create
detailed reports from these tools, increasing the amount of valuable decision support information
they are able to receive from the repository.
Data Warehousing Support
The repository provides information about the location and nature of operational data which is
critical in the construction of a data warehouse. It acts as a guide to the warehouse data, storing
information necessary to define the migration environment, mappings of sources to targets,
translation requirements, business rules, and selection criteria to build the warehouse.
Conclusion
Organizations are becoming increasingly aware of the limitations of their own systems and internal
data. The attempts to liberate and leverage data across the organization's stovepipes have been
replete with frustration and too many examples of failure. These experiences, coupled with drivers
demanding flexibility in business processes, are hastening the day that businesses will implement an
enterprise level view of metadata. Activity to supply this enterprise level capability is being
aggressively pursued by all major vendors. It is critical that corporations understand the issues at
hand as they adopt enterprise strategies and that they be in a position to evaluate what set of vendor
products are appropriate to their situation. Business Information Demand — An organization's
continuously increasing, constantly changing need for current, accurate information, often on short
notice, to support its business activities.
Glossary
24x7 Lights Out Operations — The use of Systems Management tools to ensure the reliable
movement and update of data from operational systems to analytical systems.
Analytical Data Store — Useful in making strategic decisions, this data storage area maintains
summarized or historical data. This stored data is time variant, unlike operational systems which
contain real-time data. Information contained in this data store is determined and collected based on
the corporate business rules.
Application Lifecycle — Includes the following three stages: process and change management ,
analysis and design, construction and testing.
Architecture — A definition and preliminary design which describes the components of a solution
and their interactions. An architecture is the blueprint by which implementers construct a solution
which meets the users' needs.
Availability — A measure of the percentage of time that a computer system is capable of
supporting a user request. A system may be considered unavailable as a result of events such as
system failures or unplanned application outages.
Business-Driven — An approach to identifying the data needed to support business activities,
acquiring or capturing those data, and maintaining them in a data resource that is readily available.
Business-Driven Approach — The process of identifying the data needed to support business
activities, acquiring or capturing those data, and maintaining them in the data resource.
Business Rules — The statements and stipulations that a corporation has set as "standard" in
order to run the enterprise more consistently and smoothly.
Capacity Planning — The process of considering the effects of a warehouse on other system
resources such as response time, DASD requirements, etc.
CASE — Computer Aided Software Engineering.
CASE Management — The management of information between multiple CASE "encyclopedias,"
whether the same or different CASE tools.
Centralized Data Warehouse — A Data Warehouse implementation in which a single warehouse
serves the need of several business units simultaneously with a single data model which spans the
needs of the multiple business divisions.
Change Propagation — The process of generating only the updates from the source databases to
the target database (usually the data warehouse).
Chargeback — The process that data warehouse managers use to ensure appropriate costs are
correctly distributed to the corresponding business units and users so that they can meet financial
reporting requirements.
Client/Server — A distributed technology approach where the processing is divided by function.
The server performs shared functions — managing communications, providing database services,
etc. The client performs individual user functions — providing customized interfaces, performing
screen to screen navigation, offering help functions, etc.
Client/Server Processing — A form of cooperative processing in which the end-user interaction
is through a programmable workstation (desktop) that must execute some part of the application
logic over and above display formatting and terminal emulation.
Consistent Data Quality — The state of a data resource where the quality of existing data is
thoroughly understood and the desired quality of the data resource is known. It is a state where
disparate data quality is known, and the existing data quality is being adjusted to the level desired to
meet the current and future business information demand.
Data — Items representing facts, text, graphics, bit-mapped images sound, analog, or digital
live-video segments. Data is the raw material of a system supplied by data producers and is used by
information consumers to create information.
Data Access — The process of entering a database to store or retrieve data.
Data Access Tools — An end-user oriented tool that allows users to build SQL queries by
pointing and clicking on a list of tables and fields in the data warehouse.
Data Accuracy — The component of data integrity that deals with how well data stored in the data
resource represent the real world. It includes a definition of the current data accuracy and the
adjustment in data accuracy to meet the business needs.
Data Administration — The processes and procedures by which the integrity and currency of the
data in the warehouse are maintained.
Data Analysis and Presentation Tools — Software that provides a logical view of data in a
warehouse. Some create simple aliases for table and column names; others create data that identify
the contents and location of data in the warehouse.
Data Consistency — The result of using a repository to capture and manage data as it changes so
that decision support systems can be continually updated.
Data Definition — Enabling the structure and instances of a database to be defined in a
human-and machine-readable form.
Data Distribution — The placement and maintenance of replicated data at one or more data sites
on a mainframe computer or across a telecommunications network. This part of developing and
maintaining an integrated data resource that ensures data are properly managed when distributed
across many different data sites. Data distribution is one type of data deployment, which is the
transfer of data to data sites.
Data Mart — A subset of the data resource, usually oriented to a specific purpose or major data
subject, that may be distributed to support business needs. The concept of a data mart can apply to
any data whether they are operational data, evaluational data, spatial data, or metadata.
Data Model — A logical map that represents the inherent properties of the data independent of
software, hardware, or machine performance considerations. The model shows data elements
grouped into records, as well as the association around those records.
Data Movement — The transportation of data from disparate sources ranging from various
mainframes, client/server machines, and network file servers to a central location, the data
warehouse, in order to create a reliable source of information, usable for strategic decision making.
Data Quality — Indicates how well data in the data resource meet the business information
demand. Data quality includes data integrity, data accuracy, and data completeness.
Distributed Database — A collection of multiple, logically related databases that is provided to
data sites.
Data Store — A place where data is stored; data at rest. A generic term that includes databases
and flat files.
Data Transformation — (1) The formal process of transforming data in the data resource within a
common data architecture. It includes transforming disparate data to an integrated data resource,
transforming data within the integrated data resource, and transforming disparate data. It includes
transforming operational, historical, and evaluational data within a common data architecture. (2)
Creating "information" from data. This includes decoding production data and merging of records
from multiple DBMS formats. It is also known as data scrubbing or data cleansing.
Data Warehouse — (1) A subject oriented, integrated, time-variant, non-volatile collection of data
in support of management's decision making process. A repository of consistent historical data can
that can be easily accessed and manipulated for decision support. (2) An implementation of an
informational database used to store sharable data sourced from an operational data-base-
of-record. It is typically a subject database that allows users to tap into a company's vast store of
operational data to track and respond to business trends and facilitate forecasting and planning
efforts.
Database — A collection of data which are logically related.
DBA — Database Administrator.
Decision Support — A set of software applications intended to allow users to search vast stores
of information for specific reports which are critical for making management decisions.
Disparate Data — Data that are essentially not alike, or are distinctly different in kind, quality, or
character. They are unequal and cannot be readily integrated to adequately meet the business
information demand. Disparate data are heterogeneous data.
End User Data — Data formatted for end-user query processing; data created by end users; data
provided by a data warehouse.
Enterprise — A complete business consisting of functions, divisions, or other components used to
accomplish specific objectives and defined goals.
Enterprise Data Warehouse — An Enterprise Data Warehouse is a Centralized Warehouse
which services the entire enterprise.
FTP (File Transfer Protocol) — A client-server protocol which allows a user on one computer to
transfer files to and from another computer over a TCP/IP network.
Fragmentation — The process in which a packet is broken into smaller pieces, fragments, to fit the
requirements of a physical network over which the packet must pass.
Global Enterprise — A corporate environment not limited by geographic location.
Heterogeneous Data — See disparate data.
Heterogeneous Databases — See disparate databases.
Incremental Refresh — A technique which loads only data which has changed since the last load
into a Data Warehouse or Data Mart.
Information — (1) A collection of data that is relevant to one or more recipients at a point in time.
It must be meaningful and useful to the recipient at a specific time for a specific purpose. Information
is data in context, data that have meaning, relevance, and purpose. (2) Data that has been
processed in such a way that it can increase the knowledge of the person who receives it.
Information is the output, or "finished goods," of information systems. Information is also what
individuals start with before it is fed into a Data Capture transaction processing system.
Information Technology Infrastructure — An infrastructure for the information technology
discipline that provides the resources necessary for an organization to meet its current and future
business information demand. It consists of the data resource, the platform resource, business
activities, and information systems.
Job Management Tools — Tools which include job scheduling, job optimization, charge back,
and output management tools that help operations managers and system administrators manage,
monitor, and coordinate the execution of enterprise-wide IT jobs.
Legacy Data — Another term for disparate data because they support legacy systems.
Metadata — (1) Traditionally, metadata were data about the data. In the common data
architecture, meta-data are all data describing the foredata, including meta-praedata and the
meta-paradata. They are data that come after or behind the foredata and support the foredata. (2)
Metadata is data about data. Examples of metadata include data element descriptions, data type
descriptions, attribute/property descriptions, range/ domain descriptions, and process/method
descriptions. The repository environment encompasses all corporate metadata resources: database
catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length,
valid values, and description of a data element. Metadata is stored in a data dictionary and
repository. It insulates the data warehouse from changes in the schema of operational systems.
Methodology — A system of principles, practices, and procedures applied to a specific branch of
knowledge.
OLTP — On-Line Transaction Processing.
On-Line Transaction Processing — Processing that supports the daily business operations. Also
know as operational processing and OLTP.
Operational Data Store — Contains timely, current, and integrated information. The data is
typically very granular. These systems are subject oriented, not application oriented, and are
optimized for looking up one or two records at a time for decision making.
Operational Systems — Please refer to Legacy Data.
Performance Management Tools — Tools that help warehouse managers monitor, maintain, and
manage warehouse performance in distributed, heterogeneous environments.
Problem Resolution Software — Tools that provide automated problem report management for
help desks, technical support departments, or customer service operations. These products can be
used by support staff, as they assist customers or end users, or as part of an automated call-in
self-help system.
Query — A (usually) complex SELECT statement for decision support. See Ad-Hoc Query or
Ad-Hoc Query Software.
RDBMS — Relational Database Management System.
Reference Data — Business data that has a consistent meaning and definition and is used for
reference and validation (Process, Person, Vendor, and Customer, for example). Reference data is
fundamental to the operation of the business. The data is used for transaction validation by the data
capture environment, decision support systems, and for representation of business rules. Its source
for distribution and use is a data warehouse.
Refresh Technology — A process of taking a snapshot from one environment and moving it to
another environment overlaying old data with the new data each time.
Relational Database Management System — A Database Management System which uses the
concept of two dimensional "tables" to define "relationships" among the different elements of the
database.
Repository — A location, physical or logical, where databases supporting similar classes of
applications are stored.
Repository Environment — The Repository environment contains the complete set of a
business's metadata. It is globally accessible. As compared to a data dictionary, the repository
environment not only contains an expanded set of metadata, but can be implemented across multiple
hardware platforms and database management systems (DBMS).
Reusability — Using code developed for one application program in another application.
Scalability — (1) The ability to scale to support larger or smaller volumes of data and more or less
users. The ability to increase or decrease size or capability in cost-effective increments with minimal
impact on the unit cost of business and the procurement of additional services. (2) The ability of a
system to accommodate increases in demand by upgrading and/or expanding existing components,
as opposed to meeting those increased demands by implementing a new system.
Securability — The ability to provide differing access to individuals according to the classification
of data and the user's business function, regardless of the variations.
SQL (Structured Query Language) — A structured query language for accessing relational,
ODBC, DRDA, or non-relational compliant database systems.
Stovepipe Decision Support Systems — Independent, departmental data marts incapable of
making accurate decisions across the enterprise because they have no way to consistently define
data.
Target Database — The database in which data will be loaded or inserted.
Warehouse Application Vitality — A solution to enable business needs to drive the technology
that reaches the end-user's desktop by limiting the negative effects of application change.
© 1995, 1998 PLATINUM technology, inc. All rights reserved.
Legal, Privacy Policy. 800-442-6861 630-620-5000 Fax: 630-691-0718