
Phones: 800-442-6861,   630-620-5000   
Fax: 630-691-0718
Table of Contents
               Introduction
               Why Do We Need to Manage Metadata?
    Data from Everywhere
    Data as a Potential Resource
    Metadata Status in an Organization
               Metadata Management: Historical Approaches
    Data Administration: Data Dictionaries 
    The Havoc of Distributed Systems 
    Early Metadata 
    Early Data Warehousing 
               Metadata Management Trends
    Growth of Data Warehousing 
    The Resurgent Repository
               An Enterprise View of Metadata Management
    Managing Metadata Within and Across Warehousing Efforts 
         A Repository as a Metadata Integration Platform 
         Lifecycle Issues
    What Needs to Be In an Enterprise Repository to Make the Warehouse Work Better 
         Non-proprietary Relational Database Management System 
         Fully Extensible Meta Model 
         Application Programming Interface (API) Access 
         Central Point of Metadata Control 
         Impact Analysis Capability 
         Naming Standards Flexibility 
         Versioning Capabilities 
         Robust Query and Reporting 
         Data Warehousing Support
               Conclusion
               Glossary
             Introduction
          As a recent article in the Wall Street Journal pointed out — data is becoming an abundant
          commodity — we can get it anywhere and everywhere. However, just as with any other item which
          becomes a commodity, data by itself is losing value, in part simply because there is so much of it.
          Not too long ago, the situation was exactly the opposite: raw information was extremely difficult to
          acquire and therefore highly prized. For example, fifty years ago, investors interested in trading in a
          futures market such as coffee beans would ask the question: "how is the coffee bean harvest doing in
          South America?" In order to get their answer, they hired agents to investigate coffee production as
          well as other important commodities. The reason they were willing to go to these lengths is that the
          answers to these questions were critical, and could make an investor an overnight millionaire in the
          futures market. Today, you can get higher quality raw data, including satellite pictures if you wish,
          over the Internet, for less than ten dollars a month. More data can be obtained in 15 minutes, from a
          broader and richer set of sources than any investor 50 years ago could get with months of effort. 
          The availability of data and the staggering proportions of the quantities available makes it difficult to
          deal with. When data was scarce, the amount available was consumable by users with fairly
          primitive tools. Now that data is plentiful, we are faced with the problem of separating the significant
          facts from the rest. There are many analogies which can be used to describe the situation. Some say
          it is like trying to take a drink from a fire hose; others say it is like trying to find one specific grain of
          sand on a stretch of beach. In all cases, the problem being described is singular, pervasive, and
          compelling, because one thing that hasn't changed is the fact that the ability to use data to get
          answers to business questions is still the key to making money and achieving success in business
          endeavors. Therefore, like the Sorcerer's Apprentice, we have too much of a good thing and we
          don't know how to control it. 
          We do know that in order to make good business decisions we need good data. The process which
          has been generally accepted as good business practice has been described as follows: 
               First, acquiring quality raw data; 
               second, combining and integrating the data to make useful information, and then; 
               analyzing the information and making high quality decisions. 
          The acute issue is knowing which data to use to create useful information. The end goal is to make
          better and higher quality decisions than your predecessors could. The torrent of raw data has added
          more choice, and therefore complexity, to the process. What we need to do is put the data in
          context, give the data meaning, relevance, and purpose, and make it complete and accurate. Data
          which is viewed in this light is called information, because we can use it for deductive and inductive
          insights which lead us to quality decisions. 
          This paper discusses the issues confronting management today as they grapple with the floods of
          raw data and the pressing need to know what they have as data assets and how to achieve the goal
          of better decision making. 
Why Do We Need to Manage Metadata?
          As mentioned in the introduction, raw data is proliferating at a rapid rate. Data is flowing into the
          company from suppliers and customers. And, the internal systems of the corporation are adding
          their share. 
          Data From Everywhere
          Corporations are coupling together in webs of suppliers, partners, and customers, exchanging a
          myriad of information through a spectrum of technologies such as Electronic Document Interchange
          (EDI) systems, Electronic Funds Transfer (EFT) systems, email, and a host of other data acquisition
          and networking applications.
          At the same time, existing legacy systems within enterprises continue to generate data on orders,
          sales, revenues, employee information, manufacturing schedules, inventory, fleet status, and every
          other parameter imaginable. As computers become more and more affordable, as storage costs
          continue to plummet, as user sophistication increases in the use of information technology, the
          proliferation of the technology adds to the exponential growth of the data it generates. 
          What do we know about the data being generated by these systems? First of all, we know that it is
          by and large dispersed across the enterprise. Each department, division, group, branch, section or
          any other subdivision is today capable of generating its own unique caches of data. The information
          technology advances of the last 20 years have added significantly to the amount and depth of data
          produced, managed, and stored. The waves of management interest in achieving operating
          efficiencies by centralizing/decentralizing/reengineering along with the technology whiplash of
          mainframe Warehouse to old two-tier client server architectures to new three-tier client server
          architectures has created the opportunity for the data in one group to have a different meaning in
          another group in the same organization. This disparate data is exacerbated by readily available
          CASE tools, rapid application development tools, application and code generators, underutilized
          data models and definitions, database products, spreadsheets, and other client friendly products,
          and a lack of leadership in management. 
          Secondly, we know that along with the dispersion there exists a view that the data generated by
          each group belongs only to itself, and is intended only for its own uses. 
          Finally, we know that because of the first two observations, the potential for integrating these
          disparate data elements across various departments is poor with-out some significant work. 
          Recently, management has begun to recognize the value of using data as a corporate asset. The idea
          of using all of the organization's data to get a complete picture of the enterprise is today's ideal. At
          the very least, management is recognizing the need to view data from multiple departments to get
          some kind of combined view of operations. The concept of a data warehouse has emerged as a
          technology by which management can get a single comprehensive view of the state of the
          organization. Data is extracted at regular intervals from existing systems and placed in the
          warehouse, summarized to allow management to look at trends, but also available in detail for drill
          down data access and analysis. 
          However, in many companies, the same data element may be used by different divisions to mean
          different things. Manufacturing may exclude work in process from an inventory analysis, while
          purchasing does not, for example. Different groups may have different standards and approaches to
          defining prospects.
          Data as a Potential Resource
          Faced with these dilemmas, management has realized that data is a resource but only if all its
          important attributes are known and understood. Data must be set in context, have meaning to its
          users, be relevant, and have purpose. It is not enough to know that the inventory levels have ranged
          between two values over time. One must also know what the definition is of inventory levels. It is
          not enough to know that the value of a certain case of French wine has increased over time. One
          must also know what has happened to the relative value of the French Franc to the Dollar, and
          whether that value has been adjusted for fluctuations in the currency exchange. 
          Data must also be complete and accurate. If there are multiple sources for a particular data element,
          which one is being used in the data warehouse, and why? What are the business rules which impact
          how we view data? If we have calculated some data elements, such as profitability, what equations
          and formulas have been used to derive those results? Only when these are known, understood, and
          applied can data be fully utilized, and only then can we begin the reliable building of information from
          the data which ultimately leads to quality decision making. 
          The need to understand the data leads to a need for managing the data. This need is particularly
          acute in systems such as data warehouses whose primary purpose is to provide answers and supply
          a fertile ground for exploration and insight. 
          Having thus established a requirement to under-stand and manage the properties of the data, the
          question then becomes, what is the best mechanism for achieving this? What we are really talking
          about then is a store of attributes about the data, or data about data. Semanticists have termed this
          concept "metadata," from the Greek "meta," which means a later stage, transcending, or situated
          behind. Literally, then, we are talking about data that sits behind the operational data, and that
          describes its origin, meaning, derivation, etc. (What is gross sales? — Dollars or French Francs,
          quarterly or annualized, what system does it come from, when is it extracted, etc.?) Metadata can
          range from a conceptual overview of the real world to detailed physical specifications for a
          particular database management system. 
          A data resource becomes useless without readily available high quality metadata. Its primary
          objective is to provide a comprehensive guide to the data resource.
          Metadata Status in an Organization
          If organizations today are having problems managing data, what can we say about their ability to
          manage metadata? Most companies suffer from the "ready, fire, aim" syndrome, in that they are so
          rushed to implement systems that the planning and aspects of most projects are the first to suffer.
          Pressure from management and users to gain the information or functions they need to do their work
          leads inevitably to a rushed implementation where there is little thought given to coordinating data
          elements with other groups who may use the same concept and few if any resources dedicated to a
          careful documentation of the properties of the attributes, the business rules used in their derivation,
          and so on. 
          In short, the problem continues. In most companies, the metadata situation is worse than the data
          situation. Along with the disparate data arriving from multiple sources from within and outside the
          corporation, there are multiple tools creating metadata in a variety of formats. 
          It is typical that companies want to rapidly implement data warehousing systems and then discover
          themselves in a metadata dilemma. That is, they have a critical need for readily available high quality
          metadata to leverage their data resource, yet the organization has no system in place for maintaining
          adequate metadata. As a result, many data warehousing projects slow down as an organization
          grapples with these issues brought about by disparate data and poor quality metadata. In any given
          project, there is a need to include business experts, domain experts, and data experts so that the
          metadata that is formed is relevant and useful as applied to the projects purpose. 
Metadata Management: Historical Approaches
          Managing the data assets of an organization effectively has been the goal of Information Technology
          since its inception. As systems have increasingly become more diverse, distributed, and complex,
          the management of the data assets has become increasingly difficult but nevertheless critical to the
          corporate entity. 
          Data Administration: Data Dictionaries 
          Early in the days of Information Technology, all data was defined and maintained within the
          computer program itself. These were total and complete packages of logic and purpose (64K
          Assembler core programs). There was no need to share data between systems and programs
          because one couldn't, without great effort, transfer data between different physical computer
          systems. While the need and demand was in place to get multiple programs to work together
          sequentially to get a given job done, the state of the art simply did not allow it. 
          In the late 1960's and early '70s, the technology improved to the point where multiple programs
          could run sequentially against a given data set to solve business problems. For example, a batch run
          could use a data set such as a collection of checking account transactions and calculate a new
          balance. This required some coordination within the set of programs as they used the system holding
          the transactions and then later as they accessed account balances, deposits, etc. In this evolution of
          the technology, the overhead of having each program be its own little environment was not a tenable
          solution, and so early versions of data coordinators were developed which were simple data
          "dictionaries" that were shared by programs looking for data to use in their logic processes. The
          programs would load the data definitions and locations it would need in its run from a common data
          dictionary. These dictionaries were most likely managed by an IS organization that was centralized
          and tightly controlled. 
          As developers became more sophisticated over time, data dictionaries evolved to provide more
          than just data attribute descriptions. They were also able to track which applications accessed which
          pieces of the database. This means that managers who took advantage of the capabilities of the data
          dictionary and did a good job of designing and populating the data dictionary found themselves in
          the enviable position of being able to maintain their systems more easily than their counter-parts who
          did not. For example, a user wants to change a SKU number definition from five digits to seven.
          How many programs need to be changed to affect this enhancement? If a manager has done the job
          properly, this question can be answered by a simple query into the data dictionary. Such a centrally
          designed and maintained system, which holds the data definitions as well as CASE information
          about which applications use which pieces of the database is sometimes called a repository. This
          concept will be expanded in later sections of the paper. 
          The Havoc of Distributed Systems 
          As demand for more IS technology blossomed and the technology advanced with lower cost
          midrange and distributed systems, these systems found themselves as islands of automation
          ensconced in individual departments and working on specific business problems which only related
          to the specific department. Data was defined in a decentralized manner, by the business unit, with no
          central arbiter, if it was defined at all. Worse yet, as new and different CASE tools came on the
          market, and as new and different architectures came into vogue, such as object oriented databases
          and client/ server architectures, different tools were used to define the data for different applications.
          In some cases, the same element was defined multiple times with slight variations, as new systems
          were used to create applications to help users solve various business problems. 
          Exchanging data between systems became risky, highly structured, and infrequent. Importing data
          from external systems and environments became a labor intensive endeavor and was avoided if
          possible. However, in many cases it became necessary. For example, a call management system for
          customer support needs to get all of the customer information from the legacy databases. In order to
          do this successfully, elaborate programs needed to be written to "scrub" the data clean. With
          inadequate metadata, data incompatibilities could cause programs to fail, keeping programmers up
          nights debugging operational systems, as well as taking down the call management system, for
          example.
          Early Metadata 
          Just as any other kind of warehouse needs to keep an inventory of its holdings, early implementers
          of data warehouses found that they needed to keep track of what data the warehouse was currently
          holding along with the "pedigree" of that data. To do this, the idea of a metadata repository was
          created, similar to a data dictionary, to give users and technicians information about the data, such
          as where the data came from, what rules were used in creating the data, what the data elements
          meant, how recent the data was, and so on. 
          In early implementations, and even up to the recent past, many systems divided the business or
          end-user related information from the technical or development directory, so that technical
          information about the data, which would be of limited use to an end-user, and could arguably make
          the end user's task more difficult, was kept in a separate store from that which was required by the
          user. 
          It is now widely accepted that the metadata component must be designed so that everyone
          understands the data in the warehouse. Robert Typanski, Data Manager at Bayer was recently
          quoted in Datamation: "Unless people can identify the data that's in the warehouse, they're not
          going to be able to access it any better than if it were buried in some legacy operational system." 
          Early Data Warehousing 
          Based on the seminal work by Bill Inmon, early adopters of the concept took Inmon literally in
          defining a data warehouse as having an enterprise wide scope. The early incarnations of this concept
          were gargantuan structures that encompassed summaries of data from all aspects of the business. In
          order to tie together all of this data, a tremendous amount of work had to be done to find the legacy
          data (data archaeology), build a common data model which was appropriate for an enterprise view
          of the business, and then extract the data from the legacy systems. During the extraction, data had to
          be not only consolidated but normalized and rationalized so that the resultant picture did not contain
          any duplications, contradictions, or other anomalies which could interfere with the accurate and
          timely analysis of the consolidated data. These projects often took longer than expected, and
          estimated costs in were in the millions of dollars, much of which was spent in the data understanding
          and data preparation phases of the project. 
          Large systems with broad scopes, such as the ones described above, are sometimes not as
          responsive as users would like, not necessarily from the standpoint of system response time,
          although that could be a problem in an enterprise wide data warehouse, but also from the standpoint
          of being able to modify data structures and rules to make certain specific analyses which are
          particular to a single department. 
          This led to the next evolution of the data warehouse, a special purpose warehouse, or data mart,
          which is an application specific implementation, with data derived from the warehouse itself. The
          objective of these individual department specific implementations was and continues to be the
          generation of better support for management decision making by supplying data in a more relevant
          form and in a more responsive manner. 
Metadata Management Trends
          The awareness of the need to manage metadata has been an offshoot of the growth of data
          warehousing. As the number and diversity of data warehousing implementations began to grow, IT
          managers and end users began to realize that the data warehouse was only as useful as the quality,
          accuracy, and ease of use of its data.
          Growth of Data Warehousing 
          There is no doubt that data warehousing has grown as a technology and that it has firmly established
          itself as a mainstream tool in competitive businesses today. However, along with the success of the
          concept came growth in a number of other areas. 
               Growth across platforms. Initially, the global data warehouses were the domain of a few
               platforms and a few vendors. As their popularity grew, data warehousing technologies such
               as parallel computing, parallel databases, and OLAP/ROLAP were extended into smaller
               and more pervasive platforms, so that now the range of platforms which claim data
               warehousing capabilities spans from Microsoft NT, through the UNIX domain, and on up
               into the giants such as NCR, DEC, and IBM.
                
               Growth across tools. Success has many parents, and failure is an orphan. This old saw is
               certainly true of data warehousing. As the bandwagon got rolling, vendors sprang from all
               directions to jump on, each touting special features and functions. From legacy data
               extraction tools to maintenance and scheduling tools, and yes, even tools that purported to
               handle metadata, vendors and products proliferated.
                
               Growth across departments. Success also breeds demand, which is exactly what happens
               when department A gets a new data mart and starts showing colleagues in department B how
               easily they can now access consistent data. Soon, department B has its own data mart, and
               departments C, D, and E are not far behind.
          Throughout all of this growth, however, the subject of metadata management remained fractured
          and dispersed. Extraction tools, loading tools, cleansing tools and analysis tools, all claimed to have
          a piece of the metadata problem solved. In many ways, this was akin to the six wise men of
          Hindustan who, in describing an elephant, all described the entirety of the elephant by the one piece
          of the animal they were feeling: the trunk, the tail, the ear, the body, the tusk and the leg. In fact, until
          recently, there has been little progress in terms of a solution to the integrated metadata issue. 
          However, enterprises today are clamoring for such a solution, and for good reason. Users need to
          know what they are looking at if they are to make intelligent decisions and take informed actions
          based on the data they have received from the data warehouse. A system cannot leave it up to the
          user to assume the business rules embedded in a calculated data element because different
          assumptions will lead to different courses of action that will inevitably conflict with each other. Yet,
          in most cases today, metadata is spread across different components of the warehouse, from the
          scheduler to the data extraction/cleansing tools which claim to build metadata as they are extracting
          and cleansing, to the loading tools, to the OLAP tools, which need to present metadata to users in
          order to navigate. Business rules are separated from technical metadata, as they should be, but are
          kept by different systems in different formats with different user interfaces. In the case of multiple
          data marts spread across the enterprise, this situation is multiplied by the number of marts. And, if
          there is a need for a user in department A to use a data mart created by department B, then in most
          cases, that user has to relearn the metadata navigation for that system. Clearly, users would like to
          go to one place and be able to see either the business metadata or the technical metadata in a single
          system with a single user interface and single screen metaphor for any and all data residing in the
          enterprise.
          The Resurgent Repository 
          The scenarios depicted above are the primary drivers behind the resurgence of the concept of a
          single metadata repository. A repository is the vehicle of metadata. Simply put, a repository is
          where information (metadata) about an organization's information systems components (objects,
          table definitions, fields, business rules and so on) is held. A repository also contains tools to facilitate
          the manipulation and query of the metadata.
          A repository has a number of potential applications within an enterprise schema that deliver value
          beyond that exclusively in the domain of a data warehouse. For example a repository can: 
               aid in the integration of the views of disparate systems by helping understand how the data
               used by those systems are related;
                
               support rapid change and assistance in building systems quickly by impact analysis and
               provision of standardized data;
                
               facilitate reuse by using object concepts and central accessibility;
                
               assist in implementation for data warehousing (A central repository can be built in advance of
               the warehouse, purely for data and application integration purposes, and then be ready to
               support a warehouse implementation. Alternatively, if the repository is built in support of the
               initial warehousing effort, it can be of enormous value in deploying subsequent efforts.);
                
               support software development teams.
          One of the primary benefits of a repository is that it provides consistency of key data structures and
          business rules, which makes it easier to tie warehousing efforts together (data marts) across the
          enterprise. This has been one of the major criticisms leveled at the proponents of independent data
          marts that deploying data marts without a unifying infrastructure simply promulgates the "islands of
          automation" problems we have with our legacy systems.
          The repository also leverages an organization's investment in existing legacy systems by documenting
          program information for future application development. 
An Enterprise View of Metadata Management
          Metadata is therefore a key resource to the warehouse during all phases of its life cycle, from the
          warehouse construction, through the user access, and into the maintenance and update of the data it
          holds. 
          During the past few years there has been a tremendous level of activity in the vendor metadata
          repository field, largely due to the rapid growth in the data warehouse and data mart markets.
          Business has come to clearly see the issues surrounding the disparate data as they attempt to
          leverage these data assets across their organizations, and vendors are responding by building
          enterprise level strategies. As an example of the state of metadata today, below is an excerpt from a
          recent Datamation article: 
               " Syncing metadata between two products — different functions, different metadata
               stores, different vendors — is a huge challenge, too. To do it, you'd have to get the
               right piece of metadata at the right level of detail from one product and map it to the
               right piece of metadata at the right level of detail in the other product, then straighten
               out any differences in meaning or in coding between them. And then do it again for
               each of the hundreds of other points in metadata space that the two products share in
               common. And then figure out what to do when the metadata changes in one of the
               products. And if the metadata structure (yes, that would be the meta-metadata) of a
               product changes, you get to do it all again.
                
               " Syncing the metadata between two products is tough. Syncing metadata among each
               of the half-a-dozen tools it could take to build, run, and access a data warehouse is an
               almost unthinkable task. But for a smooth, robust, efficient data warehouse operation,
               it's sync or sink. " What you really need is a single, comprehensive metadata source
               that is accessible to all of the tools you buy — the tools you buy for the data
               warehouse, certainly, but also the tools you buy for virtually every other IS function, as
               well. One metadata source, no syncing."
          Vendors are beginning to respond to these kinds of pressures and are trying to solve the enterprise
          metadata conundrum. However, the answers are not simple. Metadata is collected and/or generated
          in a variety of places in the Data Warehousing architecture, from data extraction, from data
          manipulation and application specifics, and from query engines such as OLAP. Today, each of these
          areas has a number of vendors who offer products, and each vendor has a slightly different
          approach or paradigm to their metadata solution. There are several possible approaches, and some
          vendors are propagating the concept of a single enterprise level metadata repository to integrate the
          enterprise's disparate metadata Tower of Babel. The simplest approach is to have all vendors utilize
          the same semantics, paradigm, etc. and collect all of the metadata in a single format in a single
          location. 
          Previous efforts to create a single repository to meet all needs have failed as they tried to be all
          things to all users. Even standards councils have had a difficult time creating information models that
          work for multiple vendors in heterogeneous environments. 
          The announcement by Microsoft Corp. and PLATINUM technology, inc. highlights a practical
          approach to providing enterprise-wide metadata management in a heterogeneous environment.
          Working with leading application development and data warehousing vendors, Microsoft will
          publish an Open Information Model for use by independent software vendors to share metadata
          using a single model.
          In addition, over time, PLATINUM technology will port the Microsoft Repository to MVS and
          major UNIX platforms, embedding the core Microsoft Technology in its repositories, so users can
          implement a single repository solution across all leading platforms and databases. 
          By teaming together, the two leaders in repository technology will deliver a staged solution to cross
          platform metadata management. 
          A second alternative is to partner with other vendors in various domains within the data warehouse
          architecture (extraction, loading, OLAP, etc.), and build translation schema from their metadata to a
          canonical metadata. This in effect gives users the option of a suite of products whose metadata all
          can be translated into a single metadata "esperanto". Vendors such as PLATINUM technology,
          who have an extensive spectrum of product in the data warehouse world, can then ensure all of their
          own products speak this "metadata esperanto," and via teaming arrangements include a complete
          suite of warehousing components. A single metadata repository then could be built at an enterprise
          level, and a corporation could attain some measure of consistency and manageability by
          implementing this concept. 
          This area will be of intense interest to corporations as they build their corporate architectures and
          approaches to "enterprise" metadata. This is a very active area for vendors at this time with many
          levels of interactivity. PLATINUM and Viasoft are integrating the products of the acquisitions into
          their mainstream tool sets. Other major vendors are arranging alliances and developing bridging
          software. Needless to say, in an area this active, there will be multiple degrees of integration,
          features, and functions, all dynamic and changing with every release. In the following sections, we
          will review important characteristics in an enterprise metadata product.
           
          Managing Metadata Within and Across Warehousing Efforts 
          Most large organizations today have had some experience with data warehousing implementations.
          Today, these typically take the form of data mart style implementations in various departmental
          focus areas such as financial analysis or customer focused systems assisting business units. Many
          organizations have multiple warehousing initiatives underway simultaneously and these systems will
          most likely be based on products from multiple data warehousing vendors, in the typical
          decentralized approach of most corporations. This approach has worked to date in that it has
          allowed reasonably rapid implementation of these systems and demonstrated to the organization the
          benefit and potential of data warehousing as a business tool at a fraction of the cost of the enterprise
          data warehouse model. 
          However, as pointed out earlier, this is the typical "ready, fire, aim" approach which got us to the
          legacy data Tower of Babel we have today, and in keeping with that, some areas of the business are
          beginning to show signs of stress as a result of this approach to implementing data warehousing.
          Data and metadata are spread across multiple data warehousing systems, and system managers are
          wondering how best to coordinate and manage the dispersed metadata mess they have today. How
          do we maintain consistency when business rules change as a result of corporate reorganizations,
          regulatory changes, or other changes in business practices? What happens when an application
          wants to change the technical definition? How many places are impacted for each of these potential
          changes? These issues among others are forcing businesses to take a larger view — an enterprise
          view — of metadata management systems. Coordinating metadata across multiple data warehouses
          is one significant step in the right direction, and a repository is just the tool to do that. 
          A Repository as a Metadata Integration Platform 
          Ideally, a corporation should adopt a repository as a metadata integration platform, making
          metadata available across the organization. This would serve to manage key metadata across all of
          the data warehouse and data mart implementations within an organization. This would allow all of
          the participants to share common data structures, business rule definitions, and data definitions from
          system to system across the enterprise. 
          The platform would accept and manage information from multiple sources. These would include
          systems from major vendor technology databases (e.g. IBM, Informix, Oracle, Microsoft, Sybase,
          etc.) and across a broad spectrum of tools, from extraction tools to analysis tools. On the output
          side, the system should provide open access by multiple tools as well as API's for custom needs. 
          The metadata repository also facilitates consistency and maintainability. It provides a common
          understanding across warehouse efforts promoting sharing and reuse. If a new data element
          definition is required for a data mart implementation, the platform should permit versioning to
          support the need. With a shared metadata repository the exchange of key information between
          business decision managers (facilitated by good solid end user access tools) becomes more feasible.
          And, when multiple data marts and data warehouses are involved, a central metadata platform will
          simplify and reduce the effort required to maintain them when viewed as a whole.
          Lifecycle Issues
          Repository systems need to contribute to and integrate with the existing legacy system environment
          and play an active role throughout the lifecycle of data warehousing systems to be truly considered
          enterprise metadata repositories. 
          Documenting database and legacy information are important capabilities in metadata repositories.
          Legacy models provide the information sourcing, data inventorying, and design that are key to
          developing an effective data warehouse. The metadata surrounding the acquisition, access, and
          distribution of warehouse data is the key to providing the business user with a complete map of the
          data warehouse. 
          The repository should play an active role in the entire life cycle of the data warehouse and all the
          output attributes of system and business value. This includes existing legacy system as sources, third
          party tools, etc. This then leverages the repository's role so it contributes in the development phases
          as well as the bulk cost of all IS systems (the downstream support and maintenance costs). These
          would include systems management, database management, business intelligence, and application
          development tools and components listed below. 
               Systems management tools that can be used to manage jobs, improve performance, and
               automate operations, not only in operational systems but also in data warehouse systems.
                
               Database management tools that can help create and maintain the database management
               systems for operational systems, data warehouses, and data marts.
                
               Data movement tools that transform and integrate disparate data types and move data reliably
               to the warehouse.
                
               Business intelligence tools that provide end-user access and analysis for making business
               decisions.
                
               Business applications that provide packaged warehouse solutions for specific markets.
                
               Data warehouse consulting that uses a methodology based on the experiences of hundreds of
               other companies, thereby reducing the risk associated with making uninformed business
               decisions.
                
               Application development solutions that help you build, test, deploy, and manage operational
               and warehouse applications throughout the enterprise.
                
               CASE tools support that provide consistency and maintainability immediately by developing
               consistent terminology and structures.
                
               Repository-to-CASE interfaces that enable an organization to manage multiple CASE
               workstations from the repository. These tools are designed to allow an organization to better
               utilize the data maintained in their CASE workstations by providing a central point of control
               and storage.
                
               Sophisticated version control, collision management, and bi-directional interfaces, enabling
               the sharing and reuse of metadata among programmers and analysts working independently.
          What Needs to Be In an Enterprise Repository to Make the Warehouse Work Better
          Some areas to focus on in reviewing repository functionality are discussed in the following sections.
          Nonproprietary Relational Database Management System 
          A repository should ideally use an industry standard DBMS which provides significant advantages
          over vendor-developed DBMSs. These advantages include advanced tools and utilities for database
          management (such as backups and performance tuning) as well as dramatically enhanced reporting
          capabilities. Furthermore, maintainablity and accessibility are enhanced by an "open" system. 
          Using a standard database also allows the repository vendor to focus on the quality of the
          repository, not the features of the database management system. In addition, it allows the vendor to
          take advantage of new features made available by the DBMS vendor. 
          Fully Extensible Meta Model 
          A repository should be a complete self-defining, extensible repository based on a common
          entity/relationship diagram. By using a model that reflects industry standards, it can provide users
          with the ability to easily customize the meta model to meet their specific needs. The repository
          should support the following meta model extensions: 
               adding or modifying an entity type,
                
               adding or modifying a linkage between entity types (associations or relationships),
                
               adding user views (with different screen layouts or validations) to entities or relationships,
                
               adding, deleting, or modifying attributes of relationships or entities,
                
               modifying the list of allowable values for an attribute type,
                
               adding or modifying commands or user exits,
                
               adding custom command macros, and
                
               adding or modifying help and informational messages.
          The vendor should also support the Microsoft Open Information Model, which will allow
          information to be shared across multiple vendor products. Ideally, the vendor will be part of the
          Open Information Model design team.
          Application Programming Interface (API) Access 
          An API access to the repository can provide an organization with the flexibility needed to create a
          metadata management system which suits their unique needs. Architecture can make the repository
          powerful by allowing users to create custom applications and programs. 
          In addition, the separation of metadata from the tools that access and manipulate it by the API is a
          flexible feature. The tools can manipulate metadata through the API, thereby allowing transparent
          access to the data. If the data structures change, the tools do not need to be changed. This allows
          for greater efficiency and flexibility in an organization's application development. 
          Central Point of Metadata Control
          The repository serves as a central point of control for data, providing a single place of record about
          information assets across the enterprise. It documents where the data is located, who created and
          maintains the data, what application processes it drives, what relationship it has with other data, and
          how it should be translated and transformed. This provides users with the ability to locate and utilize
          data that was previously inaccessible. Furthermore, a central location for the control of metadata
          ensures consistency and accuracy of information, providing users with repeatable, reliable results
          and organizations with a competitive advantage.
          Impact Analysis Capability
          If the repository has an impact analysis facility it can provide virtually unlimited navigation of the
          repository definitions to provide the total impact of any change. Users easily determine where any
          entity is used or what it relates to by using impact analysis views. 
          An impact analysis facility answers the true questions in the analysis phases without forcing a user to
          sift through large quantities of unfocused information. Furthermore, sophisticated impact analysis
          capabilities allow better time estimates for system maintenance tasks. They also reduce the amount
          of rework resulting from faulty impact analysis (e.g., a program not being changed as a result of a
          change to a table that it queries).
          Naming Standards Flexibility
          A repository should provide a detailed map of data definitions and elements, thereby allowing an
          organization to evaluate redundant definitions and elements and decide which ones should be
          eliminated, translated, or converted. By enforcing naming standards, the repository assists in
          reducing data redundancies and increasing data sharing, making the application development
          process more efficient and therefore less costly. In addition, an easily enforceable standard
          encourages organizations to define and use consistent data definitions, thereby increasing the reuse
          of standard definitions across disparate tools.
          Versioning Capabilities 
          In repository discussions, "versioning" can have many different definitions. For example some
          version control capabilities are: 
               version control as in test vs. production (lifecycle phasing);
                
               versions as unique occurrences;
                
               versioning by department or business unit; and
                
               version by aggregate or workstation ID.
          The repository's versioning capabilities facilitate the application lifecycle development process by
          allowing developers to work with the same object concurrently. Developers should be able to
          modify or change objects to meet their requirements without affecting other developers. 
          Robust Query and Reporting
          The repository should provide business users with a vehicle for robust query and report generation.
          The end user tool should seamlessly pass queries to its own tool or third party products for
          automatic query generation and execution. Furthermore, business users should be able to create
          detailed reports from these tools, increasing the amount of valuable decision support information
          they are able to receive from the repository.
          Data Warehousing Support
          The repository provides information about the location and nature of operational data which is
          critical in the construction of a data warehouse. It acts as a guide to the warehouse data, storing
          information necessary to define the migration environment, mappings of sources to targets,
          translation requirements, business rules, and selection criteria to build the warehouse. 
Conclusion
          Organizations are becoming increasingly aware of the limitations of their own systems and internal
          data. The attempts to liberate and leverage data across the organization's stovepipes have been
          replete with frustration and too many examples of failure. These experiences, coupled with drivers
          demanding flexibility in business processes, are hastening the day that businesses will implement an
          enterprise level view of metadata. Activity to supply this enterprise level capability is being
          aggressively pursued by all major vendors. It is critical that corporations understand the issues at
          hand as they adopt enterprise strategies and that they be in a position to evaluate what set of vendor
          products are appropriate to their situation. Business Information Demand — An organization's
          continuously increasing, constantly changing need for current, accurate information, often on short
          notice, to support its business activities. 
Glossary
          24x7 Lights Out Operations — The use of Systems Management tools to ensure the reliable
          movement and update of data from operational systems to analytical systems. 
          Analytical Data Store — Useful in making strategic decisions, this data storage area maintains
          summarized or historical data. This stored data is time variant, unlike operational systems which
          contain real-time data. Information contained in this data store is determined and collected based on
          the corporate business rules.
          Application Lifecycle — Includes the following three stages: process and change management ,
          analysis and design, construction and testing.
          Architecture — A definition and preliminary design which describes the components of a solution
          and their interactions. An architecture is the blueprint by which implementers construct a solution
          which meets the users' needs.
          Availability — A measure of the percentage of time that a computer system is capable of
          supporting a user request. A system may be considered unavailable as a result of events such as
          system failures or unplanned application outages.
          Business-Driven — An approach to identifying the data needed to support business activities,
          acquiring or capturing those data, and maintaining them in a data resource that is readily available.
          Business-Driven Approach — The process of identifying the data needed to support business
          activities, acquiring or capturing those data, and maintaining them in the data resource.
          Business Rules — The statements and stipulations that a corporation has set as "standard" in
          order to run the enterprise more consistently and smoothly.
          Capacity Planning — The process of considering the effects of a warehouse on other system
          resources such as response time, DASD requirements, etc.
CASE — Computer Aided Software Engineering.
          CASE Management — The management of information between multiple CASE "encyclopedias,"
          whether the same or different CASE tools.
          Centralized Data Warehouse — A Data Warehouse implementation in which a single warehouse
          serves the need of several business units simultaneously with a single data model which spans the
          needs of the multiple business divisions.
          Change Propagation — The process of generating only the updates from the source databases to
          the target database (usually the data warehouse).
          Chargeback — The process that data warehouse managers use to ensure appropriate costs are
          correctly distributed to the corresponding business units and users so that they can meet financial
          reporting requirements.
          Client/Server — A distributed technology approach where the processing is divided by function.
          The server performs shared functions — managing communications, providing database services,
          etc. The client performs individual user functions — providing customized interfaces, performing
          screen to screen navigation, offering help functions, etc.
          Client/Server Processing — A form of cooperative processing in which the end-user interaction
          is through a programmable workstation (desktop) that must execute some part of the application
          logic over and above display formatting and terminal emulation.
          Consistent Data Quality — The state of a data resource where the quality of existing data is
          thoroughly understood and the desired quality of the data resource is known. It is a state where
          disparate data quality is known, and the existing data quality is being adjusted to the level desired to
          meet the current and future business information demand.
          Data — Items representing facts, text, graphics, bit-mapped images sound, analog, or digital
          live-video segments. Data is the raw material of a system supplied by data producers and is used by
          information consumers to create information.
Data Access — The process of entering a database to store or retrieve data.
          Data Access Tools — An end-user oriented tool that allows users to build SQL queries by
          pointing and clicking on a list of tables and fields in the data warehouse.
          Data Accuracy — The component of data integrity that deals with how well data stored in the data
          resource represent the real world. It includes a definition of the current data accuracy and the
          adjustment in data accuracy to meet the business needs.
          Data Administration — The processes and procedures by which the integrity and currency of the
          data in the warehouse are maintained.
          Data Analysis and Presentation Tools — Software that provides a logical view of data in a
          warehouse. Some create simple aliases for table and column names; others create data that identify
          the contents and location of data in the warehouse.
          Data Consistency — The result of using a repository to capture and manage data as it changes so
          that decision support systems can be continually updated.
          Data Definition — Enabling the structure and instances of a database to be defined in a
          human-and machine-readable form.
          Data Distribution — The placement and maintenance of replicated data at one or more data sites
          on a mainframe computer or across a telecommunications network. This part of developing and
          maintaining an integrated data resource that ensures data are properly managed when distributed
          across many different data sites. Data distribution is one type of data deployment, which is the
          transfer of data to data sites.
          Data Mart — A subset of the data resource, usually oriented to a specific purpose or major data
          subject, that may be distributed to support business needs. The concept of a data mart can apply to
          any data whether they are operational data, evaluational data, spatial data, or metadata.
          Data Model — A logical map that represents the inherent properties of the data independent of
          software, hardware, or machine performance considerations. The model shows data elements
          grouped into records, as well as the association around those records.
          Data Movement — The transportation of data from disparate sources ranging from various
          mainframes, client/server machines, and network file servers to a central location, the data
          warehouse, in order to create a reliable source of information, usable for strategic decision making.
          Data Quality — Indicates how well data in the data resource meet the business information
          demand. Data quality includes data integrity, data accuracy, and data completeness. 
          Distributed Database — A collection of multiple, logically related databases that is provided to
          data sites.
          Data Store — A place where data is stored; data at rest. A generic term that includes databases
          and flat files.
          Data Transformation — (1) The formal process of transforming data in the data resource within a
          common data architecture. It includes transforming disparate data to an integrated data resource,
          transforming data within the integrated data resource, and transforming disparate data. It includes
          transforming operational, historical, and evaluational data within a common data architecture. (2)
          Creating "information" from data. This includes decoding production data and merging of records
          from multiple DBMS formats. It is also known as data scrubbing or data cleansing.
          Data Warehouse — (1) A subject oriented, integrated, time-variant, non-volatile collection of data
          in support of management's decision making process. A repository of consistent historical data can
          that can be easily accessed and manipulated for decision support. (2) An implementation of an
          informational database used to store sharable data sourced from an operational data-base-
          of-record. It is typically a subject database that allows users to tap into a company's vast store of
          operational data to track and respond to business trends and facilitate forecasting and planning
          efforts.
Database — A collection of data which are logically related.
DBA — Database Administrator.
          Decision Support — A set of software applications intended to allow users to search vast stores
          of information for specific reports which are critical for making management decisions.
          Disparate Data — Data that are essentially not alike, or are distinctly different in kind, quality, or
          character. They are unequal and cannot be readily integrated to adequately meet the business
          information demand. Disparate data are heterogeneous data.
          End User Data — Data formatted for end-user query processing; data created by end users; data
          provided by a data warehouse.
          Enterprise — A complete business consisting of functions, divisions, or other components used to
          accomplish specific objectives and defined goals.
          Enterprise Data Warehouse — An Enterprise Data Warehouse is a Centralized Warehouse
          which services the entire enterprise.
          FTP (File Transfer Protocol) — A client-server protocol which allows a user on one computer to
          transfer files to and from another computer over a TCP/IP network.
          Fragmentation — The process in which a packet is broken into smaller pieces, fragments, to fit the
          requirements of a physical network over which the packet must pass.
Global Enterprise — A corporate environment not limited by geographic location.
Heterogeneous Data — See disparate data.
Heterogeneous Databases — See disparate databases.
          Incremental Refresh — A technique which loads only data which has changed since the last load
          into a Data Warehouse or Data Mart.
          Information — (1) A collection of data that is relevant to one or more recipients at a point in time.
          It must be meaningful and useful to the recipient at a specific time for a specific purpose. Information
          is data in context, data that have meaning, relevance, and purpose. (2) Data that has been
          processed in such a way that it can increase the knowledge of the person who receives it.
          Information is the output, or "finished goods," of information systems. Information is also what
          individuals start with before it is fed into a Data Capture transaction processing system.
          Information Technology Infrastructure — An infrastructure for the information technology
          discipline that provides the resources necessary for an organization to meet its current and future
          business information demand. It consists of the data resource, the platform resource, business
          activities, and information systems.
          Job Management Tools — Tools which include job scheduling, job optimization, charge back,
          and output management tools that help operations managers and system administrators manage,
          monitor, and coordinate the execution of enterprise-wide IT jobs.
Legacy Data — Another term for disparate data because they support legacy systems.
          Metadata — (1) Traditionally, metadata were data about the data. In the common data
          architecture, meta-data are all data describing the foredata, including meta-praedata and the
          meta-paradata. They are data that come after or behind the foredata and support the foredata. (2)
          Metadata is data about data. Examples of metadata include data element descriptions, data type
          descriptions, attribute/property descriptions, range/ domain descriptions, and process/method
          descriptions. The repository environment encompasses all corporate metadata resources: database
          catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length,
          valid values, and description of a data element. Metadata is stored in a data dictionary and
          repository. It insulates the data warehouse from changes in the schema of operational systems.
          Methodology — A system of principles, practices, and procedures applied to a specific branch of
          knowledge.
OLTP — On-Line Transaction Processing.
          On-Line Transaction Processing — Processing that supports the daily business operations. Also
          know as operational processing and OLTP.
          Operational Data Store — Contains timely, current, and integrated information. The data is
          typically very granular. These systems are subject oriented, not application oriented, and are
          optimized for looking up one or two records at a time for decision making.
Operational Systems — Please refer to Legacy Data.
          Performance Management Tools — Tools that help warehouse managers monitor, maintain, and
          manage warehouse performance in distributed, heterogeneous environments.
          Problem Resolution Software — Tools that provide automated problem report management for
          help desks, technical support departments, or customer service operations. These products can be
          used by support staff, as they assist customers or end users, or as part of an automated call-in
          self-help system.
          Query — A (usually) complex SELECT statement for decision support. See Ad-Hoc Query or
          Ad-Hoc Query Software.
RDBMS — Relational Database Management System.
          Reference Data — Business data that has a consistent meaning and definition and is used for
          reference and validation (Process, Person, Vendor, and Customer, for example). Reference data is
          fundamental to the operation of the business. The data is used for transaction validation by the data
          capture environment, decision support systems, and for representation of business rules. Its source
          for distribution and use is a data warehouse.
          Refresh Technology — A process of taking a snapshot from one environment and moving it to
          another environment overlaying old data with the new data each time. 
          Relational Database Management System — A Database Management System which uses the
          concept of two dimensional "tables" to define "relationships" among the different elements of the
          database.
          Repository — A location, physical or logical, where databases supporting similar classes of
          applications are stored.
          Repository Environment — The Repository environment contains the complete set of a
          business's metadata. It is globally accessible. As compared to a data dictionary, the repository
          environment not only contains an expanded set of metadata, but can be implemented across multiple
          hardware platforms and database management systems (DBMS).
Reusability — Using code developed for one application program in another application.
          Scalability — (1) The ability to scale to support larger or smaller volumes of data and more or less
          users. The ability to increase or decrease size or capability in cost-effective increments with minimal
          impact on the unit cost of business and the procurement of additional services. (2) The ability of a
          system to accommodate increases in demand by upgrading and/or expanding existing components,
          as opposed to meeting those increased demands by implementing a new system.
          Securability — The ability to provide differing access to individuals according to the classification
          of data and the user's business function, regardless of the variations.
          SQL (Structured Query Language) — A structured query language for accessing relational,
          ODBC, DRDA, or non-relational compliant database systems.
          Stovepipe Decision Support Systems — Independent, departmental data marts incapable of
          making accurate decisions across the enterprise because they have no way to consistently define
          data.
Target Database — The database in which data will be loaded or inserted.
          Warehouse Application Vitality — A solution to enable business needs to drive the technology
          that reaches the end-user's desktop by limiting the negative effects of application change. 
          © 1995, 1998 PLATINUM technology, inc. All rights reserved.
          Legal, Privacy Policy.   800-442-6861   630-620-5000   Fax: 630-691-0718