Implementing Metadata Repositories - Processing and Managing Complex Data for Decision Support

Given the metadata implementation challenges and the limited capabilities offered by commercial products, an approach that organizations may consider is creating a customized enterprise-wide metadata repository (Sachdeva, 1998; & White, 1999). The metadata repository captures and integrates all the metadata components that are used in organizational applications and creates a single, unified source for metadata (Marco, 2000). Underlying the repository is a physical database (commonly, but not necessarily RDBMS) that implements a comprehensive metadata model and captures the actual values of all metadata characteristics and instances. The metadata repository supports every phase of IT development and operation. Organizational members, who are involved in application design and implementation, use the metadata repository to store and retrieve requirements and the resulting data and system design specifications. Once applications are implemented, these would communicate with the repository to retrieve the required metadata elements. Should organizations pursue the enterprise repository, they face several important decisions with respect to design, architectural approach and implementation alternatives.

An important decision when implementing a metadata repository is the design paradigm — top-down, bottom-up or a hybrid strategy. A top-down approach would look at the entire organizational information system schema and try to capture an overall set of metadata requirements. A bottom-up approach, on the other hand, would start from the lower granularity of subsystems and bring their metadata specifications together into one unified schema. While a top-down paradigm is more likely to ensure standardization and integration among subsystems, it might be infeasible where existing information systems with local metadata repositories are already in place. Moreover, capturing metadata requirements for an entire organization is a complex and tedious task that might not be completed within a reasonable time. The bottom-up paradigm, focusing on specific systems first, is more likely to achieve short-term results, but might fail to satisfy the larger integration needs. The “middle out” approach, as a hybrid design alternative, treats each functional type of metadata as a module or component of the larger repository. It identifies one or two key modules of metadata and builds a repository consisting of these modules. With this approach, it should be recognized that the metadata repository will not be comprehensive or exhaustive to start with. Subsequent to the “core” implementation, initial modules may be expanded and others added incrementally to grow the repository. The advantages of the middle-out approach from a business standpoint are:

(a) it requires minimal investment initially and as the value is recognized, additional investments may be made,

(b) it is custom-developed to meet the specific, complex requirements of the organization and its offerings, and

(c) it can be initially built on an existing hardware and software platforms and later ported to larger, more sophisticated ones should the need arise. This approach will further ensure that the metadata and its repository remain extensible and not dependent on a single set of applications.

Another important design choice is the metadata repository architecture (Blumstein, 2003). A centralized architecture, which corresponds to a “top down” paradigm, locates the organizational metadata repository on one centralized server that becomes the only metadata source for all the front-end and back- end utilities. Alternately, a distributed architecture, which corresponds to a “bottom up” design paradigm, allows systems to maintain their own customized metadata. The hybrid architecture allows metadata to reside with applications, but keeps the control and the key components in a centralized repository. The pros and cons of these architecture alternatives are summarized in Table 8.

The chosen design paradigm and architectural approach are likely to be influenced by the organizational structure and the complexity of the information systems. It is unlikely that a large organization with sophisticated information needs would adopt a top-down design for metadata and implement it using a centralized server. Such organizations are likely to have many information systems already in place, hence are more likely to apply decentralized architecture, or a hybrid, using the “middle-out” design paradigm. Smaller organizations, with less complex information demands, can afford the “luxury” of a top-down Table 8. Summary of metadata repository architectures

Architecture Implementation Pros Cons

Centralized (Passive Repository)

Single, centralized location for metadata. All back- end (Data storage, ETL) and front-end (Business

Intelligence) tools should post their metadata into the repository, which becomes the only source for pulling and using it

• Efficient access - No need to search for Metadata in multiple locations.

• Better performance - No need to communicate with multiple tools

• Independence from tools being activated or not

• Easier to standardize and integrate

• Easier to capture additional metadata not related to a specific tool

• Complex and time-

consuming implementation

• Data redundancy with

larger chance for quality hazards • Synchronization issues • Increased maintenance efforts Distributed (Active Repository)

Metadata are kept on the back-end and front-end tools are accessed. The users still access a single repository, which doesn't maintain copies but retrieves the metadata in real-time, as needed

• Access is still efficient - one centralized location with lightweight data requirements. • Faster application development due to higher level of independence. • No data redundancy,

metadata are kept at its source. • Reduced system maintenance • Dependency on the end-systems being active • Harder to standardize and integrate

• Harder to capture and integrate in additional metadata, not supported by the end- tools

Hybrid Pieces of metadata

provided by back- end and front-end tools are kept at the tools and accessed in real-time, while additional homegrown pieces are maintained at the repository • Efficient access • Application independence is kept • No data redundancy • Ability to integrate between 3rd-party and home-grown metadata

• Sophisticated to implement

• Integration might not be achievable

• Dependency on the

end-systems being active

approach, attempting to capture the entire set of metadata requirements and implementing a centralized architecture.

The actual implementation of the repository introduces another important choice — developing custom software for managing metadata or using commercial software products. In reality, most organizations are likely to choose a combina- tion of the two. The process of developing customized software can be drawn out and expensive due to the complex and changing metadata requirements. On the other hand, commercial software tools might not meet the repository needs entirely (see Metadata Management in Commercial Data Warehousing Products for a discussion of these tools and their shortfalls). In recent years some attempts have been made by software vendors to address this gap. Microsoft’s Meta Data Services (MDS), which is part of MS-SQL Server RDBMS, is an attempt to create a unified metadata infrastructure. MDS is well integrated with other Microsoft product offerings, but still has some drawbacks: it emphasizes technical metadata, pays little attention to business metadata and relies on an RDBMS with no support for other storage alternatives. In recent years, software tools that specialize in metadata management, such as Ascential, MetaCenter by the Data Advantage Group and MetaBase by MetaMatrix. These tools address the metadata repository implementation needs by supporting both technical and business metadata. They claim to be vendor and technology independent, providing interfaces to most of the leading data warehousing products and supporting both the OIM and the CWM metadata exchange models (see Metadata Management in Commercial Data Warehousing Products for a description of these models). At the time of writing this chapter, software products for metadata management are still emerging and have not gained significant market share. However, if the demand for metadata repository solutions continues to grow, the demand for such tools that support centralized metadata management is expected to grow as well.

Figure 1 presents a conceptual layered architecture of a homegrown repository illustrating the different modules within. This architecture is targeted for total data quality management in a data warehouse by including the process and data quality metadata within. Reporting and application-related metadata are linked with the warehouse data elements for effective communication and delivery of the data to the decision-makers. Together with formatting and vocabulary preferences of the users and personalized metaphors for data visualization, it constitutes the data delivery metadata in the warehouse. The mapping component of the data dictionary metadata, consisting of the mapping between data elements, is distributed across the conceptual layers (shown by the arrows in Figure 1). The data dictionary metadata elements used in data integration, such as the dependencies and constraints between source data elements, are captured in the middle and lower layers of the conceptual architecture.

Data quality metadata is integrated with process metadata and are represented by the IPMAP in the conceptual architecture. The IPMAP stages (processes) are mapped to the extraction and transformation rules (another piece of process metadata) captured in the middle layer of the architecture. Administration and infrastructure metadata that spans the entire warehouse are shown on either side of the conceptual architecture in Figure 1. A detailed list of metadata elements corresponding to the architecture in Figure 1 is listed in Table 9.

In document Processing and Managing Complex Data for Decision Support (Page 181-185)