8. Infrastructural Challenges
8.1 Data and Data Service/Tool Findability
8.1.3 Data Discovery
One big challenge faced by researchers when conducting a research activity in a networked multidisciplinary environment is pinpointing the location of relevant data.
The ability to determine where data sets are located, what is in those data sets, and who can access them is a critical but necessary step in order to be able to access all the data stored in several data collections distributed in a science ecosystem that are relevant to her/his research activities.
By Data Discovery we mean the capability to quickly and accurately identify and find data that supports research requirements.
The process of discovering data that exist within a data set is supported by search and query capabilities which exploit metadata descriptons contained in data categorization/classification schemes, data dictionaries, data inventories, and metadata registries.
8.1.3.1 Data Classification
Data classification is the categorization of data for its most effective and efficient use. In a basic approach to storing computer data, data can be classified according to its critical value or how often it needs to be accessed, with the most critical or often-used data stored on the fastest media while other data can be stored on slower (and less expensive) media. This kind of classification tends to optimize the use of data storage for multiple purposes - technical, administrative, legal, and economic. Data can be classified according to any criteria. A well-planned data classification system makes essential data easy to find. This can be of particular importance in data discovery. In the field of data management data classification as a part of Information Lifecycle Management (ILM) process can be defined as tool for categorization of data to enable/help researchers to effectively answer following questions:
What data types are available?
Where are certain data located?
What access levels are implemented?
What protection level is implemented and does it adhere to compliance regulations? Data Classification Tools: Data classification is typically a manual process; however, there are many tools from different vendors that can help gather information about the data. They help “categorie” data, primarily for the purpose of tiered storage and are focused on finding unstructured data on a variety of file shares. This data can be categorized by content, file type, usage and many other variables.
8.1.3.2 Data Dictionary
Data Dictionaries contain the information about the data contained in large data collections. Each data element is defined by its data type, the location where it can be found, and the location that it came from. Often the data dictionary includes the logic when a field is derived. The logic can be business logic or research logic but it must be defined.
The data dictionary also includes the physical location, such as a server DNS (domain name system) name or the IP address. The data collection name, the instance, the table, and the field name are particularly important for the researcher seeking for relevant data. This information is even more important if the researcher must cross multiple systems to gather the necessary pieces of information for her/his research.
A data collection administration should be responsible for keeping this important information up to date and accurate. Typically each data collection has its own data dictionary. It is a good practice to have one owner of each data dictionary.
8.1.3.3 Metadata Registry
Metadata registries are used whenever data must be used consistently within a research community or in a multidisciplinary context. Examples of these situations include:
Communities that transmit data using structures such as XML, Web Services or EDI
Communities that need consistent definitions of data across time, between databases, between communities or between processes, for example when a community builds a large data collection
Communities that are attempting to break down "silos" of information captured within applications or proprietary file formats
Central to the charter of any metadata management program is the process of creating trusting relationships with stakeholders and that definitions and structures have been reviewed and approved by appropriate parties.
A metadata registry typically has the following characteristics:
Protected environment where only authorized individuals may make changes
Stores data elements that include both semantics and representations
Semantic areas of a metadata registry contain the meaning of a data element with precise definitions
Representational areas of a metadata registry define how the data is represented in a specific format, such as in a database or a structured file format (e.g., XML)
8.1.3.4 Data Inventory
Research communities will need to develop their own “data inventory” focused on identifying and describing all the data elements contained across their different data collections.
The goal of a data inventory is to inventory the data researchers actually need. Inventorying the data that moves between systems, data collections and scientific communities accomplishes two things: it identifies the most valuable data elements in use, and it will also help identify data that’s not high-value, as it is not being shared or used. This approach also provides a way to tackle initial data quality efforts by identifying the most “active” data used by a research community. It
ultimately helps the data management team understand where to focus its efforts, and prioritize accordingly.
Legacy data inventory and profiling is a structured and comprehensive way of learning about the corporate data asset. This activity is centered on a professional data analyst who gathers information, runs a variety of reports and ad hoc queries to assess the existing data and creates or updates documentation about the existence, scope, meaning and quality of the data asset.
The resulting documentation (including high-level summaries and easily navigated detail data behavior documentation) should provide answers to the following key questions:
What data do we have?
Where can it be found?
What constraints limit read-only access?
Where did each unit of data come from?
What is its age distribution?
What is its scope?
What is its quality?
Does it include test data?
What is the business meaning of the data?
Are there ambiguities in usage, meaning and expectations?
In addition to the technical issues of where the data is located, the lead data analyst should also assess and document the political issues surrounding the data—an issue such as who “owns” any portion of the data or who seeks to limit read-only access to the data.
A Research Data Infrastructure must efficiently support a Data Discovery Environment composed of query and search capabilities as well as data discovery tools including data categorization/classification schemes and/or data dictionaries and/or data inventories and/or metadata registries.