8. Infrastructural Challenges
8.2 Data Federation
Data federation is an umbrella term for a wide range of decentralized data practices.
At one extreme of this range we have data integration. It is the process of combining data residing at different sources, and providing the user with a unified view of these data. Data integration has two broad goals: increasing the completeness and increasing the conciseness of data that is available to users and applications.
A slightly different concept is data harmonization; it refers to the process of comparing similar conceptual and logical data models to determine the common data elements, similar data elements and dissimilar data elements in order to produce a resulting unified data model that can be used consistently across organizational units.
On the other extreme of the range we have the concept of data linking. It refers to the process of publishing data on a scientific data space in such a way that its meaning is explicitly defined, it is linked to other external data sets and can in turn be linked to from external data sets.
A data linking capability does not act as a data integration system as this requires semantic integration before any service can be provided; instead it follows a co-existence approach, i.e., to provide base functionality over all data sets, regardless of how integrated they are.
An aggregation/federation capability should be of paramount importance for the next generation of global research data infrastructures.
8.2.1 Data Integration
Data Integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. Data integration has two broad goals [53]: increasing the
completeness and increasing the conciseness of data that is available to users and applications.An increase in completeness is achieved by adding more data sources (more objects, more attributes describing objects) to the system. An increase in conciseness is achieved by removing redundant data, by fusing duplicate entries and merging common attributes into one.
The problem of designing data integration systems is important in current scientific applications and is characterized by a number of issues that are interesting from a theoretical point of view [54]. The data integration systems are characterized by an architecture based on a global schema and a set of sources. The sources contain the real data, while the global schema provides a reconciled, integrated, and virtual view of the underlying sources.
Data integration is a three-step process: data transformation, duplicate detection, and data
fusion[53].
8.2.1.1 Data Transformation
This step is concerned with the transformation of the data present in the sources into a common representation (renaming, restructuring). Data from the data sources must be transformed to be conform to the global schema of an integrated information system.
Modelling the relation between the sources and the global schema is therefore a crucial aspect. Two basic approaches have been proposed for this purpose. The first approach, called global-as-
view (or schema integration), requires the global schema is expressed in terms of the data
sources. In essence, this approach regards the individual schemata and tries to generate a new schema that is complete and correct with respect to the source schemata, that is minimal and understandable. The second approach, called local-as-view (or schema mapping), requires the global schema to be specified independently from the sources, and the relationships between the global schema and the sources are established by defining every source as a view over the global schema. This approach is driven by the need to include a set of sources in a given integrated information system. A set of correspondences between elements of a source schema and elements of the global schema are generated to specify how data is to be transformed.
The goal of both approaches is the same: transform data of the sources so that it conforms to a common global schema.
8.2.1.2 Duplicate Detection
This step regards the identification of multiple, possibly inconsistent representations of the same real-world objects. The basic input to data fusion. The result of the duplicate detection step is the assignment of an object-ID to each representation. Two representations with the same object-ID indicate duplicates. Note that more than two representations can share the same object-ID, thus forming duplication clusters. It is the goal of data fusion to fuse these multiple representations into a single one.
8.2.1.3 Data Fusion
This step combines and fuses the duplicate representations into a single representation while inconsistencies in the data are resolved. A number of data fusion strategies have been defined. Conflict-ignoring strategies do not make a decision as to what to do with conflicting data. They escalate conflicts to user or application or create all possible value combinations.
Conflict-avoiding strategies acknowledge the existence of possible conflicts in general, but do not detect and resolve single existing conflicts.
Conflict resolution strategies regard all data and metadata before deciding on how to resolve a conflict. They can further be subdivided into deciding and mediating strategies, depending on whether they choose a value from all the already present values (deciding) or choose a value that does not necessarily exist among conflicting values (mediating).
When handling the data integration problem special attention must be devoted to the following aspects: modelling a data integration application, processing queries in data integration, dealing with inconsistent data sources and reasoning on queries.
8.2.2 Data Harmonization
By and large, most of information is stored in databases on mainframe computers. Most of these systems are not accessible via network connections due to security concerns. This has hindered development of “real time” data aggregation efforts. Additionally, most of the data models for these systems are not harmonized since most have evolved separately from different sets of requirements. To further complicate matters, many different vendors have implemented most existing systems using different products and information models [55].
Data Harmonization is the process of comparing similar conceptual and logical data models to determine the common data elements, similar data elements and dissimilar data elements in order to produce a resulting unified data model that can be used consistently across organizational units, business systems, and Data Warehouses.
Data Harmonization covers both the data as well as the underlying business definitions.
There is a need for efficient harmonization tools to support the harmonization process. Some possible features of a harmonization tool are:
The ability to import data models into the tool from various representation formats.
The ability to linguistically and semantically analyze the components of multiple data models to determine equivalence, similarity and dissimilarity.
The ability to construct multi-modal views of the data models to assist in comparison analysis.
The ability to harmonize and visualize models with multiple users simultaneously and to jointly create the resultant data model.
The automated ability to extract common data elements across models to produce a new resultant skeleton model that can be further refined by a manual process.
The ability to save the resultant data model in multiple representations (the same set of representations available to import).
Legal requirements: in essence they concern political commitments.
Technical aspects: modalities applied in the data harmonization.
Operational aspects: adoption of international standards related to data length, format, attributes and semantic interoperability.
A real time, event-driven and harmonized view of all data is desired. To facilitate this development, a methodology and practice must be used to ensure any future work has a strong foundation of data harmonization to be built upon.
8.2.3 Data Linking
In the context of a Science ecosystem linking data becomes an imperative as it allows users to benefit from the use of multiple datasets created by different research communities/organizations including the connection of publications with the subject data.
A scientific data infrastructure must lower the barrier to publishing and accessing data leading, thus, to the creation of scientific data spaces by connecting data sets from diverse domains, disciplines, regions and nations. The concept of scientific data space is an answer to the rapidly- increasing demands of researchers for data-everywhere.
Linking data refers to the capability, supported by a scientific data infrastructure, of publishing
data on a scientific data space in such a way that it is machine-readable, its meaning is explicitly defined, it is linked to other external data sets and can in turn be linked to from external data sets. A data infrastructure supporting a data space does not act as a data integration system as this requires semantic integration before any service can be provided; instead it follows a co-existence approach, i.e., to provide base functionality over all data sets, regardless of how integrated they are [57].
A scientific data infrastructure should offer the possibility for researchers to start browsing in one data set and then navigating along links into related data sets; or it can support data search engines that crawl the data space by following links between data sets and provide expressive query capabilities over aggregated data. To achieve this, the data infrastructure must support the creation of typed links between data from different sources.
Data providers willing to add their data to a scientific data space which allows data to be discovered and used by various applications must publish them according to some principles. These provide a basic recipe for publishing and connecting data using the scientific data infrastructure architecture, services, and standards. The principles should include [58]:
the assignment of permanent universal data identifiers, i.e., strings or tokens that are unique within a data space;
setting links to other data sources so that users can navigate the data space as a whole by following the links; and
the provision of metadata so that users can assess the quality of published data. A scientific data space enabled by a data infrastructure enjoys the following properties [58]:
it contains data specific of the scientific disciplines supported by the data infrastructure;
any scientific community belonging to the disciplines supported by the data infrastructure can publish on the scientific data space;
data providers are not constrained in choice of vocabularies with which to represent data;
data is connected by links, supported by the data infrastructure, creating a global data graph that spans data sets and enables the discovery of new data sets.
From an application development perspective the scientific data space should have the following characteristics [58]:
data is strictly separated from formatting and presentational aspects;
data is self-describing;
the scientific data space is open, meaning that applications do not have to be implemented against a fixed set of data sets, but can discover new data sets at run time by following the data links
From a system perspective, a data infrastructure should provide:
a registry service whose purpose is to manage a collection of identifiers and make it actionable and interoperable, where that collection can include identifiers from many other controlled collections;
a name resolution system, which resolves the data identifiers into the information necessary to locate, and access them;
automated or semi-automated generation of links.
8.3
Data Sharing
Definition
Openness in the sharing of research results is one of the norms of modern science. The assumption behind this openness is that progress in science demands the sharing of results within the scientific community as early as possible in the discovery process.
Data Sharing is the use of information by one or more consumers that is produced by another source other than the consumer.
Why Share Research Data [59]
Research data are a valuable resource, usually requiring much time and money to be produced. Many data have a significant value beyond usage for the original research.
Sharing research data is important for several reasons:
Encourages scientific enquiry and debate
Promotes innovation and potential new data uses
Maximizes transparency and accountability
Enables scrutiny of research findings
Encourages the improvement and validation of research methods
Reduces the cost of duplicating data collection
Increases the impact and visibility of research
Promotes the research that created the data and its outcomes
Can provide a direct credit to the researcher as a research output in its own right
Provides important resources for education and training.
How to Share Data [59]
There are various ways to share research data, including:
Depositing them with a specialist data center, data archive or data bank
Submitting them to a journal to support a publication
Depositing them in an institutional repository
Making them available online via a project or institutional website
Making them available informally between researchers on a peer-to-peer basis
Approaches to data sharing may vary according to research environments and disciplines, due to the varying nature of data types and their characteristics.
Data Documentation [59]
A crucial part of making data user-friendly, shareable and with long-lasting usability is to ensure they can be understood and interpreted by any user. This requires clear and detailed data description, annotation and contextual information.
Data documentation explains how data were created or digitized, what data mean, what the content and structure are and any data manipulations that may have taken place. Documenting data should be considered best practice when creating, organizing and managing data and is important for data preservation. Whenever data are used sufficient contextual information is required to make sense of that data.
Good data documentation includes information on:
The context of data collection: project history, aim, objectives and hypotheses
Data collections methods: sampling, data collection processes, instruments used, hardware and software used, scale and resolution, temporal and geographic coverage and secondary data sources used
Dataset structure of data files, study cases, relationships between files
Data validation, checking, profiling, cleaning and quality assurance procedures carried out
Changes made to data over time since their original creation and identification of different versions of data files
Information on access and use conditions or data confidentiality
Despite its importance, however, sharing data is not easy. There are three categories of problems hindering data sharing: (i) willingness to share, (ii) locating shared data, and (iii) using shared data. First, there is a strong sense in which the scientist’s ability to profit from data collection depends on maintaining exclusive control over the data – economists would say that the data are a source of “monopoly rents” for the scientist. In this case, however, the profit, or rent, accrues largely in the form of scientific reputation and its accompanying benefits, such as publications, grants, and students [61].The point here is that the competition (and its associated benefits) in science is intense, and there may be a strong reluctance on the part of scientists to share data, as such sharing may amount to a sacrifice of future rents that could be extracted from the data were they not shared.
Second, researchers must become aware of who has the data they need or where the data are located, which can be a nontrivial problem [62]. After finding appropriate data, they often must negotiate with the owner or develop trusting relationships to gain access [63].
Third, once in possession of a data set, understanding it requires knowledge of the context of its creation [64]. How was each datum collected and analyzed? What format are the data in? If the data are in electronic form, is there a key or metadata available to indicate what the various fields in the database mean? Researchers also need to know something about the quality of the data they are receiving, and if the original purpose of the data set is compatible with the proposed use. Answering these questions requires a large amount of effort on the part of the data creator, but the benefit of such effort goes largely to the secondary user. This renders it unlikely that adequate documentation will be produced.
Even if documentation is provided, however, it is often the case that much of the knowledge needed to make sense of data sets is tacit. Scientists are not necessarily able to explicate all of the information that is required to understand someone else’s work.
Approaches to Data Sharing
Shared data is only useful if sufficient context is provided about the data such that collaborators may comprehend and effectively apply it. Typically, data context is passed among collaborating scientists via one-to-one discussion (f2f interactions, over the phone, and through emails).
More recently, data sharing environments have developed a number of approaches to the above issues. For example, standardized reporting formats and metadata protocols can allow the same data to be read across different hardware or software. Password and security systems can give a certain degree of control over who does and does not have access to data sets. Metadata may provide context about who collected data and how they were processed.
While these approaches deal effectively with the explicit technological problems inherent in data sharing, it is not clear that they adequately deal with many of the tacit and social issues outlined above. Indeed, a metadata model can only provide so much contextual information, leading to a potentially recursive situation in which metadata models require “meta-metadata” in order to be effectively understood.
Recent calls for data sharing suggest that funding agencies believe that groundbreakings scientific research requires more data sharing among scientists. Even if we provide the technical means to move data from one lab to another, however, there may be social barriers to effectively using this data in practice. To design technologies that truly support the conduct of science and not just the sharing of a data set, the designer must understand both the scientific role that data play in producing knowledge, and the social role that data play in the conduct of scientific work.
Data features and properties
Among the most prominent data features and properties which enable the sharing of data we include [65]:
General data set properties (Basic data set properties such as owner, creation date, size, format, etc)
Experimental properties (Conditions and properties of the scientific experiment that generated or is to be applied to the data)
Data provenance (Relationship of data to previous versions and other data sources)
Integration (Relationship of data subsets within a full data set)
Analysis and interpretation ( Notes, experiences, interpretations, and knowledge generated from analysis of data)
Physical organization (Mapping of data sets to physical storage structure such as a file system, database, or some other data repository)
Project organization (Mapping of data sets to project hierarchy or organization)
Task (Research task(s) that generated or applies data set)
Experimental process (Relationship of data and tasks to overall experimental process)
User community (Application of data sets to different organizations of users)
Data Sharing Environments
A data sharing environment is composed of a number of capabilities and tools that support the contexts for shared data use. Here, we list some of these data sharing capabilities/tools:
Data Browsing tools
Data Viewing tools
Data Translations capabilities and tools
Capabilities for accessing and applying external databases
Database Connection tools
Subscription capabilities
Notification capabilities
Annotation tools
As one important objective of the research data infrastructures is the creation of scientific collaborative environments, they have to support efficient and effective data sharing environments which constitute a key component of the collaborative environments.