• No results found

A DATA CURATION AND DISSEMINATION SYSTEM FOR EARTHQUAKE ENGINEERING

N/A
N/A
Protected

Academic year: 2022

Share "A DATA CURATION AND DISSEMINATION SYSTEM FOR EARTHQUAKE ENGINEERING"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

Tenth U.S. National Conference on Earthquake Engineering Frontiers of Earthquake Engineering

July 21-25, 2014 Anchorage, Alaska 10NCEE

A DATA CURATION AND DISSEMINATION SYSTEM FOR

EARTHQUAKE ENGINEERING

T. J. Hacker

1

, S. Pejša

2

, and J. Browning

3

ABSTRACT

Over the past three years, the NEES project has developed a cyberinfrastructure that supports the collection, archiving, and dissemination of experimental and simulation data. During the development of this system, we have identified and solved several problems that are both unique to earthquake engineering and common to data oriented science and engineering projects. In this paper, we describe the data cyberinfrastructure we have developed for NEES along with lessons learned that we believe would be useful to other groups seeking to deploy data curation and dissemination systems to support large-scale science engineering projects.

1AssociateProfessor, Computer & Information Technology, Purdue University, West Lafayette, IN 47907

2Data Curator, Discovery Park NEEScomm, Purdue University, West Lafayette, IN 47907

3Associate Dean, Professor Civil Engineering, University of Kansas, Lawrence, KS 66045

Hacker TJ, Pejša S, Browning, J. A Data Curation and Dissemination System for Earthquake Engineering.

Proceedings of the 10th National Conference in Earthquake Engineering, Earthquake Engineering Research Institute, Anchorage, AK, 2014.

(2)

Tenth U.S. National Conference on Earthquake Engineering Frontiers of Earthquake Engineering

July 21-25, 2014 Anchorage, Alaska 10NCEE

A Data Curation and Dissemination System for Earthquake Engineering

T. J. Hacker1 , S. Pejša2, and J. Browning3

ABSTRACT

Over the past three years, the NEES project has developed a cyberinfrastructure that supports the collection, archiving, and dissemination of experimental and simulation data. During the

development of this system, we have identified and solved several problems that are both unique to earthquake engineering and common to data oriented science and engineering projects. In this paper, we describe the data cyberinfrastructure we have developed for NEES along with lessons learned that we believe would be useful to other groups seeking to deploy data curation and dissemination systems to support large-scale science engineering projects.

Introduction

The George E. Brown Network for Earthquake Engineering Simulation (NEES) is comprised of a network of 14 laboratories across the United States that provide experimental and cyber resources to support the research activities of the earthquake engineering community focused on improving the resilience of the built environment. Large-scale experimental equipment operated by NEES includes shake tables, a tsunami wave basin, large-scale testing facilities, geotechnical centrifuges, field and mobile facilities, and a large-displacement facility. In addition to the core laboratories, NEES has international partnership agreements with peer laboratory consortiums in Canada, Europe, Japan, New Zealand, Taiwan, and China.

NEEScomm, the headquarters of NEES at Purdue University, operates a comprehensive and production-quality cyberinfrastructure that supports the research and education activities of the NEES network and earthquake engineering researchers [1]. Part of our responsibility is to operate a repository for the data generated from NEES research experiments conducted at the NEES sites as well as high value community data sets.

The user community served by the NEEShub (the main portal), the Project Warehouse (the repository for NEES research data), and Databases is global. As of summer 2013, users from over 200 countries have accessed the NEES cyberinfrastructure since the launch of the NEEShub in 2010. For the 12-month period beginning in October 2012, 86,433 users have accessed information and run simulations in the NEEShub, with use originating from Asia (30% of users), Europe (21%), and the United States. Users who provide data are mostly researchers conducting

1AssociateProfessor, Computer & Information Technology, Purdue University, West Lafayette, IN 47907

2Data Curator, Discovery Park NEEScomm, Purdue University, West Lafayette, IN 47907

3Associate Dean, Professor Civil Engineering, University of Kansas, Lawrence, KS 66045

Hacker TJ, Pejša S, Browning, J. A Data Curation and Dissemination System for Earthquake Engineering.

Proceedings of the 10th National Conference in Earthquake Engineering, Earthquake Engineering Research Institute, Anchorage, AK, 2014.

(3)

projects at NEES sites. Over the past 12 months 180 users contributed 604,582 files totaling 6844.5 GB to the Project Warehouse from which the most active users downloaded 3,840,646 files totaling 10,663 GB of data.

Technical and social challenges in developing a distributed data repository

To develop, deploy, and operate the NEES data repository, we needed to overcome technical, social, and educational challenges. On the technology front, one of the first problems we faced was the need for a “desktop-to-repository” software infrastructure to allow users to quickly and easily organize the thousands of files associated with a project and maintain a local cache of a project from the Project Warehouse on their desktop computer. Users needed to automatically upload and download files as new files were added to the project either on their desktop or in the Project Warehouse. Another challenge was to develop an easy-to-use and logical interface to allow users to easily navigate projects, browse and search data, and transfer the thousands of files in a project. This involved the need to automatically mange the process of downloading the thousands of files associated with a project on a trial, experiment, or individual file level. Another set of challenges was integrating file storage with descriptive metadata stored in a database into the NEEShub. Moreover, once data is in the repository, users need to easily use software tools within the NEEShub with these data. Cybersecurity was another critical area of need. User access control mechanisms were needed by project principal investigators to manage access to a group of users, and the type of access (read/write) permitted for each user. We realized that newly uploaded data and the entire data repository had to be regularly scanned for computer viruses and malware.

Another area of need for the data repository was to ensure the long-term integrity and viability of data contained within files uploaded by the community. This had two components: 1) ensuring that data contained with a file is not modified or corrupted; and 2) making sure that the file format is known and can be definitely associated with an application that can interpret and extract information from the file.

A critical dimension of the challenges we faced were social. The highest priority was to collect user needs and requirements with a priority ranking determined by earthquake engineering researchers. With limited development resources, we were focused on meeting the most pressing needs of the earthquake engineering community. Moreover, the technology and cyberinfrastructure developed by the team needed to be easily usable and “make sense” for civil engineering users. One critical area was to seek some degree of consensus on metadata structures and schemas to represent and organize data that was “just right” – not too restrictive or proscriptive, yet not too vague or ill-defined to the point of not being useful. A final area of social challenges was in how individuals worked as a disciplinary group. A method was needed to encourage researchers to quickly complete the process of archiving and releasing their data.

Data repository requirements

Close communication and continuous feedback from the community of practice is a key component to any successful infrastructure. The Requirement Analysis and Assessment

Committee (RAAS), comprised of a group of earthquake engineering researchers chaired by Dr.

JoAnn Browning, identified specific user needs and requirements, and prioritized these requirements for the NEES data cyberinfrastructure. Requirements were collected from user submissions to a wish list in the NEEShub and from needs collected by the RAAS. Internal

(4)

requirements are driven by the needs of NEEScomm as the steward of the data repository. The majority of internal requirements relevant to long-term access to and preservation of the research data were identified through the Trusted Repositories Audit and Certification (TRAC) self-audit [2]. The self-audit covered a checklist of over 80 criteria that characterize a trustworthy digital repository.

Developing the NEES Data Repository

We developed and operate a data repository to meet the needs of the earthquake engineering community that is fully integrated with the NEEShub. The goals of our effort (based in part on community requirements from

the NSF [3]) are to: 1) provide an authoritative repository that contains data that can be easily searched, browsed, reused, and cited; 2) operate a highly reliable cyberinfrastructure featuring data, tools, and collaboration spaces for the earthquake engineering community; 3) provide a high level of data curation services and infrastructure to ensure the long- term integrity, accessibility, and veracity of data; and 4) maintain accessibility through a science gateway, web services, or a upload/download management tool (PEN).

The NEEShub

(http://nees.org) (Figure 1) is a HUBzero based science gateway

that delivers research resources to the NEES community [4] [5]. These resources, accessible from the front page of the NEEShub, include: Tools & Resources; Learning & Outreach though the NEESacademy; the Project Warehouse, the main data repository for NEES; Simulation capabilities; Sites, providing information on NEES laboratories; Collaboration, to facilitate remote project collaboration; and Databases, to disseminate community databases. The NEEShub also provides facilities to run Windows and Linux software tools in the hub with direct file system access to data in the Project Warehouse. We utilize virtualization technology to provide software as a service to NEEShub users. We also facilitate access to supercomputers through the NSF XSEDE project, the Open Science Grid, and Purdue Community Clusters.

The NEES Project Warehouse is the main data repository for data generated by the NEES community. The Project Warehouse provides a list of publically accessible projects, my projects

Figure 1. The NEEShub.

(5)

that lists a specific user’s projects, and enhanced projects that are fully curated projects that include at least one visualization file. For each project, the Project Warehouse displays a project metadata panel (Figure 2) that summarizes information about the project such as PIs, facility used for the project, project dates, a brief description of the projects, project keywords, and the overall curation status of the project.

There are three types of projects in the Project Warehouse: Experiment, containing information related to physical experiments conducted at NEES sites; Simulation for computational simulations; and Hybrid Simulation that combines both previous types. Each project in the Project Warehouse features a group of tabs on the project summary page for navigating into greater levels of detail on the experiments and individual files in a project. The Experiments tab lists the experiments in a project. The Team Members tab lists the people involved in the project. The Reviews reports on a review of the project that can be submitted by any NEEShub user. Finally, the File Browser tab provides a hierarchical view of the directory structure. Within the file browser, we provide an

overview of the number of file by type of file, such as Drawings, Publications, and Photos. The Project Warehouse also maintains metrics on the number of times a project is accessed, and the number of file downloads. Moreover, we provide detailed statistics on the inherent types of the files in a project, and detailed statistics on the number of downloads of project content by country shown in Figure 3.

The architecture of the Project Warehouse is based on an NFS file server paired with an Oracle database server. The NFS server maintains the individual files for each project organized by project, experiment, and trial. The file structure used is visible through the “File Browser” tab for a

project in the Project Warehouse. The Figure 3. Experiment level type identification and download activity.

Figure 2. Project summary page of a project in the Project Warehouse.

(6)

database maintains descriptive metadata for each file and directory in the Project Warehouse, as well as information about the people involved in projects, information on sensors and materials, and other information such as checksums and file create and modification dates. By maintaining this descriptive metadata separately from the file structure, we can ensure the long-term integrity and correctness of these data over time and perform complex database queries on the metadata.

The NFS server and Oracle DBMS are accessible from the NEEShub server, which translates user requests through the NEEShub along with Web Services calls into file and database access operations.

The Project Explorer for NEES (PEN) is a data upload/download management tool designed to be run on the user’s local computing environment or within the NEEShub. PEN simplifies the process of downloading the hundreds to thousands of files contained within the project, and aids users in uploading project files as they are generated during the course of a project and experiment.

Using PEN, users can import files locally on their desktop computer. PEN will automatically compute the necessary checksums, and collect file metadata. When the user performs a synchronization operation for the file, directory, or even the entire project, PEN will ensure that that the necessary file upload and database update operations are completed, and report back the status of the file upload/update operating through a color based status scheme for each file shown in the PEN file browser running on the user’s desktop computer.

Users may wish to download an entire existing project from the Project Warehouse to their local computer to use their own software to analyze or sift through project data. Many projects contain thousands of files, or may have files updated or added throughout the course a project.

The process of manually managing the download and re-downloading of updated files would be unwieldy and cumbersome for users. PEN provides facilities to automatically detect updated and new files, and to download these files during a synchronization operation. PEN also validates the checksums for each file to ensure that the file was downloaded without damage.

When opening a new project, PEN can be used to establish the project directory structure on a user’s local computer. As new files are added or updated during the course of a project, PEN can automatically detect those files and upload them to Project Warehouse as well as register the file within the database as a curatable object with certifiable provenance. This greatly aids researcher efforts to manage the hundreds to thousands of files generated during the course of a project.

Establishing citable data practices

A critical element of the Project Warehouse is the use of Digital Object Identifiers (DOIs) to assign each experiment with a project a DOI to facilitate citation to the

experiment by other projects referencing the data in the project or reusing the project data for another project. An example of an experiment level citation is shown in the text box above.

A simple search for “DOI:10.4231/D3CN6Z02X” in Google will uniquely identify a dataset and provide a link directly to the experiment in the NEEShub. NEEScomm promotes DOIs as means of facilitating re-use of research data and incentivizing researchers to contribute data to the repository. NEEScomm provides a recommended format for citing the dataset, and

Hartanto Wibowo, Danielle Sanford, Ian Buckle, David Sanders (2013). "Truck Characterization Experiment Using Shake Table - Part I: Empty Truck with Tires", Network for Earthquake Engineering Simulation (distributor), Dataset, DOI:10.4231/D3CN6Z02X

(7)

researchers are encouraged to use the recommended citation format in the reference section of their papers and articles that describe or are derived from the dataset. The researchers are also instructed not only to cite the datasets in NEEShub as they (re-)use data, but also attribute the repository.

The improved metadata are particularly important as NEEScomm exposes the metadata of research datasets and makes them available to third party data aggregators. Such data can be discovered by the members of the earthquake engineering community globally. The NEEShub metadata harvested by Google can be found from a Google search and in Google Scholar, the metadata are also submitted to the search

engine of the DataCite, an organization that looks for ways to make easier access

to research data

(http://search.datacite.org/ui), and most recently NEEScomm contributed metadata to the DataCitation Index SM from Thomson Reuters.

NEES curation process

Proper data archiving requires familiarity with the curation documentation as well as planning. To assist researchers with data archiving, NEEScomm offers curation services to researchers uploading data to the Project Warehouse.

The NEEScomm curation team ensures that all data uploaded to the NEES repository are stored to the correct location, use an acceptable format, are properly documented, and are equipped with the necessary metadata, so that the experiment can be understood, data correctly interpreted and effectively reused.

Curation at NEES, shown in Figure 4, is an iterative and interactive process that is aligned with the dates set in the NEEScomm Data Sharing and

Archiving Policies (DSAP) [6]. Figure 4 shows the details of the curation process. Unprocessed data coming directly from the data acquisition system are expected to be uploaded within a month after test completion. Metadata and documentation for individual tests should be uploaded within six months after completion of the test, so that the experiment can be fully curated and made public 12 months after completion. Once the documentation of the data set is considered complete, the experiment is marked as curated, made public, and issued a DOI. Research teams are able to upload additional analysis or drafts of publications and papers to their projects even after curation.

In addition to personal interaction with the curation team, research teams can also use the curation documents listed at the NEEScomm User Guide page

Figure 4. Curation workflow at NEEScomm

(8)

(http://nees.org/warehouse/userguide), among them the most practical instructions provides NEES Curation wiki, NEEScomm Requirements for Curation and Archiving of Research Data (http://nees.org/resources/4759), and NEEScomm Guidelines for Data Upload (http://nees.org/resources/4757). The handout Summary of Curation Requirements for Research Data is available at all NEES sites and can be downloaded at http://nees.org/resources/5492.

The Curation status of individual project is tracked at the Site Activities page at NEEShub (http://nees.org/sitesactivities/equipmentavailability) (shown in Figure 5) to provide a quick overview of the curation status of projects. The projects are listed both by site and by the individual

principal investigators.

The ability to visualize data or launch visualization and analytical tools directly in the Project Warehouse provides a convenient way to quickly review and assess data without downloading either the data or the analytical tool. Currently, the NEES repository provides direct access to data through two types of

viewers: inDEED

(http://nees.org/resources/indeed) for quick processing, visualization, and analysis of earthquake engineering data in NEEShub, and 3DDV The 3D Data Viewer (developed by CEES at RPI in collaboration with Oregon State University) designed to display both 3D models as well as 2D plots side by side and to organize data from various sources which can be represented in 3D space.

(http://nees.org/resources/3DDV)

Beyond the highly structured and curated datasets generated by the NEES research community, there is a much broader and loosely structured class of community datasets collected and generated by the earthquake engineering research community. To provide a data repository for this use beyond NEES experiments, NEES created Databases and DataStore. Databases, at http://nees.org/databases, provides a simple interface that allows users to quickly discover and browse community datasets. DataStore is a new feature released in October 2013 that provides the tools needed for users to convert an Excel spreadsheet containing named and tagged columnar data into a new community database. Each new community database can be assigned a Digital Object Identifier which can provide researchers who reuse the data from the database with a permanent citable reference to the data. This is invaluable for researchers who produce these data, and those seeking to reuse these data for research. To capture and disseminate the digital knowledge created by the community, the NEEShub also hosts data from the American Society of Civil Engineering (ACSE) Journal of Earthquake Engineering and the American Concrete Institute Additionally, the Earthquake Engineering Research Institute (EERI) is now soliciting Earthquake Data Papers, peer-reviewed written articles describing datasets. All of the datasets described in these papers must include a DOI, and can refer to datasets in the NEES Project Warehouse Figure 5. Curation Status in the Project Warehouse

- Status of the Whole repository

(9)

Experiences and lessons learned

Over the past four years, we have found that it is as important to address the social challenges as it is to work on the technical challenges. As with any project, there are limits on resources available for software development and operations. To have a good chance of being successful we decided that we needed a strong focus on meeting the highest priority needs of users based on their ranking of needs. As described by Charette [7], the most common factors for software project failure include poor communication, unrealistic project goals, and management failures.

These factors are social, not technological. Thus, we deliberately put significant emphasis on social elements and procedures by carefully considering the following factors at the start of the project:

1) Recruiting professional staff with commercial software engineering experience. We created an organization and culture that from that start was built on a foundation of professional practice and accountability common in the software development industry.

2) Ensuring that software engineers were comfortable with project management practices, such as schedule estimation, working to a project schedule, and the need for frequent and honest communication with management, peers, and users.

3) Avoiding the use of unproven technology and software unless the payoff for users far outweighed the potential loss of services and risks from using the technology.

4) Leveraging existing open source software and commercial off-the-shelf solutions as much as possible to allow software engineers to focus efforts on adding value for the users rather than reinventing existing software (the “buy versus build” tradeoff).

5) Embedding earthquake engineering researchers into every aspect of the software

engineering process and operations. We employ an iterative process (shown in Figure 6) to collect user requirements, produce a software development plan and schedule, develop, deploy, and collect feedback on all aspects of the NEES cyberinfrastructure. This

“shoulder-to-shoulder” approach with earthquake engineers ensured that the developers would not run into a situation in which they would need to generate their own

requirements when there was a gap in the documented requirements, and to provide quick feedback during development. This greatly reduced development time and improved the

relevance of the developed systems for the earthquake engineering

community.

6) Using practices common in industry, such as staged software releases, project

management, and separate production, test, and development environments.

This allowed us to promote new releases into production with very little interruption in service. As a result, over this year the NEEShub achieved a remarkable 99.96% uptime.

Data curation and archiving was an especially productive area for new experiences and lessons learned since this Figure 6. NEEShub Release Work Flow

(10)

area of cyberinfrastructure practice is fairly new. Although NEEScomm is constantly evaluating and improving the user interface of the NEEShub and revising the curation documentation, data archiving is still a challenge for researchers. The community as a whole has not developed a mature culture of standard data management and data recording practices that would organically lead to archiving in the repository. NEES tries to lead and set standards for documentation of the earthquake engineering research. The role of curation as a service provided to researchers, therefore, cannot be underestimated. Early reminders were sent to project principal investigators soon after the completion of their tests and this improved the level of documentation of datasets considerably. Thanks to regular monitoring of the archiving efforts and providing continuous feedback on uploaded documentation, the number of curated experiments is constantly growing.

The collaboration with the NEES site personnel was also instrumental in minimizing the number of experiment datasets that did not comply with the DSAP deadlines.

For each project, once the unprocessed data was secured and researchers became familiar with the archiving timeline, the communication with research teams improved, as well as their responsiveness and quality of provided metadata and documentation. This led to faster and more effective archiving. Data archiving is a significant investment of time and thought, therefore NEEScomm in continually exploring new

ways that will benefit researchers and reward them for their effort. Based on expressed needs of the community and feedback from researchers, we added curation status and metric reporting tools for each project.

Researchers can now see how often their experiments are viewed, the number of downloads of their data, and the geographical location of those who viewed their experiments. We are expecting the DOIs will also increase the impact of research projects within the NEEShub and aid efforts to more closely track the use of project data.

We found that the TRAC self-audit resulted in improved curation documentation

and motivated our work on developing tools for automating some aspects of the preservation process. This included strengthening the preservation functions in the NEES repository, particularly in terms of identification of formats, format obsolescence awareness, and

preservation planning. As a result, NEEScomm developed a new preservation pipeline workflow.

All files incoming to the repository are scanned for viruses and basic descriptive and administrative metadata are collected automatically. Daily, several micro-services execute format validation and identification calls on all files uploaded to the repository that day and extract additional technical and administrative metadata that facilitate data preservation and more effective management of the repository.

In terms of utilization by the community, we have observed significant growth in the overall number of users accessing NEEShub as well as the amount of contributed content. Figure 7 shows the monthly count of the number of NEEShub users over a prior 12 month period. The number of users are currently doubling about every 12 months.

Figure 6. Annual number of NEEShub users.

(11)

Conclusions

In conclusion, we have developed a successful data cyberinfrastructure that is meeting the needs of the earthquake engineering community. We have met many of the challenges described earlier, and new challenges have emerged. Although we can measure metrics such as data downloads, it is more difficult to quantitatively measure how these data are reused. We anticipate that DOIs will provide a mechanism to discover data reuse through references to data in the repository, but this relies on the research community adopting the practice of referencing data as they reference publications. Another emerging challenge is the need to archive software along with the data.

Because non ASCII data is encoding in some kind of binary format, the software that can actually read and interpret this format will be needed in the future to read and display these data. Finally, we are working to understand how users progress through the process of learning about, trying, and finally deciding to adopt a cyberinfrastructure through the Rodgers Diffusion of Innovations model [8].

Overall, from our experience it is clear that the cyberinfrastructure and the data repository must be developed in close partnership with domain science experts and driven by documented user requirements as well as by community standards from the library community. We have found that there are few “one size fits all” data cyberinfrastructure solutions that can be universally used across many science disciplines – the best fit is a cyberinfrastructure that is based on the needs of the domain science community.

Acknowledgement

The work described by this paper is supported by the NSF NEES project (CMMI-0927178).

References

1. Hacker, T., Eigenmann, R., and Rathje, E. (2013). ”Advancing Earthquake Engineering Research through Cyberinfrastructure.” Journal of. Structural. Engineering. 139, SPECIAL ISSUE: NEES 1: Advances in Earthquake Engineering, 1099–1111.

2. CLR. Trustworthy Repositories Audit & Certification: Criteria and Checklist, 2007. Retrieved from http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf.

3. National Science Foundation Request for Proposals NSF 08-574, 2008, “George E. Brown, Jr. Network for Earthquake Engineering Simulation Operation (NEESops)”

4. Hacker, Thomas J., Rudi Eigenmann, Saurabh Bagchi, Ayhan Irfanoglu, Santiago Pujol, Ann Catlin, and Ellen Rathje. "The neeshub cyberinfrastructure for earthquake engineering." Computing in Science & Engineering 13, no. 4 (2011): 67-78.

5. Hacker, Thomas, Rudolf Eigenmann, and Ellen Rathje. "Advancing Earthquake Engineering Research through Cyberinfrastructure." Journal of Structural Engineering 139, no. 7 (2013): 1099-1111.

6. NEEScomm. Data Sharing and Archiving Policies, 2011. Retrieved from https://nees.org/resources/2811.

7. Charette, R. N. Why software fails. IEEE spectrum, 42(9), 36, 2011.

8. Hacker, Thomas J., and Alejandra J. Magana. A framework for measuring the impact and effectiveness of the NEES cyberinfrastructure for earthquake engineering. Technical Report. https://nees. org/resources/3963, 2011.

References

Related documents

Expression of Sox2 protein has not been extensively studied in lung cancer; however, we have recently demonstrated that Sox2 is strongly and diffusely expressed in ~90% of pulmonary

Examples of analysis include: investigating how the area of certain forest types changes over time under different management strategies, determining the maximum harvest level

In addition, this method offers several advantages: RNA can be isolated from tissues that are frozen immediately thereby avoiding alterations in cell differentiation state and

Hillel International is committed to helping campuses gain this critical asset, thereby supporting the growth of Jewish educational positions to more campuses, offering training for

The summary resource report prepared by North Atlantic is based on a 43-101 Compliant Resource Report prepared by M. Holter, Consulting Professional Engineer,

The survey was sent to approximately 1600 organizations, with 403 responding (a 25 percent response rate). The responding organizations are representative of the nonprofit

This has been achieved in the shape analysis of curves by using as representation and metric, the square-root velocity function (SRVF) and a particular member of the family of