6j Data Analysis - Organizing scientific data sets : studying similarities and differences in m

Data analysis of this research study focused on two components:

1. Qualitative data analysis: Using a grounded theory influenced analysis for both demographic and process oriented questionnaires.

2. Quantitative rubric: Analyzing the metadata submitted into Dryad.

Specific areas of each component were examined in detail. A description of the data analysis techniques used to examine each component is described as follows.

1. Qualitative data analysis

The questionnaire answers were analyzed using grounded theory data analysis techniques. The data collected from the questionnaires took the form of short answers; narrative, paragraph-like descriptions; and procedural lists. An example of these responses is seen in Section 7 Results. To begin the inductive analysis process, the answers from questionnaires were reviewed and saved as RTF files. These files were opened in the qualitative data analysis software, Atlas.ti7. After an initial review, descriptive codes were created to highlight salient information

organization processes and decisions. Codes were created to highlight similarities and differences related to metadata creation and subject term

application. Twenty-two codes were created during this process. These codes are shared in the Results section.

Under the consultation of P. Mihas (personal communication, December 17, 2010; personal communication, October 17, 2011), Odum Institute’s Coordinator of Education and Qualitative Research Consultant, a reviewer was selected to independently verify the codes within Atlas.ti. The codes were verified and returned. These codes were compared with each other using the Coding Analysis Toolkit (CAT)8 All coding was also reviewed by the G. Liu (personal

communication, December 15, 2011) Empirical Research Associate at Duke Law School. Multiple reviewers of the coding were used in an attempt to eliminate bias. Coding results are discussed in more detail in Section 8. Results.

2. Quantitative data analysis

Quantitative analysis was performed using traditional indexing methods of

counting and mapping. The purpose of using these basic methods was two-fold:

• To characterize the information organization behavior of both

scientists and information professionals in a controlled system.

• To learn more about the metadata fields that are frequently used by

each group as well as a comparison of the level of subject analysis ability associated with each group.

To perform this data analysis, counting and mapping methods were used to show specific connections between descriptive metadata elements used, preferred subject terms, and established controlled vocabulary. The Quantitative Analysis Rubric, presented in the Figure that follows, shows the specific data that was analyzed.

Figure 6: Quantitative analysis rubric --Counting— Metadata

-- What Dryad metadata fields were used? -- What fields were not used?

-- Which fields were most frequently used? Subject terms

-- Number of subject terms total

-- Average # of terms chosen by each individual, -- Average # of terms chosen by each group.

-- Create a tag cloud to visualize the subject terms used and their frequency -- Out of the 4 subject descriptors available: general subject, scientific names, temporal, and spatial -- which area is used most frequently?

--Mapping— Subject terms

--Map each subject term used to 4 of the vocabularies, some of which are cited in the Dryad vocabulary study from 2007. Vocabularies included are: Library of Congress Subject Headings(LCSH); Medical Subject Headings (MeSH); National Biological Information Infrastructure Thesaurus (NBII); and Integrated Taxonomic Information System (ITIS). All of these vocabularies were available using the Helping

Interdisciplinary Vocabulary Engineering (HIVE)1 demonstration project. --Frequency of specific vocabulary terms mapped to which population

-- Compare terms used to original terms created by the data set author; which group has more overlap of terms.

This rubric highlights questions and tasks that related to the research questions presented in Section 6c. Based on the analysis performed in this rubric the following outcomes will be presented in the Discussion section.

For Question 1: Propose metadata for data repository use

For Question 2: Report on current metadata used by information professionals For Question 3: Recommend controlled vocabularies for data repository use. For Question 4: Report similarities in subject term use

For Question 5: Report differences in subject term use.

As shown by the outcomes above, each question relates to a specific outcome presented in Section 8. Discussion. These outcomes are based on results presented in Section 8. Results.

7. Results

Using the methodology discussed in the previous section, data about the information organization approaches of information professionals and scientists was collected. These results relating to information organization and scientific data sets are reported in detail. The results section begins by reporting on participant breakdowns. Findings from both the PIM task scenario as well as the Dryad task scenario portions of this study are reported next. The PIM task scenario portion of this study simulates what individuals would do when integrating datasets into their own collections. The Dryad task scenario portion of this study simulates what individuals would do when integrating datasets into a shared public system. For each portion results describing the information professional population are presented first, while results describing the scientist

population are presented second.

The results reporting of the PIM task scenario of this study presents descriptive statistics from two questionnaires. The first questionnaire collected demographic data. Results from the first questionnaire reports type of position, the years each participant was in his/her current position, areas of expertise, education, and experience using data. The second questionnaire collected each participant’s reaction after performing the PIM- influenced task scenario. Results from the second questionnaire include the type of changes each participant made to the data set, the type of metadata used to create read me files or records, the type of keywords created to describe the data sets, and the type of guidelines used for creating metadata. Participants also had the opportunity to return

both the modified data sets and all additional documentation created during the task scenario. Results from these materials are reported as well.

The results portion of the Dryad task scenario presents descriptive statistics from metadata and subject terms deposited in the Dryad system. Participants were asked to deposit two data sets in the MRC Test instance of the Dryad repository. During deposition, participants had to apply both descriptive metadata and subject terms to describe each data set. The descriptive metadata applied included information including title, author, and description. The subject terms applied include spatial, temporal, topical, and scientific terms. A discussion comparing these two participant groups is included in Section 8: Discussion.

In document Organizing scientific data sets : studying similarities and differences in metadata and subject term creation (Page 97-103)