Data Organization &
Creating Reproducible Analyses
Nick Jackson
University of Southern California Department of Psychology
Overview
Raw Data Organization
◦ Vocabulary
◦ Entering/Storing Data in Excel
◦ Data Checks/Reducing Errors
◦ Data Backup/Security
Creating Reproducible Analyses
◦ Directory Structures
◦ Workflow for Projects (cleaning->archiving)
◦ Workflow for Analyses (cleaning->analysis->present/archive)
Vocabulary
Database
◦ “An organized collection of data”-Wikipedia
◦ Most of us don’t actually use “real” databases
SQL, Oracle, dBASE, MS Access
More Specifically: where the links between different
data files are automatic (ie. no merging)
◦ We mostly store data in Excel or SPSS and
Vocabulary
Wide File: One row of data per subject
◦ Pseudonym: Flat File
Long File: more than one row of data per
subject
Wide File: Primary Index Variables Cases An Observation Observations Long File: Cases Variable Primary Index Secondary Index
Developing an Excel Database
System – Basic Rules
No “Mississippi” tables
◦ Excel tables that contain ALL of your data at once can cause problems
Enter Wide Data in wide format, Long
Data in long format.
One Excel File per survey/measure (not
Developing an Excel Database
System-Variable Name Rules
The primary (and secondary) index should have
the EXACT same name in every file.
Only the primary (and secondary) index should
have the same names within and between files.
Be consistent with variable names, keep them
simple and free of unusual characters or spaces
◦ Use ”_” character to join two words together instead_of_a_blank_space.
Never leave a blank column between variable
ESS:
PSQI:
Primary Indices Don’t Match:
id≠Id
Inconsistent Capitalization: These should start lower case, OR all other variables should be capitalized
Improper Spaces: Variable names shouldn’t have spaces in it.
Inconsistent Naming: Underscore “_” should be removed OR other variables should have underscores.
Blank Column:
There should be no empty columns between variables
Types of Data-Considerations
Default formatting in excel is “General”
which is fine for everything but dates.
Collect discrete variables as continuous
when possible.
DOB vs Age vs age groups.
NEVER enter data as already reverse
coded.
Only use strings when you cannot assign
Types of Data-Considerations
Be consistent with entering string data
◦ Medications:
“Aspirin” vs “ASPIRIN” vs “Apsirin”
Use Alphanumeric codes when possible
(ATC\DDD)
Time and Date data
◦ Use the correct formatting
12:01 PM 09/20/2013
Handling Missing Data
Have indicators of missing data
◦ Use numbers as the indicator
NOT a possible real value in any excel sheet Typically -9 or -99 or -999
Handling Unknown/Don’t
Know/NA Responses
Same rules as missing data indicator
The Data Dictionary
Key Elements
◦ Variable Name –the name exactly as it appears in the data
◦ Variable Label- a lengthier description of the variable
◦ Variable Values-tells you what discrete values mean (ie. 2=Male, -99=Missing)
Reducing Errors in Data Entry
Data Validation tools in Excel
Step 1: Highlight Variables/Columns. Select Data Tab, Click on Data Validation
A) Highlight Variables
B) Select Data Tab C) Click on Data Validation
Using Excel Data Validation Tools to Restrict Variable Values
Step 2 (For Discrete Data): Select to Allow: List.
Input the potential values for the variable (including the missing data indicator) in the Source: box D) Select to Allow: List E) Input Potential Values
Step 2 (For Interval/Ratio Data): Select to Allow: (Whole Number OR Decimal). Determine the criteria for the Data Box (Between or Not Between) Input Minimum and Maximum values.
D) Select to Allow: Whole Number OR Decimal F) Select Appropriate Minimum and Maximum E) Choose Between OR Not Between from Data: box
Reducing Errors in Data Entry
Use Data validation tools in Excel
Enter with a number pad
◦ 1-0.01% Error Rate
Use split windows to keep case numbers
Click on this, and drag the window down
Reducing Errors in Data Entry
Use Data validation tools in Excel
Enter with a number pad
Use split windows to keep case numbers
and variable names visible
Use Column Borders to denote separate
physical pages of your survey within your excel file.
Applying a border in Excel
Checking for Errors
Use the excel data filters
Re-enter data because
◦ No hand-entry method protects against row
shifts
◦ Acceptable, but incorrect values can be present
Backing up Your Data
Should exist in at least 2 physically
separate locations.
Regular backups
◦ Easy to reconcile versions
Servers, Clouds, RAID Drives?
◦ Dropbox, Skydrive, Google Drive: Not HIPPA
or FERPA compliant
Back-up of your data every time you
Data Security
Protecting the Identity of your subjects
◦ Assign ID’s that have NO significance to any identifying characteristic of the subject
Identifying characteristics: names (initials), dates (of
birth, death), years, addresses, phone number, email, zip codes, geo-codes, SSN, MRN, VIN, DL #, License Plate, IP address, health plan/insurance numbers
◦ Never store Identifying characteristics in files with your other data.
Data Security
Keeping people out of your data
◦ Physical Threats
Use anti-theft devices (cables)
BIOS Password on the hard drive + Logon
Password
Remote wipes
◦ Cyber Threats
2-step verification for any cloud/email services Traffic encryption, VPNs
Servers should have restricted access
Data Security
1) Click the Office button.
2) Scroll over the Prepare button.
3) Click on Encrypt Document.
Creating Reproducible Analyses
Without replication we cease to have science
◦ can you replicate your own findings?
◦ Is everyone reinventing the wheel with your rawdata?
Journals are beginning to require submission of rawdata and syntax
“The NIH recently declared that the sharing of data is essential for translating research into knowledge and products that improve health. As such, all investigators seeking more than $500,000 in grant support per
year are now required to include a plan for data sharing”
Creating Reproducible Analyses
The difficultly of replication
◦ Can you find it?
Issues of version control (multiple versions) Everything is labeled “final”
◦ Undocumented steps
Don’t end up on here:
Creating Reproducible Analyses
Data Workflow (Long, 2009)
◦ Clean -> Analyze-> Present-> Archive Replication is our guiding principle
◦ Simple
Complex systems don’t get used (usability) High efficiency
◦ Standardized
Create a template Be consistent
◦ Scalable
Setting up a directory structure
Standardized directory structures
(templates)
The naming of data files in your
directories
2-4 letter acronym_nameYYYYMMDD
Ensures proper sorting and version
Data Workflow
Project Level (Data Manager)
◦ Clean - > Archive
Analysis Level (Data Analyst/Researcher)
Data Workflow: Project Level (Cleaning) Rawdata Clean Data isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_wasi20090616.sav Programs
isac_master_wideDATE.sav
isac_master_longDATE.sav
setup.sps Merge: cesd, bdi, wasi
Data Workflow: Project Level (Cleaning)
Programs:
◦ Setup
Data Workflow: Project Level (Archive)
Rawdata isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_wasi20090616.sav Archive isac_cesd20130920.xlsxData Workflow
Analysis Level
Data Workflow: Analysis Level
A reminder of the directory structures for the Analysis
Data Workflow: Analysis Level (Cleaning) Method 1 Rawdata Clean Data isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_master_wide20130920.sav isac_master_long20130920.sav cleaning.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Analysis Rawdata Syntax
Data Workflow: Analysis Level (Cleaning) Method 2 Rawdata Clean Data isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_master_wide20130920.sav isac_master_long20130920.sav cleaning.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Analysis Rawdata Syntax
isac_analysisDATE.sav
I use this folder only for rawdata files that are unique to the specific analysis
Data Workflow: Analysis Level
(Analyze)
cleaning.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Analysis Rawdata Syntaxisac_analysisDATE.sav
Data Workflow: Analysis Level
Syntax:
◦ Cleaning.sps
Cleans and aggregates the files
◦ Analysis.sps
Conducts the analysis
Makes NO changes to the data DOES NOT write out any files
Data Workflow: Analysis Level (Present/Archive)
You have analyzed your data and put
together a paper, abstract, poster, etc.
ARCHIVE your syntax on the date of
submission
◦ Put dates at the end of the syntax file
◦ DO NOT TOUCH THE SYNTAX FILE
AGAIN (I usually comment out the commands)
Data Workflow: Analysis Level
(Present/Archive)
cleaning20131001.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Syntax isac_analysis20130920.sav analysis20131001.sps _AbstractAdd dates to Syntax
apsabstract_20131001.docx
abstract20131001.sps Create abstract Syntax file cleaning.sps
analysis.sps Probably still analyzing stuff for
creating abstract into a poster, so I keeping working versions of my needed syntax files
I generally do not archive my analysis
folders, but rather focus on archiving the syntax when major events happen in an analysis (submissions, revisions,
acceptance).
Data Workflow: Analysis Level
(Present/Archive)
The important things
Never overwrite a rawdata file.
◦ Every change to your rawdata should be a new file with a new date.
Files are only opened or saved with syntax All changes to files are done with syntax Cleaning/Setup Syntax:
◦ We read files in and write separate files out Analysis Syntax:
The important things
When important things happen, we
archive the SYNTAX by commenting out all of the commands and placing dates on the end.