• No results found

Data Analysis in Excel and SPSS

N/A
N/A
Protected

Academic year: 2021

Share "Data Analysis in Excel and SPSS"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Organization &

Creating Reproducible Analyses

Nick Jackson

University of Southern California Department of Psychology

(2)

Overview

 Raw Data Organization

◦ Vocabulary

◦ Entering/Storing Data in Excel

◦ Data Checks/Reducing Errors

◦ Data Backup/Security

 Creating Reproducible Analyses

◦ Directory Structures

◦ Workflow for Projects (cleaning->archiving)

◦ Workflow for Analyses (cleaning->analysis->present/archive)

(3)

Vocabulary

 Database

◦ “An organized collection of data”-Wikipedia

◦ Most of us don’t actually use “real” databases

 SQL, Oracle, dBASE, MS Access

 More Specifically: where the links between different

data files are automatic (ie. no merging)

◦ We mostly store data in Excel or SPSS and

(4)

Vocabulary

 Wide File: One row of data per subject

◦ Pseudonym: Flat File

 Long File: more than one row of data per

subject

(5)

Wide File: Primary Index Variables Cases An Observation Observations Long File: Cases Variable Primary Index Secondary Index

(6)

Developing an Excel Database

System – Basic Rules

 No “Mississippi” tables

◦ Excel tables that contain ALL of your data at once can cause problems

 Enter Wide Data in wide format, Long

Data in long format.

 One Excel File per survey/measure (not

(7)

Developing an Excel Database

System-Variable Name Rules

 The primary (and secondary) index should have

the EXACT same name in every file.

 Only the primary (and secondary) index should

have the same names within and between files.

 Be consistent with variable names, keep them

simple and free of unusual characters or spaces

◦ Use ”_” character to join two words together instead_of_a_blank_space.

 Never leave a blank column between variable

(8)

ESS:

PSQI:

Primary Indices Don’t Match:

id≠Id

Inconsistent Capitalization: These should start lower case, OR all other variables should be capitalized

Improper Spaces: Variable names shouldn’t have spaces in it.

Inconsistent Naming: Underscore “_” should be removed OR other variables should have underscores.

Blank Column:

There should be no empty columns between variables

(9)
(10)

Types of Data-Considerations

 Default formatting in excel is “General”

which is fine for everything but dates.

 Collect discrete variables as continuous

when possible.

 DOB vs Age vs age groups.

 NEVER enter data as already reverse

coded.

 Only use strings when you cannot assign

(11)

Types of Data-Considerations

 Be consistent with entering string data

◦ Medications:

 “Aspirin” vs “ASPIRIN” vs “Apsirin”

 Use Alphanumeric codes when possible

(ATC\DDD)

 Time and Date data

◦ Use the correct formatting

 12:01 PM  09/20/2013

(12)

Handling Missing Data

 Have indicators of missing data

◦ Use numbers as the indicator

 NOT a possible real value in any excel sheet  Typically -9 or -99 or -999

(13)

Handling Unknown/Don’t

Know/NA Responses

 Same rules as missing data indicator

(14)

The Data Dictionary

 Key Elements

Variable Name –the name exactly as it appears in the data

Variable Label- a lengthier description of the variable

Variable Values-tells you what discrete values mean (ie. 2=Male, -99=Missing)

(15)
(16)

Reducing Errors in Data Entry

 Data Validation tools in Excel

(17)

Step 1: Highlight Variables/Columns. Select Data Tab, Click on Data Validation

A) Highlight Variables

B) Select Data Tab C) Click on Data Validation

Using Excel Data Validation Tools to Restrict Variable Values

Step 2 (For Discrete Data): Select to Allow: List.

Input the potential values for the variable (including the missing data indicator) in the Source: box D) Select to Allow: List E) Input Potential Values

(18)

Step 2 (For Interval/Ratio Data): Select to Allow: (Whole Number OR Decimal). Determine the criteria for the Data Box (Between or Not Between) Input Minimum and Maximum values.

D) Select to Allow: Whole Number OR Decimal F) Select Appropriate Minimum and Maximum E) Choose Between OR Not Between from Data: box

(19)

Reducing Errors in Data Entry

 Use Data validation tools in Excel

 Enter with a number pad

◦ 1-0.01% Error Rate

 Use split windows to keep case numbers

(20)

Click on this, and drag the window down

(21)
(22)

Reducing Errors in Data Entry

 Use Data validation tools in Excel

 Enter with a number pad

 Use split windows to keep case numbers

and variable names visible

 Use Column Borders to denote separate

physical pages of your survey within your excel file.

(23)

Applying a border in Excel

(24)

Checking for Errors

 Use the excel data filters

 Re-enter data because

◦ No hand-entry method protects against row

shifts

◦ Acceptable, but incorrect values can be present

(25)
(26)

Backing up Your Data

 Should exist in at least 2 physically

separate locations.

 Regular backups

◦ Easy to reconcile versions

 Servers, Clouds, RAID Drives?

◦ Dropbox, Skydrive, Google Drive: Not HIPPA

or FERPA compliant

 Back-up of your data every time you

(27)

Data Security

 Protecting the Identity of your subjects

◦ Assign ID’s that have NO significance to any identifying characteristic of the subject

 Identifying characteristics: names (initials), dates (of

birth, death), years, addresses, phone number, email, zip codes, geo-codes, SSN, MRN, VIN, DL #, License Plate, IP address, health plan/insurance numbers

◦ Never store Identifying characteristics in files with your other data.

(28)

Data Security

 Keeping people out of your data

◦ Physical Threats

 Use anti-theft devices (cables)

 BIOS Password on the hard drive + Logon

Password

 Remote wipes

◦ Cyber Threats

 2-step verification for any cloud/email services  Traffic encryption, VPNs

 Servers should have restricted access

(29)

Data Security

1) Click the Office button.

2) Scroll over the Prepare button.

3) Click on Encrypt Document.

(30)

Creating Reproducible Analyses

 Without replication we cease to have science

◦ can you replicate your own findings?

◦ Is everyone reinventing the wheel with your rawdata?

 Journals are beginning to require submission of rawdata and syntax

 “The NIH recently declared that the sharing of data is essential for translating research into knowledge and products that improve health. As such, all investigators seeking more than $500,000 in grant support per

year are now required to include a plan for data sharing”

(31)

Creating Reproducible Analyses

 The difficultly of replication

◦ Can you find it?

 Issues of version control (multiple versions)  Everything is labeled “final”

◦ Undocumented steps

 Don’t end up on here:

(32)

Creating Reproducible Analyses

 Data Workflow (Long, 2009)

◦ Clean -> Analyze-> Present-> Archive  Replication is our guiding principle

◦ Simple

 Complex systems don’t get used (usability)  High efficiency

◦ Standardized

 Create a template  Be consistent

◦ Scalable

(33)

Setting up a directory structure

 Standardized directory structures

(templates)

(34)
(35)

The naming of data files in your

directories

2-4 letter acronym_nameYYYYMMDD

 Ensures proper sorting and version

(36)

Data Workflow

 Project Level (Data Manager)

◦ Clean - > Archive

 Analysis Level (Data Analyst/Researcher)

(37)

Data Workflow: Project Level (Cleaning) Rawdata Clean Data isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_wasi20090616.sav Programs

isac_master_wideDATE.sav

isac_master_longDATE.sav

setup.sps Merge: cesd, bdi, wasi

(38)

Data Workflow: Project Level (Cleaning)

 Programs:

◦ Setup

(39)
(40)

Data Workflow: Project Level (Archive)

Rawdata isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_wasi20090616.sav Archive isac_cesd20130920.xlsx

(41)

Data Workflow

 Analysis Level

(42)

Data Workflow: Analysis Level

A reminder of the directory structures for the Analysis

(43)

Data Workflow: Analysis Level (Cleaning) Method 1 Rawdata Clean Data isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_master_wide20130920.sav isac_master_long20130920.sav cleaning.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Analysis Rawdata Syntax

(44)

Data Workflow: Analysis Level (Cleaning) Method 2 Rawdata Clean Data isac_cesd20130915.xlsx isac_bdi20071124.xlsx isac_master_wide20130920.sav isac_master_long20130920.sav cleaning.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Analysis Rawdata Syntax

isac_analysisDATE.sav

I use this folder only for rawdata files that are unique to the specific analysis

(45)

Data Workflow: Analysis Level

(Analyze)

cleaning.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Analysis Rawdata Syntax

isac_analysisDATE.sav

(46)

Data Workflow: Analysis Level

 Syntax:

◦ Cleaning.sps

 Cleans and aggregates the files

◦ Analysis.sps

 Conducts the analysis

 Makes NO changes to the data  DOES NOT write out any files

(47)
(48)

Data Workflow: Analysis Level (Present/Archive)

 You have analyzed your data and put

together a paper, abstract, poster, etc.

 ARCHIVE your syntax on the date of

submission

◦ Put dates at the end of the syntax file

◦ DO NOT TOUCH THE SYNTAX FILE

AGAIN (I usually comment out the commands)

(49)

Data Workflow: Analysis Level

(Present/Archive)

cleaning20131001.sps Analysis 201309 Example For Presentation (Jackson) Analysis Data Syntax isac_analysis20130920.sav analysis20131001.sps _Abstract

Add dates to Syntax

apsabstract_20131001.docx

abstract20131001.sps Create abstract Syntax file cleaning.sps

analysis.sps Probably still analyzing stuff for

creating abstract into a poster, so I keeping working versions of my needed syntax files

(50)

 I generally do not archive my analysis

folders, but rather focus on archiving the syntax when major events happen in an analysis (submissions, revisions,

acceptance).

Data Workflow: Analysis Level

(Present/Archive)

(51)

The important things

 Never overwrite a rawdata file.

◦ Every change to your rawdata should be a new file with a new date.

 Files are only opened or saved with syntax  All changes to files are done with syntax  Cleaning/Setup Syntax:

◦ We read files in and write separate files out  Analysis Syntax:

(52)

The important things

 When important things happen, we

archive the SYNTAX by commenting out all of the commands and placing dates on the end.

References

Related documents