Creating a Database
Classic Data Management Flow
for Clinical Research
Scientific Hypotheses
Identify Specific Data Elements Required to Test Hypotheses
Data Acquisition Instruments (optional) (paper forms)
Computer Program for Data Collection and Organization (e.g. Access)
Output to Analytical Software (e.g. SPSS)
Define data set to acquire, create codes (e.g. F=0, M=1) Paper Data forms
Enter data into computer
Transfer to statistical and presentation software Do statistical analysis Prepare graphs
and tables
Data Management Process
Data quality control (check for missing, out of range and illogical responses)
Differences between a clinical and
research database
•
Clinical database (e.g. PACS®)
– Report oriented so data is displayed for clinical decision making (e.g. abdominal X-ray report)
– Emphasis on displaying or reporting of individual data rather than accumulating multiple records
– Not a lot of variables recorded
•
Research database
– Table oriented so that data is accumulated for eventual export to a statistical package for data analysis and
reporting
– Less emphasis on individual records
– Create Forms to represent tables that are storing information
Databases versus Excel
•
Excel has some capabilities to sort data but its primary
function is to create financial spreadsheets
–
Can be used for small and limited research data sets
and simple lists
•
Databases (e.g. ACCESS) are designed to collect, sort,
and manipulate data
–
Can create Data Quality Control features that ensure
valid data is entered and missing data is eliminated
–
A relational database allows for linking of an
unlimited number of tables and therefore an
unlimited number of variables
Id Name Age 10 Smith 50 11 Jones 55 12 Doe 60 ID Weight (lb) Weight (kg) 10 230 104.5 11 212 96.4 12 199 90.4 ID KCAL KCAL/kg 10 2400 23.1 11 2652 27.5 12 2350 25.9
Relational Database Example
ID V02 V02/kg 10 2.8 26.7
11 3.2 33.1
12 2.1 23.2
Subject Info Anthropometrics
Relationships Between Tables
Databases versus Excel
•
Databases are more suitable for importing
data from multiple sources
• Imports of different data types (e.g. SAS files and Dbase files) into different tables can be linked via common
identifiers such as subject ID
• Merging multiple data sources into Excel can be a challenge
How is a database organized?
• Multiple linked tables
• Tables store records (rows in the database)
• Tables have a collection of fields (these are the columns)
– Patient identifiers
Table = Records and Fields
Re
co
rd
s
Fields
Advantages of a Database
•
Collection of data in a centralized location
•
Data stored so as to appear to users in one
location
Sharing and Exchanging Data
•
Multiple users can access the same database
via a network
–
Can be local or over the internet
–
Access via a client application or web interface
–
Server allows remote access over the internet
from anywhere
Database Design Considerations
•
What to collect
–
What questions are to be answered?
–
Think of the data tables in your future publications
• Focus on the key data elements rather than collect as much as possible
•
What statistical package will be used?
–
May need to format the data to accommodate
program it will be exported to
– For example, gender can be recorded in the database as M or F but statistical package may require 0 and 1
What needs to be in the research database?
•
Research variables directly related to the
hypotheses being tested-
YES
•
Clinical measures used for screening-
MAYBE
– Blood work, ECG, medical history
•
Administrative data-
NO
– Contact information
Examples of Data Collection Forms
• Paper – then into Excel; tedious, mistakes common
• Scantron - Must scan in; many opportunities for mistakes
• Directly into Excel -tedious
• Direct computer entry via forms
• Allow real-time validity/completeness checks, prompts
Designing the Questions
• Try to collect continuous data – convert to categorical during analysis period. (e.g. age)
• Use validated scales/instruments
• Don’t build your own unless unavoidable
• Consider asking in more than one way a question concerning a critical variable under study. (allows for validity assessment)
• Avoid measurements that cluster at one position or one end of a scale
• E.g. measuring body temperature on healthy outpatients
What is Data Coding?
•
A group of letters, numbers, or symbols and
the rules that form a link to a specific
terminology (e.g. male =1, female =2)
•
Coded references should be incorporated into
a data dictionary
•
Dictionaries should be based on standardized
terminology
–
e.g. if doing a study on cirrhotics categorize
Why code?
• Forces data into analyzable structure, format
• Speeds data
input/transcription
• Vastly simplifies data analysis/reporting
Rules for Data Entry
• Decide which variables require a number or string code
– e.g. code subject’s name or MRN as a string variable
• Continuous values are entered directly
• Missing values must be different values from a real response
– Common formats are “99” or bullets “·” Don’t use “99” if the variable is a continuous data field!
– Don’t know is a response—do not leave blank. Have a code for “don’t know”
– Don’t use “0” for missing, just leave blank
Avoid open-ended questions in
subject/patient surveys
What is your gender?___________________
-correct responses could be man, woman, male, female
What is your level of education?__________
-the answers 9
thgrade, did not finish high school, and
no college education all are correct.
-provide specific choices for reply.
Be Careful With Scales
Data in Spreadsheet
Finding Errors
Subject ID Gender Age 1001 Male 52 1002 Male 54 103 Mael 65 1004 Female 54 5 Female 52 1006 Female 52 1007 Femele 75 1008 Male 48 1009 M 37 1010 Female 73 11 F 54
Exploratory Data Analysis
•
Explore why there is missing data
–
Is this a software or coding glitch?
–
Did you delete by accident? (always back-up)
•
Examine for unusual consistency /inconsistency
issues
–
Is every patient measured one month 6’10”
–
Good time to use histograms and calculate skewness
•
Find inadmissible ranges and codes
Attributes of Successful Data
Management
•
Attention to detail
•
Explicit structure and process
•
Robust designs
– Anticipate failures, lapses and mistakes