Database preparation and
Abstract/Manuscript 101
Jennifer F. Tseng, MD, MPH
Chief, Division of Surgical Oncology, BIDMC BWH Residents’ Conference, Boston, MA
EXCELlent Colleagues
Bruce A. Barton, PhD Aimee Knoll, MS
Heena Santry, MD, FACS
Plan
• 10,000 foot view of starting your own database
• Pitfalls
• 10 step program for Excel (after Barton)
Conveyor Belt of Research
The Zen of Prospective Databases,
after Bruce Barton
• A prospective database is like a slow-moving conveyor belt
• At the beginning, you have the design of the product (hypothesis & study design)
• Once approved by Management (a.k.a., IRB, funding), then the conveyor belt starts up
• Every time you collect a piece of data, you place it on the conveyor belt
More Zen
• As the data move down the conveyor belt, it is slowly cleaned and polished
• The data are not ready for the “final step” (analysis) until the very end of the belt
Non-Zen
from Murray Brennan
• “It will take you 3 to 5 years before a
prospective database is of any use to you.”
– (At a Meet the Professors breakfast, circa 2005)
Basics of databases
• Why?
• How?
Basics of databases: Why?
• Hypotheses determine data points
– Write dummy methods, results, table shells • Research vs. Quality
• Individual vs. part of a joint or departmental effort
• Actively seek examples of similar databases to determine which fields are important
Basics of databases: How?
• Which program?
– Excel
– Access
– SPSS
– Filemaker Pro, Oracle, Sequel etc.
• Prospective vs. retrospective vs. hybrid
Basics of databases: How?
• Nitty-gritty
– Who will support ($, time)
– How will patients be captured?
• Point of entry
– Default maintenance
• Timing of updates
– The simpler the strategy, the more likely to succeed
• Test system for flaws prior to committing major time and resources
• Database will never be perfect
Basics of databases:
personnel
• Single most important individual: you, the PI
• It takes a village:
– Dedicated data entry staff
– Biostatistician
– Outcomes / Health services researchers
– Computer scientists
– Research residents, fellows
10 Step Program for Excel
• Excel: easy, available, readily export to SAS, STATA, SPSS
• Plenty of opportunities to get in trouble
1 3
Step 1: Column = variable
• Single row for column headings
– Column headings = variable names
– “Please enter your gender (check all that apply)” is not a good variable name
• Keep it short and sweet
– Sex
– Age
1 4
Columns: concise, unique variable names
ID Age_Time1 Age_Time2 1 64 64 2 63 65 3 57 58 4 71 74 ID Age in Months at First Timepoint Age in Months at Second Timepoint 1 64 64 2 63 65 3 57 58 4 71 74Datasets are not tables:
Transposed data confuses statistical programs!
ID 123 456 789 234 567 890 Age 54 56 50 51 59 53 Gender M M M F F M Weight 174 172 197 141 135 160 Height 69 72 70 64 62 70 Score1 5 6 3 9 8 5 Score2 7 8 2 7 6 7 C1 Frequency % Age 1 14.29 Gender 1 14.29 Height 1 14.29 ID 1 14.29 Score1 1 14.29 Score2 1 14.29 Weight 1 14.29 Excel Dataset SAS Dataset
ID Age Gender Weight Height Score1 Score2 123 54 M 174 69 5 7 456 56 M 172 72 6 8 789 50 M 197 70 3 2 234 51 F 141 64 9 7 567 59 F 135 62 8 6 890 53 M 160 70 5 7 ID Frequency % 123 1 16.67 234 1 16.67 456 1 16.67 567 1 16.67 789 1 16.67 890 1 16.67 Gender Frequency % F 2 33.33 M 4 66.67 Excel Dataset SAS Dataset
Step 2: One item per cell
/ column
• Put only one item of information in a cell/column
• Examples
– For blood pressure, put 120 not 120 mmHg or 120/80 or 120? or 120 8AM
– Make another variable for additional information (diastolic; BP_time)
1 8
More variables make for simpler spreadsheet
ID Age and Gender
1 64F 2 63M 3 57F 4 71F 5 75M 6 65M ID Age Gender 1 64 F 2 63 M 3 57 F 4 71 F 5 75 M 6 65 M
Avoid putting too much information in one field ID Symptoms
1
Cough, fever (100.3), sore throat, fatigue
2 None
3
Cough, Sore throat, Sore muscles, Vomiting 4 Sore throat, fever (101.8)
5 None
6
Sore throat, weakness, fatigue
ID Symptom1 Symptom2 Symptom3
1 Cough Fever Sore throat
2 None . .
3 Cough Sore throat Sore muscles 4 Sore throat Fever .
5 None . .
Another example…
ID Pathology
1 SB Resection: Meckel's diverticulum with ectopic gastric mucosa. Focal fibrosis, regenerated
mucosa at suture site- possibly a healed ulcer. 2 SB Resection: Angiodysplasia; Lymph Nodes with
reactive hyperplasia; jejunal diverticulitis 3 SB Resection: Ulceration, acute and chronic
inflammation; Thickened submucosa
Researcher’s Conclusions: Normal: 0%; Superficial Ulcer/inflammation (not beyond submucosa): 43%; Deep Ulcer/inflammation (not beyond
muscularis propria): 14%; Transmural ulcer/inflammation: 14%; Meckel's diverticulum 14%; Angiodysplasia: 14%
Step 2A: Think numeric
• To a computer, there are two types of data: numeric and character
• Any non-numeric character, including a space, makes computer read as text not #
• Use numbers whenever possible (e.g. 1/0 rather than y, Y, Yes YES; n, N, No, NO)
2 2
Keep units of measurement consistent ID Temperature Height 1 38.8 181 2 37.1 175 3 98.6 72 4 39.0 170 5 36.9 152 6 37.0 161
If units cannot be consistent, create a variable to clarify
ID Temperature Temperature_Units Height Height_Units
1 38.8 Celsius 181 Centimeters 2 37.1 Celsius 175 Centimeters 3 98.6 Fahrenheit 72 Inches 4 39.0 Celsius 170 Centimeters 5 36.9 Celsius 152 Centimeters 6 37.0 Celsius 161 Centimeters
Maintain consistency with free text: Check capitalization and spelling!
(preferred: assign a number in codebook) ID Symptoms 1 Cough 2 None 3 COUGH 4 Sore throat 5 None 6 Sore throt Symptoms Frequency Cough 1 COUGH 1 None 2 Sore throat 1 Sore throt 1
Step 3: Define “missing”
• Indicators of missing or not available
– Use a number to indicate missing – like 999 – something that is not a possible measurement (like 99 for BP)
– Do not leave blank
– Do not use zero
– Do not use NA or UNK – see Step 2
2 6
Step 4: Format cells:
numbers
• Format cells with numbers in numeric format, not a “general” or “text” format (default).
• If leading zeros necessary (like an ID number of 01-001), then needs to be text (explain in code book; involve data analyst)
2 7
Step 4A: Format dates/times
• Dates and times – fraught with pitfalls for transfer
• Format in Excel to a valid date or date:time or time format
• Dates are stored as the number of days from some arbitrary date and time as the number of second from midnight
3 0
Step 5: Don’t use text formatting
• Computers can’t read colors or fonts
• Make a new column to modify or group
– Example: instead of highlighting all “done”
entries, make a column labeled “done” and enter 1 / 0
3 1
What not to do:
Statistical programs are colorblind
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8
3/25/10 286-2 Left 5.5M 0.181 1139.0637 286-2 Right 5.5M 0.181 1122.2231 6/24/10 18.2-4 Left 3.0M 22 0.184 0.836% 1173.1429 5332%
Step 6: Data only
• Keep data area clean with just data
– Summary statistics go elsewhere—in another sheet or a clearly marked temporary copy
• SAS will try to read any statistics in the same columns as data
3 3
Step 7: Compare apples to apples
• Include only related items on a row, e.g. from one patient, or from one admission
• Multiple measurements may require
additional variables (columns); or potentially, additional rows.
3 4
Step 8: Study ID
• Always include ID information on every row for each patient
– Never include PHI in your spreadsheet
– Assign a “study ID” and use that on file
– If multiple rows of data for a patient, make sure that each row has an ID, date, and any other
information that will uniquely identifies that observation
3 5
Step 9: Codebook
• Create a codebook or data dictionary including each column in your data set
• Complete name for each column
– Example: TC = “Total Cholesterol (mg/dl)” • Include the meaning of each code
– Example: 1 = Low, 2=Normal, 3=High…
• Critical for data entry and data analysis
3 6
Create a codebook!
ID Pathology1 Pathology2 Pathology3
1 2 3 4 2 3 4 . 3 3 . . Variable Values Pathology 1= Normal 2= Superficial Ulcer 3= Deep Ulcer 4= Inflammation
Step 10: Name game
• Give spreadsheet a descriptive name and include your name and date (and a brief description) in Properties
– Example: Curren_AAS_092912.XLSX
– Not data.xlsx
3 8
Warnings and Pitfalls
• When in doubt, get IRB approval, amendment or waiver, even for “quality” projects
• Save your data on locked folders on secure servers with regular backups. Never on laptops. Never on portable media. Never on local drives.
• Access should be limited to IRB-approved coinvestigators.
• Do not email or otherwise share PHI in a non-secure fashion, even with coinvestigators.
Final thoughts
• When in doubt, consult a biostatistician
• Remember your hypotheses:
– Type 3 error: the right answer to the wrong question
– George Reed, PhD, UMass
• Never, never, never give up!
4 0
Part 2:
Abstract: Plan of attack
• Hypothesis first (& IRB and CITI training)
• Literature search (Pubmed; meetings; Google)
• Plan methods second
– Experiments; data sources
– Pitfalls
– Statistical tests
• Results
– Empty table shells
Abstracts for dummies
• Write dummy abstract
– Data-free
• Review with PI, colleagues, data analysts, biostatisticians
• Troubleshoot prior to experiments / programming
• Rewrite, re-review, re-circulate
Abstracts: good, bad, & ugly
• Good
– P.R. for resident, attending
– Meetings are fun
– Deadlines help productivity • Bad & ugly
– Abstract but no paper = wasted effort
– Scooped!
Abstracts: good, bad, & ugly
• Good
– P.R. for resident, attending
– Meetings are fun
– Deadlines help productivity • Bad & ugly
– Abstract but no paper = wasted effort
– Scooped!
Manuscript 101
Manuscript 101
• Never start with the introduction
“As we approach the millenium . . . .”
Manuscript 101
• Know your audience
– Journal; readership; state of the field
• Start with methods and results
– Methods (steps and numbers, not prose)
• Statistics
– Results
• Empty table shells first
• Limited prose – Outcomes
– Univariate
– Multivariate
Manuscript 101
• Introduction
– Brief is better
– 3 paragraphs max (can always expand later) 1. The problem
2. Current state of the field
Manuscript 101
Discussion: the SOAR Seven Paragraphs
1. What we’ve found
2. What others have found previously
3. How our results extend/better/change #2 4. Limitations of our work
5. BUT . . . Still important because 6. Summary
Support
– ASCO (PanCAN/Samuel Stroum) – American Surgical Association – Howard Hughes Medical Institute – American Cancer Society
– NIH (UMass CTSA) – BIDMC
– Philanthropy
• Pancreatic Cancer Alliance
• www.pancreaticalliance.org
• Verville, Maney family foundations • Grateful patients