Data
Management
on this subject alone, and every version of any software seems to include a new set of problems).Don’t assume that quality control was exercised over the data entry process or that anyone else has examined the data for out-of-range or otherwise impossible values.Don’t assume that the person who gave you the project is aware that a key variable is missing for 50% of the cases... you get the idea. Data collection and data entry are activities performed by human beings, who don’t always know their jobs perfectly, and make mistakes now and then.A large part of the data management process is discovering where those mistakes were made and either correcting them or thinking of ways to work around them so the data may be analyzed as intended.
The Chain of Command
Without carrying the military metaphor too far, efficient data management for a large project requires establishing a structure or hierarchy of people who are responsible for different aspects of the process.Equally important, everyone involved in the project should know who is authorized to make what decisions, so that when a problem arises it can be resolved quickly and reasonably.This is common sense, but not always exercised in practice.If the data entry clerk notices that data is coming in with lots of variables missing, for instance, he should know exactly who to report the problem to so it can be corrected while the project is still in the data collection phase.If an analyst finds out-of-range values during initial inspection of the data file, she should know who can make the decision about what to do with those values, so they can be corrected or recoded before the main analysis takes place.Make it difficult for such issues to be resolved, and the staff is likely to impose their own ad hoc solutions or give up trying to deal with them, leaving you with a data set of uncertain quality.
Codebooks
The codebook is a classic tool of social science research, but the principle of the codebook applies to any project that involves collecting and analyzing data.Some- times the codebook is an actual book, generally either a spiral notebook or a three-ring binder, which is used to collect and organize important information about a project.I have also worked on many projects where there was no actual code “book” in the sense of a physical object of paper and ink; instead, all infor- mation was stored electronically, in the data and syntax files themselves and ancillary electronic documents.Some projects use a hybrid system, in which most of the codebook information is stored electronically, but also printed and kept in a binder.The bottom line is that it doesn’t matter what method you choose, as long as the vital information about the project and the data set is reliably recorded in some location for future reference.
On the whole, I would say that companies whose data consists of the records of their day-to-day business operations do a better job of documentation than academics and people working on small projects.That is probably a combination of two factors.When data reflects the main business of a company, the informa- tion technology department has a real incentive to get it right, and when the data collection and storage processes are ongoing and standardized, it is easier to
establish a set of procedures and follow them.In addition, companies generally assign people to carry out the procedures of data management, and ensure that they are appropriately trained.The polar opposite is often found in academia, where numerous small projects, each with their own quirks, may be conducted simultaneously.In such circumstances, data management may be relegated to undergraduates with minimal experience or training, or to Ph.D.s or M.D.s who are subject matter experts but unfamiliar with (and possibly uninterested in) the day-to-day issues of data management.
The main reason you need a codebook or its equivalent is to create a repository of information about the project and its data, so that people who join the project later or analyze the data long after the collection process has ceased know what it is and how to interpret it.It’s also helpful for people who have been involved from the start, because no one’s memory is perfect and it’s easy to forget what deci- sions were made six months or two years ago.Having codebook information easily accessible is also a great timesaver when it’s time to write up your results or when you need to explain the project to a new analyst.
At a minimum, the codebook needs to include information in the following categories:
• The project itself and data collection procedures used • Data entry procedures
• Decisions made about the data • Coding procedures
Details about the project that should be recorded include the original purpose, timeline, funding, original personnel and any changes, and who is in charge of what.Data collection procedures should include when the data was collected, what procedures were used, and who actually did the data collection.If a form like a questionnaire was used, a copy should be included in the codebook, as should any instructions given to the data collectors.
Information about data entry procedures is particularly important when data is collected in one medium, for instance, on paper questionnaires, and analyzed in another, usually as an electronic file.However, even if a CATI (computer assisted telephone interviewing) system or other method of electronic data collection was used, the codebook should explain how the individual files were collected and transferred.Usually electronic file transfer works smoothly, but not always.Every time a file is transferred there is an opportunity for a data file to become corrupted, in which case it may be necessary to trace back the process in order to correct the error.Information about the training of data entry personnel and any quality control methods (such as double entry of a percentage of the data) should also be recorded.
Seldom is data ready to be analyzed exactly as it has been collected: someone needs to examine it and make decisions about such things as out-of-range values and missing data before the file is ready for analysis.All these decisions need to be recorded, as well as the location of each version of the file.An archived version of the original data file should be stored somewhere it can’t be changed, in case you want to reverse a coding decision later, or the edited file becomes corrupt and has