Triangle Census Research Data Center
Notes from information sessions
About
TCRDC Administrator Bert Grider visited Appalachian State University on February 28, 2011 to tell researchers about the resources available at the TCRDC and the procedures for gaining access to the data. For questions about the TCRDC, contact Gale Boyd ([email protected]) or Bert Grider ([email protected]). For questions about this document or about Appalachian State University’s access to TCRDC, contact Amy Love ([email protected]).
Contact information
• Gale Boyd – DirectorDepartment of Economics, Duke University Phone: (919) 660-6892
E-mail: [email protected]
• Bert Grider – Administrator
Center for Economic Studies, U.S. Census Bureau Phone: (919) 660-6893/919-597-5123 E-mail: [email protected] • TCRDC website http://econ.duke.edu/tcrdc/ • CES website www.ces.census.gov
General information about TCRDC
• There are ten Census Research Data Centers (RDCs) in the country. The Triangle center is by far the closest to Boone
• There are actually two TCRDC labs in Durham (with the same data), one at Duke and one at the Research Triangle Institute (RTI).
• The TCRDC labs are securely but remotely connect to Census Bureau computers located in Maryland. The data is not actually stored in Durham.
• Since Appalachian State University is a member of TCRDC, researchers can use the Research Data Center without paying fees.
• Researchers must go through a 3-9 month approval process to gain access to the data.
• Research and data analysis must be conducted on site in Durham.
• The labs are open 24 hours a day, 7 days a week. National Center for Health Statistics (NCHS) data is an exception; it must be accessed during normal business hours.
• Since gaining access to the data requires a significant time investment, think about a series of papers or projects you would like to produce, rather than going through the whole application process for just one paper.
• Maximum project length that is commonly approved is five years.
• Graduate students can gain access through the same approval process as faculty PIs. The access process takes long enough that it is not usually a practical option for a master’s thesis, but it can be an option for a doctoral dissertation.
Applying for access to TCRDC
• Researchers will work with the TCRDC administrators to submit a proposal to the Census Bureau and gain approval for their project and security clearance to enter the TCRDC.
• There are different requirements for project approval depending on what type of data you’re using.
• In general, expect it to take 6-9 months to gain approval and access data sets at TCRDC.
o Business data usually takes the longest (closer to 9 months) because it also requires IRS approval. o Demographic data usually takes at least six months.
o Health data collected by other federal agencies (the Agency for Healthcare Quality and Research
(AHRQ) and National Center for Health Statistics (NCHS) should take less time to access, perhaps 2-3 months, because researchers do not have to go through the Census proposal process, they only have to go through the NCHS or AHRQ approval process and security clearance.
o The faculty PI gets approval for the project, not for herself as an individual. All individuals
working on the project must go through background check, but this is a relatively short process (4-6 weeks). If, for example, you get a new graduate assistant after you’ve gained approval for your proposal, that’s fine; he just needs a background check and does not have to go through the 6-12 month approval process independently.
• You may also bring your own data (e.g., results from your own survey or data from another agency) to link to Census data as long as the collecting agency approves.
• TCRDC administrators will help you develop a proposal for the Census Bureau. Proposals must meet several requirements:
o Justify your need to use non-public data o Maintain confidentiality
o Feasibility o Scientific merit
o MUST describe benefits to Census Bureau’s own research. This is not as hard as it sounds; the
benefit could be as simple as identifying clearly erroneous observations in the data you are using and suggesting ways to fix or improve the errors (the kinds of things you typically do when cleaning your own data). Statistical analysis to assess how well certain groups were covered in certain geographic areas, to identify gaps or errors can also be a benefit to the Bureau.
o Proposal must specify which data sets you will be using; you will only have access to those data
Confidentiality and Disclosure
• Since TCRDC has access to microdata—such as household and firm-level data, albeit without identifying information about the residents or businesses—there are very stringent security protections to access the data.
• Title 13 protects confidentiality of Census data. To access data at TCRDC, you need a Census badge, which requires a background check from the FBI and takes at least 4-6 weeks. Depending on which data you are accessing, you will also have to get approval from the agency or agencies that own the data, including the Census Bureau, the Internal Revenue Service (IRS), or other federal agencies for health data.
• Confidentiality protections from federal code require a secure research environment (the lab is a locked room with a badge required for entry), and access and output controls. You can’t print data or put it on a flash drive and take it out. Every piece of output that you want has to meet certain guidelines. The Bureau, for example, has rules about how many people must be in a sample. If you conduct a regression analysis with only 100 people in the sample, then add dummy variables to subdivide set by race, gender, income level, etc., you could create samples so small that confidentiality of respondents is at risk. The Bureau would not release that data.
• You must request to for the results of your analysis to be released and document how you created your samples. TCRDC and a disclosure officer in D.C. will review your documentation and send the output to you electronically. This process can take two days to three weeks (less is you document well; more if your documentation is sparse or your samples are too small).
• There are significant financial and legal penalties for illegal disclosure of Census data (e.g., identity of a respondent or a business).
• “The Bureau discourages intermediate output. They want the output to be a table, for example, that you’re going to put in your paper.”
• Model output is easier to get through the disclosure process than means.
• Common disclosure problems: too small of a sample size; means alongside regression output with dummy variables that create sub-samples
Data analysis
• Wide variety of statistical analysis software available: SAS, R, etc. Any software available for Linux is probably either available at TCRDC or can be made available.
• Each researcher has a dedicated space on the Linux server for storing their work, so you can visit multiple times over a period of weeks or months, do analysis, and release results when it is most convenient or appropriate.
• The data is raw, so researchers should expect to devote some time to clean up before they begin analysis. Widely-used data sets should be in better shape than more obscure data sets.
• No GIS capability at present, though they are in the process of making Giota (check spelling) available. However, you’re not completely limited to what is available in the lab—you may be able to bring things in, access them on another server, etc.
Demographic Data
• It generally takes less time to gain access to demographic data than business data because it demographic data does not require IRS approval for access. It does require Census Bureau approval.
• The best way to find out what is available in each data set is to look at the actual survey form (such as the American Community Survey form or Census long form). Most forms are available on the Census Bureau’s website. TCRDC has the raw data from responses to these forms.
• Sampling of the data sources available (all nationwide):
o 1970, 1980, 1990, and 2000 Decennial Long Form
o March CPS Earnings Supplements (more detailed questions about health)
o Survey of Income and Program Participation (addresses what kind of programs and services
people use)
o American Housing Survey
o American Community Survey (ACS) • Decennial Long Form:
o Household form (without address) is available, but you will know the respondent’s Census block. o In 2010, there was no long form, and there will probably never be a long form again. It has been
replaced by the American Community Survey.
o American Community Survey is designed to replace the long form and give more frequent
(five-year) estimates. The Bureau just relased their first five-year estimates. There is more sampling error for ACS than for the long form data, but it is more recent/frequent. The ACS is conducted on a rolling basis on a five year cycle, so one-fifth of the data will be only a year old, one-fifth will be two years old, etc.
• CPS – They have some supplements, but not all. CPS is jointly produced by the Bureau of Labor Statistics and the Census Bureau. It is not just conducted in March; it is done on a rolling, monthly basis. Main questions are about employment and earning, but there are supplements on a number of topics (e.g., voting habits). Right now, they don’t have access to all supplements, because the Bureau of Labor
Statistics has not granted access to everything. If you are interested in a particular supplement, tell TCRDC and ask them whether they have access to it.
Health Data
• Researchers can also access Agency for Healthcare Quality and Research (AHRQ) and National Center for Health Statistics (NCHS) data at TCRDC, but the access procedure is different since the data is owned by a different agency.
o Census Bureau proposal and approval not required; security clearance is required.
o You must request access from the owning agency (see the AHRQ or NCHS website for more
information). Tell them in your request that you want to access the data at the TCRDC.
o Approval usually takes about one month from the time of application submission. It may take
longer to get security clearance and a Census badge than agency approval.
o NCHS will only allow you to access data during normal business hours, even though TCRDC is
open 24 hours a day.
o There may be a fee from the agency that owns the data (AHRQ or NCHS, not TCRDC), for data
preparation (~$500) for faculty, but not for graduate students.
Economic Data
• Examples of microdata available on firms:
o Collected at plant level
o Linked data linking firms and households. (If you work for a business that pays unemployment
insurance on you, you’re in LHED data.)
o TCRDC does NOT have access to the census of agriculture yet, because it is conducted by another
agency that has not granted TCRDC access.
o Annual Survey of Manufactures
o Survey of Business Owners (which oversamples minority in women-owned business) o Commodity Flow Survey
o Auxiliary Establishment Survey
o Manufacturing Energy Consumption Survey (just got the 2007) o Medical Expenditure Panel Survey, Insurance Component o Quarterly Financial Reports
o (more on slides)
• At TCRDC, you get raw data from the surveys sent to plant. The economic data does not get cleaned up as much as the demographic data—especially the less-used, smaller sample surveys—so expect to invest more time in cleaning data.
• The best way to find out what kind of business data is available, look at the actual forms that the plants fill out. Forms are available on the Census website, but they’re not all in one place; you may need to search for the name of the particular survey in which you are interested.
• Be prepared for heterogeneity in survey responses. For example, in some firms, each plant may fill out a survey; for some firms, they may do one form for multiple plants. Some firms provide more accurate and complete information than others.
• Long forms vary by industry (e.g., a paper plant is not going to be asked the same questions as a poultry processing plant). Hundreds of different forms, with very specific questions relevant to that industry.)
• You will have a unique establishment identifier (for the plant) that you can link across time. If first change ownership, they assume the identifier of the purchasing parent firm.
• There are no public versions of many of these data sets, just publications by industry, for things like the Census of Construction or Census of Real Estate.
• Since the business data contains tax information, there are additional federal regulations about its use and there is an added approval layer from the IRS for release of the data to a researcher.
• It can be hard to know what data is available without actually gaining access to the TCRDC, so send TCRDC a short (one- or two-page) description of your project, and/or call them, and they can look it over and help you decide what data sets would be most useful and relevant for you.
• For LEHD data, you don’t actually know the name of the firm behind the data. You can try to do a probabilistic match. TCRDC also has a geocoded address list (GAL) with the latitude and longitude of the establishment.
Linked Data
• Longitudinal Employer-Household Dynamics, also called LEHD or linked data, was produced by matching unemployment insurance data with other Census data. You don’t necessarily have the most detailed demographic data on these employees, but you can track them over time through different workplaces.
• For an example of linked data use, see the Census Bureau website’s application using LEHD data called OnTheMap that shows how many people come to your county every day to work, how many people leave your county to work, etc.
• You have quarterly observations. If a person is in the data, you know that they have worked at least one hour that quarter for the business that they’re tied to. You don’t know how many hours, but you can sometimes make inferences about that.
• Quarterly data starts at 1990; most goes up through 2008.
• If someone is actually unemployed, they are not present in the data. You will have to pick them back up if and when they come back into the workforce.
• Detailed manual explaining what is in LEHD:
• Coverage for age and gender is good. Place of residence is fairly good, but only available in certain years for legal reasons.
• Education data is very spotty, but they are working to improve it. The manual linked above explains imputation process. for missing education data.