The Method in Brief - Extending the Utility of Public Use Microdata

The core function of this method is to move data from the Public Use Microdata Area (PUMA) to the desired geography. This should be able to be done with any

subpopulation that has sufficient observations in the dataset. PUMAs are statistical areas designated by officials in the states’ State Data Centers (SDC) according to the guidelines provided by the Census Bureau. The guidelines require that every PUMA area contain a minimum of 100,000 people, and for PUMA areas created for data releases since 2012, the PUMA must be constructed from contiguous census tracts with emphasis on keeping counties whole when possible (U.S. Census Bureau 2011). These guidelines provide areas that do a good job of preserving respondent confidentiality as they require the PUMAs to maintain a large minimum population. This property is also a drawback as their size often makes them large and unwieldy, especially in rural areas where they likely group several counties together in a single PUMA. Due to the guidelines for PUMA creation using census tract and population thresholds as requirements, the

PUMAs usually do not conform to any recognizable areas with the exception of counties and cities that have populations that exceed 100,000. This can create problems in putting them to use when trying to answer demographic questions below the state level, and makes it impossible to isolate all but highly populated areas. This is a problem when trying to make estimates for small areas or for areas with low populations. The revised method will not solve all the problems experienced by individuals trying to work in low or geographically diverse areas, but it will provide assistance and another tool in the demographic data user’s toolbox.

The basic steps needed to produce an interpolated set of estimates for alternate geographies and/or subpopulations are as follows:

1. Review the geographic requirements to determine if it is possible to use a census geography to create or approximate the geography required for the projects. The requirements of the request from the MDE called for the data to be presented in Intermediate School Districts (ISD). The ISD is not a geographic unit for which the Census Bureau published data, but all ISDs are aggregations of various

numbers of Local Education Agencies (LEA), which are geographies tabulated by the Census Bureau. This is a very important step as a census geography is

necessary to build the weights that will be used to distribute the values from the PUMS data. The estimates produced for New Jersey (discussed later) will be at the school district level, and the South Dakota (discussed later) estimates will be for the counties. New Jersey and South Dakota will use county subdivisions and census tracts, respectively, as intermediate geographies in the first stage of the weighting process. A valuable lesson learned through revising this method is how much effort is saved by using as many shapefiles from the same source as

possible. For the estimates, all shapefiles where obtained from the Census Bureau’s shapefile collection (U.S. Census Bureau 2016)

2. Determine if the geographic requirements for your project require an intermediate geography that from which you will aggregate to your final geography. As mentioned above, all estimates produced for testing will use the county subdivision layer as an intermediate level for the initial state of weighting.

3. Determine an appropriate weighting variable to move data from the PUMS level to the geographic level you have selected from the previous steps.

4. Obtain or produce shapefiles for all the geographic levels needed, and prepare them for use.

5. Add the weighting variable values to the geographic area you will use to move the PUMS data to the project specific geography.

6. Combine the PUMA shapefile with the shapefile that will have the weighting variable data added. This should provide an exhaustive accounting of the study area with polygons that can be added to either the PUMA areas or the weighting variable areas.

7. Once the data are attached and the shapefiles are combined, the deduplication process will be performed.

8. Weights are created that will be applied to the PUMS data. The total of these weights should exactly total the number of PUMAs in the state.

9. With the final weighting variable in place, the project is ready to make the interpolations of the needed or intermediate geography. The nature of the unioned dataset allows for the weights to move the data from the PUMAs to the required geography.

10. Review the estimates for face validity and perform any other validity checks that are possible based on the local level data that are available or compare them to gold standard data that may be available. It is unlikely that any real gold standard data will be available as such data would make this process unnecessary. This project does have those gold standard data as the group moved to purchase data

from the Census Bureau to fulfill the programmatic requirements of the project. I will use those data to test the results of the method.

In document Extending the Utility of Public Use Microdata (Page 40-44)