The result of these efforts was an Excel spreadsheet of booking data with 88,768 records. Each record is an individual booking that occurred over the period 01/01/2003 to 01/31/2015. The 88,768 booking records in this spreadsheet are unique with no duplicate booking numbers. For the follow data procedures that were performed in Excel, please see the appendix for the Excel formulae that were used in the creation of new variables, or Excel columns.
The second stage of processing the booking data in Excel had many aspects. The first things that were done were to re-order and re-name some of the variables. Then, two variables were added. In terms of variables, or Excel columns, the data now consisted of Booking_Number, Date_Confined, Time_Confined, Date_Released, Time_Released, Gender, MOC, and Statute_Description. Note that from the original Excel spreadsheets, the
following variables, or columns, were named: Booking_#, Sex, and Offense were re-named Booking_Number, Gender, and MOC. MOC is the commonly used acronym for Minnesota Offense Code.
Recall how the Date_Confined and Date_Released variables consisted of two portions: the date and the time. In creating the new variables, Time_Confined and
Time_Released, no new information was generated. New variables (or Excel columns) were created, the data values were copied from the Date_Confined and Date_Released columns, and the Excel formats for the four columns were chosen to reflect the information that was desired for that particular column. For the Date_Confined and Date_Released variables, the
mm/dd/yyyy format was chosen. For the Time_Confined and Time_Released variables, the military format (hh:mm) was chosen to represent the data. Therefore, for the
Date_Confined and Time_Confined variables, the data was the same, but was simply
displayed selectively. The same is true for the Date_Released and Time_Released variables.
However, a problem arose when SAS appeared to be unable to understand the time portion of the Excel data. SAS read the date correctly and was able to present the
Date_Confined and Date_Released variables accurately, but had trouble with Time_Confined and Time_Released variables. It became clear SAS was having trouble because Excel was storing both the date and the time data in the columns. It was reading and presented the date information, but was having trouble with the time data. Perhaps it was because the time data was stored second after the date information.
Both the Time_Confined and Time_Release variables, as columns in Excel, had to be recreated to contain only the time value. This was different than the original method where the date and time information were both contained in the column, but showing only the time information. Once this was done, SAS had no problem presenting the time variables accurately.
In its refined form of 88,768 booking records, there was no missing data for the variables Booking_Number, Date_Confined, Time_Confined, and Gender. For both the Date_Released and Time_Released variables, there are 134 missing data values, for each variable. For the Offense variable, there were 288 missing data values, and for the
Statute_Description variable there were only two blank records. It is important to point out
that when missing bookings were added from other Excel or from JMP data, these records were always missing the Offense variable, so such additions resulted in a blank data for that variable.
From this point, much time and effort was spent on creating new variables with an eye towards diagnosing data problems, and eventually attempting survival analysis, with the booking data. The booking number is the most fundamental information in the data. In fact, the first action taken was to move the Booking_Number variable to the left, to be first in the order of Excel columns.
From Booking_Number, two more variables were created: Booking_7 and
Booking_3. Booking_7 represented the first seven figures of the full booking number while Booking_3 represented the last three figures from the booking number. Recall that the first seven figures from the booking number are unique to the individual and represent one person. The last three figures from the booking number represent the number of bookings that individual has undergone. The two highest number of bookings seen in this data, as revealed by the Booking_3 variable, are the 63rd and 72nd bookings of two individuals.
The data does not have all 135 bookings for these two individuals. For the person who has been through 63 bookings, the first booking is missing. For the person with 72 bookings, the first four are missing. Both booking numbers begin with 02, meaning that the very first booking was in 2002 for each person, so those particular bookings are missing most likely because the bookings occurred prior to the time period covered by the data. Here is
an example of how this segment of the data now appears, along with the Date_Confined and Time_Confined variables:
Booking_Number Booking_7 Booking_3 Date_Confined Time_Confined 00J0068-005 00J0068 005 4/25/2005 14:00
Figure 7: Example of Corrected Excel Records
The booking number was broken up into its component parts by using the Replace() formula in Excel. The green color for one record indicated it was added from another Excel spreadsheet and was therefore missing from the two main Excel spreadsheets obtained from Stearns County. There is no obvious reason why it was missing from the main Excel spreadsheets obtained from Stearns County. The booking occurred in 2009, so it fell into the time period where it should have been captured by the main Excel spreadsheets. Since it was obtained from another Excel data source, the Offense variable has a blank data value for this booking. None of the other Excel spreadsheets had this variable other than the two Excel spreadsheets that formed the fundamental source of the data used here.
During these two stages of cleaning, correcting, and culling the data, then re-naming, re-ordering, and refining the variables, as well as creating some new variables, some
columns were copied and kept to the right in the Excel spreadsheet. Another column of
booking numbers, called Booking_Number_2 was kept. This consisted of the original booking numbers from the main Excel spreadsheets. Occasionally, booking numbers were recognized as having an error. The Booking_Number variable received this correction, but Booking_Number_2 did not. The intention was to keep its values in original form. This was done in case it was ever useful to see a booking number in the original form. This was also done to make sure the records were kept in sequence and guard against unintentional alterations.
This was particularly important with the Date and Time released information, given the attention that the variable received, as well as data additions. To that end,
Date_Release_2 was created and had the date and time released information copied.
Therefore, Booking_Number_2 and Date_Released_2 were created and shifted off the right, when viewing the Excel columns, and were away from the main focus of work on the
variables that was taking place.
At this point, the Excel data consisted of twelve variables in the following order:
Booking_Number, Booking_2, Booking_7, Booking_3, Date_Confined, Time_Confined, Date_Released, Time_Released, Booking_Number_2, Gender, Date_Released_2, MOC, and Statute_Description.