One of the most important things to do when analyzing data is getting familiar with it. This statement applies for all types of data, be it big data, structured data, unstructured data, and so on. In general, data is a live entity that continually evolves. The contents are always changing as well as its format. This is the case for both real time and historic data. No matter the case, make sure you get really familiar with the data you are going to analyze. Particular care has to be placed on thoroughly understanding the meaning of the fields that are available. It has been our
Taking advantage of the fact that the download page presents all the fields available with a brief explanation, organized in groups, to gain familiarity with the data. The groups are as follows:
• Time Period. This group consists mostly of expected fields: year, quarter, month, day of the month, day of the week, and the actual flight date. This last field is one of the most important ones for defining the timestamp in Splunk.
• Airline. This is group that has to be studied carefully, as some issues can arise if the differences between these fields are not clearly understood. The differences between the UniqueCarrier, AirlineID, and Carrier fields are subtle, but they can have a big impact in the results of our analysis. We will discuss these fields in more detail in the next chapter. The other two fields in the airline group are the tail number of the airplane, which is the registration (the equivalent of the license plate for a car), and the flight number.
Note
■
notice to the right of the Carrier field that there is a link to get a lookup table. Clicking on it downloads a table
with the names of the airlines by unique code. You can see that many fields throughout the download page have such a link.
• Origin Points. Here we also find four fields that can be confusing: OriginAirportID,
OriginAirportSeqID, OriginCityMarketID, and Origin. We will not go into the details of each field until the next chapter, but we want to reinforce the fact that you need to be acquainted with the data that you are analyzing.
• Destination points. These are similar to the origin point fields but indicate the corresponding destination.
• Departure Performance. These fields are mostly self-explanatory, except that we need to clarify the acronym CRS, which stands for Computer Reservation System. The field
CRSDepTime is the official scheduled departure time of a flight. This field along with the flight date will be used to create the timestamp in Splunk, which will be explained later in this chapter.
• Arrival Performance. These fields are similar to those of departure performance but relate to the arrival performance.
• Cancellations and Diversions. Present if a flight was cancelled or diverted and gives the reason for cancellation.
• Flight Summaries. Information here includes things such as elapsed times and distances.
• Causes for Delays. Describes reasons for a flight’s being delayed. This and the next two groups are interesting, because they illustrate how data can change in format over time. This group was added in June 2003.
• Gate Return Information. Contains information regarding the amount of time since the aircraft left the gate and returned. This is a newer group of fields added in October 2008.
• Diverted Airport Information. Describes detailed information for as many as five flight diversions. This group was also added in October 2008.
The fact that the number of fields increases over time is cause for concern. The issue at hand is that we potentially have to deal with three different types of CSV files: one with 56 fields, another one with 61, and a final one with 109 fields. We must examine carefully the CSV files we download.
Chapter 9 ■ GettinG the FliGht Data into Splunk
We start by reviewing the first month available, October 1987. The easiest way to do this is to use a spreadsheet. We first look at the title line and notice that it has all the field names for the 109 fields. This is good because it means that we might not have to worry about dealing with three different CSV files. Now we look at the first line where we can see that there is no data starting in column 57 (CarrierDelay) to column 109 (Div5TailNum). A quick glance at all the data lines shows that pattern.
Although using a spreadsheet is the easiest way of examining a CSV file, it is not the most thorough. By definition, a CSV file contains Comma Separated Values. Even though the spreadsheet shows that there is no content in fields 57 through 109 in the data lines, we need to make sure that those fields are in the file and contain nothing. Each line should have a total of 108 commas, as the last field does not need one. Starting with field 57 there should only be commas with nothing between them or a pair of double quotes, one after another (“”), which represents an empty string.
Although this is not exactly a fun endeavor, we counted the commas and found that the couple of random lines we chose have all the commas necessary to delimit 109 fields. This is really good news, because now we know that the October 1987 file has all 109 fields, even thought many of them are not populated. This means that there is only one type of CSV file to deal with, instead of three.
The next step is to make sure that the format is consistent throughout all the flight data. For this we choose to examine the months of June 2003 and October 2008, which are the ones where new fields are added. We reviewed these files and found the layout to be consistent. Now we are sure that all the files that contain the flight data have the same layout or format containing 109 fields, and we will not have to do any special processing for this reason.
When using a spreadsheet to examine the data, you also have to be aware that the spreadsheet will use default representations for data fields. For example, the FlightDate field is formatted as yyyy-mm-dd in the CSV file; however, the spreadsheet presents it as mm/dd/yyyy. Another example is the CRSDepTime field, which in the CSV file is a string with the following format: “hhmm”. The string “0900” is presented in the spreadsheet as 900. You have to be aware of how data is presented in the spreadsheet as in many cases it will vary from the actual data.
Be thorough when you are getting familiar with the data you are going to analyze. Using a spreadsheet is a quick way to verify certain things, but it is not the best solution. You will always have to examine the raw data using a text editor.