Preprocessing the Flight Data

One of the advantages of working with structured data is that you always know its format; thus, moving the columns around is relatively easy to do. We have chosen to write an AWK program to do this. The program is based on counting commas, which is the field separator. All we have to do is read a line and write it with the columns rearranged. In AWK, the essence of the program looks like this:

BEGIN { FS = ","; OFS = "," }

Chapter 9 ■ GettinG the FliGht Data into Splunk

The field separator is defined as a comma with the FS variable for the input and OFS for the output. AWK will take each incoming line, separate fields and make them available as variables called $N, where N is the number of the field or column. Note that there is no need for a read instruction in AWK. By writing column 6 and 30 first, then the rest of the columns, we achieve our objective. The program, which we called column_mover.awk, is executed in Unix as follows:

column_mover.awk < original_data_file > modified_data_file

One of the issues that we bumped into using AWK is that it counts all commas, even those within a string, that is, within a pair of double quotes. In higher level languages, commas within a string are considered part of the string. This is relevant because two fields before column 30 are strings that always contain a comma. Column 16—OriginCityName

and column 25—DestCityName always contain the city name and the state abbreviation separated by a comma, for example, “New York, NY.” We solve this problem by accounting for those two extra commas. Now OriginCityName

is made up of columns 16 and 17, shifting all the columns by one, and DestCityName now starts at column 26 and is composed of columns 26 and 27, further shifting all the following columns by one for a total of two columns from this point on. The new AWK command looks like this:

print $6, $32, $1, $2, $3, ..., $109, $110, $111

For this script to work correctly we need to verify that there are no other fields after column 30 that are strings that include a comma. After reviewing the data description, this is confirmed. This exercise of reviewing the data structure and its contents is a reminder of the fact that we need to be intimately familiar with the data we are going to analyze. After executing this program over the selected flight data, we are ready to load it into Splunk. We have included the AWK program in the download package of the book under the name column_mover.awk.

To define the desired timestamp combination, we will need to create a new source type. For this example we will be interacting with the user interface, but this can also be done by directly modifying the props.conf configuration file. Following the dialog for defining a new data input file, we select to preview the data before indexing and choose the file we want to load, as can be seen in Figure 9-4. As we are testing something new we will just use one month’s worth of data. Once we are sure that all we do works properly, we can use all the data available. In this case, we will use the most recent data, which at the time of this writing is September 2012.

Once we click on Continue, the next dialog box gives us three options related to setting the source type. We choose to define a new source type, because we are going to have a special treatment of the fields. The next screen, presented in Figure 9-5, shows the first interpretation that Splunk has of the data. As you can see, it shows all the flight records as a single event. The default is to break the lines on the timestamp, but because Splunk cannot find it, it presents one single event that contains all the lines of the CSV file.

As the data looks incorrect, we click on “adjust timestamp and event break settings.” This takes us to the next screen, where we have various options to define the event breaks. We select “Every line is one event” and click on the “Apply” button. After a little while processing the events, it displays them with the correct event breaks, as can be seen in Figure 9-6.

Figure 9-5. Initial flight data view

Chapter 9 ■GettinG the FliGht Data into Splunk

As can be seen in Figure 9-6, there is a warning triangle at the beginning of the line. When you hover the mouse over it, it states that it failed to parse a timestamp. In this case, following the rules of precedence, the timestamp it presents is the last modification timestamp of the CSV file.

The next step is to work with the timestamps. For this we select the corresponding tab. As you can see in Figure 9-7, there are three options under the Location title. Because the timestamp is located at the beginning of the event we do not have to specify a pattern that precedes it. This option is very useful if your timestamp is further into the event and has something like a string before it, for example:

Printed on 04/17/2012 Page 01

You would define the pattern as “Printed on”. You can also use regular expressions to define the pattern. Because this is not our case, we select the last option where we define the maximum amount of characters into the event where the timestamp processor can find the timestamp. This number is 0 as the fields that are used to define the timestamp are always at the beginning of the event. The closer the timestamp fields are to the beginning of the event, the better for Splunk and the faster it will process the timestamp when it indexes the data.

Figure 9-7. Timestamp definition

Under the Format title we can specify the timestamp format and the time zone. We already decided we will not be using the latter, so we can move on to define the former. This is done using strptime() expressions, which are really simple and well documented for multiple programming languages. In our case the expression is %Y-%m-%d,“%H%M,”

which means the year using four digits, a dash sign, the month as a two-digit number, a dash sign, the day of the month, a comma, a double quote, the hour using a 24-hour clock (00 to 23), and the minutes as a two-digit number.

Note

■

Splunk does not currently recognize non-english month names (%B, %b) or weekday names (%a, %a). an

alternative is to use numeric months (%m) and weekdays (%u).

Once we apply these definitions the data preview screen in Figure 9-7 presents us with a list of the events with the desired timestamp and it also highlights the section that matches the expression used for the timestamp. The only other thing to notice is that the header line still presents a warning related to the fact that it was not able to parse a

The next dialog box asks for the name of the new source type. We called it FD_Source1 and saved it. From there it takes you to index data, where it automatically starts indexing once you have done your final selections on that screen. At this point you can move to the summary page of the search app and see how the count of indexed events increases until it stops. In our case September 2012 contains 490,200 events. Note that this count includes the title line.

Now we can verify if the new source type works correctly. We start this process by reviewing a few lines by typing the following search command in the search screen of the user interface * | head. As we have not specified any sorting, by default Splunk will sort using the timestamp in decreasing order, that is, from most recent to oldest. As expected, the header line is the first event to show up, followed by events with a timestamp of September 30, 2012 11:59:00.000 PM and continues with decreasing timestamps. The events contain the information we were expecting. We can also see that each of them correctly displays the name of the host, the source type and the source file.

Reviewing the left bar we notice that under Interesting Fields none of the field names of the header line are showing up. This is not good, as we will not be able to use the field names on the searches when we analyze the data. The problem is that we had incorrectly assumed that Splunk would automatically associate the field names in the header line with the data fields. This would normally happen if we use CSV as the source type, but in our case we defined a new source type. The solution is to set the CHECK_FOR_HEADER attribute to true in the FD_Source1 stanza of the props.conf file.

Unfortunately, the CHECK_FOR_HEADER attribute works only at index time, so we will have to remove the data we have indexed and reindex it. After adding the CHECK_FOR_HEADER attribute in the props.conf file we stop Splunk. All changes to configuration files require a restart to be activated. The reason we stopped Splunk is that the command we will use to remove the events we already indexed only works when Splunk is not running.

The clean command, which is issued from the Command Line Interface (CLI) at the Unix prompt, deletes the data in one or all the indexes depending on whether you provide an index name as an argument. Because our instance of Splunk only contains the flight data on the main index, we will remove all the data by issuing the following command:

splunk clean eventdata

After we have deleted the data in the indexes we start Splunk again and we can proceed to reindex the data. We do it pretty much the same way we did before; we preview the data, but this time we select to use an existing source type and choose FD_Source1 from the pull down menu. Splunk presents us with the data, which looks correct, and we proceed to index it.

Once the indexing is ready, we verify once again if the source type is working correctly. As we search for the first 10 events, we can see that the field names are on the left bar. Things are looking good. However, we also notice that the source type on each event is now FD_Source1-2. As suspicious as it looks, this turns out to be a side effect when CHECK_FOR_HEADER is true, as it causes the indexed source type to have an appended numeral.

Caution

■

When CheCk_For_heaDer is set to true, the field names are stored in the server where the source type

was defined. Because of this, this feature will not work in most environments where the data is forwarded.

Now we can try a few search commands to see if the field names are properly lined up with the contents. The first quick check is searching for the top 10 origin airports by number of scheduled flights by typing the following search command in the search screen of the user interface:

* | top Origin

This search goes over all the indexed events, which is one of the things we want as part of these quick

verifications. The Origin field contains the code of the airport. The result shows the top 10 airports and contains the usual suspects: Atlanta, Chicago O’Hare, Dallas/Ft. Worth, Denver, and so on.

Chapter 9 ■ GettinG the FliGht Data into Splunk

In the next check we will try to verify that if an event has the last field Div5TailNum, then it also contains an actual tail number. The search we use is:

* | where isnotnull(Div5TailNum)

The result of this search only includes the header line, so there are no flights with five diversions in this month. We try the search with Div4TailNum and Div3TailNum unsuccessfully, until we get a result with Div2TailNum. This shows only one event for Alaska Airlines on September 1 at 7:40 AM. We go to the left bar and click over Div2TailNum. Figure 9-8 presents us with a popup window that contains statistical information and a breakdown of the values of the field. There are two values here, Div2TailNum, which comes from the header line, and N768AS, which is a proper tail number and happens to be the same as the one contained in the TailNum field.

We do one last check with a field in the middle of the event. The chosen field is Cancelled. A value of 1 means the flight was cancelled, whereas a value of 0 means it was not. Execute the following search command:

* | chart count by Cancelled

This command goes over all the events and the field popup presents three values, 0, 1 and Cancelled. The last value comes from the header line as expected. Because there are no other values and the percentages of noncancelled flights (99.193%) and cancelled flights (0.807%) seem reasonable, this result and the ones of the previous searches make us feel comfortable that the data was correctly indexed into Splunk.

In document r bdaus pdf (Page 149-154)