As all the searches we have done so far are simple, in this section we will start increasing the level of complexity. We start with a simple search, average flight time by airline and as usual so far we test the searches over the most recent month. The search, shown in Figure 10-14, is based on the CRSElapsedTime field, which contains the scheduled flight time in minutes, not the actual flight time:
The output of this search is a table that is ordered by airline and the average flight times, calculated as the arithmetic mean, have six decimal places. Before we determine the best way to present these results we want to polish the output a bit by rounding the averages and sorting them. We do this as follows:
* | stats avg(CRSElapsedTime) as AverageFlightTime by UniqueCarrier
| eval AverageFlightTime=round(AverageFlightTime) | sort –AverageFlightTime
Chapter 10 ■ analyzing airlines, airports, Flights, and delays
Every time we manipulate data to obtain partial results and pass them on to the next section of a search using pipes we like to define new fields, even if new fields are not explicitly necessary. In this case we create a new field that contains the averages in the second clause of the search. The output is rounded in the third clause of the search, which makes it more readable, and finally it is sorted in decreasing order. The round function defaults to an integer. If you want decimals, just specify the desired precision after the field separated by a comma.
Determining the best representation for a given data set can be difficult. The tabular form presented in Figure 10-15 provides a quick way to process the information if we are interested in the actual number of minutes, whereas a bar chart as seen in Figure 10-16 provides it without the exact numbers.
It is interesting to note that all the regional airlines average around one hour and a half, confirming their regional status. Oddly, Hawaiian Airlines shows up in this group, but this can be attributed to the fact that the majority of their flights are short hops within the Hawaiian Islands, which skew the averages to the lower end as their flights to the continent are comparatively not as many.
The next search is looking for the longest flights, but that in itself would be very easy. We want to know which are the longest flights by airline and also from where to where. Presenting the origin and destination increases the complexity of formulating the search, which does not necessarily mean that the resulting search is more complex. We cannot think of a way to do this search with only one command, so we will have to break it down in clauses using pipes. The first thought is to use the stats command to find the maximum scheduled flight time. As expected this will produce the longest flight by time for each airline. The next step is to understand what happens if we add the origin and destination to that particular search, as seen in Figure 10-17.
Figure 10-16. Average flight (time—bar chart)
Chapter 10 ■ analyzing airlines, airports, Flights, and delays
The output of this search is a table with the maximum flight time for every origin-destination pair for each airline. This table contains 6,096 entries for the month of September 2012, which implies that there are as many unique city pairs for all the airlines. Just out of curiosity, we review the first table entry, American Airlines from Albuquerque to Dallas/Fort Worth, by doing a search to count all those entries, which totals 214 matching events.
UniqueCarrier=AA Origin=ABQ Dest=DFW
Clicking on the CRSElapsedTime field on the left side bar shows that there are two values, 100 and 105 minutes, as can be seen in Figure 10-18. This means that the airline itself has two different scheduled flight times. When clicking on the departure time field there are eight values and the scheduled arrival time has nine. Obviously, there is a wide assortment of choices for this city pair.
Now that we have a table with all the city pairs, we can try to sort them by the maximum flight time:
* | stats max(CRSElapsedTime) as MaxFlightTime by UniqueCarrier, Origin, Dest | sort -MaxFlightTime
Because we are passing partial results to the next clause using a pipe, we created a new field, MaxFlightTime. The output of this search gives us a list of the longest flights independently of the airline, which is not exactly what we want. We want the longest flight for each individual airline. So let us try putting UniqueCarrier as an additional field of the sort command. This does not change the results because the only effect it has is to sort by airline for those entries that have the same flight time. The next logical step is to deduplicate the entries by airline. As the table has all the city pairs ordered from longest to shortest by airline the result should be what we want:
* | stats max(CRSElapsedTime) as MaxFlightTime by UniqueCarrier, Origin, Dest | sort -MaxFlightTime
| dedup UniqueCarrier
As we verify that this search produces the results we want, which can be seen in Figure 10-19, we notice that it can be optimized from four to three clauses, as the dedup command has a sort option:
* | stats max(CRSElapsedTime) as MaxFlightTime by UniqueCarrier, Origin, Dest | dedup UniqueCarrier
sortby –MaxFlightTime
So far we have worked with the maximum scheduled flight time, but we all know the reality is quite different, so let us throw in the actual flight time. We can do this by just adding it as part of the stats command and it should carry on throughout the rest of the sections:
* | stats max(CRSElapsedTime) as MaxSchedTime max(ActualElapsedTime) as MaxActualTime by UniqueCarrier, Origin, Dest
| dedup UniqueCarrier
sortby –MaxActualTime
Note that we changed the names of the new fields to be more representative of their contents and that we sorted based on the maximum actual flight time. One of the issues of working with results that contain multiple fields is that they do not lend themselves to being presented as colorful charts. We tried the various chart types offered by Splunk, but whereas both flight times showed very nicely the names of the city pairs were not to be found. Figure 10-20
contains the results of this search in tabular form, which we feel is the best choice to present these results.
Interestingly enough, the longest flight by both scheduled and actual time for September 2012 is from New York to Honolulu with Hawaiian Airlines. This is the flight at the top of the output. It confirms what we saw in the previous search, that even though Hawaiian Airlines has the longest flight, the majority of the flights are short, thus skewing the average flight time. One thing we need to bring to your attention with this search is that if any of the actual flight times would have been the same or less than the scheduled time the order of the results would have changed and this could affect the way the audience perceives the results.
So that we cover all the cases, we will also look at the shortest flights, both by scheduled and actual flight time. Some readers might think that the easiest and fastest way of doing this is by reversing the sort of the longest flight search we just did, but this will not work as you will obtain the shortest of the longest flights because you are sorting
Figure 10-19. Longest flight by airline by city pair
Chapter 10 ■ analyzing airlines, airports, Flights, and delays
by the maximum actual time, not the maximum scheduled time. If you reverse the sort by eliminating the dash (‘-‘) and changing the field name to MaxSchedTime it will work correctly; however, it will have the wrong column titles on the table, as they should be Minimum Scheduled Time and Minimum Actual Time. The modified search follows along with a partial output in Figure 10-21.
* | stats max(CRSElapsedTime) as MaxSchedTime max(ActualElapsedTime) as MaxActualTime by UniqueCarrier, Origin, Dest
| dedup UniqueCarrier sortby MaxSchedTime
The shortest scheduled flight is from San Jose to Monterey in California with 18 minutes, but it was scheduled only once during September 2012 and it was cancelled, so the actual time is null. As all the other flights do have actual flight times in the results we wonder about the effect of cancelled flights on the results. Reviewing other of the flights in the shortest group we find out that both minimum and maximum of either scheduled or actual flight times always result in a value. The San Jose to Monterey flight is a very special case, as there is no actual flight time and because there is only one flight the result of minimum or maximum actual flight time is null.