Accelerating Statistics

As we saw at the beginning of this chapter, running a simple search over all the flight data set can take as much three hours. So far we have used the strategy of testing our searches on the most recent month of flight data. If the search spanned over a month we tested with a couple of months and if the search spanned over years we tested with a couple years. Inevitably, the moment will come that your searches are ready to be used in prime time and some of them will cover all of the data; in our example, that is about 26 years of flight data. Having to wait three hours to get the results of a single search might be bearable, but when you have a dozen of searches and you have to run them on a regular basis, this is totally unacceptable.

Whereas report acceleration focuses on creating data summaries for specific searches or reports, Splunk has a mechanism referred to as tsidx that allows for the creation and use of general summaries of indexed data. Using tsidx is a two-step process. The first one is creating a general summary of the data using the tscollect command. Then you can use the tstats command to perform the statistical queries of your interest with an incredibly fast performance.

The tscollect command is very simple; you just have to define the name of the summary file where Splunk will store the general summary:

Chapter 10 ■ analyzing airlines, airports, Flights, and delays

As easy as it is, creating the general summary for all the 147 million events of the flight data ran for over 8 hours and it created a file about half the size of the flight data. Be aware that you cannot use the tscollect command to replace information within a summary file; however, you can append information to it. For example, once we get the flight data for October 2012 loaded into the main index, we can append it to summary_ fd like this:

earliest=10/1/2012:0:0:0 latest=11/1/2012:0:0:0 | tscollect namespace=summary_fd

The tstats command has a different syntax than the stats command. Let us illustrate how it works with an example using one of the searches we defined earlier in this chapter, the average flight time per airline, which using the stats

command is:

* | stats avg(CRSElapsedTime) as AverageFlightTime by UniqueCarrier

| eval AverageFlightTime=round(AverageFlightTime) | sort –AverageFlightTime

We will now replace the stats command with the equivalent tstats command:

| tstats avg(CRSElapsedTime) as AverageFlightTime from summary_fd

groupby UniqueCarrier

As you can see, we can still do calculations on fields using functions and create a new field to contain the results. Next we have to specify the name of the file that contains the general summary and finally the by argument is now

groupby. Notice that we no longer have a star (*) at the beginning of the search. This is because you are not doing a search and passing the events on to the tstats command. We still have the pipe before the command because we have to tell Splunk that this is an actual command and not an attribute of the search command. The rest of the search, the

eval and sort, remains the same.

How fast is it? When we run this search on the month of September 2012 using good old stats it takes about 15 seconds. Using tstats over all the 26 years of flight data, it takes about 30 seconds. It seems that investing more than eight hours in creating the general summary and using the additional 8.6GB of disk space are well worth it.

One of the down sides of tstats is that not all statistical functions are available. At the moment of this writing the functions are limited to count, sum, sumsq, distinct, avg, and stdev. However, Splunk offers a way to bridge the tstats

command with the stats command to gain access to those functions that are not available. It is a bit quirky, but it works. Let us use the search we did to count the airlines by year:

* | stats dc(UniqueCarrier) by Year

The equivalent search using tstats is the following:

| tstats prestats=t dc(UniqueCarrier) from summary_fd groupby Year | stats dc(UniqueCarrier) by Year

Just to avoid any confusion, the distinct function available in tstats lists the distinct or unique values of a field just like the values function, which is different than the distinct_count, or dc function that calculates a count of the distinct

said, it’s a bit bizarre, but at least we do have a mechanism to overcome these limitations of tstats. The performance improvement for this search is mind-blowing; the original search over the 147,122,177 events using stats took close to three hours. With tstats it took 15 seconds!

The groupby clause in tstats can handle multiple fields. We will simplify a search we used earlier to find the longest flights by airline to illustrate this. The original search was:

* | stats max(CRSElapsedTime) as MaxSchedTime max(ActualElapsedTime) as MaxActualTime by UniqueCarrier, Origin, Dest

| dedup UniqueCarrier

sortby –MaxActualTime

Changing it to use tstats is pretty straightforward, very much like the other examples:

| tstats prestats=t

max(CRSElapsedTime), max(ActualElapsedTime) from summary_fd

groupby UniqueCarrier, Origin, Dest | stats max(CRSElapsedTime) as MaxSchedTime, max(ActualElapsedTime) as MaxActualTime by UniqueCarrier, Origin, Dest

| dedup UniqueCarrier sortby –MaxSchedTime

Because the max function is not a supported by tstats, we bridge it over to stats by setting the prestats argument to true. As stats is the command that will actually be calculating the maximum flight times it’s there where we assign the results to a new field. We could have typed it in the tstats command, but it would not have any effect. For didactical purposes, we have changed the sort field to the maximum scheduled flight time. Running the normal search on the flight data of September 2012 takes about three seconds. The accelerated search over all the 26 years takes a little over two minutes. In Figure 10-33 you can see results of the search with the modified sort.

Chapter 10 ■analyzing airlines, airports, Flights, and delays

The reason we did this is to show you the strange numbers we get on almost all the rows under the maximum scheduled time column. These are obviously wrong as 1,865 minutes—about 31 hours—is totally unrealistic for a commercial flight. This is confirmed by the fact that the maximum actual flight time for that same flight over the last 26 years is 477 minutes, or about 8 hours, which is much more realistic number for a flight between Guam and Honolulu. We need to find more information about this. First we want to know if this is a regular error or it only happens once. For that we do the following simple search, which produces the output found in Figure 10-34.

UniqueCarrier=UA CRSElapsedTime=1865

This search returns only one event, but we still do not know if it is an isolated error. For this we search for any flight in that route that has a scheduled time bigger than the biggest actual flight time over the 26 years:

UniqueCarrier=UA Origin=GUM Dest=HNL | where CRSElapsedTime>477

This search only returns the same event as the previous one, so this is an isolated incident. We suspect there was some data entry problem for United Flight 200 on January 25, 2012. This last search allows us to explain some performance tips in searches. Splunk is quite slow when it has to read a large amount of events from raw data, therefore the reason for tsidx to exist. However, Splunk is extremely fast finding specific items in events, even more so when they are key value pairs. The above search took about one second for all the 147 million events. The first section of the search zooms in directly into the required events by looking directly into the index table Splunk keeps internally instead of reading every event. The second clause applies the conditional, which tends to be a more expensive operation but only on those events that resulted from the first search. Alternatively, we could have formulated the search as:

* | where UniqueCarrier=UA AND Origin=GUM AND Dest=HNL AND CRSElapsedTime>477

This would have evaluated the conditional for every single raw event and it probably would have taken longer than three hours. The difference is matching 336 events that contain United, Guam and Honolulu, and then evaluate if the scheduled time is greater than 477 minutes against evaluating if the airline is United and the origin Guam and the destination Honolulu and the scheduled time greater than 477 for every single one of the 147+ million events.

For completeness, we did a quick check with the anomaly for American Airlines and found two events that have 1,613 and 1,517 minutes for scheduled flight time, one in 1993 and another one in 1991. Obviously, these are isolated incidents, but they could potentially affect our analysis by incorrectly skewing results.

One way to solve this problem is by erasing those events using the delete command. This command does not actually delete events; it just marks them so that they are not used in searches. Of course, this will not affect a tsidx summary that was created before the events were deleted. An example of one way that you could delete specific events presenting the anomalous scheduled flight times is:

You must be extremely careful using this command, as far as we know there is no way to undelete events. Splunk is quite sensitive about this command and by default no user, not even the admin has the capability to delete events. They suggest you create a special user for when you will delete events.

The final example we have shows how to deal with multiple functions within a single stats command that assigns the results to new fields, and how to handle conditionals. For this, we will use the Arrival Delays by Airline search we built earlier in this chapter:

* | stats count(ArrDel15) as Total,

count(eval(ArrDel15=0)) as OnTime, count(eval(ArrDel15=1)) as Delayed by UniqueCarrier | eval PCTOnTime=round(OnTime/Total*100,2) | eval PCTDelayed=round(Delayed/Total*100,2) | sort - PCTDelayed

The first issue that we see in this search is that it uses the eval function within the count function. Whereas count

is supported by the tstats command, eval is not. The way around this is using the where clause of the tstats command, but the problem here is that there can be only one clause, so the logic in the stats command will have to be broken down in many parts to make it work with tstats. We start by obtaining the count of delayed flights:

| tstats count as Delayed from summary_fd where ArrDel15=1 groupby UniqueCarrier

The where argument is used in an equivalent manner to the way the eval function and the above search ends up counting all the delayed flights by airline. Be aware that you must follow the order of the from, where and groupby

clauses for the tstats command. You will get very misleading messages when you do not do so and will spent a lot of time trying to understand what is happening. Next we need another search to find either the total number of flights by airline or the on-time flights:

| tstats count as Total from summary_fd groupby UniqueCarrier

This search is quite simple and all we have left is joining both tstat commands, calculate the percentages and sort appropriately. The final search looks like this:

| tstats count as Delayed from summary_fd where ArrDel15=1 groupby UniqueCarrier | join

[

| tstats count as Total from summary_fd groupby UniqueCarrier ] | eval PCTDelayed=round(Delayed/Total*100,2) | eval PCTOnTime=round((Total-Delayed)/Total*100,2) | sort – PCTDelayed

Chapter 10 ■ analyzing airlines, airports, Flights, and delays

Essentially we have completely reformulated the search to work around the limitations of the tstats command and achieved the same results. Whereas in the original search we counted in one single command the total, on-time and delayed flights, in the new search we broke the counts of the total and delayed flights in two separate tstats

commands, which were then combined using the join command, as we did in the Comparison of Delays search earlier in this chapter. The on-time percentage is based on the subtraction of the total and delayed flights counts. The only functional difference with the original search is that we eliminated the count results using the fields command and left only the percentages. Using a 100% stacked bar chart provides a good visualization of the aggregated arrival delays over the last 26 years, which can be seen in Figure 10-35.

As expected with the tsidx mechanism, the performance difference is huge. The original search using only the month of September 2012 ran in 10 seconds, whereas using the reformulated search it took 15 seconds to go over the 147+ million events.

Summary

In this chapter, we have gone through a typical data analysis session using the Airline On-Timer Performance data, during which we learned how to use some of the commands available in Splunk. We also discussed various visualization options, which are dependent on the results of the searches and the information that is being conveyed to the consumers of the analysis. Finally, we went over some performance tips on searches and reports, which can make amazingly big differences by dramatically cutting the response times of searches.

Analyzing a Specific Flight Over

In document r bdaus pdf (Page 189-195)