The second option for mapping the desired fields of the flight data into a timestamp is modifying Splunk’s timestamp processor. Splunk automatically recognizes and extracts most of the obvious timestamps. It does this with a set of predefined regular expressions that can be found in a file called datetime.xml in Splunk’s etc directory.
Using this option we will have to write a new regular expression that allows us to map columns 6 and 30 as the new timestamp. There are many different regular expressions that can be used to do this mapping. In this case we will do it in a similar fashion as we did with the AWK script that moved the columns around: we will count commas. As before we will have to consider that column 6 (OriginCityName) and column 25 (DestCityName) always contain the city name and the state abbreviation separated by a comma.
The idea is to skip the first five columns, capture the date, then skip the next 26 columns and capture the time. The regular expression we used is:
(?:[^,]*,){5}(\d+)-(\d+)-(\d+)(?:[^,]*,){26}"(\d\d)(\d\d)"
Without making this a regular expression tutorial we will explain what this strange combination of characters means. You can find various regular expression tutorials in the Internet by doing a search with your favorite search engine. In the first part, (?:[^,]*,), the enclosing parentheses mean that we are grouping this part of the regular expression. Groups are remembered for future reference, but this is quite expensive in processing costs. Because we are only using groups in this part of the regexp to skip fields, we use the ?: characters, which are a special directive telling the regular expression processor not to remember this group, which will speed up the processing.
Square brackets ([]) are used to define character sets, which tell the regular expression engine to match only one out of all the characters within that set. When a caret symbol (^) is used as the first character within a set, it means that the set is negated. In this case [^,] matches any character except for a comma. Please note that in all other contexts the caret symbol means the beginning of the line. The star (*) that follows it is a repetition directive that means zero or more times. Finally, the comma is a literal comma, which has to be matched. Thus (?:[^,]*,) means that this group matches any characters except for a comma, zero or more times, followed by a comma and do not remember this group for future reference. Groups can be repeated by enclosing the number of times with curly brackets ({}). In this case we repeat this first group five times.
In the next part, (\d+)-(\d+)-(\d+) we capture the date by defining these three groups. \d is a shorthand for the [0-9] character set, which describes all the digits. The plus sign (+) is a repetition directive that means one or more times. The dashes between groups have to be matched. Notice that these groups do not include the ?: directive; thus, they will be remembered so we can reference them later to extract the date.
The fourth part of the regexp skips the next 26 fields and then we capture the date as two distinct groups, hour and minutes, enclosed by double quotes (“).
Using this regular expression, we create a new timestamp processor:
<datetime>
<define name="flightdata_csv_timestamp" extract="year, month, day, hour, minute"> <text><![CDATA[(?:[^,]*,){5}(\d+)-(\d+)(\d+)(?:[^,]*,){26}"(\d\d)(\d\d)"]]></text> </define> <timePatterns> <use name="flightdata_csv_timestamp"/> </timePatterns> <datePatterns> <use name="flightdata_csv_timestamp"/> </datePatterns> </datetime>
Here we define a timestamp processor called flightdata_csv_timestamp, which extracts the year, month, day, hour, and minutes from those regexp groups we specified it should remember, in that specific order. The next statements tell that this processor will be used to process time and date patterns. Following best practices, instead of adding this XML code to the datetime.xml file, we create a separate file we call datetime_flightdata.xml, which can be found in the download package of the book.
Now that we have defined a new timestamp processor the next step is to associate it with a source type. This time, instead of using the user interface, we will directly work with the configuration file. In Splunk’s etc/system/local
directory we modify the existing props.conf file, and add the following stanza for a source type called FD_Source2:
[FD_Source2]
DATETIME_CONFIG = /etc/datetime_flightdata.xml MAX_TIMESTAMP_LOOKAHEAD = 220
SHOULD_LINEMERGE = false CHECK_FOR_HEADER = true
Chapter 9 ■ GettinG the FliGht Data into Splunk
The first attribute specifies the file that contains the timestamp processor. Note that the filename is relative to the directory where Splunk is installed. The second attribute specifies how many characters into an event Splunk should look for the timestamp: in our case, how far away column 30 is going to be into the event. After reviewing the flight data we estimated that 150 characters would cover all the cases, but we decided to increase the limit to 220 just to be sure. Please note that if you do not get this number right, you can miss the timestamp altogether.
The line merge attribute is related to the way Splunk breaks the lines. As mentioned earlier, the default is to break at the timestamp. If there is no timestamp, it will create a single event that contains all the events, or flight records in our case. By setting this attribute to false, the behavior is that Splunk will create one event for every single line, that is, it will break an event where there is a line break. The final attribute specifies that Splunk should get the field names from the header line of the CSV files.
Now that we have defined the source type that uses the new timestamp processor, we can test it out. First, we delete all the events we indexed before using the CLI clean command. Then, with the user interface we add a new file using the preview option. Here we specify to use our new source type FD_Source2, which presents a preview of the events with the correct timestamp except for the header line.
After indexing the data we run the same set of quick tests we did for the preprocessing option and verified that this method works correctly.