One of the points that we stress in this book is the need to understand the data set that you want to process and analyze; that is, getting intimately acquainted with the data you will work with first. In this chapter we are going to take the first step of explaining the data of the combined access log format that we used to generate the sample data in Chapter 2.
Log files are generated by almost all kinds of applications and servers—whether they are end-user applications, web servers, or complex middleware platforms that serve as an infrastructure to run the applications used by consumers or business users. Operating systems and firmware also generate huge amounts of raw data into log files. The challenge lies in understanding, analyzing, and mining the raw data in the log files and making sense out of it.
Combined access logs generated by web servers such as Apache or Microsoft IIS provide information about activity, performance, and problems that are happening, whether intermittently or continuously. These logs contain information about all of the requests processed by the server. Both Apache and IIS allow customization of the combined access log format, which is commonly used and well understood by many log analysis applications that interpret and process the entries in these log files. The log entries produced in the combined access log format look like this:
127.0.0.1 - JohnDoe [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.google.com" "Opera/9.20 (Windows NT 6.0; U; en)"
Table 3-1. Description of fields in combined access log
Field
Description
127.0.0.1 This is the IP address of the client (the machine, host, or proxy server) that was making an HTTP request to access either a web application or an individual web page. The value in the field could be represented as hostname
- This field is used to identify the client making the HTTP request. Because the contents of this field are highly unreliable, a hyphen is typically used, which indicates the information is not available
JohnDoe This is the user id of the user who is requesting the web page or an application 10/Jan/2013:10:32:55 -0800 The timestamp of when the server finished processing the request. The format can
be controlled using web server settings “GET /apache_pb.gif
HTTP/1.0”
This is the request line that is received from the client. It shows the method
information, in this example GET, the resource that the client was requesting, in this case /apache_pb.gif, and the protocol used, in this case HTTP/1.0
200 This is the status code that the server sends back to the client. Status codes are very important information as they tell whether the request from the client was successfully fulfilled or failed, in which case some action needs to be taken. 200 in this case indicates that the request has been successful
2326 This number indicates the size of the data returned to the client. In this case 2326 bytes were sent back to the client. If no content was returned to the client, this value will be a hyphen “-”
“http://www.google.com” This field is known as a referrer field and shows from where the request has been referred. You could be seeing web site URLs like http://www.google.com,
http://www.yahoo.com, or http://www.bing.com as the values in the referrer field. Referrer information helps web sites or online applications to see how the users are coming in to the web site and this information could be used to determine where the online advertisement dollars should be spent. As you may notice that referrer has an extra “r”. That is intentional and originated from the original proposal submitted in the HTTP specification. In browsers like Chrome where users can use incognito mode, or have referrers disabled, the values in the field will not be accurate. In HTML5 the user agent that is reporting this information can be instructed not to send the referrer information
"Opera/9.20 (Windows NT 6.0; U; en)"
This is the user-agent field, and it has the information that the client browser reports about itself. You will see values like “Opera/9.20 (Windows NT 6.0; U; en)”, which means that the request is coming from an Opera browser running on a Windows NT (actually Windows Vista or Windows Server 2008) operating system. User-agent information helps to optimize web sites and web applications and cater for requests coming from smaller form factor devices such as the iPad and mobile phones
Chapter 3 ■ proCessing and analyzing the data
Now let us look at some of the sample log entries that we generated in Chapter 2 for MyGizmoStore.com. Here are sample entries from the /opt/log/BigDBBook-www1/access.log file. You can see that there are different status codes as well as user agents or browsers.
196.65.184.6 - - [28/Dec/2012:06:54:46] "GET /product.screen?productId=CA-NY-
99&JSESSIONID=SD5SL8FF8ADFF4974 HTTP 1.1" 200 992 "http://www.bing.com" "Opera/9.20 (Windows NT 6.0; U; en)" 597
92.189.220.86 - - [29/Dec/2012:02:58:28] "GET /cart.do?action=purchase&itemId=HYD-
2&JSESSIONID=SD2SL1FF4ADFF5176 HTTP 1.1" 500 1058 "http://www.MyGizmoStore.com/oldlink?itemId=HYD-2" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 604
189.228.151.119 - - [30/Dec/2012:18:18:50] "GET /product.screen?productId=8675309&JSESSIONID=SD6S L9FF2ADFF6808 HTTP 1.1" 404 3577 "http://www.MyGizmoStore.com/product.screen?productId=CA-NY-99" "Opera/9.01 (Windows NT 5.1; U; en)" 916
218.123.191.148 - - [31/Dec/2012:04:28:45] "GET /category.screen?categoryId=BLUE
GIZMOS&JSESSIONID=SD0SL1FF1ADFF7226 HTTP 1.1" 500 2992 "http://www.MyGizmoStore.com/category. screen?categoryId=BLUE GIZMOS" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.38 Safari/533.4" 928
78.65.68.244 - - [31/Dec/2012:02:22:40] "GET /category.screen?categoryId=ORANGE
WATCHMACALLITS&JSESSIONID=SD1SL5FF9ADFF7146 HTTP 1.1" 200 2120 "http://www.bing.com" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" 338