HTTP-log Documentation
Release 0.1
deterralba
Jun 25, 2017
Contents
1 Forewords 1
2 Let’s go to the point! 3
3 Contents: 5
3.1 Overview . . . 5
3.2 The software architecture . . . 7
3.3 My choices and what could be improved . . . 9
3.4 The different modules . . . 12
4 Indices and tables 23
Python Module Index 25
i
CHAPTER 1
Forewords
Dear reader,
This is the cherish documentation of my httplog program. I hope you will like it, I tried to make it as smooth as possible.
Here is a link to mygithub repositoryfrom where you can download the program.
1
2 Chapter 1. Forewords
CHAPTER 2
Let’s go to the point!
If you are in a hurry, you just need:
1. to have a python 2.7 installed,
2. to go to the source folder in your favorite terminal and type httplog ../data/sim_config -s 3. enjoy!
What is happening: the command launches a simulation of the program: a log file: log/simulated_log, is actively written by a special part of the program and read at the same time by the monitoring program. You will see every 10s a summary of the traffic observed, with alerts when it’s going up, and down, and up, and down...
CTRL + Cwill (cleanly) end the program.
Go to theOverviewpage if want to know how to use the program!
3
4 Chapter 2. Let’s go to the point!
CHAPTER 3
Contents:
You will find here:
• A simple user manual with a quickOverview
• A description ofThe software architectureand the reasons behindMy choices and what could be improved
• A short description ofThe different modules, with links towards:
• The usual doc describing all the functions etc.
Overview
What is this program?
The goal of HTTP-log is to monitor an actively writen HTTP access log, typically on a LAMP server. The following bloc is an example of HTTP access log lines
123.12.45.78 - - [22/Dec/2015:03:31:01 +0000] "OPTIONS /page2/blog1/ HTTP/1.1" 200 170 123.12.45.78 - - [22/Dec/2015:03:31:01 +0000] "TRACE /page2/index.html HTTP/1.1" 200
˓→293
123.12.45.78 - - [22/Dec/2015:03:31:01 +0000] "GET /blog3/picture.png HTTP/1.1" 200
˓→417
123.12.45.78 - - [22/Dec/2015:03:31:01 +0000] "GET /page1/page4/picture.png HTTP/1.1"
˓→200 778
123.12.45.78 - - [22/Dec/2015:03:31:01 +0000] "TRACE /archive4/index.html HTTP/1.1"
˓→200 965
You can find a description of what I had to do here: subject.
This is a typical output:
================================================================================
Welcome! HTTP-log Displayer is now running
LogSimulator is started - LogReader is started - Statistician is started
5
================================================================================
16:35:47 - Most visited section in the past 10s is '/page1' with 837 hits.
Total hits: 9700,
Total sent bytes: 4835489, i.e. 4.611 MB.
16:35:57 - Most visited section in the past 10s is '/page2' with 870 hits.
Total hits: 10000,
Total sent bytes: 4992571, i.e. 4.761 MB.
///////////////// ALERT \\\\\\\\\\\\\\\\\
High traffic detected: 12/22/15 16:36:02: outflow 0.957MB/s 16:36:07 - Most visited section in the past 10s is '/page2' with 1717 hits.
Total hits: 20000,
Total sent bytes: 9949869, i.e. 9.489 MB.
\\\\\\\\\\\\\\\ ALERT OVER ///////////////
End of alert: 12/22/15 16:36:07: outflow 0.213MB/s Terminating the program, waiting for the threads to end...
Program correctly ended!
101 User Manual
You can use the httplog command in a terminal opened in the source folder. Type httplog -h to see the help.
• If you do have a HTTP access file being writen on your server, type httplog path (where path is its relative path to httplog or absolute path) to start to monitor it.
• If you don’t have a HTTP access file being writen right now, type httplog path -s to start a simulation.
pathshould be the path to a simulation config file, you can find two of them in the data folder: sim_config and short_config. sim_config gently explains you what it contains.
Arguments
The path argument is mandatory. Use -l DEBUG to set the log level to DEBUG, other levels are INFO, WARNING, ERRORand CRITICAL.
Use -s to indicate that you run a simulation, if you do.
See the --help for more information.
The alerts
The alert monitoring system uses moving averages, and raises an alert in a short moving average is above a long one, with a certain threshold coefficient. That implies that if the traffic is not increasing or decreasing brutally, no alert (or end-alert) will be raised.
Note: The alert system is based on the real outflow (in bytes), not the number of requests received.
Note: You can change the alert monitoring parameters in the httplog (for a command line execution) and playground.
Change the line AlertParam(short_median=12, long_median=120, threshold=1.5, time_resolution=10). The unit base of short_median and long_median is time_resolution, that means that long_median=2 with time_resolution=10 will calculate an outflow average on the last 2 *
6 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
10 = 20seconds. An alert will be raised if short_outflow_average > long_outflow_average * threshold.
What are all those folders for?
There should be several folders in your HTTP-log folder. Let’s have a quick look:
• source: obviously where the source code is. Go there to play with the program.
• data: used to store some useful test or configuration files.
• log: the place where you will find the files generated by the program.
• documentation is where you are now, it contains all the documentation web and config files (for sphinx).
I want to change your code
I recommend you to create a virtual environment with python 2.7 so that your own python installation doesn’t conflict with the program requirements. First install virtualenv, then create your virtual environment, activate it, and finally install the dependencies stored in data/requierements.txt.
pip install virtualenv
virtualenv python27 -p /usr/bin/python2.7
source python27/bin/activate # or simply 'python27\Scripts\activate' if you are on
˓→Windows
pip install -r data/requirements.txt
We now have the same dev environment. You can use playground.py if you want to launch the program from your IDE, without command line arguments. Each module has a if __main__ == '__main__': that allows you to start and play with them.
And now what?
You could read my page aboutThe software architectureto understand it!
The software architecture
Introduction: a simple solution
To implement what is asked in the subject, you need to carry out 3 main tasks:
1. Read continuously the HTTP log file 2. Keep some statistics on what you read 3. Print your stats and alerts in the console
So you could write a program that 1. read a line of the log, 2. Parse it, keep a counter for each hit section, check if an alert is starting or ending, and finally 3. Print an alert or the stats if it is time.
Then you start again. And again, and again:
3.2. The software architecture 7
Thread 1 Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 Task 1 etc...
Why a multi-threads structure
While you are calculating your statistics, you are not reading the log, nor printing them. If you need advanced stats, printing, or even parsing, you may be interested to do something like this :
Thread 1 Thread 2 Thread 3 Task 1 Task 2 Task 3 Task 1 Task 2 Task 3 etc... etc... etc...
If you want low latency.
If you want to write a program that will be able to do more advance things in the future.
if you want to use more than one core of your high-end 32-cores processor.
if you think that maybe one day you will deal with several types of input or output, or several type of clients that will each ask for different statistics...
What you need is a program with several threads. It is a bit more complicated, but it is also much more challenging and interesting.
The threads awaken
Did I say they were 3 main tasks in the program ? Because there are 4 threads in my program ;) 1. A reader.Reader, that reads the input log file,
2. Astatistician.Statistician, that calculates the statistics,
3. A displayer.Displayer, that displays the stats and the alert, or write them in output logs, 4. The main thread (usually inhttplog) that manages the others.
If you use the simulation mode of the program, two others kind of threads are created:
5. log_writer.LogWriter: that writes in the log file the log lines at the speed you want during the time you want,
6. log_writer.LogSimulator: that reads the simulation configuration file and manages the succession of log_writer.LogWriterit needs.
The different roles of the threads should be clearer now, do not hesitate to contact me through github if you have any suggestion!
Now let’s see how I chose to implement all theses tasks and threads:My choices and what could be improved
8 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
My choices and what could be improved
“Python is love, python is life.”– Dalai Lama
Python
Python has several advantages:
• It allows you to write quickly and easily not-so-slow and clean program
• It is already installed on every Linux server (and OSX, but is this relevant?)
• It is reliable
• I know it
I could have used faster languages like C++ or Go. It would also have been interesting if I had do use several external libraries the program but I manage to use only the standard library, hence the installation is really dead- simple. Libraries like py.test are used for the development but are not necessary for the deployment.
Python 2.7
“I used python 2.7, because it’s fun to learn deprecated technologies”
Ok, python 2.7 is not really deprecated, bugs are still corrected and it is still a great tool. Python 2.7 is used in professional environments, and it used to be more widely adopted than python 3, that is why I used it.
If you have an old Linux VM somewhere, you should still be able to use httplog on, it thanks to 2.7.
Libraries and OS features
No external library is necessary to use the program. You got my point : simpler installation, higher compatibility, less user tears.
I did not used any Unix-only features, even if it could have been more elegant to do so (who said polling?). Windows users, you are welcome: httplog will run for you too.
Implementation choices
Several threads
Summary ofThe software architecturepage, several threads means
• faster
• better
• stronger
• more complicated code
• subtle problems with delicate solutions For instance:
3.3. My choices and what could be improved 9
• When a critical error is raised in a child thread and the program must be exited (ex: incorrect log path given, hence no input), I used thread.interrupt_main to stop the main program, because the raised exception is not sent to the parent thread. Then the parent thread automatically asks the others threads to end, thanks to the atexitmechanism. The children threads check between to tasks if their attribute should_run is still True, if it is not, they close the opened files and gently exit. This allows you to simply stop them with should_run
= Falseand a little patience. Yes, it is more complicated than a single thread process, but it works.
• When you use and communicate data between threads, thread-safe systems are needed. You don’t want a variable that can be overridden by another thread while you use it! That is why the LogReader send the read lines to the Statistician using a Queue, and why the Statistician use a Lock when accessing his stats(to send them to the Displayer or to update them)
Reading the log
The reader thread uses a temporal loop to read the log file: it goes to the EOF using log.readline(), then waits for a certain time (defined by reader.sleeping_time) and starts again. Simple, cross-platform solution. No polling, no signal, just import io.
Note: When the log is going to fast i.e. they are to many HTTP access lines written each second and the reader cannot follow the pace, there is a mechanism that makes the reader aware of it (it checks the last time it has been in EOF).
reader.LogReadersend a WARNING to thedisplay.Displayerwhen it happens.
This mechanism can be implemented for the Statistician.statistician. It is already implemented for the log_writer.LogWriter: it sends a WARNING to the displayer when not enough lines are written.
Parsing the lines:
To parse the read lines, a regex is used. °The input is a raw string, the output is a dictionary with the keys remotehost, remotelogname, authuser, date, request, status, bytes.
Statusand bytes are converted to int. If needed, the date can be transformed in a datetime object (this is rather slow, hence disabled by default). Exceptions are raised and handled when an incorrect line is read, and a WARNINGis send to the displayer.
Note: Commented and empty lines are ignored.
Warning: Only valid W3C HTTP access lines can be read:
# remotehost remotelogname authuser [date] "request" status bytes
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200
˓→2326
Parsing the lines: optimisation
Who should parse the line: the Statistician or the LogReader? To answer that question, I implemented the two alternatives and took the faster one, the one the optimally divide the work between the threads. The LogReader is now parsing the lines.
10 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
Simple alerts
The alert monitoring system is working, but their parameters are not necessarily properly fitted for your use.
Possible improvements
Optimisation
It is always possible to optimise the program, or to rewrite it in GO or C++. Note that most of the loop conditions are already optimised: they were tested and chosen to maximise the speed of the program.
Function are defined before the loops to avoid if else statements (ex: timeout_check in log_writer.LogWriter.run()), lists are generated once for all (ex: in the log_writer.
uniform_random_local_URL_maker()) etc.
Better stats and alerts
Alerts are not perfects, more efficient detection mechanisms could be written. Several types of alerts could be raised, for instance we may want to distinguish “outflow alerts” (scale: sent bytes per second) from “requests alerts” (scale:
number of requests per second).
Printed stats are not incredibly thrilling either.
You might want to overwrite statistician.Statistics.emergency(), statistician.
Statistics.update_long_term(), and the corresponding methods indisplay.Displayer.
Displayer and UI/UX
The actual UX is quite simple: a command line to start, a keyboard interruption to end it, a few log files written and that’s it.
Cursescould be used to create a real interface (and break the windows compatibility...). The thread could be forked and it could become a background daemon (like httpd).
A user interface could be written with Qt or Tkinter (not so useful on a console, but you could code a remote displayer, or ... a monitoring web interface, start a startup, settle down in NYC etc.).
Unit Tests
The code coverage of my test is a sensible subject, I should write unit tests instead of if __name__ ==
'__main__'tests.
Clean the code
Not every bloc of code is beautiful... for instance the global displayer indisplayis working and without obvious side effect but it is more a trick than an elegant solution. The problem was “how redirect all the ‘print’ to a special object method without having to explicitly give the object reference to everyone?”. If you want to talk about it, tell me!
A singleton pattern might be more appropriate.
3.3. My choices and what could be improved 11
Service optimisation
You have data. This is nice, let’s use it!
You could analyse your log file to detect which web page generate the most outgoing traffic, and try to minimise its impact. You could try to detect strange behaviours, there are several interesting uses that one could think of for your logs.
Unixification
The program structure could be adapted to the unix spirit “small single task tools”. For instance, the LogSimulator could be a separate program, whose output could be redirected in the standard input of httplog. The output could be redirected with the > syntaxe. The program could also easily be run in the background with a nohup httplog path &.
Log rotate
What happens when logrotate rotates the log? Nobody knows, this should be handled.
Bug fix
There are no known bug, tell me if you find one!
What’s next?
ReadThe different modulesdescription to see where is what, then take a look atplaygroundto change the default parameters of the simulation.
The different modules
The core
• httplogis the “main thread module”, called when you type `httplog path in the terminal. It parses the arguments, creates the children threads and manages them.
• readercontains thereader.LogReaderthread that continuously reads the log file.
• statistician contains the statistician.Statistician thread that calculates the stats with a statistician.Statisticsobject.
• display contains thedisplay.Displayer thread that displays stats and alerts on the console or on log files.display.Displayercontains a thread-safe logging and printing system through which every printing command go.
The simulation module
• log_writercontainslog_writer.LogSimulatorandlog_writer.LogWritertwo threads used to generate actively written HTTP access logs from simulation config files.
12 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
The other modules
• The test modules: they start with test_ and are used by py.test to test the source code. They need to be completed!
• Themiscmodules, a.k.a. miscellaneous module, used as a storage facility. You can ignore it.
Links
source
display module
class Displayer(statistician=None, display_period=10, console_print_program_log=True, write_program_log_file=True, log_level=30, debug=False)
Bases: threading.Thread
This thread object is responsible for the logging and the printing of the stats, the alerts and the log messages.
The log messages are messages generated by the different threads to inform the user of a peculiar event.
This thread can keep up to two output log files : one for the traffic alerts, and one for the log messages. The alert log is always used, the program log is optional, by default activated.
The log messages can also be printed on the console.
Note: Once it has been initiated (for instance with picasso = Displayer(debug=True), itslog() method can be simply called from any module with the following code:
import display as d
d.log(self, d.LogLevel.INFO, "This is my log message !")
This is possible because in __init__(), it defines a global displayer variable.
Variables
• statistician(Statistician) – A reference to the statistician that will be contacted to print the stats.
• display_period (int) – The stats will be printed with a period = display_period
• log_level (LogLevel) – The limit under which received log messages will simply be ignored in the log method
• console_print_program_log (bool) – If true, the receive log messages will be printed in the console, if their log_level is above or equal to the log_level
• write_program_log_file (bool) – If true, the receive log messages will be written in the program log file, if their log_level is above or equal to the log_level
• should_run (bool) – If False, the thread will shortly end stop its operation. Used to cleanly end the program.
• registered_object (list of Thread) – The list of other threads, used of INFO log messages
• console_lock (Lock) – Lock used to make the print function thread-safe
• display_width (int) – The typical width of the terminal
3.4. The different modules 13
• name (string) –
lock_print(*args, **kwargs) Thread-safely print function log(sender, level, message)
Receives messages from all the threads and “thread-safely” prints them depending of their level. Writes them on the log file if needed.
Parameters
• sender (Object) – A reference to the sender
• level (LogLevel) – The log level of the message
• message (string) – The message
print_end_alert(alert_param, long_average, short_average) Called when a alert is shut down, prints and/or logs it
Note: Could be merged with print_new_alert
print_new_alert(alert_param, long_average, short_average) Called when a alert is raised, prints and/or logs it
Parameters
• alert_param (AlertParam) –
• long_average (float) –
• short_average (float) – run()
Regularly prints the stats, and ends when should_run = False (normally after 0.1s max) Called when the thread is started.
stat_of_registered_object()
Returns a string describing the registered threads.
stat_string()
This is called when stats need to be printed Returns The formatted message Return type string
class LogLevel
Stores the possible values of the log printing parameters log_level.
This is an ersatz to an Enumerate object (not available in python2).
level_nameis used to get the log_level name from its value.
CRITICAL= 50 DEBUG= 10 ERROR= 40 INFO= 20 WARNING= 30
14 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
level_name= {40: u’ERROR’, 10: u’DEBUG’, 20: u’INFO’, 50: u’CRITICAL’, 30: u’WARNING’}
httplog module
READ_THE_SOURCE_CODE_FOR_THE_DOC_PLZ()
log_writer module
class LogSimulator(config_path) Bases: threading.Thread
Simulates a real actively writen HTTP access log: writes lines in a given log file at given speeds during given times. All theses parameters are read in a given config file.
Note: See data/sim_config to understand how a config file should be written.
Variables
• config_path (string) – The path to the config file where the parameters and the com- mand for the simulation are stored.
• log_w (LogWriter) – A reference to the present LogWriter
• name (string) –
• total_nb_of_lines_previously_written (int) –
• param_dict (dictionary) –
• should_run (bool) – If False, the thread will shortly end stop its operation. Used to cleanly end the program
get_parameters()
Reads the config file and extracts the parameters i.e. the lines with ‘=’. Sets LogSimulator.
param_dict.
Returns A dictionary with the parameters as key Return type dictionary
run()
Reads the config file given, and executes the commands read with the read parameters.
started_first_writing()
Returns True if the LogWriter has started to write, else False Return type bool
state()
Returns Describes the present thread state Return type string
exception LogSimulatorConfigFileError Bases: exceptions.Exception
Raised when the path or the format of the config file is incorrect
3.4. The different modules 15
class LogWriter(log_path, line_type=u’HTTP_slow’, pace=3000, timeout=-1, erase_first=True, is_simulated=False)
Bases: threading.Thread
Writes a given type of lines in the file given by the log_path at a given speed during a given time.
Warning: The time is the priority, hence of the I/O cannot follow the writing speed asked, the program will not write as many lines as asked, but it will end in time. A WARNING log message will be send.
Variables
• log_path (string) – The path to the log file, may already exist and will be erased or may not exist and will be create (see erase_first)
• line_type (string) – The type of line that that should be writen in the log, can be:
– 'line': lineX will be writen, where X is the number of the line writen (fastest) – 'HTTP_fast': an HTTP access line will be writen, but only the HTTP request and the
bytes will be random (slower)
– 'HTTP_slow': an HTTP access line will be writen, everything will be random (slowest)
• pace (int) – The number of line that should be writen every second. If the IO stream cannot follow the pace, a log message is send to the Displayer.
• timeout (int) – The number of seconds meanwhile the log should be writen, never stop if timeout == -1, minimum timeout is 1.
• erase_first (bool) – If True, the log_path is erased before being writen (log file is opened in 'at' mode).
• should_run (bool) – Used to get out of the writing while loop when the user want to stop the program.
• is_simulated (LogSimulator or None) – A reference to the LogSimulator if this is a simulation, or None. Used to update LogSimulator.total_number...
• nb_of_line_written (int) – The number of line written by the LogWriter. Only update every 100 lines !
• started_writing (bool) – Becomes True after the first flush() ie after a first line has been written. NB: Could be deduced from nb_of_line_written...
• name (string) –
run()
Open the output log file and writes log lines in it n = pace every second, as long as execution time <
timeout
Called when the thread is started.
generate_all_URL_possible(factor=2, max_depth=2)
Generates a list of all the possible local URL with the given parameters
Note: number of section = len(section) * factor
max_depthis the highest number possible of /section/subsection/subsubsection/etc. /archive2/blog0/
archive2/page1/ -> depth = 4
16 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
Examples
•/ <- min_depth without URL_end
•/index.html <- min_depth with URL_end
•/page0/
•/archive0/blog1/
•/blog0/picture.png
•/page0/archive1/archive2/index.html
•/archive2/archive0/archive2/archive2/picture.png <- max_depth = 4, with URL_end
random_HTTP_request(URL)
Parameters URL (string) – The URL that will be written in the HTTP request Returns A random W3C valid HTTP request
Return type string
Examples
•"HEAD /index.html HTTP/1.1"
•"GET /page0/ HTTP/1.1"
•"DELETE /archive0/blog1/ HTTP/1.1"
random_log_line_maker(line_type, **kwargs)
Returns The right log line writer function Return type function
Parameters line_type (string) – The type of line that should be generated, can be
• line: the string returned will be lineX\n where X is line_count given in keyword argument
• HTTP_fast: the string returned will be a static HTTP access log line with random HTTP request and a accurate date, and random bytes
• HTTP_slow: the string returned will be a fully random HTTP access line Keyword Arguments
• factor (int) – Parameters used in the uniform_random_local_URL_maker() function
• max_depth (int) – Parameters used in the uniform_random_local_URL_maker() func- tion
uniform_random_local_URL_maker(factor=2, max_depth=2)
Returns a uniform_random_local_URL function that returns equi-probable local URL, this is useful (I assure you) to avoid getting always the same most hit section in a simulation.
Note: number of section = len(section) * factor
3.4. The different modules 17
max_depthis the highest number possible of /section/subsection/subsubsection/etc. /archive2/blog0/
archive2/page1/ -> depth = 4
Examples
•/ <- min_depth without URL_end
•/index.html <- min_depth with URL_end
•/page0/
•/archive0/blog1/
•/blog0/picture.png
•/page0/archive1/archive2/index.html
•/archive2/archive0/archive2/archive2/picture.png <- max_depth = 4, with URL_end
misc module
This is a storage for now useless pieces of code that used to be, or could be useful one day.
EOF_reader(path)
Read a file until the EOF.
Warning: This is far too slow because tell() is bugged in python 2.x !
class InputThread
Bases: threading.Thread
An always waiting for input tread, printing what it gets.
run()
class Monitor(input_queue) Bases: threading.Thread
A thread used to demonstrate how a queue object is working run()
random_local_URL(factor=2, max_depth=2)
Returns A random local URL Return type string
Warning: The repartition of the returned URL is not uniform ! ULR with 0-depth (ie ‘/’, with or without file name) are returned more often that longer URL. Use uniform_random_local_URL_maker() to get a uniform_random_local_URL function.
18 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
Notes
number of section = len(section) * factor
max_depthis the highest number possible of /section/subsection/subsubsection/etc. /archive2/blog0/
archive2/page1/ -> depth = 4
Examples
•/ <- min_depth without URL_end
•/index.html <- min_depth with URL_end
•/page0/
•/archive0/blog1/
•/blog0/picture.png
•/page0/archive1/archive2/index.html
•/archive2/archive0/archive2/archive2/form.php <- max_depth = 4, with URL_end
read_log(log_name)
A “two liner” that read the log file given. Fun.
Returns A list of all the non-empty line that are not starting with ‘#’
Return type list of strings
Note: The returned line are .strip()-ed
playground module
READ_THE_SOURCE_CODE_FOR_THE_DOC_PLZ()
reader module
exception HTTPFormatError
Bases: exceptions.Exception
Raised when the HTTP access log line is not recognized class LogReader(log_path, sleeping_time=0.1, parse=True)
Bases: threading.Thread
This thread object reads the given log file, and by default parses its lines. Then it sends them in a queue that is read by the Statistician.
To read the file, the LogReader read the lines until the EOF, then waits for a given time: sleeping_time.
Variables
• log_path (string) – The path to the log file. The program is terminated if the log file cannot be opened.
3.4. The different modules 19
• sleeping_time (float) – The time in second during which the program will sleep after the EOF.
• parse (bool) – If True, the LogReader will parse the read line with the parse() function before it puts it in the Queue. If False, the Statistician will have to parse it itself.
• total_nb_of_line_read (int) – Counts the number of lines that have been read since the beginning, including the empty and commented lines.
• should_run (bool) – If False, the thread will shortly end stop its operation. Used to cleanly end the program.
• output_queue (Queue) – The queue where the read lines will be put.
• name (string) – The name of the thread: ‘log reader thread’
run()
Opens the input log file at log_path, goes to the EOF, then try to read new lines. If new lines are detected, sends them to the output_queue for the Statistician (parsed or not parsed depending of self.parse).
When EOF, waits for sleeping_time and starts again.
Note: There are two printing systems: sys1 and sys2, used to send log messages
•sys1 is used to print WARNING log messages when the LogReader is too slow: last_EOF is big
•sys2 is used to print DEBUG log messages with the number of line read every second
state()
Returns Describes the present thread state Return type string
get_section(request)
Return the section name from a HTTP request, or None if not a proper HTTP request
Examples
•GET /test/index/ HTTP => /test
•GET /te.st/index/ HTTP => /te.st
•GET /test/index.html HTTP => /test
•GET /test HTTP =>‘‘/testv
•GET /test.html HTTP => /
•GET / HTTP => /
parse_line(line, parse_date=False)
Parse a HTTP w3c formatted line and return a dictionary with the following keys: 'remote_host', 'remote_log_name', 'auth_user', 'date', 'request', 'status', 'bytes'
Note:
•status and bytes are converted to int
•date can be converted in a datetime object, UTC-time, but by default the conversion is disable (it is slow)
20 Chapter 3. Contents:
HTTP-log Documentation, Release 0.1
Raises HTTPFormatError
statistician module
class AlertParam(short_median=12, long_median=120, threshold=1.5, time_resolution=10) Stores the parameters used in the alert detection process.
short_medianand long_median are the size of the windows for the moving averages.
An alert will be raised if short_outflow_average > long_outflow_average * threshold.
Warning: The unit base of short_median and long_median is time_resolution, that means that long_median=2 with time_resolution=10 will calculate an outflow average on the last 2 * 10 = 20seconds.
class QueueWriter(output_queue, parse=True, pace10=1, factor=2) Bases: threading.Thread
Used to fill the Statistician queue, to simulate a fast reading and compare the reading speed with or without parsing.
Note: pace10 is the pace for 100ms, ie 10*pace10 entries are put in the queue every second.
run()
Puts n=pace10 lines every 10th of a second in the output_queue
class Statistician(input_queue, sleeping_time=0.1, parse=False, alert_param=<statistician.AlertParam instance>)
Bases: threading.Thread
This thread object is responsible for the statistics maintenance, it has aStatisticsobject to store them and raise the alerts.
It possesses the alert parameters.
Read lines are ‘thread-safely’ received thanks to an input_queue. They should be parsed by default, but this can be changed with the parse parameter.
Variables should_run (bool) – If False, the thread will shortly end stop its operation. Used to cleanly end the program.
Note: Alerts are checked every AlertParam.time_resolution.
run()
Checks if an alert should be raised, checks the input queue, update the stats if necessary and starts again.
state()
Returns Describes the present thread state Return type string
3.4. The different modules 21
class Statistics
Object used by the statistician as its “notebook”. This is were the stats are saved.
It calls the Displayer if an alert should be raised or shut down. It is called when the stats should be printed.
Variables
• section (dictionary) – The keys are the hit sections, values are the number of hits for each section.
• total_bytes (int) – Sum of the bytes sent.
• total_hits (int) – Total number of hits.
• should_run (bool) – If False, the thread will shortly end stop its operation. Used to cleanly end the program.
• long_term_bytes_buffer (int) – Stores the sum of bytes sent during a certain time, then is appended to the long_term_bytes list and reseted.
• long_term_bytes (list of int) – List that stores the evolution of the number of sent bytes, used to compute moving average.
• alert_raised (bool) – True if an alert has been raised and not shut down.
Note: The use of statistics.lock makes this object thread-safe.
Warning: number_of_hits, total_bytes and total_hits are used for the printed stats, ‘total’
is in fact ‘total since the last display’.
emergency(alert_param)
Returns a dictionary with the alert parameters if there is one. Called by update_long_term get_last_stats()
Returns a stats dict, used for the regular stats printing reset_short_stat()
Called by the displayer after get_last_stats to reset the ‘printing stats’
upadate_stat(HTTP_dict)
Update the stats with the given parse line (ie the HTTP_dict) update_long_term(alert_param)
Update the long_term_bytes list. Checks if an alert should be raised (or shut down), and raises it if necessary.
Called by the Statistician every AlertParam.time_resolution.
test_log_writer module
check_local_URL(URL)
Return True is URL starts with ‘/’ and doesn’t contain any whitespace character
test_reader module
22 Chapter 3. Contents:
CHAPTER 4
Indices and tables
• genindex
• modindex
• search
23
24 Chapter 4. Indices and tables
Python Module Index
d
display,13
h
httplog,15
l
log_writer,15
m
misc,18
p
playground,19
r
reader,19
s
statistician,21
t
test_log_writer,22
25
26 Python Module Index
Index
A
AlertParam (class in statistician),21
C
check_local_URL() (in module test_log_writer),22 CRITICAL (LogLevel attribute),14
D
DEBUG (LogLevel attribute),14 display (module),13
Displayer (class in display),13
E
emergency() (Statistics method),22 EOF_reader() (in module misc),18 ERROR (LogLevel attribute),14
G
generate_all_URL_possible() (in module log_writer),16 get_last_stats() (Statistics method),22
get_parameters() (LogSimulator method),15 get_section() (in module reader),20
H
HTTPFormatError,19 httplog (module),15
I
INFO (LogLevel attribute),14 InputThread (class in misc),18
L
level_name (LogLevel attribute),14 lock_print() (Displayer method),14 log() (Displayer method),14 log_writer (module),15 LogLevel (class in display),14 LogReader (class in reader),19
LogSimulator (class in log_writer),15 LogSimulatorConfigFileError,15 LogWriter (class in log_writer),15
M
misc (module),18
Monitor (class in misc),18
P
parse_line() (in module reader),20 playground (module),19
print_end_alert() (Displayer method),14 print_new_alert() (Displayer method),14
Q
QueueWriter (class in statistician),21
R
random_HTTP_request() (in module log_writer),17 random_local_URL() (in module misc),18
random_log_line_maker() (in module log_writer),17 read_log() (in module misc),19
READ_THE_SOURCE_CODE_FOR_THE_DOC_PLZ() (in module httplog),15
READ_THE_SOURCE_CODE_FOR_THE_DOC_PLZ() (in module playground),19
reader (module),19
reset_short_stat() (Statistics method),22 run() (Displayer method),14
run() (InputThread method),18 run() (LogReader method),20 run() (LogSimulator method),15 run() (LogWriter method),16 run() (Monitor method),18 run() (QueueWriter method),21 run() (Statistician method),21
S
started_first_writing() (LogSimulator method),15
27
stat_of_registered_object() (Displayer method),14 stat_string() (Displayer method),14
state() (LogReader method),20 state() (LogSimulator method),15 state() (Statistician method),21 Statistician (class in statistician),21 statistician (module),21
Statistics (class in statistician),21
T
test_log_writer (module),22
U
uniform_random_local_URL_maker() (in module log_writer),17
upadate_stat() (Statistics method),22 update_long_term() (Statistics method),22
W
WARNING (LogLevel attribute),14
28 Index