Application Layer DDoS Attacks Detection
using Classification Techniques & Data Mining
.
Fawad Khan1, Chen Zhanfang*2, Ijaz Ahmed3, Danish Javeed4
School of Computer Science and Technology of Changchun University of Science and Technology, Changchun, Jilin, China.
Abstract-- The rapid growth in the usage of internet attracts cyber intruders to exploit and execute malicious activities in the network. The forensic analysis of the attacks are carried out by find out the series of activities performed by an attacker. Digital forensic analysis can be accomplished by collecting the hard drive, RAM images, access log files etc. It is hard to trace the attack by collecting the activities from the network since the intruder removes all the possible activities. Therefore, the only solution to identify the intrusion is from the access log events captured in the web server. Classification (k-NN) performs a vital role in recognizing suspicious sequence of actions from the incoming network traffic. In this paper classification techniques are used to evaluate the performance such as k-NN and Logistic regression are compared with incoming traffic to identify the source of application layer DDoS attack. These models are evaluated by employing the firewall server access log, which results demonstrate that the k-NN based method achieved high detection rate than Logistic regression with less false positives.
Keywords: User behavior analysis, DDoS attacks, Patterns Mining, Classification Techniques.
I. Introduction
Nowadays, the people are extremely reliance on the internet services to accomplish their routine activities. The rapid growth in internet technologies and inevitable reliance on the internet can lead of new threats and malicious activities which conciliate on one’s confidential information, reliability and integrity on the available network utilities [1]. The attacker intention to commit crime with uses different browsing activities in particular network on the internet and left no evidence is a prime constituent for digital forensic investigation. Secure system is the prime concerns in the web based applications and also intrusion detection like security attack is very
difficult. Once retrieved the evidence from different components or servers such as log file, images, hard disk , cache, cookies, time and frequency of user visiting the page then investigator start analysis of the victim. The rapid growth in usage of network tools and scripts helpful for the attackers to execute various attacks in the network. As survey report stated of Kaspersky, the companies lost a revenue of $444,000 by just Application layer DDoS attacks in 2014 [2]. This cause high resource consumption and also economically collapsed the targeted companies by generating heavy bills. For Instance, online gaming networks, telecoms etc are prone to DDoS attacks [3]. In order to detect such crimes the investigators need to carry out the necessary actions involved in forensic investigation.
malicious activities. Various techniques have been used. Intrusion Detection System is a type of method that works as to stop the connection of the network by happened unauthorized activities. Various methods had use in the existing work for Intrusion Detection System are statistical analysis [5], and machine learning techniques[6,7], and so on.
Forensic detection is the method of identification, collection of evidence, examination and analysis of it while preserving the integrity of the data [8]. The forensic analyst accumulates the events by find out the sequence of activities have performed by an intruder. Forensic investigation detached the victim machine after identifying it, retrieve the data and investigate the attack from log files, virtual hard disk, RAM images of VM, and so on, with the help of live or dead analysis. Dead forensic analysis is crucial to identify the evidences, when the data is at rest [9]-[10]. Live forensic analysis has carry out to identify evidences with the help of regular checks of the sources on the private network as the data modifying with time [9]. Log files, in the other words evidence collection play a great role to identify the attack sources for forensic investigation. Evidence collection and analysis have been made from the attacked machine by using various validating techniques and log file statistical analysis [10].
The forensic auditor conviction on determining the details like where, when, why, who, what and how the event has happened. Big data analytics and machine learning techniques are used to make classification of the DDoS attacks captured in the access log of the web server. The new attack patterns can be determined using supervised Classification machine learning techniques namely, Logistic regression, Random forest, K-Nearest Neighbors etc. These techniques are more suitable for the identification of multiclass classifier supervised learning techniques. It is hard to differentiate the authorized or unauthorized trace as the incoming traffic patterns and traits of attack are similar as the benign
traces [11]. Also, the new data patterns may not be identified by using Intrusion Detection System (IDS) due to the huge amount of data generated and suffers in large processing overheads [11, 12, 13]. Data mining and machine learning models viz., Classification, Decision Tree etc. That’s play a major role in identifying new attacks which have not been encountered before for forensic analysis. Classification algorithms are used to train the model and to evaluate the target variable which help to enhance the accuracy and performance of the system.
In this work classification techniques performance is evaluated and compared by extracting the best suite features from firewall server access log. These features are processed with big data statistical analysis and machine learning techniques such as Logistic Regression, K-nearest Neighbor for the identification of application layer attacks which had happened on the network. The multiclass classifier model is used due to high accuracy and performance of the system, which is suitable for multiclass classification system. Since the DDoS attacks are multiclass classifying system. The classification model depends on the train and test datasets. The dataset is splits into train and test by using split method with user-defined range and passed as input to the models [14]. The Logistic Regression multiclass classifier is used to observe the features matrix and targets vector, which make comparisons of the while generated access log files from inbound traffic with the predefined user behaviors to identify the request is legitimate or suspicious (targets class). We also applied K-nearest Neighbor Classifier (k-NN), this is better than that of Logistic regression, whereas model achieves high detection rate, reduces false positives and identifies unknown attacks than the logistic regression algorithm.
II. Related Work
information by storing chronologically in the log file of the web server. The forensic auditor investigates the attack by retrieving the event form log files, virtual hard disk, physical memory etc., via online or offline. Application layer attacks play a great impacts on the web based applications. Krugel et al. introduced an intrusion detection based on web by implicitly deriving the profiles like length and composition of web server logs [15]. The derived profiles can be matched with inbound network traffic to classify the attacks, which resulting more false positives. Lee et al. came up with a technique of benign or attack traces used clustering analysis on every phase of attack [16]. This technique considers few input as the features matrix which results in small detection rate for attacks.
Yatgai et al. Introduced DDoS attack detection by using the browsing sequence of the content and finding the association to the page content size [15]. Large access log file usage has not been analyzed to detect the new attack which resulting high false positives. Oh et al. built a model for the Application layer DDoS attacks identification by making clusters of traffic patterns using SOM and the labelling is performed using the correlation of the features [16]. The detection true positives is reduced by the labelling of each map units. This method leads huge number of false positives.
Konar et al. made combination of the idea considered in [16] with the fuzzy logic to get the high detection attacks rate. SOM algorithm has been used to make identification of the suspicious sort of newly generated patterns and classifying from every neighboring map unit. When a new suspicious activity happens the newly generated rules conform to the map units would be update instead of updating the whole system of fuzzy rules base. Zolotukhin et al. proposed a technique to
identify a benign or malicious requests by using n-gram statistical analysis methods [17]. This method has high computational time since the feature size is large. Bhuyan et al. proposed a method to differentiate the low rate and high rate malicious traffic from benign traffic using information theory with low processes overhead [17].
Maggi et al. introduced a technique to differentiate the malicious and benign patterns in web based applications. The HTTP traffic log files are examined to determine the historically modelled parameters [18]. This technique needs voluminous clearly labelled dataset for initial training the model to find out the unauthorized behavior. Chwalinski et al. presented a technique for the HTTP-GET flood attacks detection by employing clustering techniques of the categorical vector variable and theoretic information measures. The proposed technique differentiates the authorized and unauthorized patterns by analyzing the incoming traffic of the web request patterns [19]. Prior observation is not necessary for detection of the attacks behavior. The total number of clusters that have distributed all over the different entropy ranges are difficult to determine since most sequence of request is uniformly distributed.
The methods used in the already existed work have not been solved the challenge for mining the unknown attacks preserved in the web server. The existing methods computational time is very high, also high resource usage and result in huge false positives. Classification techniques play a great role in identifying the behavior of unknown attacks. Classification algorithms performance are evaluated such as Logistic Regression and K-nearest neighbors and then compared to detect the attacks effectively.
Fig.1. Architecture of HTTP flood attack detection.
Architecture for the system model is presented in Fig 1. The system model consist of several steps viz., Log collection, Log Preprocessor, Correlation and Features selection and Model Training and prediction.
Log Collection: This process collects the event logs from information sources like network modems, routers, hubs, switches, servers and hosts which is under forensic analysis.
Log Preprocessing: The log file passed as input and analyzes it to identify events log in terms of feature matrix.
Correlation and Features Selection: This is the process to identify correlative features that help in model learning in terms of high accuracy and performance of the system.
Model Training: Feature matrix and labels vector of access log is passed as input to the classification model, which make comparison with newly generated log files from inbound traffic to classify the target class.
a. Log Files Collection
The events log are accumulated from various network sources like network router, modems, hubs, switches, web server, hosts and underlined constituents namely hard drives, RAM files, physical memory etc. These are under forensic analysis. The access logs retrieved from network server performs a major role in events accumulation. Application layer (DDoS) attacks preserved in various logs namely system log, network log, servers log, etc., located in web server. These logs files are analyzed to make detection of application layer attacks by using forensic examination. The different attacks information preserved in the events log, which are given below.
System log – find out if attacker attempting or carried out buffer overflow.
Debugging log – find out the processes of application layer DDoS attacks.
Firewall log – auditing of firewall by using application specific direct methods.
Authentication log – examining of attacks on credentials and determines the malicious access
inclusion, host file inclusion, and DDoS attacks )
Error log – significant for addressing attacks which are web based.
Database log – significant for auditing attacks which are database based.
As Application layer DDoS attacks are preserved in access log file, this log is taken for auditing analysis. The entries in the access log file of a web server contains the following attributes as discussed in [25] are: Timestamp, Source IP, Destination IP, Source Port, Destination Port, HTTP requests method, URL and HTTP version, Flags, and HTTP status code.
b. Log Preprocessing
The logs collected from the web server are formatted to common log by discarding the unwanted and noisy attributes. Only essential attributes impact the results namely distant host, requests time, HTTP request and referral URL for determining the authorized or unauthorized user [25]. The remote host (IP address), timestamp, TCP service, flag are converted to numeric value instead of the whole digits by using the hash function.
c. Correlative Features selection In this stage the preprocessed features statistical analysis has carried out to selects the best resulted features and labels class.
d. Model Training & Evaluation The preprocessed features matrix and targets vector are passed as input to the classification model viz., Logistic regression and K-nearest neighbor for the identification of suspicious behavior on the network, since the targets class is a categorical variable thus each request classified as individual class. Classification is a supervised machine learning techniques in which the computer software give instructions
from the input and then uses this same instructions to classify the new classes [24]. The Logistic regression and K-nearest neighbor are work one vs all and similar k-proximity data points/objects respectively. Which reflects the concerned label class. Since the models are trained/observed the features, which make comparison of the while produced access log files from inbound network traffic with the predefined user behavior on the network to identify the new request is legitimate or suspicious.
classes. The classes are then predicted with majority vote. In order to deal with very large number of classes, e.g. hundreds of classes, the one versus one strategy is too costly because it needs to train many thousands binary classifier. So the one versus all strategy becomes popular in this case. Therefore our multiclass logistic regression algorithm also used the one versus all strategy.
ii. K-nearest Neighbors Classification Algorithm KNN is considered among the oldest non-parametric classification algorithms. In order to classify an unknown instance, the distance from that instance to every other training instance is measured. The K lowest distances are determined, and the most represented class by these k nearest neighbor is considered the output label class. The value of K is normally identified using a validation set or using cross validation. K-nearest neighbors rule is one of the best pattern identification. The k-NN rules classify each new instance by the majority exists of its k-nearest neighbors instance in the training dataset. In spite of its simplicity, the KNN rule often generates competitive outcomes in specific domains. When smartly developed with prior knowledge, it has primarily advanced the state-of-the-art (Belongie et al., 2002; Simard et al., 1993). As it nature of decision rule, the performance of KNN classification algorithm relies especially on the way that
distances are measured between different instances. When no preceding knowledge is exist, most often implementations of KNN compute simple Euclidean distances. Unluckily Euclidean distance contravene statistical regularities that can be measured from a large training set labeled of instances. Therefore, now we have studied that the KNN classification algorithm can be greatly improved by learning an appropriate distance metric from labeled instances (Chopra et al., 2005; Goldberger et al., 2005; Shalev-Shwartz et al., 2004; Shental et al., 2002). By default, the KNN algorithm employs the Euclidean distance which can be calculated as in the following equation has given [25].
D p q( , ) = (p q1 − 1)2 +(p q2 − 2)2 + +(p qn− n)2
Where p and q are used to compare with n characteristics. There are also other method to calculate distance such as Manhattan distance. Another concept is the parameter k which decides how many neighbors will be chosen for KNN algorithm. The best choice for k has play a significant impact on the evaluation performance of KNN algorithm.
IV. Experimental results
This part illustrates the experiments conduct for access log collection, access log preprocessing, model training and experimental results of classification algorithms.
The normal traffic is collected by employing different browsing activities executed on various systems using legitimate browsing agents, HTTP services header parameters and HTTP request methods, which are captured in the access log file of the web server. The real voluminous incoming and outgoing network traffic from the web server is preserved and reflected persistently as web server log file during one week of period or more. Following that the attack was carried out during the proposed time frame. These HTTP Get Flood attack is executed by using manufactured bots or with different attacking tools viz., HULK [26], HTTP DoS [27], HOIC [28]. During the peak hours a significant increase occurred in the flow of network traffic and suppresses it gradually in the afternoon. The DDoS attacks are preserved in the access log file of the web servers.
b. Result Analysis
The log files are formatting by pre-processed to obtain the transmission traffic. The pre-processed access log files passed as input to various classification algorithms to effectively identify the traffic patterns. In classification algorithms viz., K-nearest neighbors, here K means for each test data point, we would be looking at the k nearest observed data points and take the most frequent occurring classes and assign that class to the test data. Therefore, K represents the number of training data points lying in proximity to the test data point which effective identification of the target class. In this study also four different types of DDoS attacks have encountered, which are captured in access log (Firewall server). Therefore, the model built as a multi class classifier viz., K-nearest neighbors and
[image:7.612.319.583.180.335.2]logistic regression. Since the trained model compares the incoming traffic with predetermined observed features to classify the request is legitimate or come one of the four category (illegitimate) the targets class as shown below as fig 2.
Figure 2. Different DDoS attacks encountered in our data.
When evaluated the two different models namely Logistic Regression and K-nearest neighbors. The performance measure is far better the K-nearest neighbors than the logistic regression. The pseudocode for K-nearest neighbors is as follows.
1. Load the training and test data 2. Choose the value of k
3. For each point in the test data:
- Find the Euclidean distance to all training data points
- Store the Euclidean distance in a list and sort it.
- Choose the first k points
- Assign a class to the test point based on the majority of classes present in the chosen points.
4. End.
4.2.1. Generalized Confusion Matrix for Multiple Classes
Predicted Number
Class 1
Class 2
…
Class
n
Class 1
𝑥
11𝑥
12…
𝑥
1𝑛Class 2
𝑥
21𝑥
22…
𝑥
2𝑛.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Class
n
𝑥
𝑛1
𝑥
𝑛2…
𝑥
𝑛𝑛The entire amounts of false positive (TFP), false negative (TFN), and true negative (TTN) for each class iwill be calculated based on the
Generalized Equations 1, 2, and 3, respectively. The entire true positive in the system will be achieved through Equation 4.
[1]
𝑗=1 𝑗≠ 𝑖
𝑛
𝑇𝐹𝑃
𝑖= ∑ 𝑥
𝑗𝑖[2]
[3]
𝑇𝑇𝑁
𝑖= ∑ ∑ 𝑥
𝑗𝑘𝑗=1 𝑘=1 𝑗≠ 𝑖 𝑘≠𝑖 𝑛
[4]
𝑇𝑇𝑃
𝑎𝑙𝑙= ∑ 𝑥
𝑗𝑗After experimenting with the access log (Firewall server log files), it is inferred that the k-NN accuracy is higher when make comparison with Logistic regression by carried out various test cases. The
kNNoutperforms well than the other classifier, it is because of employed the best parameters tuning and an optimal value for K is identified. The results can be shown as fig 3 and fig 4.
Figure 3. Confusion matrix outcomes of the proposed method.
To compute the generalized precision (P), recall (R), and specificity(S) for each class i
Generalized Equations 5, 6, and 7 will be used.
𝑇𝑇𝑃
𝑎𝑙𝑙[5]
𝑇𝑇𝑃
𝑎𝑙𝑙[6]
𝑇𝑇𝑁
𝑎𝑙𝑙[7]
𝑇𝑇𝑃
𝑎𝑙𝑙[8]
Although the proposed method returns the satisfactory outcomes regarding precision, recall, and f1-measure, some errors still occur. It is due to our dataset has lack of attributes, which are used for multi class classifier. Therefore, it has been trained two different classification models viz., Logistic regression and K-nearest neighbor, the k-NN outperforms well as compared to logistic regression. Various tests are carried out using the log files generated by firewall server. The combination of various attack instances are tested by using big data analytics and machine learning techniques. By doing so many experiments, it is acquired that the k-NN with predetermined parameters tuning greatly improved the proposed method performance measure. The k-NN misclassifying rate is very lesser when make comparison to other model and an optimal value for k is identified, which plays an effective role in model learning. So k-NN has reasonable outcomes over the Logistic regression.
Conclusion
In this paper, the classification techniques such as K-nearest neighbors and logistic regression are employed to make detection of application layer DDoS attacks and to enhance the forensic analysis performance. The normal traffic is collected by employing the normal browsing activities and the attacks are carried out by employing different attacking tools, scripts, and bots. These events are captured in the access log of a firewall web server. The obtained access log events are pre-processed to extracts the relevant feature matrix from the web server log file. These preprocessed feature matrix is then passed as input to the classification algorithm viz., K-nearest neighbors and Logistic regression, which helps to identify the attacks from the pattern analysis of incoming traffic. The experimental results reflect that the k-nearest neighbors based classification technique achieves high detection rate, minimizes false positive and identifies unknown attacks then the logistic regression algorithm.
Acknowledgement
This research work was supported by the Jilin Scientific and Technological
Development Program (Project no. 20190201267JC).
References
[1] Scarfone, K., Mell, P.: Guide to intrusion detection and prevention systems (IDPS) NIST Special Publications 800-94,1–127 (2007).
[2] Kaspersky Labs, Global it security risks survey 2014 Distributed Denial of Service (DDoS) attacks, 2014,
http://media.kaspersky.com/en/B2B- International-2014-survey-DDoS-Summary-report.pdf
[3] DDoS attack, http://www.digitaltrends.com/computing/dd os-attacks-hit-record-numbers-in-q2-2015/ (Accessed on 25/11/2015).
[4] W. Lee, S. J. Stolfo, ―Data mining approaches for intrusion detection,‖ Columbia University, New York dept. of computer science, 2000.
[5] Zhang, Z., Li, J., Manikopoulos, C., Jorgenson, J., Ucles, J.: HIDE: a Hierarchical Network Intrusion Detection System using statistical preprocessing and Neural Network classification, In: Proceedings of IEEE Workshop on Information Assurance and Security, pp. 85– 90, (2001).
[6] Govindarajan, M., Chandrasekaran, R.: Intrusion Detection using neural based hybrid classification methods, J. Comput. Netw., vol. 55, 1662–1671, (2011).
[7] Hu, W., Liao, Y., Vemuri, V. R.: Robust anomaly detection using Support Vector Machines, In: Proceedings of International Conference on Machine Learning, pp. 592– 597, (2003).
[8] Adrian T.N. Palmer, Computer Forensics, The six steps, US-CERT, (2008). [9] Liao, N., Tian, S., Wang, T.: Network forensics based on fuzzy logic and expert system, J. Computer Communications, vol. 32, 1881—1892, (2009).
[11] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, ‖An empirical evaluation of information metrics for low-rate and high-rate DDoS attack detection,‖ Pattern Recognition Letters, vol: 51, pp. 1-7, 2015. [12] T. Yatagai, T. Isohara and I. Sasase, ―Detection of HTTP-GET flood attack based on analysis of page access behavior,‖ In Communications, Computers and Signal Processing, IEEE Pacific Rim Conference, pp. 232-235, 2007.
[13] K. Lee, J. Kim, K. H. Kwon, Y. Han and S. Kim, ―DDoS attack detection method using cluster analysis,‖ Expert Systems with Applications, vol. 34, No. 3, pp. 1659-1665, 2008.
[14] H. Oh and K. Chae, ―Real-Time Intrusion Detection System Based on Self- Organized Maps and Feature Correlations,‖ In Convergence and Hybrid Information Technology, 3rd IEEE International Conference on ICCIT’08, vol. 2, pp. 1154-1158, 2008.
[15] A. Konar and R. C. Joshi, ‖An Efficient Intrusion Detection System Using Clustering Combined with Fuzzy Logic,‖ Contemporary Computing, Springer Berlin Heidelberg, pp. 218-228, 2010.
[16] Sree TR, Bhanu SM. Identifying HTTP DDoS Attacks Using Self Organizing Map and Fuzzy Logic in Internet Based Environments. In Proceedings of 3rd International Conference on Advanced Computing, Networking and Informatics 2016 (pp. 259-269). Springer, India.
[17] Sidana, M. (2017, February 28). Types of classification algorithms in Machine Learning. Retrieved from Medium website: https://medium.com/@Mandysidana/machin
e-learning-types-of-classification-9497bd4f2e14.
[18] Kruegel, C., Vigna, G.: Anomaly detection of web based attacks. In: Proceedings of the 10th ACM conference on communications security, pp. 251–261, ACM, (2003).
[19] M. Zolotukhin and T.Hamalainen, ‖Detection of anomalous http requests based
on advanced n-gram model and clustering techniques,‖ Internet of Things, Smart Spaces, and Next Generation Networking, Springer Berlin Heidelberg, 371-382, 2013. [20] Bhuyan MH, Bhattacharyya DK, Kalita JK. An empirical evaluation of information metrics for low-rate and high-rate DDoS attack detection. Pattern Recognition Letters. 2015 Jan 1;51:1-7.
[21] Maggi, F., Robertson, W., Kruegel, C., Vigna, G.: Protecting a moving target: Addressing web application concept drift. In: Kirda, E., Jha, S., Balzarotti, D., (eds.), Recent Advances in Intrusion Detection 2009. LNCS, vol. 5758, pp. 21–40. Springer, Berlin Heidelberg (2009).
[22] Chwalinski P, Belavkin R, Cheng X. Detection of HTTP-GET attack with clustering and information theoretic measurements. In: Foundations and Practice of Security. Springer; 2013. p. 45-61. [23] Do, T.-N., &Poulet, F. (2015). Parallel Multiclass Logistic Regression for Classifying Large Scale Image Datasets. Advanced Computational Methods
for Knowledge Engineering, 255–266.
https://doi.org/10.1007/978-3-319-17996-4_23.
[24]Download Limit Exceeded. (n.d.). Retrieved May 4, 2020, from citeseerx.ist.psu.edu website: http://citeseerx.ist.psu.edu/viewdoc/downloa d?doi=10.1.1.175.107&rep=rep1&type=pdf. [25] Weinberger, K., & Saul, L. (2009). Distance Metric Learning for Large Margin nearest Neighbors Classification. Journal of Machine Learning Research, 10, 207–244. Retrieved from http://www.jmlr.org/papers/volume10/weinb erger09a/weinberger09a.pdf
[26] HULK attack, http://github.com/grafov/hulk