Part of the Databases and Information Systems Commons, Information Security Commons, OS and Networks Commons, and the Systems Architecture Commons

(1)

University of Arkansas, Fayetteville University of Arkansas, Fayetteville

ScholarWorks@UARK

Theses and Dissertations 5-2021

Network-Based Detection and Prevention System against

DNS-Based Attacks

Based Attacks

Yasir Faraj Mohammed

University of Arkansas, Fayetteville

Follow this and additional works at: https://scholarworks.uark.edu/etd

Part of the Databases and Information Systems Commons, Information Security Commons, OS and Networks Commons, and the Systems Architecture Commons

Citation Citation

Mohammed, Y. F. (2021). Network-Based Detection and Prevention System against DNS-Based Attacks. Theses and Dissertations Retrieved from https://scholarworks.uark.edu/etd/3970

(2)

Network-based Detection and Prevention System against DNS-based Attacks

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Engineering with a concentration in Computer Science

by

Yasir Faraj Mohammed AL-Hadbaa University

Bachelor of Science in Computer Science, 2004 Glamorgan Uninversity

Master of Science in Computer System Security, 2009

May 2021 University of Arkansas

This dissertation is approved for recommendation to the Graduate Council

(3)

ABSTRACT

(4)

ACKNOWLEDGMENTS

(5)

DEDICATION

(6)

TABLE OF CONTENTS

1 Introduction . . . 1

1.1 Background . . . 1

1.1.1 DNS Protocol . . . 1

1.1.2 DNS Namespace Hierarchy . . . 2

1.1.3 Data Exchange over DNS . . . 4

1.1.4 DNS Attacks . . . 5

1.1.5 DNS Tunneling Attack . . . 6

1.1.6 Communication Patterns . . . 6

1.1.7 Visualization . . . 7

1.1.8 Machine Learning Techniques . . . 7

1.2 Statement of Problem . . . 8

1.3 Goals of this work . . . 9

1.4 Contributions . . . 10

1.5 Dissertation Organization . . . 11

2 Related Work . . . 12

2.1 Detection of DNS attacks . . . 13

2.2 Detection DNS Tunneling attacks . . . 14

2.3 Detection IDS/IPS systems . . . 15

2.4 Detection using Machine Learning . . . 17

2.5 Detection using Visualization . . . 18

2.6 DNSSec . . . 19

3 Visualization of DNS Tunneling Attacks . . . 21

3.1 Introduction . . . 21

3.2 Visualization System Design . . . 22

3.3 Graphical patterns for DNS tunneling attacks . . . 23

3.4 Evaluation . . . 26

3.4.1 Experimental setup . . . 26

3.4.2 Results . . . 27

3.4.3 Issues . . . 29

3.5 Conclusion . . . 30

4 DNS Tunneling Attack Detection using Machine Learning . . . 31

4.1 Methodology . . . 31

4.1.1 System Design . . . 31

4.1.2 The Detection System . . . 33

4.2 Results and Discussion . . . 35

4.2.1 Evaluation . . . 35

4.2.2 Experimental Setup . . . 35

4.2.3 The Dataset . . . 38

(7)

5 System Methodology . . . 46

5.1 Background . . . 46

5.2 DNS Proxy . . . 46

5.2.1 DNS Proxy as a Detection System . . . 47

5.3 System Design . . . 48

5.3.1 Parsing DNS Data . . . 48

5.3.2 Feature Extraction . . . 49

5.3.3 The Payload Analysis Module . . . 50

5.3.4 The Traffic Analysis Module . . . 53

5.3.5 System Features . . . 56

6 Results and Discussion . . . 58

6.1 Evaluation . . . 59

6.1.1 Experimental Setup . . . 59

6.1.2 The Collected DNS Data . . . 60

6.2 Threshold Value . . . 63 6.3 Results . . . 65 6.3.1 Results of Experiment 1 . . . 67 6.3.2 Results of Experiment 2 . . . 69 6.4 Results . . . 69 6.4.1 Snort . . . 69 6.4.2 Snort Setup . . . 69 6.4.3 Snort Results . . . 71 6.4.4 DNS Proxy vs Snort . . . 71

7 Conclusion and Future Work . . . 75

7.1 Conclusion . . . 75

7.2 Discussion and Limitations . . . 77

7.3 Future Work . . . 78

(8)

LIST OF FIGURES

Figure 1.1: The Domain Name System Hierarchy . . . 2

Figure 1.2: How DNS works . . . 3

Figure 1.3: Data exchange over the DNS protocol . . . 5

Figure 1.4: The DNS tunneling attack . . . 6

Figure 3.1: System Architecture Design . . . 23

Figure 3.2: Graphical Patterns for Traditional DNS Tunneling Attack. . . 25

Figure 3.3: Graphical Patterns for Heyoka DNS Tunneling Attack. . . 26

Figure 3.4: Top Five Domain Name for an Entire Week without the Attack. . . 27

Figure 3.5: Top Five Domain Names during the Attack over a Two-hour Period. . . 28

Figure 3.6: Visualizer Module Shows only the Attack. . . 29

Figure 3.7: Visualize the Entire DNS Captured Data. . . 29

Figure 4.1: The DNS Tunneling Attack Patterns . . . 32

Figure 4.2: DNS Tunneling Tools . . . 36

Figure 4.3: Top 25 Domain Names including DNS Tunneling Attack Tools . . . 37

Figure 4.4: The Normal and the Malicious DNS requests in the dataset . . . 39

Figure 4.5: The Total Number of the Normal and the Malicious DNS requests in the dataset . . . 40

Figure 4.6: The Total Number of the TTL and Data Length in the Normal and the Malicious DNS requests in the dataset . . . 41

Figure 4.7: The Confusion Matrix . . . 41

Figure 5.1: DNS Proxy . . . 47

Figure 5.2: The Architecture Design of DNS Proxy Detection System. . . 49

Figure 5.3: The Real-time Payload Analysis Module. . . 53

Figure 5.4: The Overview of Traffic Analysis Module. . . 54

Figure 5.5: Example of Parsed DNS Data . . . 54

Figure 6.1: An Overview of the DNS proxy Setup. . . 60

Figure 6.2: Internet Data Transfer during the Four Weeks. . . 61

Figure 6.3: DNS Queries over Monthly Time Period. . . 62

Figure 6.4: Statistics of Queries Types and Answered by. . . 63

Figure 6.5: Environmental Setup of Snort. . . 70

(9)

LIST OF TABLES

Table 1.1: Well-known malware that uses DNS protocol for their communications . . 9

Table 4.1: DNS Tunneling Visualization Signatures . . . 33

Table 4.2: Frequency of Top 10 Domain Names . . . 38

Table 4.3: The confusion matrix values for the machine learning classifiers. . . 44

Table 4.4: Compared Results Among Machine Learning Classifiers . . . 45

Table 5.1: DNS Tunneling Visualization Signatures . . . 55

Table 6.1: Frequency of Top 10 Domain Names . . . 62

Table 6.2: Exfiltrating Various Binary files Over the DNS . . . 65

Table 6.3: Tools to Simulate DNS-Based Attacks . . . 66

Table 6.4: Frequency of Top 10 Blocked Domain Names . . . 67

Table 6.5: Top 5 Domain Names of False Positives . . . 68

Table 6.6: Compared Results between DNS Proxy and Snort . . . 71

Table 6.7: Comparison of Three Experiments Results . . . 72

Table 6.8: The Major Difference between the Detection System and Snort IDS . . . 73

(10)

1 Introduction

The Internet has become a significant part of people’s daily life and an essential method for communications today. Individuals and companies rely on the Internet and use it as a primary tool for doing their personal or private business. Thus, personal computers and computer networks have been a primary target of data theft, malware software, and other types of attacks such as accessing, changing, or destroying sensitive information that interrupt normal business processes. This requires cybersecurity practices to protect systems, networks, and applications from various cyber attacks. Cybersecurity provides multiple layers of protection deployed among different components of network infrastructures to block the Internet-based attacks and includes firewalls and intrusion detection systems (IDSs). However, sometimes adversaries can bypass these defensive mechanisms by using different techniques. For example, the Domain Name System (DNS) is the critical backbone of the Internet providing the mapping of human-readable names to Internet Protocol (IP) addresses used for communication. Therefore, restricting the DNS protocol in a firewall or security appliances is challenging, and it may raise usability issues since almost every communication session on the Internet relies on the DNS protocol. Thus, the DNS protocol is one of the most laborious protocols that must be maintained for reliable communication inside a network environment.

1.1 Background

In this section, background information for this work is presented. The background includes a description of the DNS protocol, its usage, and DNS attacks. In addition, data exfiltration techniques and DNS tunneling attacks are introduced.

1.1.1 DNS Protocol

(11)

protocol is a distributed hierarchical database which stores information about a group of registered computers on the Internet. There are different types of information that the DNS protocol can store such as IP addresses including IP version 4 (IPv4) and IP version 6 (IPv6), hostnames, and mail routing information.

1.1.2 DNS Namespace Hierarchy

The DNS has a hierarchical tree structure called the DNS namespace. Each dot in a domain name indicates the separation between levels in the tree structure. The top layer of the DNS tree structure represents the root level which starts with a dot. Under the root level, there are top-level domains (TLD) which are children domains of the root such as .com, .net, .edu and others. Next, TLD also has children domains that reference the second level of domains or authoritative domain name servers. Finally, the fully qualified domain name (FQDN) locates the hostnames or subdomains within the DNS hierarchy [1]. For example, the process of looking up the following domain name csce.uark.edu. is always started with the dot which is the root server in the DNS tree structure. Then, the root server(s) forwards the query to the correspondence TLD which in this case is .edu. Next, the TLD sends the query to the secondary level domains. As soon the query is received by the authoritative domain name server, it looks up the subdomain equivalent and returns its IP address. Figure 1.1 illustrates the DNS hierarchy.

(12)

DNS translates and locates easy-to-remember domain names into corresponding IP addresses and vice-versa. As seen in Figure 1.2, it is assumed that a user types a domain name example.org in a browser on a computer. First, the browser attempts to resolve the IP address of the domain name by checking the local cache of the computer. If there is no information related to that domain name and depending on the network configuration, it often forwards a query as a question seeking to match the domain name with its corresponding IP address to the local DNS server or the Internet Service Provider’s (ISP) DNS server. Next, if the information is still not found, the request is forwarded to root nameservers. The root servers are located all over the world that point to the appropriate downstream name servers. Then, it gets forwarded to the relevant TLD server which stores the address information for second level domains example.org within the top-level domain .org. As soon as the information is gathered from authoritative DNS servers, it resolves the domain name and responds to the initial sender. Finally, the computer communicates with a web server to display the content of the website.

(13)

1.1.3 Data Exchange over DNS

The DNS is a stateless protocol, meant to receive and answer requests from clients. The DNS protocol was not designed for data transfer. It was designed to exchange very short and specific types of information [2]. According to [3], the length of any label of the domain name is limited to between 1 and 63 bytes, and a full domain name is limited to 255 bytes if letters, digits, hyphens, and separators are included. However, the DNS protocol is considered an excellent covert channel from a security perspective. A covert channel is a channel used to transfer information between processes in a secretive manner, bypassing the security policy of a system. Thus, adversaries may abuse and leverage the DNS protocol to transfer data to their authoritative name server. Instead of sending legitimate subdo-mains, adversaries take advantage of the limitation of the domain length and encode some data in the domain name. For example, if the passw0rd.example.org domain is queried, the authoritative name servers of example.org receives the passw0rd as a string and interpret it as incoming data instead. Also, adversaries may use some other encoding techniques in order to transfer text or binary data to their servers. For example, they can use Base32, Base64, and Based128 to encode their data. In this case, the query will be as the follow-ing:YWRtaW5AOnBhc3N3MHJkCg.example.org where YWRtaW5AOnBhc3N3MHJkCg is a base64 form of the sensitive text file and this is called a DNS data exfiltration technique. Note that if the required data exceeds the 255 bytes, then it has to be split into more than one DNS query in order to be transferred via the DNS protocol. Data exchange over the DNS protocol is shown in Figure 1.3

(14)

pub-Figure 1.3: Data exchange over the DNS protocol

lic hotspot authentication system for mobile devices that leveraged the DNS protocol for their communications where mobile devices exchange information and media access control (MAC) addresses through the DNS protocol to communicate with a credential server [4]. In addition, the malicious use of data exfiltration over the DNS occurs from other software and malware, which are discussed later in this chapter.

1.1.4 DNS Attacks

(15)

traffic except the DNS protocol and DNS activities are not monitored. Adversaries can still transfer stolen or sensitive information through a firewall using DNS. They take advantage of this open avenue and manipulate the use of the DNS service to establish undetected covert channels.

1.1.5 DNS Tunneling Attack

The DNS tunneling attack is a method that encodes and encapsulates the data of other applications or protocols in DNS queries and responses and is shown in Figure 1.4. The original concept of DNS tunneling attack was designed to bypass the captive portals for paid Wi-Fi service, especially at hotels and cafes. However, a data payload can be added to attack internal networks that are used to control remote servers and applications, or to perform DNS data exfiltration which is a technique being used to transfer unauthorized data between two computers through the DNS protocol [9]. To successfully execute this attack, an adversary requires a compromised machine within the internal organization network that has access to the internal DNS server which has external access via the DNS protocol. In this way, the adversary can use the compromised machine to establish a connection to an outside computer using the DNS protocol. Adversaries must also register a domain name and set up a remote server that can run as a DNS authoritative server.

Figure 1.4: The DNS tunneling attack

1.1.6 Communication Patterns

(16)

of DNS tunneling attack, which can be performed using software such as Iodine, DNScat, DNS2TCP, and others, is more reliable than the DNS data exfiltration techniques. The DNS tunneling attack communication has many keep-alive messages with bi-directional and interactive patterns. Due to the DNS limitation of 255-byte messages, a high volume of DNS queries is required. Therefore, a system administrator could quickly notice the attack if the DNS protocol is monitored. The common usage of DNS tunneling attack is web browsing over the DNS and remote desktop protocols. On the other hand, DNS data exfiltration is less aggressive than DNS tunneling. The communication channel of DNS data exfiltration is built on uni-directional patterns which does not require DNS replies from an authorita-tive DNS server. DNS data exfiltration is usually performed by malware and command and control (C&C) channels to transfer sensitive data, such as stolen credit cards, outside the network through the firewall. In this technique, adversaries use common DNS records such as A and AAAA, which makes the detection more difficult than the others.

1.1.7 Visualization

Visualization is not a new concept for monitoring network traffic. There are several papers that used the parallel coordinates technique to monitor and visualize network traffic [10, 11]. There are various benefits of using a parallel coordinates technique. First, it is scalable. It helps to represent a considerable amount of data in an efficient manner which makes analyzing big data more simple and powerful. Second, there is no limit on the number of values for representing an attack using the parallel coordinates technique that converts multidimensional data into two dimensions. Therefore, using the parallel coordinates technique gives the ability to represent more than three different values within a two-dimensional space. In this way, the relationship between the results can be reviewed quickly showing prominent trends, correlations, and divergence from the raw data. Finally, the visualized data using the parallel coordinates technique does not include any bias for any column within the space [12, 13]. In other words, every visualized feature has the same weight.

1.1.8 Machine Learning Techniques

(17)

and improving from experience without human interaction. Machine learning algorithms use computational approaches to learn information directly from data without relying on a predetermined equation as a model.

Machine learning consists of three primary classifications: supervised learning, unsu-pervised learning, and semi-suunsu-pervised learning. Suunsu-pervised learning includes labeling infor-mation on training data using classification algorithms to predict new data labels. On the other hand, unsupervised machine learning uses unlabeled information and builds a classifier based on learning and information characteristics analysis. It tends to automatically cluster data. Finally, semi-supervised learning is a classifier using both labeled and unlabeled data to train the algorithm [14]. Nowadays, machine learning is actively being used for various topics, including detection systems against malicious activity. In Chapter 2, we list several proposed systems that use machine learning for detecting abnormal activities including DNS tunneling, data exfiltration, and botnets that use DNS protocol for communications.

1.2 Statement of Problem

Detecting and preventing DNS-based attacks is a challenging task. Adversaries use DNS-based attacks to transfer sensitive data, such as stolen private organization data, or establish covert communication channels to hide their malicious activities. DNS-based at-tacks happen often and can result in a loss of revenue. The 2020 Global DNS Threat Report revealed that 79% of organizations had experienced DNS-based attacks, including DNS phishing, DNS-based malware, and DNS tunneling attacks [15]. Moreover, DNS-based attacks were the main cause for 82% of application downtime in 2020, which caused ap-proximately more than 1 million dollars of damage to these organizations. This work could mitigate these attacks.

(18)

of the modern firewalls and IDS appliances do not block a DNS tunneling attack or raise the alarm by default. For that reason, the DNS tunneling attack is attractive to adversaries or people who design malicious applications.

The data exfiltration over the DNS protocol and DNS tunneling attacks are being used in many real-world scenarios such as malware, botnets, trojans, and others. According to [2], there are various types of well-known malware software that targets victims using the DNS protocol for creating covert communication channels. Table 1.1 shows more information about malware software from 2011 to 2017 which were explicitly designed to avoid detection systems.

Year Malware Targets

2011 Morto RDP/RAT 2011 FeederBot Botnet 2014 Plugx RDP/RAT 2014 FrameworkROS POS 2015 Wekby Targeted 2015 BernhardPOS POS 2015 JAKU Botnet 2016 MULTIGRAIN POS 2017 DNSMessenger Targeted

Table 1.1: Well-known malware that uses DNS protocol for their communications

1.3 Goals of this work

The objectives of this work are listed below:

(19)

based on the Payload Analysis module. This module parses, extracts, and analysis cap-tured fully qualified domain name (FQDN) queries from clients to determine whether malicious or not.

3. Design and implement an offline detection mode to distinguish abnormal activity within the DNS traffic. This module works based on two modules: the visualization and the machine learning-based detection modules. The offline detection mode feeds the system with DNS traffic based on a specific window period, every x minutes, to determine whether the DNS traffic contains an attack.

4. Implement and design a network-based solution, DNS proxy server, that combines the two detection modes in order to detect and prevent DNS-based attacks. The network-based solution in this work aims to provide DNS services as well as the detection and prevention in high accuracy, scalability, and efficient manner.

5. Compare and evaluate a set of performance metrics with different detection systems or previous research work.

1.4 Contributions

(20)

(6) Designing a system to detect DNS tunneling attacks that achieves a high accuracy of 99.8% and no false negatives with a low false-positives rate of less than 0.02%. The system performs better than the commonly used Snort Intrusion Detection System (IDS), and the detection system in this work to determine which system obtained better performance and results against various DNS-based attacks.

1.5 Dissertation Organization

(21)

2 Related Work

Sensitive data theft is now one of the most severe risks for organizations and service providers involving man-in-the-middle attacks [16] or malware that leaks data overt a cover channel [17]. Although many organizations have deployed traditional security systems such as firewalls, intrusion detection systems, and proxies, there are still high-profile data breaches that occur. Marriott International announced on November 2018 that it had been the victim of a data breach in which the information for approximately 500 million clients had been stolen [18]. The breach occurred on Starwood brand support systems starting in 2014. The breach was not discovered until September 2018 [18]. In 2016, Yahoo announced that it was the victim of the greatest data breach in history. The attack compromised approximately one billion user accounts, real names, email addresses, birth dates, passwords, and telephone numbers [19].

(22)

2.1 Detection of DNS attacks

(23)

against rogue DNS servers. Further experimental results showed that it is efficient and can be easily fit correctly into any network domain. The proposed system in [4] works as passive DNS that monitors all DNS requests, including queries and responses, and detects payload distribution channels that are established within DNS messages using a DNS zone analysis profile module. Their detection module works based on all TXT records that have been captured, divided and aggregated based on one day period. In parallel, their system stores a copy of the captured data using the passive DNS channel in a database for future use. Three datasets have been used to evaluate their work: one month of passive DNS traffic, a passive DNS database, and a one-year malware database. Their experimental results show that the proposed system was able to find different malware that leverage DNS for their data transfer. However, the proposed system has several limitations. For example, the author decided to captured only DNS traffic that has TXT recoded to detect the payload distribution channels. In [27], the authors presented a comparative performance evaluation of DNS tunneling tools, and they showed that different tools are able to use not only TXT but also different types of DNS records in order to perform the DNS tunneling attacks. Botmasters or malware creators can bypass the proposed system by using their name servers or other open DNS resolvers which are not captured by the passive DNS sensors. As they mentioned in the paper, the proposed system could be configured to run in real-time. However, the system captured and aggregated DNS traffic in a time-based window, of one-day. Therefore, by that time, malware can completely transfer sensitive data outside the organization.

2.2 Detection DNS Tunneling attacks

(24)

such as HTTP or FTP within the DNS requests. Compared to data exfiltration, this attack requires a high frequency and volume of DNS traffic that produces easily identifiable traffic pattern. The DNS tunneling attack is often referred to as high throughput tunneling [2]. Finally, the command & control, C&C, communication channel provides the attacker the ability to take full control of a victim computer by sending commands and receiving the cor-responding result over the DNS protocol. C&C channels provide an attacker with the means to communicate with their malware after it has infected a victim [28]. Many open-source DNS tunneling tools were designed to perform the attack such as iodine, dns2tcp, DeNiSe, dnscapy, heyoka, fraud-bridge, dnscat, and dnscat2 [9]. Moreover, several types of malware were discovered that use DNS tunneling for establishing difficult to detect C&C channels for bot communications over the DNS protocol. The malware described in [29] and [30] uses DNS queries and responses to execute commands and control compromised machines over the DNS protocol. Therefore, securing them is essential.

2.3 Detection IDS/IPS systems

(25)

(26)

virtualization framework, which is a network emulator that creates a network of virtual hosts, switches, controllers, and links [35]. The authors in [36] demonstrated a method to deploy Splunk, which is a commercial software that is used to index a large dataset and provides keyword searching capabilities, dashboarding, reporting, and statistical analysis, within the network and walked through steps to install external modules for enabling the detection of DNS tunneling attacks.

2.4 Detection using Machine Learning

(27)

accuracy of 99.9%. However, the system was trained to detect only well-known tools which are based on expected behavior, and an attacker can bypass the system by manipulating these features. The authors in [39] developed a machine learning method of detecting DNS tunneling attacks including information exfiltration from compromised machines and DNS tunneling that was used to establish C&C servers. Their detection method relied on detect-ing DNS tunneldetect-ing attacks that use TXT records only. However, adversaries can bypass the detection of this method by performing the attack using different techniques and DNS records [27]. In [9], the authors discussed DNS tunneling attacks and its tools that use common and uncommon types of DNS records such as A, AAAA, CNAME, NS, TXT, MX, SRV and NULL [27]. Nadler et al., 2019 proposed a novel approach for detecting and deny-ing both DNS tunneldeny-ing attacks and low throughput data exfiltration over the DNS usdeny-ing machine learning techniques [2]. There are three main phases for performing their approach. The initial phase starts by collecting and grouping the DNS data log per domain name for an extended period of time. The next phase is extracting features based on querying be-havior of each domain name. Finally, a classification phase, which is conducted using the isolation forest algorithm, is used to determine whether the domain name was used for data exfiltration. Based on the classification model results, DNS queries and responses from or to these malicious domains will be blocked. They performed the approach on large-scale DNS traffic, DNS tunneling tools traffic, well-known malware traffic, and sensitive data that was injected into the dataset to evaluate their work. As a result, the system detects the attacks with a false positive rate of less than 0.002% [2]. However, the proposed method does not work in real-time, requiring time for collecting data to detect the low throughput malware traffic. During the collection phases, an attacker can use the DNS exfiltration technique and transfer sensitive data over the DNS.

2.5 Detection using Visualization

(28)

focused on visualizing network attacks such as botnets that cause various types of attacks such as DDoS, spam, and others. The visualization system was developed to provide visual information which helps in detecting botnet attacks using DNS traffic. The parallel coor-dinates technique, a common way of visualizing high-dimensional geometry and analyzing big and multi-dimensional data, was used to represent three different DNS parameters in real-time. Four graphical patterns were defined in that visualization system [10, 11]. In [40], a fast and scalable visualization system for detecting and identifying DNS tunneling attacks is proposed based on the parallel coordinates technique. The visualization system was developed to provide visual information which helps in detecting botnet attacks using DNS traffic. One week of data was analyzed to study the normal behavior of DNS, and it was compared to the behavior during a tunneling attack. Six DNS parameters for the parallel coordinates technique were identified and used to visualize the DNS traffic using the parallel coordinates technique in order to distinguish between regular DNS traffic and malicious traffic containing DNS tunneling attacks. Four different DNS tunneling patterns were identified, each representing different types of DNS tunneling attacks using the parallel coordinates technique [40]. In [41], Born and Gustafson (2010) introduced the NgViz tool that combines a visualization technique with N-gram analysis to examine and show DNS traffic from any suspicious activities. N-gram is used in text mining and natural language for processing tasks such as predicting the next item in a sequence. Their method compared a given input file of domain names with fingerprints of legitimate traffic. It is important to note that in this research, the authors supplied the comparison input of domain names and fingerprint of legitimate files manually.

2.6 DNSSec

(29)

DNS requests and verifying the non-existence of a domain, which makes DNSSEC effective against types of cache-poisoning attacks [43]. However, some of its limitations have been discussed in [44]. This solution is not sufficient for every network because DNSSec increases the amount of network traffic. It also comes with other issues such as broken validations which may cause a denial of service.

Even though some research has been conducted regarding approaches for detecting and identifying the DNS tunneling attack, some issues still deserve further attention. IDS and IPS detection techniques are based on pre-defined signatures by observing events and identifying patterns which can be applied on the signatures of known tools and attacks. One of the presumptions of the machine learning methods used for detecting malicious activities in traffic is to compare the characteristics of traffic with DNS tunneling with the normal DNS traffic even though some of the well-known tools are generating tunneled traffic similar as possible to the regular DNS queries [38].

(30)

3 Visualization of DNS Tunneling Attacks

The Domain Name System (DNS) is considered one of the most critical protocols on the Internet. The DNS translates readable domain names into Internet Protocol (IP) addresses and vice-versa. The DNS tunneling attack uses DNS to create a covert channel for bypassing the firewall and performing command and control functions from within a com-promised network or to transfer data to and from the network. There is work for detecting attacks that use DNS but little work focusing on the DNS tunneling attack. In this chapter, we introduce a fast and scalable approach, using the parallel coordinates technique, visualiz-ing a malicious DNS tunnelvisualiz-ing attack within the large amount of network traffic. The DNS tunneling attack was performed in order to study the differences between the normal and the malicious traffic. Based on different scenarios, four different DNS tunneling graphical patterns were defined for distinguishing between normal DNS traffic and malicious traffic containing DNS tunneling attacks. Finally, the visualization system was able to visualize the DNS tunneling attack efficiently for the future work of creating an efficient detection system.

3.1 Introduction

Visualization is not a new concept for monitoring network traffic. There are several papers that used the parallel coordinates technique to monitor and visualize network traffic. There are various benefits of using a parallel coordinates technique. First, it is scalable. It helps to represent a considerable amount of data in an efficient manner which makes analyz-ing big data more simple and powerful. Second, there is no limit on the number of values for representing an attack using the parallel coordinates technique that converts multidimen-sional data into two dimensions. Therefore, using the parallel coordinates technique gives the ability to represent more than three different values within a two-dimensional space. In this way, the relationship between the results can be reviewed quickly showing prominent trends, correlations, and divergence from the raw data. Finally, the visualized data using the parallel coordinates technique does not include any bias for any column within the space [12, 13]. In other words, every visualized feature has the same weight.

(31)

is proposed for recognizing DNS tunneling attacks using the parallel coordinates technique. The proposed system is able to distinguish between normal and malicious DNS traffic. (2) A real-world DNS tunneling attack is conducted within our network to study the behavior of the attack traffic. Four different graphical patterns are defined in this work by analyzing the DNS tunneling traffic based on different scenarios.

3.2 Visualization System Design

In this section, we describe the system architecture design of a visualization system for visualizing and identifying the Domain Name System (DNS) tunneling attacks using attack patterns as shown in Figure 3.1. The visualization system consists of three main components: Parser, Analyzer, and Visualizer. DNS data can be passively captured at the network edge or it could be read from existing PCAP files. The Parsing was coded in the Go programming language that interfaces with the TCPDUMP library for collecting the data quickly and efficiently. The captured packets are filtered to provide all User Datagram Protocol (UDP) packets on port 53. Note that DNS can also use TCP port 53 for things such as zone transfers but it was decided to only capture the more common DNS traffic that uses UDP. Performance is one of the essential factors necessary for this system. Therefore, the focus is on the DNS response data sent by the DNS server that confirms the communications between the local computer and the local or global DNS servers. Another reason why DNS response packets were considered, is that it decreases the number of captured packets, resulting in faster computations. The DNS response packets have a significant amount of information. Therefore, it was found that a subset of information could be used to visualize and identify DNS tunneling attacks and the parameters include the ID, domain and subdomain names, TTL (Time To Live), source and destination IP, date and time. This DNS data is stored in a database or file as input into the Analyzer.

(32)

Figure 3.1: System Architecture Design

then it considers it as a thread. The Analyzer module sends the suspicious domain names to the Visualizer.

Finally, the Visualizer displays the analyzed data using the parallel coordinates tech-nique. This technique uses each coordinate to represent the six different parameters: source and destination IP address, domain and subdomain names, TTL, and the time. For ex-ample, assume that a workstation with 192.168.0.1 IP address sends a DNS request to the global google DNS server that has 8.8.8.8 IP address to resolve the following domain name www.attacker.com. Then, the Visualizer module represents the source coordinate with the local workstation IP address 192.168.0.1, the destination coordinate as 8.8.8.8, attacker as the domain name value, and www as a subdomain name value. If we have thousands of requests, the visualization system displays a funnel-like pattern.

3.3 Graphical patterns for DNS tunneling attacks

(33)

Therefore, the DNS tunneling attack requires sending a considerable number of queries in order to transfer the data and this leads to the ability to identify these types of attacks.

In the parallel coordinates technique, each coordinate represents a particular value. The first coordinate is the source IP address which has the value of the computer that receives the DNS query while the second coordinate is the destination IP address that forwards the DNS request. The third coordinate represents the domain names, and the fourth coordinate contains subdomain names. Then, the fifth coordinate is the TTL value that is set by the Authority server which allows a DNS client to cache the answer for a particular time, while the time is the last coordinate. With these values, our visualization system shows a graphical pattern that can be matched to a template of the predefined graphical patterns of unusual DNS traffic.

Suppose that an attacker compromised a computer within a network and he needs to send stolen data out secretly without being detected by the firewall. The attacker can use a DNS tunneling attack to hide his movements or to transfer data outside the network. In this case, the attacker requires a pre-configured fake DNS server outside the network. From inside the compromised network, he can connect to his fake server via the DNS protocol that is allowed by the firewall in most networks. During the DNS tunneling attack, it sends a tremendous number of requests to the DNS server within a short amount of time actually transferring data instead of just sending DNS requests. In this work, we ran the DNS tunneling attack inside our network and we transferred a 12 MB text file. During one DNS tunneling attack, it generated a considerable number of DNS requests (52 thousand queries), which is significantly more than normal. This kind of attack splits the file into small pieces and encodes each piece with a specific encoding technique [9]. Then, it encapsulates the pieces into the DNS packets to transfer it to the preconfigured fake DNS server that has multiple preconfigured subdomain names.

(34)

DNS requests within a short time where it has a constant TTL value. The previous example shows all 52 thousand requests have a TTL value of zero. Figure 3.2.B or malware scenarios using the traditional DNS attack tunneling technique to hide their C&C communications inside the network. It shows that we have many internal computers sending and receiving many DNS requests within a short time. In [45], they presented the Heyoka tool which is one

Figure 3.2: Graphical Patterns for Traditional DNS Tunneling Attack.

(35)

Figure 3.3: Graphical Patterns for Heyoka DNS Tunneling Attack.

Heyoka tool when it has been used by botnet clients or malware situations from multiple computers.

3.4 Evaluation

3.4.1 Experimental setup

In this section, we show our experimental results of the experiments we performed to evaluate the visualization system. The evaluation method consists of two main parts. The first part is performing the DNS tunneling attack. The second part is importing the captured DNS traffic data to the visualization system.

(36)

Figure 3.4: Top Five Domain Name for an Entire Week without the Attack.

external server with a static IP address. Debian Linux server was installed and set up in the cloud for this purpose. The DNS tunneling attack was performed using Kali Linux in our internal network to encapsulate and upload a 12 MB text file to an external FTP server during a weekday. Moreover, we also sniffed and monitored normal DNS traffic during regular daily use for an entire week. There are two significant reasons for sniffing and capturing a regular DNS traffic in our experiment. The first reason is to compare and study the normal behavior of DNS traffic against the DNS traffic that has the DNS tunneling attack while the second reason is to determine and set the value of threshold based on the normal DNS traffic level.

3.4.2 Results

After performing the attack, the PCAP file was stored for the later analysis. Next, using the Analyzer module, the captured data was analyzed excluding the DNS tunneling attack as shown in Figure 3.4. We found that Microsoft.com is the value that occurs most frequently in the data set, having 8836 DNS requests for an entire week and 170 requests per an hour as a maximum value.

(37)

and 40993 requests per an hour as a maximum value. It is imperative to note that the attack has run for only two hours. This experiment was used to empirically determine the threshold. In this environment, the sum of the number of requests of all domain names was 489 per hour. Therefore, the value of the threshold has been set to 500 DNS requests in our experiment. It is important to note that the threshold has to be set appropriately for each network depending upon the typical traffic levels.

Figure 3.5: Top Five Domain Names during the Attack over a Two-hour Period.

(38)

Figure 3.6: Visualizer Module Shows only the Attack.

Figure 3.7: Visualize the Entire DNS Captured Data.

3.4.3 Issues

(39)

but not a large number. As soon as our data set became larger, the system was not able to process the data as efficiently resulting in slower turnaround times. The Parser module was not able to process big PCAP files and a large number of DNS packets, and the Visualizer module was taking more than 16 hours of processing time to analyze and display the data on the parallel coordinates. Therefore, the system was rewritten in the Go Programming Language [47] and the R Language [48]. After rewriting the code, 5GB PCAP files with more than 52000 DNS packets are parsed in less than 12 seconds, while R processes and presents the parsed data in approximately 5 seconds.

3.5 Conclusion

(40)

4 DNS Tunneling Attack Detection using Machine Learning

Individuals and organizations rely on the Internet as an essential environment for personal or business transactions. However, individuals and organizations have been primary targets for cyber-security attacks that steal sensitive data. Adversaries can use different approaches to hide their activities inside the compromised network and communicate covertly between the malicious servers and the victims. The domain name system (DNS) protocol is one of these approaches that adversaries use to transfer stolen data outside the organization’s network using various forms of DNS tunneling attacks. DNS protocol is considered one of the most important internet protocol because it is available in almost every network, ignored, and rarely monitored. In this work, our research focuses on implementing a reliable and robust detection system against DNS tunneling attacks using visualization and machine learning techniques. The detection system is able to detect DNS tunneling attacks, which are either embedded in malware or be used as stand-alone attacking tools. The system consists of three main modules: the Parser, the Analyzer, and the Detector. The visualization module’s output will be input for the developed detection module that will use machine learning algorithms to decide whether a DNS tunneling attack occurs. The system is trained, tested, and evaluated using metrics derived from a confusion matrix that is commonly used to measure intrusion detection systems’ performance. It is shown that the system is capable of detecting the DNS tunneling attack with a low false rate and high accuracy. The results show that all of the machine learning algorithms tested, including Decision Tree, Logistic Regression, Na¨ıve Bayes, K-Nearest Neighbors, Random Forest, and Support Vector Machine classifiers, used in the experiment achieved 99% accuracy to identify the attack.

4.1 Methodology 4.1.1 System Design

(41)

visualiza-tion system was developed to provide visual informavisualiza-tion for detecting botnet attacks using DNS traffic for communication. Six DNS features are used as parameters for the parallel coordinates technique to identify and visualize the DNS traffic to distinguish between regu-lar DNS traffic and malicious traffic containing DNS tunneling attacks. Four different DNS tunneling patterns were identified, each representing various types of DNS tunneling attacks [40]. The DNS tunneling attack signatures were derived from the work in [40] to implement the detection system to label and prepare the machine language algorithms’ dataset. Also, we evaluate the detection system on a dataset that was generated within the lab. The evalu-ation method consists of two main parts. The first part performs the DNS tunneling attack to create a dataset, and the second part imports the captured DNS traffic dataset to the detection system. We employ the four types of attacks in our detection system and generate relationship signatures for each one to detect and flag the attacks. Figure 1.4 shows the original visualizations and Table 5.1 shows the attack relationship signatures. During the analysis process, we used these signatures to label the data for training the machine learning algorithms to detect the DNS tunneling attack.

Figure 4.1: The DNS Tunneling Attack Patterns

(42)

names, time-to-live (TTL) values, and time values. There is an easily identifiable pattern that is converted to a signature represented as the tuple (1, 1, 1, m, 1, 1), which means there is 1 source IP address, 1 destination IP address, 1 domain name, multiple (m) subdo-mains, a single TTL, and a single or small window of time. The traditional DNS tunneling attack, Botnet DNS tunneling attack, Heyoka DNS tunneling attack, and Heyoka/Botnet DNS tunneling attack all have unique visualizations that were converted to attack relation-ship signatures.

Table 4.1: DNS Tunneling Visualization Signatures

Attack Signature Representation Normal\Malicious

Traditional D. T. 1,1,1,m,1,1 Attack Type1 True

Botnet D. T. m,1,1,m,1,1 Attack Type2 True

Heyoka D. T. 1,m,1,m,1,1 Attack Type3 True

Heyoka /Botnet D. T. m,m,1,m,1,1 Attack Type4 True

No Attack Not Above 0 False

4.1.2 The Detection System

(43)

and subdomain names, DNS record type, TTL (Time-to-Live), source and destination IP, SRV’s IP (Service IP), data length, and finally the date and time. These DNS features are stored in a database or file as input for the next module which is the Analyzer module.

The next module is the Analyzer module, which cleans and prepares the DNS captured data and then passes it to the next module. The Analyzer retrieves the captured DNS data from the database, and it validates the data by checking the captured data to verify that it is a proper DNS packet to avoid further issues. Then, the Analyzer module filters the data to eliminate duplicate records based on the ID number and it applies analytical and mathematical functions on the captured data. Examples of analytical and mathematical function are calculating the number of requests for each domain and its unique subdomain names, the number of source IPs per domain, the number of destinations of IP per domain, the number of SRV of IPs per domain, the average number of subdomain lengths per domain, the average number of data lengths of each DNS request per domain, and the average number of TTLs per domain. The Analyzer module also generates the attack signature for each domain and compares it with the malicious attack signatures to label the domain name as malicious or not. Finally, the Analyzer module prepares the analyzed data and stores this information into a file to pass to the machine learning algorithms to determine if there any malicious domain names. Also, the Alexa and umbrella top 1000 domains [49] are used to identify and label the captured domain name as legitimate.

(44)

4.2 Results and Discussion 4.2.1 Evaluation

In this section, we discuss the evaluation method for the DNS tunneling detection system. To evaluate the performance of the methodology, we created the normal DNS dataset. Three months of DNS traffic were captured within the network. The network environment is a home-based lab, consisting of more than 20 devices, including IoTs, servers, mobiles, PCs, and laptops. There are two significant reasons for capturing regular DNS traffic in the experiment. The first reason is to compare and study the normal behavior of DNS traffic against the DNS traffic that has the DNS tunneling attack while the second reason is to determine and set the value of the threshold based on the normal DNS traffic level.

Due to the importance of the DNS protocol, DNS tunneling tools were used to simu-late the tunneling attacks using various methods. For example, the dns2tcp DNS tunneling tool uses the TXT record types to perform the tunneling, whereas Iodine DNS tunneling tool uses NULL records [50]. However, due to the lack of available DNS tunneling datasets, we implemented and created a DNS dataset that has malicious DNS traffic. The DNS tun-neling dataset includes various DNS traffic generated from DNS tuntun-neling tools, including Iodine, DNScat2, dns2tcp, and fraud-bridge. Finally, normal and malicious DNS datasets are combined and then imported to the detection system and evaluated.

4.2.2 Experimental Setup

This work is a network-based solution which has the ability to distinguish and identify between normal and malicious DNS traffic. In order to detect the malicious DNS traffic, we require that the system intercepts the network DNS traffic. To guarantee that the system captures all transmission DNS traffic, the plan is to place the system on the network edge or to run it as a DNS proxy, which looks up DNS queries from a local cache. A DNS proxy makes the process of resolving the domain name more efficient by performing the lookup on behalf of the client and caching common queries. In addition, the detection system will have the ability to read DNS traffic from an existing PCAP file, which helps to identify and detect the attack offline from previously captured network traffic.

(45)

in detecting botnet attacks that are using DNS traffic for communication. The evaluation process relies on two parts including: create the dataset that has the DNS tunneling attacks and import the generated dataset into the detection system.

To perform a DNS tunneling attack, we register a domain name and set up an external server with a static IP address. A Debian Linux server is deployed in the cloud for this purpose. The DNS tunneling attack is performed using the Kali Linux distribution that has many DNS tunneling attack tools built-in on the internal network to encapsulate and upload a 12 MB text file to an external FTP server. Also, we configured a couple of local domain nameservers to perform other DNS tunneling attacks to generate more data for our evaluation. During the attacks, we captured DNS traffic as PCAP files. As previously mentioned, we captured the normal DNS traffic on a three-month window to generate our dataset for the evaluation process.

Figure 4.2: DNS Tunneling Tools

(46)

three-month window period. On the other hand, the number of requests to the dns4phd.me domain name is performed a couple of hours during the attack to upload a text file via an FTP server over the Internet while the other tools performed within a short time so that they generated fewer DNS requests within the home-lab environment.

Figure 4.3: Top 25 Domain Names including DNS Tunneling Attack Tools

(47)

method.

4.2.3 The Dataset

After performing the attacks, the PCAP files were stored for the later analysis. Next, using the Analyzer module, the captured data was analyzed, excluding the DNS tunneling attack. The frequency of the top 10 domain names is shown in Table 6.1. We found that encipher.io domain name is the value that occurs most frequently in the data set, having 344776 DNS requests for the three-month window period while the youtube.com domain name has the lowest frequency number in that list, having 68956 DNS requests.

Table 4.2: Frequency of Top 10 Domain Names

Frequency Domain Name

344776 encipher.io 339618 antsle.com 168649 us-east-1.amazonaws.com 148326 google.com 143410 wd2go.com 132706 app-measurement.com 85932 googleapis.com 80354 Microsoft.com 76220 facebook.com 68956 youtube.com

We also analyzed the complete dataset to determine some dataset relationships be-tween normal and malicious DNS requests. The Analyzer module marks the dataset with two types of labels: No attack and DNS ATTACK, which helps the machine learning algo-rithms train on this dataset with the labels and then tests them while excluding the labels. Fig 4.4 represents the malicious DNS requests against the normal one. Our dataset has 3646 regular DNS requests; on the other hand, the dataset has 33 DNS malicious requests that been generated using various DNS tunneling tools, as we mentioned earlier.

(48)

Figure 4.4: The Normal and the Malicious DNS requests in the dataset

Thus, the number of DNS requests for two hours of performing DNS tunneling attack will be higher than the three-month window period of capturing DNS requests.

The DNS Tunneling attack encapsulates the TCP protocol packets and sends them over the domain name system. Due to the DNS protocol limitations, the TCP packets need to be split into pieces to be sent with the DNS protocol. As a result, a successful attack sends many DNS requests per minute in order to perform a certain task. Thus, to confirm that we examine and study the DNS features of the normal and the malicious DNS requests within our lab. Fig 4.5 shows the distribution of the overall total requests for the normal and the malicious DNS requests for the dataset. This shows that the total number of DNS requests with the DNS ATTACK label is much larger compared to DNS requests labeled No attack and also have very different distributions. The different distributions demonstrate graphically why the machine learning algorithms are able to distinguish between normal DNS requests and malicious DNS requests.

(49)

Figure 4.5: The Total Number of the Normal and the Malicious DNS requests in the dataset

versus the DNS are very different.

4.2.4 Evaluation Metrics

In this section, the evaluation metrics for evaluating the detection system are defined. Accuracy and effectiveness are essential factors in the evaluation process, which describes the ability of the detection system to differentiate between intrusive and non-intrusive activities.

Confusion Matrix

The confusion matrix is a tool for determining the performance of the detection system module. It contains information about actual and predicted classifications. Most researchers primarily focus on evaluating and measuring the intrusion detection systems (IDSs) based on the accuracy and effectiveness in terms of false alarm rate and the percentage of attacks that are successfully detected [51]. Similarly, this work will evaluate and measure the proposed system using the confusion matrix as defined below:

1. True positives (TP) - the number of intrusive activities that are correctly detected by the detection system.

(50)

Figure 4.6: The Total Number of the TTL and Data Length in the Normal and the Mali-cious DNS requests in the dataset

3. True negatives (TN) - the number of normal or non-intrusive activities that are cor-rectly identified as normal by the detection system.

4. False negatives (FN) - the number of intrusive activities that are incorrectly marked as normal or non-intrusive by the detection system.

Figure 4.7 provides information about the confusion matrix of TP, TN, FP, and FN, which represents true and false classification results.

(51)

Metrics from confusion matrix

In this section, different classification measures that are derived from the confusion matrix values are described. The confusion matrix metrics produce numeric values that are simple to compute and compare [51], including true-positive rate (TPR), false-positive rate (FPR), precision (PR), and accuracy, using the following formulas [14]:

1. True-positive rate (TPR) is a ratio between the number of true positives and the sum of the true positives and false negatives. It is used to measure the proportion of actual positives cases that are correctly identified as a DNS tunneling attack. Also, recall and sensitivity are other machine learning terms that refer to a true-positive rate.

T P R = T P

T P + F N (4.1)

2. False-positive rate (FPR) defines as the proportion of negative cases that are incorrectly identified as positive cases in the DNS traffic.

F P R = F P

F P + T N (4.2)

3. Precision (PR) is the ratio of correctly predicted positive attacks to the total predicted positive attacks. For example, consider an email spam detection system. A false positive means that a non-spam email (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.

P R = T P

T P + F P (4.3)

4. Accuracy is the ratio of correctly predicted attacks, true positives plus true negatives, to the sum of true positives, false positives, true negative, and false negatives. If the system performance has a high accuracy value, then the system is considered as better than a system with lower accuracy.

Accuracy = T P + T N

T P + +F P + T N + F N (4.4)

5. F1-Score defines as the harmonic mean of precision and recall models, and it is a method of combining and weighted average of both precision and recall.

F 1 = 2 × Recall × P recision

(52)

4.2.5 Results

In this section, the results of the work are discussed. The aim of this research designs a detection method that identifies and detects the DNS tunneling attacks from network traffic. Our method starts by capturing the DNS traffic of the home-base network for a three-month window period using the Parser module. The dataset has more than 3,500,000 DNS requests before sending it to the Analyzer module. Then, the Analyzer module processed the DNS requests and preprocessed the requests to filter the dataset to be ready for the machine learning algorithms. The dataset is being used during the experiment. It contains normal DNS requests and malicious DNS requests, with 3646 normal DNS requests and 33 malicious DNS requests that were generated using various DNS tunneling tools. Finally, the Detector module splits the dataset into the training and testing dataset and then applies machine learning algorithms to identify and detect any malicious activities within the DNS traffic. The results shows that all of the machine learning algorithms that were used in the experiment achieve 99% accuracy to identify the attack. Table 4.4 compares the machine learning classifiers, including Decision Tree, Logistic Regression, Na¨ıve Bayes, K-Nearest Neighbors, Random Forest, and Support Vector Machine classifiers based on the precision, recall, F1-score, and accuracy.

(53)

Table 4.3: The confusion matrix values for the machine learning classifiers. ML. Classifier (TP) (FN) (FP) (TN) NDecision Tree 17 0 3 1820 Logistic Regression 17 0 3 1820 K-Nearest Neighbors 17 0 3 1820 Random Forest 17 0 3 1820 Na¨ıve Bayes 16 1 5 1818

Support Vector Machine 14 3 0 1823

cases out 17. In addition, the results show that the classifier incorrectly labeled one case as DNS Attack. In the case of false-positive, the classifier was incorrectly labeled five cases as a No Attack. Finally, the Support Vector Machine classifier was able to identify 14 cases of the DNS attacks out 17, which means it three two cases. The Support Vector Machine classifier had false-negatives, incorrectly labeling three cases as DNS Attack. It had no false positive cases, and the true-negative values of shows the classifier identify 1823 cases as normal.

(54)

malicious DNS requests, while the Naive Bayes missed five cases of malicious DNS requests. The Support Vector Machine classifier did not miss any malicious cases, but instead, it marked 9 cases as malicious DNS requests where were not malicious.

Table 4.4: Compared Results Among Machine Learning Classifiers ML. Classifier Precision Recall F1-score Overall Accuracy

Decision Tree 0.85/1.00 1.00/1.00 0.92/1.00 99.836

Logistic Regression 0.85/1.00 1.00/1.00 0.92/1.00 99.782

Na¨ıve Bayes 0.76/1.00 0.94/1.00 0.84/1.00 99.728

K-Nearest Neighbors 0.85/1.00 1.00/1.00 0.92/1.00 99.673

Random Forest 0.85/1.00 1.00/1.00 0.92/1.00 99.836

Support Vector Machine 1.00/1.00 0.82/1.00 0.90/1.00 99.836

Conclusion

(55)

5 System Methodology

In this chapter, the detection system is introduced and discussed the system architec-ture design and its main components. Our approach is a network-based detection system that can identify and detect malicious DNS requests, including DNS tunneling, data exfiltration, and command and control channels communications over the DNS protocol.

5.1 Background

The Domain Name System (DNS) considers as one of the most significant elements in the Internet infrastructure. Many protocols and applications such as HTTP, email, and FTP applications rely on DNS to resolve human-readable domain names to the corresponding IP address. Due to the DNS’s importance, adversaries misuse and exploit the DNS protocol in many various ways to achieve their malicious goals. Also, DNS-based attacks are one of the most serious cyber-threats used to steal an online identity. According to a global 2019 DNS Security survey, 91% of malware uses DNS services to establish their cyberat-tacks, and a DNS attack had targeted 82% of survey respondents in 2018 [15]. The 2020 global DNS threat report shows that the number of victimized and attacked organizations via DNS-based attacks increased by 82%. Moreover, the report exhibits the top DNS-based attacks, including DNS phishing, DNS-based malware, DNS amplification, and DNS tun-neling attacks. DNS attacks and business consequences are explicitly interlinked, and there are direct strong measurable business impacts. The report shows that such attacks targeted various sectors such as healthcare, education, government, transportation, and utilities and cost more than one million dollars as impacts on their businesses. This section introduces the detection system, which is a network-based solution against malicious DNS requests.

5.2 DNS Proxy

(56)

advantages is that it has full control over the incoming and outcoming requests using either whitelist or blacklist domain names for blocking certain domain names. It may also have the ability to cache DNS records for speeding up the domain name lookup process. DNS Proxy server features are available in almost every enterprise firewall appliance devices. In this work, we used the DNS proxy concept to fully control the DNS queries and requests as well as block the attacks as soon as identified and detected. The current state of the detection system is to identify, detect, and block DNS attacks.

5.2.1 DNS Proxy as a Detection System

In this work, we used the term “DNS proxy” to implement the detection system. Our Golang code is hosted on a dedicated server within the network to run as a DNS proxy server to have a full control over the DNS traffic, including the incoming and outcoming DNS queries. Therefore, in order to implement our methodology, there is a particular setup that needs to be set. First, the network router needs to configure the DNS IP address section to point to our DNS proxy server. The DNS proxy is placed within the home network to intercept clients’ DNS queries. As soon as the DNS proxy receives the DNS queries, it applies several analytical functions and then forwards it to the actual DNS server, the Internet provider DNS server, or a public DNS server. After the internet provider DNS server receives an answer, it replays it to the DNS proxy to send it back to the client, as shown in 5.1.

(57)

5.3 System Design

The detection system’s main purpose is to distinguish the DNS traffic’s abnormal behavior and block malicious domain name requests. The detection system needs to be built with multiple modules using different approaches and communication patterns in order to detect both the DNS tunneling attacks and the DNS exfiltration technique, including malicious software and Malware Bots. For example, in some cases, the DNS exfiltration attack is used as the low-throughput DNS technique. The attacker sends a DNS request every x minutes to exfiltrate sensitive data outside the organization’s network through security appliances without been detected. In contrast, the DNS tunneling attacks method must use the high-throughput DNS technique due to the communication patterns. For example, it has to send and receive an enormous amount of data over the DNS protocol to establish a VPN connection.

This section describes the network-based detection system’s system architecture de-sign, consisting of two main analysis modules: The Payload Analysis (PA) and the Traffic Analysis (TA) modules. As soon as the DNS proxy receives a DNS query from a client, the first module, the payload analysis, intercepts and operates in real-time. The second mod-ule, the traffic analysis, works every x minutes to parse, analyze, and detect the incoming DNS requests, where x is configured based on the daily incoming traffic in the network. We implemented two main analysis modules to detect most malicious DNS requests, including low and high throughput DNS tunneling and exfiltration techniques, whereas identifying and detecting low throughput techniques via payload analysis module in real-time and high throughput techniques via traffic analysis module every x minutes. Fig 5.2 shows the DNS proxy detection system’s architecture design, including the payload and traffic analysis mod-ules.

5.3.1 Parsing DNS Data

(58)

Figure 5.2: The Architecture Design of DNS Proxy Detection System.

and then sends it back to the client. Meanwhile, it also passively captures the DNS requests and stores them into a PCAP file to supply the traffic analysis module with DNS data for further and more extensive analysis.

5.3.2 Feature Extraction

The DNS protocol’s nature design is to exchange specific information related to do-main names [2]. It was not designed to transfer data over the DNS protocol. The DNS protocol has various limitations, such as the length of any label of the domain name is re-stricted to between 1 and 63 bytes as well as a full domain name is limited to 255 bytes of letters, digits, hyphens, and separators are included [3]. Even though these limitations exist, adversaries may abuse and leverage the DNS protocol to transfer data to their authoritative name server and establish a covert channel for their activities. The attacker can include as much information in the DNS queries to pass the data. Instead of sending legitimate subdomains, adversaries take advantage of the domain length limitation and encode some data in the domain name.

(59)

DNS traffic for other features such as the volume of DNS traffic per IP address, the frequency of DNS traffic per domain, visualization of the traffic, the number of hostname per domain, and other analysis features. Therefore, a set of features needs to be considered.

5.3.3 The Payload Analysis Module

The Payload Analysis (PA) technique focuses on identifying and analyzing critical features of the DNS queries. It intends to focus on monitoring and observing the DNS traffic that uses the particular patterns and indicators. Previous research and studies show that malicious domain names can be found and represented in different forms and encoded characters. This section describes the method of the payload analysis module approach. Implementing the payload analysis module is to identify and detect the low-throughput data exfiltration or DNS tunneling. One technique to bypass the firewall and in-place security appliances is to use low-throughput techniques. The adversary sends a malicious request every minute to avoid being detected. Adversaries use various data encoding techniques and communication patterns in the DNS protocol to carries binary files, sensitive or stolen data outside the organization’s network. Examples of data encoding techniques include MD4, MD5, Base32, Base64, hex, and others. Thus, this section exhibits the approach used to analyze and identify the DNS protocol for these techniques.

The Payload analysis module relies on DNS features, focusing on collecting and ana-lyzing DNS query details, including statistical analysis and the information entropy. Thus, we need to perform DNS feature extraction and examine them to detect if any DNS malicious occurs. In the payload analysis module, the feature extraction phase works in real-time on the network edge side, using a DNS proxy. Every time the DNS proxy server receives a DNS request from a client, it extracts the DNS features for the analysis process. It then passes it to the analysis module, which applies statistical and analytical functions to captured DNS requests for analyzing to determine malicious activities. DNS features extraction includes the Fully Qualified Domain Name (FQDN), query record type, a subdomain of FQDN, and the subdomain’s query length, encoding the query data’s subdomain, blacklisted and whitelisted domain names, and information entropy.

(60)

about the domain name. Thus, the domain name system is limited to a certain number of bytes of letters and symbols. However, that depends on the type of DNS query that is being used. The DNS text (TXT) type record lets system administrators describe the service of the domain name by entering human-readable notes. Therefore, many DNS tunneling tools leverage the DNS TXT type record to transfer as much data as possible. In this work, the detection system monitor uncommonly used record types such as TXT and NULL.

2. Data Encoding: Adversaries take advantage of the domain name system to pass bi-nary files, sensitive information, or stolen data over the DNS protocol. They most likely include encoded or encrypted data in the domain name before passing it to make it look like a part of the domain name request. For example, if the thepassw0rd.example.org domain is queried, the adversary’s authoritative name server, for example.org, receives the passw0rd as a string of hostname but interpret it as incoming data instead. Also, adversaries may use some other encoding techniques to transfer text or binary data to their servers. Another section that may have encoded or encrypted data is the description section in the TXT requests. Therefore, the detection system monitors the domain name system and looks for encoded or encrypted data, including Base32, Base64, Based128, hex, md5, md4, and other Non-English characters.

3. Length of DNS query: The DNS tunneling tools utilize the domain names to transfer as much data as possible over the DNS protocol. Therefore, the domain names’ and hostnames’ lengths are considered abnormal behavior if they exceed a certain length value, DNS requests’ lengths, including the length of characters and labels, the number of dots. In a 2012 presentation at the RSA conference, Ed Skoudis identified a new DNS-based Command and Control of malware that uses the DNS exfiltration technique to transfer data over the DNS protocol [52, 53]. Based on that finding, they recommend most of the Fully Qualified Domain Names (FQDN) longer than 52 characters and most hostnames that are longer than 27 unique characters should be considered suspicious [52, 53].