Scope of study - Detection and classification of malicious network streams in honeynets : a the

This thesis shows the effectiveness of information-based intrusion detection techniques that involve string matching and compression on Honeypot network data. The proposed system is designed to aid existing signature-based intrusion detection systems, with the additional capability to detect variations of network intrusions in the form of malicious streams of trojans, viruses, worms, shellcode and other active threats. To enable in- telligent learning for automated classification, exemplar-based learning techniques are investigated to detect variants of known attacks while they are being transferred over a network. Similarity between known malicious stream samples and the incoming traffic is calculated based on similar information or similar symbols. Such synergies are thus quantified to yield a similarity score with a certain level of confidence and this is used to classify examples of known attacks, and to add new classes of attack where there is no known exemplar, so that the system can engage in lifelong learning and continue to extend itself. The system is integrated with our existing Honeynet setup and we demon- strate our approach using data from real Honeypots. In this thesis, the research scope is restricted to intrusion detection schemes that can perform packet payload analysis. This involves:

• Collection of malicious network streams for ground truth.

• Detection of malicious network streams and their variants (Supervised Learning). • Detection of new or novel malicious network streams (Unsupervised Learning).

17_{File Transfer Protocol} 18_{Hypter-text Transfer Protocol} 19_{Simple Mail Transfer Protocol}

• Classiﬁcation of all these streams.

1.4.1 Collection of malicious network streams for ground truth

In order to evaluate an IDS, it is necessary to have or create a dataset to serve as a ground truth. This dataset should comprise of known malicious streams and their variants. Variants can be synthesized or captured from live traﬃc. Details of known malicious streams are studied and collected from sources such as:

Existing documented sources

It is important to start with known samples of malicious streams that can be labelled and veriﬁed to train and test our system. Details of known malicious streams are studied and collected from existing knowledge sources such as: CVE20, CERT21, Snort Signatures22and existing datasets collected by researchers and organizations (discussed in Chapter 2). These knowledge sources have detailed documentation of the attack or exploit, but often do not have live attack samples. These sources can be used to craft or synthesize attack traﬃc and their variants.

Live sensors

Most network security tools are passive in nature; for example, firewalls and IDS. They operate on available rules and signatures in their database. Anomaly detection is limited only to these sets of available rules. Any activity not in alignment with those rules goes undetected. In order to achieve a better insight into the attacks and attacker tactics, there is a need to set up a vulnerable environment that lures an attacker, to study their behaviour. To this end, apparently benign computer systems designed to keep detailed logs of system activity are widely deployed today by security researchers. These systems, known as ‘Honeypots’, are designed to record a hacker’s activities to gain an insight into the methods used. The logs typically would include intruder keystrokes, processes, and system-wide and network data. Over the years, researchers have successfully isolated and identified worms and exploits using Honeypots placed in specialized architectures called Honeynets. Honeynets are capable of logging information, far more effectively than any other available security tool, providing the

20_{Common Vulnerabilities and Exposures:} _{http://cve.mitre.org/}

21_{Computer Emergency Response Team:} _{http://www.us-cert.gov/ncas/alerts/} 22_{Snort Rules:} _{http://www.snort.org/snort-rules/}

capability to study hackers “under a microscope”. We have deployed multiple Hon- eypot sensors emulating various services such as: FTP, HTTP, SSH, DNS23, SMB24, RIP25, DHCP26, NTP27, POP28, TELNET, SOCKS, SNMP29, IRC30, RPC31, RDP32, VNC33, XMPP34 and others to collect malicious network streams.

These live sensors provide three main advantages over other techniques:

• They provide an advantageous position for attack analysis, as the researchers can control both the endpoints (Honeypots) and the intermediary network (Hon- eynet).

• Since they have no production value, any connection attempt to them is considered suspicious or malicious. Thus avoiding intensive pre-ﬁltering, which may be required in production networks.

• They serve as an excellent source to collect live samples of malicious network streams and malware.

1.4.2 Detection of malicious network streams and their variants: Su-

pervised Learning

Once a ground truth, in the form of a labelled dataset of malicious network streams and their variants has been created, the next challenge is to investigate methods to detect similarity between known malicious streams and their variants. The key research questions addressed are:

Q: Is it possible to measure similarity between samples of malicious network streams?

Q: Is it possible to measure similarity between variants of malicious network streams with statistical conﬁdence?

Q: How should similar samples be grouped?

23_{Domain Name Service} 24_{Server Message Block} 25_{Routing Information Protocol} 26_{Dynamic Host Conﬁguration Protocol} 27_{Network Time Protocol}

28_{Post Oﬃce Protocol}

29_{Simple Network Management Protocol} 30_{Internet Relay Chat}

31_{Remote Procedure Call} 32_{Remote Desktop Protocol} 33_{Virtual Network Computing}

Q: How should appropriate exemplars from known malicious stream samples for clas- siﬁcation be selected: Instance selection problem?

1.4.2.1 Similarity measurement between malicious network streams: The use of information-theoretic and string metrics based measures

One of the key requirements for classiﬁcation is to measure similarity between malicious network streams. String metrics, information-theoretic measures and other methods are investigated, that can be applied on the payload collected from malicious network streams for similarity measurements. The key questions addressed include:

Q: How should similarity between similar types of malicious network traﬃc or streams be measured?

Q: What metrics are available?

Q: How do these metrics compare?

Q: What metrics provide the most appropriate methods for making similarity compar- isons?

1.4.3 Detection of new or novel malicious network streams: Unsuper-

vised Learning

Techniques have been studied and adopted that involve detection and handling of new or novel streams which may appear as a result of classiﬁcation. The key questions addressed here include:

Q: How should unknown or novel samples that did not get classiﬁed by prior knowledge extracted from the training set (TR) be identiﬁed?

Q: How should appropriate exemplars from these unknown or novel samples for clas- siﬁcation be selected?

Q: How should groups and sub-groups that may exist in these novel samples be iden- tiﬁed to carry out classiﬁcation?

1.4.4 Stream Classiﬁcation

Clustering and classiﬁcation algorithms have been studied, proposed and evaluated to classify same, similar and novel streams from any given dataset. Evaluation for the

accuracy of these algorithms is conducted using ROC curve analysis. Best performing algorithms, that exhibit the best results in terms of highest true positive rate (TPR) and lowest false positive rate (FPR) are considered.

In document Detection and classification of malicious network streams in honeynets : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Palmerston North, New Zealand (Page 32-36)