• No results found

This thesis shows the effectiveness of information-based intrusion detection techniques that involve string matching and compression on Honeypot network data. The proposed system is designed to aid existing signature-based intrusion detection systems, with the additional capability to detect variations of network intrusions in the form of malicious streams of trojans, viruses, worms, shellcode and other active threats. To enable in- telligent learning for automated classification, exemplar-based learning techniques are investigated to detect variants of known attacks while they are being transferred over a network. Similarity between known malicious stream samples and the incoming traffic is calculated based on similar information or similar symbols. Such synergies are thus quantified to yield a similarity score with a certain level of confidence and this is used to classify examples of known attacks, and to add new classes of attack where there is no known exemplar, so that the system can engage in lifelong learning and continue to extend itself. The system is integrated with our existing Honeynet setup and we demon- strate our approach using data from real Honeypots. In this thesis, the research scope is restricted to intrusion detection schemes that can perform packet payload analysis. This involves:

Collection of malicious network streams for ground truth.

Detection of malicious network streams and their variants (Supervised Learning). Detection of new or novel malicious network streams (Unsupervised Learning).

17File Transfer Protocol 18Hypter-text Transfer Protocol 19Simple Mail Transfer Protocol

Classification of all these streams.

1.4.1 Collection of malicious network streams for ground truth

In order to evaluate an IDS, it is necessary to have or create a dataset to serve as a ground truth. This dataset should comprise of known malicious streams and their variants. Variants can be synthesized or captured from live traffic. Details of known malicious streams are studied and collected from sources such as:

Existing documented sources

It is important to start with known samples of malicious streams that can be labelled and verified to train and test our system. Details of known malicious streams are studied and collected from existing knowledge sources such as: CVE20, CERT21, Snort Signatures22and existing datasets collected by researchers and organizations (discussed in Chapter 2). These knowledge sources have detailed documentation of the attack or exploit, but often do not have live attack samples. These sources can be used to craft or synthesize attack traffic and their variants.

Live sensors

Most network security tools are passive in nature; for example, firewalls and IDS. They operate on available rules and signatures in their database. Anomaly detection is limited only to these sets of available rules. Any activity not in alignment with those rules goes undetected. In order to achieve a better insight into the attacks and attacker tactics, there is a need to set up a vulnerable environment that lures an at- tacker, to study their behaviour. To this end, apparently benign computer systems designed to keep detailed logs of system activity are widely deployed today by security researchers. These systems, known as ‘Honeypots’, are designed to record a hacker’s activities to gain an insight into the methods used. The logs typically would include intruder keystrokes, processes, and system-wide and network data. Over the years, re- searchers have successfully isolated and identified worms and exploits using Honeypots placed in specialized architectures called Honeynets. Honeynets are capable of logging information, far more effectively than any other available security tool, providing the

20Common Vulnerabilities and Exposures: http://cve.mitre.org/

21Computer Emergency Response Team: http://www.us-cert.gov/ncas/alerts/ 22Snort Rules: http://www.snort.org/snort-rules/

capability to study hackers “under a microscope”. We have deployed multiple Hon- eypot sensors emulating various services such as: FTP, HTTP, SSH, DNS23, SMB24, RIP25, DHCP26, NTP27, POP28, TELNET, SOCKS, SNMP29, IRC30, RPC31, RDP32, VNC33, XMPP34 and others to collect malicious network streams.

These live sensors provide three main advantages over other techniques:

They provide an advantageous position for attack analysis, as the researchers can control both the endpoints (Honeypots) and the intermediary network (Hon- eynet).

Since they have no production value, any connection attempt to them is consid- ered suspicious or malicious. Thus avoiding intensive pre-filtering, which may be required in production networks.

They serve as an excellent source to collect live samples of malicious network streams and malware.

1.4.2 Detection of malicious network streams and their variants: Su-

pervised Learning

Once a ground truth, in the form of a labelled dataset of malicious network streams and their variants has been created, the next challenge is to investigate methods to detect similarity between known malicious streams and their variants. The key research questions addressed are:

Q: Is it possible to measure similarity between samples of malicious network streams?

Q: Is it possible to measure similarity between variants of malicious network streams with statistical confidence?

Q: How should similar samples be grouped?

23Domain Name Service 24Server Message Block 25Routing Information Protocol 26Dynamic Host Configuration Protocol 27Network Time Protocol

28Post Office Protocol

29Simple Network Management Protocol 30Internet Relay Chat

31Remote Procedure Call 32Remote Desktop Protocol 33Virtual Network Computing

Q: How should appropriate exemplars from known malicious stream samples for clas- sification be selected: Instance selection problem?

1.4.2.1 Similarity measurement between malicious network streams: The use of information-theoretic and string metrics based measures

One of the key requirements for classification is to measure similarity between malicious network streams. String metrics, information-theoretic measures and other methods are investigated, that can be applied on the payload collected from malicious network streams for similarity measurements. The key questions addressed include:

Q: How should similarity between similar types of malicious network traffic or streams be measured?

Q: What metrics are available?

Q: How do these metrics compare?

Q: What metrics provide the most appropriate methods for making similarity compar- isons?

1.4.3 Detection of new or novel malicious network streams: Unsuper-

vised Learning

Techniques have been studied and adopted that involve detection and handling of new or novel streams which may appear as a result of classification. The key questions addressed here include:

Q: How should unknown or novel samples that did not get classified by prior knowledge extracted from the training set (TR) be identified?

Q: How should appropriate exemplars from these unknown or novel samples for clas- sification be selected?

Q: How should groups and sub-groups that may exist in these novel samples be iden- tified to carry out classification?

1.4.4 Stream Classification

Clustering and classification algorithms have been studied, proposed and evaluated to classify same, similar and novel streams from any given dataset. Evaluation for the

accuracy of these algorithms is conducted using ROC curve analysis. Best performing algorithms, that exhibit the best results in terms of highest true positive rate (TPR) and lowest false positive rate (FPR) are considered.