Combining both the areas of DLP and ap- proximate matching, in this work, we present a novel technique usingapproximatematching to detect ﬁles in network traﬃc. Compared to existing techniques, our proposed approach is straight forward and does not need comprehen- sive conﬁguration. It can be easily deployed and maintained since only ﬁngerprints (a.k.a. simi- larity digest) are required. Our approach does not require machine learning, or rule generation. The main contribution is to demonstrate that it is possible to use approximatematching on net- work traﬃc by changing the algorithms slightly although algorithms where never designed to handle such small pieces. To the best of our knowledge, this the ﬁrst paper describing a tech- nique for ﬁle identiﬁcation usingapproximatematching in network traﬃc.
The problem of data loss has become an important problem and a robust solu- tion is need of the hour. Possible routes of data loss have become complicated and numerous, making countermeasures difficult to develop and deploy. The in- creased incidents of involvement of insiders in data leakage has raised a serious question on confidentiality of organisation’s internal information, like intellec- tual property. In this work, the problem of identifying files in networktraffic is considered. The problem with the existing technology is highlighted and need of open source tools and techniques needed to solve this problem is emphasised. In order to solve this problem, bitwise content analysis of data in motion usingapproximatematching is proposed. Each packet is analysed for containing the ‘known file’. It is successfully established that it is possible to detect files using this approach. To validate the technique and implementation, several scenarios are considered and tested. In a first step, random data is used to explore feasi- bility and establish a benchmark for what to expect from such a methodology. The tests with real world data showed promising results as well. Both binary and text based files can be easily detected using this approach. However, with real world data, problem of ‘common substrings’ persists. Wherefore, a easy extension is proposed of using stream based analysis.
Sdhash stands for similarity digest hash and it was developed by Vassil Roussev
in 2010 . It is an algorithm that allows two arbitrary blobs of data to be compared for similarity based on common strings of binary data . Sdhash’s approach is to identify statistically-improbable features—features that are least likely to occur in other data objects by chance and use them to generate similarity digests. Each of the features is hashed using the cryptographic hash function SHA-1 and the resulting hashes are put into a series of Bloom filters, which are a space-efficient set representation. In order to carry out comparison between two digital artifacts, their digests can be compared. Sdhash applications include identification of embedded objects, identification of code versions, identification of related documents, and correlation of network fragments.
 is a static technique which looks for risky APIs / risky keywords within java code of the application. They collected malware samples and created the database of risky APIs found in malwares and further searched for presence of such APIs in the public sector related apps like banking, flight booking etc.  is other such mechanism which does two order risk analysis of the applications collected from the official play store. First order analysis consists of analysing permissions within the manifest file i.e. what dangerous permissions are present in the application and second order analysis consists of heuristics based filtering in which heuristics like run time download of component by the application is considered to be malicious.  found set of permissions which can distinguish between normal apps and malicious apps. They used hierarchical bi-clustering technique to cluster the permissions in two groups i.e. normal and malwares. And then filtered out those permissions sets that are clearly distinguishable i.e. they are present in malicious apps but missing in normal apps.  evaluated potential risks hidden within the ad libraries of the applications. They extracted the ad libraries within the application and looked for dangerous permissions present within them. Dangerous permissions could lead to leakage of private information of the user.  is the tool available in the market for performing static analysis.
found at text position j if C m,j < k, where k is the maximum number of errors
allowed. Figure 2.2 shows the matrix when searching for the pattern “halfway” in the text “hallways”. Italic entries are the positions were a match with less than 2 errors was found. Over the years various improvements to this algorithm have been developed. Most of these exploit properties of the dynamic programming matrix. Another approach for solving the approximate pattern matching problem is using automatons, where each specifity of a column represents a separate state of the automaton. In this case, state transitions occur on every character of the text, i.e., when a new column is computed. All these improvements achieve increased performance by trading memory for runtime. Navarro [Nav01] summarizes these improvements.
There are two working procedure in our anomaly detec- tion scheme: deployment and measurement. First, our scheme must be deployed properly, such that it receives NetFlow records on available measurement network. We should configure internal NetFlow sources that handle traffic from corporate hosts to Internet and vice versa such as routers, switches and firewalls to export Net- Flows to the processing engine server. For best result and more visibility make sure those sources deal with clear, not NATed traffic. Second, we assume that training traf- fic is devoid of any attack and the characterization of traffic features acts as a normal profile. The normal pro- file is used to calculate the pre-define thresholds. And then our scheme enters fully operational mode. In this mode the threshold is constantly compared with the cur- rent entropy value of degree distributions derived from incoming NetFlows. Alarms are generated if the entropy values differ beyond allowed tolerances. Note that asso- ciated thresholds are self-adjustable as they’re calculated by the processed data itself (NetFlows) in particular time span and periodically update thresholds without requir- ing dedicated periodic training interval.
In a first step in our analysis we detect if the given network has a special structure, such as grid networks, radial networks or star networks, which allows in the further network simplification to adjust algorithms for a better performance. To detect the network structure, pattern matching algorithms are searching for grid and radial patterns. Then the network gets simplified to its essential nodes. Essential nodes include on ramp start points, off ramp end points, merging nodes and diverging nodes. To determine the essential nodes of a network we follow the links of a network from all entry points until arriving at a node that has more than one exit link or another entry link. Reached this node we delete all visited links and replace it with an equivalent link from the starting position to the actual position. Afterwards we repeat the process until all node of the network have been visited. To give an indication of the node and link reduction, we found that for the network description of the Tokyo Metropolitan Expressway we could eliminate about 48% of the nodes. All further structural analysis is based on the simplified network, and includes:
Zargari and Voorhis  examine significant features in anomaly detection systems with an aim to apply them to data mining techniques. Identifying some current challenges of obtaining a comprehensive feature set and establishing a system that eradicates redundant and recurring data from the KDD 99 dataset while also keeping the feature set to a minimal size. Rough set theory dependency was used to identify the most discriminating features of each class. Feature 21 and 22 in the KDD dataset were found not to have any significance in intrusion detection (FTP session and hot login). A further five features were identified to have a small significance in intrusion detection. These were su attempted, number of file creation operations, is guest login, dst host rerror rate. To produce a distinctive report finding the features and characteristics of the intrusion detection that offers more attacks and distribution of attacks as compared to KDD dataset. Corrected KDD-dataset was used, in order to discover the features and characteristics of the intrusions plus, whether anomaly detection can be improved by using this dataset from a statistical point of view. It is important to mention that different to other studies, the Corrected KDD- dataset was analysed here instead of the KDD-dataset. The Corrected KDD-dataset contains more attacks and the distribution of attacks is different comparing to the distribution of attacks in the KDD-dataset. A subset of features was later proposed to help decrease dimensions of KDD and compare to subset features through data mining techniques. The proposed features were later tested on NSL – KDD and demonstrated higher detection rates for proposed features. The work may require live analysis before we can be sure that it would function correctly.
1 | P a g e
1.1. Intrusion Detection System (IDS)
Intrusion detection system (IDS) is a tool/ application used to detect an attack that is encountered on a system or network in order to compromise or break it by an anomalous user outside of the network. This is done by keeping track of all the suspicious patterns/activities, experienced by both in incoming and outgoing traffic within the network. Generally an IDS maintains all the details of events examined on the system and later generates reports which are sent to the management station for further actions. After obtaining the details of that malicious user from the records, actions like blocking the user are performed. It is important to note that the IDS also includes a feature of monitoring the suspicious user within the network.
In the second place, the most common use of the API network group is API 1 and 3 groups with a total of 10 malware samples each. It can be said that as many as 10 malwares with the API Network Group 1 have activities to access a URL or IP and then provide file transfer services between client and server so that, without realizing it, the server can send and retrieve data on the client computer. Whereas 10 malwares with API Network Group 3 have activities to send and receive data through a predetermined socket. The API Network Group 3 will affect the throughput of networktraffic because the process of sending and receiving data happens.
Figure 2 shows architecture of NIDS. In first step KDD- 99 dataset which contain attack and normal traffic is given to C4.5 algorithm through weka . C4.5 in output generate tree. Tree is generated by some attribute value. At each node attributes are given by which tree is further classified. At the leaf node the actual attack is given. On each branch some weight is assigned according to classification attribute. In second step this tree and dataset is given to Adaboost algorithm. Adaboost algorithm has 4 phases labeling, data mining, training, testing. In labeling the normal packet are given -1 value and attack packet +1 at the end. Through data mining some features are extracted. Training phase is performed by taking different fields combination by changing folds. Then the created NIDS is tested for its accuracy. Adaboost algorithm classifies the traffic into 4 types of attacks DOS, U2R, R2L, probe and normal packet. Detection rate and false alarm rate is found out.
administrator privileges, and they take CPU and hard disk resources. The above-mentioned methods try to detect a ransomware when it is encrypting ﬁles in the user’s computer. However, in most enterprise productivity deployments user documents are located in central network shared volumes (Eurostat Statistics Explained). They can be documents shared by groups of users or even the whole set of documents from a user for allowing mobility among hosts. The centralization oﬀers better storage utilization with higher quality disks, group sharing capabilities, easier maintenance and simpler periodic backups. In fact, most enterprises hit by ransomware recover their documents thanks to nightly backups (Osterman Research and Inc., 2016). However, the same centralization and sharing opens the door to a single infected computer encrypting lots of documents with eﬀects on many company departments. Locally installed malware detectors could prevent ransomware from encrypting network shared volumes, how- ever, they require installation and updates on the whole set of company computers. As far as we know, no previous work has tried to detect ran- somware action based on the traﬃc to a NAS system. In this paper we show how a single network probe can detect and stop any ransomware by the analysis of traﬃc to a network ﬁle server. Tens of gigabits per second of sustained traﬃc are supported, adding ﬁle recovery capabilities in order to reduce ransomware impact to a minimum. 3. Network scenario
1.1 Main Contributions and Road-map
In this paper, we provide a methodology to identify P2P traffic. The methodology is based on the following steps: analysis of the protocol of interest; identification of patterns specific to the P2P protocol that can be revealed by an IP packet level analysis; coding of these patterns in rules that can be fed to an IDS; network monitoring of the identified patterns with an effective IDS fed with the devised rule. Note that following the IDS-like approach does not intro- duce any delay in the network, while requiring only little overhead on the checking-point where it is installed. Fur- ther, the proposed methodology is showed to be extensible to the analysis of P2P protocols that encrypt their gener- ated traffic as well and to efficiently leverage characteris- tics introduced by decentralized P2P file sharing applica- tions. Our P2P trafficdetection tool has been successfully deployed and is currently running in a corporate LAN.
The Time Sliding Window  Three Conformance level meter TSWTCL meters a traffic stream and determines the conformance level of its packets. Packets are deemed to belong to one of the three levels, Red, Yellow or Green, depending on the committed and peak rate. The meter provides an estimate of the running average bandwidth. It takes into account burstiness and smoothes out its estimate to approximate the longer-term measured sending rate of the traffic stream. The estimated bandwidth approximates the running average bandwidth of the traffic stream over a specific window (time interval).It estimates the average bandwidth using a time-based estimator. When a packet arrives for a class, TSWTCL re-computes the average rate by using the rate in the last window (time interval) and the size of the arriving packet. The window is then slid to start at the current time (the packet arrival time). If the computed rate is less than the committed configuration parameter, the packet is deemed Green; else if the rate is less than the peak rate, it is yellow else Red. To avoid dropping multiple packets within a TCP window, TSWTCL probabilistically assigns one of the three conformance level to the packet. The basic working principle of NTM is pictorially represented as below:
The traffic analysis process starts with a tcpdump data file which is used for extracting flow data and for gathering per packet information. There are two types of tcpdump records that was used through this thesis work. One of them is the 1999 DARPA Intrusion Detection Evaluation Data Set which was used throughout the development phase. DARPA Intrusion Detection Evaluation Data Set (DARPA 2009) includes weekly prerecorded tcpdump files for further evaluation. The data set was used for intrusion detection so it has separated clean traffic. Attack free (clean) traffic was important during the development period to detect the metric values for each flow attributes. So the first week of the dataset is used for development purposes. Second record type is the one including manually produced anomalies. These records include both abnormal traffic and a scheduled attack produced manually. These type of records are used for checking the accuracy of the work.
b) Controlling file access: Generally functions of controlling file access are done to
specialized systems, such as Secret Net, which are intended specifically for protecting network information from unauthorized access. However, protection of some critically important files such as database files and password files cannot be done by such systems. Moreover, such systems are mainly developed for the Windows and NetWare platforms. So such systems fail in UNIX environments which are used for network applications in many organizations. So in such types of cases a network intrusion detection system comes to the rescue of network administrators. Mainly host based network intrusion detection systems are used in such cases which are based both on log-file analysis (Real Secure Server Sensor) and IDSs analyzing system calls (Cisco IDS Host Server).
In this paper we present and evaluate a concept to extend existing approximatematching algorithms, which reduces the lookup complexity from O(x) to O(1). Therefore, instead of using multiple small Bloom ﬁlters (which is the common procedure), we demonstrate that a single, huge Bloom ﬁlter has a far better performance. Our evaluation demonstrates that current approximatematching algorithms are too slow (e.g., over 21 min to compare 4457 digests of a common ﬁle corpus against each other) while the improved version solves this challenge within seconds. Studying the precision and recall rates shows that our approach works as reliably as the original implementations. We obtain this bene ﬁt by accuracy–the comparison is now a ﬁle-against-set comparison and thus it is not possible to see which ﬁle in the database is matched.
Basically approximatematching consists of two sepa- rate functions. First, tools run a feature extraction function that extracts features or attributes from the input that allow a compressed representation of the original object (the exact proceeding depends on the implementation it- self). Second, to compare two similarity digests, a similarity function is used that normally outputs a score s which is scaled to 0 s 100. Despite its range, this value is not necessarily an estimate of percentage commonality be- tween the compared objects but a level of con ﬁdence. It is meant to serve as a means to sort and ﬁlter the results. ssdeep and the F2S2 software
Abstract: The Spatial Pyramid Matching approach has become very popular to model images as sets of local bag-of- words. The image comparison is then done region-by-region with an intersection kernel. Despite its success, this model presents some limitations: the grid partitioning is predefined and identical for all images and the matching is sensitive to intra- and inter-class variations. In this paper, we propose a novel approach based on approximate string matching to overcome these limitations and improve the results. First, we introduce a new image representation as strings of ordered bag-of-words. Second, we present a new edit distance specifically adapted to strings of histograms in the context of image comparison. This distance identifies local alignments between subregions and allows to remove sequences of similar subregions to better match two images. Ex- periments on 15 Scenes and Caltech 101 show that the proposed approach outperforms the classical spatial pyramid representation and most existing concurrent methods for classification presented in recent years.
Bit-parallelism is a technique which takes advantage of the intrinsic parallelism of the bit operations inside a computer word, allowing to cut down the number of operations that an algorithm performs by a factor up to the number of bits in the computer word. Bit- parallelism is indeed particularly suitable for the efficient simulation of non-deterministic automata. In other words, Bit-parallelism is the technique of packing several values in a single computer word and updating them all in a single operation. This technique has yielded the fastest approximate string-matching algorithms if exclude ﬁlteration algorithms .