Clustering of Documents for Forensic Analysis

(1)

CONFERENCE PAPER

National level conference on

"Advances in Networking, Embedded System and Telecommunication 2015(ANEC-2015)" On 6-8 Jan 2015 organized by

" G.H.Raisoni College of Engg. & Management, Wagholi, Pune, Maharashtra, India."

Page 212

Clustering of Documents for Forensic Analysis

Asst. Prof. Mrs. Mugdha Kirkire#1, Stanley George#2,RanaYogeeta#3,Vivek Shukla#4,Kumari Pinky#5

#1 GHRCEM, Wagholi, Pune,9975101287 . #2,GHRCEM, Wagholi, Pune, 9763023582. #3,GHRCEM, Wagholi, Pune, 9923696296. #4, GHRCEM, Wagholi, Pune,9503642390 . #5,GHRCEM, Wagholi, Pune, 8087716802 .

ABSTRACT

Forensic analysis refers to the analysis of the unstructured material.Forensics has to deal with thousands of files during inspection of various crimes. These files are unlabeled and complicated which makes it very hard for the computer examiners in analyzing it. The analysis of large volumes of data is really a difficult task .In today’s world there is a great need for computer forensics.Time plays a significant role in digital forensic analysis. In order to deal with high volume of data as well as to deal with limitations over time, clustering technique is being used. Clustering of documents helps in managing files by dividing it into different structured groups.The document can be made better for analysisusing clustering algorithms like K-means in an automated manner. Thus Digital Forensic Analysis with the help of clustering makes the document to be analyzed rapidly.

Keywords:Forensics, Clustering, K-means.

Corresponding Author: Mrs. MugdhaKirkire

INTRODUCTION

Forensic analysis is a branch of digital forensics which deals with collecting, examining and preservation of the digital information. The basic problem that occurs during forensic analysis by the analysts is time consumption and bulk formation of data.So during an investigation, data is recovered from storage device, which is analyzed to form clusters using clustering algorithms.

Clustering is a technique where the objects are classified in a similar manner. This methodis used to transfer documents present in an unstructured manner to documents that are

(2)

Page 213

fully structured. It makes the job of analysts simpler and efficient.By this structured groups can be formed according to the domain keyword in a particular file. It includes two steps: 1) Extraction of textual data, 2) Analysis of extracted data using clustering algorithms. Clustering can be classified as exclusive, overlapping, fuzzy, partial and complete clustering. The clustering performed here is exclusive clustering. Exclusive clustering is a clustering in which each data object can exist in one cluster.This automatically enriches the performance of the expert examiner through analysis.

Clustering can be performed effectively using Clustering algorithms.The Clustering algorithms can be divided into three types- partitioning-based clustering, hierarchical clustering and density-based clustering.Partitioning-based algorithms can determine clusters by grouping data at once (e.g. K-means, K-medoids). Hierarchical clustering algorithm can determine clusters by using previously defined groups (e.g. divisive and agglomerative clustering). Density-based clustering algorithms form clusters ifdata objects have density exceeding some given threshold (e.g. DBSCAN, OPTICS).

RELATED WORK

Comparisons on studies of clustering algorithms for forensics are not frequent. The most widely used algorithms are bisecting K-means, Fuzzy C-means and Expectation Maximization (EM). The usage of various algorithms for the improvement of computer inspection has been in demand to get the data in the form of clusters or groups. In the studies, it has been observed that algorithms like average link and complete link have shown good results in the field of forensic computing. Even though they are costly, but they provide documents that has to be inspected in a summarized form. Email forensic analysis tool to be introduced to investigate email fraud and harassment related issues.[1]Ontology defines the knowledge to process and aggregate onto a set of texts which is domain specific. [2] Java based implementation for flexibility of data by forming groups using centroid. SOM based algorithms were used to form clusters according to date/time extensions.[3] Fuzzy C-means describes a method to easily understand expert system rules from forensic data. Fuzzy methods describe an automatic procedure for easily understanding expert systems. [4]The main focus is on lexical and syntactical features so as to learn about words of individual and use of isolated characters.[5]

PROPOSED SYSTEM

The proposed system for clustering of documents first tokenizes the input documents collected during investigation. The document is then filtered out using filter database. The unnecessary words are removed and stem of words is obtained as shown in figure 1. The final tokens obtained so are sorted out by calculating their term frequency and finally calculating the distance measured between these words. The clustering algorithm is applied after the correct estimation of the clusters. All these processes help in formation of document clusters in a structured format. This provides in carrying out the investigation process faster and in an efficient manner.

(3)

Page 214

Figure 1: Proposed Architecture for Document Clustering

Work Flow

The work can be done by following these steps explained below:

A. Tokenize-Tokenizing means dividing the words into tokens (in parts). In this, meaningful sentence is converted into non-sensitive data.

B. Filter-In filter process, words are filtered and insignificant words are removed.e.g.-is, was, were, are gets removed.

C. Stemmer-Stemming process provides stem of necessary words.e.g. -working becomes

work, image becomes imag.

D. Term Frequency-It will check the frequency of words in a particular document.e.g.-if the word work comes 10 times in a document of 1000 words then its term frequency is 10/1000=0.01

(4)

Page 215 E. Distance Measure-Distance is measured using jaccard similarity. Jaccardcompares the

similarity in the sample sets and computes the distance.

F. Clustering-In K-means algorithm clustering is done which gives the best possible way for k groups to be divided into n entities which reduce the distance between the centroids and its group members.

Algorithms

There arefour algorithms which will be used during clustering process:

A. Porter Stemmer:To remove morphological endings from words in English, the porter stemming algorithm is used. It is mainly used for term normalization which serves the purpose when setting up Information Retrieval Systems. Core meaning of words is obtained.

B. Jaccard similarity coefficient:Set of samples can be compared using Jaccard coefficient. It is basically the ratio of size of intersection to size of union of sample sets.

Formula: J (A, B) = |A intersection B| / |A union B|

C. K-means: It is a clustering algorithm used for clustering the data. It deals with unstructured data. Its aim is to partition n observations into k clusters.

D. Term Frequency: It counts the term frequency of particular word from specific document. It also gives numerical statistic to reflect importance of word.

Example: “the brown cow”.

- Here, frequency is not the rate of something. - In IR, frequency is count.

- Log-frequency weighting of t in d is: w(t,d) = 1+log tf(t,d) , if tf(t,d)>0 = 0 , otherwise - Example,

0->0, 1->1, 2->1.4,100->3, 10000->5, etc.

CONCLUSION AND FUTURE WORK

For forensic analysis of computers, laptops and hard disk drives which are acquired from criminals during investigation of police, this application is used.Lots of demerits can be overcome with the help of clustering technique as well as it has many applications in day to day life.The Computer Forensics not only improves the performance but also makes it an efficient tool in investigating crimes. Forensic experts working in computing department can

(5)

Page 216

have several practical results based on their work which will be extremely useful to get justice sooner.

REFERENCE

[1] K.Pallavi, S.NagarjunaReddy, Dr.S.Sai Satyanarayana Reddy, “Clustering of documents in Forensic Analysis for Improving Computer Insection”, July,2014.

[2] Juan Ramos, “Using TF-IDF to determine word relevance in document queries” ,Piscataway,NJ,2002.

[3] Luís Nassif and Hruschka, “Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection”, January,2013.

[4] A. Sudarsana & C. Rajendra,”Quick and secure cluster labeling for Digital Forensic Analysis” , A. P.,India,July,2014.