Can Data Leakage Prevention Prevent Data Leakage?

Full text


Can Data Leakage Prevention

Prevent Data Leakage?

Bachelor Thesis

Matthias Luft

First Examiner: Professor Dr. Felix C. Freiling

Second Examiner: Dr. Thorsten Holz

Advisor: Dr. Thorsten Holz

University of Mannheim


Eidesstattliche Erkl¨


Hiermit versichere ich, die vorliegende Arbeit ohne Hilfe Dritter und nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus den Quellen entnommen wurden, sind als solche kenntlich gemacht worden. Diese Arbeit hat in gleicher oder ¨ahnlicher Form noch keiner Pr¨ufungsbeh¨orde vorgele-gen.

Mannheim, 27. Mai 2009



Data Leakage Prevention is the general term for a new approach to avoid data breaches. To achieve this aim, all currently available implementations perform analysis of intercepted data. This analysis is based on defined poli-cies which describe valuable data. There are different possibilities to both define these content policies and to intercept data to make the analysis pos-sible.

This thesis examines exemplary DLP software to reveal security vul-nerabilities of these solutions. These security vulvul-nerabilities can result in different impacts like that data breaches can still happen or even the ap-pearance of new leakage vectors. This review process is an essential step in the life cycle of every new software or concept. There should be a continuous cycle of test phases and examinations before a solution can be regarded to be dependable.

The results of the performed tests allow general conclusions on the ma-turity of current products and show basic problems of the concept Data Leakage Prevention.


Data Leakage Prevention ist der allgemeine Ausdruck f¨ur ein neues Konzept, das die Offenlegung von Daten verhindern soll. Um dieses Ziel zu erre-ichen, f¨uhren alle aktuell verf¨ugbaren Implementierungen verschiedene in-haltsbasierte Untersuchungen auf abgefangenen Daten durch. Diese Analy-sen basieren auf Policies, die sch¨utzenswerte Daten beschreiben. Es existieren verschiedene M¨oglichkeiten, diese Policies zu definieren und Daten abzufan-gen, um die Analyse zu erm¨oglichen.

Diese Arbeit untersucht exemplarische L¨osungen, um enthaltene Sicher-heitsl¨ucken aufzudecken. Diese Sicherheitsl¨ucken k¨onnen verschiedene Auswirkun-gen nach sich ziehen. So kann beispielsweise die Offenlegung von Daten nicht wirksam verhindert werden oder es k¨onnen sogar neue M¨oglichkeiten f¨ur das Auftreten von Datenverlust hinzukommen. Dieser Pr¨ufungsprozess ist ein essentieller Schritt im Lebenszyklus jeder neuen Software oder jedes neuen Konzepts. Bevor eine L¨osung als zuverl¨assig betrachtet werden kann, sollte ein kontinuierlicher Zyklus an Testphasen und Untersuchungen existieren.

Das Ergebnis der durchgef¨uhrten Tests erlaubt allgemeine R¨uckschl¨usse auf den Reifegrad aktueller Produkte und zeigt prinzipielle Probleme des Konzepts Data Leakage Prevention auf.



1 Introduction 1 1.1 Tasks . . . 1 1.2 Outline . . . 1 1.3 Results . . . 2 1.4 Acknowledgments . . . 2

2 Definitions And Concepts 3 2.1 Data Leakage . . . 3

2.2 Safeguarding confidentiality . . . 6

2.2.1 Multilevel Security . . . 6

2.2.2 Access Control Lists And Capabilities . . . 7

2.2.3 Information Flow . . . 8

2.2.4 Hippocratic Databases . . . 9

2.2.5 Email Leak Prevention/Bayesian Filter . . . 9

2.3 Summary . . . 10

3 Current DLP Approaches 11 3.1 Outline . . . 11

3.2 Identify: How To Find Valuable Content . . . 11

3.3 Monitor: Controlling Every Channel . . . 13

3.4 React: Handling Policy Breaches . . . 14

3.5 Current DLP Products . . . 15

3.6 Summary . . . 15

4 Evaluation 17 4.1 Testcases . . . 17

4.2 McAfee Host Data Loss Prevention . . . 18

4.2.1 Environment And Specifications . . . 20

4.2.2 Testcases . . . 21

4.2.3 Results . . . 21

4.3 Websense Data Security Suite . . . 31

4.3.1 Environment And Specifications . . . 31

4.3.2 Testcases . . . 33

4.3.3 Results . . . 33

4.4 Summary . . . 36

5 Abstraction To General Problems And Conclusions 39 5.1 General Problems . . . 39

5.2 Alternative Solution Statements . . . 40


A Security Goals 45 B Security Software Vulnerabilities 47

C Technical documentation 49

C.1 Filesystem Forensics . . . 49

C.2 Portscanner Results . . . 49

C.3 Attacks on SSL . . . 53

C.4 Fuzzing . . . 55


List of Figures

1 Architecture of McAfee Host Data Loss Prevention . . . 21

2 Structure of the test environment . . . 22

3 Text pattern for matching the string secret . . . 23

4 The complete reaction rule using the text pattern . . . 23

5 Copying thesecret file to a USB stick . . . 24

6 The secret file is correctly blocked . . . 24

7 The blocked file is stored at the ePO repository . . . 25

8 PDF files with removed leading line are not recognized . . . 27

9 NTFS alternate data stream is monitored correctly . . . 27

10 Unmonitored partition . . . 28

11 Monitored partition . . . 28

12 Information contained in file names is not monitored . . . 29

13 Files containing sensitive data are transmitted via SMB in plain text 30 14 Architecture of the Websense Data Security Suite . . . 32

15 The file containing the filtered string is blocked. . . 33

16 A PDF document without its first line is not recognized. . . 34

17 EXIF comments in images are not recognized. . . 35

18 There are zero lines containing anything other than zero. . . 50

19 Additional search for the stringSECRET on the empty disc . . . . 50

20 Successful search after the valuable file was deleted by DLP . . . . 51

21 The management server supported insufficient key sizes . . . 54


List of Tables

1 DLP testcases . . . 19 2 Testcases for McAfee DLP . . . 22 3 Performed testcases for McAfee Host Data Leakage Prevention and

Websense Data Security Suite . . . 37 4 Vulnerabilities in security software . . . 47




Data Leakage Prevention (DLP) is the general term for a new technology which has its focus on the big problem of information leakage. Many of the problems which emerge from security holes concern the unauthorized access on data. The impact of such incidents ranges from identity theft to law suites due to data breach rights. Data Leakage Prevention provides an approach which should avoid the possibility of data leakage. There are already several products on the market which fulfill the requirements to be called a DLP suite [QP08]. The complete market for these tools is very new and not yet matured. Every new technology should be reviewed and tested several times before it could be regarded as a dependable solution for the addressed problem.



This thesis covers basic parts of this review process by examining DLP solutions in face of mainly three questions:

• Is accidental leakage still possible?

• Is it possible to subvert the leakage prevention? • Are there any vulnerabilities in the software?

To answer these questions, a set of tests must be developed. These tests will examine which forms of data leakage can be avoided by implementing a DLP suite. Possible tools for this test suite will be encryption, obfuscation and embedment of the information using different data types. To assess the vulnerabilities of the software, it will be examined whether standardized technologies are used and, if it is possible, how the application reacts on modern fuzzing techniques.

The two exemplary DLP solutions that get examined are the really new McAfee Host Data Loss Prevention and the Websense Data Security Suite which is one of the leading products.

As the examination shows, both solutions contain several flaws which make them fail in important areas. These results allow the abstraction to general prob-lems of the concept DLP and an interpretation of its capabilities.



To go through the review process in a structured way, Section 2.1 defines the term

Leakage and motivates the need for any kind of protection against it. To restrain DLP against other technologies which deal with access control and related prob-lems, similar approaches are listed and illustrated in Section 2.2. Since the whole area of access control is a really broad field, the range of covered developments is limited to those that can be alternatives to DLP.


This theoretical groundwork allows the comparison of several key characteris-tics of DLP. These characterischaracteris-tics are represented in different stages during the prevention of data leakage and are the framework for most of the DLP solutions on the market. Therefore it is necessary to understand these principles to deter-mine whether evaluated solutions follow these best practices. Section 3 provides an overview on these concepts.

The evaluation of the two DLP solutions is based on a set of test cases and requirements which are explained in Section 4.1. There is also a structure provided for specifying the test results of the McAfee Host Data Loss Prevention (Section 4.2) and the Websense Data Security Suite (Section 4.3).

These results allow the answering of the initial questions about the capabilities and limitations (Section 1.1) of the DLP concept. Section 5 sums up the gathered results and draws general conclusions on what DLP can be, what its problems are and what can be done instead.



As the title supposes, the main topic of this thesis is the question whether DLP can avoid data breaches. The results of the examination of the DLP solutions allow general conclusions on this question. Both solutions contain severe flaws which either restrict the ability to avoid data loss or even create new possibilities for attackers to retrieve confidential data. For example, the McAfee DLP solution did not delete confidential files that were copied to USB devices properly and the encryption of the Websense Data Security Suite was not sufficient.



First of all, I would like to thank Professor Dr. Felix C. Freiling and Dr. Thorsten Holz, who gave me the possibility to write this thesis. Special thanks go to Dr. Thorsten Holz since he was willing to be both my advisor and second examiner.

For the initial idea to write this thesis and a lot of helping contacts, I would like to thank Enno Rey and the ERNW GmbH. Thanks go also to Michael Thumann from ERNW who could provide the evaluation version of the McAfee DLP solution. For the evaluation version of the Websense Data Security Suite, I want to thank J¨org Kortmann from Websense and Marcel Sohn and Rainer Kunze from ConMedias. They provided the evaluation version in a very fast and unbureaucratic way.

Again, thanks go to Dr. Thorsten Holz, Enno Rey and Phoebe Huber for providing lots of feedback for improving the structure, expression, and composition of my thesis.



Definitions And Concepts

Data leakage is a very general term which can be used in a variety of meanings. In the context of DLP, it means a certain loss of data or, more precisely, the loss of confidentiality for data. To examine solutions which prevent data leakage, it is necessary to define the term data leakage to derive from this understanding which threats exist and must be controlled.

Section 2.1 provides a definition of leakage which is used in the remainder of the document. This definition involves several characteristics of data leakage and derives from several examples which describe popular data breach incidents which occurred within the last year. Even if this selection is very limited, it represents different kinds of data leakage. The provided characteristics involve several typical attributes of data leakage and sums them up to a single definition.

This basic groundwork is necessary to understand which threats arise from different types of data leakage and – even more important – which controls can be applied to mitigate them. As DLP is only one of these controls, Section 2.2 lists different approaches which also concern the area of access control and can be alternatives to DLP.


Data Leakage

There are lots of well-known examples that represent different kinds of Data Leak-age. One of the most popular incidents in 2008 was the selling of a camera on Ebay [Gua08]. This camera contained pictures of terror suspects and as internal

classified documents of the MI6, the British intelligence service. This kind of leak-age means an inadvertent loss of trust for the abilities and trustworthiness of an organization, but does not affect anyone other than the organization.

Another british organization, the General Teaching Council of England, lost a CD containing data of more than 11.000 teachers [BBC08]. The CD was sent to another office via an courier service, but did never arrive. Fortunately, all information was encrypted so that nobody can use the lost data. At this point, it is necessary to distinguish between data and information. The remainder of this thesis will refer to the term data when the pure content of any kind of media or communication channel is meant. In contrast data becomes information when it can be interpreted to transport any kind of message. Thus the loss of an encrypted CD means the loss of data. This encrypted data can not be interpreted to get the information which is on the CD, too.

Other examples are even worse because personal data is disclosed. These kind of incidents can result in identity theft for thousands of people. One of the biggest leakages of this kind in 2008 happened to the German T-Mobile. 17 million cus-tomer data sets were stolen due to the exploitation of security vulnerabilities in different systems and databases [Spi08].


happen every day. The website [Cle] is focused on data breaches which affect individual persons. It provides a database of leakage incidents which disclosed more than 251 million data records of U.S. residents since January 2005. 17 privacy issues just happened in December 2008, and the website just covers incidents inside of the USA. Articles like the yearly published 2008 Data Breach Investigations Report [BHV08] sum up the most popular incidents and provides statistics about those.

Based on this exemplary selection, some of the characteristics of data leakage can be pointed out. These characteristics are independent from the kind of leakage and how it occurred. The following items just describe how data leakage feels and affects things:

unintentional It is obvious that data leakage is never on behalf of the affected or-ganization or person. Though Data Leakage can happen by misuse, accidental mistakes, or malicious activities.

independent from leakage vector

Information can leak on all channels it can be transmitted. So leakage is not restricted to be meant in a very technical way. The incidental data loss – like the camera sold on Ebay – is mostly based on inconsiderate user behavior which leads to a physical leakage.

independent from leaked information

Since we distinguished the terms data and information in the in-troduction, the term data leakage is more general than information leakage. As we can see in the case of the lost CD, even the loss of encrypted data when no valuable information gets disclosed affects the reputation of an organization.

impact Data Leakage affects always the confidentiality of data, thus it leads to a loss of confidentiality. As the previous item shows, also a loss of data which can not be interpreted can lead to a loss of reputation for the company.

inadvertent When data is disclosed, there is no possibility to undo this breach [JD07]. Nobody can determine how many persons read the data or even distributed it. This restricts the response process to con-trols like changing passwords, informing affected users, and closing security holes.

Most of these characteristics mentioned the confidentiality of data. To classify controls and threats which concern data, there are three classical and primary security goals. The following definitions of these goals are given by the ISO/IEC 13335 standard and are also listed in Appendix A.



The property that information is not made available or disclosed to unauthorized individuals, entities, or processes

Integrity The property of safeguarding the accuracy and completeness of assets

Availability The property of being accessible and usable upon demand by an authorized entity

Based on these presumptions and examples, the following, very general, defini-tion of Data Leakage can be derived:

Data Leakage is – from the owner’s point of view – unintentional loss of confidentiality for any kind of data.

Using this definition, it is necessary to define the term unintentional in more detail. As the definition states, data leakage is unintentional from the owner’s point of view. But from the perception of an industrial spy’s, data leakage can of course be intentional. So the following items define different ways how leakage can occur. Still, the characteristics explained above apply to every leakage incident, no matter which of the following scenarios happened:

unconscious There are many situations in which people do not recognize that their behavior can lead to data leakage. So most people do not know that files on any kind of data storage are easily recoverable when they are just deleted using standard commands. A file which is accidentally copied to an USB stick thus can lead to leakage even when the users recognizes his mistake and deletes it – using the wrong tools. This means that the person is not aware of the possible leakage, he rather thinks acting absolutely right and avoiding the leakage.

unintentional Unintentional leakage is quite similar to unconscious leakage. The main difference is that the leakage is recognized, but there is no possibility to stop or revoke the action like an email was sent to an unintended recipient, e.g., due to a typo.

malicious Attackers and malware are still two of the biggest problems in the digital world. Since most pieces of malware are spread to gather more or less defined pieces of information – e.g. every kind of private data like online banking accounts [HEF08] – , it is obvious that any kind of malicious activity can lead to data leakage.

third party Data leakage can also happen without any mistake of the infor-mation owner. The example of the lost CD shows that even the


mistakes of business partners or service providers can lead to a loss of data and thus reputation. In such cases the argumentation would contain items like the careless choice of partners.

careless Even if policies are deployed which describe the correct handling of sensitive information, users will disregard these rules. For example in one third of the incidents in the 2009 Data Breach Investigations Report which are caused by internal sources careless end-users are the culprits [BHV09]. If there is a policy which specifies that valu-able information must be encrypted when sent to an external busi-ness partner, users will sent this email unencrypted if there is a big hurry. Another point is a lack of awareness because data leakage never happened to them and “for sure nothing will occur”.

The remainder of this thesis will use the defined characteristics as well as the different ways data leakage can occur. They will also influence the development of appropriate testcases for DLP solutions in Section 4.1


Safeguarding confidentiality

The loss of confidentiality is a common problem when engaging with information security. There are lots of approaches which follow a variety of ways to safeguard the confidentiality of data. Since they all use different technical controls and basic ideas we need to get an overview to differentiate them to the concept of DLP.

2.2.1 Multilevel Security

Multilevel Security (MLS) systems were first developed in the seventies to map the military access controls to digital systems. The oldest concepts were the Bell-LaPadula-Model [BL76] and the Biba-Model [Bib77] which protect confidentiality and respectively integrity. MLS systems must be restrained against other ap-proaches which also concern with access controls. For example the more general concept Mandatory Access Control (MAC) provides a framework for enforcing access rights on operating system level without users being able to change any per-missions. So MLS can be implemented on top of MAC. Another related concept is the Need To Know principle which restrains access in a horizontal way; it defines in which area any access is even possible. A simple example are several organizational units in a company. Employees from different departments may have the security clearance to access files from other departments. But they simply do not need to know them to get their work done. It was also developed to fulfill the military needs for shared knowledge. Every single unit should only know the part of the overall strategy which it really needs to finish its particular function.

MLS systems assign security labels to each subject and object. These security labels are used to prevent unauthorized access to data and allows the


implemen-tation of more granular access controls. These labels define the security clearance of subjects and objects and tell MLS apart from other approaches.

The two classical variants safeguard either confidentiality – the Bell-LaPadula model – or integrity – the Biba model. By now, there are several other approaches which ensure both security goals like the Lattice model [Den76] or the Chinese Wall Model [BN89].

Since DLP addresses the protection of confidentiality, the Bell-LaPadula model is an adequate example to explain the basic principles of MLS. As every MLS system, it assigns security labels to each subject and object. These labels repre-sent usually the security clearances unclassified,confidential,secret andtop secret. Based on these security clearances, every access is controlled and the MLS system decides whether it is granted or not. To enforce such a preventive approach, which ensures that data is only accessed if the check of the security levels succeeded, an integrated data classification is necessary. Based on these requirements the Bell-LaPadula model enforces two basic rules:

No Read Up A subject can read an object, if its security clearance is higher or the same as the object’s one

No Write Down

A subject can write an object, if its security clearance is lower or the same as the object’s one

As the first rule is obvious, the following example explains why the second rule assures the confidentiality of data: If an user with security leveltop secret composes a document, he must not be allowed to save it and assigning it the security level

secret. If he could do so, anyone having security levelsecret would be able to read the document, what would be a violation of the needs for confidentiality.

In contrast to DLP, this system depends on the constant use of a security classification scheme. It grants access based on meta information while DLP needs to analyze the content.

2.2.2 Access Control Lists And Capabilities

Access Control Lists (ACLs) and capabilities are two more concepts of access control. Both approaches do not use access levels or security clearances but define exactly what a subject or object is allowed to do. An ACL can be considered as a table [And01] which is assigned to an object. This table contains every user and its access rights for this particular object. In contrast, capabilities define exactly what a subject is allowed to do, e.g. accessing file file.txt for read and write access. These approaches allow very fine grained access rules since every interaction of a subject and an object can be controlled. Obviously this is not practicable so that ACLs which contain groups of users are more common. Another problem is the enumeration and modification of access rights and capabilities: Using ACLs,


it is hard to find all the files to which a user has access and also hard to remove those access rights if a users is removed from the system. Inversely, capabilities complicate the enumeration of all users which have access to a certain object. This is even more complicated since user or processes can share capabilities and hand them over to other processes. So it is really hard to revoke a once given capability.

2.2.3 Information Flow

One of the problems of MLS systems is the focus on the access of subjects to objects. If an access to an object is regarded to be valid, it is for example not further controlled whether the data is passed to any process which can be read by other users. Information Flow models [Den76] address this gap by monitoring the complete path of data between transitive processes or objects. For example the Bell-LaPadula model would allow the write access of user A on process P. There would be no further control whether user B has read access on P, so there would be a transitive information flow

A → P → B

and so

A → B

. Therefore, Information Flow models assign also a security level to process P which is derived from the information it handles and its clearance level. Based on this additional label, it is possible to hand over the initial security clearance of the information for further protection. This additional labeling must be supported by all programs which process the data. Additional data mark mechanisms combined with updating instructions inserted by the compiler can assure the proper handling of the labeling information.

Since there are some problems applying Information Flow models to real world applications, there are similar approaches which enhance the concept. A common problem of pure Information Flow models is the fact that two subjects with differ-ent security clearance cannot communicate with each other, only the flow from the higher security level to the lower one is possible. To resolve this problem there are Access Flow models [Sto81] which uses both kinds of labeling: labels for general access and labels for potential access depending on the further use. The labels for general access implement traditional MLS security levels and control the access of subjects to objects. Which labels for potential access are assigned depends on the function to derive this label from general access. It is possible to implement this functionf which computes the potential access labelp from the general access labelsg1, g2, . . . , gn according to the environment and specific requirements. Basi-cally it determines what can be done with the contained information, e.g., copying to documents with other security levels.


Based on these flow models, a system for taint analysis was implemented by Yin et al. [YSE+07]. The system monitored data flow through an operation system to determine whether sensible information were accessed by processes other than the intended. This approach was used to detect malware but it shows that information flow models are capable of detecting any data leakage, too.

2.2.4 Hippocratic Databases

Hippocratic Databases are aimed at transferring the principles of the hippocratic oath to database systems [AKSX02]. In particular, this means that more granular access controls exist to ensure that only the owner or an authorized user obtains information from the database system. They extend the principle of statistical databases which are able to provide statistical information without compromising sensitive information like for example the queries on databases which contain only a small amount of rows which would allow conclusions on the contained personal information.

The ten characteristic principles of Hippocratic Databases [AKSX02] should be achieved by attaching so called attributes, e.g. authorized-users or retention-period, to all stored information. These attributes allow fine grained access control. Another important requirement is the absence of side channels. It must be achieved that executed queries do not provide additional information when compared to statistics, e.g. statistical data which is based only on a small number of data sets which allow further interpretation.

2.2.5 Email Leak Prevention/Bayesian Filter

Email has become one of the most important communication mediums. The con-sequence of the resulting mass of sent messages is the arising thread of email leakage. Emails containing confidential data sent to wrong recipients – e.g. due to misspelling or wrong use of the auto completion feature of modern mail agents which completes email addresses after the first letters – are widely known and ob-vious examples. There are approaches to apply kind ofinverse spam filters to face this thread. One of these uses machine learning methods to determine whether a recipient was an intended one for this certain content or not. There are differ-ent stages [CC07] to apply learning strategies to email leakage. In a first step,

recipient-message pairs are built and indexed using basic textual content analysis. These pairs are compared recipient wise to discover which pair is most different from the other ones. In addition to this baseline method, kind of social network analysis is used, e.g. statistical analysis how often two recipient are addressed in the same email. Based on a real world training set of emails [Coh], this approach was able to detect real email leakage with a success rate up to 90%. Regarding the limited set of training data, it seems to be possible to reach a consistent recognition rate of 90% if this technique would be trained in a mail client with higher, real world email rate.




The developed definition of data leakage allows the understanding of basic leakage threats. Additionally, several approaches for mitigating various of these threats were presented. Based on this groundwork it is possible to show what DLP actually is and for what purposes it can be used.



Current DLP Approaches

This chapter explains current approaches that are implemented in DLP solutions. The following definition of DLP solutions [Mog07] is a good source to understand all important aspects:

Products that, based on central policies, identify, monitor, and protect data at rest, in motion, and in use, through deep content analysis

Based on this definition, it is possible to derive the three main capabilities of DLP solutions:

• Identify • Monitor • React

Each of these steps in leakage prevention has to deal with the mentioned require-ments to handle data at rest, in motion, and in use. Thus the remainder of this chapter explains in depth how the different challenges are handled in each situa-tion. There are several products which address single requirements like to scan for certain content. Complete DLP solutions must cover all of these tasks that can also be fulfilled by single programs. This approach allows the central management of all components and thus the sharing of results for further steps. Nevertheless the remainder of this chapter explains in depth how the different work stages are handled by the complete DLP suites. This allows a more fine grained analysis of the single capabilities.



The cited Definition of DLP provides a base for the structure of the remaining section. Since the single phases of DLP have very different requirements for both analysis processes and examination, the understanding of these phases is necessary for a structured evaluation. Thus Section 3.2 explains how valuable content can be defined. Section 3.3 lists different channels which can transport this valuable data and describes common practices how these channels can be intercepted. Fi-nally, Section 3.4 describes how detected data breaches should be handled and reported. On this basis, Section 3.5 examines what requirements must be fulfilled by a product to be called a complete DLP suite.


Identify: How To Find Valuable Content

If sensitive data should be protected, every kind of control mechanisms needs to know how the valuable data looks like. So in a first step, methods of defining


data and scanning for it are needed. It is not practicable to insert every piece of information that is worthy of protection into, for example, a database. A central management is needed since the policies must be consistent and manageable. This is not provided when all policies are spread through different places or tools. It is also necessary to provide generic methods to define data both as general and as special as needed. The following approaches provide capabilities to discover data in various way which define also the methods to describe data.

Rule-Based Regular Expressions are the most common technique for defining data via an abstract pattern. At the same time, this is the biggest constraint since this approach produces a high rate of false posi-tives due to the limited scope and missing context awareness. For example the termconfidential can be used in various, even non con-fidential contexts. Due to the fast processing of regular expression on huge amounts of text, they can be used as a first filter to reduce the amount of data for further processing using more sophisticated methods.

Database Fingerprinting

If it is possible to identify a database that holds lot of sensitive data, this database can be used to perform exact file matching. The data gets extracted via live connections or nightly database dumps for checking whether the database data matches intercepted data. This method produces very few false positives but can have a clear impact on database performance and hardware requirements of the DLP server.

Exact File Matching

Like the extraction out of a database, existing amounts of data, e.g., on a file server can be hashed and indexed. Using these footprints, matching can be performed on any kind of file types with a low rate of false positives. But again, this heavily increases the hardware and bandwidth requirements.

Partial Document Matching

This technique processes complete documents to be able to match particular appearances in other documents. So every part of sen-sible documents can be matched even if they are only partially included in other documents or even just copy & pasted to emails. To scan large amounts of documents in an efficient way, a special process called cyclic hashing is used. The first hash value indexes the first N characters of the document, the next value covers the next part which includes also an overlapping section of the first one. Thereby, it is important that the resulting index contains an over-lapping map of the document. If suspicious documents should get


examined, the same algorithm can be used to determine whether there is sensible data included. Of course, this method produces a low rate of false positives if standard phrases are excluded, but again produces a high CPU load due to excessive hashing.

Statistical Analysis

Some DLP solutions use modern machine learning techniques either by processing an amount of training data or by learning continu-ously at work. Exactly as is the case with learning spam filters, this approach will lead to false positives and false negatives. Any-how, the resulting evaluation can be used to calculate an overall “leakage score”.

Conceptual If it is possible to define adictionary of sensible data, the DLP so-lution can judge content based on the contained words. So unstruc-tured but sensible or unwanted data can be assessed and scored. This extends the concept of regular expressions since a complete dictionary is used for comparison. If more than N of the defined terms were found in the intercepted data, the data is judged to be confidential. In contrast, regular expressions match immediately on every occurrence.

Categories Many organizations use information classification based on cate-gories like secret, internal or public. If documents include an area or label representing this category, the DLP solution can recognize and process this information.

All of these approaches analyze content of data. The solution must thus have the ability to understand lots of different file types like Word documents, zip archives or images. This deep content analysis allows the further processing using the methods mentioned above.


Monitor: Controlling Every Channel

In a second step, data must be accessible to allow the application of any kind of control. Since the last section was focused on the content analysis, this section regards the context of data. The context of data is again related to the different states of data: In motion, at rest, and in use. Data in motion can have the context of an email or a FTP file transmission. This means a complete solution needs as many approaches for monitoring as ways of transmitting data exist.

In Motion Data in motion basically means every data that is transferred over the network. So there are several monitoring points to intercept traffic. This includes both integration in web proxies or mail servers and passive network monitoring using a special network port or similar controls.


At Rest The scanning of data at rest is one of the most important use cases for a DLP solution. So an organization can recognize where its sensitive data is distributed all about. Most of the scanning can be done using network sharing methods, but in some cases as the assessment of endpoint systems or application servers, a software agent is needed to access all data.

In Use Since every user behavior is a different use case for leakage preven-tion, it is not possible to remotely monitor data in use. To control every action a user may take, an endpoint agent is needed. This agent hooks up important operating system functions to recognize all actions, like copying to the clipboard, a user takes.


React: Handling Policy Breaches

There are several controls to handle the different kinds of detected data loss. De-pending on the state of data, it is necessary to have an appropriate reaction policy. It is not appropriate to delete sensible documents which are found on a public file server – this would lead to a radical decrease of the availability of data even if it would avoid any leakage. There must exist fine grained possibilities to determine what controls should be applied. In the case of the file server, the file should get moved to a secure place leaving a moved to message. Encryption is also a way to secure discovered data: Providing a challenge code the password for decryption can be requested. The data then has to be moved to another place to avoid further policy breaches.

It can also be necessary to go through an initial phase of reconnaissance to build and improve a basic rule set. Thus it must also be possible to define policies which just notify about possible data leakage instead of protecting the data from leaking.

Similar techniques can protect the content of intercepted emails or HTTP com-munication. These channels can be intercepted by proxies or are even designed to use mail gateways. All complete DLP suites provide plugins or services which integrate into the proxy and mail gateway landscape. This can be implemented using a plugin to extend the proxy functionality or by providing a dedicated service which acts as a proxy and forwards the received data to the real proxy. It is more difficult to recognize sensible data in an unknown network stream which is usually not intercepted by any station. Even if data can be monitored using a special network port, the protection of leaking data is hard. A possibility would be to terminate the connection by TCP reset packets [Inf81] using the sniffed sequence number.

Since this connection break is not practicable due to some further factors, it is necessary to integrate the DLP solution into firewalls and routers or deploy endpoint agents. Using access to the network infrastructure, it would be possible


to deploy dynamic blocking rules which drop connections that transport sensitive data.

Endpoint agents can be installed on every endpoint software and communicate with the central policy server. When they are installed and the latest policy is deployed, they can monitor all actions that take place on that particular system. It is even possible to monitor applications whose network streams can not be analyzed by intercepting the traffic – e.g. due to proprietary protocols or encryption – on a non-network layer. Additionally, the transfer of data to removable storage media has to be monitored and – if a policy breach is detected – blocked.


Current DLP Products

In 2008, lots of DLP solutions were released on the market. These new products as well as new directions in DLP are summarized every year by Gartner [QP08]. The report provides an overview of the key features of current DLP solutions. It defines also a metric which minimal requirements have to be fulfilled that a product is called a DLP suite. Appropriate products must provide, e.g., complete monitoring of all network data and either monitoring of data at rest or in use. The central management interface is necessary to provide a possibility to control all functions of the solution in a effective way.

These requirements distinguish complete solutions from products which focus only on one part of the complete DLP approach, like content discovery. In 2008, the report listed 16 products which met these requirements. This is a increase of almost 100% regarding the 9 solutions mentioned in 2007 [EQM07]. Despite this heavy growth the market is still called adolescent and evolving.

This more competitive market results in more extensive capabilities of the so-lutions in 2008: In 2007, it was adequate to provide filtering or discovering on the network layer. In 2008, according to the evolving market, all products supported scanning of data in at least two of the three states. The integration of endpoint agents into the overall solution thereby is the most important change in product resources.



The DLP-related definitions given in this section build the groundwork for the understanding of DLP solutions. Without this understanding, it would not be possible to develop appropriate testcases which cover all capabilities of the solutions and also all possible leakage vectors. Additionally, the listed characteristics of DLP suites will show why the two examined DLP solutions are representative examples for this class of products.




Since the DLP market is still adolescent and evolving, it is important to evaluate current products before trusting them. Appendix B shows also that even security technologies contain frequently critical security vulnerabilities. New software or, even more general, new approaches need to be analyzed and reviewed to reveal any flaws. So the following analysis examines whether the two DLP solutions McAfee Host Data Loss Prevention [McA08] and Websense Data Security Suite [Web08] are able to protect the confidentiality of data in typical use cases.

Since the implementation of DLP components on endpoint systems was one of the main changes in 2008 [QP08], the according endpoint agents of the two suites were interesting points of research. To examine these components, adequate test-cases for endpoint agents are chosen from the developed general testtest-cases (Section 4.1). The Sections 4.2 and 4.3 explain the system design of the solutions and list the gathered security related findings.



When evaluating any piece of software, the testcases derive from its application and specification. The test scenarios of a DLP endpoint solution therefore must cover all use cases which represent typical user behavior that could affect the confidentiality of data. In doing so both intentional and unintentional leakage of data must be handled properly. Intentional data leakage includes firstly malicious activities, but also an employee who is restricted by the DLP solution and wants to get his work done, e.g., by sending an email containing important information to a colleague.

These use cases lead to concrete technical checks which are summarized in Table 1. The sectioning derives from the different stages of a DLP process one more time. So the discovery functionalities must implement all different methods to define data properly. If regular expressions are one possibility to describe a certain kind of data, these regular expressions must be applied properly. Further tests must use different file types to check whether the solutions parse known file types properly and also recognize special functions like embedding further data structures. A compressed archive of Word document containing Excel sheets and images would be a first test. If this processing works it must be examined whether real deep content analysis is performed. This includes the processing of unknown file formats, even binary files, and the processing of known data which is tampered to slightly obfuscate the mime type for example by removing the first line of a pdf document. Another special case is the handling of encrypted data: Information should be encrypted when sent over untrustworthy channels, but attackers can use encryption to protect their communication, too. If there are corporate encryption tools which possibly use special mime types or other characteristics, it must be possible to add these signatures to the DLP solution.


As Section 3.5 shows, it gets more and more important to monitor all possible channels for data transmission. So the different monitoring modules must be eval-uated whether they reliably intercept data. The first tests should contain checks whether sensitive data is recognized during the transfer to USB sticks, which must contain multiple partitions and different filesystems, email and HTTP connec-tions. Further steps need to get more sophisticated: Different network protocols, like FTP or more special protocols like SNMP, must be checked. The DLP solution must monitor all media regarding also special cases like alternate data streams of the NTFS filesystem, which append additional data in a second data stream to a file, and unknown blocks of network protocols. Even unknown network protocols should get examined using at least textual analysis.

The last step of DLP, the reaction policies, must take care of proper response processes. Depending on the policy, it must be ensured that any reaction is per-formed correctly like the blocking of file transfers and transmission of notifications. To ensure these proper reaction, the actions should be blocking instead of reacting. This sophisticated differentiation is necessary for the following scenario: If a file transfer is already performed, it is hard to eliminate the possibilities of so called race conditions. If data is copied, e.g. to an USB stick, the USB stick may be removed before any reaction could be performed. Similarly, all reporting messages must be sent using secure channels. The event collector service must authenticate itself and ensure the message integrity as well as the client’s identity.

Since a lot of valuable information is needed to operate a DLP solution – pol-icy description, stored files from incidents – the DLP server is a worthy aim for attackers. It is therefore necessary that the DLP solution was developed follow-ing secure codfollow-ing principles and testfollow-ing methodologies. If a solution shows up vulnerabilities regularly, this might not have been the case. Further examination can include the use of fuzzing methods to search for new vulnerabilities. Despite existing vulnerabilities, all access to both the management console and the clients must be encrypted and authenticated. Only authorized users should be able to read and change policies according to the access rights. The security of the system can be confirmed by a security certification like the EAL of the Common Criteria, at least it is a sign that the product was developed with security in mind.


McAfee Host Data Loss Prevention

One of the new DLP solutions which arose in 2008 was the McAfee Host Data Loss Prevention suite. It is currently available in version 2.2 and exists both as a full DLP suite, including network monitoring and data discovery, and as a endpoint monitoring solution. The central management is realized via a plugin which can be integrated into the McAfee management console ePolicy Orchestrator (ePO). This management console is the central point for administering all kinds of McAfee security software including the anti virus components and endpoint agents. So the central maintenance of DLP policies can be easily integrated into the workflows of


Identify Are all methods to match data properly working? Are all file types handled properly?

Are all file extensions handled properly?

Are unknown data structures handled properly? Is encrypted data handled properly?

Monitor Are all removable devices (USB, floppy disc, CD) mon-itored properly?

Are all file systems monitored properly, including all spe-cial functionalities?

Are all network protocols (Layer 2/3/4) handled prop-erly?

Are all intercepting network devices monitored prop-erly?

Is there a possibility to decrypt files using an enterprise encryption system?

React Is sensitive data blocked?

Are all incidents reported properly?

Are there reaction or blocking rules? Allow reaction rules race conditions?

Is there a firewall/proxy integration to block network connections?

System Security Is all sensitive traffic encrypted?

Exist any publicly available vulnerabilities?

Can vulnerabilities easily found using simple vulnerabil-ity assessment methods?

Are all access rights set properly?

Is there a security certification like a Common Criteria Level?


anti-virus administration.

The solution basically provides the monitoring of removable storage media like USB sticks or floppy discs. All content which should be written to these media is monitored and analyzed based on the central policies.

There are several methods of content monitoring which will be explained in Section 4.2.1. Based on these specifications and characteristics, custom testcases are listed in Section 4.2.2. The evaluation of these testcases and the resulting findings finally are listed in Section 4.2.3 and centralized discussed in Section 4.4.

4.2.1 Environment And Specifications

The McAfee DLP solution consists of several modules: • ePO Plugin

• ePO Reporting Plugin • EventCollectorService • Endpoint Agent • Database Instance

Figure 1 shows the basic interaction of the individual components. The central tool for administration is the Policy Server. Using ePO 4.0, the DLP management plu-gin integrates into the existing anti-virus console. This pluplu-gin allows defining DLP policies, reaction rules, global settings for the endpoints agents and assigning these policies to the managed systems. For storing the settings, a Microsoft SQL Server installation is needed. A new DLP database is created during the installation and is reserved for further use. Both the policy data – by the management console – and the reporting and notification events – by the EventCollectorService – are stored in the database. ThisEventCollectorService is listening for messages from the end-point agents which report successful update processes or policy breaches. When the service receives data it writes it back to the database. Finally the reporting plugin can fetch this data and perform further processing.

Since these components are all detached services, it is possible to install them on different machines. Dependent on the number of clients, this allows to scale the performance according to the number of clients. In the built test environment all services, including the database and the domain controller, ran on a single virtual machine as depicted in Figure 2. Due to the small number of two clients this lead to no performance issues.

Using this environment, it is possible to monitor every removable storage device which is connected to each of the two client systems. There are several ways to define sensible data: Regular expressions, mime type recognition and manual tagging. The defined regular expressions are applied to all data which should be


Figure 1: Architecture of McAfee Host Data Loss Prevention

read or written from removable mediums and met the specification of the policy rule. This rule specification can contain restrictions to certain mime types or file extensions. Additionally, it is possible to tag files manually and so mark special files which did not met any category. If a breach of policy is detected, a notification is shown to the user, the data is blocked and an alert messages is sent to the event collector service.

4.2.2 Testcases

As mentioned in Section 4.1, the concrete evaluation of an application derives from its capabilities. The examination of an endpoint agents needs a subset of all testcases. Since only the endpoint agent for removable media monitoring is examined, no network monitoring is performed and so Table 2 lists the necessary evaluation items.

4.2.3 Results

To examine the McAfee DLP solution specific tests were performed to determine whether the solution can fulfill the requirements listed in Table 2. To evaluate the capabilities of the system, a simple policy was created: Every file that contains the string SECRET should be blocked from being written to any removable media.


Figure 2: Structure of the test environment

Discover Are regular expressions matched and is data blocked? Are all file types handled properly?

Are all file extensions handled properly? Monitor Are all devices monitored?

Are all file systems monitored properly? React Is sensitive data blocked?

Are all incidents reported properly? System Security Is all sensitive traffic encrypted?

Exist any obvious vulnerabilities? Table 2: Testcases for McAfee DLP


Figure 3: Text pattern for matching the stringsecret

Figure 4: The complete reaction rule using the text pattern

Additionally, a notification message should show up. The blocked data then should be stored for further analysis. This policy could be applied using the text pattern shown in Figure 3 and the reaction rule shown in Figure 4. The examination of this basic system again is divided in three sections which derive from the single stages identify,monitor and react.


The first simple test was the copying of a text file containing the text SECRET

to an USB stick. As was not to be expected otherwise, the text file was blocked (Figure 5 and 6) and stored at the repository of the ePO server (Figure 7). After this check of the correct functionality of the system, further tests regarding the file type recognition were performed. Every DLP solution supports lots of file


Figure 5: Copying the secret file to a USB stick


types that can be understood and parsed. So the following combinations of more sophisticated file types were generated:

• PDF document containing the string SECRET

• Zip archive containing this PDF document

• Word document containing an embedded excel tabular containing the string


• Zip archive containing this Word document

Since all of this file types are officially supported all 4 files were detected to be valuable content and thus blocked.

To check whether a real deep content analysis is performed, the test was re-peated using the same PDF file but the first line, which contains usually hints on the file type, has been removed. The following listing shows that the first line

%PDF-1.4 is missing in the second file secret-nomime.pdf:

$ diff -a -u secret.pdf secret-nomime.pdf

--- secret.pdf 2009-01-24 13:00:51.000000000 +0100

+++ secret-nomime.pdf 2009-01-27 10:14:38.000000000 +0100 @@ -1,4 +1,3 @@


This slightly modified file was not detected by the scan engine as Figure 8 shows. In general, this means that unknown mime types are not monitored. Similarly, a PNG image which contained the EXIF comment (where EXIF is an standard for embedded meta data for images) SECRET was copied to an USB stick. This test should examine whether recognized document formats are parsed completely and correctly. As the file can be copied to an USB stick, this is another failing of the solution.


The next step is to examine whether all possible channels are monitored properly to ensure that all data can be discovered and analyzed. In a first step, it was tested whether the monitoring engine is restricted to certain file systems. A file containing a NTFS alternate data stream was prepared and copied to the USB stick. This special extension of the NTFS file system was monitored correctly (Figure 9).

Completely different file systems like the Linux ext3 file system which was included by installing a special driver was also monitored. A file containing the secret string was detected and blocked.


Figure 8: PDF files with removed leading line are not recognized


Figure 10: Unmonitored partition

Figure 11: Monitored partition

In contrast, it is a severe restriction that devices containing more than one partition are not properly monitored. Using an USB hard drive containing three partitions, only the last mounted partition was monitored during several attempts. The first two ones were not monitored. Thus it was possible to write arbitrary data containing the secret string to the first two partitions (Figure 10 and 11). Another kind of a side channel is the possibility of copying a file which is named

SECRET.txt to an USB stick. Even though the file name contains the valuable information SECRET, the file is not blocked (Figure 12).

To ensure the stability of the endpoint agent, it is important to analyze the behavior of the system when a big file must be examined. Since the McAfee Host Data Leakage Prevention does not analyze files which are bigger than 500 MB, it was not possible to run a test using a 5GB file containing random data.


Figure 12: Information contained in file names is not monitored


Since the last two paragraphs figured out which data is recognized and monitored, the reaction methods need to be examined in detail.

As Figure 6 showed, data gets deleted if valuable information is discovered. To check whether valuable data is really deleted, a completely empty USB stick was prepared. Every sector of the flash memory was overwritten with zero (more technical description for all further steps reside in Appendix C, special documen-tation for this paragraph in C.1). When this USB stick was inserted into the client system, it must be formatted and then could be mounted. A file containing

SECRET several times was written to the stick and expectedly got blocked and deleted. After this process, the stick was examined on a very low level without using any filesystem structures (the used standard forensic methods are described in Appendix C.1). The resulting data still contained the secret string SECRET. Thus it is possible to bypass the DLP solution via recovering the deleted files. To eliminate the side effects of wear leveling techniques [Cha07], the test was verified using a floppy disc – ending up with the same result.

In a last step of reaction, the blocked data is stored on a central repository. This central repository is available via SMB and a provided network share. Monitoring this transmission it showed up that the file is just delivered via SMB without further encryption. Since SMB is a plain text protocol it was no problem to intercept the traffic and extract the sensible information (Figure 13).


Figure 13: Files containing sensitive data are transmitted via SMB in plain text

System Security

It is obvious that the DLP architecture is a worthwhile aim for attackers. A lot of valuable data is stored on the incident repository, descriptions of all secret patterns are available and no further searching would be necessary. Thus the overall security of the DLP system has to be appropriate to protect the mass of confidential data. The first phase of reconnaissance revealed several provided network services which could be discovered using port scanning techniques (results are listed in Appendix C.2). The most interesting port was TCP port 8443 of the ePO server since it is used for logging in to the management console. The use of the encrypted HTTPS protocol for all traffic is definitely necessary, but the server supported insecure encryption algorithms with insufficient key lengths (Appendix C.3) which would allow the decryption of intercepted traffic. Usually most clients support strong cipher algorithms, but if any kind of mobile or embedded device is used to access the webservice, it is possible that only a weak cipher is negotiated.

Another web interface on the central server can be reached using TCP port 80. It provides remote access to functions like DeleteAllEvents which purges all events from the reporting system. Since the unencrypted HTTP protocol is used the NTLM handshake used for authentication can be sniffed. If weak passwords [Ril06] are used the password can be brute forced and an attacker could log in to delete all events. To mitigate these threats, strong SSL ciphers with appropriate authentication should be used to ensure that no communication can be sniffed.

To find programming flaws in the endpoint agent, all interfaces must be enu-merated. In a first step, a port scan (Appendix C.2) revealed a listening network


service on TCP port 8081 on the client side. Since this interface allows an attacker to send arbitrary data to the service, it is important that no vulnerabilities exist in this exposed network interface. Otherwise it could be possible to compromise the complete system exploiting the management port of the DLP solution.

The second step is the enumeration of local functions which are intercepted by the DLP solution. All software solutions which monitor activities on the system – like anti virus or personal firewalls – must intercept certain system interfaces to see whether they are accessed. It is for example possible to hook the operating system function that creates new files. Based on this interception, a DLP solution can read files that should be created and block these calls if they would harm any policies.

Local attackers or malicious users could also inject code into these functions. This could for example disable the DLP solution or help the attacker gaining the higher privileges of the DLP solution. These so called API hooks were enumerated using memory analysis techniques (Appendix C.4).

Using this knowledge on hooked functions and network ports, it was possible to analyze the behavior of these interfaces using so called fuzzing techniques [SGA07]. This black box analysis method reveals major programming flaws which can also lead to security vulnerabilities. In general, it gives an overview whether the appli-cation was properly tested and developed following secure coding best practices. The performed tests did interact with both the network socket and the hooked API functions and are described in detail in Appendix C.4. This basic evaluation did not reveal any crashes neither of the DLP client nor of the operating system.


Websense Data Security Suite

According to the DLP market overview by Gartner [EQM07], the Websense Data Security Suite was one of the leading DLP solutions already in 2007. Nevertheless, it just provided network monitoring and discovery functionalities. Since an end-point agent was added in 2008, it is a suitable completion to the DLP newcomer McAfee. Central management both of policies and client administration is realized by standalone or web applications.

Like the McAfee Host Data Loss Prevention, the Data Security Suite monitors all removable storage media so that the same set of testcases can be applied.

4.3.1 Environment And Specifications

Regarding only the functionality relevant to endpoint protection, the Websense Data Security Suite includes the following components:

• DSS Server • Endpoint Agent


Figure 14: Architecture of the Websense Data Security Suite

The DSS Server is the central management and analysis component. Additionally it provides log files, statistics and stored incidents. To ensure the availability of this core system, it is possible to operate multiple instances of it – in this case, one of them must be the master DSS Server. Figure 14 shows the basic interaction between the single components. In contrast to the McAfee Host Data Leakage Prevention, the Websense solution installs its own database server per default. It is possible to define an existing database server, but since there is no need for a dedicated installation it is not listed in Figure 14. The DSS Server itself is divided into three tools:

• DSS Manager

• Management Console • Policy Wizard

The DSS Manager provides a frontend to access both analysis data and global configuration settings. There is detailed data on each incidents, statistics and time lines. Additionally it is possible to configure endpoint and server agents at this single point. The policies which define valuable data and adequate reaction rules can be created using the Management Console as well as users and roles can be administered. There are also a lot of predefined policies which protect classes of data which are very valuable in a country specific meaning. For example there is


Figure 15: The file containing the filtered string is blocked.

a policy which protects social security numbers which are used in the USA. These policies can be accessed via the Policy Manager.

4.3.2 Testcases

To make a statement about the security of the two DLP solutions, it is necessary to have a direct comparison. Since the two solutions can address the same leakage vector and use similar protection methods, the same testcases as for the McAfee solution were applied (Table 2). This comparison allows a more substantiated conclusion on the overall security and maturity of DLP solutions.

4.3.3 Results

The performed tests resulted in different findings which are listed in the remainder of this section. Again, the same challenge as for the McAfee Host Data Loss Pre-vention was used: A policy that should avoid the copying of files which contained the string SECRET to USB media was deployed to an endpoint agent. To ensure the correct functionality, a PDF document containing the defined string was copied to an USB stick. As Figure 15 shows and the policy dictates, the file is blocked. The mentioned tests were performed using this test environment. The results are listed below and are divided in the categories identify, monitor, react and system security.


Figure 16: A PDF document without its first line is not recognized.


To ensure the correct processing of different file types and file operations, a PDF document containing the string SECRET was copied to an USB stick and the detection worked correctly: The file was blocked. Again, the next test was the copying of the same file after its first line was removed. This small change was enough to circumvent the DLP solution and the file was copied successfully to the USB stick as Figure 16 shows. This test controls the inspection of files for their mime type. Since also the McAfee DLP solution did not recognize this file without the mime type information, the same PNG file containing the EXIF comment

SECRET was copied to the USB stick. This time, the Websense Data Security Suite did not block the copying as Figure 17 shows.


The system monitors all tested channels correctly. For example both NTFS al-ternate data streams and the Linux ext3 filesystem were monitored and every information breach was detected. But a test of the stability of the application failed: During the copying of a 5 GB file filled with random data – which con-tained also the string SECRET – to an USB hard disk, the complete operating system freezed in all three test runs. Without the endpoint agent running, this process completed without any problems. In the worst case, this system freeze could be a hint that any kind of an overrun in the agent software exists. Depend-ing on the kind of overrun, this could mean that this vulnerability is exploitable by an attacker. Otherwise this only affects the availability of the system.


Figure 17: EXIF comments in images are not recognized.


Per default, the communication between the endpoint agents and the DSS server is handled using HTTPS which implicitly means that every communication is en-crypted. This encryption can be turned off so that it was possible to analyze the communication protocol. The following listing shows the plain text communica-tion – without its HTTP header informacommunica-tion to improve readability – that happens when the endpoint agent registers at the server after each start:

Request (client to server):

CPS_CLIENT4626415283943780173|xp-template|N/A|N/A| .K|.... Response (server to client):

CPS_CLIENT4626415283943780173|78|.XaO.... ...M.e.s.s.a.g.e. .w.a.s. .h.a.n.d.l.e.d. .s.u.c.c.e.s.s.f.u.l.l.y...

This protocol is vulnerable to at least one attack. An attacker is able to inter-cept the traffic from the client to the server and vice versa (Appendix C.5). If this happens, the attacker can drop the requests from the client to the server and reply arbitrary answers to the client. The complete response of the Server is predictable:

CPS CLIENT4626415283943780173: Can be extracted from the Client re-quest

78: Answer code which stands for Message was handled successfully. There are also other message codes like Incident was handled successfully


Message body: Derives from the answer code.

Thus an attacker can intercept the reporting of an incident, drop the request and send the answer Incident was handled successfully to the client. The reporting of incidents would never reach the server and thus would never been reported.

It could also be possible that an attacker is able to inject faked messages into the incident reporting system. It was not possible to replay an initial registration request of the client. But since there is no additional client verification using, for example, certificates, the session id (in the example above: CPS CLIENT4626-415283943780173) is generated only on client side. This means that the server has no possibility to prove the identity of the client. If an attacker gets access to the data on the client system, he has access to all data the endpoint agent can use to generate the session id following a certain algorithm. It could be possible that an attacker explores this algorithm – since every data and program code for this algorithm is resided on client side – and is then able to generate valid session ids.

System Security

As mentioned in Section 4.3.3, the communication between the endpoint agent and the DSS server is encrypted due to the use of HTTPS. HTTPS is based on SSL which in turn uses certificates for the authentication of the two stations. Since the Websense Data Security Suite uses a certificate only on server side, the client is not authenticated. Additionally, the client does not verify whether the server’s certificate is valid. Thus an attacker is able to perform an SSL man in the middle attack which is described in Appendix C.5. This results in the decryption of the protocol and this in turn in the disclosure of sensible data. Since only data that is judged to match the policy – which is the valuable data – is sent, this actually adds an additional vector for data leakage.



The findings from Sections 4.2.3 and 4.3.3 show that the DLP solutions are not yet matured, even considering the fact that the Websense Data Security Suite is one of the leading solutions in this field.

There were far too many possibilities for even accidental leakage (e.g. the copying of data to one of the unmonitored partitions of an USB hard drive) in the McAfee Host Data Leakage Prevention that it would be dangerous to rely on the system as a part of the security concept. And even if the scan engine of the Websense solution may be able to avoid accidental leakage, it introduces an additional leakage vector to the network due to the lack of mutual authentication. These findings are summarized in Table 3 to provide a fast overview on the capabilities of the solutions. If the solution passed a test, a check mark (“X”) is used, otherwise a “X” takes place.





Related subjects :