2017 2nd International Conference on Artificial Intelligence: Techniques and Applications (AITA 2017) ISBN: 978-1-60595-491-2
Data Loss Prevention System Based on Big Data
Chun-wei WANG
*Room 318, Taiyangdao Building Tower B, No. 33 Shijingshan Street, Shijingshan District, Beijing, China
*Corresponding author
Keywords: Data loss prevention, Big data, Data classification, Policy design, Incidents analysis.
Abstract. Nowadays data leakage is causing great loss, not only huge economic losses, but also serious social impact. In this paper, through the classification and gradation of sensitive data, with the strategy design based on big data and artificial intelligence, monitoring and blocking data loss through network, host and database storage can be achieved, besides, the events are analyzed intelligently. Via all these intelligent methods, a brand new data loss Prevention system based on big data is constructed to protect enterprise data assets security.
Introduction
Along with the arrival of internet plus time, data has become core assets for more and more organizations. Information is evolving from paper to electronic, Including transaction records, customer information, investment plan and so on. The transmission and application of the data on the Internet has increased data security threats and vulnerabilities. Incomplete security measures, poor security awareness, imperfect management mechanism, all these have increased the risk of data loss.
Data loss can be classified into three categories: at rest, in motion, and in use. The general enterprise is provided through the firewall, antivirus software and other methods to prevent outside intrusion, however, according to the survey from China Information Technology Security Evaluation Center, the proportion of important data stolen by hackers and leaked by internal employees is 1:99, and most of the information leaks originate from internal enterprise.
At present, the state attaches great importance to the work of information security, promulgated the "network security law", "national cyberspace security strategy", "network security classification prevention system 2.0" and other laws or regulations successively, showing explicit requirements for key information infrastructure prevention . It has been clearly pointed out the network operator has the responsibility to protect the safety of network and prevent from data loss or theft, tamper [1].
Therefore, Data loss prevention system based on big data is constructed to classify and grade sensitive data, design strategies intelligently to improve the monitoring and blocking effect on the sensitive data from network, the host and illegal storage in database. Analyzing and judging events intelligently can enhance the efficiency of event discrimination. Via protecting enterprise data asset security comprehensively, core data is prevented from loss so as to avoid affecting business operation, which is ensured that sensitive data loss can be discovered beforehand, intercepted in events, traced back afterwards.
Introduction to Data Loss Prevention
Facing with the data loss prevention requirements of different stages and data loss risk, current data loss prevention solutions in industry includes loss data prevention based network (NDLP), loss prevention based on the host (HDLP) and illegal storage discovery(Discover storage). Among them, NDLP is usually deployed in the internal network and the exit of the external network connection, all data getting in or out the internal network ought to be checked on security compliance. HDLP is deployed on the terminal host where sensitive data is stored, once the data on the protected host is found to have been transferred out, HDLP will intercept or alert. Discover storage is deployed in a data storage area to detect mass sensitive data stored in violation. The specific deployment’s as shown in Figure 1.
Internet
Management Platform Management Platform Management Platform Management Platform Network
Network Network Network MMMMonitoringonitoringonitoringonitoring
Scan Discovery Scan Discovery Scan Discovery Scan Discovery
Office Network
Sensitive Network
Application& Storage
Network Blocking Network Blocking Network Blocking Network Blocking
Figure 1. Data loss prevention deployment.
Construction of Data Loss Prevention System Based on Big Data
With the rapid development of information technology, cloud computing, big data, Internet of things, mobile Internet applications gradually deepened, which bring great help to the development of all walks, meanwhile, evoke great security risks such as the frequent information loss incidents ,huge economic loss, adverse social influence. The urgent demand for data prevention by the state, enterprises and individuals has reached an unprecedented height. However, the traditional risk oriented security prevention has been difficult to meet the current security situation. In order to cope with the systemic data loss risk, our solution is to build an intelligent data loss prevention system. Through the classification and gradation of sensitive data and the strategy based on big data and artificial intelligence, intelligent data loss prevention system is constructed to monitor or block date out of network or terminal host, and the data illegal storage in database. Following above measures, the event will be analyzed and judged to make sure the overall closed loop data loss prevention and further realize all-round enterprise data assets security prevention. Data loss prevention system’s as shown in Figure 2.
Data loss prevention system based on big data
governance-related goals
Classification, organize
risk evaluate
intelligent policy
Data loss prevention network
HTTP HTTPS
FTP WENMAIL
endpoint
USB print Share IM
storage
database File server
Incident intelligence analysis
[image:2.612.187.425.551.726.2]Incident handler data management system
In response to the state's call for "core technology, product autonomy, controllability and localization”, based on the idea of data loss prevention system, ZhongJi Petroleum Communication Construction Co Ltd. independently researches and develops a set of data loss prevention products - “Eagle Eyes of Information Security”. The whole set of products working together form a complete DLP solution based on audit of network protocol, application content and file content focusing on detection strategy. The series of products have been applied in information security audit project successfully, providing all-round, multi angle, different granularity monitoring and prevention for important data of the state and enterprises. In the future, with the big data technology, the product will be more intelligent and systematic.
Data Classification
To build a data loss prevention system, the sensitive data should be sorted and analyzed first. Aiming at clearly grasping “what data is needed to protect, and what is the priority of data prevention?”, the sensitive data assets content and security level must be classified and graded[2]. The complete of this step is of great importance to the implementation of DLP project. On the one hand, it provides clear objects for data loss prevention, and on the other hand it provides a clear direction for formulating data loss prevention rules. Data classification and gradation effects as follows and the process showed in Figure 3:
1) Determine the expected goals of data prevention and the content to be protected;
2) Determine the data classification: Secrecy department, business department and IT departments should work closely together. According to the classified trade confidential catalogue of enterprise, the data of one enterprise can be defined in classification;
3) Determine the data level: According to the data sensitivity and importance, classify the sensitive date into different categories, such as core business secret, common business secret and so on.
Data Asset Collection
Data Classification
Data Gradation DataAsset
Management Rules
Business Data
Management Data
Structured Data
Unstructured Data
Semi-structured Data
Data Feature Association Analysis Data Protection
Requirement
[image:3.612.104.499.391.480.2]Protection Policy
Figure 3. Process of Enterprise Data Classification.
Intelligent Strategy Design Based on Big Data
After completing the data classification, effort should be focused on the design of prevention strategy. Strategy is the core of data loss prevention system, of which the quality has a direct impact on monitoring event effectiveness. Intelligent design of strategy is realized by using high performance strategy matching technology. Figure 4 shows the specific implementation process:
Data Assets Sample
Intelligen t Chin ese Segmentation
Intelligen t Clustering
Keywords Extraction
Multidimensional Keyword Intelligent
Analysis
Classification Result Screening
Keywords Extraction High Frequency
Word Intelligent Screen ing Strategy
Compilation Policy
Validation Policy Issue
Corpu s Policy Library
Figure 4. Process of Intelligent Strategy Design.
1) Using big data technology to analyze sensitive events, establish data analysis model and obtain data information and associated information;
[image:3.612.98.510.577.649.2]Further, cyclic iteration can be adopted to optimize the strategy at high false positive rate and improve the strategy accuracy;
3) Deploying the generated strategy into the data loss prevention system aims to realize the all-around monitoring and prevention to the sensitive data out of the network boundary, terminal pathway and stored in a database.
Data Loss Prevention Systems
Data loss prevention with data as the center, considering the entire life cycle of data generation, storage, and transmission, develops the all-around shield for network, host and storage to build an integrated prevention system including sensitive data network monitoring technology, serial blocking technology and illegal storage discovery technology.
Data Loss Prevention Based on Network. Network data loss prevention includes network monitoring and blocking. Both of them can identify the mainstream application layer protocols, providing the loss prevention for different ways such as web, mail, file transfer and instant messaging. According to the actual business needs, the key important labeled data ought to be blocked to prevent information assets from being discharged in the form of breach of security policy intentionally or accidentally, avoiding the economic losses to the enterprise.
1)Network monitoring
Through high-speed network packet capture, network session restore, protocol analysis, file content extraction and other technical means, Data loss network monitoring can achieve the detection for security incidents of data loss.
Within the enterprise intranet, the Internet export flow is larger, and the Internet export bandwidth of large enterprises is above 1.0 GB/s. In such high-speed network bandwidth, data package capture needs ensure the capture speed and avoid packet loss in the premise of integrity. Session restore is invoked to reorganize the session of packets, reconstruct the content of session. Protocol analysis is initialed in light of session. The output of session restore is analyzed according to the protocol characteristics of SMTP, FTP, and HTTP and so on to extract entity files from protocol packages that conform to characteristics. Finally, file content extraction is conducted. File types are identified in accordance with file type characteristics, the attributes. Contents of an entity file are extracted through file parsing functions. The obtained contents of the file are structured to be transferred into the format convenient for chandelling. For the commonly used mail format, contents and attachments are extracted. All above measures will be combined with strategy design to realize the sensitive data monitoring. Network monitoring process is as shown in Figure 5.
Policy Configuration and Delivery Parameter
Configuration
Data Packet Session
Restore
Protocol
Analysis File Parsing
Strategy Extraction
Rule Matching Event
Generation Event
[image:4.612.146.465.511.582.2]Processing Event Reporting
Figure 5. Process of Network Monitoring.
2)Network blocking
Policy Configuration and Delivery Parameter
Configuration
Data Packet Session Restore
Protocol
Analysis File Parsing
Strategy
Extraction Rule Matching
Event Generation Data Blockin g
[image:5.612.144.469.66.148.2]Event Reporting Yes No Data Release
Figure 7. Process of Network Blocking.
a) Intercept and cache data packets. The network monitoring data comes from the port mirroring. Packet loss due to device failure does not affect the network. Network blocking needs the interception and retention of network data which cannot be released before the completion of data analysis and confirmation of legality.
b) Add the identification of underlying protocol. Network blocking technology can identify the protocol of transport layer and network layer on the basis of identification of application layer protocol to ensure correct transmission and forwarding of different protocol data.
c) Add network session blocking control technique. In a realistic network environment, if any packet in the session is blocked, the session will no longer be able to continue transmitting data, then the blocking server will not be able to restore the information being transmitted, as a result, it is impossible to determine whether the document is illegal. Network blocking requires proper processing of network packet transmission timing sequence so as to intervene in the network conversation at the right time without affecting the network running.
Data Loss Prevention Based on Host. Through the operation of behavior control and file filtering technology, the copy of shared sensitive data from network host to USB, SD card or burning CD will be monitored, and the operation will be blocked timely according to strategy. This way can keep the sensitive data away from violation, proliferation, and retention at source so as to realize the explicit control of outgoing document permissions.
In the host-based data loss prevention, hooker procedure will be installed into the kernel function of the application layer and kernel layer of the operating system. Whenever a particular copied or pasted message is delivered, the hook program captures the message first, getting the priority control before the message arrives at the user. In this manner, outgoing and being shared sensitive data on terminal will be monitored and blocked. Host-based data loss prevention process is as shown in Figure 7.
Policy Configuration and Delivery Parameter
Configuration
Task Distribution
Task Scheduling Management
Local Storage
Scan File Parsing Policy Matching
Secure Disposal File Feature
Extraction
Event
Collection Event Reporting Storage
Peripheral Audit
Application Outsourcing
Audit
Figure 8. Process of Host-based Data Loss Prevention.
Illegal storage discovery. By scanning the storage file, the audit of database storage content can be realized, weather there is data stored illegally can be identified, unprotected confidential data from the server also can be discovered so as to reduce the spread of confidential data.
[image:5.612.141.469.476.572.2]Policy Configuration and Delivery Parameter
Configuration
Scan Task Distribution
Task Scheduling
Management Server Scan
Scaned Content Analysis
Policy Matching
Secure Disposal File Feature
Extraction Event
[image:6.612.134.478.66.143.2]Collection Event Reporting
Figure 9. Process of Illegal Storage Discovery.
Intelligent Event Analysis Based on Big Data
Using big data analysis technology, combined with vector learning, hash matching and other advanced technologies and algorithms, intelligent event analysis system will conduct characteristic analysis, judgment, association and comprehensive evaluation on the outgoing sensitive data gathered by the Loss prevention system. System can send the prompt of the important out-going event with higher sensitive level to user, assisting the users to improve incident judgment efficiency, adding the basis of event decision. Figure 10 shows the process of event intelligence analysis based on big data.
Event Acquisition
Event gathering and Processing
Association Event Mining
Intelligent Analysis of Data
Characteristics
Summary of Sensitive Event Characteristics
Event Intelligence
Analysis Event Analysis
Results Suspected
Business Secret Incident Event Portrait
Feature Library
Figure 11. Process of Event intelligence analysis based on big data.
The main contents of event intelligence analysis include:
1) According to the features of the source file and sensitive events, a comprehensive evaluation of the event is conducted to get the sensitive level, seriousness, similarity to the source file and other important values which can form a scientific and effective comprehensive evaluation of events. Based on comprehensive evaluation results of event, users can get reference and intelligent judgment and when the amount of events is too large, the outgoing events with a higher risk level can be picked out in priority so that the efficiency of event identification can get improvement;
2) Based on the data content, event type, hit policy and other attribute information, the characteristics of sensitive events are analyzed comprehensively. Event retrieval and association analysis were conducted based on characteristic analysis and sensitive events. The sensitive Event association is calculated to discover potential sensitive data threats and the potential sensitive event can be warned beforehand;
3) By scanning the terminal file identification information, the statistical sensitive event distribution is calculated. Export and internal network boundaries as well as terminal reporting information are analyzed; the flow path of sensitive files is tracked [4]. General situation of sensitive events can be controlled on real-time, thus the control measures can be exerted promptly.
With the maturity of big data and cloud computing capabilities, the system of data loss prevention based on big data can check the content more quickly, build and optimize the strategy more accurately, manage the violations more effectively. The system can help user have better understanding of where the data is stored and how to use so as to control the data where can go. By improving data recognition, learning, analysis, decision making, and intelligent execution, the intelligence and automation level of the data loss prevention system is enhanced. With the improvement of information security and reduced labor cost, overall data prevention effect will be prominent.
Conclusions
[image:6.612.137.476.278.345.2]assets and competitiveness of enterprises. The intelligent DLP system based on big data combines advanced technologies such as big data analysis and artificial intelligence can improve effect of data loss prevention monitoring and the efficiency of events check. Under the premise of clear prevention scope, explicit right and responsibility relationship among the organizations and departments, taking the management system as the restraint, the standard process as the norm, the technical platform as the support, comprehensive data loss prevention system covering management, check, monitoring to data will be constructed to provides effective throughout lifecycle prevention and management to enterprise sensitive data.
Acknowledgement
This research was financially supported by the China National Petroleum Corporation.
References
[1] Standing Committee of the National People's Congress. Network security law of the People's Republic of China; 2017.6
[2] Chen K & Liu L. Privacy preserving data classification with rotation perturbation. In Fifth IEEE international conference on data mining; 2005. p. 4. 10.1109/ICDM.2005.121.
[3] Sultan Alneyadi, Elankayer Sithirasenan & Vallipuram Muthukkumarasamy A survey on data leakage prevention systems. Paper presented at Journal of Network and Computer Applications Volume 62, February 2016, Pages 137-15.