INTERSET
The Foundations of Big Data Behavioral
Analytics
By Stephan Jou, CTO, Interset
7/15/2014
Introduction
There is no longer any question about the realities of big data. Big data is here to stay: we have never before had as much access to as much data. In the context of Information Security, we have endpoint logs, activity data, traffic and machine logs, HR records, reputation services, etc.
Similarly, we do not need to be sold on the value of analytics. What better weapon to make sense of all the big data that we are collecting and to which we have access? We see the success of analytics in the consumer
marketing space, marveling at the ability of Amazon’s recommendation engines and LinkedIn’s data science teams to turn large amounts of user behavioral data into actionable and profitable insight. Surely, the same approaches apply to the challenges we face in Information Security.
Indeed, the application of big data analytics in security can be incredibly effective when done in a principled manner. We have an opportunity to define an approach that can form the basis for an advanced and effective behavioral threat platform for Information Security.
Big data for ubiquitous context VOLUME, VELOCITY AN D VARIE TY
The phrase “big data” can be frustratingly elusive, but the popular “four V” categorization can be useful to remind us of all the data available to us. Many of us by now have heard of the first three Vs: Volume, Velocity and Variety. Volume reminds us of all the audit data that we are now able to store. It is no longer impossible to store audit records that represent the critical behaviors of all your employees, across all your departments, for all time. Indeed, most of us now archive terabytes of this information routinely.
Velocity reminds us of the activity and transactional information that streams into our systems. This is point-in-time and ephemeral information that comes in from networks, proxies and sensors. We do our best to try to detect important events within a window of increasingly fast and real time streams of data.
Variety reminds us that data available to us no longer must fit into rectangular-shaped tables, neat rows and columns. Unstructured information in documents, emails, IM chats, and even video and audio can also contain valuable clues that we cannot afford to ignore.
THE FOURTH V: VE RACITY
But what is the fourth V, veracity? A word meaning “trustworthiness,” veracity is the newest recognized dimension of big data and potentially the most important for Information Security. Veracity reminds us that different data sources have different levels of risk and trust, for example, a company’s HR database has relatively high trust, while what employees are saying on twitter has lower trust. If the level of trust can be quantified and embraced, however, then we can join low trust data sources with high trust data sources and not ignore the value embedded in all the available data. Accounting for veracity allows us to see the employee who has received a bad
performance rating in the HR system, and is now causing reputation damage to the company through posts on twitter.
The four Vs are useful because they remind us that we have more data than ever before. The ability to combine them allows us to build a complete and comprehensive data plane of the ecosystem that must be protected. By aiming to correlate across all available data, we come closer to being able to build comprehensive coverage and detect complex threats that leave an evidence trail spanning multiple data sources.
THE FIGHT AGAINST FALSE POSITIVE S
The ubiquity and coverage of our data plane can also form the basis of a strategy against false positives.
Suppose John Sneakypants accessed an unusually large volume of files on a network share? This may represent a threat, but it also may not, and could be a false positive: perhaps John just changed roles and he is accessing those files for valid reasons.
But suppose John Sneakypants also did this at a time of day when he was never active at his computer? And from a login location he was never seen at before, and also copied those files to a USB key, and put it in an archive file and renamed the extension to “.mp3.” And so on…
As more anomalous events co-occur across more datasets, we can more intuitively see that John’s events feel less and less like false positives, and John Sneakypants represents a true person of interest, demanding investigation. Big data, and the availability of more data and context than ever before, gives us the ability to corroborate and triangulate analytics to distinguish the true threats from the false positives.
Analytics and data science
If big data represents the crude oil to power our energy needs, analytics is the refining process that turns crude oil into usable fuel. Analytics must, however, offer more than bar charts and/or naive statistics.
Amassing a comprehensive and large data plane is, therefore, only half the battle: we also need to apply principled mathematics to help us turn the data into actionable insights.
DATA SCIEN CE
“Data science” is a term that is perhaps as guilty of being as ill-defined as “big data” and “analytics.” A useful way to understand data science can be credited to Drew Conway’s Venn diagram, illustrated below. The diagram makes the point that effective data science requires the intersection of three disciplines: hacking skills (not in terms of “Black Hat" but in terms of manipulation of data), substantive domain expertise, and math and statistics.
Figure 1: Drew Conway's Venn diagram of data science, 2010, http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Hacking skills refers to the effective use of computer technology to deal with the large and messy data of the real world – the four Vs of big data. If we hope to put our data science into production, this means that our data science must be done with full awareness of how calculations can be run at scale, on high volume and high velocity. Fortunately, the big data technologies and community are mature and robust, and the challenges of big data are reasonably well understood and learnable.
Substantive expertise means that domain knowledge of information security is critical for a proper understanding and interpretation of the data. Again, the Information Security space is steep with excellent research and
resources. Research groups within universities like Carnegie Mellon and companies such as Intel are publishing great research on behavioral risk indicators and threat patterns.
The third category, math and statistics knowledge represents, in my opinion, a great opportunity for Information Security. We have a few examples of applied mathematics, statistics and machine learning, but we need more. Our industry has tremendous hacking skills and substantive expertise, but with a stronger investment in math and statistics knowledge, we can avoid the “danger zone” pointed out by Conway’s Venn diagram: without proper understanding of the underlying principles, we run the risk of building analytics that do not work.
FEATURE EN GINEE RIN G AND MODELING
The actual work of data science can be divided into three inter-related phases: data handling, feature engineering, and modeling. Data handling refers to dealing with the data, and is well understood. Modeling is the area that gets all the press, and is where you find buzzwords like Bayesian methods, deep learning, neural networks and machine learning.
Feature engineering, however, is the important glue that connects the data to the math and makes it work. Feature engineering refers to the process of examining the input columns available in the data, understanding them, and sometimes enhancing or modifying them to increase their predictive power. Data scientists spend a lot more time in feature engineering than in model selection, and for good reason: the best model in the world cannot be effective without good input columns.
The importance of feature engineering is particularly striking in the context of big data. Trying to treat all data sources generically from an analytics perspective is hard: it is much more effective to squeeze every ounce of insight for each different data source type with the right features. If you ignore differences in specializations between data sources, you will miss insights.
PROBABILISTIC APPROACHES AN D THE IMPORTAN CE OF A N UMBE R
One key math and statistics principle that is useful for threat and anomaly detection is to move away from simple rule- or boolean threshold-based approaches, in favor of probabilistic approaches. Using a rule or boolean style “alert” means that every event is classified as either good or bad. This is a notoriously hard approach to scale and maintain, resulting in thousands of daily alerts in a large enterprise with no ability to prioritize and take action. It is critical to algorithmically consolidate the noisy flood of alerts into a small, manageable and prioritized set of information. Doing this on a set of information-poor, boolean alerts where every event is labeled either “good” or “bad”, and the “good” events are thrown away, is hard.
A probabilistic approach can be much more effective. By building probabilistic models that quantify how bad, suspicious or abnormal an event is, we can keep all events and their associated scores for our consolidation and correlation. This allows us to both more accurately assess the overall risk posture of any entity inside our system, and even detects “low and slow” threats by no longer ignoring low probability events that, with a threshold-based approach, would otherwise be discarded.
DAVENPORT MATURITY MODEL AND THE CURREN T STATE OF THE NATION
Thomas Davenport suggested a maturity model for analytics, implying a sequential order of increasingly sophisticated and mature techniques. The current state of readily available analytics in Information Security software then can be viewed as quite nascent compared to other industries.
Figure 2: Davenport, T.H., & Harris, J.G. (2007). COMPETING ON ANALYTICS: THE NEW SCIENCE OF WINNING (p. 8). Boston:
The vast majority of information security software performs analytics by doing standard, ad-hoc and query-based reporting and alerts – classical business intelligence. Only a small number of security vendors have started looking at the application of forecasting, simulation and predictive modeling to behavioral threat detection.
The exciting interpretation is that there is a lineup of powerful analytics that has been proven effective in other industries that we have yet to take advantage of in Information Security.
Big data analytics in real life
Finally, to complete our advanced behavioral threat platform, we need to account for the realities of our environment, and specifically handle scalability and consumability.
SCALABILITY
Scalability in this context means that the technology must run in production environments, on production-scale data. As mentioned earlier, there have been fantastic advances in big data technology in recent years, but there remain challenges. Not all mathematics is map-reducible; not all models are computable in real time, and no amount of open source will eliminate the need for clever mathematics and well-designed architecture. Linear scalability and predictable cost of infrastructure are critical for any successful platform in the real world.
There is also a non-technology component to scalability: a solution must remain usable by human operators when deployed in a corporation with hundreds of thousands of employees. The user experience must scale. Lists of a hundred thousand names or charts with a million bars are impossible to take action on, even if they could be rendered in a reasonable amount of time.
CON SUMABILITY
In the real world, the most sophisticated analytics have no value if they cannot be understood. Analytics must be consumable in order to be actionable. Well-designed visualizations that abstract complexity and provide evidence, reports that proactively summarize and deliver the right information at the right time, and text in simple language become increasingly important to use, particularly as the mathematics becomes increasingly sophisticated. Human comprehension and consumability will always remain important, for everyone from the NOC operator taking action, and the security professional investigating an event, to the executive stakeholder assessing the organization’s overall risk posture.
Platform for Advanced Behavioral Threat Detection
Putting all the components together produces an effective and powerful platform for advanced behavioral threat detection.
We start with connecting to all the data available to us, taking advantage of big data, in all four senses, to build a comprehensive data plane. We apply data science to this substrate, doing data-source specific feature engineering and building probabilistic models and increasingly powerful analytics, to extract higher level behavioral patterns and quantify, aggregate, correlate, corroborate, and identify the true threats and avoid false positives. We deploy
at scale, both for the technical firmament and for the human audience, providing consumable, and therefore, actionable, intelligence.
The Information Security segment is ripe for, and in a very real sense, demands, big data analytics. It is real, it works, and it can help solve some really hard challenges in our space. The main question remaining is: what are we waiting for?
About Interset
Interset provides a highly intelligent and accurate insider and targeted outsider threat detection solution that unlocks the power of behavioral analytics, machine learning and big data to provide the fastest, most flexible and affordable way for IT teams of all sizes to operationalize a data protection program. Utilizing lightweight, agentless data collectors, advanced behavioral analytics and an intuitive user interface, Interset provides unparalleled visibility over sensitive data, enabling early attack detection and actionable forensic
intelligence without false positives or white noise. Interset solutions are deployed to protect critical data across the manufacturing, life sciences, hi-tech, finance, government, aerospace & defense and securities brokerage industries.
www.interset.com
© 2014 Interset & FileTrek, Inc. All Rights Reserved. Interset, FileTrek and the FileTrek logo are trademarks of FileTrek, Inc. All other logos are the property of their respective owners. The content of this document is subject to change without notice.
16 Fitzgerald Road, Suite 150 Ottawa, ON K2H 8R6
Canada Phone: (613) 226-9445