1 2 3 4 5 6
Key requirements for detecting data breaches and ad-dressing compliance.
ArchitectureIn-depth look at the architecture throughout the Hadoop stack.
PlanBest practices to consider when building out your data security plan.
ImplementBuilding blocks for implementing effective data monitoring.
OperationalizeOperationalize your processes with extra emphasis given to handling security breaches and foren-sic investigations.
toward enterprise‑readiness for Hadoop
Hadoop is delivering insights for many organizations that are using it. However, the security risks remain high. Although some Hadoop distributions do support various security and authentication solu-tions, there has not been a comprehensive data activity monitoring solution for Hadoop until now. Considering that even robust and mature enterprise relational database sys-tems are often the target of attacks, the relative lack of controls around Hadoop makes it an attractive target, especially as more sensitive and valuable data from a wide variety of sources moves into the Hadoop cluster.
Organizations who tackle this issue head-on, sooner rather than later, position themselves to expand their use of Hadoop for
enhanced business value. They can pro-ceed with the confidence that they can address regulatory requirements and detect breaches quickly, thus reducing overall business risk for the Hadoop project. Ideally, organizations should be able to integrate big data applications and analysis into an existing data security infrastructure, rather than relying on homegrown scripts and monitors, which can be labor-intensive, error-prone and subject to misuse.
With IBM® InfoSphere® Guardium® data
security solutions, much of the heavy-lifting is taken care of for you. You define security policies that specify what data needs to be retained and how to react to policy violations. Data events are written directly to a hardened appliance, leaving no
opportunity for even privileged users to access that data and hide their tracks. Out-of-the-box reports and policies get you up and running quickly, and those reports and policies are easily customized to align with your audit requirements.
Comprehensive monitoring for Hadoop
InfoSphere Guardium helps you make sense of what’s going on by actively monitoring activity throughtout the Cloudera or IBM InfoSphere BigInsights Hadoop stack (see Figure 1), including Hue/Beeswax or BigInsights Web Console, MapReduce, Hive, HBase and HDFS.
Not only does this comprehensive monitor-ing help with data protection, it can also help you find and react to breaches or unauthorized access quicker by making it easier to see what is happening. Even though much of the activity in Hadoop breaks down to MapReduce and HDFS, at that level, you may not be able to tell what a user higher up in the stack was really trying to do, or even who the user was. It is similar to showing disk segment I/O operations
instead of an audit trail of a database. Figure 1. InfoSphere Guardium can capture activity as it flows through the Hadoop stack
User Interface Who submittedthe job/query?
What jobs? What queries?
Is this an authorized job?
Permission exceptions? What files accessed?
By providing monitoring at different levels, you are more likely to understand the activity, as well as being able to audit activities that come in directly through lower points in the stack.
For example, the Hue/Beeswax report included with InfoSphere Guardium will show you the actual Hive queries that were run, as shown in Figure 2. A report in the same time period for HDFS would show you that activity at a file-system level.
Figure 2. See commands, users, exceptions and more.
HADOOP HADOOP HADOOP HADOOP HADOOP HADOOP
10.70.146.211 select * from JoeD222 10.70.146.211
Type Server IP Hive Parsed SQL HiveUser HiveCommand HiveDatabase Hive TableName Hive Error
10.70.146.211 10.70.146.211 10.70.146.211 10.70.146.211
SELECT * FROM “DavidTest” SELECT * FROM “DavidTest” DROP TABLE “sample_07” SELECT * FROM “sample_08”
david david david cloudera cloudera cloudera get_table create_table get_table get_table get_table get_table default default default default default default JoeD2222 demo22 NoSuchObjectException(message;default.JoeD2222 table not found)
CREATE EXTERNAL TABLE demo22 (a int, b int, c int) location
DavidTest DavidTest SAMPLE_07 SAMPLE_08
Architecture of the solution
As shown in Figure 3, InfoSphere Guardium continuously monitors data activity using lightweight software probes called S-TAPs without relying on logs. The S-TAPs also do not require any changes to the Hadoop servers or applications.
Because privileged users can delete or modify logs, InfoSphere Guardium helps ensure separation of duties by immediately intercepting and forwarding data activity to a separate hardened appliance, known as a Collector. There, the activity messages are compared to previously defined policies to detect violations that could, for example, generate an alert in real time. The relevant activity is stored in the Guardium repository from which you can also do forensic analysis and schedule regular audit reports.
The InfoSphere Guardium S-TAP was originally designed for performance with low overhead; after all, the S-TAP is also
used to monitor production database environments.
Figure 3. Architecture enforces separation of duties
Cluster Clients MapReduce jobs HDFS and HBase commands InfoSphere Guardium collector
InfoSphere Guardium reporting and alerting
InfoSphere Guardium S-TAP
Make a plan
Data activity monitoring for Hadoop is newer than Hadoop itself, but with InfoSphere Guardium, a wide variety of enterprise data sources can be monitored using the same scalable environment. If you are already monitoring relational databases, the planning concepts will be similar, even if the specifics are different.
Here are some questions to aid in planning a monitoring and auditing solution for Hadoop:
• Who needs to be involved? • Where is the monitoring software
installed? Where should the appliances be located?
• How should the deployment be rolled out? • What are the business requirements
Who needs to be involved?
Where is software installed? Where should appliances be located?
InfoSphere Guardium consists of software components that sit on the Hadoop cluster servers (the S-TAPs and the optional instal-lation manager agents) and separate hard-ware or softhard-ware appliances. The appli-ances can be fully configured software solutions delivered on physical appliances provided by IBM or software images that you deploy on your own hardware.
InfoSphere Guardium scalable architecture
The InfoSphere Guardium distributed archi-tecture is built to scale — from small to very large — using a graduated system of collec-tors and aggregacollec-tors, as well as the ability to perform load balancing (see Figure 4).
Make a plan
Table 1. Team members for an InfoSphere Guardium deployment
Primary team members
Contributing team members
Business Analyst Collects and documents business requirements for auditing, monitoring and logging. Data Monitoring
Responsible for deﬁning reports, policies and audit processes. To properly observe segregation of duties requirements, members of this team should not have privileges to install policies or modify the contents of groups that are deﬁned for use in Guardium policies and reports, such as authorized users, privileged users or sensitive data.
Project Manager Manages product implementations and upgrades.
Network Engineer Assigns IP addresses to the InfoSphere Guardium appliance, and ensures connectivity through network infrastructure including ﬁrewalls. Storage and
Ensure that retention period policies are in compliance, and proper operational procedures are in place.
Security Escalation Performs/activates forensic analysis if a data security breach is reported. Security Team Produces standards for monitoring; stays up-to-date on industry data security
requirements and government regulations.
Technology Group Evaluates, tests and certiﬁes new software releases and patches; produces technical documentation.
Application Managers Keep InfoSphere Guardium application administrator informed of non-BAU activity and implementation of new modules that may impact data collection.
Hadoop Administrator Keeps InfoSphere Guardium application administrator informed of changes in platform environment, such as upgrades of OS and introductions of new servers. System Administrator Typically installs software on operating systems. They would install
A Collector is used to collect data activity, analyze it in real time, and log it in the internal repository for further analysis and/or reacting in real-time (alerting). Depending on how much audit data you collect (which is determined by your business requirements for auditing), you may need multiple Collectors, which should be co-located in the same data center as the Hadoop cluster.
The Aggregator is used to collect and merge information from multiple appliances (collectors and other aggregators) to
produce a holistic view of the entire
environment and generate enterprise-level reports. The Aggregator does not collect data itself; it just aggregates data from multiple sources. A single Aggregator can support up to ten Collectors. The Aggregator can be located anywhere, but requires
network connectivity to the Collector units. Figure 4. Scalable, distributed architecture
Policies, groups, users pushed down from
Central Manager. Definitions pushed up from
Collectors and Aggregators to Central Manager.
Nightly audit data uploads from Collectors. Central Manager Aggregator Aggregator Collectors Collectors
Make a plan
The Central Manager is used to manage the entire InfoSphere Guardium deployment (all the collectors and aggregators) from a single console, including patch installation, software updates, and the management
InfoSphere Guardium S‑TAPs reside on Hadoop servers
Think of the S-TAP as the listener for data activity; one is installed on each Hadoop server that requires monitoring (see Figure 5). Each S-TAP must be configured with one or more inspection engines. This is how you tell InfoSphere Guardium S-TAP which ports to monitor. For example, if you have the HDFS NameNode and Hive master on the same machine, you would need one S-TAP configured with two inspection engines.
To configure the inspection engines, you will need to work with the network or Hadoop administrator to get a list of the ports, such as the JobTracker ports and HBase master.
IBM InfoSphere Guardium provides a centralized solution for installing and
updating multiple S-TAPs using the InfoSphere Guardium Installation Manager (GIM). GIM sits on a Central Manager and provides a UI interface to make S-TAP management, includ-ing applyinclud-ing software maintenance, simpler and more automated. This would require the installation of an InfoSphere Guardium Installation Manager S-TAP agent on each server, which you can do during any maintenance window, and then use GIM to install the S-TAPs.
No SQL DB HBase
Optional S-TAP required only for monitoring HBase Put commands
SecondaryNN HiveServer JobTracker NameNode HBase Master Distributed data processing Map/Reduce Distributed query processing Distributed data Stgorage
HDFS Clients Data Node Task Tracker HBase Region Data Node Task Tracker HBase Region Data Node Task Tracker HBase Region Data Node Task Tracker HBase Region
Figure 5. Hadoop servers with Guardium S-TAPs
How should the deployment be rolled out?
As with any significant IT infrastructure enhancement, it’s a good idea to do a proof-of-concept in a sandbox environment. Not only will this help you validate the
auditing solution, it will give you the oppor-tunity to see for yourself how data activity is stored. It may also help you identify processes and procedures you need to put in place to make sure the production
deployment will go smoothly, and to help support automation procedures. For exam-ple, it is possible to automatically update privileged users groups or sensitive data objects in the InfoSphere Guardium system on a scheduled basis.
For a production deployment, IBM services can help you create a project plan that will
include education, planning, installation and configuration.
What are the business requirements for data monitoring?
Although InfoSphere Guardium provides a comprehensive data monitoring solution, in reality you don’t need to monitor everything. For example, Hadoop has a “chatty” proto-col, so InfoSphere Guardium includes a built-in policy with rules to filter out some of the internal messages the system uses for health checks. Over time, you can add rules to ensure that you are capturing activity that is required for audit.
There are different levels of auditing to consider:
• Privileged user audit applies only to
specific users or groups of users,
and everything else is filtered out before even being sent to the InfoSphere Guardium appliance.
• Selective auditing means that only a
subset of data activity is logged. However, in this case, everything is sent to the InfoSphere Guardium appliance, where it is determined whether the information is relevant and should be maintained.
• Comprehensive auditing means that
everything is audited and logged. If you are already using database activity monitoring for audit and compliance, some-one with Hadoop expertise may be able to map between the requirements on data-bases and those for Hadoop. For example, permission exceptions in Hadoop are file system permission errors rather than data-base authorization errors.
After you get the appliances and S-TAPs installed and connected on the network, all the planning work you did around busi-ness requirements will be beneficial when implementing monitoring.
You will start with the basic building blocks of creating groups and build upon that as follows:
1. Define and populate groups. This in-cludes groups of users, sensitive data objects, applications, server IPs and client IPs.
2. Define a security policy.
3. Customize out-of-the-box auditing reports, or create your own.
Create and populate groups to sim-plify management and maintenance Groups are central to simplifying manage-ment and control of the auditing environ-ment. By classifying users, applications, servers, data objects and more into groups, you can fully take advantage of the flexibility and power of the InfoSphere Guardium system, while also keeping it manageable. Think about some of the following groups:
• Privileged users (administrators)
• Sensitive objects (files or HBase tables) • Applications
• Server IPs (this will help with managing
traffic coming from multiple IPs)
• Client IPs to help you manage and track
back suspicious activity
• Commands (are there certain commands
you want to capture and/or filter out?)
For example, Figure 6 shows how to create a group of authorized programs called MapReduce and sortlines. The new group is named “Hadoop Authorized Job List.” Use the Guardium Group Builder to populate the group with members.
Figure 6. Creating a group of authorized programs
Figure 7 shows a partial report from Clou-dera (CDH4) that includes a query to show activity from any application that is NOT in the authorized job list group. The program PiEstimator has not been added to the authorized job list, and you can see its activity in this report.
There are several options for creating groups. You will probably use several approaches to create and automate the update of these groups, including:
• Manual entry by working with application
owners to identify sensitive data objects for specific environments
• An API to script the creation of groups
from your own input
• Populate from a query using observed
traffic from InfoSphere Guardium
• LDAP/Active Directory integration to
The automation process can be scheduled to run on a periodic basis to pick up any new changes in the Hadoop system, such as new users.
Figure 7. Extract of an unauthorized job activity report
SVORUGA svoruga PiEstimator job_201209042356_0007 HADOOP PROTOBUFCLIENT PROGRAM
Name MapReduceUser MapReduceName MapReduce Job SourceProgram
HADOOP PROTOBUF CLIENT PROGRAM PiEstimator job_201209042356_0007
Define a security policy
Policies are sets of rules and actions that direct the operations and behavior of the InfoSphere Guardium system, including which traffic is ignored and which is logged; which activities require more granular log-ging; and, when to prompt real-time alerts. InfoSphere Guardium includes a Hadoop policy that you can customize, as shown in Figure 8. The purpose of the predefined policy rules is to filter out traffic that is not needed for auditing. The policies make use of predefined groups such as Hadoop-SkipObjects. This is the case where you will likely create and modify such groups based on the observed traffic in your system.
You can then add on additional rules such as ignoring trusted sessions, or log the activities of privileged users with more detail. Again, this is where your predefined group of privileged users will help.
Figure 8. Hadoop policy built-in rules
In addition, you can use policies to define real-time alerts. For example, you can create a rule in which an alert is fired when-ever a user from a particular group, such as
a privileged user, attempts to access a sensitive data set that they are not autho-rized to access. This requires creating a group of privileged users and a group of
sensitive data objects. Figure 9 is an example of how this alert will appear on the Guardium Incident Management tab. Alerts can also be sent to email addresses.
Figure 9. Alert on access to sensitive files by a user who is not authorized
Customize reports and create compli-ance automation workflows
Because InfoSphere Guardium stores all information from all monitored sources into a common schema, many existing reports included with InfoSphere Guardium will show valid information for Hadoop, such as session information. InfoSphere Guardium also includes several reports that have already been tailored for Hadoop, including MapReduce activity, detecting unauthorized MapReduce jobs, Hue/Beeswax reports for Hive, HDFS activity, and full details reports. You can customize these reports, or build your own tailored to your own audit process requirements using the robust query building and report building capabilities in InfoSphere Guardium.
InfoSphere Guardium includes workflow capabilities to enable the distribution and signoff of audit reports. Results can be delivered to users, groups of users, or roles. (Using roles is recommended to enable more than one user to review and sign off. Roles also make it easier to manage employee absence and turnover.) Start by:
1. Identifying who should receive reports for what job function (info security manager). 2. Identifying groups of users with the same
job function and grouping them into roles. You can use the predefined roles in
InfoSphere Guardium, or create your own customized roles.
3. Creating users and assigning them to their appropriate roles.
4. Determining how often reports need to be generated.
5. Determining who receives the reports, whether review/signoff is required and whether the delivery should stop at any user or role until they complete the required action.
Operationalize your processes
Operational procedures should be defined for each of the teams that are involved in administering the InfoSphere Guardium environment or in evaluating and acting on monitoring results. Process flows can be very useful in defining responsibilities and the sequence of steps needed to address a particular situation, such as when new users are authorized to the InfoSphere Guardium system, or when policy rules need to change.
Extra emphasis should be given to process-es related to handling security breachprocess-es and forensic investigations. The support team needs to be made aware of the rules and trained on steps to be performed in case a security breach occurs.
Based on the business requirements, daily, weekly, monthly, quarterly and cyclical tasks should be defined and documented. Here is a simplified example of a plan:
The InfoSphere Guardium Administrator:
• Verifies archiving/aggregation and backup • Follows up on self-monitoring alerts from
the previous night The Audit team:
• Performs review of the automated audit
processes set up on the system
• Investigates any activity that is not
business as usual
• Escalates data security breach attempts
The InfoSphere Guardium Administrator:
• Verifies space utilization on the appliance • Verifies that data is being logged correctly • Verifies that the InfoSphere Guardium
appliance is purging and archiving correctly
• Verifies that all scheduled jobs are
executed on time The Audit team:
• Meets with the members of the Hadoop
The implementation of an InfoSphere Guardium data activity monitoring solution for Hadoop can help jump start your
organization’s use of Hadoop for enhanced business value. With the correct planning and understanding of your business requirements for monitoring and auditing, InfoSphere Guardium can help you address regulatory requirements and reduce
your risk of data breaches from hackers or insiders.
For more information, please visit ibm.com/guardium.
InfoSphere Guardium Data Security v9
Deliver real-time activity monitoring and automated compliance reporting for Big Data security
Big data security and auditing with IBM InfoSphere Guardium
Monitor and audit access for IBM InfoSphere BigInsights and Cloudera Hadoop
Understanding holistic database security
8 steps to successfully securing enterprise data sources
For more information on managing data-base security in your organization, visit
© Copyright IBM Corporation 2012 IBM Corporation
Software Group Route 100 Somers, NY 10589
Produced in the United States of America December 2012
IBM, the IBM logo, ibm.com, InfoSphere, and Guardium are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.
THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.