Real-Time Big Data Analytics for the Enterprise

Full text

(1)

Executive Summary

Companies are using real-time big data analytics to reshape the competitive landscape in their industries. They do it by capturing, storing, and analyzing volumes and varieties of data that were previously unmanageable, and then extracting insights fast enough to support real-time business processes. What started with a few leading Internet companies has spread to finance, healthcare, government, manufacturing, retail, scientific research, and many other fields.

Yet implementing real-time big data analytics can be challenging, requiring IT organizations to implement mission-critical solutions based, at least in part, on open-source software that does not always meet enterprise requirements. Not only is integration complex, but IT organizations must establish security, compliance, and high availability from the ground up to ensure the system is up to the challenge of housing sensitive data and supporting revenue-generating business processes.

Intel and SAP have addressed these challenges to provide an enterprise-ready solution for real-time big data analytics. With SAP HANA* running on the latest Intel® Xeon® processor E7 family and the Intel® Distribution for Apache Hadoop* software running on the latest Intel Xeon processor E5 family, businesses can ingest, store, and analyze petabytes of polystructured data, and they can generate insights in fractions of a second to support real-time business processes.

This solution includes a rich set of data management and business intelligence tools for turning data into high-value insights that can be embedded into other applications and business processes. Just as importantly, the solution is designed to meet enterprise requirements of security, compliance, and high availability so businesses can confidently integrate sensitive data into their analytics environment.

This white paper discusses the value of performing real-time analytics using all available enterprise data and describes how Intel and SAP have overcome the inherent challenges to deliver an enterprise-ready solution.

Real-Time Big Data

Analytics for the

Enterprise

(2)

Table of Contents

Executive Summary . . . . 1

Extending Real-Time Analytics to All Enterprise Data . . . 3

Solving the Challenges of Big Data Integration . . . . 4

Advanced Analytics across All Data Sets . . . . 4

Industry-Leading Performance for Apache Hadoop . . . . 4

Integrated Data Management . . . . 6

An Enterprise-Ready Platform . . . . 6

End-to-End Security . . . . 6

High Availability . . . . 7

Enterprise-Class Manageability . . . 7

SAP and Intel: A Shared Vision for Big Data Integration . . . . 7

SAP: Single Point of Contact for Service and Support . . . . 8

(3)

Extending Real-Time Analytics to All Enterprise Data

Advances in data analytics are changing the way businesses compete, enabling them to make faster and better decisions based on real-time analysis. Until recently, companies had to make tradeoffs between deep analysis of large data sets and fast time to results. Intel and SAP are eliminating the need to compromise with an analytics platform designed to deliver real-time query performance while acting on petabytes of both structured and unstructured data.

SAP HANA provides a real-time analytics platform using an in-memory database. Organizations can combine large data sets from their operational systems and other sources and perform complex queries in real time, typically in milliseconds. They can even use a single SAP HANA instance as a common foundation for all their applications, both transactional and analytical. This approach streamlines infrastructure and eliminates the physical and operational complexities of moving large amounts of data from operational systems to analytic systems. With these capabilities, SAP HANA answers the business challenge of delivering data-driven intelligence to support real-time business processes.

Big data introduces a new set of challenges. Companies generate enormous volumes of poly-structured data from Web logs, sensors, call records, social network posts, emails, and many other sources. They need a cost-effective, massively scalable solution for capturing, storing, and analyzing this data. They also need to be able to integrate their big data into their real-time analytics environment to maximize business value. For example, many companies want to analyze the clickstream trails of online customers in combination with historical purchasing patterns to deliver personalized offers and information. Deep analysis across diverse data sets can improve outcomes in such scenarios, but results are needed quickly to positively impact online transactions. Intel and SAP have collaborated to meet this challenge by integrating the Intel

Distribution for Apache Hadoop (IDH) software with SAP HANA, SAP Data Services, and SAP Business Objects. The result is a real-time analytics platform designed to efficiently ingest, store, integrate, and analyze all enterprise data. The platform offers:

• Real-time analytics with cost-effective storage that can scale to petabytes, and potentially exabytes, of data.

• Transparent data integration and query federation, so advanced analytics can be applied across all data using SAP tools and familiar SQL-based programming models. • Enterprise-class support for security, compliance, and manageability so

businesses can realize the advantages of real-time big data analytics more quickly and with reduced cost and risk.

(4)

Solving the Challenges of Big

Data Integration

SAP HANA is known for its unmatched query performance at scale. Intel collaborated with SAP engineers to help them optimize their in-memory processing platform to get maximum benefit from the hardware capabilities of the Intel Xeon processor E7 family, including its multicore architecture, large cache, large memory capacity and high-bandwidth I/O channels. Based on these efforts, SAP HANA speeds query processing times by as much as 10,000 times1 versus traditional data

warehouse solutions. The latest Intel Xeon processor E7 v2 family delivers even greater performance benefits and can process much larger in-memory data sets. These new processors support three times more memory than previous-generation processors: up to 6 TB on a four-socket server and up to 12 TB on an eight-socket server. They also provide more cores, threads, and system bandwidth to enable up to 2x faster

performance2 for complex, ad hoc queries,

compared to previous-generation SAP HANA platforms.

The distributed architecture of Apache Hadoop addresses very different requirements than SAP HANA. Hadoop enables query performance and data capacity to be scaled cost-effectively across tens to hundreds of standard, two-socket servers based on Intel Xeon processors and configured with direct-attached storage drives. This clustered architecture stores and processes data at a cost-per-terabyte that is far lower than traditional data warehousing systems. Although Hadoop enables fast processing of massive data sets, queries typically take minutes to hours to complete. This creates challenges when integrating Hadoop into a real-time analytics

environment. Intel and SAP address these challenges in two ways. First, IDH is highly optimized for performance on Intel®

architecture (see sidebar). Second, Intel and SAP make it easy to generate queries that make efficient use of both platforms.

Advanced Analytics across All

Data Sets

SAP HANA and SAP Business Objects provide comprehensive support for advanced analytics, including traditional SQL-based queries, dashboards, predictive analytics, planning, text mining, and more. In combination with IDH, these models can be applied transparently across the data stored in both platforms.

BI users and developers see data stored in IDH as an extension of the data stored in SAP HANA. The queries they generate are automatically federated, as appropriate, across the two platforms. For example, one part of a query might extract customer purchasing data from SAP HANA; another part might search associated Web server logs or call center data records in the Hadoop cluster. The results are then combined and further analyzed in SAP HANA to provide desired insights. As part of this query federation process, some components of the SQL queries generated by BI users and developers are automatically translated into MapReduce* applications that can run natively in Hadoop.

The separate parts of a federated query can be performed simultaneously. They can also be performed asynchronously, so that intermediate results from the Hadoop cluster are available as needed to support real-time processes in SAP HANA. Query performance statistics are provided, so developers can shape queries to address specific latency requirements.

Industry-Leading Performance for

Apache Hadoop*

The Intel® Distribution for Apache Hadoop* (IDH) software is optimized with the latest Intel® Xeon® processors, Intel® Solid-State Drives, and 10 gigabit Intel® Ethernet Adapters to deliver:

• Up to 30x higher performance than unoptimized Hadoop software running on

legacy hardware.3

• Up to 2 .6x faster performance than other open-source Hadoop distributions

running on the same hardware platform.4

Additional optimizations within IDH help to improve performance for other key functions, such as MapReduce* job launches and Hive* queries (Hive provides data-warehouse-like functionality for Hadoop environments and is a key component for integrating the Intel Distribution with SAP HANA*.) These and other optimizations help to shorten query completion times. They also allow organizations to perform more queries in the time available, which provides greater agility and better utilization of the infrastructure.

(5)

Much of this functionality is supported through the SAP HANA Smart Data Access connector, which Intel and SAP have optimized for use with IDH (Figure 1). This connector supports data relocation as well as the creation of proxy tables within SAP HANA to simplify and accelerate data access and query execution.

Intel implemented a number of optimizations to improve query performance on Apache Hadoop. One example is hot replication, in which multiple replicas of frequently used

data are automatically created to avoid contention. Suppose a company launches a popular new product, and the associated data is under continuous demand. Dozens or even hundreds of replicas can be generated so the data can be accessed and manipulated without bottlenecks. Another performance-enhancing feature is caching. Frequently used data and intermediate query results are automatically stored in the in-memory database of SAP HANA, so they can be accessed almost instantly when needed.

With these and other optimizations, Intel and SAP help to make the integration between SAP HANA and IDH as seamless and as transparent as possible for BI users and developers.

FIgURE 1 . The SAP HANA* Smart Data Access connector has been engineered and optimized by Intel and SAP to simplify and ac-celerate data sharing and query execution across both platforms. As a result, analysts can achieve fast query results across petabytes of structured and un-structured data.

ETL

SAP HANA*

Intel® Distribution for Apache Hadoop Software

SAP HANA

Smart Data Access

Real Time SAP Business

Objects Big Data SAP Data Services Optimized for: • Data relocation

• Query federation and acceleration (proxy tables, hot replication, caching)

OLAP Analysis

Open source components with:

Data Mining Reporting Market Data Location Data Web Logs Call Logs Sensor Logs HDFS

Hadoop* Distributed File System YARN* (+ MapReduce*) Distributed Processing Framework

Pig* Scripting Mahout* Machine Learning R*

Stats HCatalog*Metadata QueryHive* Intel® Manager for Apache Hadoop* Software Deployment, Configuration, Monitoring, Alerts, and Security

Connectors Ingest, Export Sqoop * Da ta Ex change Oo zie * W orkflo w HBase * NoSQL S tor e Zook eeper * Coor dina tion Flume * Log C ollect or

(6)

Integrated Data Management

SAP Data Services provides an integrated, enterprise-class platform for data integration, data quality, data profiling, and metadata management. System administrators can use it to load and manage data across both SAP HANA and IDH for SAP. They can also use it to manage data that has been loaded independently into the Hadoop cluster.

An Enterprise-Ready Platform

SAP HANA is engineered specifically to support mission-critical computing environments. Intel implements advanced security and reliability features in the Intel Xeon processor E7 family and related platform components, and works with SAP to ensure they are fully utilized throughout the SAP HANA solution stack.

source and proprietary tools to provide a platform that addresses the requirements of enterprise deployments.

End-to-End Security

IDH provides end-to-end security to protect data. Tools and capabilities include: • Authentication and Access Control.

IDH supports user authentication and role-based access controls. Queries generated in SAP Business Objects are authenticated just once for both SAP HANA and IDH, and IDH provides granular access controls for data and services. Users and queries can only access authorized data sets, which helps to protect sensitive data against both internal threats and external hackers.

Intel® Distribution for Apache Hadoop Intel® Manager

Connectors Netezza, Oracle, SAP, SQLServer, Teradata, DB2

Intel proprietary

components Intel-optimized open source components Includes Intel security enhancements Kafka*

Event Bus Pig*

Scripting MetadataHcatalo*

SLURM* Scheduler Hive Query R* Stats Lucene*, Solr* Search Mahout* Machine Learning YARN* (+MapReduce*) Distributed Processing Framework

HDFS | Lustre* | GlusterFS Hadoop Compatible File Systems High Availability and Disaster Recovery

Rhino (Security) [Encryption, Authentication, Authorization, Auditing]

Graph Mining Low-latency SQL-92Gryphon* Recommendation Engine

Analytics Workbench HBase* Explorer

Vertical Accelerators Heat Map

Security Controls Job Profiler Resource Monitor Upgrade Alerts Tuning Unified Logging Deployment Configuration Behavior Model Sqoop* Data Transfer HBase Flume* Log Collector Oozie* Workflow Zookeeper* Coordination

Apache Hadoop, on the other hand, is an open-source software application that combines features and optimizations generated by many companies and individuals. This development model enables exceptionally fast innovation, which is evidenced by the rapid evolution of the Hadoop software ecosystem. However, because of this rapid evolution, there are gaps in most available Hadoop distributions, particularly with respect to security, availability, and manageability. These gaps have kept many businesses from deploying Hadoop in production environments.

Intel has worked to close those gaps in IDH. IDH includes the full open source solution stack, with all components pre-integrated and optimized to improve performance on Intel architecture. Intel also integrates a combination of open

Project Rhino

Establishing comprehensive security for Apache Hadoop*

FIgURE 2. The Intel® Distribution for Apache Hadoop* includes extensive enhancements for enterprise-class security and compliance and Intel is working on Project Rhino to establish a comprehensive security framework across the Hadoop* ecosystem. The goal is to provide a common authentication and authorization framework with integrated support for regulatory requirements in financial, healthcare, government, and

(7)

could fail without impacting service or data availability. However, the cluster NameNode and Job Tracker servers, which are required in all Hadoop deployments, are potential single points of failure. IDH provides integrated support for high availability for both these critical servers. Intel is also working on the open source Project Ladon, which is designed to support disaster recovery of Apache Hadoop through multisite data replication.

Enterprise-Class Manageability

SAP HANA is typically delivered as an appliance for onsite deployments. All hardware and software is tightly integrated and optimized to simplify deployment and management. Apache Hadoop, on the other hand, is based on open source software that is designed to run on large numbers of off-the-shelf servers. Management can be complex in this more distributed computing environment, and the challenges increase as a cluster grows.

IDH includes Intel® Manager for Apache Hadoop software, which combines open source and proprietary tools to provide enterprise-level manageability, including: • A user friendly interface for managing

access controls and for updating the system. Built-in wizards provide workflows and guidance to speed deployment, simplify upgrades, and improve results.

• Automatic cluster configuration and tuning, using the Intel® Active Tuner. Advanced machine-learning algorithms select the best setup based on workload characteristics to deliver optimized query performance quickly and with no need for complex manual tuning.

• Built-in monitoring, with a dashboard that provides a comprehensive view of the cluster and system health.

• Flexible extensibility, with an application programming interface (API) that allows third-party and custom applications to access the functions in Intel Manager for Apache Hadoop.

SAP and Intel: A Shared Vision for

Big Data Integration

Intel and SAP continue to jointly engineer, optimize, and enhance the integration of SAP HANA and IDH. The companies are working together to integrate new functionality and to optimize software to derive maximum benefit from advances in hardware. Some objectives of this collaboration include:

• Simplified troubleshooting, so query failures can be identified, diagnosed, and fixed more quickly and efficiently. Future solutions will include built-in analytics for root-cause analysis. • Enhanced data relocation, so data

can be moved more quickly, flexibly, and transparently between the two platforms.

• Stronger security, by further

improving integration and by providing more comprehensive, multilayered protections in both hardware and software.

Intel is also deeply involved in hundreds of open source projects to increase Hadoop performance and functionality, and the results of these efforts will continue to increase the capability and value of IDH. Many of these developments are also offered back to the open source community to help drive innovation and interoperability across the broader big data ecosystem.

• Fast, transparent data encryption. IDH uses Intel® Data Protection Technology with Advanced Encryption

Standard New Instructions5

(AES-NI), which accelerates encryption and decryption performance by up to 19 times6, to enable strong data

protection without compromising query performance. Data can be encrypted selectively and transparently, both in motion and at rest, to meet security and compliance requirements. Within IDH, transparent encryption is supported in Hive, Pig*, MapReduce, HBase*, and the Hadoop Distributed File System* (HDFS*).

• governance. All database operations are logged across both SAP HANA and IDH and can be audited to verify that users only access authorized data sets and services. Reports and automated alerts help IT protect data and document compliance.

Intel is working to extend these and other security capabilities across the Hadoop ecosystem through an open source project called Project Rhino (Figure 2). The goal is to establish a comprehensive security framework for Hadoop that will help businesses address security issues and compliance protocols across a wide range of use cases in financial, healthcare, government, and e-commerce environments. Project Rhino will

contribute code to the Apache Foundation so these capabilities will be freely available.

High Availability

Big data analytics are often used to improve outcomes in revenue-producing business processes, so high availability is important. SAP HANA provides integrated support for data replication and system failover to prevent downtime. Hadoop implements 3-way data replication by default, so that any data node in a cluster

(8)

SAP: Single Point of Contact for Service and Support

SAP HANA and IDH are available from SAP sales teams worldwide. SAP offers full support for the joint solution. SAP also offers comprehensive consulting services, from initial planning and assessment through implementation and ongoing optimization. The speed, scale, and flexibility of the platform go far beyond what has been possible in the past, and IT organizations can accelerate deployment by working with experts who have extensive experience with SAP HANA and Apache Hadoop.

Conclusion

SAP and Intel provide an optimized solution for real-time big data analytics based on SAP HANA and the Intel Distribution for Apache Hadoop. Using this joint solution, data and business analysts can combine the performance of in-memory analytics with the massive scalability of Apache Hadoop. As a result, they can store and analyze petabytes of poly-structured data cost effectively at the speeds needed to support real-time business processes.

Intel and SAP have worked closely together to optimize the combined platform to support fast, federated queries that tighten the seams between the two platforms and make it easier for BI users to get the results they want without worrying about the infrastructure. The solution is designed to support enterprise requirements for security, availability, and manageability, so IT organizations can integrate the platform into their datacenter while minimizing cost and risk.

Intel Distribution for Apache Hadoop: http://hadoop .intel .com

SAP Big Data: www .sap .com/bigdata

1.Source: Sikka, Vishal, SAP. “The Business Value of Speed! Lessons from 10,000X SAP HANA Performance Club.” August 2012. http://www.saphana.com/community/blogs/ blog/2012/08/05/the-business-value-of-speed.

2. Source: Intel internal measurements November 2013. Configurations: Baseline 1.0x: Intel® E7505 Chipset using four Intel® Xeon® processors E7-4870 (4P/10C/20T, 2.4GHz)

with 256GB DDR3-1066 memory scoring 110,061 queries per hour. Source: Intel Technical Report #1347. New Generation 2x: Intel® C606J Chipset using

four Intel® Xeon® processors E7-4890 v2 (4P/15C/30T, 2.8GHz) with 512GB DDR3-1333 (running 2:1 VMSE) memory scoring 218,406 queries per hour.

Source: Intel Technical Report #1347.

3.Source: TeraSort Benchmarks conducted by Intel in December 2012. Custom settings: mapred.reduce.tasks=100 and mapred.job.reuse.jvm.num.tasks=-1. Cluster configuration: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Cisco Nexus* 5020 10 Gigabit switch. Performance measured using Iometer* with Queue Depth 32. Baseline worker node: SuperMicro SYS-1026T-URF 1U servers with two Intel® Xeon® processors X5690 @ 3.47 GHz, 48 GB RAM, 700 GB 7200

RPM SATA hard drives, Intel® Ethernet Server Adapter I350-T2, Apache Hadoop* 1.0.3, Red Hat Enterprise Linux* 6.3, Oracle Java* 1.7.0_05. Baseline storage: 700 GB

7200 RPM SATA hard drives, upgraded storage: Intel® Solid-State Drive 520 Series (the Intel® Solid-State Drive 520 Series is currently not validated for data center usage).

Baseline network adapter: Intel® Ethernet Server Adapter I350-T2, upgraded network adapter: Intel® Ethernet Converged Network Adapter X520-DA2.Upgraded software in

worker node: Intel® Distribution for Apache Hadoop* software 2.1.1. Note: Solid-state drive performance varies by capacity. More information: http://hadoop.apache.org/docs/

current/api/org/apache/hadoop/examples/terasort/package-summary.html.

4.Source: Terasort Benchmarks conducted by Intel. Configuration details: One head node (name node, job tracker), 10 workers (data nodes, task trackers), Dual Intel® Xeon®

processor E5-2680@2.70 GHz, 32 cores per node, 7 x 1 TB dedicated data disks per node, 10 GbE network. System Swap turned off, Kernel Buffer Cache cleared before each performance test.

5. No computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system

manufacturer and/or software vendor for more information.

6. Source: Intel Internal tests using OpenSSL 1.0.1c* encryption software to encrypt and decrypt a 1 GB text file, with and without AES-NI enabled. Server configuration: 4-socket server with 4 x Intel® Xeon® processor E5-2690 (32 core system, 1 core used in testing), 32 GB memory, CentOS 6.3* operating system, Apache Hadoop

Distributed File System* (HDFS*) with namenode, datanode, and the test program all run on the same server, 240 GB Intel® Solid State Drive 320 Series storage. For details,

see the Intel Solution Brief, “Fast, Low-Overhead Encryption for Apache Hadoop*.” http://hadoop.intel.com/pdfs/IntelEncryptionforHadoopSolutionBrief.pdf Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and

MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,

TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A “Mission Critical Application” is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL’S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS ,COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS’ FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Figure

Updating...