From Big Data to Actionable Insight

(1)

From Big Data to Actionable Insight

Bob Palmer

| Senior Director, SAP National Security Services™ (SAP NS2™)

Dan Dorchinsky

| Client Director, SAP National Security Services™ (SAP NS2™) August 2012 | www.SAPNS2.com

About SAP National Security Services™ (SAP NS2™)

SAP National Security Services (SAP NS2) offers a full suite of enterprise applications, analytics, database, cloud, and mobility software solutions from SAP and Sybase with specialized levels of security and support to meet the unique mission

requirements of US national security and critical infrastructure customers. SAP National Security Services™ and SAP NS2™ are trademarks owned by SAP Government Support and Services (SAP GSS). For more information, visit www.SAPNS2.com.

B

ig Data has become a hot topic as information technology (IT) leaders in business and government struggle with how best to leverage the rising flood of data coming from myriad sources. The explosion of information is a double-edged sword. On one hand, the data can reveal new insights that would have previously remained hidden. On the other hand, the quantity of data brings challenges in capturing, storing, sharing, searching, and analyzing that data. The term “Big Data” has come to describe data sets that are so large and complex that they are too cumbersome to manage using traditional tools or processes.

(2)

What is “Big Data”?

In today’s world, we have access to unimaginably large volumes of information from a growing number of data sources. While the exponentially expanding stream of information has made it possible to accomplish more and to address problems differently—for example,

spotting new business trends or drawing conclusions on theories that could never before be tested—the sheer volume quickly becomes overwhelming. (See Figure 1.)

Data sources are increasing daily:

 There are 5.9 billion mobile phone subscriptions worldwide, equivalent to 87 percent of the world’s population.1

 Wal-Mart handles more than 1 million customer transactions every hour.2

 More than 2 billion people are accessing the Internet on a regular basis, creating data with every click.3

 The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, and 65 exabytes in 2007. It is predicted that the amount of traffic flowing over the Internet will reach 667 exabytes annually by 2013, and the world’s stores of digital information will increase tenfold every five years.4

1

International Telecommunication Union, “ICT Facts and Figures.” ITU Telecomm World (2011), 2,

http://www.itu.int/ITU-D/ict/facts/2011/material/ICTFactsFigures2011.pdf.

2_{, “Data, data everywhere.” The Economist, 25 February 2010,}_{http://www.economist.com/node/15557443}_. 3_{International Telecommunication Union, “ICT Facts and Figures.” ITU Telecomm World (2011), 1,}

http://www.itu.int/ITU-D/ict/facts/2011/material/ICTFactsFigures2011.pdf.

(3)

The explosion of information is a double-edged sword. On one hand, the data can reveal new insights that would have previously remained hidden. On the other hand, the quantity of data brings challenges in capturing, storing, sharing, searching, and analyzing the data. Thus, the term “Big Data” has become an industry catchphrase to describe data sets that are so large and complex that they are too

cumbersome to manage using traditional tools or processes.

Before discussing specific technologies for harnessing Big Data, it is important to understand there is no single solution or universal approach to Big Data. The platforms, specific functionalities and tools chosen need to be based on the requirements of the mission. The platform will likely include a combination of open-source and commercial-grade enterprise software to create an end-to-end solution with data on one end and analytics on the other. This paper provides a notional architecture of this end-to-end platform and highlight how the Department of Defense (DoD) can use a combination of solutions to maximize the value of its data.

DoD Problem Statement

Addressing large volumes of data is not new, per se. However, there are key differences in addressing today’s “Big Data” conundrum than in the past:

(1) The recent confluence of cloud, social media, and mobile computing trends has reshaped the approach to Big Data. The ability to collect and analyze the staggering volume of data these

(4)

(2) Data consumers (e.g., constituents, service members, and stakeholders) expect that data is served up instantaneously. In other words, they assume there is little to no data latency. (3) Data consumers expect that applications running against the data have the ability to

proactively identify trends. And they expect the data to allow for self-service discovery in a

user-friendly, visually appealing manner.

Like the business and science communities, the Pentagon faces a similarly huge Big Data challenge, especially when it comes to “unstructured” data that does not fit well into a relational table and may include text, sensor data, images, or other elements.

According to recent news reports, the DoD has invested billions of dollars in new electronic systems that gather and store vast quantities of imagery and other data from the battlefield. However, the digital deluge is so vast that sifting through it manually to generate actionable

information is not practical, sustainable or cost-effective. According to Zach Lemnios, the Assistant Secretary of Defense for Research and Engineering, the DoD has “progressed the quality of imagers and the quality of sensors to the point where our limitation is no longer the front end collection tools, it's the back-end decision tools. How do I take a large data set and integrate it in a time-critical way?"5 Clearly, deriving new insights, recognizing relationships, and making increasingly accurate predictions are critically important to all of the defense and intelligence services. In addition to ingesting and digesting the sheer volume of data, the DoD faces an additional challenge: sharing information across different mission areas and partners. There is no standardized way of dealing with Big Data across such boundaries.

5_{Jared Serbu, “DoD R&D prioritizes 'Big Data.'” Federal News Radio, 11 April 2012,}

(5)

Building the Solution Platform

Step 1: Capturing and Storing Huge Volumes of Unstructured Data

As stated previously, many of the functions behind Big Data are not new—data warehousing, mining, analytics—but have been out of reach of many organizations due to prohibitive storage and processing costs. Obviously, a lower-cost approach will always prove most favorable for adoption.

Therefore, it is not a surprise that open-source software frameworks, such as Hadoop, have gained popularity for managing a major aspect of the Big Data platform–storing unstructured data—in large part because of the cost. 6 For example, the data stored on a Hadoop cluster is less than one-tenth the cost of an equivalent relational database. Hadoop clusters were also designed to better process

unstructured data, which represents 95% of the data currently being created. Video, picture images, and voice files are not easily stored in relational databases. Lastly, Hadoop offers significant advantages with linear scalability.

Although this paper is not intended to focus on Hadoop, it is important to define it for uninitiated executives. Keep in mind that SAP NS2’s position is not “for or against” any specific technology. Rather, we embrace and extend other technologies. Combining the best features of an open-source solution (such as Hadoop) with innovative, enterprise-ready technology will deliver the most value—at the best cost—for many organizations with Big Data requirements. Further, a Big Data solution should let an organization store data in its native object format and then enable users to pick and choose what items to bring forward for analysis.

Quite simply, Hadoop is a distributed file system, not a database. The Hadoop Distributed File System (HDFS) manages the splitting up and storage of large files of data across many inexpensive commodity servers, which are known as “worker nodes” and cost hundreds, not thousands, of dollars per terabyte. When Hadoop splits up the files, it puts redundant copies of the chunks of the file on more than one disc drive, providing

(6)

self-healing redundancy in case a low-cost commodity server fails. Hadoop also manages the distribution of scripts that perform business logic on the data files that are split up on those many server nodes. This splitting up of the business logic to each of the CPUs and RAM on many inexpensive worker nodes is what makes Hadoop work well on very large Big Data files. Analysis logic is performed in parallel on all of the server nodes at once, on each of the 64MB or 128MB chunks of the file. Hadoop software is written in Java and is licensed for free; it was developed as an open-source initiative of the Apache Foundation.

Assuming that the technological challenge of capturing and storing massive unstructured data has been addressed with a solution like Hadoop, two vitally important questions remain: 1) How can the

organization make the data relevant to act upon in real time, and 2) How can the organization accomplish this cost-effectively?

Step 2: Adding an Analytical Data Warehouse for Real-Time Logic

Our approach combines the Hadoop distributed file system for storing large amounts of unstructured data with (1) an analytical data warehouse for processing real-time analysis logic, and (2) a self-service, web-based user interface for visualizing the data.

This approach may irritate purists who advocate solely for the Free and Open-Source Software (FOSS) movement or for the more traditional model of commercial-off-the-shelf (COTS) software. But in looking at this problem, we have concluded that a combination of solutions brings the right tools to the job. The Hadoop system is a scalable approach to inexpensively take in and store very large data files of unstructured and semi-structured data. Then the content of those files can be sorted and processed in parallel as instructed by code written by data scientists using the MapReduce methodology.7

However, Hadoop and MapReduce have limitations and cannot effectively address Big Data solely on their own. A scalability problem arises in a pure Hadoop environment, and it is not merely a problem of scaling the numbers of cheap server nodes or disc space in the data center. Hadoop with MapReduce is essentially batch-oriented; a developer builds a MapReduce script to operate on the whole file, which may be as large as multiple terabyte-large and may run for 20 minutes, an hour, or even longer. There is no indexing or schema for the file system, nor is there of the capability to create, update, or delete. Given appropriate time and enough skilled data scientists, any type of analysis (predictive, comparative, pattern recognition, text analysis, or time series) can be run against the data in a batch process using MapReduce. While this process is effective, the outcome creates a conundrum due to the iterative

(7)

nature of data analytics. For example, the answer to the initial query may be accurate, but often another requirement will emerge to interrogate the data again, this time slightly differently, because analysis is inherently an iterative process. This creates another problem of scalability. In other words, how many hours will it take for highly skilled data scientists to write and rewrite analyses as driven by the shifting needs of mission specialists and war-fighters?

Step 3: A Hybrid Solution Combining the Distributed File System

with a Columnar-Structured Analytical Repository

The ultimate goal of a Big Data solution is to drive actionable insight from as much information as possible. A state-of-the-art approach empowers end-users to query the data, with near-instant response time, in a much more self-service manner. The approach can be delivered by combining Hadoop with a high-performance data store in columnar format. The columnar data format is appropriate for this use-case for three reasons:

(8)

 The columnar data store obviates the need to build a pre-conceived indexing strategy for the data, because in a sense, in a columnar data store, the data is the index.

 And finally, the columnar data store allows for extreme data compression through bit-mapping, because all of the attributes of the data are organized together in columns, instead of being

distributed in rows. Certainly, data compression is appropriate in what will be a large volume of data being operated upon in this Big Data scenario.

SAP has been a thought leader in columnar data base development, and since the acquisition of Sybase in 2010, its expertise has deepened. For years, Sybase IQ was the pioneering columnar database in the market; and now, with more than 13 years of continuous development, it is widely deployed in mission-critical applications in government and the financial industry. Sybase IQ leverages the columnar

approach and advanced compression techniques to rapidly produce insight across both structured and unstructured data sets.

On the cutting edge of database evolution, SAP now offers a columnar database and computation engine that is totally in-memory. SAP HANA™ is a flexible,

multipurpose, in-memory data warehouse that combines SAP software components optimized on hardware provided and delivered by leading hardware vendors.8 SAP HANA is not merely a cache of a sub-set of relational

database tables. Rather, the data warehouse is completely resident in RAM memory in columnar format, and the computational engine is resident in the same memory. Therefore, there is no need for disc storage with the exception of backups and disaster recovery. The net result is speed. Simply put, SAP HANA is blazingly fast, up to 100,000 times faster than disc-based systems in many cases.

To enable powerful real-time analysis, the in-memory data warehouse performs the “slicing and dicing” of data and performs rigorous predictive analysis without any I/O from data storage, in the same memory space with the data. It can process the in-memory data using numerous algorithms in the HANA Predictive Analysis Library (PAL), including K-Means, KNN, C4.5 Decision Tree, Linear Regression, Apriori, Moving Averages, et cetera, and it also makes available more than 3,500 open-source algorithms available in the R programming language (another nexus between open source and our COTS solutions.)

8

By working collaboratively with Intel Corporation, SAP has enabled HANA to leverage the full capabilities of the Intel Nehalem chip set, on hardware available from leading hardware vendors that include the top manufacturers in the industry, such as Dell, IBM, Hewlett Packard, Cisco, Fujitsu, etc.

(9)

SAP proposes that customers tackle Big Data with a hybrid solution including Hadoop and an Analytical Data Warehouse. Rather than asking data scientists to write detailed analyses using MapReduce to perform analytical processes on entire data files, an innovative approach would be to use the MapReduce code to deliver an ETL (Extract, Transform and Load) of a large data domain to the in-memory, columnar data warehouse. The MapReduce process would sort and process the unstructured data, and then the data would be loaded into the in-memory data warehouse, using SAP Data Services, a click-and-drag graphical user-interface tool for mapping data.

At this point, the large data domain needed for fast, real-time analysis will be on-hand in a structured data warehouse, in columnar format, with meta-data context. Users can create powerful analytical views of the data in a self-service web client environment (discussed in the next section). This will reduce the number of iterative cycles imposed on the data scientists to tune and re-tune the analyses as required by end users. Ultimately, it will serve to reduce the time lag between the warfighters’ analytical needs and their fulfillment.

The in-memory analytical data warehouse is uniquely qualified to be the target of the large results sets of the MapReduce process. It is optimized so that there is no bottleneck in reporting performance for even extremely large data sets coming from a MapReduce job. The in-memory solution means that the “seek time” response is much shorter, even when analyzing the very large data sets that may be the result of the Hadoop MapReduce script. In fact, performance benchmarks routinely show sub-second response times, even for complex queries of millions of records.

(10)

Step 4: Add Business Analytics

Big Data is meaningless without the ability to cull information at the right time and in the right format that materially impacts the mission.

SAP BusinessObjects business intelligence (BI) solutions can operate directly against the data stored in the columnar or in-memory repository. BI solutions provide business users with both an analytics and reporting framework for the data. They also allow end users to interface with existing applications and operational software such as Microsoft Office.

Powerful analytical capabilities can be exposed to end-users using the SAP BusinessObjects web-based user interface. This interface is designed to empower subject matter experts who are not IT personnel and who have no coding or scripting skills. Importantly, these users do not need to understand the structure of the underlying data store to create effective ad hoc queries and visualizations. They will not have to burden the data scientists with running additional MapReduce queries, as the data will already reside in the in-memory data warehouse.

(11)

Conclusion

The hybrid solution—combining both open source and commercial technologies—can best solve the Big Data challenges faced by the defense and intelligence communities. End users, who are increasingly influenced by consumer applications, expect data to be provided to their fingertips with zero latency, and with the ability to proactively identify trends and conduct self-service discovery in a visually

appealing environment. National security organizations will benefit from a hybrid solution that provides the ability to collect and analyze a staggering volume of data, and analyze it with agility in real time, with lower costs and increased speed to insight.

But this is just the beginning. The in-memory computing platform for analyzing unstructured Big Data coming from Hadoop could also be the nexus for many other data sources in the DoD enterprise. Logistics, Order of Battle, Readiness, Force Generation, Human Resources, and Financial data all represent organizational functions that could be synthesized and powerfully analyzed using a similar in-memory approach, with real-time visibility.

Authors:

Bob Palmer Senior Director

SAP National Security Services ™ (SAP NS2™)

[email protected]

301.641.7785

Dan Dorchinsky Client Director

SAP National Security Services ™ (SAP NS2™)

[email protected]

301.693.9000

SAP Government Support and Services, Inc. (SAP GSS), a Delaware corporation, is a wholly owned, independent US subsidiary of SAP and does business as SAP National Security Services™ (SAP NS2™). SAP National Security Services™ and SAP NS2™ are trademarks owned by SAP GSS. SAP NS2 offers the combined power of enterprise applications, analytics, database, cloud and mobile software solutions from SAP and Sybase with specialized levels of security and support to meet the unique mission requirements of US national security and critical infrastructure customers. In addition to US national security customers, SAP NS2 also supports private companies such as defense contractors, telecom carriers, and major financial institutions that have specialized information assurance needs.

(12)

countries. Business Objects is an SAP Company. All other product and service names mentioned and associated logos displayed are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.

The information in this document is proprietary to SAP. This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments, and

functionalities of the SAP® product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this document is subject to change and may be changed by SAP at any time without notice. SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of

merchantability, fitness for a particular purpose, or non-infringement.