Staying agile with Big Data

(1)

Staying agile with Big Data

An Ovum white paper for Red Hat

Publication Date: 09 Sep 2014

(2)

Summary

Catalyst

Like any major technology project, organizations implementing Big Data projects face challenges with aligning business cases, selecting the right technology solutions, and mobilizing the right skillsets. Big data projects add some unique challenges to the mix. With growing awareness of the value of data, organizations must innovate rapidly in the use of that data to preserve competitive edge. And, as they implement solutions, they must contend with a technology base that is becoming a rapidly moving target. The open source technology development and delivery model has played a large part in fostering the rate of innovation in Big Data platforms and solutions. Faced with rapidly evolving business needs and technologies, it is critical that organizations preserve their agility and flexibility in choosing platforms, solutions, and evolving their practices to ensure that they derive tangible value from their Big Data investments.

Ovum view

Agility is the key for benefiting from the use of Big Data for operational excellence and business insight. Almost everything about Big Data, from the underlying technology platforms to analytic approaches and availability of data sources, is a moving target. For instance, the Hadoop platform is no longer synonymous with batch-style MapReduce processing; new alternatives are emerging for incorporating machine learning, search, graph processing, and real-time streaming operational decision support. Enabling technologies, such as interactive SQL and query federation, will allow Big Data analytics and operational decision support applications to integrate with existing enterprise applications. Because none of this will happen overnight, Ovum believes that Big Data implementation should be evolutionary so organizations can keep their future options open. Organizations should take iterative approaches to choosing data sources, analytic approaches, and consider public or private cloud deployments, either as a strategy for rapid piloting or for production. When choosing technology suppliers for Big Data implementations organizations should seek providers that allow them to keep their options open.

Key messages

! Agility is essential to Big Data implementations because the underlying platforms, technologies, data sources, and analytic approaches are evolving rapidly; agile approaches enable organizations to keep their options open and adopt emerging technologies quickly and effectively.

! Open source technology has been instrumental in enabling development of Big Data platforms and tools via the proven community development model.

! Open source can also enable agility by allowing the freedom of choice that is essential to successful Big Data implementations.

(3)

The big picture is getting bigger

The growing reach of data and compute

Data has always been important to enterprises – but the data that enterprises need to address competitive or operational imperatives has changed. It has always been important for enterprises to understand the data that their internal systems already transact, but increasingly, competitive

advantage comes to organizations that can gain visibility or new insight from data that traditionally fell outside the domain of transaction systems and data warehouses. New data platforms and compute frameworks are bringing this data very much within the reach of operational and analytic applications. Innovations from Internet data centers are bringing the power of massive scale-out compute grids; petabyte-scale storage; and low cost, high-bandwidth connectivity within the reach of enterprises. Together, these trends have made it possible for enterprises to extend their reach out to millions of remotely connected devices providing an operational window on the real world, and for connecting hundreds or thousands of compute nodes to remove limits on computer capacity. And the emergence of the cloud has brought all of these capabilities within the budgets even for small and midsize firms.

Yet the business problems remain familiar

Ovum defines Big Data as data that is not readily accommodated by traditional enterprise transaction systems and data warehousing platforms. Ovum has found that Big Data adoption has graduated from early adopter phase; starting with Internet companies, who created the open source communities that spawned innovations with data platforms and computing, the first wave of “mainstream”

enterprise adoption came from digital media, telecom carriers, and financial services companies. More recently, Ovum has seen adoption from consumer goods companies, transportation and logistics providers, life sciences, and public sector. The key is not finding “Big Data problems,” but instead, confronting business or operational challenges that may require new sources of data or analytic approaches. Not surprisingly, Ovum has found that problems being addressed with Big Data are typically quite familiar; the most common use cases for Big Data in the enterprise center around Customer, Risk/Fraud/Security, Operations, and Enterprise data Warehouse (EDW) optimization, as shown in Figure 1.

Figure 1. Big Data common use cases

There are competing open source frameworks: Storm and Spark Streaming (the latter, an extension of the Spark project). Where there are competing frameworks, there are rival vendors; Storm vs. Spark Streaming will intensify the proxy war between Hortonworks and Cloudera; we expect them to differentiate on which streaming project to support. There are other open source real-time streaming projects continuing to emerge, with the latest being Tigon, a technology jointly developed by Cask, a startup firm (formerly known as continuity) and AT&T Labs. Potential IoT streaming applications will also draw competition from proprietary vendor streaming solutions, the most prominent of which are

(4)

Source: Ovum

Big Data extends the visibility and effectiveness of these use cases. For instance, customer-focused applications augment the traditional “Customer 360” transactional view with behavior-focused data from social networks and mobile device data. Risk mitigation may supplement existing data with external feeds of market-relevant events and related economic indicators for judging the degree of risk in executing a financial transaction, while operational efficiency can tap into the world of machine data to provide more granular views.

The need for agility

With business conditions constantly changing and technology rapidly evolving, enterprises must preserve their freedom of action: they cannot afford to get locked in by a specific approach to analyzing data, a specific technology architecture, or a specific vendor product stack. Enterprises implementing Big Data projects need agility.

Admittedly, agility has multiple meanings in the technology world, where it has different connotations to different audiences. For the business, agility means having the ability to readily embrace new approaches to addressing competitive challenges as the market changes. For IT, agility means having the freedom to plug and play different layers of the technology stack to avoid getting locked in to a particular architecture, solution, version, or vendor implementation. For Big Data projects specifically, it means having the ability to ingest any type of data source without having to change the underlying technology infrastructure.

Approach Big Data implementation iteratively

(5)

young; as underlying Big Data platforms mature, new approaches are emerging for performing analytics problems or managing operational decision support. For instance, as the Hadoop platform matures, new processing frameworks are emerging that are making the platform far more versatile. Gone are the days when Hadoop could only be used for batch-style, MapReduce analytic runs. New frameworks, such as Spark, Storm, and others promise improved performance for machine learning, stream processing, graph processing, and more. There are multiple approaches for conducting interactive SQL query on Hadoop. Furthermore, with wider data sets and emerging open source frameworks such as Lucene indexing and the Solr or Elasticsearch engines, search offers the potential of becoming the next “killer” application for Big Data.

In turn, with more data and new data sets at their disposal, organizations will need to take a step by step approach to understand what data will be relevant and produce the greatest value. Identifying the right data sets will ultimately prove just as important as selecting the right analytic or operational processing approaches.

As such, organizations should keep their options open. Because Big Data implementation will be a journey, they should embrace agile strategies as they select their platforms, analytic approaches, and data sets.

The role of Open Source in triggering Big Data

Cost trends set the stage

While the impact of Moore’s Law on computing is well understood, similar phenomena for data storage, bandwidth, and connectivity has changed the game. For instance, the declining costs and increasing capacity of hard drives has been especially significant, with prices having dropped nearly 50% over the past five years while mainstream drive capacities have increased as high as four terabytes. Similarly, connectivity costs for Ethernet have been dropping, with 10-GByte connections starting to join 1-GByte interconnects as the norm for cluster deployments. These developments set the stage for making external human- and machine-generated data available, and for making Big Data computing affordable.

Open source spurs platform development

Open source has been proven to be an efficient mechanism for introducing new foundational

technologies. It taps the skills of highly diverse developer communities and delivers technology that is highly accessible because of the core business model. According to the most recent Future of Open Source survey, conducted by Black Duck Software in 2013, over 1 million unique open source projects are active today. The survey also reveals that most enterprises expect that over half of all purchased software in five years will be open source.

It was the technology development model that brought Linux to the enterprise. Open source technologies at multiple tiers of the stack have unlocked the power and scalability of commodity infrastructure through middleware such as the JBoss and Tomcat projects. Cloud computing made technology and access to data affordable to even the smallest enterprises; projects such as

(6)

! HDFS, Hadoop’s core file system, which was based on the Google File System (GFS); ! MapReduce, a compute framework that was generalized by Yahoo from Google’s

MapReduce;

! HBase, Hadoop’s database, which was based on Google BigTable, and initially implemented at Facebook;

! Hive, Hadoop’s SQL-like metadata store and data warehousing infrastructure, developed at Yahoo; and

! Pig, a SQL-like data transformation language developed at Facebook.

Other open source innovations that have enriched the Big Data ecosystem include NoSQL data stores such as Cassandra, MongoDB, and Couchbase; search engines such as Solr and

Elasticsearch; and processing frameworks such as Spark and Storm. The open source development model has become sufficiently mainstream that it has been embraced by virtually every major IT technology vendor.

Choosing the right technology partner

Because Big Data adoption should be an iterative process, organizations should choose technology partners that allow them to keep their options open. For instance, availability of mobile data feeds may drive organizations that relied on offline analytics for Customer 360 targeting strategies to embrace real-time streaming analytics as well. Evolving security challenges may require organizations to embrace machine learning approaches to contend with issues that become moving targets. With new data sources constantly emerging, organizations must keep their options open on compute platforms and processing approaches for addressing competitive or operational challenges. They need partners that have the right mix of expertise and solutions that span infrastructure, data integration, and analytics. They must also have the flexibility to plug and play the right technologies, solutions, and processing frameworks to support the problems they must address, without having to invest in an entire stack. “All or nothing” or “one size fits all” technologies strategies will not work in the Big Data world.

Red Hat’s strategy

Red Hat has built its business around open source and today offers the market-leading

implementation of Linux with Red Hat Enterprise Linux. Linux has become the de facto standard operating environment for Big Data implementations; not surprisingly, the vast majority of applications, tools and frameworks for Big Data implementations run in this environment.

From its origins as leading Linux provider, Red Hat has grown its mission to provide organizations with a wide array of software-defined services that provide comprehensive, flexible solutions for storing heterogeneous data; leverage the elasticity of the cloud; and provide capabilities for managing and deploying data and applications (see Figure 2). It has adopted a modular architecture that allows enterprises to start at any point in the technology stack and derive value without requiring specific underlying products. It is also designed to interoperate with existing IT infrastructure thanks to published APIs. Red Hat is very active in the Big Data technology community and is actively

(7)

As shown in the figure, the bottom layer, encompassing Red Hat Enterprise Linux and Red Hat Storage server, enables database and systems administrators to work with the tools of their choice. Red Hat provides freedom of choice in the data sources layer because, as the leading enterprise Linux distribution provider, all major Big Data platform and data source providers certify their systems on Red Hat Enterprise Linux. That encompasses Hadoop, all major NoSQL databases, established data warehouse platforms, and emerging sources such as the new generation of in-memory platform providers, real-time data streaming engines, and more.

Red Hat Storage Server unifies the storage environment with a platform that allows Big Data sources to leverage commodity, server-based, scale-out storage infrastructure with access from a wide variety of industry standard interfaces including POSIX-compliant, object, and HDFS-compatible interfaces.. Red Hat Storage can also be deployed on physical, virtual, or cloud infrastructure. By supporting OpenStack, Red Hat customers gain access to a wide choice of cloud service providers. Red Hat’s commitment to OpenStack is reflected by the fact that it has been the leading contributor of technology in the last two distributions, and offers the largest ecosystem of certified partners.

(8)

choice. Red Hat JBoss Data Virtualization provides a layer where data can be integrated on the fly, thanks to an integrated modeling and execution environment for transforming and combining data across heterogeneous sources, and support of real-time data access and provisioning from legacy, SQL, NoSQL, and cloud data sources. JBoss Data Grid provides a high-performance, in-memory engine for I/OPS-intensive, data-driven applications. It complements SQL transaction databases with a distributed caching layer that avoids I/OPS bottlenecks, while providing elastic scalability that can deal with sudden bursts or fluctuations in workload. JBoss BRMS provides business event and decision management for applications that are rules- or event-driven. It supports an agile, iterative approach to developing and deploying applications where the rules of engagement change rapidly. Finally, OpenShift provides an open source cloud Platform-as-a-Service (PaaS) tier where developers can code, test, and deploy Big Data applications fast. Businesses can continue to use their favorite analytics applications, such as reporting tools, dashboards, and third party analytics suites with the additional business insights that are mined via the middleware and PaaS layers without impacting their productivity.

Recommendations for enterprises

The explosion of data and emergence of new capabilities for harnessing off-the-shelf infrastructure to process that data has made mastering Big Data a front burner issue for many organizations. Big Data can enhance how organizations address core challenges with optimizing customer interaction, operational efficiency, while improving security and reducing level of risk or incidence of fraud. However, Big Data is a moving target in many ways; new data sources are constantly emerging; new techniques, tools, and applications are becoming available to analyze the data; and new data platforms are emerging providing new options for managing data. Furthermore, best practices and skills for addressing Big Data challenges are at early stages of maturity.

Consequently, organizations cannot afford to lock themselves into a single platform or approach for analyzing Big Data. They must position themselves to take advantage of new platforms, tools, or applications that emerge; they must be able to iterate their approach to problem solving as new best practices emerge. As such, agility and freedom of choice must be the watchwords. For technology developers, open source has played a pivotal role for development and delivery of technology because it provides an agile mechanism for the community to innovate. It also provides, not only the business model that makes technology affordable to enterprises, but it provides the freedom of choice that enables organizations to remain agile in their implementation of fast-evolving technology.

Appendix

Author

(9)

Ovum Consulting

We hope that this analysis will help you make informed and imaginative business decisions. If you have further requirements, Ovum’s consulting team may be able to help you. For more information about Ovum’s consulting capabilities, please contact us directly at consulting@ovum.com.

Copyright notice and disclaimer

The contents of this product are protected by international copyright laws, database rights and other intellectual property rights. The owner of these rights is Informa Telecoms and Media Limited, our affiliates or other third party licensors. All product and company names and logos contained within or appearing on this product are the trademarks, service marks or trading names of their respective owners, including Informa Telecoms and Media Limited. This product may not be copied, reproduced, distributed or transmitted in any form or by any means without the prior permission of Informa

Telecoms and Media Limited.

Whilst reasonable efforts have been made to ensure that the information and content of this product was correct as at the date of first publication, neither Informa Telecoms and Media Limited nor any person engaged or employed by Informa Telecoms and Media Limited accepts any liability for any errors, omissions or other inaccuracies. Readers should independently verify any facts and figures as no liability can be accepted in this regard - readers assume full responsibility and risk accordingly for their use of such information and content.

Any views and/or opinions expressed in this product by individual authors or contributors are their personal views and/or opinions and do not necessarily reflect the views and/or opinions of Informa Telecoms and Media Limited.

(10)

CONTACT US www.ovum.com askananalyst@ovum.com INTERNATIONAL OFFICES Beijing Dubai Hong Kong Hyderabad Johannesburg London Melbourne New York San Francisco Sao Paulo Tokyo