• No results found

TECH NOTE. Hadoop Alone Is Not Big Data

N/A
N/A
Protected

Academic year: 2021

Share "TECH NOTE. Hadoop Alone Is Not Big Data"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Hadoop Alone

Is Not Big Data

(2)

Twenty-one years ago, a year

before the first web browser

appeared, Walmart’s Teradata

data warehouse exceeded a

terabyte of data and kicked off

a revolution in supply-chain

analytics.

Today Hadoop is doing

the same for demand-chain

analytics. The question is,

will we just add more zeros

to our storage capacity this

time or will we learn from our

data warehouse infrastructure

mistakes?

These mistakes include:

data silos

organizational silos

confusing velocity with

response time

DATA SILOS

A data silo is a system that has lots of inputs but few outputs. The Wikipedia page for “data warehouse” shows an architecture diagram with operational systems on the left, data marts on the right, and a “data vault” in the middle, but the third definition of “vault” at Merriam-Webster.com is “a burial chamber.” All too often, enterprise data

warehouses have become data burial chambers, or perhaps, data hospice facilities: places where data goes to die. To prevent this from happening to Hadoop systems, we need more techniques to get data out of the central data store to people and other systems. A few data marts just aren’t sufficient anymore for connecting with development partners, ad tech vendors, and the myriad customer touch-points available to retailers and brands. Data export

(3)

techniques should cover a variety of performance characteristics so that the best technique can be used for each use case. Such techniques include:

• good ol’ batch FTP of flat files, XML files, and compact binary file formats such as Avro • publish-subscribe messaging interfaces, AKA

enterprise message busses, such as Kafka • real-time REST APIs built on high-speed

databases such as HBase and Voldemort • OLAP and data visualization user interfaces

for business analysts who aren’t data scientists, such as Pentaho, Tableau, and Simba for Excel.

Let’s consider the last two in more detail. First, “real-time” means different things to different people. Fifty milliseconds (1/20th of a second) is real-time for stock trading. Google found that an increase of 500 milliseconds (1/2 a second) in page load time decreases traffic 20% and Amazon found that even a 100 millisecond (1/10th of a second) increase in load

time significantly decreases retail website revenue.1

One-tenth of a second response time is a high bar for APIs to meet. To achieve it at the 95th percentile, retailers need multiple data centers per market so that shoppers always use a data center that is close by, thereby minimizing response times. In short, they need multiple front-end data centers for each Hadoop back-end data center.

Secondly, OLAP and data visualization are part of an exciting industry trend toward the

“democratization of data” where the goal is to enable people to access required data themselves, rather than routing queries through some central analytics department. Nike FuelBand, Fitbit, and 23andMe are examples of this trend in consumer products, and OLAP and data visualization are enabling technologies for business users. Democratization of data holds the promise of preventing another big data warehouse mistake from the past: organizational silos.

1 John Rauser (Amazon) “The impact of website performance on conversion,” June 8,2004; Greg Linden (Amazon) “Make data useful,” http://www.scribd.com/doc/4970486/Make-Data-Useful-by-Greg-Linden-Amazoncom; See also Eric Schurman (Microsoft) and Jake Brutlag (Google),

Back-end Data Center Front-end Data Center

(4)

ORGANIZATIONAL SILOS

An organizational silo, like a data silo, has lots of inputs but few outputs: it’s a people bottleneck. Too often, if business analysts wanted data they had to go to a central analytics team, wait in line, get the analytics team to understand their need, wait a few days for the results, realize that the results weren’t what they thought they’d asked for, and repeat the process until one side gave up. Then, when business analysts complained and asked why on earth it could

take so long, analytics just said, “There’s a lot of math involved. You wouldn’t understand.”

Over the past 20 years, that situation has created a kind of analytics aristocracy that’s not very useful. If large companies can create such organizational silos with SQL, BI, and SAS, just imagine the kind of silos they’ll be able to create with the new technologies Hadoop, MapReduce, and R. Data democratization is the cure for organizational silos.

(5)

VELOCITY VS. RESPONSE TIME

The last data warehouse mistake we can avoid with Hadoop systems is confusing velocity for response time. Consider an analogy. Suppose you’re shipping a package from Los Angeles to San Francisco, but because of your shipper’s infrastructure, it goes through Memphis. If it takes 12 hours from LA to Memphis (1,800 miles) and 12 hours from Memphis to San Francisco (2,000 miles), that’s 3,800 miles in 24 hours or 158 miles per hour. Pretty fast. However if you cut out Memphis and go directly from LA to San Francisco (380 miles) in 12 hours then that just 32 miles per hour: pretty slow. Yet the slower route gets the package delivered 12 hours earlier. The point is that velocity should be measured from the customer’s point of view, not the infrastructure’s, since infrastructure only exists to serve the customer.

The following diagram shows what used to be a typical data flow from a customer, through a data warehouse, and then back to the customer, where each of the eight steps was scheduled and run in batch. Even if each link is fast, the whole round trip is rather slow.

With cloud-based Hadoop systems we can simplify this and greatly increase response time. Data is pushed directly from Hadoop to front-ends for use by real-time APIs, and to data marts for use by business analysts. Rather than updating customer attributes daily, weekly, or quarterly, this architecture enables real-time updates, click by click.

Hadoop holds immense promise for adding many more zeros to our storage and analytics capacity, and transforming companies to be more data driven. However to reach its full potential we should avoid the mistakes of the past. Otherwise, we’re in for another twenty years of silos, aristocracies, and inadequate response times, or as aristocrats sometimes says, “different tree same monkeys.”

Customer

Operational System

ETL

Staging Area Operational Data Store

ETL Data Mart Hadoop Front-ends FTP Data Mart Message Bus

(6)

ABOUT RICHRELEVANCE

RichRelevance is the global leader in omni-channel personalization. More than 160 international companies use RichRelevance to turn data into actionable insight, which delivers the most relevant experience for consumers as they shop across web, store and mobile. RichRelevance drives more than one billion decisions every day, and has delivered over $10 billion in attributable sales to its clients, which include Target, Marks & Spencer and Priceminister. Recently, the company opened its cloud-based platform to allow clients to easily merge disparate data sources and build real-time applications tailored to their specific business needs. RichRelevance is headquartered in San Francisco and serves clients in 40 countries from 9 offices around the globe.

For more information, please visit www.richrelevance.com. © 2014 RichRelevance, Inc.

References

Related documents

The statute creates a legal fiction that the former spouse or domestic partner has predeceased the insured, the former spouse or domestic partner “having died at the time of entry

Advances in medical treatment, management, diagnosis and surgical palliation have improved the quality and longevity of children born with Congenital Heart Disease. As

According to a recent Ventana Research report on big data technology, only 22% of 163 organizations that Ventana polled last year were using Hadoop, and 45% said they had no plans

Initially, I had difficulty understanding how it was that students were integrating the various disciplinary perspectives in their pursuit of the question, “What does it mean to

CONTAINER: 30 ml narrow amber bottle #12 CONCENTRATED PEPPERMINT WATER SYNONYM: Aqua Mint.. USE: flavoring vehicle, carminative CONTAINER: 30 ml narrow

En efecto, así como los libertarianos ven en cual- quier forma de intervención del Estado una fuente inevitable de interferencias arbitrarias –con la excepción de aquella acción

The purpose of this two hour CE course is to provide an overview of the professional aspects of the Certified Nursing Assistant's (CNAs) role and to explore the importance

The vast majority of studies in this review (thirty of thirty-six) were cross-sectional in design. This design has several advantages, particularly with rare populations, such as