Ten Mistakes to Avoid

(1)

Ten Mistakes

to Avoid

In Hadoop

Implementations

By Krish Krishnan EXCLUSIVELY FOR

TDWI PREMIUM MEMBERS

(2)

FOREWORD

Data management and analytics are foundational requirements for creating, managing, and executing a successful business. From an infrastructure perspective, however, it is a struggle to build an integrated data platform that can support the information architecture required by an enterprise data repository and analytics hub.

In the past decade, we have seen a successful set of distributed processing architectures—including Google and Nutch—that inspired us to bring distributed data processing architecture with Hadoop and its ecosystem of projects. Enterprises have explored Hadoop since 2009, and many start-ups are now focusing on that ecosystem.

Today, this infrastructure distribution is being implemented as the enterprise hub for all data; some implementations are successful, but many others are abysmal failures. Why do so many fail? Where did they go wrong? How do we identify and avoid the mistakes? When inspecting failures and listening to companies and teams, we see that fundamental steps have been missed or ignored, including end-user management, data security, performance tuning, infrastructure configuration, and sizing. From the Hadoop infrastructure perspective, simply applying workarounds to implementations doesn’t work.

In this Ten Mistakes to Avoid, we identify the mistakes with the most negative impact on Hadoop implementations and recommend solutions you can apply to your own environment.

ABOUT THE AUTHOR

Krish Krishnan is the founder and president of Sixth Sense Advisors Inc., a management consulting and independent technology analyst firm based in Chicago. He is a recognized expert worldwide in the strategy, architecture, and

implementation of big data, text analytics, and high-performance data warehousing, and a recognized author of books, articles, white papers, and e-books. Krish teaches at TDWI events and speaks at many technology conferences.

Ten Mistakes to Avoid

In Hadoop Implementations

(3)

The biggest blunder enterprises commit is assuming technology is a silver bullet. What must we understand to avoid this thinking? Let’s examine the enterprise information and data management landscape.

Integrating all data from internal and external data sets drives our need for data storage and analysis platforms that can handle many data formats delivered at different times and speeds. Organizations have not understood the maturity of the platform and therefore have not been able to align their business requirements to its processing capability. From an enterprise perspective, data management includes acquisition, quality, integration, and analytics, but these are not the requirements that drive successful Internet-based companies that are contributing to the Hadoop ecosystem. Their foundational requirement is to collect data of all types, sizes, and states into a platform that can be used to discover and tag relationships between data elements and analyze and store them in a structured database for analytics. To successfully adopt Hadoop, consider:

Hadoop is a foundational platform for data integration, analysis, and processing, and is a good enterprise option, but do not get carried away by others’ success.

Your enterprise must understand its maturity as a data processor, the maturity of its Hadoop ecosystem, and your road map of initiatives from a Hadoop perspective.

To implement a program with Hadoop, change how you execute an initiative. Business teams must own the program, manage the data governance experts, and drive the change. Those with data processing skills should form an independent team that will align with business teams to create the processing rules and architecture. IT will own the execution process and assist business and data processing teams.

By determining the business requirements that will drive your project, deciding how you’ll align those requirements across business and technology groups, and understanding your cost, governance, and skill goals, you will get better at implementing and integrating the technology.

MISTAKE ONE:

(4)

Hadoop’s “store first” feature should be avoided in any implementation. Organizations tend to listen to the claims that an enterprise should first acquire or store data and then worry about processing it. Internet-based companies have made this a successful entry/exit criterion because they need to capture data quickly, tag it, and provide it for end-user consumption in search, mashup reports, and analytical dashboards.

However, enterprises cannot afford to use the same strategy because their business goals differ. “Store first” offers a shaky foundation because we have not done a good job of using existing data from analytics or BI initiatives yet. We don’t understand how to focus on a data-driven model; acquiring data without understanding how to align it with business goals and insights is foolhardy.

Take, for example, integrated data analysis from social media forums and campaign success analytics (such as a storyboard). Imagine a large retail company conducting campaigns for a new laundry detergent. The company varies its campaign strategies based on climate, socioeconomic background, age, and ethnicity. The campaign analytics show a slower-than-expected move by consumers to try the new product, but the analysis doesn’t explain the reasons or how to make the campaign more effective. To learn more, the enterprise considers data from social media and creates a crawler to monitor the product or the brand (or both). The crawler collects data from social media and stores and tags the data in a Hadoop platform. Business subject matter experts (SMEs) and data analysts examine the collected and tagged data and discover that their new product has not matched price expectations in the East, Northeast, and Southwest regions. Sales have been higher in the upper Midwest and Northwest regions but are far below expectations in the lower Midwest and Southeast. This data provides some initial insights, and the team decides to conduct full text analytics while adjusting the campaign by region to better match consumer needs. Repeated attempts start producing improvements, and a few months later the campaign analytics dashboard shows positive readings. Such behavior is the change we need when implementing a Hadoop platform.

MISTAKE TWO:

(5)

MISTAKE THREE:

FAILURE TO PREVENT TRANSFORMATION OVERLOAD

When I wrote my first programs on the Hadoop platform as a developer, I was confronted with an issue many people face over time: data transformation. How do we apply the rules of data processing on a file-server-based system where data consistency is not an issue, considering that we can process the same data in different data nodes (storage areas) using different algorithms and programs for discovery and analysis? Why is transformation an issue on this platform? When does it become an issue, and how can we identify the symptoms and rectify the problem?

These questions immediately occurred to me on the architect side of my brain. In many of today’s Hadoop deployments, transformation overload and its associated problems with data processing during discovery and analysis persist.

From a Hadoop and technology perspective, our goal has been to create an infrastructure that can scale up and out on demand, whether that means to acquire and store data or process the data from our data discovery and analysis.

As an example: “@jdoe #acme #airlines long line, #noupgd platinum_mbr 14hr flt #disgusting #fail #nofly” is a tweet a business user wants to transform to “Brand: Acme Airlines,

User: John Doe, Status: Platinum Member, Issue: Long Travel, No Upgrade, Long Lines to Check in, Sentiment: Negative, Outcome:

Will not fly again with Acme. Date Posted, Airport, Retweets, SM_Status: Influencer.”

In these exercises, there is no mention of needing a schema or any structured storage architecture model that will require converting the data at the end of discovery and analysis from unstructured or semi-structured to a structured form. Without establishing this need at data acquisition time, we will lose our way in the discovery phase with incorrect transformations and business semantics applied incorrectly. The result is cycles of processing without resulting in meaningful insights. Eventually, we stop processing the data because we have no idea what is needed. This is a mistake we must avoid.

(6)

Executive sponsorship is a sensitive subject but must be addressed, especially in the context of Hadoop implementation. We are not talking about the 1980s, when CxOs attended conferences and returned with specific thoughts on the competition, the marketplace, enterprise alignment, and customer life cycle management. Today, we have evolved to social and collaborative data-driven operations where the entire enterprise is transparent in all of its activities. Competitive research has taken on a whole new meaning and involves direct access to any prospects and customers over all channels, including social media. This evolution has created opportunities for innovation to be driven both top-down and bottom-up. For this new mode of operation to be successful, especially with our need for transparency and innovation, we need new attitudes at the executive levels. The enterprise must become transparent to the appropriate managers and business users; they must understand the business strategy and know the enterprise’s competitors. This provides focus for the programs in the areas of competitive research, social media analytics, and campaign management. Changes in these focus areas can be measured and presented to executives for discussion with colleagues at their level and above.

Another change in executive support is related to processing data (analysis) in the Hadoop initiative. In the new world of data discovery with large data sets, business users as well as business and data analysts must create a discovery road map with search and semantic workflows. IT must support their implementation and manage the necessary technology integration and usage. When IT is the system integrator, not the solution provider, executive support and direction will need to be clear to avoid potential issues.

How do organizations succeed? It is a rough path for anyone to tread. We recommend that organizations wishing to engage in a Hadoop program create a set of measurable executive sponsorship goals. These goals are not ROI or financial goals, but rather goals to gain competitive insights, understand market alignment, visualize social media analytics, and create a change in business strategy to emerge as a visionary leader.

MISTAKE FOUR:

(7)

MISTAKE FIVE:

FAILURE TO ESTABLISH GOALS

In today’s world, we often execute on an outcomes-based approach. For example, we start our day with a quick look at e-mail and our calendar, then decide which efforts take priority. This is easy to manage for an individual, but a large enterprise must consider payroll processing, for example, with all the inputs being updated until the last minute before beginning to process the cycle for each month. This approach is not easy to implement across all aspects of data processing. However, if we do not establish a key set of goals, our program will fail. For payroll processing, the goals are to ensure all employees are paid and that their earned and used hours of vacation, 401k, and other benefits are calculated accurately. The new world of data management in any enterprise will evolve from its current architecture of the data warehouse to a new model that will include Hadoop, NoSQL, in-memory, and cloud technologies along with the data warehouse, master data management, and metadata programs. This new world has many terms: “logical data warehouse,” “unstructured data warehouse,” and “next-generation data warehouse.”

The overall ecosystem will process data from acquisition, discovery, analysis, transformation, integrated structured data from a current-state data warehouse, and data marts and other enterprise systems, then present an integrated set of outcomes for analytics and BI processing. Analytics can also be executed on any part of this ecosystem as needed.

When implementing Hadoop as a platform, we fail in the integration process for two reasons. The first is that we do not understand the entire data life cycle as we start the acquisition process; as a result, we end up with transformation overload (as discussed in Mistake Three). The second reason is that we fail to establish desired outcomes from the data discovery and analysis processes, which leaves the data transformation in a mismatched state— resulting in chaos and failure.

How can we avoid this situation? There are two distinct steps. The first is in the proof-of-concept stage where we can establish outcomes from the discovery and analysis phase. The second is where we start the Hadoop program and get the business SMEs and data analysts together to create a high-level set of outcomes from discovery to analytics phases. Associate KPIs at each instance as desired.

(8)

MISTAKE SIX:

IMPROPER INFRASTRUCTURE PLANNING

Why does infrastructure planning become a mess? One of the most important reasons is the lack of a standard configuration guide, which is partially due to the newness of the infrastructure. Also, one size does not fit all.

The problems we have seen with Hadoop are primarily those of memory, CPU, MapReduce, and YARN configuration and management. When configuring Hadoop, check several

infrastructure issues that tend to overload memory management, especially with data and computing. When programs slow down and start to fail, we try to fix the error and not simply throw more resources at the problem. We make configuration mistakes with MapReduce and YARN that we need to revisit multiple times in order to strengthen the processing of data through the infrastructure. Unfortunately, to make memory specifications feasible, we overload the CPU cycles and harm performance. For each proof of concept or implementation, the project team must work with the vendor to configure the selection of infrastructure for Hadoop. The team must outline the performance and scalability expectations for the infrastructure for the next five years. This will result in some initial over-configuration of the infrastructure, but once data acquisition and discovery start, the configuration can be fine-tuned to sustain performance with simultaneous loading and analysis of data.

Improper configuration can be handled, and memory crashes with YARN and MapReduce can be avoided. Processor overloading can be prevented by configuring the cycles of CPU per job, which can be managed by focusing on workflow configuration for the nodes and the task trackers associated with the actual job execution. Although this sounds like a complex scenario, job execution is easy to manage, and we anticipate the processor cycle to be tuned more than once in a Hadoop cluster.

(9)

MISTAKE SEVEN:

FAILURE TO PROPERLY CONFIGURE AND MANAGE

SEMANTIC DATA

Many organizations forget semantic data and its emergence as a metadata and taxonomy integrator as they begin implementing Hadoop projects.

Data discovery is an important step for every Hadoop initiative because it is the first step in the next generation of data processing. In the data discovery process, the goal is to identify the data, select tags, and create a tag index that includes the metadata, the line of occurrence, the context of its occurrence, and the number of occurrences in the current process. To complete this step successfully, we must use a combination of metadata, taxonomies, semantic workflows, databases, and business-rule engines. This extended set of metadata elements is needed to create a robust data map as we discover the data without losing sight of any piece of the information or leading the team into chaos with bad data maps. We must configure the metadata and semantic libraries with the appropriate set of business rules and context rules applied, which will process the data with the closest and most correct matches.

Another area of semantic technology is machine learning. By applying semantic libraries to aid in machine learning:

• Data mapping and linking are clean with the right metadata, and data lineage is clearer to process

• Data duplication is tagged and removed • Stop-words processing for text analytics is easier • Stemming of any text data can use semantic libraries • Tagging is accomplished easily

These advantages make automating machine learning simple, and we can develop a smart algorithm from simple blocks into a large module for execution. Today we see machine learning in healthcare, insurance, and credit card fraud analytics.

To choose the best strategy in this architecture and avoid the chaos of incorrect semantic integration:

1. Create a data-discovery strategy road map for your Hadoop implementation

2. Conduct a proof-of-concept exercise of data discovery using semantic libraries

3. Ask business user teams to identify the most appropriate semantic libraries to add to the platform

(10)

A fundamental issue that has hampered many Hadoop implementations is related to MapReduce programming. In the world of Hadoop systems, MapReduce is the lowest grain of execution of programs, whether you execute Hive queries, Pig scripts, or Java or Python code. The confusion starts with how to configure the MapReduce processing ecosystem. Let us understand the flow of MapReduce processing in Hadoop:

MapReduce programming model

• Map: Process a key/value pair to generate intermediate key/value pairs

• Reduce: Merge all intermediate values associated with the same key

• Users implement interface of two primary methods:

1. Map: (key1, val1) → (key2, val2) 2. Reduce: (key2, [val2]) → [val3]

Developers first determine the number of maps and reducers to create. For example, a text document will need tagging and then a set of maps based on number of tags. An Excel spreadsheet will need a set of maps matching the number of columns.

Execution process

• The Input phase generates a number of FileSplits from input files (one per Map task)

• The Map phase executes a user function to transform input kev-pairs into a new set of kev-pairs

• The Hadoop framework sorts and Shuffles the kev-pairs to output nodes

• The Reduce phase combines all kev-pairs with the same key into new kev-pairs

• The output phase writes the resulting pairs to files • Hadoop framework handles scheduling of tasks on cluster • Hadoop framework handles recovery when a node fails

MISTAKE EIGHT:

(11)

This is where we as developers believe the framework exists to handle all the infrastructure management, memory allocation, parallelism, and distributed computing. The framework can certainly handle the tasks, but not all of them all the time. The question is: How do we know what to do and when to implement each of them? Here are several tips:

• Do not create a distributed processing approach where the objects in the Map and Reduce programs need to communicate repeatedly. This will create issues with object state

preservation and its associated memory allocation. • Try to process aggregations locally rather than distributed

over the network. This step will provide greater scalability and improve performance, especially if you use partitions and combiners effectively to manage the MapReduce process. • Try to use in-memory combiner mapping where possible to

create a scalable process flow.

• Pairs and Stripes are useful in MapReduce to process text and semi-structured data. Mahout algorithms use these effectively. Look at the Apache Mahout website for additional design details.

• Design your MapReduce programs with a test data set that contains all the types, formats, and structures of the data, and identify tuning requirements. Start applying the steps as discussed to arrive at an optimal execution state.

(12)

MISTAKE NINE:

FAILING TO PROVIDE BIG DATA GOVERNANCE

Data governance has always been a sensitive subject in the world of data management. The first question is often whether data governance is appropriate for big data. We are still talking about “data,” so yes, we need a governance program no matter what kind of data we have: big, small, fast, wide, or deep—it doesn’t matter.

We need governance for this program because: • Data and user security are still evolving in Hadoop.

• The data must be discovered and tagged before analysis; this can require several types of governance rules to be processed and applied to the data to get the correct results and eventually the appropriate analytic outcomes.

• The data is free form, and acquisition of this data requires strong governance on HCatalog and associated metadata processed and applied to the data.

• Executive sponsorship must be managed to successfully implement Hadoop. This can be executed as a part of your enterprise’s governance program.

• The use of taxonomies and ontologies, if not governed and managed, can cause a data processing overrun from acquisition to analytics on the Hadoop platform.

In current Hadoop implementations, we see issues and mistakes arising with governance. Further analysis of these issues reveals consistent ignorance about security, compression, metadata management, and integration of taxonomies and ontologies. When presented with a road map for big data implementations and the maturity models available today (TDWI, EMC) where one core area is governance, teams agreed that paying attention to data governance could have made the difference between success and failure.

Governance is critical for the overall success of all data-driven and data-oriented programs, including Hadoop implementations. For your implementation to succeed, we recommend that you create a road map and a governance maturity model that will guide you from today into the future.

(13)

MISTAKE TEN:

USING HADOOP AS AN ENTERPRISE DATA REPOSITORY

A classic mistake made by enterprises invested in Hadoop technology is the desired end state of the enterprise data lake or data hub or data repository. Although it is common to store all enterprise data centrally to be consumed by the enterprise, this cannot be accomplished by one technology platform; it needs an ecosystem of components that can integrate and perform the desired functions. In this respect, Hadoop by itself is not a technology platform but high-performing storage architecture with provisions for computing data at the same layers. There are several reasons to be sure you don’t adopt this desire. First, enterprises have a considerable amount of data in different formats. The data has different grains of accuracy, different velocity, and different veracity. The data needs multiple metadata and master-data sets to process. Although Hadoop has been publicly acknowledged as a platform that can handle and satisfy these requirements, in real-life implementations questions remain about linking and securing the data and executing analytics on the data, all of which have not been completely implementable in Hadoop. This is primarily because Hadoop is an ever-evolving environment and some of these features have yet to be developed in Hadoop. Hadoop is a technology platform and not a solution.

Before you jump into the enterprise data hub or data lake hype, understand the requirements of the data and how it will be used, or you will be implementing a version of the enterprise swamp. To make sure of your approach, you do not need a team of data scientists but rather a team of enterprise architects, business SMEs, and data leaders across your organization to establish and validate your Hadoop requirements and detail the outcomes you expect. When you document all your viewpoints, outcomes, and expectations, you can decide how you will bring Hadoop in as an enterprise platform and use it to answer your business questions.

(14)

TDWI is your source for in-depth education and research on all things data. For 20 years, TDWI has been helping data professionals get smarter so the companies they work for can innovate and grow faster. TDWI provides individuals and teams with a comprehensive portfolio of business and technical education and research to acquire the knowledge and skills they need, when and where they need them. The in-depth, best-practices-based information TDWI offers can be quickly applied to develop world-class talent across your organization’s business and IT functions to enhance analytical, data-driven decision making and performance. TDWI advances the art and science of realizing business value from data by providing an objective forum where industry experts, solution providers, and practitioners can explore and enhance data competencies, practices, and technologies. TDWI offers five major conferences, topical seminars, onsite education, a worldwide membership program, business intelligence certification, live Webinars, resourceful publications, industry news, an in-depth research program, and a comprehensive website: tdwi.org.