COLLABORATIVE WHITEPAPER SERIES
The Fast-Track to Hands-On
Understanding of Big Data
Technology
Big Data might be intimidating to the most seasoned IT professional. It’s not simply the charged
nature of the term “Big” that is ominous, but the underlying technology is app-centric in a very
open-source way. If you are like most professionals who don’t have a working knowledge of
MapReduce, JSON, Hive, or Flume, diving into the deep-end of the Big Data technology pool
may seem like a time-consuming process. Even if you possess these skill sets, the prospect of
launching a Hadoop environment and deploying an application that streams Twitter data into the
environment in a way that is accessible through standard ODBC tools would seem like a task
measured in weeks not days.
It may surprise most people looking to get hands-on with Big Data technology that each of us
can do so in short time, and with the right approach, you can stream live social data to your own
Hadoop cluster and report on the information through Excel in less than one day. In an instructive
manner, this whitepaper series enables you with a “fast track” approach to create your personal
Big Data lab environment powered by Apache Hadoop. This first part in the series will engage IT
professionals with a passing interest in Big Data by providing them with:
• Reasons to explore the world of Big Data and Big Data skills gap.
• A practical, lightweight approach to getting hands-on with Big Data technology.
• Describe the use case and the supporting technical components in more detail.
• Provide step-by-step instructions of how to setup the lab environment, and direct
individuals to Cloudera’s streaming Twitter agent tutorial.
• We will enhance Cloudera’s tutorial in the following ways:
- Make the tutorial real-time.
- Provide steps to establish ODBC connectivity and how to execute Cloudera’s
sample queries in Excel.
- Configure and register libraries at an overall environment level.
- Provide sample code and troubleshooting tips.
COLLABORATIVE WHITE PAPER SERIES:
The Fast-Track to Hands-On Understanding of Big Data Technology
I. A reason to explore the universe of Big Data
Before beginning this exercise, the first question that may be asked by IT professionals is why would one care to explore the universe of Big Data? The fact is that the universe of data is expanding at an accelerating rate, and increasingly the data growth is driven by sources of unstructured or machine-generated Big Data (e.g. from social media, blogs, the “Internet of Things”). The latest IDC Digital Universal Study reveals an explosion of stored information: more than 2.8 zettabytes—or roughly 3 billion terabytes—of information was created and replicated in 2012 alone. To put this number in perspective, this means that 95.07 terabytes of information was produced per second over the course of a year.Organizations are increasingly aware that this unrefined data represents an opportunity to gain valuable insight into ongoing clinical research, monitoring financial risk, etc. From a practical perspective, this means that business people will be asking IT for answers to questions that can be supported by sources of Big Data, and in some cases processed by Big Data technologies. A recent Harvard Business Review survey suggests that 85 percent of organizations had funded Big Data initiatives in process or in the planning stage, but the survey reveals a severe gap in analytical skills and 70 percent of respondents describe finding qualified data scientists as “challenging” to “very difficult” 1. Thus, Big Data introduces an opportunity to the
business, but exposes a skills and technology gap for IT. This gap must be filled in short time otherwise businesses will find themselves at a competitive disadvantage, and IT’s ability to support the business will be questioned.
II. The right approach
If you are convinced that an understanding of Big Data is important to your business and IT initiatives, in most cases, you need to formulate a practical, low-cost, and ultimately relevant approach of understanding the technology and conventional use cases that resonate with the business. After all, IT resources are stretched thin, and many of us in IT that are new to the world of Big Data could spend weeks getting up to speed on the various
options before taking the first step. For those of us with day jobs, there aren’t enough hours in the day to invest a lot of time in dissecting the various Big Data technology players, or building relevant open source components that won’t necessarily prove anything from a technology or business point of view. Fortunately, the following game plan provides
Building the Hadoop
environment from scratch as opposed to using CDH.
We considered building our Hadoop environment from scratch through the Apache Hadoop projects. If your learning objectives include understanding what it takes to ensure compatibility of each Hadoop project, or if you need to tweak the source code, then you should include this step in the approach. Given the time
commitment, it seemed more useful to take an existing distribution that ensured interoperability and compatibility of the projects.
Using CDH over HortonWorks or MapR.
Alternatives from HortonWorks and MapR were considered, specifically Microsoft’s HDInsight distribution that uses HortonWorks. Ultimately, Cloudera’s software and support resources and its Twitter Feed are used in the example, which are
available for download and general use. Cloudera also has VM images with a free edition of the Cloudera manager and Hadoop available with the entire Apache Hadoop project required by the scenario for download.
Deploying Hadoop in the Cloud.
Deploying the environment to the cloud was considered, and in some cases it may be preferred. An instance of Microsoft HDInsight was used running in Azure, and would have been pursued at a greater length but unfortunately the lease on the Azure instance expired and inquiries on how to extend the lease went unanswered.
Consideration Direction and Rationale
Figure 2: Why CDH was used
a universal use case as a starting point, and practical ways with which you can get a lab environment running so that you can mobilize business sponsors and technical staff around Big Data capabilities in less than eight hours. For learning purposes, it makes sense to pursue a fairly common scenario across industries. In our case, we will attempt to stream social media (specifically tweets from Twitter) to a lab environment, and then we will report on the data through everyone’s favorite BI tool, Excel. The use case will be described in more detail later on. From a technical training perspective, the approach relies heavily on Apache Hadoop.
Figure 1 lists the reasons why Hadoop is the preferred platform to learn Big Data and to implement this scenario:
There are many ways to deploy Apache Hadoop. Our example relies on Cloudera’s distribution of Apache Hadoop (CDH) running in a Linux VM image. Figure 2 lists key considerations and why CDH was used.
The CDH stack (Figure 3) summarizes the core projects included with CDH, and the projects relevant to the use case are captioned 2.
It should be noted that this is a learning exercise, not a performance benchmark. Thus a single-node Hadoop running inside of a Linux VM is deemed sufficient for those of us wanting to learn Hadoop. If performance tuning is crucial to your learning objectives, then a more robust environment would be required. The business use case would still be relevant, since streaming live social data will generate millions of transactions, depending on the key
words you have specified. The specifications of the VM image and the host matching are listed in the Appendix. Finally, there are ways to make this use case more comprehensive. For instance, once the data streaming is captured, you could use Apache Mahout to cluster and classify the data using the various algorithms available. Since much of the classification would be dependent on business input, it seems reasonable to take a first iteration
through the use case as presented, and then proceed with next steps in concert with more involvement or direction from the business.
III. Use case and supporting Hadoop
components
Streaming social media data is a fairly common use case for Big Data and applies across industries. Cloudera provides a
=
CDH 100% open sourceCloud Integration
=
UI and Workflow Batch Processing Batch Compute Resource Management and Coordination Storage Real-time Access and Compute Metadata SQ Sqoop FL Flume FILE Fuse DFS REST WEBHDFS HTTFS SQL ODBC & JDBS HU Hue OO Oozie Pl Pig WH Whirr MA Mahout Datafu DF MR2 MapReduce2 YA Yarn Zookeeper ZO IM Impala HDFS Hadoop DFS HBase HB AC Access MS MetastoreFlume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
Hive is a data warehouse system for Hadoop that facilities easy data summarization, ad-hoc queries, and the analysis of datasets stored in HDFS. Hive provides a mechanism to project structure onto this data and query that data using a SQL-like language called HiveQL.
Oozie is a workflow scheduler system to manage Apache Hadoop Jobs. Oozie Coordinator jobs are recurrent Oozie workflow jobs triggered by time (frequency) and data availability.
Captures the Hive metadata, transparent to the overall application.
Hadoop supports ODBC and JDBC connectivity is supported for interfacing with other platforms like databases, BI, and data integration tools.
MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and Data Nodes that store the actual data.
Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services. All of these kinds of services are used in some form or another by distributed applications.
MR
MapReduce
HI
Hive
=
Twitter users tweet about various topics. In this example, we want to capture in real-time any tweet related to Big Data keywords.
“hadoop” “big data” “analytics” “mapreduce”
1. Twitter users generate tweets
Big Data tweets by time zone and day
Hive contains an external table named “tweets” that is modeled on the JSON message from Twitter. In addition to partitioning the table by year, month, day, and hour through the Q script executed within Oozie, Hive will be used to aggregate tweets from Excel. These queries perform aggregations listed above, and are executed through ODBC.
5. Hive aggregates data from Excel
Top 15 Big Data
hashtags Top 10 retweeted users on Big Data topics
A Flume agent runs on our CDH VM. It streams any tweet containing the Big Data keywords listed above to HDFS. It places the tweets in a folder structure organized by year, month, day, and hour. It creates a new subfolder as it rolls forward each time interval.
2. Flume organizes keywords
An Oozie workflow runs in the background and once a new subdirectory is created by the Flume Agent, it will execute a Hive script that adds a new partition to a tweet table corresponding to the year, month, day, and hour of the directory.
3. Oozie process runs Hive script
JSON SerDe
There are two JARs that will be registered in Hive. One defines the JSON extensions for the Hive external table “tweets”. The second instructs Map Reduce to exclude any TMP files created by Flume. Flume writes to a TMP file. These TMP files cannot be accessed by MapReduce.
4. JAR files are created in Hive
Custom MapReduce pathFilter
CDH V4.1.1
single node cluster Running on 64-bit Linux VMWare
HDFS
6. Hive initiates MapReduce program
Hive will translate any query that requires aggregation, sorting, or filtering into a MapReduce program. MapReduce will grab the data from HDFS and pass the result set to Hive.
Hive Q Script: Add hourly partition HQL (Hive Query Language)
Figure 4: Streaming social media use case and supporting technical components
tutorial that represents an implementation of this use case. This paper will build on Cloudera’s tutorial, and extend it by making the data available in real-time and reporting on the data in Excel. Using the approach described above and by following the instructions, you will have Tweets streaming into your Hadoop sandbox reportable in Excel in less than a business day.
Cloudera’s tutorial is documented thoroughly in a series of blog postings and the source code is available on GitHub
4-7. Figure 4 represents our version of the streaming Twitter
tutorial. Major components are numbered and their purpose explained in Figure 4.
IV. How to stream social data to Hadoop in less than a day
Now that we have established a rationale, an approach, and use case for learning Big Data, we can get started. Despite the many moving parts listed in the use case, you can have the streaming social media use case operational in your own lab environment in less than a working day.
The following lists the steps of how to make this happen, and any non-obvious instructions to follow that are not provided by the instructions in the tutorial. Where appropriate, explanations have been provided to ensure the significant concepts and mechanics are understood and reinforced.
1. First and foremost, you need a CDH lab environment. Building and configuring this environment from the OS up could take time. Fortunately, Cloudera provides a VM image that is available for download with all of the necessary Hadoop projects pre-installed and pre-configured. You can download the VM from Cloudera’s website.
2. To run the lab environment, you will also need VMware player. You can download and install the VMware player from the VMware website.
3. Verify you have sufficient resources to run the VM image on your host machine. Please refer to Cloudera’s system requirements, and the appendix for the host and guest machine specifications used in this example.
4. Start the VM. Once started, you can begin the Cloudera Twitter tutorial. For the most part, you can follow the instructions exactly as provided in the GitHub Tutorial instructions. The following instructions should be followed in addition to those provided by Cloudera, and the rationale for the amendments is also provided:
4.1. Unless you have a need to build the JARs from scratch, you should find the JARs referenced in the tutorial will already exist on the VM image provided by Cloudera. If you build the JARs from the source, you will probably need two days to get the tutorial operational.
4.2. Before starting, the GCC library was missing from the VM image. To include the library (which is required to install other libraries):
sudo su - yum install gcc
4.3. When following the steps under “Configuring Flume”:
4.3.1. Step 3 – We had to manually create the flume-ng-agent file with the following contents:
# FLUME_AGENT_NAME=kings-river-flume FLUME_AGENT_NAME=TwitterAgent
4.3.2. Step 4 – If you are not familiar with the details of your Twitter app, this step may cause confusion. All that is required is a Twitter account. Once you have a Twitter account, you need to register the Flume Twitter agent with twitter so that Twitter has a record of your agent and can govern the various 3rd parties that stream Twitter data.
4.3.2.1. To register your Twitter App, go to https://dev.twitter.com.
4.3.2.2. Sign-in with your Twitter account. 4.3.2.3. Click “Create a new Application”.
4.3.2.5. Your new application will provide you with four security tokens that will be specified in the flume.conf file. These properties are highlighted below.
4.3.2.6. Using the values for the application properties highlighted above, enter the following parameters in flume.conf. If flume.conf does not exist on /etc/flume-ng/conf, please download it from the GitHub project:
TwitterAgent.sources.Twitter.consumerKey = <consumer_key_from_twitter> TwitterAgent.sources.Twitter.consumerSecret = <consumer_secret_from_twitter> TwitterAgent.sources.Twitter.accessToken = <access_token_from_twitter>
TwitterAgent.sources.Twitter.accessTokenSecret = <access_token_secret_from_twitter>
4.3.2.7. In flume.conf, modify the following parameter according to the key words in which you want to filter tweets. Note that the default flume.conf provided by Cloudera misspelled data scientist; the correct spelling is listed in red below:
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientist, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
4.3.2.8. At this point you probably realize the importance of flume.conf. In addition to containing the details of the Twitter app and the key words, it contains the following parameters which govern how big the Flume files are before it rolls into a new file. These parameters are significant because as you change them, the latency of the tweets will also change. The complete listing of the Flume parameters can be on Cloudera’s website.
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 # number of events written to file before it flushed to HDFS
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 # File size to trigger roll (in bytes) TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 # Number of events written to file before it rolled
4.3.2.9. Place flume.conf under /etc/flume-ng/conf as instructed in Step 4. 4.4. When following the steps under “Setting up Hive”:
4.4.1. Now copy hive-serdes-1.0-SNAPSHOT.jar in Step 1 to /usr/lib/hadoop
cp hive-serdes-1.0-SNAPSHOT.jar /usr/lib/hadoop
4.4.2. After step 4, you’ll want to create a new Java package using the following steps. There is no Java programming knowledge required; simply follow these instructions. It is necessary to create this Java class and JAR it so that you can exclude the temporary Flume files created as Tweets are streamed to HDFS.8
mkdir com
mkdir com/twitter mkdir com/twitter/util
CLASSPATH=/usr/lib/hadoop/hadoop-common-2.0.0-vi com/twitter/util/FileFilterExcludeTmpFiles.java
Copy the Java source code in the appendix into the file and save it.
javac com/twitter/util/FileFilterExcludeTmpFiles.java jar cf TwitterUtil.jar com
cp TwitterUtil.jar /usr/lib/hadoop
4.4.3. Edit the file /etc/hive/conf/hive-site.xml, and add the following tags. The first property ensures that you won’t have to add the JSON SerDe package and the new customer package that excludes Flume temporary files for each Hive session. This will become part of the overall Hive configurations that is available to each Hive session. The second tags instruct MapReduce of the class name and location of the new Java class that we created and compiled above.
<property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hadoop/hive-serdes-1.0-SNAPSHOT jar,file:///usr/lib/ hadoop/TwitterUtil.jar</value> </property> <property> <name>mapred.input.pathFilter.class</name> <value>com.twitter.utilFileFilterExcludeTmpFiles</value> </property>
4.4.4. Bounce the hive servers:
sudo service hive-server stop sudo service hive-server2 stop sudo service hive-server start sudo service hive-server2 start
4.5. When following the steps under “Prepare the Oozie workflow”:
4.5.1. For all steps, download the Oozie files from the Cloudera GitHub site. 4.5.2. Before Step 4, edit the job.properties file accordingly.
4.5.2.1. Make sure the following parameters reference localhost.localdomain referenced, not just localhost:
nameNode=hdfs://localhost.localdomain:8020 jobTracker=localhost.localdomain:8021
4.5.2.2. The jobStart, jobEnd, tzOffset, and initialDataSet require explanation. Let’s say Flume is streaming the tweets to a HDFS folder, /user/flume/tweets/*. The parameter initialDataset instructs the workflow what the earliest year, month, day, and hour for which
it can be set well into the future. In the following example, the parameters specify that the first set of Tweets live on HDFS under / user/flume/tweets/2013/01/07/08, and once the directory is available it will create execute the Hive Query Language script “add-partition.q”. jobStart=2013-01-17T13:00Z jobEnd=2013-12-12T23:00Z initialDataset=2013-01-17T08:00Z tzOffset=-5 4.5.2.3. Edit coord-app.xml:
a. Change timezone from “America/Los_Angeles” to “America/ New_York” (or the corresponding timezone for your location):
initial-instance=”${initialDataset}” timezone=”America/New_York”>
b. Remove the following tags. This is extremely important in making the tutorial as real-time as possible. The default Oozie workflow has defined a readyIndicator which acts as a wait event. It instructs the workflow to create a new partition after an hour completes. Thus, if you leave this configuration as-is, there will be a lag as great as one-hour between tweets and when the tweets can be queried. The reason for this default configuration is that the tutorial did not define the custom JAR we built and deployed for Hive that instructs MapReduce to omit temporary Flume files. Because we have deployed this custom package, we do not have to force a full hour to complete before querying tweets.
<data-in name=”readyIndicator” dataset=”tweets”> <!-- I’ve done something here that is a
little bit of a hack. Since Flume
doesn’t have a good mechanism for notifying an application of when it has rolled to a new directory, we can just use the next directory as an input event, which instructs Oozie not to kick off a coordinator action until the next dataset starts being available. -->
<instance>${coord:current(1 + (coord:tzOffset() / 60))}<instance>
</data-in>
4.5.3. If you haven’t done so already, enable the Oozie web console according to the Cloudera documentation. Doing so allows Oozie coordinating jobs and workflows to be accessed from the console located at http://localhost.localdomain:11000/ oozie/.
Tweets streaming to your HDFS.
4.6.1. You can browse the HDFS directory structure from the Hadoop NameNode
console on your cluster. You can also access the cluster from http://localhost. localdomain:50070/dfshealth.jsp.
4.6.2. If you are experiencing technical issues, please reference the Troubleshooting Guide in the appendix
5. Setup ODBC connectivity through Excel:
5.1. ODBC connectivity to Hive from an application is a logical extension of the Cloudera Twitter tutorial.
5.1.1. There are several ODBC drivers for Hive, but many were not compatible with Excel (e.g. Cloudera’s ODBC driver for Tableau) or not compatible with Cloudera’s environment (Microsoft’s ODBC driver for Hive, which only worked when
connecting to Microsoft HDInsight).
5.1.2. We successfully used MapR’s ODBC driver for Windows located here. Since we are running 32-bit Excel, we needed to download the 32-bit ODBC driver for Hive, but MapR has a driver for 64-bit as well.
5.1.3. Download and install the appropriate ODBC driver from MapR’s website.
5.1.4. Configure an ODBC connection to the Hive database.
5.1.4.1. We recommend specifying an entry in your Windows hosts file (C:\ Windows\System32\drivers\etc\ hosts) to alias the IP address for your VM machine. You can get the IP address from your VM by
5.1.5.1. From “Data” tab.
5.1.5.2. Select “From Other Sources”.
5.1.5.3. Select “From Data Connection Wizard”.
5.1.5.5. Select the DSN you set up using the MapR driver (Cloudera Hive VM MapR).
5.1.5.6. Select the “tweets” table.
5.1.5.7. Select “Finish”.
override the HQL in order for the query to execute. At the time this article was written, the
major ODBC drivers append “default” to Hive Query and the MapR ODBC driver is the only one able to establish connectivity which would allow us to override the HQL.
5.1.5.9. Select “Definition” tab. Using one of the Hive queries provided in the Appendix, copy the HQL and paste it into the ”Command Text”. Also save password.
5.1.5.10. Hit OK to import the data.
5.1.5.11. Repeat for the remaining queries in the appendix. Create as many queries as you see fit. HQL is very SQL-like and for many of us that know SQL will be easy to adapt the queries from the appendix into other statements that provide the views you need.
V. Summary
Once you have successfully completed this tutorial, you should have a clearer understanding of Hadoop, specifically:
1. A quick overview of core Hadoop projects and how each is used to support streaming social media and reporting through a standard ODBC connection.
2. An operational Hadoop sandbox that can be used for training, local development, and proof of concepts that you can navigate and explore.
3. A real-world reference model for a use case illustrating the amazing streaming capabilities in Hadoop. 4. How to model semi-structured JSON data in Hive and query it in a conventional manner.
Lastly, this exercise should leave individuals wanting to take the Hadoop experience to the next level. Independently, you can layer in a Mahout program to cluster and classify the tweets, thereby simulating some form of sentiment analysis. You may also want to layer in Geospatial data into the set to provide more advanced analytics. You could consider streaming data from other social media sites (if so, we recommend starting here). Above all, you may want to show someone from the business to illustrate what this new technology can do. By demystifying Big Data technology, you can take your understanding and ability to support additional business use cases to these next levels.
Appendix: Custom Java code for MapReduce PathFilter
package com.twitter.util; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.PathFilter;
public class FileFilterExcludeTmpFiles implements PathFilter { public boolean accept(Path p) {
String name = p.getName();
return !name.startsWith(“_”) && !name.startsWith(“.”) && !name.endsWith(“. tmp”); }
Appendix: Hardware/software environment
Appendix: Troubleshooting guide
OS Processor Memory Disk Software
Host Windows 7
Enterprise 64-bit Intel ® Core ™ 2 Duo
CPU P8400 @ 2.26GHz, 2.27GHz
8GB
(7.9 Addressable) 300GB VM Player 3.1.2 build-301548
Microsoft Office 32-bit Guest CentOS 6.2 Linux 64-bit Intel ® Core ™ 2 Duo CPU P8400 @ 2.26GHz 2,98GB 23.5GB Cloudera Manager Free Edition 4.1.1 CDH4.1.2 Figure 5
FAILED: RuntimeException MetaException(message:org.apache. hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive. serde.JSONSerDe does not exist
Hive cannot find hive-serdes-1.0- SNAPSHOT.jar
1. Place hive-serdes-1.0-SNAPSHOT.jar in / usr/lib/hadoop.
2. Edit /etc/hive/conf/hive-site.xml, add the following: <property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hadoop/ hive-serdes-1.0-SNAPSHOT. jar,file:///usr/lib/hadoop/ TwitterUtil.jar</value> </property>
3. Start and restart the hive services 013-01-17 13:57:37,027 INFO org.apache.oozie.command.
coord.CoordActionInputCheckXCommand: USER[-] GROUP[-] TOKEN[-] APP[-] JOB[0000068-130117082739514-oozie- oozi-C] ACTION[0000068-130117082739514-oozie- oozi-C@2] [0000068-130117082739514-oozie-oozi- C@2]::ActionInputCheck:: In checkListOfPaths: hdfs:// localhost. localdomain:8020/user/flume/tweets/ 2013/01/17/10 is Missing. Permissions on / user/flume/*
Change perms on /user/flume:
sudo -u flume hadoop fs -chmod -R 777 / user/flume
Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10001]
Missing MySQL
driver cp /var/lib/oozie/mysql-connector-java. jar oozie-workflows/lib OLE DB or ODBC error: [MapR][Hardy] (22) Error from
ThriftHiveClient: Query returned non-zero code: 2, cause: FAILED: Execution Error, return code 2 from
org.apache.hadoop. hive.ql.exec.MapRedTask; HY000. An error occurred while the partition, with the ID of ‘Tweets By Timezone_cbf7182e-a7a6-416c-a3fd-d7f484952cc6’, Name of ‘Tweets By Timezone’ was being processed.
Flume temp file permissions issue
Walk through the instructions, “Setting up Hive” to ensure the custom Java class to set the MapReduce pathFilter is built, deployed and referenced in Hive as specified.
Error Message/Stack Trace Cause Resolution
Appendix: Excel queries
9Tweets by time zone and day SELECT
user.time_zone,
SUBSTR(created_at, 0, 3), COUNT(*) AS total_count FROM tweets
WHERE user.time_zone IS NOT NULL GROUP BY
Top 15 Big Data hashtags
SELECT
LOWER(hashtags.text), COUNT(*) AS total_count FROM tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags GROUP BY LOWER(hashtags.text)
Top 10 retweeted users on Big Data topics SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_ screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10
Top 200 most active users on Big Data topics
select user.screen_name, count(*) tweet_cnt from tweets
group
by user.screen_name order
by tweet_cnt desc limit 200
References
1. http://blogs.hbr.org/cs/2012/11/the_big_data_talent_ gap_no_pan.html
2. Definitions from the Apache Hadoop website for each respective package 3. http://www.cloudera.com/content/cloudera/en/products/ cdh.html 4. http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/ 5. http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume 6. http://blog.cloudera.com/blog/2012/11/analyzing-twitter-with-hadoop-part-3-querying-semi-structured- data-with-hive/ 7. https://github.com/cloudera/cdh-twitter-example 8. Known issue with Flume, see https://issues.apache.org/
jira/browse/FLUME-1702
9. Adapted from http://blog.cloudera.com/blog/2012/09/ analyzing-twitter-data-with-hadoop/
Collaborative Consulting is a leading information technology services firm dedicated to helping our clients achieve business advantage through the use of strategy and technology. We deliver a comprehensive set of solutions across multiple industries, with a focus on business process and program management, information management, software solutions, and software performance and quality. We also have a set of offerings specific to the life sciences and financial services industries. Our unique model offers both onsite management and IT consulting as well as U.S.-based remote solution delivery.
To learn more about Collaborative, please visit our website at www.collaborative.com, email us at [email protected], or contact us at 877-376-9900.