Sentimental Analysis using Hadoop Phase 2: Week 2

(1)

Sentimental Analysis using Hadoop Phase 2: Week 2

M A R KE T / I N DU STRY, F U T U R E S COP E BY

A N KUR UP R I T

(2)

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data.

The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

This key/value type database allow clients to read and write values using a key as follows:

1. Get(key), returns the value associated with the provided key.

2. Put(key, value), associates the value with the key.

3. Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys.

4. Delete(key), removes the entry for the key from the data store.

(3)

While Key/value type database seems helpful in some cases, but it has some weaknesses as well. One, is that the model will not provide any kind of traditional database capabilities (such as atomicity of transactions, or consistency when multiple transactions are executed simultaneously). Such capabilities must be provided by the application itself.

Secondly, as the volume of data increases, maintaining unique values as keys may become more difficult;

addressing this issue requires the introduction of some complexity in generating character strings that will remain unique among an extremely large set of keys.

Key Value

“India” {“B-25, Sector-58, Noida, India – 201301″

“Romania”

{“IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606″,City Business Center, Coriolan Brediceanu No. 10, Building B, Timisoara,

300011″}

“US” {“3975 Fair Ridge Drive. Suite 200

South, Fairfax, VA 22033″}

Reference: http://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-databases

(4)

MongoDB

MongoDB is a document-oriented database that natively supports JSON format. It is extremely easy to use and operate so it is very popular with developers and doesn’t require a database administrator (DBA) to bootstrap.

Redis

Redis is one of the fastest data stores available today. An open source, in-memory and NoSQL database known for its speed and performance, Redis has become popular with developers and has a growing and vibrant community. It features several data types that make implementing various functionalities and flows extremely simple.

(5)

Cassandra

Created at Facebook, Cassandra has emerged as a useful hybrid of a column-oriented database with a key-value store. Grouping families gives the familiar feeling of tables and provides good replication and consistency for easy linear scaling. Cassandra is most effective when used for managing really big volumes of data (the kind that don’t fit in a single server), such as Web/click analytics and measurements from the Internet of Things – writing to Cassandara is extremely fast.

CouchDB

CouchDB gets accessed in JSON format over HTTP. This makes it simple to use for

Web applications. Perhaps not surprisingly, CouchDB is suited best for the Web with

some interesting applications for offline mobile apps.

(6)

HBase

Deep down in Hadoop is the powerful database, HBase, that spreads data out among nodes using HDFS. It is, perhaps, most appropriate to use for managing huge tables consisting of billions of rows. Being a part of Hadoop, it allows using map/reduce processing on the data for complicated computational jobs, but also provides real-time data processing capabilities.

Both HBase and Cassandra follow the BigTable model. As such, HBase can be scaled linearly simply by adding more nodes to the setup. HBase is best suited for real-time querying of Big Data.

Reference: http://www.itbusinessedge.com/slideshows/top-five-nosql-databases-and-when-to-use-them.html

(7)

Most of the scripting languages like php, python, perl, ruby bash is good. Any language able to read from stdin, write to sdtout and parse tab and new line characters will work: Hadoop

Streaming just pipes the string representations of key value pairs as concatenated with a tab to a program that must be executable on each task tracker node.

On most linux distros used to setup hadoop clusters, python, bash, ruby, perl... are already installed but nothing will prevent to roll up your own execution environment for your favorite scripting or compiled programming language.

But, the difference between java and scripting language, it is "Heart Beat of child nodes will not be sent to the parent nodes when we are using scripting languages".

(8)

Reference:

1. http://stackoverflow.com/questions/8572339/which-language-to-use-for-hadoop-map-reduce-programs-java-or- php?answertab=active#tab-top

2. http://www.slideshare.net/corleycloud/big-data-just-an-introduction-to-hadoop-and-scripting-languages

(9)

Clover provides the metrics you need to better balance the effort between writing code that does stuff, and code that tests stuff.

Clover runs in your IDE or your continuous integration system, and includes test optimization to make your tests run faster, and fail more quickly.

Reference: https://www.atlassian.com/software/clover/overview

(10)

JUnit is a simple framework to write repeatable tests. It is an instance of the xUnit architecture for unit testing frameworks.

Reference: http://blog.cloudera.com/blog/2008/12/testing-hadoop/

How to create a test:https://github.com/junit-team/junit/wiki/Getting-started

(11)

Hadoop admin who can work independently with excellent communication skill. Good knowledge of Linux (security, configuration, tuning, troubleshooting and monitoring). Able to deploy Hadoop cluster, add and remove nodes, troubleshoot failed jobs, configure and tune the cluster, monitor critical parts of the cluster, etc.

Hadoop Developer: A Hadoop Developer has many responsibilities. And the job

responsibilities are dependent on your domain/sector, where some of them would

be applicable and some might not

(12)

Hadoop development and implementation.

Loading from disparate data sets.

Pre-processing using Hive and Pig.

Designing, building, installing, configuring and supporting Hadoop.

Translate complex functional and technical requirements into detailed design.

Perform analysis of vast data stores and uncover insights.

Maintain security and data privacy.

http://www.edureka.co/blog/hadoop-developer-job-responsibilities-skills/

https://www.dice.com/jobs/detail/Hadoop-Solution-Architect-%26%2347-Information-and-BI-Architect-%26%2347-Big-Data- Architect-%26%2347-Hadoop-Admin-IDC-Technologies-Sunnyvale-CA-94085/10114879/727691

Create scalable and high-performance web services for data tracking.

High-speed querying.

Managing and deploying HBase.

Being a part of a POC effort to help build new Hadoop clusters.

Test prototypes and oversee handover to operational teams.

Propose best practices/standards.

(13)

Reference: http://hortonworks.com/training/certification/

(14)

You would not compare so does Hive vs Hbase - Commonly happend because of SQL-like layer on Hive - Hbase is a Database but Hive is never a Database.

Hive is a MapReduce based Analysis/ Summarisation tool running on Top of Hadoop. Hive depends on Mapreduce( Batch Processing) + HDFS

Hbase is a Database (NoSQL) - which is used to store and retrieve data.

To Query(Scans) Hbase - mapreduce is not required - So HBase depends only on HDFS - not on Mapreduce So Hbase is Online Processing System

http://www.quora.com/Hive-vs-HBase-Which-one-wins-the-battle-Which-is-used-in-which-scenario

(15)

Depending on where you work, you may need to simply use whatever standards your company has established.

For example, Hive is commonly used at Facebook for analytical purposes. Facebook promotes the Hive language and their employees frequently speak about Hive at Big Data and Hadoop conferences.

However, Yahoo! is a big advocate for Pig Latin. Yahoo! has one of the biggest Hadoop clusters in the world. Their data engineers use Pig for data processing on their Hadoop clusters.

Alternatively, you may have a choice of Pig or Hive at your organization, especially if no standards have yet been established, or perhaps multiple standards have been set up.

However, compared to Hive, Pig needs some mental adjustment for SQL users to learn. Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!). Pig requires more verbose coding, although it’s still a fraction of what straight Java MapReduce programs require. Pig also gives you more control and optimization over the flow of the data than Hive does.

If you are a data engineer, then you’ll likely feel like you’ll have better control over the dataflow (ETL) processes when you use Pig Latin, especially if you come from a procedural language background. If you are a data analyst, however, you will likely find that you can ramp up on Hadoop faster by using Hive, especially if your previous experience was more with SQL than with a procedural programming language. If you really want to become a Hadoop expert, then you should learn both Pig Latin and Hive for the ultimate flexibility.