Getting to Know Big Data

(1)

Getting to Know Big Data

Dr. Putchong Uthayopas

Department of Computer Engineering,

Faculty of Engineering, Kasetsart University Email: [email protected]

(2)

Information Tsunami

• Rapid expansion of Smartphone Usage, social

computing, mobile application, gaming

• Rapid increases in Network Bandwidth and coverage

– Wifi, 4G

• Rapid move toward Internet of Things (IOT)

(3)

(4)

During the first day of a baby’s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the Library of Congress. | Photo Credit: ©Catherine Balet “Strangers in the light” (Steidl) 2012 / from The Human Face of Big Data

(5)

By signing up with the personal genetics company 23andMe, producer of the documentary We Came Home Yasmine Delawari Johnson was able to get a glimpse into the future. | Photo Credit: © Douglas Kirkland 2012 / from The Human Face of Big Data

(6)

Big data is volume, velocity and

high-variety information assets that demand

cost-effective, innovative forms of information

processing for enhanced insight and decision

making.

(7)

Property of Big Data

BIG Data Volume

Velocity

(8)

Volume

• Big data must be

huge

– Beyond the

capability of a single

computer server to

process it

– Possible to store the

data but difficult to

process it

(9)

Velocity

• Big data accumulate at a

very fast speed

– Stock market data

– Internet access log

– Social media data

• Twitter , facebook, IG

• We need to

– Extract meaning as fast and

as much as we can before

throwing away the data

(10)

Variety

• Data come with

variety

– Traditional data

base

– Documents

– Web page

– Social media

data

– Image

– Video/Audio

– Location

(11)

Diya Soubra, The 3Vs that define Big Data, 2012

(12)

(13)

Why?

• Improve product and

service

• Increase customer

satisfaction/behavior

• Improve operation

efficiency

• Understand

emerging market

trends

The real value of big data is in the insights it produces when analyzed— discovered patterns, derived meaning, indicators for decisions, and ultimately the ability to respond to the world with greater intelligence.

Know thy self, know thy enemy. A thousand battles, a thousand victories. http://www.intel.com/content/dam/www/public/us/en/d ocuments/product-briefs/big-data-cloud-technologies-brief.pdf )

(14)

Google Flu

• pattern emerges when all the flu-related search queries are added together.

• We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening.

• By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world.

http://www.google.org/flutrends/abo ut/how.html

(15)

Social Media Analytics

• Social media analytics is

the practice of gathering

data from blogs and social

media websites and

analyzing that data to

make business decisions.

The most common use

of social media analytics is

to

mine customer

sentiment in order to

support marketing and

customer service activities.

(16)

Cupid in you Network

• Study matchmaker

– surveyed approximately 1500 English speakers around the world who had listed a relationship on their profile at least one year ago but no more than two years

– asking them how they met their partner and who introduced them (if anyone).

– analyzed network properties of couples and their matchmakers using de-identified, aggregated data.

• Matchmaker characteristics

– Matchmakers have far more friends than the people they're setting up.

– Matchmakers' networks have a different structure

• their networks are less dense: their friends are

less likely to know each other

– Matchmakers were more likely to be close friends, rather than acquaintances.

(17)

Consideration for Applying Big Data

(18)

NoSQL (Not Only SQL)

• A NoSQL (often interpreted as Not only SQL)

database provides a mechanism for storage and

retrieval of data that is modeled in means other than

the tabular relations used in relational databases.

– being non-relational, distributed,

open-source and horizontally scalable.

– Used to handle a huge amount of data

– The original intention has been modern web-scale

databases.

(19)

• MongoDB is a general purpose,

open-source database.

• MongoDB features:

– Document data model with

dynamic schemas

– Full, flexible index support and rich

queries

– Auto-Sharding for horizontal

scalability

– Built-in replication for high

availability

– Text search

(20)

• Hadoop is an open-source software framework written in Java for

distributed storage and distributed processing of very large data sets on

computer clusters built from commodity hardware.

• The base Apache Hadoop framework is composed of the following

modules:

– Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

– Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;

– Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users'

applications;and

– Hadoop MapReduce – a programming model for large scale data processing.

• Hadoop was created by Doug Cutting and Mike Cafarella

in 2005. Cutting,

who was working at Yahoo! at the time, named it after his son's toy

(21)

Magic behind Hadoop and HDFS

• Problem is divided into two phases

– Map applying some action to data in <key, Value>

Pair and get some intermediate results

– Reduce summarize intermediate result <key,value>

and return back to main program

Ricky Ho, How Hadoop Map/Reduce works,

(22)

Example: Word count

• Counting word in an input text file.

– How many word “love” in a novel? ^_^

• In map phase the sentence would be split as words and

form the initial key value pair <word, 1>

• “tring tring the phone rings” becomes <tring,1> ,<tring,1>, <the,1>, <phone,1>, <rings,1>

– In the reduce phase the keys are grouped together and the values

for similar keys are added.

• There are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be

• <tring,2>, <the,1>, <phone,1>, <rings,1>

• Reduce forms an aggregation phase for keys

– This would give the number of occurrence of each word in the

input.

http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html

(23)

Data Product

• Data Product provides actionalble information

without exposing decision maker to the

underlying data or analytics

– Movie Recommendations

– Weather Forecast

– Stock Market Prediction

– Operation improvement

– Health Diagnosis

(24)

(25)

Bottom up approach

• What is the data that we have?

• How can we collect and store it?

• What is the infrastructure and

tool to process this big data?

• What analytics method can be

apply?

• What is the insight we can gain

from this data and analysis?

(26)

Top down

• What is the business

challenge that can create

value and impact to the

organization?

• What is the data that we

need?

• What is the tools and analytics

approach that should be used

?

• What is the infrastructure

needed?

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

(36)

(37)

In-memory Database

• An in-memory database is

– a database management

system that primarily relies

on main memory for computer

data storage.

• faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions.

• Accessing data in memory eliminates seek time when querying the data, which provides faster and more

predictable performance than disk.

(38)

Spark at Yahoo

• Personalizing news pages for Web visitors and another for running analytics for advertising. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to

figure out what types of users would be interested in reading them.

– wrote a Spark ML algorithm 120 lines of Scala. (Previously, its ML algorithm for news

personalization was written in 15,000 lines of C++.) – With just 30 minutes of training on a large, hundred

million record data set, the Scala ML algorithm was ready for business.

• Second use case shows off Hive on Spark (Shark’s) interactive capability.

– use existing BI tools to view and query their advertising analytic data collected in Hadoop.

http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/

(39)

BigData Infrastructure Goes to Cloud

• Data is already on the cloud

– Virtual organization

– Cloud based SaaS Service

• Big Data As a Service on the Cloud

– Private Cloud

– Public Cloud

• IBM Bluemix, Amazon AWS (EMR) and many

Big Data Services

Services App

(40)

Big Data Analytics

• a set of advanced technologies

designed to work with large

volumes of heterogeneous data.

• explore the data and to discover

interrelationships and patterns

using sophisticated quantitative

methods such as

• machine learning

• neural networks

• robotics algorithm

• computational mathematics

• artificial intelligence

(41)

Deep Learning

• Deep learning is a subcategory of machine learning

with the use of neural networks to improve things

like speech recognition, computer vision,

and natural language processing.

(42)

Applying Deep Learning

• In 2011, Stanford computer science professor Andrew Ng founded Google’s Google Brain project, which created a neural network trained with deep learning

algorithms, which famously proved capable ofrecognizing high level concepts, such as cats, after watching just YouTube videos--and without ever having been told what a “cat” is.

• Facebook using deep learning expertise to help create solutions that will better identify faces and objects in the 350 million photos and videos uploaded to Facebook each day.

• Voice recognition like Google Now and Apple’s Siri is now using deep learning.

– According to Google researchers, the voice error rate in the new version of Android--after adding insights from deep learning--stands at 25% lower than previous versions of the software.

Source: http://www.fastcolabs.com/3026423/why-google-is-investing-in-deep-learning http://www.wired.com/2014/08/deep-learning-yann-lecun/

(43)

IBM Watson and Cognitive Technology

• Watson is a cognitive

technology that processes

information more like a human than a computer—by understanding

natural language, generating

hypotheses based on evidence, and learning as it goes. And learn it does. • Watson “gets smarter” in three

ways:

– being taught by its users

– learning from prior interactions

– being presented with new information.

• This means organizations can more fully understand and use the data that surrounds them, and use that data to make better decisions.

(44)

Applying Watson in Healthcare

• WellPoint, Inc. is an Indianapolis-based health benefits company.

– approximately 37 million health plan members – processes more than 550 million claims per year.

• Using IBM Watson™ to improve the quality and efficiency of healthcare decisions.

– WellPoint trained Watson with 25,000 historical cases. Now Watson uses hypothesis generation and evidence-based learning to generate confidence-scored recommendations that help nurses make decisions about utilization management. Natural language processing leverages unstructured data, such as text-based Treatment requests.

• Benefit

– Helps UM nurses make faster UM decisions about treatment requests

– Could accelerate healthcare preapprovals, which can be critical when treatments are time-sensitive