Getting to Know Big Data
Dr. Putchong Uthayopas
Department of Computer Engineering,
Faculty of Engineering, Kasetsart University Email: putchong@ku.th
Information Tsunami
• Rapid expansion of Smartphone Usage, social
computing, mobile application, gaming
• Rapid increases in Network Bandwidth and coverage
– Wifi, 4G
• Rapid move toward Internet of Things (IOT)
During the first day of a baby’s life, the amount of data generated by humanity is equivalent to 70 times the information contained in the Library of Congress. | Photo Credit: ©Catherine Balet “Strangers in the light” (Steidl) 2012 / from The Human Face of Big Data
By signing up with the personal genetics company 23andMe, producer of the documentary We Came Home Yasmine Delawari Johnson was able to get a glimpse into the future. | Photo Credit: © Douglas Kirkland 2012 / from The Human Face of Big Data
Big data is volume, velocity and
high-variety information assets that demand
cost-effective, innovative forms of information
processing for enhanced insight and decision
making.
Property of Big Data
BIG Data Volume
Velocity
Volume
• Big data must be
huge
– Beyond the
capability of a single
computer server to
process it
– Possible to store the
data but difficult to
process it
Velocity
• Big data accumulate at a
very fast speed
– Stock market data
– Internet access log
– Social media data
• Twitter , facebook, IG
• We need to
– Extract meaning as fast and
as much as we can before
throwing away the data
Variety
• Data come with
variety
– Traditional data
base
– Documents
– Web page
– Social media
data
– Image
– Video/Audio
– Location
Diya Soubra, The 3Vs that define Big Data, 2012
Why?
• Improve product and
service
• Increase customer
satisfaction/behavior
• Improve operation
efficiency
• Understand
emerging market
trends
The real value of big data is in the insights it produces when analyzed— discovered patterns, derived meaning, indicators for decisions, and ultimately the ability to respond to the world with greater intelligence.
Know thy self, know thy enemy. A thousand battles, a thousand victories. http://www.intel.com/content/dam/www/public/us/en/d ocuments/product-briefs/big-data-cloud-technologies-brief.pdf )
Google Flu
• pattern emerges when all the flu-related search queries are added together.
• We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening.
• By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world.
http://www.google.org/flutrends/abo ut/how.html
Social Media Analytics
• Social media analytics is
the practice of gathering
data from blogs and social
media websites and
analyzing that data to
make business decisions.
The most common use
of social media analytics is
to
mine customer
sentiment in order to
support marketing and
customer service activities.
Cupid in you Network
• Study matchmaker
– surveyed approximately 1500 English speakers around the world who had listed a relationship on their profile at least one year ago but no more than two years
– asking them how they met their partner and who introduced them (if anyone).
– analyzed network properties of couples and their matchmakers using de-identified, aggregated data.
• Matchmaker characteristics
– Matchmakers have far more friends than the people they're setting up.
– Matchmakers' networks have a different structure
• their networks are less dense: their friends are
less likely to know each other
– Matchmakers were more likely to be close friends, rather than acquaintances.
Consideration for Applying Big Data
NoSQL (Not Only SQL)
• A NoSQL (often interpreted as Not only SQL)
database provides a mechanism for storage and
retrieval of data that is modeled in means other than
the tabular relations used in relational databases.
– being non-relational, distributed,
open-source and horizontally scalable.
– Used to handle a huge amount of data
– The original intention has been modern web-scale
databases.
• MongoDB is a general purpose,
open-source database.
• MongoDB features:
– Document data model with
dynamic schemas
– Full, flexible index support and rich
queries
– Auto-Sharding for horizontal
scalability
– Built-in replication for high
availability
– Text search
• Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets on
computer clusters built from commodity hardware.
• The base Apache Hadoop framework is composed of the following
modules:
– Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
– Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
– Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users'
applications;and
– Hadoop MapReduce – a programming model for large scale data processing.
• Hadoop was created by Doug Cutting and Mike Cafarella
in 2005. Cutting,
who was working at Yahoo! at the time, named it after his son's toy
Magic behind Hadoop and HDFS
• Problem is divided into two phases
– Map applying some action to data in <key, Value>
Pair and get some intermediate results
– Reduce summarize intermediate result <key,value>
and return back to main program
Ricky Ho, How Hadoop Map/Reduce works,
Example: Word count
• Counting word in an input text file.
– How many word “love” in a novel? ^_^
• In map phase the sentence would be split as words and
form the initial key value pair <word, 1>
• “tring tring the phone rings” becomes <tring,1> ,<tring,1>, <the,1>, <phone,1>, <rings,1>
– In the reduce phase the keys are grouped together and the values
for similar keys are added.
• There are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be
• <tring,2>, <the,1>, <phone,1>, <rings,1>
• Reduce forms an aggregation phase for keys
– This would give the number of occurrence of each word in the
input.
http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html
Data Product
• Data Product provides actionalble information
without exposing decision maker to the
underlying data or analytics
– Movie Recommendations
– Weather Forecast
– Stock Market Prediction
– Operation improvement
– Health Diagnosis
Bottom up approach
• What is the data that we have?
• How can we collect and store it?
• What is the infrastructure and
tool to process this big data?
• What analytics method can be
apply?
• What is the insight we can gain
from this data and analysis?
Top down
• What is the business
challenge that can create
value and impact to the
organization?
• What is the data that we
need?
• What is the tools and analytics
approach that should be used
?
• What is the infrastructure
needed?
In-memory Database
• An in-memory database is
– a database management
system that primarily relies
on main memory for computer
data storage.
• faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions.
• Accessing data in memory eliminates seek time when querying the data, which provides faster and more
predictable performance than disk.
Spark at Yahoo
• Personalizing news pages for Web visitors and another for running analytics for advertising. For news personalization, the company uses ML algorithms running on Spark to figure out what individual users are interested in, and also to categorize news stories as they arise to
figure out what types of users would be interested in reading them.
– wrote a Spark ML algorithm 120 lines of Scala. (Previously, its ML algorithm for news
personalization was written in 15,000 lines of C++.) – With just 30 minutes of training on a large, hundred
million record data set, the Scala ML algorithm was ready for business.
• Second use case shows off Hive on Spark (Shark’s) interactive capability.
– use existing BI tools to view and query their advertising analytic data collected in Hadoop.
http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/
BigData Infrastructure Goes to Cloud
• Data is already on the cloud
– Virtual organization
– Cloud based SaaS Service
• Big Data As a Service on the Cloud
– Private Cloud
– Public Cloud
• IBM Bluemix, Amazon AWS (EMR) and many
Big Data Services
Services App
Big Data Analytics
• a set of advanced technologies
designed to work with large
volumes of heterogeneous data.
• explore the data and to discover
interrelationships and patterns
using sophisticated quantitative
methods such as
• machine learning
• neural networks
• robotics algorithm
• computational mathematics
• artificial intelligence
Deep Learning
• Deep learning is a subcategory of machine learning
with the use of neural networks to improve things
like speech recognition, computer vision,
and natural language processing.
Applying Deep Learning
• In 2011, Stanford computer science professor Andrew Ng founded Google’s Google Brain project, which created a neural network trained with deep learning
algorithms, which famously proved capable ofrecognizing high level concepts, such as cats, after watching just YouTube videos--and without ever having been told what a “cat” is.
• Facebook using deep learning expertise to help create solutions that will better identify faces and objects in the 350 million photos and videos uploaded to Facebook each day.
• Voice recognition like Google Now and Apple’s Siri is now using deep learning.
– According to Google researchers, the voice error rate in the new version of Android--after adding insights from deep learning--stands at 25% lower than previous versions of the software.
Source: http://www.fastcolabs.com/3026423/why-google-is-investing-in-deep-learning http://www.wired.com/2014/08/deep-learning-yann-lecun/
IBM Watson and Cognitive Technology
• Watson is a cognitive
technology that processes
information more like a human than a computer—by understanding
natural language, generating
hypotheses based on evidence, and learning as it goes. And learn it does. • Watson “gets smarter” in three
ways:
– being taught by its users
– learning from prior interactions
– being presented with new information.
• This means organizations can more fully understand and use the data that surrounds them, and use that data to make better decisions.
Applying Watson in Healthcare
• WellPoint, Inc. is an Indianapolis-based health benefits company.
– approximately 37 million health plan members – processes more than 550 million claims per year.
• Using IBM Watson™ to improve the quality and efficiency of healthcare decisions.
– WellPoint trained Watson with 25,000 historical cases. Now Watson uses hypothesis generation and evidence-based learning to generate confidence-scored recommendations that help nurses make decisions about utilization management. Natural language processing leverages unstructured data, such as text-based Treatment requests.
• Benefit
– Helps UM nurses make faster UM decisions about treatment requests
– Could accelerate healthcare preapprovals, which can be critical when treatments are time-sensitive
– Includes unstructured data in the streamlined decision process