eTransactions & Data Science I:
The Power of Big Data & Data Analytics
Electronic Transactions
Defining “Big Data”
Context
More context
“Big” context
Rise of data markets
Rise of open data portals
Then, Big Data happens
Then, Big Data happens
What is ...?
• Big Data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing applications software. - Wikipedia
• Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. - Gartner, 2001
• Big Data is a term that describes the large volume of data – both structured and unstructured – that
inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what
organizations do with the data that matters. Big data can be analyzed for insights that lead to better
decisions and strategic business moves. - SAS
What is ...?
• Big Data refers to data that would typically be too expensive to store, manage, and analyze using
traditional (relational and/or monolithic) database systems. Usually, such systems are cost-inefficient because of their inflexibility for storing unstructured data (such as images, text, and video),
accommodating “high-velocity” (real-time) data, or scaling to support very large (petabyte-scale) data volumes. - Google Cloud Platform
• Big Data can be described in terms of data management challenges that – due to increasing volume,
velocity and variety of data – cannot be solved with traditional databases. - Amazon, AWS
The initial 3 V’s...
The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-
enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.
Velocity is the fast rate at which data is received and (perhaps) acted on.
Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action.
Variety refers to the many types of data that are available. Traditional data
types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semi-
structured data types, such as text, audio, and video require additional preprocessing to derive meaning and support metadata.
1
2
3
Volume
Velocity
Variety
... and the next 2 Vs.
Refers to the biases, noise and abnormality in data. The data which have been collected
& stored from various sources, in different forms, often deals with inaccuracy. Under this we’ve to deal with poor quality of data, also in huge, which is not precise and uncertain.
Quality and accuracy are less controllable, so veracity in data analysis is the biggest challenge when compares to things like volume and velocity.
4 Veracity
Refers to our ability turn our data into value. Having endless amounts of data is one thing, but unless it can be turned into value it is useless. It is important that businesses make a case for any attempt to collect and leverage big data. It is easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of the business value it will bring.
5 Value
Focusing on Value
V for value sits at the top of the big data pyramid. This refers to the ability to transform a tsunami of data into business.
Business value is in the insights, which were not available before. Acting upon the insights is imperative.
The most important part of embarking on a big data initiative is to understand the costs and benefits of collecting and analyzing the data
to ensure that ultimately the data that is reaped can be monetized.
The key is to make a difference
Putting it all together
Beyond the 5 Vs
Validity
Like big data veracity, there is the issue of validity meaning is the data correct and accurate for the intended use.
Volatility
Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.
Visualization
Refers to the challenge of visualizing big data. Current big data visualization tools face technical challenges due to limitations of in-memory technology and poor scalability, functionality, and response time. You can't rely on traditional graphs when trying to plot a billion data points, so you need different ways of
representing data.
The characteristics of Big Data (again)
Big Data Sources
Social
Networks Log Files Web Traffic
–
Clicks
Network Data
Activity Data
Streaming Endpoints
Images
–
Photos Free Speech Text
Documents
ERPs
–
CRMs
Transactions
(commercial transactions, banking/stock records, e-
commerce, credit cards)
Internet of Things
(traffic, weather, GPS, mobile, satellite)
Open
Data
Data sources evolution
ERP
CRM
WEB
BIG DATA
Purchase detail Purchase record Payment record Segmentation
Offer details
Customer touches
Support Contacts
A/B Testing Dynamic pricing
Affiliate networks
Search marketing Behavioural
targeting
Dynamic Funnels Web logs
Offer history
Sensors / RFID / Devices User click stream
Mobile web User generated content
Sentiment
Social interactions &
feeds
Spatial GPS coordinates
External demographics
Business data feeds Video, Audio, Images
Speech to text Product/Service logs
Messages
Increasing Data Variety & Complexity
Megabytes Gigabytes Terabytes Petabytes
BIG DATA = TRANSACTIONS + INTERACTIONS + OBSERVATIONS
NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison, A. B. M. Moniruzzaman, Syed Akhter Hossain
Big Data Facts – How Big is Big Data?
https://digitalmarketingphilippines.com/big- data-8-surprising-facts-to-know-infographic/
Big Data Facts – How Big is Big Data?
Big Data Challenges
Source: https://resources.sei.cmu.edu/asset_files/Presentation/2014_017_101_89659.pdf
Big Data for the Businesses
Big Data Market Trends and Forecasts
Source: https://statista.com
Big data market size revenue forecast worldwide from 2011 to 2027 (in billion U.S.
dollars)
Big Data Market Trends and Forecasts
Nearly 50% of respondents to a
recent McKinsey Analytics survey say analytics and Big Data have
fundamentally changed business
practices in their sales and marketing functions.
Also, more than 30% say the same about R&D across industries, with respondents in High Tech and Basic Materials & Energy report the
greatest number of functions being transformed by analytics and Big Data.
Source:Analytics Comes of Age, published in January 2018
Big Data Market Trends and Forecasts
Big Data applications and analytics is projected to grow from $5.3B in 2018 to $19.4B in 2026, attaining a CAGR of 15.49%.
Big Data market worldwide includes Professional Services is projected to grow from
$16.5B in 2018 to $21.3B in 2026.
Source:Wikibonandreported by Statista.
Big Data Market Trends and Forecasts
Comparing the worldwide demand for advanced analytics and Big Data- related hardware, services and
software, the latter category’s dominance becomes clear.
The software segment is projected to increase the fastest of all categories, increasing from $14B in 2018 to $46B in 2027 attaining a CAGR of 12.6%.
Sources:Wikibon;SiliconANGLE; Statista estimates andreported by Statista.
“Big Data” - “Big Deal”
• Government
• European Data Governance Act
• health data: improving personalised treatments, providing better healthcare, and helping cure rare or chronic diseases, saving
approximately €120 billion a year in the EU health sector and providing a more effective and quicker response to the global COVID-19 health crisis;
• mobility data: saving more than 27 million hours of public transport users’ time and up to €20 billion a year in labour costs of car drivers thanks to real-time navigation;
• environmental data: combatting climate change, reducing CO₂ emissions and fighting emergencies, such as floods and wildfires;
• agricultural data: developing precision farming, new products in the agri-food sector and new services in general in rural areas;
• public administration data: delivering better and more reliable official statistics and contributing to evidence-based decisions.
• Private Sector
• Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data
• Facebook handles 40 billion photos from its user base.
• Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide
• Science
• Large Synoptic Survey Telescope will generate 140 Terabyte of data every 5 days.
• Biomedical computation like decoding human Genome & personalized medicine
• Social science revolution
https://resources.sei.cmu.edu/asset_files/Presentation/2014_017_101_89659.pdf
What is Big Data for the Businesses
Big data is the elephant in the boardroom Companies know it’s important but most just don’t know what to do with it. Understanding what big data really is;
puts some companies at a disadvantage since they do not see the importance or future of this in its entirety.
According to a study of over 1,600 businesses, an outstanding 76 percent of businesses lack the understanding of the potential value of their information and therefore had not invested in the data management platforms and data management software to help them use their data strategically.
Salesforce
Big Data: Drivers of change
How Big Data creates value?
Big Data & Data Analytics
Data Analytics
Descriptive Analytics
• Help users answer the question: “What
happened and why?”
Examples include
traditional query and reporting
environments with scorecards and
dashboards.
Predictive Analytics
• Help users estimate the probability of a given event in the feature. Examples include early alert systems, fraud
detection, preventive maintenance
applications, and forecasting.
Prescriptive Analytics
• Provide specific (prescriptive)
recommendations to the user. They
address the question – What should I do if
“x” happens?
Data Analytics
Data Analytics
Data Analytics
Types of Data Analytics
• Text analysis
• Sentiment analysis
• Social Media & Social Graph analytics
• Face recognition
• Voice analytics
• Movement analysis
• Profiling
• Segmentation
• Clustering
• Classification
• Outlier detection
• ...
ML Algorithms
Classification & Regression Logistic Regression (Binomial, Multinomial), Decision Tree Classifier, Random Forest Classifier, Linear SVM, Naïve Bayes, (…), Linear Regression, Generalised Linear Regression, Decision Tree Regression, Random Forest Regression, Isotonic Regression, (…), Linear Methods, Tree Ensembles (…)
Feature Extraction, Transformation
& Selection
TF-IDF, Word2Vec, Count Vectorizer, Feature Hasher, PCA, Discrete Cosine Transform, Tokenizer, Polynomial Expansion, (…), Vector Slicer, Chi-Squared Selection, (…), LSH (…)
Recommendation Collaborative Filtering
Clustering K-means, Gaussian Mixture Models, LDA, Bisecting K-means Frequent Pattern Mining FP-Growth
Basic Statistics Correlation Calculation, Hypothesis Testing
Business Problems
• Classification & class probability estimation
• Regression (value estimation)
• Similarity Matching
• Clustering
• Co-occurrence grouping
• Outlier Detection
• Profiling
• Link Prediction
• Data/Dimensionality Reduction
• Causal Modelling
Key considerations for executing an analytic strategy
Have a purpose:
a concrete business objective, not driven by
technological advancements
Link “insight” to action Push analytics to business end-points
Create feedback loops
Do not overlook the importance of properly preparing and sampling
data
How Big Data Works (AWS)
• Collect. Collecting the raw data – transactions, logs, mobile devices and more – is the first
challenge many organizations face when dealing with big data. A good big data platform makes this step easier, allowing developers to ingest a wide variety of data – from structured to
unstructured – at any speed – from real-time to batch.
• Store. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. Depending on your specific requirements, you may also need temporary stores for data in-transit.
• Process & Analyze. This is the step where data is transformed from its raw state into a
consumable format – usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms. The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualization tools.
• Consume & Visualize. Big data is all about getting high value, actionable insights from your data assets. Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets. Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical
“predictions” – in the case of predictive analytics – or recommended actions – in the case of prescriptive analytics.
https://aws.amazon.com/big-data/what-is-big-data/
The whole cycle
Data Analytics: Examples + Pitfalls
Data Science: An indicative workflow
Data Analysis - ML
Data Preprocess
Exploratory Analysis
Boxplots…& Outliers
Features and Labels
Train – Validation - Test Split & Model Fitting
Optimize an
objective function
Regression
Correlation vs Causation
How to fit a model ?
(Under- / Over-) Fitting a model
Representative Datasets
Binary Classification
Multi-class classification
Overfitting Classification
Clustering I
Clustering II
Clustering III
Clustering IV – the curse of dimensionality
Self-selection bias
Hypothesis
“Students who attend a test preparation course get better scores on the course’s final exams”
Higher test scores might be observed among students who choose to participate in the preparation course itself
Due to self-selection, there may be a number of differences between the people who choose to take the course and those who choose not to, such as motivation, socioeconomic status, or prior test-taking experience.
An outcome might be that those who elect to do the preparation course would
have achieved higher scores in the actual test anyway
Selection Bias
• Undercoverage occurs when some members of the population are inadequately
represented in the sample. A classic example of undercoverage is the Literary Digest voter survey, which predicted that Alfred Landon would beat Franklin Roosevelt in the 1936 presidential election. The survey sample suffered from undercoverage of low- income voters, who tended to be Democrats. Undercoverage is often a problem with convenience samples .
• Voluntary response bias occurs when sample members are self-selected volunteers, as in voluntary samples . An example would be call-in radio shows that solicit audience participation in surveys on controversial topics (abortion, affirmative action, gun
control, etc.). The resulting sample tends to overrepresent individuals who have strong opinions.
• Nonresponse bias. Sometimes, individuals chosen for the sample are unwilling or
unable to participate in the survey. This can be a big problem with mail surveys, where
the response rate can be very low.
Open Data
In a Nutshell…
Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control in a timely and accessible way.Why Open Data?
• More information might lead to more informed and better decisions
• Higher degree of effectiveness and efficiency
• Strengthen trust
• Leverage benefits of peer production
• New business models
• “Peoples right to know”
A government org publishes
data
Citizens &
developers engage, providing feedback
That govt. org incorporates
feedback, improving
data Demonstrable
use inspires that govt. org to publish more More data
attracts more data consumers
Positive interaction inspires more governments to
follow suit
1
2
3 4
5
6
8 Principles of Open Data
1. Data Must Be Complete 2. Data Must Be Primary 3. Data Must Be Timely
4. Data Must Be Accessible
5. Data Must Be Machine Processable
6. Access Must Be Non-Discriminatory
7. Data Formats Must Be Non-Proprietary
8. Data Must Be License-free
Open Data Publication
• Top-down approach:
• A national plan for coordinating the data publication is created by committees involving all stakeholders before public organizations actually release any data
• Defining and reaching consensus on a consistent set of terms and their relations (ontology)
• Bottom-up approach:
• Data should be published by all public organizations
• Any interested party can use the available data in raw formats
• Coordination efforts to join them together should follow at a later
stage
The Open Data Publisher/Subscriber Equation
Transforming governments from data collectors → data producers → data publishers
Linked Data
Open Data.. Is this enough ?
★make your stuff available on the Web (whatever format) under an open license
★★ make it available as structured data (e.g., Excel instead of image scan of a table)
★★★ make it available in a non-proprietary open format (e.g., CSV instead of Excel)
★★★★ use URIs to denote things, so that people can point at your stuff
★★★★★ link your data to other data to provide context
5[Source: http://5stardata.info/en/]