• No results found

etransactions & Data Science I:

N/A
N/A
Protected

Academic year: 2022

Share "etransactions & Data Science I:"

Copied!
84
0
0

Loading.... (view fulltext now)

Full text

(1)

eTransactions & Data Science I:

The Power of Big Data & Data Analytics

Electronic Transactions

(2)

Defining “Big Data”

(3)

Context

(4)

More context

(5)

“Big” context

(6)

Rise of data markets

(7)

Rise of open data portals

(8)

Then, Big Data happens

(9)

Then, Big Data happens

(10)

What is ...?

• Big Data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing applications software. - Wikipedia

• Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. - Gartner, 2001

• Big Data is a term that describes the large volume of data – both structured and unstructured – that

inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what

organizations do with the data that matters. Big data can be analyzed for insights that lead to better

decisions and strategic business moves. - SAS

(11)

What is ...?

• Big Data refers to data that would typically be too expensive to store, manage, and analyze using

traditional (relational and/or monolithic) database systems. Usually, such systems are cost-inefficient because of their inflexibility for storing unstructured data (such as images, text, and video),

accommodating “high-velocity” (real-time) data, or scaling to support very large (petabyte-scale) data volumes. - Google Cloud Platform

• Big Data can be described in terms of data management challenges that – due to increasing volume,

velocity and variety of data – cannot be solved with traditional databases. - Amazon, AWS

(12)

The initial 3 V’s...

The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-

enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.

Velocity is the fast rate at which data is received and (perhaps) acted on.

Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action.

Variety refers to the many types of data that are available. Traditional data

types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semi-

structured data types, such as text, audio, and video require additional preprocessing to derive meaning and support metadata.

1

2

3

Volume

Velocity

Variety

(13)

... and the next 2 Vs.

Refers to the biases, noise and abnormality in data. The data which have been collected

& stored from various sources, in different forms, often deals with inaccuracy. Under this we’ve to deal with poor quality of data, also in huge, which is not precise and uncertain.

Quality and accuracy are less controllable, so veracity in data analysis is the biggest challenge when compares to things like volume and velocity.

4 Veracity

Refers to our ability turn our data into value. Having endless amounts of data is one thing, but unless it can be turned into value it is useless. It is important that businesses make a case for any attempt to collect and leverage big data. It is easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of the business value it will bring.

5 Value

(14)

Focusing on Value

V for value sits at the top of the big data pyramid. This refers to the ability to transform a tsunami of data into business.

Business value is in the insights, which were not available before. Acting upon the insights is imperative.

The most important part of embarking on a big data initiative is to understand the costs and benefits of collecting and analyzing the data

to ensure that ultimately the data that is reaped can be monetized.

(15)

The key is to make a difference

(16)

Putting it all together

(17)

Beyond the 5 Vs

Validity

Like big data veracity, there is the issue of validity meaning is the data correct and accurate for the intended use.

Volatility

Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.

Visualization

Refers to the challenge of visualizing big data. Current big data visualization tools face technical challenges due to limitations of in-memory technology and poor scalability, functionality, and response time. You can't rely on traditional graphs when trying to plot a billion data points, so you need different ways of

representing data.

(18)

The characteristics of Big Data (again)

(19)

Big Data Sources

Social

Networks Log Files Web Traffic

Clicks

Network Data

Activity Data

Streaming Endpoints

Images

Photos Free Speech Text

Documents

ERPs

CRMs

Transactions

(commercial transactions, banking/stock records, e-

commerce, credit cards)

Internet of Things

(traffic, weather, GPS, mobile, satellite)

Open

Data

(20)

Data sources evolution

ERP

CRM

WEB

BIG DATA

Purchase detail Purchase record Payment record Segmentation

Offer details

Customer touches

Support Contacts

A/B Testing Dynamic pricing

Affiliate networks

Search marketing Behavioural

targeting

Dynamic Funnels Web logs

Offer history

Sensors / RFID / Devices User click stream

Mobile web User generated content

Sentiment

Social interactions &

feeds

Spatial GPS coordinates

External demographics

Business data feeds Video, Audio, Images

Speech to text Product/Service logs

Messages

Increasing Data Variety & Complexity

Megabytes Gigabytes Terabytes Petabytes

BIG DATA = TRANSACTIONS + INTERACTIONS + OBSERVATIONS

NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison, A. B. M. Moniruzzaman, Syed Akhter Hossain

(21)

Big Data Facts – How Big is Big Data?

https://digitalmarketingphilippines.com/big- data-8-surprising-facts-to-know-infographic/

(22)

Big Data Facts – How Big is Big Data?

(23)

Big Data Challenges

Source: https://resources.sei.cmu.edu/asset_files/Presentation/2014_017_101_89659.pdf

(24)

Big Data for the Businesses

(25)

Big Data Market Trends and Forecasts

Source: https://statista.com

Big data market size revenue forecast worldwide from 2011 to 2027 (in billion U.S.

dollars)

(26)

Big Data Market Trends and Forecasts

Nearly 50% of respondents to a

recent McKinsey Analytics survey say analytics and Big Data have

fundamentally changed business

practices in their sales and marketing functions.

Also, more than 30% say the same about R&D across industries, with respondents in High Tech and Basic Materials & Energy report the

greatest number of functions being transformed by analytics and Big Data.

Source:Analytics Comes of Age, published in January 2018

(27)

Big Data Market Trends and Forecasts

Big Data applications and analytics is projected to grow from $5.3B in 2018 to $19.4B in 2026, attaining a CAGR of 15.49%.

Big Data market worldwide includes Professional Services is projected to grow from

$16.5B in 2018 to $21.3B in 2026.

Source:Wikibonandreported by Statista.

(28)

Big Data Market Trends and Forecasts

Comparing the worldwide demand for advanced analytics and Big Data- related hardware, services and

software, the latter category’s dominance becomes clear.

The software segment is projected to increase the fastest of all categories, increasing from $14B in 2018 to $46B in 2027 attaining a CAGR of 12.6%.

Sources:Wikibon;SiliconANGLE; Statista estimates andreported by Statista.

(29)

“Big Data” - “Big Deal”

• Government

• European Data Governance Act

• health data: improving personalised treatments, providing better healthcare, and helping cure rare or chronic diseases, saving

approximately €120 billion a year in the EU health sector and providing a more effective and quicker response to the global COVID-19 health crisis;

• mobility data: saving more than 27 million hours of public transport users’ time and up to €20 billion a year in labour costs of car drivers thanks to real-time navigation;

• environmental data: combatting climate change, reducing CO₂ emissions and fighting emergencies, such as floods and wildfires;

• agricultural data: developing precision farming, new products in the agri-food sector and new services in general in rural areas;

• public administration data: delivering better and more reliable official statistics and contributing to evidence-based decisions.

• Private Sector

• Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data

• Facebook handles 40 billion photos from its user base.

• Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide

• Science

• Large Synoptic Survey Telescope will generate 140 Terabyte of data every 5 days.

• Biomedical computation like decoding human Genome & personalized medicine

• Social science revolution

https://resources.sei.cmu.edu/asset_files/Presentation/2014_017_101_89659.pdf

(30)

What is Big Data for the Businesses

Big data is the elephant in the boardroom Companies know it’s important but most just don’t know what to do with it. Understanding what big data really is;

puts some companies at a disadvantage since they do not see the importance or future of this in its entirety.

According to a study of over 1,600 businesses, an outstanding 76 percent of businesses lack the understanding of the potential value of their information and therefore had not invested in the data management platforms and data management software to help them use their data strategically.

Salesforce

(31)

Big Data: Drivers of change

(32)

How Big Data creates value?

(33)

Big Data & Data Analytics

(34)

Data Analytics

Descriptive Analytics

• Help users answer the question: “What

happened and why?”

Examples include

traditional query and reporting

environments with scorecards and

dashboards.

Predictive Analytics

• Help users estimate the probability of a given event in the feature. Examples include early alert systems, fraud

detection, preventive maintenance

applications, and forecasting.

Prescriptive Analytics

• Provide specific (prescriptive)

recommendations to the user. They

address the question – What should I do if

“x” happens?

(35)

Data Analytics

(36)

Data Analytics

(37)

Data Analytics

(38)

Types of Data Analytics

• Text analysis

• Sentiment analysis

• Social Media & Social Graph analytics

• Face recognition

• Voice analytics

• Movement analysis

• Profiling

• Segmentation

• Clustering

• Classification

• Outlier detection

• ...

(39)

ML Algorithms

Classification & Regression Logistic Regression (Binomial, Multinomial), Decision Tree Classifier, Random Forest Classifier, Linear SVM, Naïve Bayes, (…), Linear Regression, Generalised Linear Regression, Decision Tree Regression, Random Forest Regression, Isotonic Regression, (…), Linear Methods, Tree Ensembles (…)

Feature Extraction, Transformation

& Selection

TF-IDF, Word2Vec, Count Vectorizer, Feature Hasher, PCA, Discrete Cosine Transform, Tokenizer, Polynomial Expansion, (…), Vector Slicer, Chi-Squared Selection, (…), LSH (…)

Recommendation Collaborative Filtering

Clustering K-means, Gaussian Mixture Models, LDA, Bisecting K-means Frequent Pattern Mining FP-Growth

Basic Statistics Correlation Calculation, Hypothesis Testing

(40)

Business Problems

• Classification & class probability estimation

• Regression (value estimation)

• Similarity Matching

• Clustering

• Co-occurrence grouping

• Outlier Detection

• Profiling

• Link Prediction

• Data/Dimensionality Reduction

• Causal Modelling

(41)

Key considerations for executing an analytic strategy

Have a purpose:

a concrete business objective, not driven by

technological advancements

Link “insight” to action Push analytics to business end-points

Create feedback loops

Do not overlook the importance of properly preparing and sampling

data

(42)

How Big Data Works (AWS)

• Collect. Collecting the raw data – transactions, logs, mobile devices and more – is the first

challenge many organizations face when dealing with big data. A good big data platform makes this step easier, allowing developers to ingest a wide variety of data – from structured to

unstructured – at any speed – from real-time to batch.

• Store. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. Depending on your specific requirements, you may also need temporary stores for data in-transit.

• Process & Analyze. This is the step where data is transformed from its raw state into a

consumable format – usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms. The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualization tools.

• Consume & Visualize. Big data is all about getting high value, actionable insights from your data assets. Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualization tools that allow for fast and easy exploration of datasets. Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical

“predictions” – in the case of predictive analytics – or recommended actions – in the case of prescriptive analytics.

https://aws.amazon.com/big-data/what-is-big-data/

(43)

The whole cycle

(44)

Data Analytics: Examples + Pitfalls

(45)

Data Science: An indicative workflow

(46)

Data Analysis - ML

(47)

Data Preprocess

(48)

Exploratory Analysis

(49)

Boxplots…& Outliers

(50)

Features and Labels

(51)

Train – Validation - Test Split & Model Fitting

Optimize an

objective function

(52)

Regression

(53)

Correlation vs Causation

(54)

How to fit a model ?

(55)

(Under- / Over-) Fitting a model

(56)

Representative Datasets

(57)

Binary Classification

(58)

Multi-class classification

(59)

Overfitting Classification

(60)

Clustering I

(61)

Clustering II

(62)

Clustering III

(63)

Clustering IV – the curse of dimensionality

(64)

Self-selection bias

Hypothesis

“Students who attend a test preparation course get better scores on the course’s final exams”

Higher test scores might be observed among students who choose to participate in the preparation course itself

Due to self-selection, there may be a number of differences between the people who choose to take the course and those who choose not to, such as motivation, socioeconomic status, or prior test-taking experience.

An outcome might be that those who elect to do the preparation course would

have achieved higher scores in the actual test anyway

(65)

Selection Bias

• Undercoverage occurs when some members of the population are inadequately

represented in the sample. A classic example of undercoverage is the Literary Digest voter survey, which predicted that Alfred Landon would beat Franklin Roosevelt in the 1936 presidential election. The survey sample suffered from undercoverage of low- income voters, who tended to be Democrats. Undercoverage is often a problem with convenience samples .

• Voluntary response bias occurs when sample members are self-selected volunteers, as in voluntary samples . An example would be call-in radio shows that solicit audience participation in surveys on controversial topics (abortion, affirmative action, gun

control, etc.). The resulting sample tends to overrepresent individuals who have strong opinions.

• Nonresponse bias. Sometimes, individuals chosen for the sample are unwilling or

unable to participate in the survey. This can be a big problem with mail surveys, where

the response rate can be very low.

(66)

Open Data

(67)

In a Nutshell…

Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control in a timely and accessible way.

(68)

Why Open Data?

• More information might lead to more informed and better decisions

• Higher degree of effectiveness and efficiency

• Strengthen trust

• Leverage benefits of peer production

• New business models

• “Peoples right to know”

(69)

A government org publishes

data

Citizens &

developers engage, providing feedback

That govt. org incorporates

feedback, improving

data Demonstrable

use inspires that govt. org to publish more More data

attracts more data consumers

Positive interaction inspires more governments to

follow suit

1

2

3 4

5

6

(70)

8 Principles of Open Data

1. Data Must Be Complete 2. Data Must Be Primary 3. Data Must Be Timely

4. Data Must Be Accessible

5. Data Must Be Machine Processable

6. Access Must Be Non-Discriminatory

7. Data Formats Must Be Non-Proprietary

8. Data Must Be License-free

(71)

Open Data Publication

• Top-down approach:

• A national plan for coordinating the data publication is created by committees involving all stakeholders before public organizations actually release any data

• Defining and reaching consensus on a consistent set of terms and their relations (ontology)

• Bottom-up approach:

• Data should be published by all public organizations

• Any interested party can use the available data in raw formats

• Coordination efforts to join them together should follow at a later

stage

(72)

The Open Data Publisher/Subscriber Equation

Transforming governments from data collectors → data producers → data publishers

(73)

Linked Data

(74)

Open Data.. Is this enough ?

★make your stuff available on the Web (whatever format) under an open license

★★ make it available as structured data (e.g., Excel instead of image scan of a table)

★★★ make it available in a non-proprietary open format (e.g., CSV instead of Excel)

★★★★ use URIs to denote things, so that people can point at your stuff

★★★★★ link your data to other data to provide context

5

[Source: http://5stardata.info/en/]

(75)

Linked Data – the idea

• The main strength of the world wide web lies in the ability to link between different web pages.

• This way, a webpage may provide its customer a link to another website in order to retrieve additional information about a topic.

• Could we apply the same principle on data?

• Linked data, much like websites, can live on different places, be maintained by different organizations, and still be used as a single system from the

user’s perspective.

Linked Data are structured data which are interlinked with other data, so it

becomes more useful through semantic queries

(76)

Linked Data – Characteristics

• Linked Data allow us to easily reference the same entity in different datasets.

• Using linked data we can refer to and extend data external to our organization.

• Linked data usage is ideal for data exchange between different systems, especially when each one of the system only maintains part of the overall information regarding each entity.

• Linked data usage reduces the cost of data exchange and maintenance, while increasing the cost of data generation and usage.

• Benefits of linked data greatly depend on correct usage of the paradigm

and well-designed datasets.

(77)

Linked Data – Examples

(78)

Linked Open Data Cloud (a while ago…)

(79)

Big Data Analytics Technologies

(80)

Big Data Landscape (2012)

(81)

Big Da ta Landsc ape ( 2016)

(82)

Big Da ta Landsc ape ( 2018)

(83)

Big Data Technologies

There are six primary needs that Big Data technologies address:

• 1. Distributed Storage and Processing

• 2. Non-Relational database with Low latency

• 3. Streams and Complex Event Processing

• 4. Data Processing of Special big data data-types

• 5. In-Memory Processing

• 6. Reporting

(84)

QUESTIONS

[email protected]

Tsapelas I. – [email protected]

Dimitropoulos N.- [email protected]

References

Related documents

Additionally, a number of in-vivo studies have been made on diverse aspects of the software process in small organizations such as requirements engineering

Figure 2.3 Enhancement of soil water content for elevated CO 2 levels (A) under different management systems; B) under different vegetation types; and (C) under different

Japan’s universal health insurance system is composed of four main insurance systems, i.e., community health insurance for the self-employed and unemployed (National Health

Kijkend naar de resultaten van het onderzoek kan geconcludeerd worden dat de manier waarop competentiemanagement binnen Fabory uitgevoerd wordt geen toegevoegde waarde heeft

Results and Conclusion: The collaborative approach to simulation-based training design described here can facilitate genuine stakeholder participation in the development of

Message Transformation Inbound Message Processing Outbound Message Processing Operational Data Store Integration Hub Knowledge Management Business Intelligence Environment Public

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and