• No results found

DATA VISUALIZATION: FINDING PICTURES IN NUMBERS

N/A
N/A
Protected

Academic year: 2021

Share "DATA VISUALIZATION: FINDING PICTURES IN NUMBERS"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

D

ATA

V

ISUALIZATION

:

F

INDING

P

ICTURES

IN

N

UMBERS

Pratap Vardhan, Data Scientist, Gramener

@PratapVardhan

(2)
(3)
(4)

A D

ATA

V

ISUALISATION

C

HALLENGE

You will see 3 questions.

You have 30 seconds.

Try it!

Your timer

starts now

(5)

H

OW

MANY

NUMBERS

ARE

ABOVE

100

?

1

23 32 71 72 58 87 11 77 70 16 17 21 56 44 68 51 84 20 60 40 37 8 107 14 12 41 69 14 18 71 62 55 59 64 33 55 71 58 103 92 101 56 45 34 43 15 73 78 6 93 39 53 22 26 26 94 60 82 99 74 11 12 36 67 70 71 97 59 73 99 75 74 69 69 51 48 2 66 92 98 15 10 41 58 104 94 92 84 74 82 12 52 10 57 33 77 88 81 81 91 15 56 25 30 21 7 66 66 78 87 29 23 5 34 11 96 74 99 99 88 37 10 43 15 50 71 65 60 101 98 46 34 19 102 57 70 95 84 63 91 3 34 39 37 60 81 65 63 9 71 48 46 25 50 22 64 91 76 71 79
(6)

H

OW

MANY

NUMBERS

ARE

BELOW

10

?

2

23 32 71 72 58 87 11 77 70 16 17 21 56 44 68 51 84 20 60 40 37 8 107 14 12 41 69 14 18 71 62 55 59 64 33 55 71 58 103 92 101 56 45 34 43 15 73 78 6 93 39 53 22 26 26 94 60 82 99 74 11 12 36 67 70 71 97 59 73 99 75 74 69 69 51 48 2 66 92 98 15 10 41 58 104 94 92 84 74 82 12 52 10 57 33 77 88 81 81 91 15 56 25 30 21 7 66 66 78 87 29 23 5 34 11 96 74 99 99 88 37 10 43 15 50 71 65 60 101 98 46 34 19 102 57 70 95 84 63 91 3 34 39 37 60 81 65 63 9 71 48 46 25 50 22 64 91 76 71 79
(7)

W

HICH

QUADRANT

HAS

HIGHEST

TOTAL

?

3

23 32 71 72 58 87 11 77 70 16 17 21 56 44 68 51 84 20 60 40 37 8 107 14 12 41 69 14 18 71 62 55 59 64 33 55 71 58 103 92 101 56 45 34 43 15 73 78 6 93 39 53 22 26 26 94 60 82 99 74 11 12 36 67 70 71 97 59 73 99 75 74 69 69 51 48 2 66 92 98 15 10 41 58 104 94 92 84 74 82 12 52 10 57 33 77 88 81 81 91 15 56 25 30 21 7 66 66 78 87 29 23 5 34 11 96 74 99 99 88 37 10 43 15 50 71 65 60 101 98 46 34 19 102 57 70 95 84 63 91 3 34 39 37 60 81 65 63 9 71 48 46 25 50 22 64 91 76 71 79
(8)

The same questions again.

But with a few visual cues.

See how long it takes now.

Your timer

starts now

A D

ATA

V

ISUALISATION

(9)

H

OW

MANY

NUMBERS

ARE

ABOVE

100

?

1

23 32 71 72 58 87 11 77 70 16 17 21 56 44 68 51 84 20 60 40 37 8 107 14 12 41 69 14 18 71 62 55 59 64 33 55 71 58 103 92 101 56 45 34 43 15 73 78 6 93 39 53 22 26 26 94 60 82 99 74 11 12 36 67 70 71 97 59 73 99 75 74 69 69 51 48 2 66 92 98 15 10 41 58 104 94 92 84 74 82 12 52 10 57 33 77 88 81 81 91 15 56 25 30 21 7 66 66 78 87 29 23 5 34 11 96 74 99 99 88 37 10 43 15 50 71 65 60 101 98 46 34 19 102 57 70 95 84 63 91 3 34 39 37 60 81 65 63 9 71 48 46 25 50 22 64 91 76 71 79
(10)

H

OW

MANY

NUMBERS

ARE

BELOW

10

?

2

23 32 71 72 58 87 11 77 70 16 17 21 56 44 68 51 84 20 60 40 37 8 107 14 12 41 69 14 18 71 62 55 59 64 33 55 71 58 103 92 101 56 45 34 43 15 73 78 6 93 39 53 22 26 26 94 60 82 99 74 11 12 36 67 70 71 97 59 73 99 75 74 69 69 51 48 2 66 92 98 15 10 41 58 104 94 92 84 74 82 12 52 10 57 33 77 88 81 81 91 15 56 25 30 21 7 66 66 78 87 29 23 5 34 11 96 74 99 99 88 37 10 43 15 50 71 65 60 101 98 46 34 19 102 57 70 95 84 63 91 3 34 39 37 60 81 65 63 9 71 48 46 25 50 22 64 91 76 71 79
(11)

W

HICH

QUADRANT

HAS

HIGHEST

TOTAL

?

23 32 71 72 58 87 11 77 70 16 17 21 56 44 68 51 84 20 60 40 37 8 107 14 12 41 69 14 18 71 62 55 59 64 33 55 71 58 103 92 101 56 45 34 43 15 73 78 6 93 39 53 22 26 26 94 60 82 99 74 11 12 36 67 70 71 97 59 73 99 75 74 69 69 51 48 2 66 92 98 15 10 41 58 104 94 92 84 74 82 12 52 10 57 33 77 88 81 81 91 15 56 25 30 21 7 66 66 78 87 29 23 5 34 11 96 74 99 99 88 37 10 43 15 50 71 65 60 101 98 46 34 19 102 57 70 95 84 63 91 3 34 39 37 60 81 65 63 9 71 48 46 25 50 22 64 91 76 71 79

3

(12)

Y

OU

WILL

BE

SHOWN

A

SET

OF

NUMBERS

ALONG

WITH

A

SUMMARY

(

AVERAGE

,

ETC

)

C

AN

YOU

MAKE

SENSE

OF

THE

FIGURES

?

(13)

So is the variance in sales. Variance in price is the same.

Average sales is the same too. Average price is the same.

Take a look at the sales report alongside. A company has branches in 4 cities, and each branch changes the product price every month. This leads to a corresponding change in the sales.

Here is the performance of the 4 branches with their monthly price and sales for each month. Looking at the average, the four branches have an identical performance.

2010 Boston Chicago Detroit New York

Month Price Sales Price Sales Price Sales Price Sales

Jan 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 Feb 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 Mar 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 Apr 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 May 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 Jun 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 Jul 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 Aug 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 Sep 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 Oct 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 Nov 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Average 9.0 7.50 9.0 7.50 9.0 7.50 9.0 7.50 Variance 10.0 3.75 10.0 3.75 10.0 3.75 10.0 3.75

D

O

THESE

FOUR

CITIES

LOOK

IDENTICAL

TO

YOU

?

(14)

A

RE

THEY

REALLY

IDENTICAL

? C

HECK

A

GAIN

But in fact, the four cities are

totally different in behaviour.

Boston’s sales has generally increased with price.

Detroit has a nearly perfect increase in sales with price, except for one aberration.

Chicago shows a decline in sales beyond a price of 10.

New York’s sales fluctuates despite a nearly constant price.

Boston Chicago

New York Detroit

(15)

We handle terabyte-size data via non-traditional analytics and visualise it in real-time.

Gramener visualises

your data

Gramener transforms your data into concise dashboards

that make your business problem & solution visually obvious. We help you find insights quickly, based on cognitive research, and our visualisations guide you towards actionable decisions.

(16)
(17)
(18)

100

YE

ARS

OF

I

NDIA

S

WEA

THER

1901 1911 1921 1931 1941 1951 1961 1971 1981 1991 2001
(19)
(20)

I

N

2014 E

LECTIONS

,

WHICH

STATE

PRODUCED

MOST

NUMBER

OF

CROREPATI

CANDIDATES

?

A

ND

WHICH

STATE

HAS

HIGHEST

%

OF

CROREPATI

(21)

G

EOGRAPHY

OF

CANDIDATE

WEALTH

Uttar Pradesh,

with over 400

crorepati

candidates, tops

the list.

The

North-eastern states

have the largest

percentage of

crorepati

candidates.

(22)

A

MONG

THE

MAINSTREAM

PARTIES

,

WHICH

PARTY

HAS

HIGHEST

%

OF

CRIMINAL

(23)

C

RIMINAL

CASES

MNS seems like a

winner here. Closely

followed by RJD, MDMK

Size: Number of candidates Color: % of criminal

candidates

(24)
(25)
(26)

A

ND

,

ONE

MORE

THING

..

N

AMESAKES

OF

2014

(27)

C

HANDU

LALS

OF

M

AHASAMUND

Winner’s Margin:

1,217 votes

Namesakes'

polled: 60,000+

votes

(28)

M

OST

OF

WHAT

I

DO

TODAY

IS

V

ISUALISING

D

ATA

A

NOMALIES

Y

OU DON

T NEED SOPHISTICATED ANALYSES FOR THIS
(29)

EDUCATION

PREDICTING MARKS

What determines a child’s marks?

Do girls score better than boys?

Does the choice of subject matter?

Does the medium of instruction matter?

Does community or religion matter?

Does their birthday matter?

(30)

L

ET

S

LOOK

AT

15

YEARS

OF

US B

IRTH

D

ATA

This is a dataset (1975 – 1990) that has been around for several years, and has been studied extensively. Yet, a

visualization can reveal patterns that are neither obvious nor well known.

For example,

• Are birthdays uniformly distributed?

• Do doctors or parents exercise the C-section option to move dates?

• Is there any day of the month that has unusually high or low births?

• Are there any months with relatively high or low births?

Very high births in September. But this is fairly well known. Most conceptions happen during the winter holiday season Relatively few births during the

Christmas and Thanksgiving holidays, as well as New Year and Independence Day. Most people prefer not

to have children on the 13th of any month, given that it’s an unlucky day

Some special days like April

Fool’s day are avoided, but Valentine’s Day is quite

popular

(31)

T

HE

PATTERN

IN

I

NDIA

IS

QUITE

DIFFERENT

This is a birth date dataset that’s

obtained from school admission data for over 10 million children. When we compare this with births in the US, we see none of the same patterns.

For example,

• Is there an aversion to the 13th or is there a local cultural nuance?

• Are holidays avoided for births?

• Which months have a higher propensity for births, and why?

• Are there any patterns not found in the US data?

Very few children are born in the month of August, and thereafter. Most births are concentrated in the first half of the year We see a large number of

children born on the 5th, 10th,

15th, 20th and 25th of each month – that is, round numbered dates Such round numbered patterns a

typical indication of fraud. Here, birthdates are brought forward to aid early school admission

(32)

T

HIS

ADVERSELY

IMPACTS

CHILDREN

S

MARKS

It’s a well established fact that older

children tend to do better at school in most activities. Since many children have had their birth dates brought forward, these younger children suffer.

The average marks of children “born” on the 1st, 5th, 10th, 15th etc. of the

month tend to score lower marks.

• Are holidays avoided for births?

• Which months have a higher propensity for births, and why?

• Are there any patterns not found in the US data?

Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)

Children “born” on round numbered days score lower marks on average,

due to a higher proportion of younger children

(33)

EXPLORING THE MAHABHARATA

How does Mahabharata, one of the largest epics

with 1.8 million words lend itself to text analytics?

Can this ‘unstructured data’ be processed to extract

analytical insights?

What does sentiment analysis of this tome convey?

Is there a better way to explore relations between

characters?

How can closeness of characters be analysed &

visualized?

(34)
(35)
(36)
(37)
(38)
(39)

DETECTING FRAUD

We know meter readings are

incorrect, for various reasons.

We don’t, however, have the

concrete proof we need to start the

process of meter reading

automation.

Part of our problem is the volume

of data that needs to be analysed.

The other is the inexperience in

tools or analyses to identify such

patterns.

(40)

B

ILLING

FRAUD

AT

AN

ENERGY

UTILITY

This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of readings are aligned with the slab boundaries.

Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh).

Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.

An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available.

Most fraud detection software failed to load the data, and sampled data

revealed little or no insight.

This can happen in one of two ways. First, people may be monitoring their usage very carefully, and turn of their lights and fans the instant their usage hits the slab boundary.

Or, more realistically, there’s probably some level of corruption

involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.

(41)

L

INKS

Github:

https://github.com/pratapvardhan

Elections:

https://gramener.com/election/

Speechopedia:

https://gramener.com/speechopedia/

AAP:

https://gramener.com/aapdonations/

Cricket:

https://gramener.com/cricket/

Flags:

https://gramener.com/flags/

(42)

Try it! All you need is some data and some curiosity to…

V

ISUALISE

D

ATA

Y

OURSELF

!

@PratapVardhan

[email protected]

+91-837-460-9651

: https://github.com/pratapvardhan ions: https://gramener.com/election/ : https://gramener.com/speechopedia/ : https://gramener.com/aapdonations/ https://gramener.com/cricket/ https://gramener.com/flags/

References

Related documents

In spite of the yearly efforts of government to expend resources on school improvement so as to make African classroom more interactive, and democratic, the

Youth unemployment is a serious problem in Latvia, as unemployed young people make up 16.3 % of the total number of unemployed in the country, while in the Latgale region

Moreover, most tenured lecturers are involved in (inter)national research projects. Academic partnerships and exchange programmes with a number of uni- versities all over the

(e) Acid sphingomyelinase activity on cell lysates derived from cells transfected with the mutant protein S78A and treated with chemotherapeutic agent cisplatin (CDDP)

These additional data emphasize that –in addition to a general biological variation among tissues in ribosomal protein mRNA expression- evolutionary conserved more profound

Papers 6 and 7, two methods based on ADP are developed for solving optimal switching problems with free mode sequence, for autonomous and controlled subsystems, respectively. The

The term MSR has been coined to describe a broad class of investigations into the examination of software repositories ( e.g. We refer the interested readers to Kagdi

The asynchronous algo- rithm allows for an efficient parallelization of simulations that involve multiple dielectric and/or PEC objects [ Fostier and Olyslager , 2008b], while