Finding statistical patterns in Big Data

(1)

Finding statistical patterns in Big Data

Patrick Rubin-Delanchy

University of Bristol & Heilbronn Institute for Mathematical Research

IAS Research Workshop: Data science for the real world (workshop 1)

1st May 2015

(2)

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data

Computational issues are largely ignored

Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple

Recommendations are meant to help exploratory research; I’m not taking any

position on publishing standards

(3)

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data

Computational issues are largely ignored

Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple

Recommendations are meant to help exploratory research; I’m not taking any

position on publishing standards

(4)

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data

Computational issues are largely ignored

Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple

Recommendations are meant to help exploratory research; I’m not taking any

position on publishing standards

(5)

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data

Computational issues are largely ignored

Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple

Recommendations are meant to help exploratory research; I’m not taking any

position on publishing standards

(6)

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data

Computational issues are largely ignored

Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple

Recommendations are meant to help exploratory research; I’m not taking any

position on publishing standards

(7)

Pitfalls of hypothesis testing with Big Data

This talk focusses on two pervasive issues:

1

Hypothesis testing when the model is wrong

2

Multiple testing

(8)

The cyber-security application

Network flow data at Los Alamos National Laboratory: ∼ 30 GB/day Typical attack pattern:

A. Opportunistic infection B. Network traversal C. Data exfiltration

Figure : Network traversal, source: Neil et al. (2013)

(9)

The cyber-security application

Global cost of cyber-security is estimated at $400 billion (CSIC, 2014) Botnet behind a third of the spam sent in 2010: earned about $2.7 million spam prevention: cost about > $1 billion (Anderson et al., 2013)

UK National Security Strategy — Priority Risks (Cabinet Office, 2010) International terrorism affecting the UK or its interests, including a chemical, biological,radiological or nuclear attack by terrorists; and/or a significant increase in the levels of terrorism relating to Northern Ireland.

Hostile attacks upon UK cyber space by other states and large scale cyber crime.

A major accident or natural hazard which requires a national response, such as severe coastal flooding affecting three or more regions of the UK, or an influenza pandemic.

An international military crisis between states, drawing in the UK, and its allies as

well as other states and non-state actors.

(10)

Hypothesis testing framework

1

Null hypothesis H 0 , e.g., the drug has no effect; any difference between the two groups is due to chance

2

Alternative hypothesis H 1 , e.g., the drug has a (positive) effect

3

Test statistic T , e.g., difference in treatment outcomes

4

P-value:

p = P 0 (T ^∗ ≥ T ),

where P 0 is the distribution of T under H 0 and T ^∗ is a replicate of T under H 0

5

We reject the null hypothesis when p is small, e.g., less than 5%

(11)

Hypothesis testing with the wrong model

Example hypothesis test:

H 0 : “the data are Gaussian”

| {z }

FALSE

, µ 1 = µ 2

| {z }

TRUE

H 1 : “the data are Gaussian”

| {z }

FALSE

, µ 1 > µ 2

| {z }

FALSE

It’s right to reject the null hypothesis, but wrong to accept the alternative.

In practice this seems to lead to p-values with “U-shaped” distributions.

(12)

Hypothesis testing with the wrong model

Example hypothesis test:

H 0 : “the data are Gaussian”

| {z }

FALSE

, µ 1 = µ 2

| {z }

TRUE

H 1 : “the data are Gaussian”

| {z }

FALSE

, µ 1 > µ 2

| {z }

FALSE

It’s right to reject the null hypothesis, but wrong to accept the alternative.

In practice this seems to lead to p-values with “U-shaped” distributions.

(13)

Hypothesis testing with the wrong model

Example hypothesis test:

H 0 : “the data are Gaussian”

| {z }

FALSE

, µ 1 = µ 2

| {z }

TRUE

H 1 : “the data are Gaussian”

| {z }

FALSE

, µ 1 > µ 2

| {z }

FALSE

It’s right to reject the null hypothesis, but wrong to accept the alternative.

In practice this seems to lead to p-values with “U-shaped” distributions.

(14)

Example: two-sample t-test

G ₁ G ₂ 10 ¹ 10 ³ 10 ⁸ 10 ⁶ 10 ⁴ 10 ⁶ 10 ¹ 10 ³ 10 ⁶ 10 ⁶ 10 ⁵ 10 ⁵ 10 ⁵

Difference:

G ¯ ₁ − ¯ G ₂ ≈ 3.39 · 10 ⁷ t-test:

T = G ¯ ₁ − ¯ G ₂ s 12 p1/n ₁ + 1/n 2

≈ 2.03 P-value:

p ≈ 0.03

(15)

Example: two-sample t-test

G ₁ G ₂ 10 ¹ 10 ³ 10 ⁸ 10 ⁶ 10 ⁴ 10 ⁶ 10 ¹ 10 ³ 10 ⁶ 10 ⁶ 10 ⁵ 10 ⁵ 10 ⁵

The t-test assumes:

1

H 0 : the data are independent and Gaussian with the same mean and variance.

2

H 1 : µ 1 > µ 2

Although we might correctly reject H 0 , we don’t know if we are rejecting:

a) the data are Gaussian

b) the data have the same mean

c) the data have the same variance

(16)

Example: two-sample t-test

G ₁ G ₂ 10 ¹ 10 ³ 10 ⁸ 10 ⁶ 10 ⁴ 10 ⁶ 10 ¹ 10 ³ 10 ⁶ 10 ⁶ 10 ⁵ 10 ⁵ 10 ⁵

A vector X ₁ , . . . , X _n is exchangeable if its joint distribution is the same as X _σ(1) , . . . , X _σ(n) for any permutation σ.

Instead of assuming a Gaussian model under H ₀ , as- sume

1

H ₀ : the data are exchangeable

2

H ₁ : the data are not exchangeable, µ ₁ > µ ₂ Can think of T as a random draw from T ₁ ^∗ , . . . , T _M ^∗ , where T _i ^∗ is the i th permutation of G 1 and G 2

(Formally, we are conditioning on a sufficient statistic

for the unknowns)

(17)

Example: two-sample t-test

G ₁ G ₂ 10 ¹ 10 ³ 10 ⁸ 10 ⁶ 10 ⁴ 10 ⁶ 10 ¹ 10 ³ 10 ⁶ 10 ⁶ 10 ⁵ 10 ⁵ 10 ⁵

Difference:

G ¯ 1 − ¯ G 2 ≈ 3.39 · 10 ⁷ t-test:

T =

G ¯ 1 − ¯ G 2

s ₁₂ p1/n ₁ + 1/n ₂ ≈ 2.03

(18)

Example: two-sample t-test

G ₁ ^∗ G ₂ ^∗ 10 ⁵ 10 ⁶ 10 ³ 10 ⁶ 10 ⁶ 10 ⁸ 10 ¹ 10 ⁶ 10 ⁵ 10 ⁴ 10 ¹ 10 ⁵ 10 ³

Resampled difference:

G ¯ ₁ ^∗ − ¯ G ₂ ^∗ ≈ −9.95 · 10 ⁶ Resampled statistic:

T ^∗ = G ¯ 1 − ¯ G 2

s ₁₂ p1/n ₁ + 1/n ₂ ≈ −0.53

(19)

Example: two-sample t-test

G ₁ ^∗ G ₂ ^∗ 10 ⁵ 10 ¹ 10 ⁶ 10 ⁴ 10 ⁸ 10 ¹ 10 ⁶ 10 ⁶ 10 ⁶ 10 ⁵ 10 ⁵ 10 ³ 10 ³

Resampled difference:

G ¯ ₁ ^∗ − ¯ G ₂ ^∗ ≈ 3.34 · 10 ⁷ Resampled statistic:

T ^∗ = G ¯ 1 − ¯ G 2

s ₁₂ p1/n ₁ + 1/n ₂ ≈ 2.07

(20)

Example: two-sample t-test

G ₁ G ₂ 10 ¹ 10 ³ 10 ⁸ 10 ⁶ 10 ⁴ 10 ⁶ 10 ¹ 10 ³ 10 ⁶ 10 ⁶ 10 ⁵ 10 ⁵ 10 ⁵

Difference:

G ¯ ₁ − ¯ G ₂ ≈ 3.39 · 10 ⁷ Statistic:

T =

G ¯ ₁ − ¯ G ₂

s ₁₂ p1/n ₁ + 1/n ₂ ≈ 2.03 P-value (permutation-based):

ˆ

p = 1

M + 1

M

X

i =0

I(T _i ^∗ ≥ T ) ≈ 0.21,

where T ₀ ^∗ = T .

(21)

Reasons to consider a non-parametric approach (≈ reasons modelling could be hard):

1

Visualisation and curation are difficult (e.g. for logistical or privacy reasons)

2

The same analytic is to be used on different data sources

3

A lack of domain expertise

4

The data are complicated objects, e.g. graphs Reasons not to:

1

Sometimes a non-parametric approach is not available

2

Model-based approaches can have greater power

3

There is a question of how to balance computational effort against simulation error

(22)

Reasons to consider a non-parametric approach (≈ reasons modelling could be hard):

1

Visualisation and curation are difficult (e.g. for logistical or privacy reasons)

2

The same analytic is to be used on different data sources

3

A lack of domain expertise

4

The data are complicated objects, e.g. graphs Reasons not to:

1

Sometimes a non-parametric approach is not available

2

Model-based approaches can have greater power

3

There is a question of how to balance computational effort against simulation error

(23)

Multiple testing

Michael Jordan: “When you have large amounts of data, your appetite for hypotheses tends to get even larger.” (IEEE Spectrum, 20th October 2014).

In fact, the number of hypotheses tested often grows much faster than the data.

The basic problem: as the number of tests gets large, the probability of finding a

significant result becomes very high.

(24)

Example: spurious correlations

Figure : Spurious correlations, source: http://www.tylervigen.com

(25)

Two approaches to multiple testing

Suppose we have p-values p ₁ , . . . , p _n . The two canonical tasks are:

1

sub-select a set for further analysis

2

combine the p-values into one overall score of significance

(26)

Sub-selection

Define the false discovery rate to be (Benjamini and Hochberg, 1995):

Q =

( 0 if no hypothesis is rejected

#incorrect rejections

#total rejections otherwise

Benjamini and Hochberg (1995) propose that Q is the quantity we want to control.

1

Let p ₍₁₎ ≤ · · · ≤ p _(n) denote the ordered p-values

2

Let k be the largest i such that p _{(i )} ≤ _n ⁱ q

3

Reject hypotheses corresponding to p ₍₁₎ , . . . , p _(k)

4

Then, if the p-values corresponding to the true null hypotheses are independent,

E(Q) ≤ q

(27)

Sub-selection

Define the false discovery rate to be (Benjamini and Hochberg, 1995):

Q =

( 0 if no hypothesis is rejected

#incorrect rejections

#total rejections otherwise

Benjamini and Hochberg (1995) propose that Q is the quantity we want to control.

1

Let p ₍₁₎ ≤ · · · ≤ p _(n) denote the ordered p-values

2

Let k be the largest i such that p _{(i )} ≤ _n ⁱ q

3

Reject hypotheses corresponding to p ₍₁₎ , . . . , p _(k)

4

Then, if the p-values corresponding to the true null hypotheses are independent,

E(Q) ≤ q

(28)

Combining p-values I

In the second approach, we consider the joint hypothesis test:

H ˜ ₀ : all of the null hypotheses hold H ˜ ₁ : at least one alternative holds One method for combining p-values:

1

Let

π = min

i ∈1,...,n

n np _{(i )} i

o

2

Then if the p-values are independent under H ₀ (Simes, 1986), π ∼ uniform[0, 1]

3

Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π ≥ st uniform[0, 1],

where ≥ denotes the usual stochastic order. In statistical terminology, rejecting

(29)

Combining p-values I

In the second approach, we consider the joint hypothesis test:

H ˜ ₀ : all of the null hypotheses hold H ˜ ₁ : at least one alternative holds One method for combining p-values:

1

Let

π = min

i ∈1,...,n

n np _{(i )} i

o

2

Then if the p-values are independent under H ₀ (Simes, 1986), π ∼ uniform[0, 1]

3

Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997),

π ≥ st uniform[0, 1],

(30)

Combining p-values I

In the second approach, we consider the joint hypothesis test:

H ˜ ₀ : all of the null hypotheses hold H ˜ ₁ : at least one alternative holds One method for combining p-values:

1

Let

π = min

i ∈1,...,n

n np _{(i )} i

o

2

Then if the p-values are independent under H ₀ (Simes, 1986), π ∼ uniform[0, 1]

3

Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π ≥ st uniform[0, 1],

where ≥ denotes the usual stochastic order. In statistical terminology, rejecting

(31)

Combining p-values I

In the second approach, we consider the joint hypothesis test:

H ˜ ₀ : all of the null hypotheses hold H ˜ ₁ : at least one alternative holds One method for combining p-values:

1

Let

π = min

i ∈1,...,n

n np _{(i )} i

o

2

Then if the p-values are independent under H ₀ (Simes, 1986), π ∼ uniform[0, 1]

3

Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997),

π ≥ st uniform[0, 1],

(32)

Combining p-values II

Consider the more specific “needles-in-a-haystack” scenario

1

n very large

2

Most of the tests are expected to have no signal:

H ˜ 0 : all of the null hypotheses hold

H ˜ 1 : a vanishing proportion of the alternatives hold Let

π ^∗ = max

0≤α≤α

0

√ n{[(Fraction significant at α) − α]/ p

α(1 − α)}

Then under sparse conditions set out in Donoho and Jin (2004), π ^∗ will manage to

detect the alternative whenever it is asymptotically theoretically possible.

(33)

Combining p-values II

Consider the more specific “needles-in-a-haystack” scenario

1

n very large

2

Most of the tests are expected to have no signal:

H ˜ 0 : all of the null hypotheses hold

H ˜ 1 : a vanishing proportion of the alternatives hold Let

π ^∗ = max

0≤α≤α

0

√ n{[(Fraction significant at α) − α]/ p

α(1 − α)}

Then under sparse conditions set out in Donoho and Jin (2004), π ^∗ will manage to

detect the alternative whenever it is asymptotically theoretically possible.

(34)

Conclusion

1

We’ve advertised a few simple techniques that can help with hypothesis testing at scale

2

Computational issues have been largely ignored, e.g. the permutation test is more effort (but we can control that)

3

Some of the concepts touched upon have a much deeper theory, e.g.

exchangeability, dependence, stochastic orders, that is possibly very relevant to

the theory of Big Data.

(35)

Anderson, R., Barton, C., B¨ohme, R., Clayton, R., Van Eeten, M. J., Levi, M., Moore, T., and Savage, S. (2013). Measuring the cost of cybercrime.

In The economics of information security and privacy, pages 265–300. Springer.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300.

Cabinet Office and National security and intelligence (2010). A strong britain in an age of uncertainty:the national security strategy.

Center for Strategic and International Studies (2014). Net losses: Estimating the global cost of cybercrime economic impact of cybercrime II.

Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, pages 962–994.

Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs.

Technometrics, 55(4):403–414.

Sarkar, S. K. and Chang, C.-K. (1997). The simes method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92(440):1601–1608.

Simes, R. J. (1986). An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751–754.

Finding statistical patterns in Big Data