Program for statistical comparison of histograms

(1)

Program for statistical comparison of

histograms

S. Bityukov (IHEP, INR), N. Tsirova (NPI MSU)

Scope

• Introduction

• Statistical comparison of two histograms

• Normalized significance

• Example

• “Distance measure”

• Internal resolution of the method

• Script stat analyzer.C

• Missing ET: large difference between histograms • PT light jet: only statistical difference

• Output of the script

(2)

Introduction

Sometimes we have two histograms and are faced with the question: Are the realizations of random variables in the form of these histograms taken from the same statistical population?

There are many approaches for comparison of two histograms: Kolmogorov-Smirnov test, Cramer-von-Mises test, Anderson-Darling test, Likelihood ratio test for shape and for slope, ... .

The review on this subject is in the paper “Testing Consistency of Two Histograms” by F. Porter.

http://arxiv.org/abs/0804.0380

Usually, a special test-statistic is used for comparing the two histograms

(3)

Statistical comparison of two histograms

A method, which uses the distribution of test-statistics

for each bin of histograms, is proposed in note arXiv:1302.2651. These test-statistics obeys the distribution close to

standard normal distribution if both histograms are obtained from the same statistical population by the same analyzer of data. Correspondingly, the joint dis-tribution must be close to standard normal distribu-tion.

For example, if we consider observed values which are produced by the same Poisson flow of events dur-ing equal independent time ranges then a test statistic

ˆ Sc12 = 2· ( √ ˆ ni1 − √ ˆ ni2),

where nˆi1 is a number events in bin i of histogram

1, nˆi2 is a number events in bin i of histogram 2, is

the test statistic of such type. It is shown in paper PoS(ACAT08)118.

Often test-statistics of such type are named as “sig-nificances of a deviation” or “sig“sig-nificances of an en-hancement”.

(4)

Normalized significance

Let us consider a model with two histograms where the random variable in each bin obeys the normal dis-tribution ϕ(x_|nik) = 1 √ 2πσik e− (x₋_nik₎₂ 2σ2 ik . (1)

Here the expected value in the bin i is equal to nik and

the variance σ_n2_ik is also equal to nik. k is the histogram

number (k = 1,2).

Let integral in the histogram 1 equals N₁ and

inte-gral in the histogram 2 equals N₂ (for example, if we

have different integrated luminosity in experiments). We define the normalized significance as

ˆ Si(K) = ˆ ni1 − Knˆi2 r ˆ σ_n2_i₁ + K2σˆ_n2_i₂. (2)

Here nˆik is an observed value in the bin i of the

his-togram k, σˆ_n2_ik = ˆnik and coefficient of normalization

K = N1 N2

(5)

Example

h1 Entries 1000 Mean 666.6 RMS 235.8 0 100 200 300 400 500 600 700 800 900 1000 0 200 400 600 800 1000 h1 Entries 1000 Mean 666.6 RMS 235.8 hist1 u1 Entries 1000 Mean 499 RMS 290.3 0 100 200 300 400 500 600 700 800 900 1000 -3 -2 -1 0 1 2 3 u1 Entries 1000 Mean 499 RMS 290.3 s_i h2 Entries 1000 Mean 665.8 RMS 235.6 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 h2 Entries 1000 Mean 665.8 RMS 235.6 hist2 u1 Entries 1000 Mean -0.01873 RMS 1.035 -4 -2 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 u1 Entries 1000 Mean -0.01873 RMS 1.035 Signif

Figure 1: Triangle distributions (K = 2, M = 1000): the observed values ˆni₁ in the

first histogram (left,up), the observed values ni₂ in the second histogram (left, down),

observed normalized significances ˆSi bin-by-bin (right, up), the distribution of observed

normalized significances (right, down).

The example with histograms produced from the same events flow during unequal independent time ranges shows that the standard deviation (RMS in the ROOT notation) of the distribution in the picture (right, down) can be used as estimator of the statisti-cal difference between histograms (this distribution is close to standard normal distribution in our example).

(6)

“Distance measure”

The RMS as a “distance measure” in our case has a clear interpretation (in fact, we set a scale of this “differmeter”):

• RM S = 0 – histograms are identical;

• RM S ∼ 1 – both histograms are obtained (by the

using independent samples) from the same parent distribution;

• RM S >> 1 – histograms are obtained from

differ-ent pardiffer-ent distributions.

An accuracy (internal resolution) of the method

de-pends on the number of bins M in histograms and on

the normalized coefficient K. The accuracy can be

estimated via Monte Carlo experiments.

p1 Entries 5000 Mean 1.005 RMS 0.04395 / ndf 2 χ 101.8 / 110 Constant 111.2 ± 2.1 Mean 1.004 ± 0.001 Sigma 0.04157 ± 0.00048 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0 20 40 60 80 100 120 p1 Entries 5000 Mean 1.005 RMS 0.04395 / ndf 2 χ 101.8 / 110 Constant 111.2 ± 2.1 Mean 1.004 ± 0.001 Sigma 0.04157 ± 0.00048 RMS

Figure 2: Distribution of RMS for 5000 comparisons of histograms (triangle distribution,

(7)

Internal resolution of the method

The dependencies of the internal resolution on the

bin numbers M and on the value of coefficient of

nor-malization K are shown in Fig. 3.

M, bins 0 200 400 600 800 1000 RMS σ 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 K = 1 K = 0.5 The precision of the RMS estimation

Monte Carlo

(5000 trials for each point) standard deviation of RMS & binning for histogram

Figure 3: The dependence of the standard deviation of RMS on number of binsM. This

(8)

Script stat analyzer.C

Two input files (*.root) with a set of TH1D his-tograms to compare.

User should indicate which histograms he/she wants to compare.

Processing:

– calculate mean value, RMS and p-value for each pair of histograms;

– sort variables by RMS in descending order; – print sorted results

(variable – p-value – RMS – mean). Output:

– to screen; – to text file.

Commands for start-up of script in ROOT root[0] .L stat analyzer.C++

(9)

Missing

E

T

: large difference between

histograms

0 10 20 30 40 50 60 70 80 90 100 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

MET, signal _Entriess1 ₂₀

Mean 54.33 RMS 31.1 0 10 20 30 40 50 60 70 80 90 100 -15 -10 -5 0 5 10 15 s1 Entries 20 Mean 54.33 RMS 31.1 Significances, bin-by-bin s_i 0 10 20 30 40 50 60 70 80 90 100 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

MET, anomalous Wtb coupling

d1 Entries 20 Mean 1.993 RMS 10.74 -20 -15 -10 -5 0 5 10 15 20 0 0.5 1 1.5 2 2.5 3 3.5 4 d1 Entries 20 Mean 1.993 RMS 10.74 Distribution of significances p-value = 1.7231041929895835e-32 Signif

Figure 4: MET: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down).

Let us consider the distributions of the probability

(with errors) of the missing ET to be in corresponding

bin of histogram in the case of registration of Standard Model events and the events which are produced due to the presence of anomalous Wtb coupling. The com-parison of these distributions is shown in Fig. 4 (RMS

(10)

P

T

light jet: only statistical difference

40 60 80 100 120 140 160 180 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Pt_LJ, signal _Entriess1 ₂₀ Mean 83.32 RMS 55.64 0 20 40 60 80 100 120 140 160 180 -1.5 -1 -0.5 0 0.5 1 1.5 s1 Entries 20 Mean 83.32 RMS 55.64 Significances, bin-by-bin s_i 40 60 80 100 120 140 160 180 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Pt_LJ, anomalous Wtb coupling d1 Entries 20 Mean 0.1446 RMS 0.9824 -10 -8 -6 -4 -2 0 2 4 6 8 10 0 1 2 3 4 5 6 7 8 d1 Entries 20 Mean 0.1446 RMS 0.9824 Distribution of significances p-value = 0.54364536413492959 Signif

Figure 5: Pt LJ: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down).

The case of the absence of the difference between distributions for the production of single top quark in frame of SM and model with anomalous Wtb coupling is shown in Fig. 5 for Pt distribution of jet from light

(11)

Output of the script

(12)

Conclusions

• We discussed the possible tool for comparison of

histograms in frame of the ROOT system (script stat analyzer.C).

• This tool very easy in use and very easy to

under-stand the results.

• This tool can be used in tasks of monitoring of the

equipment. In this moment the method is used for choice of the most significant variables in mul-tivariate analysis.

• This tool requires the additional development (now

(13)

η

of lepton: the difference exits

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Eta_Lep, signal s1 Entries 20 Mean 1.245 RMS 0.6696 0 0.5 1 1.5 2 2.5 -2 -1 0 1 2 3 s1 Entries 20 Mean 1.245 RMS 0.6696 Significances, bin-by-bin s_i -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 0 0.02 0.04 0.06 0.08 0.1

Eta_Lep, anomalous Wtb coupling Entries d1 20 Mean 0.2672 RMS 1.374 -10 -8 -6 -4 -2 0 2 4 6 8 10 0 1 2 3 4 5 6 7 d1 Entries 20 Mean 0.2672 RMS 1.374 Distribution of significances p-value = 0.16255840792058071 Signif

Figure 7: Eta Lep: the observed values in the first histogram (left,up), the observed values in the second histogram (left, down), observed normalized significances bin-by-bin (right, up), the distribution of observed normalized significances (right, down).

The case of small difference between the histograms is shown in Fig. 7 (RMS = 1.37, Mean = 0.29).