Classification of High Dimensional Small Sample Genetic Data by Forward Maximum Likelihood Ratio Stepwise Logistic Regression

(1)

2019 International Conference on Artificial Intelligence and Computing Science (ICAICS 2019) ISBN: 978-1-60595-615-2

Classification of High Dimensional Small Sample Genetic Data by

Forward Maximum Likelihood Ratio Stepwise Logistic Regression

Li-hao WANG, Qiao-han CHU, Yi-ying ZHANG and Kun-ping ZHU

*

Department of Mathematics, East China University of Science and Technology, Shanghai, China.

*Corresponding author

Keywords: Classification, Logistic regression,Likelihood ratio, Principal components.

Abstract. Analyzing genetic data has been one of the effective ways in early diagnosis of cancer. However, the method of machine learning usually needs the support of large data. For the small sample and with high dimensional genetic data, the accuracy of classification is generally not ideal by means of machine learning directly. This paper deals with the modeling of classification for high dimensional genetic data with only 62 samples. By extracting the principal components, it shows that the cumulative contribution rate of the first nine principal components can reach 80.73%, but the classification effect based on the principal components is not desirable. For the sake of reduction of the dimensions, the test of maximum Likelihood Ratio is applied for the selection of variables in logistic stepwise regression, which ensures that only the variables with significant influence on classification can enter the model. With the procedure, the final model fits well and is of good predicting performance.

Introduction

(2)

Data Information

The data to be analyzed comes from question A of 2010 National Post-Graduate Mathematical Contest in Modeling. The given data file contains the log2 intensity of the 2000 genes with highest minimal intensity (rows) of the 62 samples (columns), including 22 normal samples and 40 cancer samples. Based on the training data, the problem is to establish a classification model for the purpose of cancer prediction.

Fist, all the 2000 characteristics’ value are normally distributed. Due to the small sample data, we take K-S test into account. As shown in Table 1 of the third characteristic data for example, the Sig. is 0.2, which means the normality of the 3rd variable underSignificance level 0.05.For more intuitive, the Q-Q graph is given in figure 1, where x-axis stands for the observed value and y-axis stands for the expected standard value.

Table 1. Normality test.

Normality Test

Kolmogorov-Smirnova _Shapiro-Wilk

statistics df Sig. statistics df Sig.

[image:2.612.100.510.238.560.2]

3 .082 62 .200* _.977 ₆₂ _.302

Figure 1. Standard Q-Q graph.

Although there are 2000 characteristics, there is a strong correlation between them. Taking the first 100 variables as an example, the range of correlation is from -0.223 to 0.994. Thereby, we can understand the data information by extracting the principal components. The extraction result shows that the cumulative contribution rate of the first 2 principal components is 55.9%, and the cumulative contribution of the first 9 principal components reaches 80.73%. We try to classify according to the 9 principal components, but the practice shows thataccuracy of classification is not high by principal components.

(3)

[image:3.612.185.418.69.245.2]

Figure 2. Distribution of the two principal components.

Logistic Regression

In general, Logistic Regression [3,4] is a method to solve binary problems. Through estimating parameters of the logistic model to fit the data, it can predict the binary results. Conditions can be divided into “Yes” or “not”, they are often labeled as “1” or “0” as indicator variable. For instance, it can predict the probability of developing a certain disease (cancer), according to observed characteristics dependent on different people (various gene expression data). We suppose that x1, x2,

..., xn are n variables of different characteristics, y is the presence (y=1) or absence (y=0) of a certain

result and the p indicates the probability of presence (y=1). Then the following equation describes the relationships between different characteristics and p value [5]:

) ... ( 0 11

1

n nx

x

e

p

₋_β ₊_β ₊ ₊_β

+

=

(1)

where the coefficients βi are the parameters of the model associated with xi, including constant term

β0.We hereby use the gene related characteristics to substitute xi. Note that p is interpreted as the

probability of the dependent characteristics equaling the presence of the case rather than the absence. (e.g. if p=0.6, it tells that a patient have a 60% chance of having a malignant tumor ).

Generally, we consider that

otherwise p y

0

5 . 0 1

{ ≥

= (2)

which hereby means if p≥0.5 then the patient has cancer, otherwise the patient is healthy.

Estimation of Parameters

According to what has been supposed above, we suppose that l=βT_x+_β

0, where β= (β1, β2,..., βn)T. Then

from logistic function, we can get

l

e

l ₋

+ =

1 1 ) ( φ

(3)

Due to the fact that p indicates the probability of “1”, we can obtain that

y y

l l

x y

P = − 1−

)) ( 1 ( ) ( ) ; |

( β φ φ . And we suppose that there are m samples, then the likelihood function is

(4)

∏

= − =

−

=

m j y j j y j m j j j j

l

x

y

P

L

1 1 ) ( ) ( ) ( 1 ) ( ) ( ()

))

(

1

(

))

(

)

;

|

(

)

(

β

φ

(4)

Logarithm to the formula (4), we get:



= − − + = m j j j j j l In y l In y L In 1 ) ( ) ( ) ( ) ( ))) ( 1 ( ) 1 ( )) ( ( ( )) (

( β φ φ

(5)

where () () ₀

β

β +

= T j

j x

l , j indicates the jth sample, y(j)indicates the real value of the jth sample and ((j)) l

φ

indicates the predict value of the jth _sample.

If we put a negative sign in front of it, we should strive for minimize this function, which means that it can be used as a cost function:



=

−

+

−

=

−

=

m j j j j j

l

In

y

l

In

y

L

In

J

1 ) ( ) ( ) ( ) (

)))

(

1

(

)

1

(

))

(

))

(

)

(

β

φ

(6) )) ( 1 ( ) 1 ( )) ( ( ) ; ), (

( l y yIn l y In l

J

φ

β

=−

φ

− − −

φ

₍₇₎

Now we use gradient descent to calculate the parameters:

β β

β_：= +∆ _;∆β=-η∇J(β)_;βk：=βk+∆βk_; k J β β η β ∂ ∂ − =

∆ _k ( )

(8)

where β_k indicates the kth_{characteristic coefficient;}_η_{indicates the learning rate to control the steps.}

Because of φ'(l)=φ(l)(1−φ(l)), we can get that



= = − − = ∂ ∂ − − − − = ∂ ∂ m j j k j j m j j j j j j j k x l y l l y l y J 1 ) ( ) ( ) ( 1 ) ( ) ( ) ( ) ( ) ( )) ( ( ) ( ) ) ( 1 1 ) 1 ( ) ( 1 ( ) ( φ β φ φ φ β β (9)

Therefore, the maximum likelihood estimation of the parameters is



= − + = m j j k j j x l y 1 ) ( ) ( ) ( k

k

β

η

(

φ

( ))

β

：

(10)

Likelihood Ratio Test

Basic principles: we start from a simple regression formula with only a constant term, then pick some of the variables which contribute most to the regression into the regression formula to minimize residual sum of squares and at the same time to avoid multi-collinearity.

We suppose that (X1,X2,...,Xn) are samples obtained from population ζ, (x1,x2,...,xn) are observed

value corresponding to the samples. The density function of population ζ is p(x,θ), θ∈Θ, Where

Θ=Θ0∪Θ1 indicates the value space of θ, and Θ0∩Θ1=∅, the test of parameter θ: H0:θ∈Θ0,

H1:θ∈Θ1, we take

) ( sup ) ( sup ) ,..., , ( 0 2 1 θ θ λ θ θ L L x x x _n Θ ∈ Θ ∈

= (11)

as the statistic of the generalized likelihood ratio and call it likelihood ratio function [8].

(5)

)) ,..., , ( ( 2

- In λ x₁ x₂ x_n (12)

which obey chi square distribution whose degree of freedom is 1 [10,11].

Analysis and Conclusion

We first use Z-score standardization to standardize all the characteristics by:

i i

S x x

z= i− ₍₁₃₎

We calculate the maximum likelihood estimation before and after considering this variable in the function and then put them into (13). We set a threshold for it, if it’s smaller than the threshold and if those variables pass the test with given confidence level, they can be taken into account. The analysis process is shown in Figure 3 as follows:

[image:5.612.105.507.254.357.2]

rd M

Figure 3. Process of analysis.

According to the steps given above, we have considered all the variables and finally 5 variables that meet the requirement are taken into account. They are 493rd_{, 721}st_{, 1622}nd_{, 1818}th_{, 1891}st_variables.

The intermediate results of analysis are given in Table 2.

Table 2. Variables in the equation.

B S.E. Wald df Sig. Exp(B)

Step1 Zvar 493

Step1 Constant

-2.246 0.977 0.590 0.381 14.475 6.567 1 1 0.000 0.010 106 2.658

Step2 Zvar 493

Step1 Zvar1622

Step1 Constant

-3.735 1.969 1.671 1.045 0.610 0.607 12.778 10.417 7.574 1 1 1 0.000 0.001 0.006 0.024 7.164 5.315 Step3 Zvar 493

Step1 Zvar1622

Step1 Zvar1891

Step1 Constant

-6.875 2.240 3.390 2.418 2.238 0.805 1.335 1.085 9.438 7.747 6.446 4.971 1 1 1 1 0.002 0.005 0.011 0.026 0.001 9.389 29.654 11.225 Step4 Zvar 493

Step1 Zvar721

Step1 Zvar1622

Step1 Zvar1891

Step1 Constant

-15.022 -4.197 6.625 7.5 6.334 7.828 2.870 3.611 3.817 3.756 3.682 2.138 3.365 3.860 2.843 1 1 1 1 1 0.055 0.144 0.067 0.049 0.092 0.000 0.015 753.372 1807.993 563.3 Step5 Zvar 493

Step1 Zvar721

Step1 Zvar1622

Step1 Zvar1818

Step1 Zvar1891

Step1 Constant

[image:5.612.81.535.450.722.2]

(6)

According to the calculation, the final regression function is:

) 935 . 87 568 . 126 721 . 1 2 -094 . 93 892 . 51 454 . 231

( ₄₉₃ ₇₂₁ ₁₆₂₂ ₁₈₁₈ ₁₈₉₁

1

+ +

+ −

− −

+

=

_x _x _x _x _x

e

y

₍₁₄₎

The result of classification is given in Table 3, which shows that the final regression function (14) can correctly classify the 62 samples with only 5 characteristics and is of 100% prediction accuracy. Although the model has perfectly fitted the data, there are still problems with this seemingly wonderful result according to related reference [12]. One of the most probable problems is that it’s over-fitting under some circumstances. Theoretically, further research for the method of forward stepwise likelihood ratio logistic regression is needed.

Table.3 Classification table.

Observed

Predicted

Yes or No Percentage Correct

0 1

Step 1 Yes or No 0

Step 1 Yes or No 1

Step 1 Overall Percentage

16 4

6 36

72.7 90.0 93.9 Step 2 Yes or No 0

Step 1 Yes or No 1

19 4

3 36

86.4 90.0 88.7 Step 3 Yes or No 0

Step 1 Yes or No 1

22 2

0 38

100.0 95.0 96.8 Step 4 Yes or No 0

Step 1 Yes or No 1

22 1

0 39

100.0 97.5 98.4 Step 5 Yes or No 0

Step 1 Yes or No 1

22 0

0 40

100.0 100.0 100.0

References

[1] J.D. Cohen et al., Science 10.1126 / science.aar3247 (2018).

[2] L. Qi and J. Shen, Construction of recurrence model of patients with hepatocellular carcinoma by gene mutation information in TCGA database combined with machine learning software RapidMiner, Chinese Journal of Liver Diseases (Electronic Version), 2018 10(3).

[3] Walker, S.H.; Duncan, D.B. (1967). “Estimation of the probability of an event as a function of several independent variables”. Biometrika. 54 (1/2): 167–178. doi:10.2307/2333860. JSTOR 2333860

[4] Cox, DR (1958). “The regression analysis of binary sequences (with discussion)”. J Roy Stat Soc B. 20 (2): 215–242. JSTOR 2983890.

[5] M. Hassan, M.A. Butt and M.Z. Baba, Logistic Regression Versus Neural Networks: The Best Accuracy in Prediction of Diabetes Disease, Asian Journal of Computer Science and Technology, ISSN: 2249-0701 Vol. 6 No. 2, 2017, pp. 33-42.

[6] Z. Zhou, Machine Learning [M], Beijing: Tsinghua University Press, 2016: 53-68.

(7)

[8] J. Liu, K. Zhu and Y. Lu, Applied Mathematical Statistics [M], Shanghai: East China University of Science and Technology Press, 2014: 95-98, 141-148.

[9] Hosmer, David W.; Lemeshow, Stanley (2000). Applied Logistic Regression (2nd ed.). Wiley. ISBN 978-0-471-35632-5.

[10]Cohen, Jacob; Cohen, Patricia; West, Steven G.; Aiken, Leona S. (2002). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed.). Routledge. ISBN 978-0-8058-2223-6.

[11]Allison, Paul D. “Measures of Fit for Logistic Regression”. Statistical Horizons LLC and the University of Pennsylvania.