Imaging preprocessing and feature extraction software summary

II. Materials and Methods

9. CT radiomics is not a surrogate predictor of patients at risk as

12.21. Imaging preprocessing and feature extraction software summary

As a summary, the figure 12.1 provides the module dependency inside the architecture of the software for imaging processing and feature extraction.

12.21. Imaging preprocessing and feature extraction software summary

13. Software documentation: Feature

selection and ML models

In this chapter full disclosure of the radiomic implementations for feature selection and ML modelling are discussed. The implementations follows the theory exposed in section 4.

13.1. radiomics_functions.py

This module is a class object that provides functions for feature preprocessing tasks, such as PCA applications and feature correlation assessments. it uses the built-in library Scikit-learn and Pandas and Numpy.

• f_corr_assestment(self, df, d=1.0):

assesses correlations between features from the DataFrame object input data

df, accounting by the threshold correlation d, which by default is set to 1 or 100% pearson correlation coefficient.

Listing 13.1: Python code for correlation assessment of an DataFrame object, ra- diomics_functions.py.

def f _ c o r r _ a s s e s t m e n t ( s e l f , df , d = 1 . 0 ) :

# e x t r a c t c o r r e l a t i o n m a t r i x from t h e pandas method .

c o r r M a t r i x = d f . c o r r ( ) c o r r M a t r i x . l o c [ : , : ] = np . t r i l ( c o r r M a t r i x , k=−1) a l r e a d y _ i n = s e t( ) r e s u l t = [ ] f o r c o l in c o r r M a t r i x : p e r f e c t _ c o r r = c o r r M a t r i x [ c o l ] [ c o r r M a t r i x [ c o l ] >= d ] . i n d e x . t o l i s t ( ) i f p e r f e c t _ c o r r and c o l not in a l r e a d y _ i n : a l r e a d y _ i n . update (s e t( p e r f e c t _ c o r r ) ) p e r f e c t _ c o r r . append ( c o l ) r e s u l t . append ( p e r f e c t _ c o r r ) return r e s u l t

• apply_PCA_in_f(self, X_train, X_test, cluster):

applies the principal component algorithm imported from Scikit-learn to training (X_train) and test (X_test)DataFrame objects using the highly correlated feature lists cluster.

13. Software documentation: Feature selection and ML models

Listing 13.2: Python code for PCA applications of DataFrame objects, ra-

diomics_functions.py.

def apply_PCA_in_f ( s e l f , X_train , X_test , c l u s t e r ) :

# I n i t i a l i z e s t o r a g e o f pca f e a t u r e s

X_train_pca = np . z e r o s ( ( X_train . s h a p e [ 0 ] , len( c l u s t e r ) ) ) X_test_pca = np . z e r o s ( ( X_test . s h a p e [ 0 ] , len( c l u s t e r ) ) )

# I n i t i a l i z e PCA method from s c i k i t l e a r n t o combine c o r r e l a t i n g f e a t u r e t o one s i n g l e component

pca = PCA( n_components = 1 )

# f o r e a c h p a t i e n t i n ’ d a t a ’ : a p p l y PCA t o e a c h o f t h e c l u s t e r s i n ’ c l u s t e r e d _ f e a t u r e s ’

f o r i in range(len( c l u s t e r ) ) :

comp1 , comp2 = [ X_train [ c l u s t e r [ i ] ] , X_test [ c l u s t e r [ i ] ] ] comp1 = pca . f i t _ t r a n s f o r m ( comp1 )

comp2 = pca . t r a n s f o r m ( comp2 ) X_train_pca [ : , i ] = comp1 [ : , 0 ] X_test_pca [ : , i ] = comp2 [ : , 0 ]

# T r a n s l a t e t h e numpy o b j e c t t o t h e o r i g i n a l pandas o b j e c t

X_train_pca = pd . DataFrame ( X_train_pca )

# Rename f e a t u r e s

names = [ ]

f o r j in range(len( c l u s t e r ) ) : names . append ( c l u s t e r [ j ] [−1 ] ) X_train_pca . columns = names

X_test_pca = pd . DataFrame ( X_test_pca ) X_test_pca . columns = names

return X_train_pca , X_test_pca

13.2. filter_selection.py

The module contains different utilities for filter feature selection methods (cf. section 4.3). It uses the functions chi2, f_classif and mutual_info_classif from the built- in module Scikit-learn.feature_selection as well as mutual_info_score from Scikit- learn.fmetrics and ScyPy.stats.

• pearson_scorer(X, y):

computes pearson correlation coefficient between two ndarrays,X and y.

• kendall_score(X, y):

computes kendall correlation coefficient between two ndarrays, X and y.

• spearman_score(X, y):

13.2. filter_selection.py

• mutual_info_feature_selection(X, y):

computes mutual information feature selection relevance coefficient between two

ndarrays, X and y.

Listing 13.3: Python code for computation of mutual information feature selection coefficient of DataFrame objects, filter_selection.py.

def m u t u a l _ i n f o _ f e a t u r e _ s e l e c t i o n (X, y ) : # i n i t i a l i z e r e l e v a n c e s c o r e p e r f e a t u r e P = np . a r r a y ( [ ] ) f o r i in range(X. s h a p e [ 1 ] ) : temp = 0 f o r j in range(X. s h a p e [ 1 ] ) : i f j != i : # computes mutual i n f o r m a t i o n c o e f f i c i e n t a c r o s s f e a t u r e s p a c e temp += m u t u a l _ i n f o _ s c o r e (X [ : , [ i ] ] , X [ : , [ j ] ] ) # computes r e l e v a n c e s c o r e a c c o r d i n g l y P = np . append (P , m u t u a l _ i n f o _ s c o r e (X [ : , [ i ] ] , y . r e s h a p e (−1 , 1 ) ) − temp ) return P • conditional_mutual_information(X, Y, Z):

calculates conditional mutual information coefficient relevance coefficient between three ndarrays, X, Y and Z.

Listing 13.4: Python code for computation of conditional mutual information coefficient of DataFrame objects, filter_selection.py.

def c o n d i t i o n a l _ m u t u a l _ i n f o r m a t i o n (X, Y, Z ) : # I n i t i a l i z e s c o r e p e r f e a t u r e P = np . a r r a y ( [ ] ) # i n i t i a l i z e number o f b i n s f o r h i s t o g r a m i n t a r g e t l a b e l c l a s s e s n_bins_z = len( np . u n i q u e ( Z ) ) f o r i in range(X. s h a p e [ 1 ] ) : I = 0 f o r j in range(Y. s h a p e [ 1 ] ) : i f j != i : n_bins_x = len( Z ) n_bins_y = len( Z ) argument_xyz = np . c o n c a t e n a t e ( (X [ : , [ i ] ] , Y [ : , [ j ] ] , Z . r e s h a p e (−1 ,1) ) , a x i s = 1 ) argument_xz = np . c o n c a t e n a t e ( (X [ : , [ i ] ] , Z . r e s h a p e (−1 ,1) ) , a x i s = 1 ) argument_yz = np . c o n c a t e n a t e ( (Y [ : , [ j ] ] , Z . r e s h a p e (−1 ,1) ) , a x i s = 1 ) # c o n s t r u c t c o n d i t i o n a l j o i n t p r o b a b i l i t y d i s t r i b u t i o n

p_xyz , _ = np . h i s t o g r a m d d ( argument_xyz , b i n s = [ n_bins_x , n_bins_y , n_bins_z ] )

13. Software documentation: Feature selection and ML models

# c o n t r u c t c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n

p_xz , _, _ = np . h i s t o g r a m 2 d ( argument_xz [ : , 0 ] , argument_xz [ : , 1 ] , b i n s = [ n_bins_x , n_bins_z ] ) p_yz , _, _ = np . h i s t o g r a m 2 d ( argument_yz [ : , 0 ] ,

argument_yz [ : , 1 ] , b i n s = [ n_bins_y , n_bins_z ] )

# c o n s t r u c t p r o b a b i l i t y d i s t r i b u t i o n

p_z , _ = np . h i s t o g r a m ( Z , b i n s = n_bins_z ) p_xyz = p_xyz/np .sum( p_xyz )

p_xz = p_xz/np .sum( p_xz ) p_yz = p_yz/np .sum( p_yz ) p_z = p_z/np .sum( p_z )

# computes argument o f t h e e q u a t i o n f o r c o n d i t i o n a l mutual i n f o r m a t i o n c o e f f i c i e n t

p_xy_z = p_z∗p_xyz

log_p_xy_z = p_xyz∗np . l o g 1 0 ( p_xy_z ) temp = p_xz∗p_yz

log_temp = p_xyz∗np . l o g 1 0 ( temp ) a = log_p_xy_z − log_temp # computes c o e f f i c i e n t f o r n u m e r i c a l v a l u e s I += np .sum( a [ np . i s f i n i t e ( a ) ] ) P = np . append (P , I ) return P • joint_mutual_information(X, y):

calculates the joint mutual information coefficient based on the

conditional_mutual_information function.

• conditional_infomax(X, y):

computes the conditional infomax feature extraction coefficient based on the

conditional_mutual_informationandmutual_info_feature_selectionfunc-

tions.

13.3. feature_selection.py

The module compile algorithms to stratify features according to feature selection approaches. It is built up on top of Scikit-learn functions and utilities.

• select_features_corr(X_train, y_train, feature_names, n_features, \

training_size, n_rounds, stat_test, class_weight = None):

selects features according to chosen filter method (stat_test) with bootstrap- ping for feature stability, one can select the extend of samples training_size, the number of resample times n_rounds and the maximum number of features

n_features

Listing 13.5: Python code for selection of features according to relevance indexes of DataFrame objects, feature_selection.py.

13.3. feature_selection.py

def s e l e c t _ f e a t u r e s _ c o r r ( X_train , y_train , feature_names , \

n _ f e a t u r e s , t r a i n i n g _ s i z e , n_rounds , s t a t _ t e s t , c l a s s _ w e i g h t = None ) : # Chi−s q u a r e w o r k s o n l y f o r p o s i t i v e v a l u e s , s o e s c a l a t i o n a c c o r d i n g l y i s n e e d e d . i f s t a t _ t e s t == c h i 2 : s l c = MinMaxScaler ( ) X_train = s l c . f i t _ t r a n s f o r m ( X_train ) # I n i t i a l i z e f e a t u r e r e l e v a n c e m a t r i x p e r r e s a m p l e i t e r a t i o n t o t a l _ s e l e c t e d _ f e a t u r e s _ p e r _ r o u n d = np . empty ( [ n _ f e a t u r e s , n_rounds ] , dtype= ’ | S600 ’ ) f o r j in range( n_rounds ) : # A c t i v a t e s b a l a n c e d o r u n b a l a n c e d l a b l e c l a s s e s f o r b i n a r y p r o b l e m s

i f ( c l a s s _ w e i g h t == ’ b a l a n c e d ’ ) & (len( y _ t r a i n < 0 . 5 ) > len( y _ t r a i n > 0 . 5 ) ) :

X _ p o s i t i v e , y _ p o s i t i v e = X_train [ y _ t r a i n == 1 ] , y _ t r a i n [ y _ t r a i n == 1 ]

X_negative , y _ n e g a t i v e = s h u f f l e ( X_train [ y _ t r a i n == 0 ] , y _ t r a i n [ y _ t r a i n == 0 ] , n_samples = len( y _ p o s i t i v e ) ) X_train = np . append ( X _ p o s i t i v e , X_negative , a x i s =0) y _ t r a i n = np . append ( y _ p o s i t i v e , y _ n e g a t i v e )

e l i f ( c l a s s _ w e i g h t == ’ b a l a n c e d ’ ) & (len( y _ t r a i n > 0 . 5 ) > len( y _ t r a i n < 0 . 5 ) ) :

X_negative , y _ n e g a t i v e = X_train [ y _ t r a i n == 0 ] , y _ t r a i n [ y _ t r a i n == 0 ]

X _ p o s i t i v e , y _ p o s i t i v e = s h u f f l e ( X_train [ y _ t r a i n == 1 ] , y _ t r a i n [ y _ t r a i n == 1 ] , n_samples = len( y _ n e g a t i v e ) ) X_train = np . append ( X _ p o s i t i v e , X_negative , a x i s = 0 ) y _ t r a i n = np . append ( y _ p o s i t i v e , y _ n e g a t i v e )

# r e s a m p l e t r a i n and t e s t d a t a s e t s randomly

X, y = s h u f f l e ( X_train , y _ t r a i n )

# I n i t i a l i z e f e a t u r e s e l e c t i o n o b j e c t a c c o r d i n g t o f i l t e r f u n c t i o n and maximum number o f f e a t u r e s a l l o w e d

f e a t u r e _ s e l e c t o r = G e n e r i c U n i v a r i a t e S e l e c t ( s c o r e _ f u n c=s t a t _ t e s t , mode= ’ k_best ’ , param=n _ f e a t u r e s )

# Chooses t h e number o f s a m p l e s t o t r a i n

number_of_samples = i n t(round( t r a i n i n g _ s i z e∗len( y ) ) + 1 ) X_train_boots , y _ t r a i n _ b o o t s = X [ : number_of_samples ] , y [ : number_of_samples ] f e a t u r e _ s e l e c t o r . f i t ( X_train_boots , y _ t r a i n _ b o o t s ) # O b t a i n i n d e x e s o f t h e most i m p o r t a n t f e a t u r e s a c c o r d i n g t o f i l t e r f u n c t i o n i n d e x e s _ f s = np . f l i p u d ( np . a r g s o r t ( f e a t u r e _ s e l e c t o r . s c o r e s _ [ : , 0 ] ) ) i n d e x e s _ f s = i n d e x e s _ f s [ : n _ f e a t u r e s ] # t r a n s l a t e i n d e x e s t o f e a t u r e names s e l e c t e d _ f e a t u r e s = np . a r r a y ( [ ] )

13. Software documentation: Feature selection and ML models f o r i in i n d e x e s _ f s : s e l e c t e d _ f e a t u r e s = np . append ( s e l e c t e d _ f e a t u r e s , f e a t u r e _ n a m e s [ i ] ) s c o r e s = np . f l i p u d ( np . s o r t ( f e a t u r e _ s e l e c t o r . s c o r e s _ [ : , 0 ] ) ) s c o r e s = s c o r e s [ : n _ f e a t u r e s ] t o t a l _ s e l e c t e d _ f e a t u r e s _ p e r _ r o u n d [ : , j ] = s e l e c t e d _ f e a t u r e s # E x c l u d e s f e a t u r e r e p e t i t i o n i n r e s a m p l i n g i t e r a t i o n a l l _ r e l e v a n t _ f e a t u r e s = np . u n i q u e ( t o t a l _ s e l e c t e d _ f e a t u r e s _ p e r _ r o u n d ) # Computes f e a t u r e s t a b i l i t y r e l e v a n c e f e a t u r e _ r e l e v a n c e = {} f o r i t e m in a l l _ r e l e v a n t _ f e a t u r e s :

rank , round_of_appereance = np . where (

t o t a l _ s e l e c t e d _ f e a t u r e s _ p e r _ r o u n d == i t e m ) n _ o f _ a p p e r e a n c e _ f e a t u r e = len( rank )

r e l e v a n c e _ s c o r e = n _ o f _ a p p e r e a n c e _ f e a t u r e∗( 1 . 0 / np . mean ( rank + np . o n e s (len( rank ) ) ) )∗( 1 . 0 / ( np . s t d ( rank ) + 1 . 0 ) ) / n_rounds

f e a t u r e _ r e l e v a n c e [ i t e m ] = r e l e v a n c e _ s c o r e

# I n i t i a l i s e c o m p u t a t i o n o f most s t a b l e and r e l e v a n t f e a t u r e s

r e l e v a n t _ f e a t u r e s = [ ]

In document CT-radiomics in the Context of Outcome Prediction after Chemoradio Therapy (CRT) in Cancer Patients (Page 170-200)