Spatial Preprocessing Based Multinomial Logistic Regression for Hyperspectral Image Classification

(1)

Procedia Computer Science 46 ( 2015 ) 1817 – 1826

Peer-review under responsibility of organizing committee of the International Conference on Information and Communication Technologies (ICICT 2014) doi: 10.1016/j.procs.2015.02.140

ScienceDirect

International Conference on Information and Communication Technologies (ICICT 2014)

Spatial Preprocessing Based Multinomial Logistic Regression For

Hyperspectral Image Classification

Nidhin Prabhakar T V

a

*, Gintu Xavier

a

, Geetha P

a

, Soman K P

a a_{Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham}

Coimbatore, India

Abstract

The paper presents a fast, reliable and efficient method for improving hyperspectral image classification aided by segmentation. The Multinomial Logistic Regression(MLR) algorithm can be extended to a semi-supervised learning of the posterior class distribution using unlabeled samples actively selected from the dataset. Classification results obtained from regression model is improved by performing a maximum a posteriori segmentation as it considers the spatial information of the hyperspectral image. The addition of the spatial processing step prior to the above mentioned classification scheme improves the overall accuracy of the process. The accuracies obtained before and after applying the preprocessing are compared.

Keywords: diffusion; hyperspectral image classification; multinomial logistic regresion; semisupervised learning; hyperspectral image

segmentation.

1. Introduction

Advances in hyperspectral sensing technologies and computational mathematics has opened up new opportunities in earth remote sensing and related applications. Land cover mapping, target detection, material identification, precision agriculture, abundance estimation etc are few applications that utilize the large wealth of information

obtained through earth remote sensing1,2,3_{. The huge chunk of data obtained from a hyperspectral sensor need to be}

* Corresponding author. Tel.: +91-9446471110.

E-mail address:[email protected]

(2)

converted to a tractable form of information for easy storage, data handling and ease of interpretation and drawing conclusions.

Classification is one such way of information extraction in data mining, where each constituent voxels in the image are identified and a corresponding thematic map is generated. The quality of further analysis and decision making for an application depends on the accuracy of the classification performed. The hyperspectral images are inherent to noises which may hinder proper classification and a preprocessing step becomes necessary to reduce the noise corruption. A band-by-band nonlinear diffusion applied prior to classification ensures increased class separability and noise reduction. This step ensures that the noise in the data is not propagated along a cascaded processing scheme.

The selection of a classification strategy should ensure high performance and speed, which in hyperspectral

scenario is a challenging problem2_{. The numerous contiguous spectral bands and relatively fewer labeled samples}

available for training leads to Huges phenomenon3,6_{. Feature extraction by transformation to subspaces, band}

selection etc are ways of reducing the curse of dimensionality. But these steps are time consuming and may result in

loss of information, affecting the classification accuracy2_.

A deterministic classification approach is a satisfactory solution for the aforementioned bottlenecks. Multinomial logistic regression (MLR) is such a deterministic approach which directly learns and models the posterior class probabilities in Bayesian framework without learning the class-conditional densities, facilitating classification with

smaller training set and less complexity5,6,7,8_{. A semi-supervised scheme with active learning incorporates unlabeled}

samples along with the labeled samples for training, solving the problem of limited labeled samples for training the

model6_.

The classification accuracy of the above model can be further improved by incorporating the spatial information

of the hyperspectral image along with its spectral information6, 9,10_{. Hyperspectral segmentation using the posterior}

class distribution modeled by the MLR, and the spatial information modeled by multilevel logistic (MLL) prior, offers better accuracy. The final segmentation is obtained through maximum a posteriori (MAP) segmentation computed using α–expansion min-cut based optimization technique.

2. Proposed Method

Flow diagram of the proposed method is as shown in Fig.1. Descriptions of the steps involved are explained in the following sections.

3. Denoising

The hyperspectral images are inherent to noises as hyperspectral sensors makes noises due to photon effect, sensor distortions, calibration etc. The aim is to enhance the quality of the image by selecting a denoising technique, so as to smoothen the homogenous regions of the imagery without affecting the significant edge information for better classification accuracy.

For a 2-D image X, PDE based general diffusion equation is represented as

( ( , , ))X x y t div c x y t X x y t( ( , , ) ( , , )) (1)

t

where c x y t the diffusion coefficient, diffusivity, t denotes the number of iterations and represents the ( , , )

gradient operator12,13,14_{. This diffusion is linear and space invariant when}_{c x y t is constant, but has the}_{( , , )}

disadvantage of blurring the edges resulting in the loss of significant information during denoising.

In order to reduce the smoothing effect along the edges, Perona and Malik proposed a diffusion coefficient,

based on the gradient of the image at time t13_{. This edge seeking function is given as}

2 || || (|| ||) (2) X k c X e

(3)

where || X and k makes the diffusion equation non-linear and space variant. The value of k controls the || sensitivity to the edges. A larger value of k implies larger c and the edges will be heavily blurred on convolution, whereas a smaller k value leads to slower diffusion. The value of k is usually chosen experimentally.

Fig. 1. Flow graph of the proposed method

The norm of gradient || X is larger at the edges and smaller at the homogenous regions, whereby ensuring ||

less blurring at edges and more smoothing effect at homogenous regions. Hence the Perona-Malik non-linear diffusion(when applied to each band of the hyperspectral imagery) improves the image quality by modifying the spatial content (spectral information is not considered).

For a hyperspectral image, X s( ) (X s₁( ),...,X s_d( ))T ; s ( , )x y , with n pixels in each of the d bands, the

evolution of the PDE in (1) for the multi-scale representation of the image is given as12

i ( ) (3)

i

X

div c X t

where i =1,2,…., d. The sequential denoising of individual band is followed by concatenation to form the hyperspectral image cube and the feature vector formation by reshaping the hypercube. Classification is performed

MLR Modeling and Regression Coefficient

Estimation Hypercube & Feature Vector

Formation

Learning Posterior Probability Distribution

Graph-Cut based Segmentation

Land Cover Map Training Phase Testing Phase Sp atia l P re pro ce ss in g Su pe rv ise d Cl assific atio n Se gm en ta tio n

Load Hyperspectral Image Data

Band-by-Band Perona-Malik Diffusion

(4)

on this pre-processed image.

4. Classification: A Deterministic Approach 4.1. Problem Statement

Let the image be represented asx ( ,...,x₁ _{xn , where the}) n-pixels are represented as d-dimensional feature vectors.

{1,..., }

I n , be the set of integers indexing the n-pixels of the hyperspectral image (or n feature vectors) and

{1,..., }

L K be the set of Kclass labels. For_{i I}, let the image of class labels be represented asy ( ,...,y₁ _yn), where

y_i L denotes class label for the ith pixel. The aim of hyperspectral image classification is to infer y_i Lfor alli,

from the feature vector _xi and to generate a 2-D image of class labels representing the class information of the

original d-dimensional hyperspectral image. 4.2. MLR Modelling and Regressor Estimation

The aim of MLR based supervised learning algorithm is to design a classifier, based on L labeled training samples ( L I ), that is capable of distinguishing Kclasses when feature vector(s) (or a basis function computed from the observed feature) is given as the input for classification. The algorithm used in this paper includes a training phase

for estimating the regressors w and a testing phase for finding posterior probability of each feature vectors from

which the class labels are inferred6,9,15_.

4.2.1. Training-Supervised

Let the L training samples with known class labels be represented as a training setD_L {( , ),...,( ,x y₁ ₁ x y_L _L)}. The

posterior class distributions under this MLR model are calculated for the MAP estimation of multinomial logistic

regressors w(initial w is an assumption)16_{. The general MLR model representation}9,15_is:

( ) ( ) 1 exp( ) ( | , ) (4) exp( ) T T k i i i _K k i k w x P y k x w w x

where the right hand side represents the probability that x belongs to class k for k L ._i _w( )k _{is the set of logistic}

regressors for class k where w (w(1),...,w(K 1)) and w( )K is usually set to ’0’ as the conditional probabilities for

Kth _{class can be directly found by subtracting the sum of (}_K ₁₎_{conditional class}16_.

1

( ,... )t

x x x , represents the feature vectors selected to train the model. Non-linear functions like symmetric kernels improve the data separability

in the transformed space17_{and hence are ideal for hyperspectral image classification. This paper uses a Gaussian}

Radial Basis Function represented as K x x( ,i j) exp( ||xi xj|| /(22 2)) for representing the training vectors17.

The expressions for the MAP estimation of wusing expectation maximization (EM) and block Gauss-Seidel

iterative algorithm is briefed below (detailed explanation in9,11,18_).

In Bayesian approach, the posterior density of w with Y and L X as the set of labels and set of feature vectors L

in the given labeled training samples respectively, is given as

( |p w Y X_L, _L) p Y X w p w X( _L| _L, ) ( | _L) (5)

The regressorswcan be estimated as the value of wthat maximizes the conditional log data likelihood. This

MAP estimate is given by

ˆ arg max{ ( ) log ( | _L)} (6)

w

w l w p w X

(5)

1 ( ) ( ) 1 1 ( ) log ( | , ) log ( | , ) ( i log ( exp( )) (7) L L L i i i L K y T T j i i i j l w p Y X w p y x w x w x w ( | _L)

p w X is the prior on wand acts as the regularizers for promoting better solutions during the testing phase11_.

The Gaussian prior used is

( | ) exp{ 1 } (8)

2

T

p w w w

The MAP estimation is performed using an expectation-maximization algorithm9,19_{. EM is an iterative procedure}

where the E-step (for mean) and M-step (for maximizing) are performed during iterations. Specifically the tth

iteration can be briefed as E-step ( | t) [log ( , | L) | t] (9) Q w w E p w D w M-step 1 arg max ( | ) (10) t t w w Q w w

Detailed explanations and algorithms are presented in9,11,18,19_.

4.2.2. Training-Semisupervised

For cases where the number of labeled samples available is less, a larger set of unlabeled samples, 1,....,

{ }

U L L U

X x x and a smaller set of labeled samples, D_L {( , ),...,( ,x y₁ ₁ x y_L _L)}form the training set. Methods

like random selection and maximum entropy are used6_{to actively select the unlabeled samples from the dataset. The}

MAP estimation of the regressors follows the same formulation as in supervised except for the modified training set

representation during the iterations6_.

The dataset used in this paper being AVIRIS Indian Pines with all the label information, this semi-supervised learning approach is not given importance in the performance evaluation of adding the preprocessing step.

4.2.3. Testing

The estimated regression coefficients along with the testing samples are fed to the model for determining the posterior class densities of each feature vectors, in all the K classes. The testing data can be the entire feature vectors including the training samples or the feature vectors excluding the training samples. The index corresponding to the maximum posterior class probability for a feature vector represents the class label for that particular vector. Likewise the complete classification map can be generated.

5. Segmentation 5.1. Problem Statement

The hyperspectral earth images are very likely to have similarity among the neighbouring pixels, and exploiting this peculiarity by segmenting the similar regions improves the classification performance. For the image

represented asx ( ,...,x₁ _{xn with}) I {1,..., }n as the set of integer indexes and L {1,..., }K as the set class labels,

(6)

such that pixels in each regions are similar in some sense. This process incorporating the contextual information improves the overall classification accuracy.

The MAP segmentation under the Bayesian framework is given as9

1

ˆ arg max (log ( | ) log ( )) log ( ) (11)

n i i i y L i y p y x p y p y Or equivalently 1

ˆ arg min (log ( | ) log ( )) log ( )ˆ (12)

n

i i

y L i

y p y w p y p y

where p y w is outcome of classification step and ( )( | )i ˆ p y is the MLL prior.

5.2. MLL Prior Estimation

MLL prior promotes a piecewise smooth segmentation and the segmentation assumes the adjacent pixels to be in

same class.This Markov Random Field (MRF) prior is widely used in segmentation21_{. The prior for segmentation}

has a Gibbs’s distribution21,23_{and is represented as}

( ) 1 c C c( ) (13)

V y

p y e

Z

where Z is normalizing constant, V y is the prior potentials corresponding to the set of cliques over the image_c( ) 6,

21, 28_.

Equation (13) can be rewritten as

( , ) ( ) 1 ( ) (14) y_i i j i I i j C V y y p y e Z

where (y_i y is the unit impulse function representing the pairwise interaction and controls the level of _j)

smoothness for segmentation6_.

5.3. Graph-cut Segmentation

From (12) and (14), the MAP estimation can be finally given as6

1

( , )

1 ( , )

ˆ arg min{ (log ( | ) log ( ))ˆ

( log ( ) ( ))}

ˆ

arg min log ( | ) ( )

n i i y L i i i j i I i j C n i i j y L i i j C y p y w p y p y y y p y w y y (15)

Equation (15) is an integer optimization problem. Satisfactory approximation to this MAP segmentation problem can be obtained by mapping it on to a graph and performing -expansion algorithm through min-cut computations

24-27_.

6. Experimental Results

This paper claims a fast, reliable and efficient method for improving the supervised classification aided by

segmentation6_{of the Indian Pines scene. The justification is provided in the following sections by furnishing the}

(7)

6.1. Dataset description

The data collected by AVIRIS system, operated by NASA Jet Propulsion Laboratory, over the Northwestern Indiana in 1992 is used for validating the result (Fig. 2.).

a) b)

Fig. 2. (a) AVIRIS data scene (b) Ground truth image with class descriptions

The Indian Pines scene comprises of 224 spectral channels (bands) in the wavelength range of 400-2500nm, with nominal spectral resolution of 10nm and spatial resolution of 20m per pixel. The image has 145x145 pixels per band. Fig. 2(a) shows the AVIRIS Indian Pines data scene and Fig. 2(b) shows the labeled ground truth information of the 16 mutually exclusive classes of which ten corresponds to different crops, five represents vegetation types and one represent building. The background is represented as white and is not considered during classification.

6.2. Accuracy Assessment Measures

The best way of representing the classification outcome is through confusion or error matrix. But, as a common measure of accuracy for classification and segmentation stages, statistical parameters are chosen. They are: Overall Accuracy(OA), Class Accuracy, Average Accuracy, Kappa index etc. In this paper we consider only the Overall Accuracy.

Overall Accuracy Total number of correctly classified pixels

Total number of pixels

Since the method used in this paper is probabilistic, Monte Carlo (MC) runs are performed to obtain a stable result. 5 MC runs are used in this paper. Hence, mean of Overall Accuracies after all the MC runs are computed for evaluation.

6.3. Results and Discussions

First, the Perona-Malik diffusion algorithm is applied band-by-band, to the Indian Pines image. Number of iterations, step size and diffusion control parameter k are chosen experimentally to give best result in the classification stage. Three iterations with step size less than 0.2 and k=0.1 is chosen by trial and error method for this experiment. The outcome of this step for a single band is as shown in Fig. 3.

Each diffused image (145x145) is concatenated to form the hypercube (145x145x220) and is reshaped into a

matrix (220x21025) in which, each column represents feature vectors. The 20 noisy and water absorption bands29

are removed to form 200x21025 matrix. This matrix is saved as a ‘.mat’ file which is loaded for the classification in the proposed method.Fig. 4 shows the classification and segmentation results obtained before and after applying preprocessing step. The parameters chosen for the supervised MLR based classification and segmentation are same for both methods (with and without preprocessing) except for the dataset used. The MLR probabilistic model being

predictive7_{will generate results with slight variation on each run. Hence, Monte Carlo runs are performed for}

getting more accurate and stable values as output. Linear and radial basis function (RBF) kernel learning methods are used for comparison and for deriving a fast and reliable method. For the training phase, 500 training samples (20

(8)

from each class) are selected. As the number of training samples increases, the accuracy also increases. Due to space

limit, only the most significant outcomes are furnished. The smoothness parameter μ chosen for segmentation is 46_.

For testing, all the feature vectors are used along with the training samples, since method6_{is fast and predictive.}

a) b)

Fig. 3. Effect of Perona-Malik diffusion (a)Original Band 3 (b) Denoised and smoothened Band 3 after diffusion

Table 1 shows the comparison of the Mean Overall Accuracy (MOA) for different learning methods, before and after preprocessing. From Table 1 it can be inferred that, the proposed method offers better accuracies than the method without preprocessing. Also, the linear method gains accuracies equal to or greater than that of RBF method upon preprocessing. Table 1 also shows that, the computational time taken by linear method (using feature vectors directly) is much lesser than that of RBF method because RBF computes the function values among all feature vectors. In this case, the matrix dimension for w is 501x15, whereas for linear, w is 201x15.

Table 1. Comparison of the performance of proposed and existing method (MC runs=5) Method Linear Kernel RBF Kernel

Computation

Time (mins) Classification (MOA %) Segmentation (MOA %) Computation Time (mins) Classification (MOA %) Segmentation (MOA %) Without

spatial

processing 1.30 50.30 73.50 7.04 74.87 83.47 With spatial

processing 2.05 86.37 90.62 7.45 86.32 90.31

Hence, the preprocessing makes the linear learning method for MLR based classification aided by segmentation, the fast, reliable (since MC runs can be increased to get more consistent result as it fast) and accurate method, that this paper claimed.

7. Conclusion

A spatial preprocessing step prior to MLR based hyperspectral image classification, for improving the overall performance was proposed in this paper. The experimental results and discussions prove the efficiency of the proposed method. Through comparisons it is clear that, when the proposed preprocessing step is applied, linear kernel learning method becomes the fast, reliable and efficient method that improves the classification and segmentation results when applied to AVIRIS Indian Pines dataset.

Acknowledgements

The authors would like to thank Ms. Jun Li, Dr. Jose M. Bioucas Dias and Antonio Plaza for sharing the demo codes

of MLR based hyperspectral classification and segmentation along with the dataset files16_{. We also thank Mrs.}

(9)

a) b)

c) d)

Fig. 4. (a)-(b) Classification map (OA 53%) and Segmentation map (OA 70%) before preprocessing resp. (c)-(d)Classification map (OA 86%) and Segmentation map (OA 90%) after preprocessing resp.

References

1. Muhammad Ahmad, Sungyoung Lee, Ihsan Ul Haq and Qaisar Mushtaq. Hyperspectral remote sensing: Dimensional reduction and end member extraction. International Journal of Soft Computing and Engineering; May 2012. 2-2.

2. Heesung Kwon, Xiaofei Hu, James Theiler, Alina Zare and Prudhvi Gurram. Algorithms for multispectral and hyperspectral image analysis.

Journal of Electrical and Computer Engineering; 2013.

3. G.Hughes. On the mean accuracy of statistical pattern recognizers. IEEE Trans. on Information Theory; Jan 1968. 14.

4. Gustavo Camps - Valls, Luis Gomez-Chova, Jordi Mu˜noz-Mari, Joan Vila-Frances and Javier Calpe-Maravilla. Composite Kernels for Hyperspectral Image Classification. IEEE Geoscience and Remote Sensing Letters.; Jan 2006. 93 – 97. 3-1.

5. Y. Dan Rubinstein. Discriminative vs. informative learning. Dissertation submitted to Stanford University; Jan 1998.

6. Jun Li, José M. Bioucas-Dias and Antonio Plaza. Semisupervised hyperspectral image segmentation using multinomial logistic regression with active learning.IEEE Trans on Geoscience and Remote Sensing; Nov 2010. 48.

7. Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naïve Bayes. Advances in neural

information processing systems; 2002. 841.

8. V.N Vapnik. Statistical learning theory. John Wiley and Sons; 1998.

9. J. Borges, J. Bioucas-Dias and A. Marçal. Bayesian hyperspectral image segmentation with a discriminative class learning. Proceedings

of.IEEE Int.Geosci.Remote Sens. Symp.; 2007. 3810-3813.

10. Mathieu Fauvel , Jon Atli Benediktsson , Jocelyn Chanussot , Johannes R. Sveinsson. Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles. IEEE trans on Geoscience and Remote Sensing; 2008. 3804-3814:46-11.

11. Balaji Krishnapuram, DavidWilliams, Ya Xue, Alex Hartemink and Lawrence Carin. On Semi-Supervised Classification. Advances in

neural information processing systems; 2004. 721-728.

12. Kavitha Balakrishnan, Sowmya V and Dr. K.P. Soman. Spatial Preprocessing for Improved Sparsity Based Hyperspectral Image Classification. International Journal of Engineering Research & Technology; July 2012. 1-5.

13. P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell.; Jul. 1990.

629-639:12- 7.

14. S. K. Weeratunga and C. Kamath. A Comparison of PDE-based Nonlinear Anisotropic Diffusion Techniques for Image Denoising. Image

Processing: Algorithms and Systems II, SPIE Electronic Imaging.Santa Clara; Jan 2003.

15. D.Böhning. Multinomial logistic regression algorithm. Ann. Inst. Stat. Math; Mar. 1992. 197-200:44-1. 16. http://www.lx.it.pt/~jun/#code.

(10)

17. G. Camps-Valls and L. Bruzzone. Kernel-based methods for hyperspectral image classificaiton. IEEE Trans. Geosci. Remote Sens.; Jun. 2005. 1351-1362:43-6.

18. J.Bernardo and A. Smith. Bayesian Theory. Wiley; 1994.

19. A.Dempster, N.Laird and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J.R. Stat.Soc.; 1977. 1-38:39-1. 20. D.MacKay. Information-based objective functions for active data selection. Neural Comput; Jul. 1992. 590-604:4-4.

21. S. Gemen, D. Gemen. Stochastic relaxation Gibbs distribution and Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell.; Nov.1984. 721-741:6.

22. S.Z.Li. Markov random field modeling in computer vision. Springer-Verlag;1995.

23. J.Besag. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc B; 1974. 192-236:36-2.

24. Yuri Boykov, Olga Veksler and Ramin Zabih. Fast approximate energy minimization via graph cut. IEEE trans. on PAMI; Nov 2001.

1222-1239:20-12.

25. Vladimir Kolmogorov, Ramin Zabih. What energy functions can be minimized via graph cuts?. IEEE Trans on Pattern Analysis and

Machine Intelligence; Feb 2004. 147-159:26- 2.

26. Yuri Boykov, Vladimir Kolmogorov. An experimental comparison of Min-Cut/Max-Flow algorithms for energy minimization in vision.

IEEE Trans on Pattern Analysis and Machine Intelligence; Sep 2004. 1124-1137:26-9.

27. Shai Bagon. Matlab wrapper for Graph Cut. Available at www.wisdom.weizmann.ac.il/~bagon. December 2006. 28. S.Z Li. Markov Random Field modeling in image analysis. 2nd ed. Springer-Verlag; 2001.

29. Landgrebe, http://dynamo.ecn.purdue.edu/˜biehl/MultiSpec; 1992.

30. Santiago Velasco-Forero, Dr. Vidya Manian. Improving hyperspectral image classification using spatial preprocessing. IEEE Trans.