Application of Kohonen Self-Organizing Map for Urban Structure Analysis

(1)

Abstract—Kohonen Self-organizing Map (SOM) has been widely used to discover clusters in datasets of various real world applications. Urban structure, as characterized by various social, economic, and environmental features, can be explored with SOM. In this paper, we present a case study of applying SOM for urban structure analysis to the city of New Orleans. One novel aspect of this work is the inclusion of environmental data from satellite imagery for clustering. Ten social-economic variables from Census 2000 data and Normalized Difference Vegetation Index (NDVI) from Landsat 7 Enhanced Thematic Mapper Plus (ETM+) imagery were used as inputs to SOM to group 482 census block groups of New Orleans into 9 clusters. The clustering results show that block groups with high economic status and environmental quality tended to cluster at 3 major outer locations, while block groups with low economic status and environmental quality were concentrated in the mid-city area, which is a well known pattern due to the suburbanization process.

Three major components were extracted by a principal component analysis, and they were compared with the clustering results from SOM. Fisher’s linear discriminant analysis shows a good separability result among 9 clusters discovered by the SOM.

Index Terms—Clustering, Landsat 7 ETM+, Discriminant Analysis, Self-Organizing Map, Urban Structure

I. INTRODUCTION

he traditional way of urban structure research focused on the distribution of land use zones, thus formed Park and Burgess’s concentric zone model [1], Hoyt’s sector model [2], and Harris-Ullman’s multiple nuclei scheme [3]. However, when looking up the details of spatial structure in a specific city, it is not enough to simply fit a city into one of these models. Cities must be specifically studied by its social, economic, and environmental variables to derive the underlying spatial patterns.

As a topology-preserving clustering technique, Kohonen Self-Organizing Map (SOM) [4] has recently been applied to urban structure analysis. The cluster centers in an SOM are arranged in a typically two-dimensional grid topology, and the centers of similar clusters would be placed in neighboring locations in the topology [5], [6]. This is more meaningful than

Manuscript received December 27, 2005.

Wenxue Ju is with the Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803. (E-mail: [email protected]).

Nina Siu-Ngan Lam is with the Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803. (E-mail:

[email protected]).

Jianhua Chen is with the Department of Computer Science, Louisiana State University, Baton Rouge, LA 70803. (E-mail: [email protected])

ordinal order, which was widely used in choropleth map.

Social-economic census data were widely used in interpreting the gentrification phenomenon [7], urban zone clustering [8], and city/county system [9], [10]. Environmental variables were also considered in some urban research [8] but often neglected, although urban system is also a reflection of environmental factors. The SOM performance was addressed previously [5]

but it still needs to be further evaluated.

In this paper, we utilize social, economic, and environmental indices from census data and satellite imagery to analyze the spatial pattern of New Orleans at the census block group (a geographical unit for census) level. We compare the results with principal component analysis [11], [12], and evaluate the clustering performance with linear discriminant analysis [13].

II. STUDY AREA AND DATASET

A. Study Area

The city of New Orleans (see Fig. 1), which has 485 block groups, was selected as the study area. This area is of great interest for urban social-economical structure analysis for several reasons. The city of New Orleans has just experienced Hurricane Katrina; the city population has a highly diversified demographic composition, with a high percentage of African American among the residents. Moreover the social-economic conditions vary greatly. In terms of environmental features, the city of New Orleans is unique. As we all know now, the city is situated at an elevation below sea level and needs the levee to protect it from flooding. It is of particular interest today to analyze, from a social, economic, environmental, and vulnerability perspective, the urban structure of New Orleans, since the discovered patterns can provide meaningful information to help the city’s urban planning and reconstruction after Hurricane Katrina.

N E W

S

Fig. 1. Study area: Orleans Parish, LA

Application of Kohonen Self-Organizing Map for Urban Structure Analysis

Wenxue Ju, Nina Siu-Ngan Lam, and Jianhua Chen

T

(2)

B. The Dataset

Ten social-economic variables derived from the census 2000 data were used in this study:

1) Pop: Population density (people per square kilometer) 2) Hou: House density (number of houses per square

kilometer)

3) Vac: Vacancy rate (the ratio of vacant houses to total houses)

4) Veh: Multi-vehicle rate (the rate of occupied household that have two or more vehicles)

5) Edu: Bachelor rate (the rate of 25years or older people that have Bachelor or higher degree)

6) Pov: Poverty rate (the rate of households that fall below poverty level)

7) Unem: Unemployment rate (the rate of 25years or older labor-force people that are unemployed)

8) White: % whites

9) Rent: Median rent (median house rent of occupied household)

10) Inc: Median household income (median household income of occupied household)

We use the median value for rent and income to minimize the effects of the anomaly values. Landsat 7 Enhanced Thematic-Mapper (ETM+) imagery was used in this research to derive environmental variables. Landsat 7 ETM+ has seven multi-spectral bands whose spatial resolution is about 30 meters; its fourth band and third band measure the reflected energy of ground surface at near infrared and red electromagnetic range. These two bands can be used to measure the vegetation condition of ground surface in terms of Normalized Difference Vegetation Index (NDVI) [14]:

) /(

)

( b

₄

− b

₃

b

₄

+ b

₃

δ =

⁽¹⁾

Where

δ

denotes NDVI, b4 and b3 denotes the corresponding pixel value of band 4 and band 3.

An image, which was taken on November 18, 1999 (see Fig.

2) and obtained from the Global Land Cover Facility (GLCF, www.landcover.org), was used in this paper.

Fig. 2. Landsat 7 ETM+ imagery of New Orleans

The NDVI image, which was derived from Landsat 7 ETM+

imagery, is in raster format. Census block groups’ boundary, which is in vector format, was used as a zone layer with the

NDVI image as a class layer to perform zonal statistics in Erdas Imagine to derive the average NDVI for each census block group.

We used the census 2000 Topologically Integrated Geographic Encoding and Referencing (TIGER) file as the boundary file with the exclusion of Lake Ponchartrain and Mississippi River. Three block groups were excluded in clustering procedure because their house densities and many other variables are 0.

III. SOM FOR URBAN STRUCTURE ANALYSIS

Kohonen Self-Organizing Map is commonly used for clustering. Kohonen network is a two layer neural network (see Fig. 3) whose learning process is competitive.

Fig. 3. Kohonen neural network

A. Characteristics and terms 1) Map topology

The output neurons are organized into a map, which can be of any dimension. Three 2-dimensional topologies are widely used: grid, hexagon, or random.

2) Output-neuron distance function

The measure is obtained based on the spatial position of the neurons in the SOM map. Commonly used functions include Euclidean, link, box, or Manhattan [4], [15]. Euclidean measures the Euclidean distance between the spatial positions of two output neurons; link distance measures the number of shared neighboring neurons to connect two output neurons;

Manhattan distance is the sum of spatial distances of two neurons along all dimensions of the SOM map.

3) Learning rate

Learning rate is used to adjust the weight vector. It is ranged from 0 to 1 and dynamically decayed from a larger value (like 0.9) to a small value (like 0.02) along the network training process. Decay can be linear or exponential.

B. General algorithm

The general algorithm [4], [6], [15], for SOM includes five steps:

1) Choose the number of input and output neurons and the map topology. Initialize the weight vectors. Input data must be scaled to [0, 1] to avoid variable domination.

2) Randomly select a data record X as the input.

3) Determine the winning output neuron.

Output:

Weight:

Input:

(3)

Calculate Euclidian distance (||X-C||), or dot product (X·C), where X and C are input vector and weight vector of output neuron respectively. The output neuron that has minimum Euclidian distance or maximum dot product will be declared as the winner.

4) Adjust weight vector of the winning neuron and neighborhood neurons so that they are dragged closer to input vector.

C

_i^'

= C

_i

+ r × a ( d ) × || X − C

_i

||

⁽²⁾

Where Ci and

C

_i^'are the weight vectors of cluster i before and after adjustment, X is the input vector, r is current learning rate, a(d) is a distance activation function which is 1 for winner and 0.5 for the neurons that falling into a specified distance d and 0 for all the other neurons.

5) Repeat step 2)-4) until certain epochs or convergence.

C. The Implementation of SOM

The SOM used in this paper contains 11 input neurons, which represent the 11 social, economic, and environmental variables. These variables were scaled to [0, 1]. Output layer is a 3×3 grid topology structure (see Fig. 4). This choice of 9 clusters in a two-dimensional grid is based on the consideration of ease of visualization and interpretation. Since using too many clusters will make the results hard to visualize and interpret, while just a few clusters is not sufficient to capture the potential patterns in data, we choose only 9 clusters. Weight vectors of the output neurons were initialized at midpoint (value = 0.5 for all variables). Learning rate started from 0.9 and decayed to 0.02 at learning phase. Euclidian distance function was used (for output neuron distance d) in weight vector adjustment procedure. We also used the Euclidean distance to measure the dissimilarity between an input vector and an output neuron’s weight vector.

Fig. 4. The SOM 3×3 Grid topology

IV. RESULTS AND ANALYSIS

After 4000 epochs, the SOM network converges and the final weight vectors are shown in Table I and visualized as shown in Fig. 5.

Cluster#

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Pop Hou Vac Veh Edu Pov Unem White Rent Inc NDVI Variables

Weights

1 2 3 4 5 6 7 8 9

Fig. 5. Weight vectors of different clusters

The output layer of the SOM has a predefined-topology as shown in Fig. 4, which means neighboring clusters are more similar to each other. The topology is preserved and the dissimilarity measurement among the final clusters is measured by Euclidian distance (see Table II) to show the closeness.

The characteristics of different clusters are analyzed below:

1) The block groups that have high income, rent, %whites, education status, vehicle rate, NDVI, and low unemployment, poverty, vacancy rate, population and house densities were clustered to 9 and 8, which occupied the upper right corner of the output layer.

Cluster 9 held most corresponding highest or lowest variable values among all clusters.

2) The block groups that have low income, rent, %Whites, education status, vehicle rate, NDVI, and high unemployment, poverty, vacancy rate, population and house densities were clustered to 1, 2, and 4, which occupied the lower left corner of the output layer.

Cluster 1 held most corresponding highest or lowest variable values among all clusters.

3) The block groups that have moderate variable values were clustered to 3, 5, 6, and 7, which occupied the upper left corner --- lower right corner diagonal line of the output layer.

4) Cluster 6 and 7 share different part of the characteristics with cluster 9 and 8 and are distinguished to each other.

According to the clustering result, each input record (block group) was assigned to its own cluster and the result was visualized by using ArcView, with a dot filling scheme for cluster 1, 2, and 4, a complex dot texture filling scheme for cluster 8 and 9, and a line filling scheme for cluster 3, 5, 6, and 7 (see Fig. 6).

Based on the visualization of clusters’ spatial distribution and the comparison with reference maps (see Fig. 7), the following patterns were discovered:

5

9

6

3 8

2 7

4

1

(4)

TABLE I

WEIGHT VECTORS OF DIFFERENT CLUSTERS (UNITS ARE THE SAME AS IN PART B(DATASET),SECTION II)

TABLE II

PROXIMITY MATRIX OF CLUSTERING RESULTS (EUCLIDIAN DISTANCE, DISSIMILARITY MEASUREMENT)

1) The block groups that were in cluster 8, and 9 tended to be concentrated in the outer parts of New Orleans, especially for three circled areas (see Fig. 7) that have golf parks: City Park (upper left circle),

Audubon Golf Club (lower left circle), and English Turn Golf and Country Club (lower right circle).

2) The block groups that were in cluster 1, 2, and 4 were generally located in the mid-city area.

Cluster Pop Hou Vac Veh Edu Pov Unem White Rent Inc NDVI Cases 1 5495 2411 0.183 0.141 0.088 0.447 0.187 0.075 296 15669 0.053 89 2 4667 1995 0.146 0.196 0.104 0.365 0.146 0.073 325 19635 0.080 82 3 3420 1394 0.100 0.295 0.140 0.259 0.102 0.081 363 26498 0.124 61 4 4873 2456 0.180 0.172 0.219 0.365 0.140 0.309 359 19935 0.049 34 5 3579 1658 0.106 0.329 0.291 0.222 0.083 0.363 435 33749 0.118 25 6 2716 1121 0.073 0.408 0.234 0.169 0.073 0.211 421 36022 0.159 59 7 3461 2064 0.139 0.325 0.475 0.179 0.055 0.733 520 39170 0.093 42 8 2992 1581 0.091 0.436 0.524 0.123 0.040 0.780 568 50510 0.150 64 9 2441 1063 0.061 0.512 0.429 0.103 0.048 0.556 538 52056 0.189 26

Cluster 1 2 3 4 5 6 7 8 9

1 0.000 0.170 0.409 0.307 0.576 0.632 0.920 1.078 0.976

2 0.170 0.000 0.241 0.292 0.455 0.475 0.848 0.984 0.851

3 0.409 0.241 0.000 0.406 0.343 0.259 0.783 0.874 0.687

4 0.307 0.292 0.406 0.000 0.357 0.517 0.622 0.798 0.747

5 0.576 0.455 0.343 0.357 0.000 0.237 0.443 0.539 0.408

6 0.632 0.475 0.259 0.517 0.237 0.000 0.639 0.675 0.449

7 0.920 0.848 0.783 0.622 0.443 0.639 0.000 0.227 0.401

8 1.078 0.984 0.874 0.798 0.539 0.675 0.227 0.000 0.290

9 0.976 0.851 0.687 0.747 0.408 0.449 0.401 0.290 0.000

(5)

3) The block groups that were in cluster 3, 5, 6, and 7 tended to be scattered among the other clusters and occupied the east part of New Orleans.

Generally speaking, the spatial structure of New Orleans is still on the common suburbanization [16] stage and no downtown renaissance was found so far.

V. DISCUSSION

A. Factors that lead to the pattern

The spatial pattern that was discovered and described above was caused by the 11 variables, but to which degree the variables explain the pattern was unknown. Principal component analysis [11] can derive the main components that explain the variances in the data. In this case, the first three components whose eigenvalues are larger than 1 explain 77.7% of the total variance. After varimax rotation, the component matrix was derived as shown in Table III.

TABLE III

ROTATED COMPONENT MATRIX

Variables Component 1 Component 2 Component 3

POP -0.270 -0.033 0.923

HOU -0.057 -0.270 0.915

VAC -0.134 -0.822 0.077

VEH 0.531 0.632 -0.349

EDU 0.915 0.147 -0.046

POV -0.683 -0.413 0.335

UNEM -0.607 -0.264 0.289 WHITE 0.923 0.025 -0.023

RENT 0.813 0.201 -0.180

INC 0.702 0.468 -0.207

NDVI 0.180 0.824 -0.123

The first component was highly loaded by income, rent,

%whites, education status, unemployment rate (negative), and poverty rate (negative), hence can be interpreted as economic-social factor; the second component was highly loaded by NDVI, vehicle rate, and vacancy rate (negative),

noting that the loading of vehicle rate was not significantly higher than that of the first component, hence it can be interpreted as environmental-economic factor; the third component was highly loaded by population density and house density and can be interpreted as density factor. The distributions of these three factor scores were visualized in Fig. 8-10. Rather than topology, the figures simply represent sequential orders of classes in terms of component scores which approximately represent the distribution of several corresponding variables simultaneously. They can be used as supplements to understand another aspect of the spatial pattern: the pattern of the individual component is quite similar to the cluster distributions in Fig. 6 and all of the three components can be used together to illustrate the factor aspect of the urban structure.

B. Separability of clusters

Linear discriminant analysis [13] was performed in SPSS

(6)

to evaluate the clustering results. The general idea of linear discriminant analysis is to find linear combinations of input variables to separate different groups. We use the whole dataset including the clustering result from SOM, as the inputs to derive the linear discriminant equations and do a self-classification to test the separability of different clusters in the instance space.

A good result was achieved. The first five discriminant functions which explain 99.9% of cumulative variance was verified to be significant by using Wilks’ Lambda. The overall classification accuracy (see Table IV) is 88%, which means the clusters are generally well separated.

TABLE IV

CLASSIFICATION C^ONFUSIONM^ATRIX Predicted Group Membership

1 2 3 4 5 6 7 8 9 Omit

%

1 67 22 0 0 0 0 0 0 0 75

2 2 77 3 0 0 0 0 0 0 94

3 0 5 55 0 0 1 0 0 0 90

4 0 2 0 29 3 0 0 0 0 85

5 0 0 0 1 24 0 0 0 0 96

6 0 0 3 0 4 52 0 0 0 88

7 0 0 0 1 3 0 35 3 0 83

8 0 0 0 0 0 0 3 60 1 94

O R I G I N A L

9 0 0 0 0 1 0 0 0 25 96

Com-

mit % 97 73 90 94 69 98 92 95 96 88

VI. CONCLUSIONS

This work utilized Kohonen Self-Organizing Map to cluster census block groups of New Orleans based on 11 social, economic, and environmental variables derived from census 2000 data and Landsat 7 ETM+ imagery. The characteristics of nine clusters were interpreted and analyzed. The spatial distribution of these clusters was visualized and further analyzed to demonstrate the spatial aggregation of certain clusters. The results showed that the block groups with high economic and environmental status tend to be clustered at 3 major outer locations, while those with low economic and environmental status are concentrated in the mid city area, which is a typical structure resulted from the suburbanization [16] process.

Principal component analysis was used as a supplement to understand the spatial organizing patterns of New Orleans to identify critical factors contributing to the spatial patterns.

Three major components were extracted and linked to certain variables. The distributions of component scores were quite similar as the pattern from SOM.

Clustering performance was evaluated by using linear discriminant analysis. The result showed that the differences among clusters were significant and they can be well distinguished, which means the clustering was successful.

If vulnerability data (such as elevation data for flooding,

distance to freeways, and high day time population for evacuation planning) were considered, this method as well as the result can be used to discover some more underlying patterns related to the recovery of New Orleans from Hurricane Katrina, which has a broader impact rather than theoretical urban structure research itself.

ACKNOWLEDGMENT

This research was partially supported by a National Science Foundation grant: IIS-0326387 and an AFOSR grant: FA9550-05-1-0454. We also sincerely thank three anonymous referees for their very helpful comments.

REFERENCES

[1] E. W. Burgess, “The growth of the city: an introduction to a research project,” in The City, R. E. Park, E. W. Burgess, and D. McKenzie, Ed.

Chicago: Chicago University Press, 1925, pp. 47-62.

[2] H. Hoyt, The Structure and Growth of Residential Neighbourhoods in American Cities, Washington, DC.: Federal Housing Administration, 1939.

[3] C. D. Harris, and E. L. Ullman, “The nature of cities,” Annals of the American Academy of Political and Social Science, vol. 242, 1945, pp.

7-17.

[4] T. Kohonen, Self-Organizing Maps, Berlin: Springer-Verlag, 1995.

[5] G. Foody, “Applications of the self-organising feature map neural network in community data analysis,” Ecological Modeling, vol. 120, no. 2, 1999, pp. 97-107.

[6] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, 1st ed. Boston: Addison Wesley, 2005.

[7] M. Takatsuka, “An application of the self-organizing map and interactive 3-D visualization to geospatial data,” in Proc. of the 6th International Conference on GeoComputation, Brisbane, Australia, 2001.

[8] L. Franzini, P. Bolchi, and L. Diappi, “Self organizing maps: a clustering neural method for urban analysis,” in Proc. of the VRecontres de Théo Quant., A. Banos, F. Banos, J. Bolot, and C.

Coutherut, Ed. 2001, pp. 1-15.

[9] J. Kropp, “A neural network approach to the analysis of city systems,”

Applied Geography, vo. 18, no. 1, 1998, pp. 83-964.

[10] A. Skupin, and R. Hagelman, “Visualizing demographic trajectories with Self-Organizing Maps,” Geoinformatica, vol. 9, no. 2, 2005, pp.

159-179.

[11] S. Daultrey, “Principal component analysis,” in Concepts and Techniques in Modern Geography, vol. 8, Norwich: Geo Abstracts Ltd., 1976, pp. 42.

[12] M. Y. Kiang, and A. Kumar, “An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications,” Information Systems Research, vol. 12, no. 2, 2001, pp.

177-194.

[13] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, 1936, pp. 179-188.

[14] T. M. Lillesand, and R. W. Kiefer, Remote Sensing and Image Interpretation, 4th ed. New York: John Wiley & Sons, 2000.

[15] J. Vesanto, J. Himberg, E. Alhoniemi, and J. Parhankangas, SOM toolbox for Matlab 5, Report A57, Helsinki University of Technology, 2000.

[16] T. Hall, Urban Geography, Routledge Contemporary Human Geography Series, 2nd ed. London: New York Routledge, 2001.