Image Texture Feature Extraction Based on Hadoop Cloud Platform and New ImageClass

(1)

Available at http://www.joics.com

Image Texture Feature Extraction Based on Hadoop

Cloud Platform and New ImageClass ⋆

Haodong Zhu

a,b

, Hongchan Li

b,

∗, Di Wu

b

, Deshuang Huang

a

, Bing Wang

a

a_{Computer Science and Technology Department, College of Electronics and Information Engineering}

Tongji University, Shanghai 201804, China

b_{School of Computer and Communication Engineering, Zhengzhou University of Light Industry}

Zhengzhou 450002, China

Abstract

With the increasing amount of digital image data, image texture feature extraction has become a key step of digital image processing. As an excellent massive data processing and storage capacity of the open source cloud platform, Hadoop provides a parallel computation model MapReduce, HDFS distributed file system module. In this paper, we firstly introduced Hadoop platform programming framework and Tamura texture features. And then, the image texture feature extraction was carried out in Hadoop platform. In the process, every image file is treated as a split which will be processed as a job record, every Map task corresponds to an image file, we use a Reduce task to write the result to the specified location in the specified format. Meanwhile, we defined an ImageClass which it not only can achieve the basic function modules of hadoop, but also can increase some modules based on the actual need. The comparison results show that for image texture feature extraction of low-resolution images, Matlab platform has more obvious advantage than Hadoop platform, but for image texture feature extraction of high-resolution images, the spent time in Hadoop platform is lesser, data processing capability of Hadoop platform is better.

Keywords: Hadoop; Texture Feature; Image Processing; Feature Extraction

1 Introduction

As one of three underlying features of image, texture feature does not depend on color and bright-ness, includes the alignment order and the organizational order of the things surface structure, and shows the connection with context content and reﬂects the homogeneous phenomenon which recurs in visual feature [1]. So texture feature has been widely used in image classiﬁcation and retrieval.

⋆_{Project supported by the National Natural Science Foundation of China (No. 61201447).} ∗_{Corresponding author.}

Email address: [email protected] (Hongchan Li).

(2)

Based on the psychology study of human visual perception about texture, Tamura [2] proposed a new theory to express texture feature. The six components of texture correspond to the tex-ture’s six properties from psychological perspective, namely coarseness, contrast, directionality, lifelikeness, regularity, and roughness. In which, the most important three characteristics are coarseness, directionality and contrast [3, 4]. The image resolution is higher, the image detail information will be better expressed, we also can get a better texture feature, but the computed and consumed time increase quickly. For example, we extracted texture feature extraction of four kinds of leaf image from three diﬀerent image databases which is ImageClef, Flavia, ICL. The image resolutions of three databases are 500× 800, 800 × 600, and 300 × 400. The consumed time which extract texture feature from the three image databases on Matlab are shown in Fig. 1.

Flavia ICL Image clef

45 40 35 30 25 20 15 10 5 0 Time/s

Fig.1: Consumed time which extract texture feature from three image databases on Matlab

Higher resolution means the detail information will be better reﬂected and we can get better texture feature, but the computed and consumed time increase quickly. As we can see from Fig. 1, because the images which are extracted from Image Clef and Flavia have higher resolution, the average consumed time is more and about 30 s. However, because the images which are extracted from ICL have lower resolution, the average time is lower and is about 6 s.

In order to reduce consumed time which extracts texture feature from massive im-ages, this paper intends to combine Tamura algorithm with Hadoop, and proposes a new Tamura algorithm base on Hadoop platform to rapidly extract image texture feature.

2 Hadoop

Hadoop [5] is a massive data distribution computation framework which was developed by Apache software foundation. By means of this framework, users need not concern about the base detail to develop distributed program and fully use high-speed and storage ability of cluster [6]. As a used widely distributed computation platform, Hadoop has been widely applied in the ﬁeld of image-related [7]. Zhu Yi-ming [8] developed a image classiﬁcation based on hadoop platform. Zhang Liang-jiang [9] implemented a parallel image processing system, which can be used to resize image, detect canny edge, and so on. Yong Feng [10] designed a parallel automatic deep Web data extraction based on Hadoop. In order to achieve higher image processing speed, Ranajoy

(3)

Malakar [11] integrated CUDA acceleration into the Hadoop framework. Liu [12] developed a massive image data management system based on hbase and mapreduce. Other related study about hadoop image processing were also introduced by reference paper from [13] to [16].

Hadoop contains multiple subprojects, such as MapReduce, HDFS, Hbase, etc. Due to limited space, we only brieﬂy introduce MapReduce and HDFS.

2.1 MapReduce

MapReduce [6] is a distributed programming framework which was developed by Google, appli-cation programs which follow this framework can run on larger cluster which contain thousands of computer, and process massive data with a reliable tolerance way. Various types of files can be processed by hadoop, such as text, image, video, etc. According to our requirements, we can write a special program to implement the goal. Here, we illustrate MapReduce program flow by means of Hadoop WordCount, its execution flow is shown in Fig. 2.

Related Data TextInputFormat TextRecordReader

InputKey InputValue

MapOutputKey Map out data is processed

according to the key value and documents MapOutputValue OutputFormat RecordWriter Map Reduce OutPut Intermediate Files

Fig.2: MapReduce execution ﬂow

Step 1: TextInputFormat segments the target ﬁle into many logical splits, every split will be applied to a single mapper. Meanwhile, TextInputFormat also provides RecordReader to collect the data in logical split and generate key-value pairs which serve as a mapper tasks input.

Step 2: Key-Value pairs are received by Mapper and processed by deﬁnite map logical. Sub-sequently, new key-value pairs are generated and sorted by key. Finally, the combine process is executed to add value which pairs have the same key.

Step 3: Inputdata is sorted and processed by client selfdefinite reduce, New key-value pairs is generated. According to the outputformat which definite by designer, pairs are written to specified location.

(4)

2.2 HDFS

HDFS [7, 8] was developed as the infrastructure of Apache Nutch search engine project and has become a part of the Apache Hadoop Core project at present. It has a lot in common with many existing distributed ﬁle system. But the diﬀerence between HDFS and other systems is also evident. HDFS is a highly fault-tolerant systems and is suitable to be deployed on cheap machines. HDFS is able to provide high throughput data access and is an ideal application with large data sets.

From Fig. 3, we can see that HDFS adopts master/slave architecture in general. Client, Na-meNode, Secondary NameNode and DataNode are major components of HDFS. The following is the description of major components:

Local dsik Secondary name node Name node metadata Name space state _Client Block map Name node Client b1 b3 b2 b2 b3 Heartbeat & BlockReport b4

DataNode DataNode DataNode DataNode

b1 b4 b2

b4 b3 b1

Fig.3: The architecture diagram of HDFS

(1) Client: It represents user communicate with NameNode and DataNode, Thus accesses ﬁles in HDFS.

(2) NameNode: The entire cluster has only one NameNode. It is the brain of HDFS, and is responsible for managing the HDFS directory tree and the associated ﬁle metadata information, and monitoring the health status of each DataNode.

(3) Secondary NameNode: It regularly consolidates fsimage ﬁles and edits log, and transfers to NameNode.

(5)

DataN-ode. each slave node is installed a DataNode which regularly reports stored data information to NameNode.

3 Tamura Texture Feature

3.1 Coarseness Degree

Coarseness degree is the most essential features of texture, and is a quantity which reﬂects the granularity in texture. When the size of element is not the same, texture pattern with larger element size is rougher. This procedure can be summarized in following steps:

Step 1. Setting the size of active window as 2k _{× 2}k_{, the average gray value of pixel in the}

neighborhood of active window at point(x, y) is

Ak(x, y) = x+2∑k−1−1 i=x−2k−1 y+2_∑k−1₋₁ j=y−2k−1 f (i, j)/22k (1)

where k = 0, 1, ..., 5, f (i, j) is the gray value at (x, y).

Step 2. At each point, calculating the average intensity diﬀerence of window with no overlap both horizontal and vertical orientation.

Ek,h(x, y) =|Ak(x + 2k−1, y)−Ak(x−2k−1, y)|, Ek,y(x, y) = |Ak(x, y + 2k−1)−Ak(x, y−2k−1)| (2)

Step 3. At each point, setting the best size Sbest(i, j) as 2k, where k maximizes E in each

direction.

Step 4. Taking the average of over the picture to get a coarseness degree:

Fcrs = 1 m× n m ∑ i=1 n ∑ j=1 Sbest(i, j) (3)

where m and n are the eﬀective width and the height of the picture.

About k of coarseness degree, Tamura describes two cases: The ﬁrst, k = 0, 1, 2, 3: images have no noise, Sbest is always maximum, the amount of calculation is lower. The second, k = 0,

1, ..., 5: images have noise, Sbest is unstable which it brings an impact on the calculation results.

Therefore, image must be pretreated to eliminate noise before extracting texture features.

3.2 Contrast Degree

We can get the contrast degree of target image by statistical distribution of gray values. Usually, it can be deﬁned as α4 =

µ4

σ4. Where µ4 is the four moment about the mean, and σ

4 _{is the} variance. Contrast degree can be measured by the following formula:

Fcon = σ α 1 4 4 (4)

(6)

3.3 Directionality

Because diﬀerent texture image has diﬀerent directionality, so Tamura uses directionality to de-scribe the texture direction of the divergence or focus on some directions. This procedure can be summarized in the following steps:

Step 1. Calculating the gradient vector at each pixel. The modulus and the direction of vector are shown by the following formulas:

|∆G| = (|∆H| + |∆V|) 2 , Θ = tan −1(∆V ∆H ) + π 2 (5)

where ∆H and ∆V can be calculated as the convolution of input image with the following 3× 3

operators. −1 0 1 −1 0 1 −1 0 1 1 0 1 0 0 1 −1 −1 −1

Step 2. The following formula is used to get the histogram of the θ.

HD(K) = Nθ(K) n∑−1 i=0 Nθ(i) (6)

where n is the quantization level of Direction angle, t is a threshold. Nθ(k) is the number of pixels

when (2k_2n−1)π ≤ θ ≤ (2k+1)π_2n and |∆G| ≥ t. For a texture image without obvious directional, its

histogram is more ﬂat. Otherwise its histogram shows a more distinct peaks. Step 3. The following formula is used to calculated directionality.

Fdir = np ∑ p ∑ ϕ∈wp (ϕ− ϕp)2HD(ϕ) (7)

where, np is the number of peaks in histogram HD, p is one of peaks in histogram HD, wp

represents all of the discrete areas in p, ϕp is the peak center position of the biggest histogram in

wp.

4 Proposed Image Texture Feature Extraction

We treat an image file as a split which would be processed as a job record. Each Map task corresponds to an image file, and then realizes feature extraction of image texture features. It uses a Reduce task to write the result to the specified location by the specified format. In order to achieve the above functions, it firstly need to implement a new data type ImageClass which is used to process and store image. Secondly, InputFormat and RecordReader need to be redefined to transform image files to specific data types. Thirdly, image texture features are extracted in the Map function.

(7)

4.1 Defined New ImageClass

When a class achieves writable interface, it is able to act as a value in the Hadoop, but hadoop have no class to act as an alternative type of key and value. In order to solve this problem, we deﬁned the ImageClass which not only can achieve the basic function modules of hadoop, but also can increase some modules based on the actual need. The image types are shown below:

As we can see from Fig. 4, the ImageClass not only can implement the basic functions of hadoop, but also can process image to become gray, get the size of the target image, get pixel data and other functions. Rgb2 gray Get width Get height Image Get pixel Get data Set pixel

Fig.4: New deﬁned ImageClass

4.2 Image InputFormat

InputFormat describes the details rules about the MapReduce job input. The FileInputFormat is a base class which all InputFormat use the ﬁle as its data source. It implements the following two class by using hadoop API:

1) ImageFileInputFormat: Achieving a class that inherits from FileInputFormat, it treats one image as a split and does not divide the ﬁles again.

2) ImageRecordReader: Achieving a class that inherits from RecordReader, it transforms the input split as a key-value pairs.

5 Experiments

5.1 Experiment Platform

We uses two experiment platforms in this experiment: (1) Common platform: it is conﬁgured as eight nuclear Intel Core i7 processor, 4 GB memory, 1 TB hard disk, R2012a version Matlab. (2) Hadoop platform: it contains one master node and four slave node and is conﬁgured as eight nuclear Intel Core i7 processor, 4 GB memory, 1 TB hard disk, 2.0.4 version Hadoop, 1.7.25 version Java.

(8)

5.2 Experiment Results and Analysis

In order to verify the efficiency of algorithm with different resolutions, we select three data sets: Flavia, ICL and ImageClef. Their URL are shown in Table 1. We firstly extract 2000 images from three data sets and divided into 100, 200, 500, 1000, 2000 five groups, and then use two platforms extract image texture feature. Fig. 5, Fig. 6 and Fig. 7 show the consumed time. Fig. 8 and Fig. 9 show the speedup rate.

Table 1: Dataset URL

Image database Download URL

Flavia http://ﬂavia.sourceforge.net/ ImageClef http://www.imageclef.org/2012/plant ICL http://www.intelengine.cn/dataset ICL Flavia ImageClef 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 Time/s 100 200 500 1000 2000 Image number ×104

Fig. 5: Consumed time of Matlab platform

From Fig. 5, Fig. 6, Fig. 7, we can see that with the increase of image number, the consumed time is multiple level growth. There are two reasons: (1) The image resolution in Flavia are 800× 600, ImageClef is about 500× 800, ICL is aboutand 300 × 400, because the amount of calculation about Coarseness degree is closely related to the image resolution, therefore the computation time grew more obviously. (2) Matlab platform uses a serial approach extract image texture features, while Hadoop platform uses multiple Map parallel computing method to perform, so the computation time which compares with the Matlab platform is lower.

As we can see from the above five charts, Tamura algorithm which runs on Hadoop platform is more efficiency than that which runs on Matlab platform, the algorithm speedup is ever-increasing with the number of image. Tamura algorithm is more efficiency when processes the larger datasets with the increasing number of nodes in cluster. Compared with the traditional method, Tamura in Hadoop have higher efficiency and scalability, which is available to extracted features from larger image datasets.

(9)

ICL Flavia ImageClef 7000 6000 5000 4000 3000 2000 1000 0 Time/s 100 200 500 1000 2000 Image number

Fig.6: Consumed time of Hadoop platform with 3 nodes ICL Flavia ImageClef 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Time/s 100 200 500 1000 2000 Image number

Fig.7: Consumed time of Hadoop platform with 4 nodes ICL Flavia ImageClef 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 Speedup 0 500 1000 Image number 2000 2500 1500

Fig. 8: Speedup rate of Hadoop platform with 3 nodes ICL Flavia ImageClef 10.0 9.5 9.0 8.5 8.0 7.5 7.0 6.5 6.0 5.5 5.0 Speedup 0 500 1000 Image number 2000 2500 1500

Fig. 9: Speedup rate of Hadoop platform with 4 nodes

6 Conclusions and Future Work

We extracted texture features from massive image files on Hadoop. In order to solve the problem that Hadoop cannot load image files directly, we designed a new inputformat ImageInputFormat and a new data type ImageClass to meet the needs of image input and image processing. The method makes full use of the parallel pro-cessing capabilities of hadoop, and ensures data’s accu-racy and shortens the computa-tion time. The experimental results show the method is effective. However, in the course of the experiment, because hadoop block size is 64 MB and the size of image used in experiment is less than 1 MB, the storage space is wasted. Due to schedule policy on Hadoop platform, the timeliness of the proposed method is affected. How to improve the uti-lization of system memory when it stores large number of smaller file, and design better schedule strategy are the focus point of our future research.

(10)

Acknowledgement

The authors would like to thank the editors and anonymous reviewers for their valuable com-ments. This work is also supported in part the Technology Innovation Outstanding Talents Plan Project of Henan Province of China under Grant No. 134200510025, the Youth Backbone Teachers Funding Planning Project of Colleges and Universities in Henan Province of China under Grant No. 2014GGJS-084, the Science and Technology Research Key Project of Education Department of Henan Province of China under Grant No. 13A520367, the Youth Backbone Teachers Training Targets Funded Project of Zhengzhou University of Light Industry of Henan Province of China under Grant No. XGGJS02, the Ph. D. Research Funded Project of Zhengzhou University of Light Industry of Henan Province of China under Grant No. 2010BSJJ038 and the Science and Technology Innovation Fund Project of Postgraduate of Zhengzhou University of Light Industry.

References

[1] Yubao Hao, Renli Wang, Jun Ma, Image retrival based on improved Tamura texture features, Science of Surveying and Mapping, 35(4), 2010, 136-138

[2] Xiaoqi Lu, Jinge Guo, Yuhong Zhao, Research and realization of Tamura texture feature extraction method based on image segmentation, Chinese Journal of Tissue Engineering Research, 16(17), 2012, 3160-3163

[3] Shunjie Wang, Chun Qi, Yusheng Cheng, Application of Tamura texture feature to classify un-derwater targets, Applied Acoustics, 31(2), 2012, 135-139

[4] Tomas Majtner, David Svoboda, Extension of Tamura texture features for 3D ﬂuorescence mi-croscopy, Proc. of 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, 2012, 301-307

[5] M. Armbrust, A. Fox, R. Griﬃth, A view of cloud computing, Communications of the ACM, 53(4), 2010, 50-58

[6] A. Srinivasan, T. A. Faruquie, S. Joshi, Data and task parallelism in ILP using MapReduce, Machine Learning, 86(1), 2012, 141-168

[7] Jongwoo Song, Seongjoo Song, A quantile estimation for massive data with generalized Pareto distribution, Computational Statistics & Data Analysis, 56(1), 2012, 143-150

[8] Yiming Zhu, Image classiﬁcation based on Hadoop platform, Journal of Southwest University of Science and Technology, 26(2), 2011, 70-73

[9] Liangjiang Zhang, Fei Huan, Yangde Wang, Parallel image processing implementation under Hadoop cloud platform, Information Security and Communication Privacy, (10), 2012, 59-62 [10] Yong Feng, Dongfeng Jia, Huijuan Wang, PFIME: Parallel automatic deep Web data extraction

based on Hadoop, Journal of Computational Information Systems, 10(9), 2014, 3863-3870

[11] Ranajoy Malakar, Naga Vydyanathan, A CUDA-enabled Hadoop cluster for fast distributed image processing, Proc. of the 2013 National Conference on Parallel Computing Technologies, 2013, 1-5 [12] Yuehu Liu, Bin Chen, Wenxi He, Massive image data management using HBase and MapReduce,

Proc. of the 2013 21st International Conference on Geoinformatics, 2013, 1-5

[13] Weiwei Li, Hang Zhao, Yang Zhang, Research on massive data mining based on MapReduce, Computer Engineering and Applications, 49(20), 2013, 112-117

[14] Ziwen Chi, Zhang Feng, Zhenhong Du, Cloud storage of massive remote sensing data based on distributed ﬁle system, Proc. of 2013 IEEE International Conference on Communication and Com-puting, Signal Processing, 2013, 1-4

(11)

[15] Chao-tung Yang, Kuan-lung Huang, William C. Chu, Implementation of video and medical image services in cloud, Proc. of 2013 IEEE 37th Annual Computer Software and Application Conference Workshops, 2013, 451-456

[16] Wichian Premchaiswadi, Anucha Tungkatsathan, Sarayut Intarasema, Improving performance of content-based image retrieval schemes using Hadoop MapReduce, Proc. of 2013 International Conference on High Performance Computing and Simulation, 2013, 615-620