An Experimental Construction for Distributed Image Processing Using Hadoop

(1)

8 International Journal for Modern Trends in Science and Technology, Volume 3, Special Issue 5, October 2017

An Experimental Construction for Distributed Image Processing Using Hadoop

N Jagajeevan¹| P. Subba Rao²| G Sreenivasulu³| Y.Prakash Rao⁴

1,2Department of CSE, CIET, Lam Guntur, Andhra Pradesh, India.

3,4Department of CSE, Madanapalle Institute of Technology and Science, Madanapalle, Andhra Pradesh, India.

To Cite this Article

N Jagajeevan, P. Subba Rao, G Sreenivasulu and Y.Prakash Rao, “An Experimental Construction for Distributed Image Processing Using Hadoop”, International Journal for Modern Trends in Science and Technology, Vol. 04, Special Issue 01, January 2018, pp. 08-14.

Present days, the sizes of image collections are in-creasing dramatically and reaching petabytes of data.

Such large volumes cannot be analyzed on personal computer within a rea-sonable time. Therefore, processing of modern image collections requires distributed computing. This paper presents a MapRe-duce Image Processing framework (MIPr), which provides the ability to use distributed computing for image processing. MIPr is based on Map Reduce and its open source implementation Apache Hadoop. MIPr provides various forms of image representations in Hadoop internal format and the input/output tools for integration of image processing into Hadoop data workflow. The image formats in the MIPr framework are based on the popular image processing libraries. Furthermore, the MIPr includes the high-level Image processing API for developers who are not familiar with Hadoop. This API allows to create sequential functions that process one image or a group of related images. The MIPr framework applies such functions to the large amount of images in parallel. In addition, MIPr includes MapReduce implementations of popular image processing algorithms, which can be used for distributed image processing without any software development. The MIPr framework significantly simplifiesimage processing in Hadoop distributed environment.

Keywords—image processing, map reduce, hadoop, distributed computing

I. INTRODUCTION

Present days, the several technologies of parallel and distributed computing exist, the most popular ones being MPI, OpenMP, and Map Reduce . For almost all image processing tasks Map Reduce is the preferred technology, because it pro-vides automatic parallelization, work distribution among cluster nodes, and fault tolerance. Hence, application developers can concentrate on the image analysis instead of dealing with the complicated details of distributed computing. In addition, popular Map Reduce implementations include a distributed file system (DFS) that allows storing large volume of data on the commodity

hardware. As a result, the cost of large image collection storing is decreased significantly. The most popular open source Map Reduce implementation is Apache Hadoop .

Image processing is often used in many scientific areas, such as medical imaging, astronomical data analysis, remote sensing, and so on. During the last few years, the sizes of image collections increased dramatically and reached petabytes of data. Such volumes cannot be processed on a personal computer within a reasonable time. Hence, contemporary image processing tasks require distributed computing.

Therefore, application developers must implement by themselves many routine tasks, which are ABSTRACT

Available online at: http://www.ijmtst.com/ncracse2018.html

Special Issue from 3rd National Conference on Recent Advances in Computer Science and Engineering, 27^th – 28^th January 2018, Guntur, Andhra Pradesh, India

(2)

9 International Journal for Modern Trends in Science and Technology common for every Hadoop-based image processing

application. The examples of such tasks are reading an image from DFS to memory, converting the image to the Hadoop internal representation, and writing the image back to DFS after processing.

A MapReduce Image Processing (MIPr) framework, pre-sented in the paper, is aimed at making image processing in Hadoop easy and efficient. The MIPr framework provides the images representations in the internal Hadoop formats and the input/output tools for image processing integration into Hadoop data workflow.

Furthermore, the MIPr offers the Image processing API for developers who are not familiar with Hadoop. The API hides the internal detail of Hadoop distributed computing environment from the application de-velopers and allows them to concentrate on creating the image processing algorithms.

II. BACKGROUND MapReduce, Hadoop, and HDFS

The data processing consists of two computational phases (Map and Reduce) and one communica-tion phase (Shuffle and Sort). The advantage of the MapReduce programming model is the ability to process different key-values pairs independently in parallel. Such parallelization is provided automatically by MapReduce implementation; appli-cation developers need only to create serial Map and Reduce functions.

Despite the limitations of the MapReduce model, many image processing tasks can be easily represented as the MapReduce tasks. The simplest case is the independent processing of large volume of images, for example, SIFT-descriptor extraction or face recognition. This scenario can be implemented using Map-only job, where each Map function processes a single image. More complicated cases, for example, co-addition , require processing of related images. In such cases, separate Map functions process different related images, and Reduce function combines the Maps outputs to one resulting image.

Google suggested the MapReduce programming model, but did not share its implementation. Based on the paper by Google , several open source MapReduce implemen-tations were created, the most popular of which being Apache

Hadoop Data workflow

Hadoop . In addition to MapReduce, it includes the Hadoop Distributed File System (HDFS) .

Hadoop Data workflow

To enable image processing in Hadoop, it is necessary to create three components:

• Image representations in Hadoop internal format with the ability of serialization and deserialization.

• Custom implementations of InputFormat and Record-Reader for reading images from HDFS to Hadoop for MapReduce processing.

• Custom implementations of OutputFormat and RecordWriter for writing images to HDFS after MapReduce processing.

Hadoop automates the data workflow in MapReduce . Firstly, the data need to be read from HDFS. For this purpose Hadoop uses the InputFormat class, which divides the input data into the logical parts called splits, and provides a RecordReader. Each split is processed by a separate Map process (Mapper). The RecordReader is responsible for reading data and presenting them as the key-value pairs.

Secondly, the data have to be presented in the internal Hadoop format suitable for serialization.

Hadoop writes data to a byte stream for network transfer or temporary storage on local disks. The most popular serialization technology is Hadoop Writable, but there are other options such as Apache Avro or Apache Thrift.

(3)

10 International Journal for Modern Trends in Science and Technology Lastly, the data need to be written back to HDFS

after MapReduce processing. Similar to reading process, Hadoop uses OutputFormat class, which determines the structure of the output and supplies a RecordWriter. The RecordWriter actually writes key-values pairs to HDFS.

III. RELATED WORK

Hadoop was used for image processing in several scientific projects Owing to complicated structure of Hadoop runtime, which is hard to extend, the developers tried to use standard Hadoop capabilities to process images. In most cases, images were represented by byte arrays.

Each Map and Reduce function converts the byte array to the image, processes the image as required, and then converts it back to the byte array.Such approach is not convenient, but usually efficient. In rare cases, less efficient and convenient approaches were used. For example, the authors of work converted remote sensing images to the text files. Each line in the files represented one pixel and contained the RGB color codes in text format. Although this approach allowed using Hadoop text processing capabilities, it led to significant increase in the size of data.

There are two existing systems that can be used for image processing in Hadoop: HIPI (Hadoop Image Processing Interface) and OpenIMAJ (Open Intelligent Multimedia Analysis for Java). HIPI is a framework that is specifically designed to enable image processing in Hadoop. OpenIMAJ is a set of Java libraries for image and video analysis, some of OpenIMAJ tools have Hadoop implementation.

The HIPI framework provides all necessary parts of Hadoop data workflow for image processing:

• The internal Hadoop representation of image (Float-Image), based on simple float array of pixels.

• The InputFormat and the OutputFormat with corre-sponding the RecordReader and the RecordWriter.

HIPI uses Hadoop Writable interface for serialization of FloatImage. In addition, FloatImage includes small set of image processing operations, such as conversion from color to gray-scale, resizing, cropping, and so on. Before processing, the images must be packed into a HIPI Image Bundle, which is a non-standard HIPI file format, similar to Hadoop Sequence file. In addition to the

images, the HIPI Image Bundle includes image descriptors, which can be used to filter images in the bundle, and the index for a quick image search.

The HIPI Image Bundle provides a higher performance of MapReduce image processing compared to standard Hadoop capabilities. The main disadvantage of HIPI is its poor functionality.

Appli-cation developers are provided with the simple array of pixels; image processing algorithms must be implemented from the scratch. Another disadvantage is the lack of interoperability: images must be packed into the HIPI Image Bundle, which cannot be read by any other system.

OpenIMAJ offers more image processing and analysis features than HIPI, including the implementations of computer vision algorithms, clusterization, classification, etc. Just as HIPI, OpenIMAJ requires packaging image files into one large file. But in contrast to HIPI, OpenIMAJ uses standard Hadoop Sequence files for this purpose.

Another significant difference from HIPI is the lack of convenient Hadoop image repre-sentation. The images in OpenIMAJ are represented by the byte arrays. Hence, the image processing requires a conversion from byte array to image and back in each Map and Reduce function. The disadvantage of OpenIMAJ is that it includes only a ready-to-use MapReduce implementations of image processing algorithms, but does not provides development tools for MapReduce.

IV. THE MIPR FRAMEWORK

a Architecture of MIPr framework

MIPr is our framework for MapReduce image

processing aimed at

(4)

11 International Journal for Modern Trends in Science and Technology

providing simple and convenient way for image processing in Hadoop. MIPr allows the developers to process large amount of images in parallel using familiar tools, without having to learn details of distributed computing.

The MIPr framework uses image representation based on the popular image processing libraries.

Hence, the application developers are able to use the existing implementations of image processing algorithms from the libraries, instead of starting from scratch.

The internal details of distributed processing in Hadoop is hidden from the application developers.

The MIPr framework provides a simple interface for creating serial functions that process a single image or a group of related images. The image is already loaded into memory and presented in a convenient format. These serial functions are applied by the MIPr framework to the large volumes of images in parallel using Hadoop capabilities.

A. Architecture of the MIPr Framework

The MIPr framework architecture consists of three layers (Fig. 2): core components, Image processing API, and image processing libraries. The core components layer provides the basis for image processing in Hadoop. It includes various forms of image representation in the Hadoop format suitable for using as the values in MapReduce programs. This layer also provides input/output tools that are necessary to integrate image processing into Hadoop data workflow.

The main goal of the Image processing API layer

is to hide the internal details of Hadoop, MapReduce, and distributed processing from the application developers. The layer includes the Image processing API for developing the serial image processing functions and the MapReduce drivers for executing the functions as the Hadoop jobs.

The third layer contains MapReduce implementations of image processing algorithms, which can be used for distributed image processing without any software development.

B. Internal Image Representation

Currently, the MIPr framework includes image represen-tation formats based on Java 2D (BufferedImage class) and OpenIMAJ (FImage class for gray-scale images and MBFIm-age for color images). The Hadoop Writable is used as a serialization technology. The Writable wrappers for the MIPr image formats are presented on the above digram.

Images in the proposed formats can be used as the values in MapReduce programs. It is worth mentioning that such images cannot be used as the keys, because the keys in Hadoop must implement WritableComparable interface instead of Writable.

C. Images Input/Output Tools

For each type of Writable image representation, special InputFormat and OutputFormat implementations were developed. The InputFormat implementations (and correspond-ing RecordReaders) read the image from HDFS, create the desired Writable image wrapper, and produce a key-value pair for Map input. The pair includes the Null as a key and the image in the Writable wrapper as a value. Current implementation does not allow file splitting. Hence, the entire image file is read and processed by one Mapper.

The Output Format and RecordWriter implementations write images back to HDFS after processing. Each image is written to HDFS as a single file. In contrast to traditional MapReduce approach, the default behavior of the Output-Format is to preserve file names and extensions of the original images. This information is stored as the metadata fields in the Writable image representation.

In contrast to HIPI and OpenIMAJ, MIPr does not require packaging images into one large file.

Although Hadoop works badly with the large amount of small files, the problem can be solved by

(5)

12 International Journal for Modern Trends in Science and Technology using the CombineFileInputFormat. It is the

standard Hadoop technology for combining several small files into one large split. The split can be read from HDFS in one operation, which is much faster than reading small files separately. Another advantage of CombineFileInputFormat is that each Mapper processes several images instead of one. As a result, the number of required Mappers is declined, and consequently, the overhead of starting and stopping Mappers is decreased.

CombineFileInputFormat is an abstract class and requires the concrete implementations. Such implementations were created in the MIPr framework for each type of Writable image representations

D. Image Processing API

The current MIPr implementation includes a Java Image processing API. The main class of the API is ImageProcessor with one important method processImage. The method receives a source image in the necessary format (Fig. 3) as an argument and must return a processed image. The Image processing API also contains various MapRe-duce drivers to execute image processing functions, provided by application developer, as a Hadoop job.

The MapReduce driver sets up the appropriate InputFormat, OutputFormat, and image representation format. After that, the driver executes the job on the Hadoop cluster.

Current MIPr implementation provides Map-only drivers for Hadoop jobs. In such jobs, only the Mappers are used to process images. Each Mapper deals with a single image. To process the image, the Mapper calls the processImage method of the ImageProcessor class, supplied by the application de-veloper. The processed images, generated by the Mappers, are written to the HDFS.

Reducers are not used.

V. EXPERIMENTS

A series of experiments were carried out to evaluate the scalability of MIPr framework and compare its performance with HIPI and OpenIMAJ.

The MIRFLICKR-1M image collection was used as a dataset in the experiments. The collection contains one million of images downloaded from Flickr. The size of the collection is 118GB.

The experiments were performed on the 6-node Hadoop cluster of the Institute of Mathematics and Mechanics UrB RAS. The cluster consists of one management node and five computing nodes with the following configuration: OS Linux CentOS 6.5, 2 CPU AMD Opteron 2218, 8 GB RAM, 500 GB SATA Drive, Hadoop 2.2 (Cloudera Distribution of Hadoop 5).

A. Scalability

In order to evaluate the scalability of MIPr, the four image processing operations were performed (Table I) on the MIRFLICKR-1M dataset. The images from the dataset were packed into the the Sequence files before processing. The two series of experiments were conducted. The purpose of the first series of experiments was to evaluate how the MIPr framework scales with the increasing number of nodes in the cluster. The entire MIRFLICKR-1M image dataset was processed on the Hadoop cluster using various number of nodes. The results are presented at Fig. 4a. The second series of experiments was aimed at estimating the scalability of the MIPr with the growing volume of images. During this series, the entire Hadoop cluster was used to process various number of the images. The results are shown at Fig. 4b (logarithmic scale).

As it can be seen from the Fig. 4, the MIPr framework has near-linear scalability both with the increasing the number of nodes in the cluster and the number of images to process. The reason of good scalability is that each image is processed independently.

B. Performance Comparison of MIPr, HIPI, and OpenIMAJ

(6)

13 International Journal for Modern Trends in Science and Technology To compare performance of MIPr with HIPI and

OpenIMAJ, all images from the MIRFLICKR-1M dataset were converted from color to gray-scale format using the entire Hadoop cluster. This operation was chosen because the developers of HIPI used it to evaluate the performance of the HIPI Image Bundle. Hence, an existing implementation can be used for HIPI. Unfortunately, the implementation does not write the converted gray-scale images to HDFS because HIPI can read and write only color images. Therefore, to get the results suitable for comparison, the writing of the original color images to HDFS was added to the existing HIPI implementation.

The performance measurements results are presented in the Table II. The time of the Sequence files or the HIPI Image Bundles creation and copying images to HDFS is not taken into account.

HIPI has the best performance due to the special image file format HIPI Image Bundle. The performance of OpenIMAJ and MIPr is almost the same when the Sequence files are used to store images. In addition to the Sequence files, MIPr was also used to process images in the separate small files. However, processing the large amount of

small images, even with the

CombineFileInputFormat, still requires more time than processing the one large Sequence file.

VI. CONCLUSION

This paper presented the MIPr framework for distributed image processing using Hadoop. The framework extends the Hadoop by providing capability of using images in the MapRe-duce programs.

The images in the MIPr framework are represented based on the popular image processing libraries (currently, Java 2D and OpenIMAJ). It provides the ability to quickly implement the image processing algorithms using the existing libraries, instead of developing the algorithms from the ground up.

The Image processing API and library hide the complexity of Hadoop from the application developers and allow them to use distributed image processing without prior MapReduce knowledge.

In contrast to HIPI and OpenIMAJ, the MIPr framework is able to process images not only in one large file, but also in many small files. Performance of small files processing is sig-nificantly improved with the help of CombineFileInputFormat.

The experiments with the MIPr framework demonstrated its scalability and good performance.

Further works include development of Python and C/C++ Image processing API, creation of OpenCV-based Hadoop im-age representation, and extension of image processing library.

REFERENCES

1. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online].

Available:

http://doi.acm.org/10.1145/1327452.1327492 2. S. Ghemawat, H. Gobioff, and S.-T. Leung, “The

google file system,” SIGOPS Oper. Syst. Rev., vol. 37, no. 5, pp. 29–43, Oct. 2003. [Online]. Available:

http://doi.acm.org/10.1145/1165389.945450 3. Apache hadoop. [Online]. Available:

https://hadoop.apache.org

4. D. Moise, D. Shestakov, G. Gudmundsson, and L.

Amsaleg, “Indexing and searching 100m images with map-reduce,” in Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, ser. ICMR ’13. New York, NY, USA: ACM, 2013, pp. 17– 24. [Online]. Available:

http://doi.acm.org/10.1145/2461466.2461470 5. K. Wiley, A. J. Connolly, J. P. Gardner, K. S.

Krughoff, M. Balazinska, B. Howe, Y. Kwon, and Y.

Bu, “Astronomy in the cloud: Using mapreduce for image coaddition,” CoRR, vol. abs/1010.1015, 2010.

6. M. H. Almeer, “Cloud hadoop map reduce for remote sensing image analysis,” Journal of Emerging Trends in Computing and Information Sciences, vol.

3, no. 4, April 2012. [Online]. Available: http://www.

cisjournal.org/journalofcomputing/archive/vol3no 4/vol3no4 23.pdf

7. A. Cary, Z. Sun, V. Hristidis, and N. Rishe,

“Experiences on processing spatial data with mapreduce,” in Proceedings of the 21st International Conference on Scientific and Statistical Database Management, ser. SSDBM 2009. Berlin, Heidelberg:

Springer-Verlag, 2009, pp. 302–319. [Online].

Available:

http://dx.doi.org/10.1007/978-3-642-02279-1 24 8. Z. Lv, Y. Hu, H. Zhong, J. Wu, B. Li, and H. Zhao,

“Parallel k-means clustering of remote sensing images based on mapreduce,” in Proceedings of the 2010 International Conference on Web Information Systems and Mining, ser. WISM’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 162–170.

[Online]. Available: http:

//dl.acm.org/citation.cfm?id=1927661.1927687

(7)

14 International Journal for Modern Trends in Science and Technology 9. B. White, T. Yeh, J. Lin, and L. Davis, “Web-scale

computer vision using mapreduce for multimedia data mining,” in Proceedings of the Tenth International Workshop on Multimedia Data Mining, ser. MDMKDD ’10. New York, NY, USA: ACM, 2010, pp. 9:1–9:10. [Online]. Available:

http://doi.acm.org/10.1145/1814245.1814254 10. D. G. Lowe, “Object recognition from local

scale-invariant features,” in Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ser. ICCV ’99. Washington, DC, USA:

IEEE Computer Society, 1999, pp. 1150–. [Online].

Available:

http://dl.acm.org/citation.cfm?id=850924.851523 11. K. Shvachko, H. Kuang, S. Radia, and R. Chansler,

“The hadoop distributed file system,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, May 2010, pp. 1–10.

12. C. Sweeney, L. Liu, S. Arietta, and J. Lawrence,

“Hipi: A hadoop image processing interface for image-based mapreduce tasks,” B.S. Thesis, University of Virginia, 2011. [Online]. Available:

http://cs.ucsb.edu/∼

cmsweeney/papers/undergrad thesis.pdf