Numerical Calculations for Geophysics Inversion Problem Using Apache Hadoop Technology

(1)

Problem Using Apache Hadoop Technology

ÃLukasz Krauzowicz1_{, Kamil Szostek}1_{, Maciej Dwornik}1_{, PaweÃl Oleksik}1_{, and}

Adam Pi´orkowski1

Department of Geoinfomatics and Applied Computer Science, AGH University of Science and Technology, Cracow, Poland {szostek,oleksik,pioro}@agh.edu.pl,[email protected]

http://www.agh.edu.pl

Streszczenie There are considerations on the problem of time-consuming calculations in this article. This type of computational problems concerns to multiple aspects of earth sciences. Distributed computing allows to perform calculations in reasonable time. This way of processing requires a cluster architecture.

The authors propose using Apache Hadoop technology to solve geo-physics inversion problem.This solution is designed rather for analyzing data, but it also enables to perform computations. There is an archi-tecture of solution proposed and real test carried out to determine the performance of method.

Key words: parallel computing, distributed computing, cluster, numer-ical computing

1 Introduction

Earth sciences, and especially geophysics, are an area of intense research. The considerations are accompanied by a number of modeling and data analysis, which are tasks that require high computational power. A single computer is often not able to perform these calculations in a reasonable time. In such cases it is necessary to use clusters.

In the paper [1] the problem of ground vibration modeling is presented. The authors used a very large size of the model. The numerical calculations were time-consuming, therefore calculations were made in parallel on an effective computer cluster.

Seismic wave field modeling is a topic of [2,3]. This is another time-consuming problem in geophysics. It can reveal the nature of an analyzed wave phenomenon. This modeling is often a part of complex and extremely time consuming methods with almost unlimited needs of computational resources, therefore computations are dedicated for academic centers, especially with support from oil and gas com-panies. Using the GPU-PC cluster and cluster based on component technologies were tested.

(2)

Another geophysical phenomenon is geothermal field [4,5]. Heat transfer mod-eling is very important in solving physical problems in Earth science as volca-noes, intrusions, earthquakes, mountain building or metamorphism. This kind of calculations requires high computational power that exceeds the capabilities of a single PC. There is the ability to use a high performance cluster. A solution based on the component technologies was set, but it was not fault-tolerant and did not support a load balancing.

2 Inverse Problem for Vertical Transverse Isotropy

Geological Medium

Knowledge of velocity distribution in geological medium is one of the most impor-tant things in mining exploration. Process of reconstruction velocity distribution is named an inverse problem. Solving inverse problem, especially in anisotropic medium, is a difficult process, because of the non-linear relationship between distribution of value of the elastic parameters and received travel times of wave propagation. This relationship implicates that deterministic method of inversion is useless. One of the methods to obtain velocity distribution is a stochastic inversion. In this paper Monte Carlo method was used to obtain parameters values.

Stochastic inversion is based on generating a huge set of models, calculating theoretical travel times for each seismic ray and evaluating solution. To compare the estimated and received travel times, a following equation was used:

L= 1 N N X i=1 |Test i −Tirec| (1) 2.1 Forward Problem

Travel times were estimated using The Shortest Time Method (e.g. [6], [7], [8]). In this method geological medium was divided into several velocity cells described by Thomsen parameters [9]: vP0= r c33 % (2a) vS0= r c44 % (2b) γ≡ c66−c44 2c44 (2c) ε≡ (c11−c33) 2c33 (2d) δ≈(c13+c44) 2_{+ (c} 33−c44)2 2c33(c33−c44) (2e)

(3)

where c11, c13, c33, c44, c66 are coefficients of the stiffness tensor. Using this

pa-rameters, velocity of seismic wave in θ direction in vertical transverse isotropy medium is described by following equation [9]:

vP(θ)≈vP0[1 +δ·sin2θ·cos2θ+ε·sin4θ] (3a)

vSV(θ)≈vS0[1 + (vp0

vs0)

2_·_(ε₋_δ)_·_sin2_θ_·_cos2_θ] _(3b)

vSH(θ) =vS0[1 +γ·sin2θ] (3c)

2.2 Seismic Inversion

Crosswell tomography is a method for reconstruction Thomsen parameters’ dis-tribution. In one well seismic wave is generated and received by geophones from next well. To reconstruct velocity field 31 shot points and 76 received points were used. In this case only P-wave was used for Thomsen parameters reconstruction, so it was impossible to obtainγ andvS0 value. Geological medium was divided

into 24 velocity cells. That gives 72 independent values to obtain.

This hyperspace excludes regular sampling. Typical method is to generate huge numbers of models and remember only a few of the best solutions. This method does not give information of under-sampled areas and perspective areas (region with small value of error, which can be sampled more). First problem will not be discussed in this work. On the other hand, the second problem can be partially solved by sampling space near the best solutions.

3 The Apache Hadoop

The ApacheTM_HadoopTM_{technology provides a framework for data sets}

dis-tributed processing. It is meant to be run on clusters of computers, from single servers up to thousands of nodes and to process millions of megabytes. This library is designed to be independent from any hardware failures - it implements malfunction detection on application layer [10,11]. Apache Hadoop technology is widely used in cloud computing systems directed to process large amount of heterogeneous data in a distributed environment [12].

The Apache Hadoop works on three main subprojects: – Hadoop Common, which supports other subprojects,

– Hadoop Distributed Files System (HDFS) - a high-throughput data access distributed file system,

– Hadoop MapReduce - large data sets processing framework, dedicated for clusters.

Hadoop technology was invented to process large amount of data. In 2009 this framework won the one minute sort: 500GB was sorted in 59 seconds on 1406 nodes. Then 100 terabytes sort was performed in 173 minutes on 3400 nodes [13]. Many companies and organizations use Apache Hadoop for research and production, i.e. Amazon, Yahoo!, Google, Facebook and more [14].

(4)

3.1 Hadoop Distributed File System and MapReduce

Apache Hadoop technology takes advantage of MapReduce programming model. It allows user to write his own Map and Reduce tasks in Java or, by use of wrap-pers, in other programming languages, like C++, Ruby or Python. As Apache Hadoop itself is written in Java, using this language for MapReduce tasks is the fastest option, because otherwise Hadoop Streaming or Hadoop Pipes have to be used, i.e. for C++ these wrappers executes C++ Map or Reduce class and communicate through sockets (Fig. 1).

The second strongest side of the Hadoop technology is highly failure-tolerant HDFS, designed to cope with large amount of data. As the data is distributed along all cluster nodes in HDFS, MapReduce can easily process it in parallel.

Client node

Client JVM

MapReduce

Program Job Client

JobTracker node

JobTracker 5: initialize job

TaskTracker node Task Tracker child JVM Child Map Task or Reduce Task C++ wrapper library C++ Map or Reduce class run Input key/values Output key/values socket 9: launch Launch Shared File System (HDFS)

1: run job 4: submit job

2: get new job ID

7: heartbeat (returns task) 3: copy job

resources 6: retrieve input splits

8: retrieve job resources

Rysunek 1.Conventional job submission diagram in Apache Hadoop framework with C++ wrapper.

3.2 Apache Hadoop job workflow

The figure 1 shows default Apache Hadoop job workflow, which is:

1. The program is executed by client users. In this paper Job Tracker is used, which is the main node.

2. Client receives Job’s ID from Job Tracker.

(5)

4. Client confirms the job. 5. Job Tracker initiates the job.

6. Job Tracker receive input data from HDFS, then split it into parts and sends back to HDFS.

7. Task Tracker sends a heartbeat signal to the Job Tracker periodically to inform Job Tracker about its presents and readiness to work.

8. All working Task Trackers receive split data from HDFS 9. Task Trackers start the task.

10. In child JVM the job is executed. If C++ program was submitted wrapper is used to run C++ MapReduce libraries. It communicates through the socket. Otherwise Java classes are used directly.

11. Task Trackers return results to the HDFS.

4 Tests

The main idea of the Apache Hadoop utilization is to search perspective re-gions by analysis of the data produced by seismic modeling with Monte Carlo approach. In order to take advantage of Apache Hadoop results, they should be sent back to the seismic modeling algorithm and tested to produce more pre-cise Thomsen parameters. Process should be repeated until satisfactory results achieved.

The algorithm used to process this data was written in Java and C++, thus Apache Hadoop executes the second using wrappers, as mentioned before. The algorithm itself is not complicated, because most of the complex work is moved to Apache Hadoop framework. This significantly reduces time for code writing. The algorithm is performed in two stages. First, in the Mapper stage, it gath-ers parametgath-ers with similar compare estimator and produces key-value tuplets, where key is the rounded estimator and value is the set of parameters. Next, in the Reducer stage, all sets of tuplets with the same key are processed. At this moment, for each parameter of one set, average values are calculated and emit-ted as output. The results of this approach are sufficient, but may be extended in future for more accurate results.

Tests were run on cluster of varying number of nodes: 1, 2, 4 and 8 nodes. Each of them was a low-cost PC running openSUSE 11.4 (Linux kernel 2.6.37), with Intel(R) Pentium(R) 4 CPU 2.8GHz, 1GB RAM and 100Mbit/s Ethernet connection. Nodes were connected into star network using 100Mbit/s switch. The Job Tracker was also Task Tracker and was exactly the same machine as all other Task Trackers (Fig. 2).

Input data was collected in 8 files that consist of the compare estimator and 72 values, 1.5GB in total. Every single test was repeated 30 times, measuring times of three stages: copying the input data to HDFS, performing calculations and copying results back from HDFS.

(6)

Switch

Nodes / Task Trackers

...

Job Tracker

Rysunek 2.The network configuration used in the tests. The number of nodes varying from 1 to 8.

5 Results

Tests show that increasing number of nodes affects computation time, but the results are not impressive (Fig. 4). However, time required for data distribution to all nodes increases with number of nodes as well as computation time decreases slightly. What is more, time for copying data back from HDFS is inconsiderable, as it takes less than 1% of whole processing time, because of the size of results. For more accurate results the Reducer task should generate more data, which might significantly increase processing and copying time, but on the other hand will take more advantage of all Apache Hadoop’s features.

6 Conclusion and Future Works

In this paper the Apache Hadoop technology was used to perform geophysical numerical analysis. As this technology is directed to process large data sets, algorithm to reconstruct Thomsen parameters’ distribution was implemented and tested. The main advantage of using Apache Hadoop technology appears to be high scalability and ease of MapReduce code writing. Unfortunately, to benefit fully from this framework it is necessary to use more powerful computers in increased number as well as faster network configuration.

(7)

2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 182MB 364MB 728MB 1456MB 0 100 200 300 400 500 600 700 800 Ti m e [s ] copying from HDFS calculations copying to HDFS

Rysunek 3.Times of data analysis for different number of nodes.

2 nodes 4 nodes 8 nodes

0 100 200 300 400 500 600 700 Ti m e [s ] c++ java

(8)

The future works will be focused on increasing number of nodes and amount of data, hence the inverse problem will be extended to more accurate models, which in consequence produce more data - more suitable for Apache Hadoop framework. What is more, to increase speed and accuracy of the inverse problem, forward solution will be implemented as a part of MapReduce class. This will enable the usage of the MapReduce tasks’ results in next Thomsen parameters estimation in more convenient way. As the small-file problem might by significant for analysis speed in presented configuration, certain optimization should be considered in future work [15]. Furthermore, next tests should be performed using a slightly different configuration. As it is mentioned in Apache Hadoop documentation, when network consist of more than four nodes it is better to set up Job Tracker and Name Node on separate machines.

Acknowledgments. The study was financed in part by the statutory research project No 11.11.140.561 of the Department of Geoinformatics and Applied Com-puter Science, AGH UST and by grant No. N N525 256040 from Ministry of Science and Higher Education.

This work was co-financed by the AGH - University of Science and Technol-ogy, Faculty of GeolTechnol-ogy, Geophysics and Environmental Protection, Department of Geoinformatics and Applied Computer Science as a part of statutory project.

Literatura

1. Pieta, A., Danek ,T., Le´sniak, A.: Numerical modeling of ground vibration caused_,

by underground tremors in the LGOM mining area. Gospodarka Surowcami Min-eralnymi - Mineral Resources Management, Vol. 25, No 3, pp 261-271 (2009). 2. Danek, T.,: Parallel and distributed seismic wave field modeling with combined

Linux clusters and graphics processing units. IEEE International Symposium on Geoscience and Remote Sensing IGARSS, pp 2588-2591 (2009).

3. Kowal A., Pi´orkowski A., Danek T., Pieta A.: Analysis of selected component tech-_, nologies efficiency for parallel and distributed seismic wave field modeling. Pro-ceedings of the 2008 International Conference on Systems, Computing Sciences and Software Engineering (SCSS), part of the International Joint Conferences on Com-puter, Information, and Systems Sciences, and Engineering, CISSE 2008, Bridge-port, Connecticut, USA. In: Innovations and Advances in Computer Sciences and Engineering, Springer, pp 359-362 (2010).

4. Pi´orkowski, A., Pieta A., Kowal A., Danek T.: The Performance of Geothermal_,

Field Modeling in Distributed Component Environment. Proceedings of the 2009 International Conference on Systems, Computing Sciences and Software Engineering (SCSS), part of the International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 09), Bridgeport, Connecticut, December 4-12, 2009. In: Sobh, Tarek (ed.) et al., Innovations in Computing Sciences and Software Engineering. Springer, pp. 279-283 (2010).

5. Kowal A., Pi´orkowski A., Pieta A., Danek T.: Efficiency of selected component_,

technologies for parallel and distributed heat transfer modeling. Mineralia Slovaca, ISSN 0369-2086. Vol. 41 no. 3 supl. Geovestnik pp. 361, (2009).

(9)

6. Moser, T.J., Shortest path calculation of seismic rays,Geophysics,56, pp. 59–67, (1991)

7. Fischer, R., Lees, J.L., Shortest path ray tracing with sparse graph,Geophysics,58, pp.987–996, (1993)

8. Dwornik, M., Pieta, A., Efficient algorithm for 3D ray tracing in 3D anisotropic_,

medium,71st_{EAGE Conference & Exhibition incorporating SPE EUROPEC 2009,}

Extended Abstracts, Amsterdam, Holland, P138, (2009)

9. Thomsen, L., Weak elastic anisotropy,Geophysics,51, pp.1954–1966, (1986)

10. http://hadoop.apache.org/

11. White, T.: Hadoop: The Definitive Guide, Second Edition, O’Reilly Media, ISBN: 978-1-449-38973-4, (2010)

12. Kim, H., Kim, W., Lee, K., Kim, Y.: A Data Processing Framework for Cloud Environment Based on Hadoop and Grid Middleware. In: Grid and Distributed Computing, CCIS, vol. 261, pp. 515-524. Springer, Heidelberg 2011

13. Sort Benchmark Home Page, http://sortbenchmark.org/ 14. http://wiki.apache.org/hadoop/PoweredBy#G

15. Mohandas, N., Thampi, S. M.: Improving Hadoop Performance in Handling Small Files. In: Advances in Computing and Communications, CCIS, vol. 193, pp. 187-194. Springer, Heidelberg (2011)