Cloud-Based Data Analytic and Visualization
Framework for Dengue Fever Analytics
Sifei Lu, Xiaorong Li, Long Wang, Zhaoxia Wang, Henry Palit, Henry Kasim, Terence Hung Institute of High Performance Computing
A*STAR Institute, Singapore
{lus, lixr, wangl, wangz, henry, kasimh, terence}@ihpc.a-star.edu.sg Abstract—Cloud Computing is a new paradigm changing the
platform infrastructure and services provided by enterprises, and it provides a new architecture supporting Complex computational and Data-enabled Science and Engineering(CDS&E) applications. Users face integration challenge to develop and deploy such applications in current commercial and private clouds. In this paper, we present a practical, effective, user friendly high performance cloud analytic framework to support CDS&E applications by composing dynamic cloud resource provisioning, adaptive workflow scheduling and management, data analytics and harmonization, and result visualization. An application for dengue fever incident analytics has been used as a use case to explore the auto scale of the workflow management engine, and demonstrates the flexibility and simplicity for users to conduct time-series data analysis, spatio-temporal data analysis, data harmonization and visualization in public and private clouds.
Keywords-component;Cloud Computing; workflow scheduling; time-series data analysis; spatio-temporal data analysis; visualization
I. INTRODUCTION
Cloud computing is a new computing paradigm for enterprises to transform business model and enhance firms capability of providing new products and services in a fast and economic way. CIOs are moving IT infrastructure from server visualizations, private cloud to hybrid cloud. Increasing demand also promotes cloud service providers, e.g. Amazon, Google, Microsoft and Salesforce to put in huge efforts to deliver new technologies and services frequently. In the mean while, researchers are exploring the ways to support real-world science and engineering applications for in-depth understanding and insights into complex problems in a hybrid cloud [1][2].
An integrated environment is expected for a user to extract data from the Internet, save historical data, define complicated processing, analyze tasks using a variety of analytic tools, and provide the visualized result to users. A workflow management system is helpful to get data, define suitable workflow, obtain resources from private and public cloud, schedule tasks' execution in different instances to meet task dependency and specific requirement, and manage failures in tasks' execution. There is a challenge of allocating the right number of instances and the right type of instances with the constraint of cost, security and QoS objectives at run time. Another challenge is
that the complex data analytics may need a system which can simultaneously support a variety of data analytics tools, such as Rapidminer [4], R [5], Hadoop [6] etc. Difficulty of having data harmonization and visualization in server side and client side is also a limitation to users' development real-world applications.
Dengue fever is a common mosquito-borne viral disease caused by infection with a virus. According to WHO data [7], the global incidence of dengue has grown dramatically in recent decades. It is popular in urban and semi-urban areas of tropical regions since the warm weather can reduce the life cycle of the Aedes mosquito. There is no specific treatment for dengue fever, and doctors can only provide supportive care to patients. In Singapore, the National Environment Agency (NEA) has a website to provide advice on preventing Aedes mosquito breeding, related educational materials and latest dengue cases with location [8]. Several studies have shown that dengue cases are linked to weather variables, such as temperature and humidity [9]. Hence data analytics are needed to transform information into insight and help to guide actions.
In this paper, we present a practical framework comprising of dynamic cloud resource provisioning, adaptive workflow scheduling and management, data analytics and harmonization, and result visualization. It integrates common data analytic tools: Rapidminer, R and Hadoop. A case study for dengue fever analysis and prediction is used to evaluate the scalability of the workflow management module, and the flexibility and simplicity for users to conduct spatio-temporal data analysis, data harmonization and visualization in public and private cloud.
II. RELATED WORK
Several workflow management systems have been developed to handle scientific application in hybrid clouds. A Workflow Engine is introduced to integrate with Amazon EC2, Aneka Cloud and local clusters to provide workflow design, workflow scheduling, fault tolerance and data movement for scientific applications [10][11][12]. The work in [10] obtained an order-of magnitude improvement in the run time when large cloud resources provided are charged. CometCloud is used as the core engine to provide a framework for autonomic management of scientific workflow applications on hybrid clouds [14][15]. The architecture supports autonomic cloud bursting and autonomic cloud bridging, as well as hadoop/Mapreduce.
An extended hybrid workflow management framework was developed to integrate with Rapidminer and Rapidanalytics for multiscale climate data analytics and visualization [2][4], which bridged the gap between scientific workflow management system and common used open source data analytics tools. Radoop also provides an integration of Rapidminer and Hadoop system [16][17]. Although it can make use of Rapidminer GUI designer, full set of operators and visualization tools, it can not support cloud environment.
Delayed effects of weather variable on incidence of dengue fever in Singapore has been investigated into [9], and the result shows there are strong correlations between dengue incidences and weather variables, such as temperature and humidity. In this paper, we extend to provide Geo-spatial data (e.g. Rain fall data) processing related to dengue incidence research, as well as to provide common R and Hadoop analysis support.
OpenGeo suite is an open source tool set to support spatial relational database PostgreSQL, provide map and feature service using GeoServer and publish data using Openlayers and GeoExt Javascript libraries [18]. It can be used for Geo-spatial data processing and visualization.
III. CASE STUDYON SPATIAL-TEMPORAL ANALYTICS A case study for dengue fever application is used to demonstrate the cloud workflow framework to manage the data processing, analysis and visualization. The desired outcome is helpful to conduct weather and dengue cases monitoring and prediction, and to guide community responses to dengue fever prevention and control.
A. Objectives and motivations
The rapid advancement in technology allows to have various of tools for big data analytics. Our objectives are to provide an integrated framework in a fast and cost effective way for big data analytics. More specifically, it provides capabilities:
• To provide an integrated framework in a hybrid cloud platform with workflow management
• To integrate with common open source data analytics tools, such as Rapidminer, R and Hadoop.
• To explore interactive visualization tools for Geo-spatial data
• To evaluate the concept with a real world showcase.
• To discover insight knowledge and provide suggestions for dengue fever prevention and control.
B. Requirement of hybrid cloud spatio-temporal analytics
There is still a gap between data analytics in local servers and cloud based data analytics which leverages the cloud resources to execute cloud analytics tasks. We considered the following requirements for hybrid cloud spatio-temporal data analytics.
• Able to capture, extract, transform various data (e.g. climate data and dengue cases data) from different sources using various tools in one platform
• Able to develop different data processing and data analytics tasks using such tools
• Able to harmonize data and visualize the result using geo-spatial enabled database and map service.
• Able to use workflow management system to orchestrate the dynamic computing resource allocation, data movement, scheduling and execution tasks, fault tolerance, security awareness in cloud.
• Able to build a private cloud platform using virtualization technology.
• Able to easily change the whole tool set to solve other relative data analytics problem.
C. Data sets
The data used for the following showcase is from different sources and with a variety of formats and volumes.
1) Dengue fever data
The weekly dengue data is retrieved from the Communicable Diseases Division, of the Ministry of Health Singapore (MOH), and weekly epidemiological publication of MOH, while the latest locations of active clusters is captured from Campaign Against Dengue web site [8].
2) Global summary of day data
The daily weather data is downloaded from National Climatic Data Center, climate data online [19]. Daily mean temperature and mean dew point from the Changi meteorological station is used to calculate relative humidity.
Figure 1. Weekly dengue cases, mean temperature and humidity. Fig 1. shows the weekly dengue cases, weekly mean temperature and weekly mean humidity in the year 2000-2010. The unit of mean temperature is degrees Celsius (°C), while relative mean humidity is expressed as a percentage (%).
3) Simple relationship of dengue cases and weather variables
Fig 2. is the simple relationship between dengue cases and weather variables in 2000-2010 generated from R. It shows that the mean temperature and mean relative humidity has a linear correlation.
Figure 2. Simple relationship among dengue cases, mean temperature and humidity.
4) Rainfall images
Rainfall weather 70km radar image is collected from National Environment Agency (NEA) Singapore. Fig 3 shows the rainfall information on 2011 Jun 30 at 08:10am. It is captured every 5 minutes; there are 16GB for 3 years data.
Figure 3. Radar cloud rainfall image.
D. Data analysis and visualization methods
A few analysis and processing tools are used in the show case. We will introduce them in the following part.
1) Time lag correlation coefficient
The dengue incidence is related to temperature and humidity with a time lag, the time lag (τ)is added to the standard Spearman's rank correlation coefficient (SRCC).
(1)
In equation (1), N is the selected sliding window of weeks, x(t) is the weekly dengue cases, while y(t) is the weekly temperature or weekly relative humidity. And
x
andy
are the mean value of x(t) and y(t) respectively.2) Rainfall data process
From the radar color image, a few complex process and tools will be applied to transfer and extract the maximum, minimum or average rainfall for a specific area in Singapore.
Filtering ocean and landscape color and changing the
image color accordingly for better raster process
Assigning geographical value of the picture.
Using tool raster2pgsql [20] to transfer from raster to
pgsql command script
Uploading data to local PostgreSQL database Transferring raster records to geography records Clustering the rainfall value
Integrating with data table
Exporting data for consolidation and harmonization 3) Data visualization
In the application, the high level tasks of dengue cases visualization is separated as follows. The dengue cases data is verified and marked with longitude and latitude, and uploaded to the PostgreSQL database for visualization. Then a presentation layer for each week of the dengue cases is built. Finally, publish the result in a tomcat web site overlayed with temperature data.
IV. SYSTEM ARCHITECTURE
We proposed a cloud-based worklflow-enabled data analytics framework to allow multiple users to implement big data analytics applications using common open source tools, such as R, Rapidminer and Hadoop. Details will be discussed in the following section.
A. Application and framework architecture
The framework is based on the private cloud implemented using Ecalyptus [21], and the public cloud Amazon EC2 [22]. A few type of linux virtual machine images have been created with Rapidanalytics, R, Hadoop, OpenGeo suite installed separately. Hadoop system is also installed in private cluster and is linked to the framework through the name node.
An application may only use a subset of tools listed before, or just be deployed in private cloud due to the cost reason or security requirement. The cost of building image in Amazon EC2 and transferring data to Amazon S3 will also be considered in implementation phase. Therefore we recommend to use the hybrid cloud to develop, test and deploy most of the scientific applications.
Fig 4. shows the architecture of the cloud based data analytics framework.
R
X Y(
τ
)=
∑
t=1 N(
x
(
t+τ
)−x
)(
y
(
t
)−
y
)
√
∑
t=1 N(
x
(
t+τ
)−x
)
2√
∑
t=1 N(
y
(
t
)−
y
)
2Figure 4. Architecture of the cloud based data analytics
The core of the framework is the workflow engine. It maintains the workflow repositories and provides a web interface for user to create, modify, execution and review workflows; requests the proper resources from the cloud clustering, schedules tasks to those instances, and executes tasks on separate instances, monitors execution results, provides fault tolerance support, orchestrates the data moving and harmonization, releases cloud resource after workflow completion.
Cloud clustering service is used to maintain and find the proper type of virtual machine images with data analytics tools (e.g. R, Rapidanalytics and Mysql) installed and configured.
The framework also provides user friendly interface to build analytics tasks using common open source data analytics tools (e.g. R, Rapidminer, hadoop), as well as customized data analytics process. It also provides a practical way to effectively support a variety of data harmonization and visualization methods, especially for Geo-spatial data processing and visualization.
B. Interface to Rapidminer, R and Hadoop
Rapidanalytics is the server version of Rapidminer, which can share process, and use the same run-time library with Rapidminer to execute data analytics tasks. A proxy shell program is built to execute Rapidanalytics configured process through web service call. Separate additional workflow tasks are added to move data to and from Rapidanalytics VM instances.
A similar proxy shell program is used to execute R script, and the workflow tasks to move data to and from R instances are also needed here. In the test environment, we installed R and Rapidanalytics in one VM image.
The workflow engine also can define and execute Hadoop tasks through Hadoop name node. The development, test and running are conducted in Hadoop environment. A proxy shell program is built to trigger the execution of Hadoop tasks from workflow engine, and separate additional tasks to move data to or from Hadoop Distributed File System (HDFS) are also required. Fig 5. shows how to define a Hadoop task in a workflow. In the same workflow, the workflow engine is able to schedule and execute multiple dependent Hadoop tasks, as well as other analytics tasks (e.g. R or Rapidminer task).
Figure 5. Workflow with Hadoop task
Some tasks are limited to be scheduled in private cloud due to the security and privacy requirements. The workflow is able to find the information through pre-designed configuration file and schedule them in private cloud VM instances.
C. Interface for data visualization
Usually, data visualization part is implemented at client user side, either using a special tool or common browser to display and view the result. In the framework, after completion of the workflow execution, the result can be moved to local client through scp, or accessed from an Network File System (NFS) share folder. The visualization tool (e.g. ncview [23], web browser) then can display the result.
For Geo-spatial data, we use OpenGeo Suite to design and publish the data with Geo Server. The layers, styles and data store are properly defined, and the data in PostgreSQL is dynamically updated after the result is generated from workflow execution. From a web browser, a user can access the web pages published through an tomcat web server, which call the map service provided by Geo server.
V. RESULTSAND DISCUSSION
The experimental test is conducted under a hybrid cloud to examine the performance of our proposed framework in terms of speedup, functional specification. The private Cloud is composed of 6 nodes, each has a 24-core 2.93 GHz processor with 96 GB of RAM memory. 4 nodes of them are used for Hadoop. 2 nodes are used to provide 24 virtual machines. Three type of Ubuntu 12.04 virtual machine instance (VM) are created, small VM with 1 core and 2 GB memory, medium VM with 2 cores and 4GB memory, and large VM with 4 cores and 8 GB of RAM memory. The public Cloud is composed of 5 Amazon AWS EC2 medium instances (2 cores with 2 ECU and 3.75 GB of RAM) in the region of Asian Pacific South East.
A. Scalability Result
Fig 6. shows the processing time of 24 hours rainfall images processing in three types of configuration VM.
Our observation is that the processing time relates to the size of the image. If there was rain during the capture time, the radar image will be bigger, and the process time will be longer. The median process time are 109.5s, 41s, 40s for small, medium and large VM instances respectively, while the mean process time are 174.5s, 64.2s and 62.5s respectively. Another observation of the rainfall images process is that the processing time can vary greatly.
Figure 6. Data processing time
Fig 7 shows the cost performance ratio of each task in medium and large VMs. The mean speedup ratio are 2.65 and 2.72 respectively. The cost is counted based on no. of cores and size of memory in each type of VM instances. The pricing of linux Amazon EC2 instances in Asia Pacific region is $0.08, $0.16, $0.32 per Hour for Small, Median and Large instances respectively. For this special Geo Image process task, in terms of cost-performance ratio, Medium type VM is best. It shows that there is a limitation of scale up for a data analytics application, hence user needs to select appropriate instance type based on the requirement of application.
Figure 7. Cost performance ratio of different types of VM
We also conducted scale out test for another typical day radar images on medium VM instance. Fig 8. shows the execution time of different number of VMs. The result also provides information on how to allocate the right number of instances with right type of VM image for data processing.
Figure 8. Execution time of VM scale out
B. Time-lag correlation coefficient result
Analyzing the time lag correlation coefficients between dengue cases and weather variables for 10 years data using R shows that the number of dengue cases is correlation to temperature and humidity. It also shows the number of dengue cases links to the periodicity of the temperature. Fig 9. shows the result of Time-lag correlation coefficients between dengue cases and temperature. RD-T is correlation coefficients. In most of the weeks, P_value is lower than the conventional 5% (Red Line, P_value=0.05). The correlation coefficient shows statistically significant. We analyze the time lag correlation coefficients between dengue cases and rain fall for a specific location in Singapore, and process rainfall images and save classified data to database.
Figure 9. Time-lag correlation coefficients between dengue cases and temperature
C. Integrated visualization result
OpenGeo suite is used for Geo-spatial data visualization. Geo server is responsible for designing and publishing map service. Shapefiles data are stored in the server, while structure data are uploaded to PostgreSQL databases. Tomcat Apache is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0 100 200 300 400 500 600 700 800
Small(1core, 2G) Medium(2cores, 4G) Large(4cores, 8G)
24 Hours Data P ro ce ss in g T im e (s ) 1 2 3 4 5 6 7 8 9 10 11 12 0 100 200 300 400 500 600 700 800 Execution Time(s) No of Medium VMs P ro ce ss in g T im e ( s) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 Medium(2cores, 4G) Large(4cores, 8G) 24 Hours Data S p e e d u p r a tio /C o st
used to publish Web pages integrated map service provided by Geo server.
After temperature data are collected based on areas, Grass [24] is used to interpolate and classify for each area. Then the processed result is output as shapefiles. Finally it is imported to Geo server as a layer of temperature distribution. Dengue cases with location information is stored to PostgreSQL database and published as another layer in Geo server. In Fig 10, user is able to view the dengue incident and mean temperature information in browser. The higher temperature, the more dengue incidents in that area.
Figure 10. Dengue incidence visualization. VI. CONCLUSIONS AND FUTURE WORK
In this paper we present a practical, effective high performance cloud framework to leverage common data analytics tools R, Rapidminer and Hadoop for CDS&E applications in hybrid cloud. After a user completes the configuration of workflow, workflow tasks will be scheduled, managed and monitored via workflow engine; then the data and result are imported to OpenGeo Suite for visualization.
An real world data analytics application for dengue fever incident in Singapore is used to explore the auto scale of the workflow engine, and to demonstrate the flexibility and simplicity for time-series data analytics, spatio-temporal data analysis, data harmonization and visualization in public and private cloud.
In future, the workflow engine schedule algorithms will be improved with run time prediction after sample data execution , while the interface to other data analytics tools will be explored. More knowledge may be discovered after the relationship between delayed location level dengue cases and rain fall data is investigated. It may help to provide with dengue fever incidence forecast and effective measures for community dengue fever prevention and control.
ACKNOWLEDGMENT
Thanks to Dr. Ta Duong for providing cloud clustering service.
REFERENCES
[1] Manish Parashar, Moustafa AbdelBaky, Ivan Rodero, Aditya Devarakonda,"Cloud Paradigms and Practices for Computational and Data-Enabled Science and Engineering" in Proc. Of the Eighth
International Workshop on System Management Techniques, Processes and Services (SMTPS 2012) at 2012 International Parallel and Distributed Processing Symposium (IPDPS) in Shanghia, China, May, 21 2012.
[2] Sifei Lu, Reuben Mingguang Li, William Chandra Tjhi, Kee Khoon Lee, Long Wang, Xiarong Li and Di Ma. (2011). “A Framework for Cloud-Based Large-Scale Data Analytics and Visualization: Case Study on Multiscale Climate Data”, The 2011 Workshop on Integration and Application of Cloud Computing to High Performance Computing (HPCCloud), Athens, Greece, 29 Nov - 1 Dec, 2011
[3] Gartner report, “The Top 10 Technology Trends for 2012”, The report is
available on Gartner’s website at
http://www.gartner.com/DisplayDocument?id=1926316 [4] Rapidminer, Rapidanalytics, http://rapid-i.com/ [5] R, http://www.r-project.org/
[6] Hadoop, http://hadoop.apache.org/
[7] WHO, Dengue and severe dengue,
http://www.who.int/mediacentre/factsheets/fs117/en/index.html
[8] Campaign Against Dengue, NEA, Singapore,
http://www.dengue.gov.sg/
[9] Zhaoxia Wang, Hoong Maeng Chan, Martin L. Hibberd, Gary Kee Khoon Lee, “Delayed Effects of Climate Variables on Incidence of Dengue in Singapore during 2000-2010”, in Proc. 3rd International Conference on Environmental Science and Development (ICESD 2012), 2012.
[10] Suraj Pandey, Dileban Karunamoorthy and Rajkumar Buyya, “Workflow Engine for Clouds”, Chapter 12, pp. 321-344, Cloud Computing: Principles and Paradigms, R. Buyya, J. Broberg, A.Goscinski (eds), ISBN-13: 978-0470887998, Wiley Press, New York, USA, February 2011.
[11] Mustarfizur Rahman, Xiaorong Li, Henry Palit, “Hybrid Heuristic for Scheduling Data Analytics Workflow Applications in Hybrid Cloud Environment”, in Proc. High-Performance Grid and Cloud Computing Workshop 2011, in conjunction with International Parallel and Distributed Processing Symposium (IPDPS 2011), 2011.
[12] Rajkumar Buyya, Suraj Pandey, and Christian Vecchiola, “Cloudbus Toolkit for Market-Oriented Cloud Computing”, in Proc. of the 1st International Conference on Cloud Computing (CloudCom 2009), Beijing, China, December 1-4, 2009.
[13] Rajkumar Buyya, Rajiv Ranjan, and Rodrigo N. Calheiros, “InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Application Services”, in Proc. of the 10th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2010), 13-31pp, Busan, South Korea, May 21-23, 2010. [14] Hyunjoo Kim, Yaakoub el-Khamra, Ivan Rodero, Shantenu Jha, Manish
Parashar, “Autonomic management of application workflows on hybrid computing infrastrcuture”, Scientific Computing 19(2-3):75-89, 2011, IOS Press.
[15] Hyunjoo Kim, Manish Parashar, “CometCloud: an autonomic Cloud engine”, Chapter 10, pp. 275-297, Cloud Computing: Principles and Paradigms, R. Buyya, J. Broberg, A.Goscinski (eds), ISBN-13: 978-0470887998, Wiley Press, New York, USA, February 2011.
[16] Radoop, http://www.radoop.eu/
[17] Z. Prekopcsk, G. Makrai, T. Henk, C. Gspr-Papanek, Radoop: Analyzing Big Data with RapidMiner and Hadoop , Proceedings of the 2nd RapidMiner Community Meeting and Conference (RCOMM 2011), 2011
[18] OpenGeo Suite, http://opengeo.org/
[19] Climate Data Online, http://www.ncdc.noaa.gov/cdo-web/
[20] Raster Data Management, Queries, and Applications,
http://postgis.refractions.net/docs/using_raster.xml.html
[21] Eucalyptus, http://www.eucalyptus.com/
[22] Amazon Elastic Compute Cloud (EC2), http://www.amazon.com/ec2/
[23] ncview RPM DEB Download, http://pkgs.org/download/ncview [24] GRASS: Development, http://grass.osgeo.org/devel/index.php