The UNECE Big Data Sandbox: What Means to What Ends?

(1)

Distr. GENERAL UNITED NATIONS ECONOMIC COMMISSION

FOR EUROPE (UNECE)

CONFERENCE OF EUROPEAN STATISTICIANS

Working Paper No. 17th_{April 2015}

ENGLISH ONLY

Workshop on the Modernisation of Statistical Production (Geneva, 15-17 April 2015)

Topic (iii): Innovation in technology and methods driving opportunities for modernisation.

The UNECE Big Data Sandbox: What Means to What Ends?

Prepared by Bruno Voisin and Niall Wilson

[email protected] [email protected]

Irish Centre for High-End Computing, Ireland

Abstract

The UNECE Big Data Sandbox, a shared infrastructure for Big Data technology experimentation based on a Hadoop environment, is about to be made available in its second iteration. While ubiquitous, Hadoop isn't the only approach to Big Data. As an early exercise in planning the future of the Sandbox, this paper presents a number of hardware and software technologies that may be considered in the context of the modernisation of statistical production. Additional considerations are also made regarding a possible broadening of the Sandbox's mandate towards a production-like infrastructure, as well as enhancing third-party collaboration through the use of synthetic data.

I. Introduction

1. The UNECE Sandbox provides a shared infrastructure for experimenting with Big Data technologies, where the volume, velocity or complexity of data becomes a limiting factor to computation. It currently provides a Hadoop environment testbed that has already been used by 20 institutes for the development of scalable parallel computations. Due to rapid technological improvements, the Sandbox's old hardware fast became slow and uneconomical, and it is in the process of being replaced by a new system due for open access in May 2015. While the current Sandbox effort is heavily focused on the Hadoop ecosystem, many other technologies are available for large, complex data processing through distributed/parallel computing. Looking forward to the future, with so many possibilities in hardware and software, which ones should be kept in mind and considered for future developments of the Sandbox? Which types of problems are currently being investigated, or will be in the future? And what should be the role of the system in terms of experiment, production and collaboration?

2. This paper briefly describes the current status of the UNECE Big Data Sandbox before presenting a number of promising technologies, both in terms of hardware and software. It then illustrates through current use cases at ICHEC how particular problems are a better fit for a particular solution. Finally, it considers the possible expansion of a Sandbox-type

(2)

infrastructure towards a broader mandate: provision of a production environment and third-party collaboration through synthetic data.

II. The UNECE Big Data Sandbox

A. Mission/Mandate

3. A “Sandbox” environment was created in April 2014, with support from the Central Statistics Office (CSO) of Ireland and the Irish Centre for High-End Computing (ICHEC). The aim is to provide a technical platform to load Big Data sets and tools and give participating statistical organisations the opportunity to:

a) Test the feasibility of remote access and processing – Statistical organisations around the world will be able to access and analyse Big Data sets held on a central server. Could this approach be used in practice? What are the issues to be resolved?

b) Test whether existing statistical standards / models / methods etc. can be applied to Big Data c) Determine which Big Data software tools are most useful for statistical organisations

d) Learn more about the potential uses, advantages and disadvantages of Big Data sets – “learning by doing” e) Build an international collaboration community to share ideas and experiences on the technical aspects of

using Big Data.

B. Previous Platform

4. The initial hardware dedicated to the Sandbox system consisted of 20 nodes from a decommissioned high performance computing cluster. Each node consisted of:

a) 2 x Intel Xeon X5560 quad-core processors b) 48GB RAM

c) 1 x 1TB disk

d) DDR Infiniband (16 Gbit)

Thus the cluster as a whole had approximately 20TB of shared storage available via a HDFS file system. With the default three way replication this allowed for approximately 6TB of data. The Hortonworks Data Platform was chosen as the Hadoop distribution to be installed and in addition Rhadoop (Hadoop interface for the R statistical suite) and Pentaho Business Analytics software was available for users. The Sandbox system proved to be a very successful initial step in enabling collaborative groups to work together as was documented in the final report published in January 2015 [1]. However, due to the age of the hardware used for this Sandbox infrastructure and the requirement to continue some experiments into 2015, it was decided to purchase and commission new dedicated hardware which would have greater capacity and capability and would enable the deployment of updated software.

C. Upcoming Platform

5. The design philosophy for the updated Sandbox is that it should have greater capability but with fewer nodes and that the software stack should be easier to maintain and upgrade as newer versions and components of the Hadoop ecosystem become available. Hence, in February 2015 a new set of hardware was procured and installed. The cluster consisted of separate admin and login nodes as well as initially 4 compute/data nodes each with the following specification:

a) 2 x Intel Xeon E5-2650 v3 10-core processors b) 128 GB RAM

c) 4 x 4TB disk

(3)

Hortonworks Data Platform (version 2.2) has been installed on this system and a total of 56TB is available for user datasets. The system is currently undergoing the final stages of software installation and testing and once user data has been migrated, full access is planned to open on May 4th_2015.

III. Big Data Technologies: Means and Ends

A. Hardware Means

6. Traditional x86 CPU architectures dominate the platforms upon which the underlying application software used to implement data analytical algorithms are run. However, other options exist which have been used in high performance computing and other domains to accelerate the runtime of application software. Below we list some of these alternative compute technologies along with some comments on how they may be suitable for big data analytics.

a) GPUs : Graphical Processing Units have long provided dedicated hardware to offload and accelerate the graphics

pipeline from the CPU. The inherent parallelism of graphics processing meant that this hardware possessed hundreds of independent processing cores and local memory which could be repurposed to perform generic non graphical computation. This often requires some algorithmic redesign and re-implementation using platform specific frameworks such as Nvidia’s CUDA but for suitable applications can yield very significant speedups. The primary performance penalty derives from data movement across the PCI bus between main memory and the GPU device. Unless minimised, this data movement can dominate and starve the GPU of useful work. As datsets become too large to fit in GPU memory they need to be partitioned and loaded sequentially as needed. By using asynchronous memory transfer and streaming it is possible to keep both CPU and GPU busy while moving data. By using these techniques and rewriting applications, it has been shown that GPUs are capable of speedups of the order of 80x for large dataset analytical problems.

b) FPGAs : Field Programmable Gate Arrays are devices which allow integrated circuits to be reconfigured after

manufacture. Hence they can provide similar functionality to Application Specific Integrated Circuits (ASICs) but can be designed and changed by the user so as to provide the ability to implement algorithms in hardware rather than as software running on microprocessors. From a data processing point of view these devices have traditionally been used in a stream processing model whereby an input stream of data is analysed or transformed in real time. Some FPGA devices have on-board high performance network interfaces to more easily enable this.

c) Manycore Coprocessors : These devices (e.g. Intel Xeon Phi) consist of many (currently 60) simple x86 cores on

a single processor. While the computational capability of each core is much less than a current generation high-end CPU core, the number and interconnection allow for significant parallel capability. A key advantage is that applications do not need to be modified to run on these devices as they are the ubiquitous x86 architecture. However, code level changes are still required to achieve optimal performance and with current generations similar problems of data movement between these PCIe hosted devices and main memory is a limitation as with GPUs. Future tighter integration with main memory and other hardware improvements have the potential to greatly increase performance and the familiar programming model could enable this platform to become an attractive target for parallelising data analytics workflows. A standards based programming model (MPI, OpenMP) means that a portable parallel code base should adapt to this platform more easily than to proprietary models.

d) Large Shared Memory : Distinguished only by the amount of memory and number of processors which is

significantly higher than normal, dedicated large memory servers with multi Tera Bytes of RAM can be used to ingest large data sets into memory for processing. Given the very large difference in access times for a CPU to request data from disk versus main memory, having the full working dataset in memory can greatly speed up the overall analysis. Applications will often need to be modified to take full advantage of this type of system but only for the I/O component. Ease of use is the primary advantage of such systems, but it comes at the cost of expensive hardware.

e) Burst Buffer Cache : For datasets that are too large to be held in main memory, they must be accessed from a

filesystem on disk. The bandwidth and latency of this access is a severe limitation so new technologies based on arrays of solid state disks and intelligent software are starting to emerge which will act as a high performance

(4)

cache in between main memory and the persistent storage on disk. By pre-fetching data from disk to this buffer cache, significant speedups can be achieved for I/O bound problems. This still emergent technology provides its benefits transparently to the user, requiring no additional programming from the user.

B. Software Means

7. Having access to advanced hardware is one thing, being able to use it to best effect is another, and this is where the software stack comes in. Its role is to provide fine control over the hardware resources to the user, and to facilitate the implementation of a potentially complex and distributed processing task over a specific set of data. Each solution comes with its own compromise in achieving these two tasks, fulfilling both at the same time being still an open problem. Though the parallel computing and distributed storage landscape is large, some solutions offer what we believe are key advantages in particular areas, and form a good set of technologies to have in mind when pondering the needs of a future evolution of the Sandbox.

8. An already deployed technology on the Sandbox, Hadoop is ubiquitous when discussing Big Data. Its free-software nature, robustness, useability on commodity hardware and relative ease-of-use have helped its wide adoption over the years by many businesses looking to scale up their sotrage and analytics capacities. Hadoop is an ecosystem composed of a number of components. Of these, the Hadoop Distributed File System (HDFS) and MapReduce form a solid core of what Hadoop provides: HDFS is a distributed storage facility that allows storage and retrieval of data across a number of distributed nodes connected over a network. MapReduce is a fault-tolerant distributed processing facility that provides a simple way to define data parallel processing, i.e. run a number of identical tasks over different slices of a data set (defined by the Map function), and merge the results afterwards (through the Reduce function). Though the simple nature of the MapReduce approach helped popularize Hadoop, the algorithm's inherent limitation in its type of parallelism became an obstacle to taking further advantage of the platform. Therefore, a dedicated resource manager called YARN (Yet Another Resource Negotiator) was developed as a new component. It focuses on the distributed resource management, separating it from the processing engine, allowing for different type of processing models (like MapReduce) to be implemented on top of it. By comparison, the original MapReduce framework was managing both resource management and processing engine. Finally, another Hadoop component of note is HBase: a non-relational database that provides hosting and processing of very large tables rather than standard file processing. In addition to the core Hadoop ecosystem, additional analytics frameworks are available for more specific needs. Of particular interest are Storm, focused on complex processing of streaming data, and Spark, which provides in-memory computing for near real-time analytics, as well as a machine learning library (MLlib) for out-of-the-box parallel implementations of commonly used algorithms. Both platforms are available on the Sandbox.

9. A different take on large-scale data analytics with a focus on business intelligence and real-time analytics is that of in-memory column-oriented databases. Such systems are typically deployed on servers with a large amount of shared memory where the entire database is stored (with regular storage on disk for persistency or versioning). In addition to the data tables being in RAM for faster access, they are also stored per column, making it faster again to read the entire column for a calculation. Such systems are particularly good fit for real-time business intelligence where column-based operations like aggregations are a very common task. Examples of such platforms are SAP HANA, MonetDB and kdb+, which all offer integration with various programming languages. Potential candidates for these technologies would be smart meter management or mobile phone analytics, where real-time analysis may be a key requirement. A major limitation of these systems however is the cost involved in the large-memory systems required, which may make them typical candidates for being the hot storage in a tiered storage solution when dealing with a Big Data scenario.

10. Finally, when speed is the prime requirement of a complex analytics task, or when the parallel computation follows a complex model, another alternative should be considered: the development of a dedicated standalone application using the traditional High Performance Computing tools. The Message Passing Interface (MPI) library, available for various programming languages, allows for the development of a complex parallel processing application through a messaging system allowing communication between its various processes. Though such an approach is more typical of large-scale scientific simulations, it may be that future statistics will have to deal with complex cases requiring extreme performances. Furthermore, some technological lines are blurring with, for example, the development of an MPI implementation for YARN, the Hadoop resource manager, enabling the development of complex communication parallel applications over a Hadoop cluster.

(5)

11. The various hardware and software solutions at hand mean that a Sandbox infrastructure will require careful considerations of the problems it is intended to tackle. What will statisticians want to do with the processing power made available by modern computing technologies? Though the answer this question could only come from a long reflexion by the community itself, we provide here the example of two recent cases where ICHEC was involved in taking advantage of computing technologies to make a difference in a statistical context.

12. The first case, in partnership with the Economic and Social Research Institute (ESRI, Ireland), is typical of traditional statistical computations whose scale is now starting to outgrow the capacities of the system they were developed for. The work consisted in accelerating the processing of Eurostat employment data. Quarterly individual worker information from the 28 member states of the European Union over a number of years was to be read, processed and aggregated according to various criteria. In-house benchmarking at the ESRI showed that their traditional statistical computing platform was hitting a limit in terms of performance as the data size grew, limiting the current method's usability to smaller population samples or shorter periods of time. Analysis showed that the main reasons for the performance decrease were disk operations and single core sequential processing. The disk operations were particularly problematic since the platform was designed around working on disk files, making table scans and column updates all the more costly. An alternative implementation using the R platform was therefore developed with these problems in mind. Due to the confidential nature of the Eurostat data, a first requirement for ICHEC as a third-party collaborator was to generate a comparable synthetic data set to work with. This had the advantage of allowing ICHEC to run tests on any given data volume. The code then calculates partial aggregates in parallel over small slices of the dataset before merging them for the final results, with each data slice being read from disk only once. This type of approach, known as data parallelism, is applicable to many statistical tasks and typical - but not limited to - MapReduce operations in a Hadoop environment. 13. The second case, in partnership with the CSO is a study of the possibilities in mobile phone location data analysis. Again, for data confidentiality reasons, synthetic data was generated. The study, focused on tourists mobility patterns in Ireland from coarse phone location on phone usage events, consisted in exploring the statistical measures and machine learning algorithms that would best identify significant patterns. The exercise showed that given a list of locations of interest, a mix of statistical summaries per visitor nationality and machine learning induced rules provided an interesting picture of national behaviours such as typical arrival places, duration of stay, group sizes or areas visited together. The data being linked to geographical locations too, it proved naturally suited to visualisation of rules like these from an a priori algorithm.

14. In terms of scale, while the study of tourism data in Ireland for a single season remains reasonable to be done through a traditional approach (it is currently implemented as an R package), running such analysis over a longer time period or comparing holiday seasons across many years could easily outgrow the performance window of a workstation, as would running the analysis on a larger country with more visitors. Our current work therefore investigates which technologies can be used for a mobile phone data analytics information system. The Hadoop ecosystem looks like a good candidate as HBase can provide a storage solution for the call location data and most of the analysis is data parallelisable, therefore easily portable to a MapReduce approach. However, looking for a more integrated solution and considering the potential usefulness of real-time analytics in the domain (moving from low resolution tourist data to high resolution commuter data for traffic analysis), an in-memory database approach was picked, using SAP HANA. The core advantage of such a technology resides in the memory storage as opposed to disk storage, as well as the column-based storage and indexing which accelerates the computation of statistical aggregates across millions of rows (timestamped phone locations), making it suitable for real-time usage. The current HANA prototype is being fed the data directly from the R data generator using the RODBC connector, and provides both analytics and visualisation from the integrated functionalities of the database (PAL for analytics, XSJS/SAPUI5 for the web interface and visualisation). The native HANA/R integration is currently being investigated for the provision of more complex analytics.

15. Though the in-memory database system seems to fit for real-time mobile data analytics, large-memory systems are generally costly hardware. Therefore, as the mobile data collection grows, it is likely that the historical data will outgrow the capacity of the database, as opposed to a Hadoop solution that wouldn’t provide the same real-time performance, but could be scaled up at a lower hardware cost when needed. But these technologies don’t need to work in isolation: a tiered storage/analytics system could be considered, where a Hadoop/HBase platform would store the entire data archive and provide its own analytics layer, and where an in-memory database like HANA would store the most critical data for which real-time analysis is a requirement. Looking back at potential use of mobile phone location for commuter traffic analysis, it is likely that only data from the past one to twenty four hours would be required as in-memory hot storage. Such a tiered solution would then provide real-time analysis of the most important data (particular time window, particular region), while offering a cheap scalable archive for historical data preservation and analysis.

(6)

D. Broadening the mandate

16. A sandbox is, by definition, a platform for experimentation. In our case, experimenting with new hardware and software technologies, different distributed/parallel computing methods and testing data scaling. The governing criteria for a successful and useful sandbox system are that it have the ability to quickly adapt and change according to new requirements. Considerations such as stability and availability consequently are secondary. However, at some point, certain services and workflows may be considered to be stable enough and of sufficient general demand to be moved into production. In contrast to the requirements of an experimental sandbox, a production service must provide certain guarantees so that people can plan with confidence in its future availability and resource levels. It also requires the deployment of a dedicated infrastructure, and exposing the solution to new requirements: 24/7 activity, accumulation of persistent data, data streaming, etc. But these requirements also open up new possibilities in that it can form the basis for a collaborative data analytics platform where different organisations can come together to work on shared datasets and methodologies on the same centrally managed infrastructure. This, however, is currently out of the scope of the UNECE Sandbox. Therefore, we believe that whether a production platform is desirable as a shared infrastructure functionality is an important question that, if answered positively, opens up many more:

• How much resources should be provided for? • How many concurrent services should be supported? • How many users must it support?

• With a limited resource pool, how to decide which project gets allocated the resources first? • Should such a service be used to fund maintenance and future expansion of the Sandbox? • What level of separation in terms of network and file system is required for sensitive data? • What level of technical support is appropriate?

17. Another aspect of the Sandbox usage is tied to its nature as a shared infrastructure. Being an “outside” system with necessarily lower restrictions on access than a national statistics institute’s internal network, it is a convenient platform for collaboration between users, groups or institutes. A critical impediment to such work though, is data confidentiality. When processing sensitive data, getting a third-party involved may require authorisations to be granted, NDAs to be signed or data sets to be anonymised. These constitute barriers to collaboration, heavily restraining the context of the work and preventing the wide publicising of an interesting problem that could be pushed to, for example, the research community. A workaround to these issues is the use of synthetic data. A synthetic data set is either a thoroughly anonymised/transformed product of real data, or a data set generated from scratch according to a set of rules. Though this is not useful for actual policy-driving data analysis, the use of a synthetic data generator provides some key advantages in the development of an analytics methodology. First, it can explore the extent to which some analysis techniques can detect specific patterns, thanks to these patterns being inserted into the data during its generation. In our work on mobile phone data for example, we inserted a number of phones that travelled together (in groups of 2, 3, 4 or 10) to evaluate whether a clique-based approach could detect such groups despite the sparse and noisy nature of the data (only mast location data during phone call). Then, synthetic data can be generated to any given size, allowing one to run scaling experiments, which are critical to the development of a reliable 'big data' analytics methodology. Finally, it provides one with a data set (or a data generator) that can be freely exchanged, experimented upon, and more simply 'leave the walls of the institute' without legal requirement and no risk of a confidentiality breach. These issues have already been identified as a critical limitation of the Sandbox, and, we believe, constitute an avoidable restraint for a platform whose core aim is the development of methodologies more than specific results from real data. Typically, Sandbox teams working on smart meter data and scanner data have already used synthetic data for that purpose. Provided that the synthetic data accurately mimics the format of the original, this allows the development of an entire software stack that can later be deployed 'within the walls' of each institute that hosts the real data source. It also provides a great opportunity for collaboration with third level education, through the promotion of these data sets as sources for data analytics students. Looking to the future then, it is worth considering whether a data generation facility should be developed and be made available on the Sandbox for a number of core data types? What sort of patterns would we want to be able to model into the generated sets? How generic could such a platform be?

III. Conclusion

18. While the current Sandbox does satisfy its original mandate, it is worth being aware of the alternatives in terms of Big Data computing. As more use cases are being discovered, it is important to be able to recognise which model of

(7)

parallel/distributed computing it fits best, and which technologies would provide the best environment for it. These considerations, in turn, will allow for engaged critical thinking by the community when the time comes to plan future iterations of the Sandbox.

19. Another important consideration is the Sandbox's mandate. Due to its nature as a shared dynamic testbed, it isn't currently providing a production-like environment. Such a facility may however be deemed desirable, for short to medium term testing of an application with actual confidential data or for evaluation of a 24/7 data streaming scenario. These would have to be limited in time though, since a long term production service would ultimately remain the responsibility of its host statistical institute.

20. Finally, as an external shared infrastructure, the Sandbox is a convenient platform for collaboration between statistical institutes. This has brought to the fore a number of collaborative issues, one of them being data confidentiality. More systematic use of synthetic data is a way to get around this problem, and some effort involved in developing quality synthetic data generation facilities could be a useful foundation for getting third-parties involved on Big Data methodologies development. Making it easier for other disciplines and communities to access the national statistical production will bring different and often complementary expertise to that of the statistical institute's staff, and with high quality synthetic data sets, any developed application or codebase should be immediately applicable to the real data when needed.

References

[1] “The Role of Big Data in the Modernisation of Statistical Production”, January 2015, reports available at: http://www1.unece.org/stat/platform/display/bigdata/2014+Project