• No results found

3.8 Computing Architectures

3.8.2 Cloud Computing

The term “Cloud Computing” refers to an architecture which spreads processing and data across many servers, storing data in multiple redundant locations and so offering a level of resilience that is now expected from the major Internet companies. These sys- tems are built to be big and scalable, running on cheap commodity hardware with fault tolerance emerging from data replication. Central to how this architecture functions is the MapReduce algorithm [DG08] which was published by Google in 2008. Many vendors offer this type of cloud service for hire at a cost per compute cycle, or bit of

data stored. Examples include Microsoft’s Azure platform, Amazon’s compute cloud and Google’s AppEngine, or Hadoop, Casandra (Facebook), HyperTable and a host of other 3rd party projects. The basic architecture uses a distributed file system, for ex- ample HDFS, with a compute layer running on top which allows queries to be made. Although this is a very simplistic view, it makes the important point about the ability to analyse large datasets and harness more compute power than has previously been available to the general public. It also shows the move away from relational databases and towards the analysis of unstructured data which does not fit easily into the entity relational model. In addition to this, GIS systems like Geotools are beginning to be seen running on these distributed architectures.

It is worth pointing out that, while the web as we know it wouldn’t be possible with- out cloud computing, it is not without problems as the following two quotes explain:

“The interesting thing about cloud computing is that we’ve redefined cloud computing to include everything that we already do. I can’t think of anything that isn’t cloud computing with all of these announcements. The computer industry is the only industry that is more fashion-driven than women’s fash- ion. Maybe I’m an idiot, but I have no idea what anyone is talking about. What is it? It’s complete gibberish. It’s insane. When is this idiocy going to

stop?” (Larry Ellison, CEO Oracle21)

“One reason you should not use web applications to do your computing is that you lose control. It’s just as bad as using a proprietary program. Do your own computing on your own computer with your copy of a freedom- respecting program. If you use a proprietary program or somebody else’s web server, you’re defenceless. You’re putty in the hands of whoever devel-

oped that software.” (Richard Stallman, GNU FSF22)

To put these two quotes into context, Larry Ellison, as the CEO of Oracle, has an interest in distributed databases rather than Internet technologies and his quote follows on from Oracle 9G using GRID technology. The later acquisition of Sun Microsystems and Java and the release of Apache Hadoop tools for the core database show the fluid way that the company positions itself in the market. Richard Stallman makes a good 21Taken from Larry Ellison’s keynote speech at OpenWorld in September 2012 http:// www.oracle.com/ openworld/ keynotes/

index.html(February 2013).

22Quote from Richard Stallman of the Free Software Foundation in a Guardian interview in September 2008 http:// www.

3.8. Computing Architectures 91 argument for maintaining control over ownership of your own data and the inability to look inside the box and see how a cloud computing application works.

Where cloud computing fails is when large amounts of data need to be moved be- tween the cloud and client machines. In this situation, the Internet connection is the weakest link, limiting the transfer speed. At the time of writing, 100MBit connections are typical in Universities, with this slowly being replaced by 1GBit, but only due to the direct connection. More typical of home users is 12MBits, with some companies providing 50MBit where a fibre connection is available. In general, the way this is handled is to ensure that the processing of large or real-time data occurs on the same system as the one holding the original data. The Network Rail API for all UK train movements is an example of this, being hosted on the Amazon cloud. This fact is ad- vertised in the documentation and clients are advised to host their own services on the Amazon cloud to take advantage of the faster transfer speeds. Another example is the TfL Trackernet API for London Underground, which is now hosted on Microsoft Azure after originally launching on TfL’s own in-house servers and failing under high load. Best practice when trying to track all public transport in London when the data is split between Amazon and Azure is still not fully investigated. This could be the Achilles’ heel of cloud computing with large and fast datasets.

Moves are already being made towards cloud-based GIS systems, both in the com- mercial market and open-source communities. The release of ‘ESRI’s GIS Tools for Hadoop’ (ESRI, 2013) on 25 March 2013 shows how they are positioning themselves in the market, but the indicators have been around for the last year at least. Reading the blog of Mansour Raad23, a senior software architect at ESRI and Cloudera Certified De-

veloper for Apache Hadoop, there are examples using the core Hadoop project and its other closely associated Apache projects of ‘Hive’ [Apa16a] and ‘Pig’ [Apa16b]. The Apache ‘Hive’ project deals with the storage of relational data on the NoSQL HDFS file system which Hadoop uses, while ‘Pig’ is the Swiss army knife of real world data processing on Hadoop. Pig provides job control for MapReduce jobs run on Hadoop. Because of the way the functional MapReduce programming works, a typical task will be split into a number of dependent jobs, so, for example, the data needs to be moved from the local file system onto the remote HDFS, then one or more parallel map and reduce iterations will be used to transform the data, using collation and sorting where

necessary. Then the task will finish by collecting up all the outputs and writing the result back to the local file system. For a full description of MapReduce and distributed file systems see [DG08].

Where real-time processing with MapReduce is required, the “Spark”

[Apa16c], “Storm” [Apa14a] and “Shark”24projects adapt the original job-based method- ology to a variation that runs in memory with a more immediate turnaround time for jobs. These projects are all interesting in that they are second generation Hadoop de- velopments which are spin-offs from attempts to adapt the technology using real world experiences. In a similar thread, data mining and machine learning are significant areas of interest in “Big Data Analytics” and one that is being addressed by projects such as “OpenML” [Rij+13] for machine learning and “Massive Online Analysis” [Uni14] for data stream mining. These projects are a potential developmental next step in spatial analytics and real-time spatial processing if the techniques and experience are transfer- able to the spatial analysis domain.

The following quote about the applicability of parallel cloud computing techniques is relevant though:

“MapReduce is not always the best algorithm. MapReduce is a profound idea: taking a simple functional programming operation and applying it, in parallel, to gigabytes or terabytes of data. But there is a price. For that par- allelism, you need to have each MR operation independent from all others. If you need to know everything that has gone before, you have a problem.”

(Apache Hadoop Wiki: http:// wiki.apache.org/ hadoop/ HadoopIsNot) In his masters thesis, Nathan Kerr [Ker09] analyses the performance of a “Spatial Hadoop” system and compares it against an alternative system built using message passing. While the spatial Hadoop system comes second in this performance test, util- ity computing is now so widespread and cheap to use that its value cannot be under- estimated. A full description of utility and warehouse scale computing can be found in Chapter 6 of [HP11], where the economies of scale and use of cheap unreliable hardware that make today’s cloud computing possible are analysed. Much of the in- formation behind this chapter comes from a release of information by Google on their shipping container computers in 2006, so it is generally assumed that the current state 24Originally a separate project, Shark is now an Apache Spark module called Spark SQL. The original Berkeley project is

3.8. Computing Architectures 93 of the art is a long way ahead of this. Conversations with current Google employ- ees would seem to support this assumption. The Apache “MESOS” project [Apa14b] comes from the “The Datacenter Needs an Operating System” work of Zaharia et al. [Zah+11] where they highlight the need for effective job scheduling. At the scale of a warehouse computer, even a 1% improvement in efficiency is significant. The work of the Berkeley AMP Lab has led to these computing architectures becoming mainstream as the technology has filtered down from the cloud computing providers where it has been tried and tested and proved to work. The main attraction of this utility computing is its ‘Pay as You Go’ nature, otherwise termed ‘Infinite Provisioning’. This means that, for a piece of analytical work requiring more computing power than is available to an organisation, it is cost-effective to pay for only the compute power required to accomplish the task.25