Solution Spotlight PREPARING A DATABASE STRATEGY FOR BIG DATA

(1)

DATABASE

STRATEGY FOR BIG

DATA

(2)

Home

‘Big data’ applica-tions bring new database choices, IT industry veteran demystifies the scale up vs. scale out de-bate challenges

ompanies need to

explore the newer

ap-proaches to handling large data volumes

and begin to understand the limitations

and challenges that come with

technolo-gies like Hadoop and NoSQL databases, if they are to avoid being

swept away by the big data tidal wave. Read this

SearchDataManag-ment.com E-Guide to learn why Big Data presents new database

challenges and opportunities. Discover why, according to experts,

massively parallel processing (MPP) and distributed computing

approaches are becoming a popular choice to handle growing data

volumes.

(3)

Home

'BIG DATA' APPLICATIONS BRING NEW DATABASE

CHOICES, CHALLENGES

I started out my career as a systems programmer and database administra-tor, working on what was then the state of the art in the world of databases: IMS, from IBM. Companies needed somewhere to put (and sometimes even retrieve) their data, and once things had moved beyond basic file systems, da-tabases were the way to go.

The volumes of data that had to be handled back then seem amusingly mod-est by today’s “big data” applications standards, with IBM’s 3380 mainframe able to store what seemed like a capacious 2.5 GB of data when it was launched in 1980. To put data into IMS, you needed to understand how to navigate the physical structure of the database itself, and it was a radical step indeed when IBM launched DB2 in 1983. In this new approach, programmers would write in a language called SQL and the database management system itself would figure out the best access path to the data via an “optimiser.”

I recall some of my colleagues’ deep scepticism about such dark arts, but the relational approach caught on, and within a few years the database world

(4)

Home

was a competitive place, with Ingres, Informix, Sybase and Oracle all battling it out along with IBM in the enterprise transaction processing market. A gradual awareness that such databases were all optimised for transaction processing performance allowed a further range of innovation, and the first specialist ana-lytical databases appeared. Products from smaller companies, with odd names like Red Brick and Essbase, became available and briefly a thousand database flowers bloomed.

By the dawn of the millennium, the excitement had died down, and the database market had seen dramatic consolidation. Oracle, IBM and Microsoft dominated the landscape, having either bought out or crushed most of the competition. Object databases were snuffed out, and the database administra-tor beginning his or her career in 2001 could look forward to a stable world of a few databases, all based on SQL. Teradata had carved out a niche at the high-volume end, and Sybase had innovated with columnar instead of row-based storage, but the main debates in database circles were reduced to arguments over arcane revisions to the SQL standard. The database world had seemingly grown up and put on a cardigan and slippers.

How misleading that picture turned out to be. Few appreciated at the time that rapid growth in both the volume and types of data that companies collect

(5)

Home

was about to challenge the database incumbents and spawn another round of innovation. Whilst Moore’s Law was holding for processing speed, it was most decidedly not working for disk access speed. Solid-state drives helped some, but they were, and still are, very expensive. Database volumes were increas-ing faster than ever due primarily to the explosion of social media data and machine-generated data, such as information from sensors, point-of-sale sys-tems, mobile phone network, Web server logs and the like.

In 1997 (according to Winter Corp., which measures such things), the larg-est commercial database in the world was 7 TB in size, and that figure had only grown to about 30 TB by 2003. Yet it more than tripled to 100 TB by 2005, and by 2008 the first petabyte-sized database appeared. In other words, the larg-est databases increased tenfold in size between 2005 and 2008. The strains of analysing such volumes of data started to stretch and exceed the capacity of the mainstream databases.

ENTER MPP, COLUMNAR AND HADOOP

The database industry has responded in a number of ways. Throwing hardware at the problem was one way. Massively parallel processing (MPP) databases allow database loads to be split amongst many processors. The columnar data

(6)

Home

structure pioneered by Sybase turned out to be well suited to analytical pro-cessing workloads, and a range of new analytical databases sprang up, often combining columnar and MPP approaches. The giant database vendors re-sponded with either their own versions or by simply purchasing upstart rivals. For example, Oracle brought out its Exadata offering, IBM purchased Netezza and Microsoft bought DATAllegro. There are also a range of independent al-ternatives remaining on the market.

However, the big data challenge is of such a scale that more radical ap-proaches have also appeared. Google, having to deal with exponentially growing Web traffic, devised an approach called MapReduce, designed to work with a massively distributed file system. That work inspired an open source technol-ogy called Hadoop, along with an associated file system called HDFS. New databases followed that spurned SQL entirely or in large part, endeavouring to allow more predictable scalability and eliminating the constraints of a fixed database schema.

This “NoSQL” approach brings with it a range of issues. A generation of programmers and software products have relied on a common standard for database access, with the removal of the need to understand internal data-base structure allowing considerable productivity gains. Programming for big

(7)

Home

data applications is an altogether trickier affair, and IT departments that are staffed with people who understand SQL are ill-equipped to tackle the world of MapReduce programming, parallel programming and key-value databases that is starting to represent the state of the art in tackling very large data sets. There are also considerable challenges to the new database technologies in coping with high availability, guaranteed consistency and tolerance to hard-ware failure, things which many organisations had previously started to take for granted.

Of course, not everyone is equally affected by such developments. The big data issues are most acutely felt in certain industries, such as Web marketing and advertising, telecoms, retail and financial services, and certain government activities. Understanding the relationships between data is important in areas as diverse as fraud detection, counter-terrorism, medical research and energy metering.

However, the recent data explosion is going to make life difficult in many industries, and those companies that can adapt well and gain the ability to analyse such data will have a considerable advantage over those that lag. New skill sets are going to be needed, and these skills will be scarce. Companies need to explore the newer approaches to handling large data volumes and begin to

(8)

Home

understand the limitations and challenges that come with technologies like Hadoop and NoSQL databases, if they are to avoid being swept away by the big data tidal wave.

ANDY HAYLER is co-founder and CEO of analyst firm The Information Difference and a regular keynote

speaker at international conferences on MDM, data governance and data quality. He is also a respected restaurant critic and author (see www.andyhayler.com).

IT INDUSTRY VETERAN DEMYSTIFIES THE SCALE UP VS.

SCALE OUT DEBATE

Choosing sides on the scale up vs. scale out debate used to be easy. Choices were few and for most organizations, it was symmetric multiprocessing (SMP) all the way – the classic scale up approach to computing.

But with the rise of commodity hardware – and more organizations look-ing to capitalize on the Internet-driven "big data" explosion – scallook-ing out has become a more viable option. As a result, massively parallel processing (MPP) and distributed computing approaches are growing more popular all the time,

(9)

Home

according to Tony Iams, a senior vice president with IT analyst firm Ideas In-ternational in Port Chester, N.Y.

SearchDataManagement.com got on the phone with Iams to learn more about the longstanding scale up vs. scale out debate. Iams explained the most common uses of SMP, MPP and distributed computing and had some advice for those seeking to match database workloads with the correct architecture approach.

Could you describe the prevailing approaches to hardware archi-tecture and how they fit into the scale up vs. scale out debate?

Tony Iams: The first is symmetric multiprocessing, or SMP, which is the

classic form of scaling up. That's where you have lots of processing units inside of a single computer. And I say "processing units" because that line is blurring. It used to be "processors," but now processors have multiple cores inside of them and those cores might have a lot of threads. But the point is that all of it is inside of one enclosure, one computer system. That is the traditional approach to scaling up.

What are the other two major approaches?

Iams: Then you have scaling out and the most extreme form of that is what

(10)

Home

computers that are cooperating to solve a problem. The third approach is mas-sively parallel processing (MPP) and that kind of sits in between [SMP and distributing computing] in the sense that you still have lots of machines that are collaborating on solving a problem. The difference is that with massively parallel processing, there is usually some assumption that there is some shar-ing of somethshar-ing. At a minimum, you have shared management in that there may be many separate computers but you manage them as if they are a single computer. With MPP, there is also usually some form of sharing memory.

How does MPP differ from SMP in terms of sharing memory?

Iams: With SMP, by definition, all reading and writing can be done by any

thread, core or processor. They can all get to the memory equally easily for reading or for writing purposes. In MPP, again there is usually some form of sharing, but depending on the implementation – and there are many different implementations of MPP – you have different compromises that you have to make in terms of who can get to what memory; whether there is a penalty for reading the memory; and whether there is a penalty for writing the memory. With MPP, users have to consider how that memory sharing works: Who can read? Who can write? The answers to those questions are going to vary signifi-cantly [depending on] the implementation.

(11)

Home

Why do I feel like I'm hearing vendor references to MPP more often than in the past?

Iams: I think because the industry in general is trending towards scaling

out. What I just explained to you is purely an architectural view. But the more important aspect of this is now matching [the computer hardware architec-ture] with workloads. Depending on what kind of workload you're trying to host, each of these approaches is going to make more sense or less sense. You always have to be careful about matching the right workload with the right scalability approach.

How has the process of choosing the right scalability approach changed over the years?

Iams: The rules 10 to 15 or 20 years ago were pretty clear. If you wanted to

do heavy duty database processing, you wanted it to scale up and you would use SMP. That was it. End of story. The number of workloads that you would want to use with the distributed computing or even MPP was extremely limited. But nowadays with the Internet and Web-based computing and all of these services that people are using on the Web like Facebook and Google – a lot of those work really well with a scale out approach, and distributed computing and MPP are starting to be applied more widely.

(12)

Home

How should an IT organization go about matching database work-loads with the right scalability approach?

Iams: If you're just talking about the classic transaction-driven database

workloads that drive the day to day operations of a typical business, that still works best on SMP type systems. That is because with transactions, you're go-ing to be writgo-ing a lot of data by definition. If you're gogo-ing to write somethgo-ing, you have to have very efficient access to that memory, and that is why you need SMP. There is no more efficient way to access memory than with SMP.

What if the database workload is created for business intelligence or analytics purposes?

Iams: More and more database work is based on analysis, which is not

nec-essarily writing data. In this case, you're more interested in reading the data because you're going to go through there and analyze it looking for patterns, business opportunities and trends. That is increasingly where the value is and that is not a new thing. People have been doing data warehouses and stuff like that for a long time. But now there has been an uptick in the volume of data. Internet usage is generating data, mobile phone usage is generating data and all that stuff is tracked now. That has produced an explosion in data. [Data ware-houses and associated analytics projects] have to scale more than ever before,

(13)

Home

(14)

Home

FREE RESOURCES FOR TECHNOLOGY PROFESSIONALS

TechTarget publishes targeted technology media that address your need for information and resources for researching prod-ucts, developing strategy and making cost-effective purchase decisions. Our network of technology-specific Web sites gives you access to industry experts, independent content and analy-sis and the Web’s largest library of vendor-provided white pa-pers, webcasts, podcasts, videos, virtual trade shows, research reports and more —drawing on the rich R&D resources of technology providers to address market trends, challenges and solutions. Our live events and virtual seminars give you ac-cess to vendor neutral, expert commentary and advice on the issues and challenges you face daily. Our social community IT Knowledge Exchange allows you to share real world information in real time with peers and experts.

WHAT MAKES TECHTARGET UNIQUE?

TechTarget is squarely focused on the enterprise IT space. Our team of editors and net-work of industry experts provide the richest, most relevant content to IT professionals and management. We leverage the immediacy of the Web, the networking and face-to-face op-portunities of events and virtual events, and the ability to interact with peers—all to create compelling and actionable information for enterprise IT professionals across all industries and markets.