PRAISE FOR UNDERSTANDING BIG DATA SCALABILITY

(1)

(2)

P

RAISE FOR

U

NDERSTANDING

B

IG

D

ATA

S

CALABILITY

“This book is useful to anyone who works with data and wants to learn more about scaling. Cory helps you understand what causes databases to slow down as data volumes grow over time. He then reviews a number of strategies that you have at your disposal to manage the growth, including software and database tuning, hardware upgrades, read-replication, and ultimately horizontal partitioning of data.”

—Dan Lynn, cofounder of FullContact

“Understanding Big Data Scalability presents the fundamentals of scaling databases from a single node to large clusters. It provides a practical explanation of what Big Data systems are, and the fundamental issues to consider when optimizing for performance and scalability. Cory draws on his many years of database experience to explain the issues involved in working with data sets that can no longer be handled with single monolithic relational databases.

“When transitioning from a traditional relational database deployment, it is tempting to ignore traditional database discipline regarding data modeling and data integrity. Much of this has been motivated by the proliferation of schema-less NoSQL databases. In spite of this trend, Cory shows why it is still important to carefully structure your data to maintain data integrity and allow sharding in such a way as to avoid costly distributed scan/shuffle operations. He discusses a practical approach to this called relational sharding. This is a commonsense method that avoids the pitfalls of black-box sharding. Cory’s approach is particularly relevant now that relational data models are making a comeback via SQL interfaces to popular NoSQL databases and Hadoop distributions.

“Understanding Big Data Scalability addresses practical problems in Big Data processing systems using real-life examples. This book should be especially useful to database practitioners new to the process of scaling a database beyond a traditional single-node deployment.”

(3)

“Software is like magic, or so many people think. In practice, many fundamental

principles of computation are constrained by immutable principles derived from the laws of physics (think transistors). One such limitation is easily understood by analyzing the trade-offs between horizontal and vertical scaling. Cory does a great job justifying the inescapable need for distributed systems, while also acknowledging their fundamental pitfalls. Good read for those who are trying to understand databases from first principles, rather than by analogy.”

—Ivan Bercovich, senior director of engineering, FindTheBest.com

“My first database project was in 1982. Since then, Moore’s law has expressed itself in just about every corner of technology—except database and database management. I’d argue alongside Cory that a significant part of the database ‘innovations’ we have experienced since then have been mostly an illusion. Can this be true? Really? If there is just a tinge of doubt in your mind, open these pages—Cory wrote this book for you.” —Matthew Rockwell, systems design and architecture consultant

“I work with customers across the spectrum—from small start-ups to large enterprises— and they all have one thing in common: concerns and questions about how to handle database growth as their user base expands. When they have complex database scaling issues, I send them to Cory Isaacson and this book illustrates why. Not only can Cory explain the ‘whys’ behind their issues in clear language with relevant examples and analogies, but he also has the in-depth knowledge to handle the ‘hows’ in addressing those problems.”

(4)

U

NDERSTANDING

B

IG

(5)

U

NDERSTANDING

B

IG

D

ATA

S

CALABILITY

B

IG

D

ATA

S

CALABILITY

S

ERIES

P

ART

I

Cory Isaacson

Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid

(6)

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.

The author and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at [email protected] or (800) 382-3419. For government sales inquiries, please contact

[email protected].

For questions about sales outside the United States, please contact [email protected].

All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission to use material from this work, please submit a written request to Pearson Education, Inc., Permissions Department, One Lake Street, Upper Saddle River, New Jersey 07458, or you may fax your request to (201) 236-3290.

ISBN-13: 978-0-13-359870-4 ISBN-10: 0-13-359870-5

Second digital release, August 2014

Executive Editor Bernard Goodwin Development Editor Chris Zahn Managing Editor John Fuller Full-Service Production Manager Julie B. Nahil Project Editor Anna Popick Copy Editor Anna Popick Indexer Jack Lewis Proofreader Anna Popick Editorial Assistant Michelle Housley Compositor Anna Popick

(7)

vi Contents

C

ONTENTS

Preface ... ix

About the Author ... xii

Chapter 1: Introduction ... 1

What You Will Learn ... 1

The Challenge of Big Data ... 2

Today’s Big Data Explosion ... 3

Managing and Capitalizing on the Current Data Boom ... 3

Your Role as a Data Architect ... 5

The Acceleration of Big Data Innovation ... 5

Background for This Book ... 6

Why the Focus on Database Sharding? ... 8

Summary ... 9

Chapter 2: Why Databases Slow Down ... 10

The Database Slowdown Curve ... 10

A Hard-Won Lesson ... 11

The Root Cause ... 12

The Lesson Learned ... 13

The Enemies of Database Performance ... 14

Enemy #1: Table Scans ... 14

Enemy #2: Slow Writes ... 16

Enemy #3: Concurrency Contention ... 17

Enemy #4: Missing Indexes ... 20

How to Identify Database Slowdown Issues ... 21

Summary ... 23

Chapter 3: What Is Big Data? ... 24

What Is Big Data Anyhow? ... 24

A Formal Definition for Big Data ... 25

A Practical Big Data Definition ... 27

Sources of Big Data ... 28

The Advent of the Search Engine ... 29

The Rise of Social Networks ... 29

Introducing the Social Network Application ... 30

The Adoption of the Smartphone and Tablets ... 31

Traditional Big Data Sources ... 31

Future Big Data Generators ... 31

Summary ... 32

Chapter 4: Big Data in the Real World ... 33

Some Real-World Examples of Big Data ... 33

(8)

Contents vii

The Big Data Challenge ... 35

Adopting a Big Data Streaming Approach ... 35

The Big Data Repository ... 36

Positioned for Rapid Growth ... 36

Social Point ... 36

Scaling from the Start ... 37

The Current Cluster ... 37

The Database Cluster Payoff ... 38

Summary ... 38

Chapter 5: Scaling Your Application ... 39

The Goals of a Scalable Application Platform ... 39

The Excitement of a High-Growth Success ... 41

Application Scalability Fundamentals ... 42

Detecting Performance Bottlenecks ... 42

Scalability and Reliability ... 44

Scaling Up ... 45

Scaling Out ... 46

A Typical Online Application Architecture ... 46

The Load Balancer Tier ... 47

The Application Server Tier ... 49

The Database Tier ... 50

Analytics Application Architectures ... 50

Scaling an Analytics Application ... 53

How to Scale a Traditional Online Application ... 53

Load Balancer Tier ... 53

Application Server Tier ... 54

Database Tier ... 54

Summary ... 55

Chapter 6: When to Scale Your Database ... 56

The Last Mile of Application Scalability ... 56

How Do You Know When to Scale Your Database? ... 57

Options for Increasing Database Performance ... 58

Performance Optimization on Your Monolithic Database ... 59

Vertical Scaling ... 59

Read Scaling ... 62

Implementing a Full Big Data Scalability Environment ... 63

Indications of the Need for Scale ... 65

Slow Read Queries ... 66

Slow Database Writes ... 67

Summary ... 68

Chapter 7: All Data Is Relational ... 69

Relational Data Overview ... 69

The Meaning of Data ... 70

Relationships Matter ... 73

Why Data Modelling Is Critical to Success ... 74

(9)

viii Contents

Chapter 8: It’s All About Sharding ... 77

Sharding: The Ultimate Answer to Database Slowdown ... 77

The Laws of Databases ... 78

Sharding Defined ... 80

Black-Box Sharding ... 83

Relational Sharding ... 86

Summary ... 88

Chapter 9: Scaling Big Data: The Endgame ... 89

The Game of Big Data Scalability ... 89

The Ideal Big Data Infrastructure ... 90

Scaling Big Data Theory ... 90

The Good, the Bad . . . ... 91

. . . And the Ugly ... 94

The Big Data Endgame ... 95

Data Locality ... 98

Summary ... 99

(10)

Preface ix

P

REFACE

The Big Data Scalability series is a comprehensive, four-part volume containing information on many facets of database performance and scalability. This book, Understanding Big Data Scalability, is the first book in the series.

There is no doubt that Big Data and scalability are some of the hottest and most important topics in today’s fast-growing applications. Not only is data constantly growing, the rate at which we accumulate data from a variety of sources is accelerating at a dramatic pace.

Understanding Big Data Scalability provides the foundation and important basic concepts for learning how scaling a Big Data infrastructure operates, and covers the most important governing factors for implementing a scalable and dynamic Big Data cluster. Later books in the series will delve into more technical detail, including the various data architectures available, optimization techniques,

innovative ways to model and process data, and the types of database management system (DBMS) engines that can be used.

This book (and series) is targeted at application architects, database architects, and those with a need to scale their database tier. It is oriented toward high-volume applications.

While the subject of Big Data analytics is covered to some degree in the context of how it fits with typical online and business applications, this is not a book that concentrates exclusively on Hadoop or MapReduce batch-processing analytics. There are many books on this subject, and rather than writing another possibly redundant text, I decided to focus on the common applications we deal with every day. No doubt, these applications will require analytics capabilities, including MapReduce batch-processing, so where I discuss this approach I do so in the context of how it fits into a typical application infrastructure. In my experience, our “common” applications today are experiencing unprecedented data growth, often from multiple sources. They are driving a need for real-time processing at rates that would have been inconceivable just a short time ago. Therefore, I hope to provide you with a different perspective—one that has not been thoroughly covered by other authors.

(11)

x Preface

Understanding Big Data Scalability includes the following chapters:

•

Chapter 1, “Introduction”: This initial chapter goes into far more detail on the purpose of the book, the origins of Big Data, and what is to be covered.

•

Chapter 2, “Why Databases Slow Down”: Simply put, today’s typical monolithic databases frequently experience performance issues. By exploring the causes of this performance degradation, you will understand the basic principles involved that drive the need for Big Data scalability—as well as the foundation for knowing how to properly implement a scalable data architecture.

•

Chapter 3, “What Is Big Data?”: It is important to establish a good working definition for Big Data and how it affects your own applications.

•

Chapter 4, “Big Data in the Real World”: In this chapter I cover two actual case studies of common applications that had a need to scale. You will see how these solutions were implemented—information that you can use to formulate your own Big Data scalability strategy.

•

Chapter 5, “Scaling Your Application”: Any scalability strategy includes scaling all of your application components. That is covered here, including the various tiers of a typical application and what is involved in scaling each.

•

Chapter 6, “When to Scale Your Database”: Scaling your data tier is an important decision with large consequences in any application environment. It is important to know when you need to scale your database, and what factors drive scalability; these items are covered in this chapter.

•

Chapter 7, “All Data Is Relational”: A fundamental concept that affects all application data is data relationships. This is true regardless of the type of DBMS engine you are using, and in fact it can be even more important with

non-relational databases. Thus I provide a brief overview of the importance of data relationships and how they pervade virtually everything to do with working with data.

•

Chapter 8, “It’s All About Sharding”: In this chapter you will learn why virtually every database scalability implementation relies on some form of database sharding. You will also learn the various approaches to sharding, and the strengths and weaknesses of each.

•

Chapter 9, “Scaling Big Data: The Endgame”: In this final chapter of Part I of the series, I cover the most important objectives of a scalable Big Data

implementation. These objectives will serve as the most important guidelines for ensuring your architecture and implementation are both successful and high performance.

(12)

Preface xi

There is a companion Web site for the entire series, with supplementary blog posts, forum discussions, and other relevant information related to the Big Data Scalability Series: www.bigdatascalability.com.

I hope you find this book informative and, most of all, useful, generating practical results in your own Big Data implementations. I am always interested in hearing feedback from readers, as well; you can reach me through the Big Data Scalability Series Web site noted above.

(13)

xii About the Author

A

BOUT THE

A

UTHOR

Cory Isaacson is CEO/CTO of CodeFutures Corporation, maker of dbShards, a leading database scalability suite providing a true shared-nothing architecture for relational databases. Cory has authored numerous articles in a variety of

publications, including SOA Magazine and Database Trends and Applications. He is a regular columnist for JAXenter magazine, and the author of the book Software Pipelines and SOA (Addison-Wesley, 2009). Cory has more than twenty-five years’ experience with advanced software architectures, and has worked with many of the world’s brightest innovators in the field of high-performance computing. Cory has spoken at hundreds of public events and seminars, and has helped numerous organizations address the real-world challenges of database performance and scalability—effectively applying Big Data technologies to a variety of fields, including social networking, mobile applications, gaming, and high-volume transaction systems. In his prior position as president of Rogue Wave Software, he actively led the company back to a position of profitable growth, culminating in a successful acquisition by a leading private equity firm. Cory can be reached through the Big Data Scalability Series Web site: www.bigdatascalability.com.

(14)

Chapter 1: Introduction 1

1

Big Data Scalability is a comprehensive, four-part volume containing information on many facets of database performance and scalability. I have found through years of experience that in order to master any aspect of software, databases, or computer science, the most important thing is a solid foundation and understanding of the fundamental concepts. Once you have this knowledge, you can apply it to solve real-world problems with your own database tier.

This series is primarily focused on how to apply Big Data scalability to common application problems with the wide selection of DBMS engines available today. As such, it is less focused on pure analytics Big Data problems with engines like Apache Hadoop (there are many other books dedicated to that), and more directed toward taking advantage of Big Data scalability with popular DBMS engines used in enterprise and online applications.

W

HAT

Y

OU

W

ILL

L

EARN

This part of the series will cover the following critical topics for building a high-performance, scalable, and reliable database tier:

•

The primary causes of database performance issues

•

What Big Data is

(15)

2 Chapter 1: Introduction

•

Options for scaling your application

•

When and how to scale your database

•

The fundamentals of data distribution (database sharding)

T

HE

C

HALLENGE OF

B

IG

D

ATA

Have you ever experienced a database performance problem? It’s a silly question, I know. Everyone who has worked with databases for any amount of time has seen performance issues. In fact, it’s so common that database performance is almost an oxymoron. The very nature of database technology is to get slower over time, sometimes dramatically so. Why? Because databases only tend to get bigger through increased usage, and of course with more data the database management system (DBMS) has to do more work to keep up with application demands. Add to the mix an explosive high-transaction volume and you have the perfect recipe for something we all dread: slow database performance. In fact, sometimes it’s downright horrible database performance.

What about the memorable experience of a “database down” on a live production system? This can easily be your worst day on the job. Sometimes, when you get that call, you wish you could just forget it and not even go to work. Not only are big, growing databases hard to keep fast, they are very painful to repair when something bad happens. I’ve seen applications that were down for days due to extended database recovery times, and I’m sure you have heard of some notorious all-too-public examples of this as well. So an all-important part of a Big Data success is, in fact, planning for failure—ensuring you are prepared, with a robust,

high-availability (HA) runtime environment and a solid disaster recovery (DR) plan for a worst-case scenario (they do happen).

What can you do about it? That’s what this book is all about: what you can do to make your database tier fast, scalable, and reliable, delivering the performance that your application needs. The focus of this portion of the book is not on any particular DBMS or database technology, but rather the answers to these questions—questions that apply to any database and any DBMS technology:

•

What are the primary causes of database slow-down?

•

How do you keep your database tier reliable and operational (especially in demanding, 24X7 environments)?

(16)

•

What are the best ways to scale your database to accommodate extreme growth in transaction volume and data sizes?

•

How do you pick the best database technology (often technologies) for your application needs?

The answers to these questions and more are covered throughout the text. The aim is to arm you with the important concepts and facts common to all DBMS types, so that you can make the best decisions possible in managing your own data explosion, and come out on top.

Please note that in subsequent parts of the series I will provide a tour and overview of many common and leading-edge DBMS engines, both conceptually by category as well as specific solutions you can use in your Big Data arsenal. With this

information, combined with the fundamentals covered in Part I, my hope is to assist you in preparing for your own Big Data adventure—commensurate with much success and reward.

T

ODAY

’

S

B

IG

D

ATA

E

XPLOSION

Next let’s look at the drivers and scope of Big Data.

M

ANAGING AND

C

APITALIZING ON THE

C

URRENT

D

ATA

B

OOM

We are living in the midst of a data explosion, a true boom in databases and database technology, the likes of which the world has never seen. Since you are reading this book, I assume you have an interest in database performance and scalability, the same as I do. Figure 1.1 illustrates the Big Data explosion by the current data boom, and how critical it is for us to be able to extract meaning from all of this data.

Figure 1.1 The Big Data explosion

What do I mean by a data boom? Given that information is often the most valuable commodity in today’s tech-centric world, this means that we as data professionals

(17)

hold the keys to the kingdom. Not only is data getting more massive each and every year, the rate at which data is being generated is accelerating at a super-linear rate. It’s tough enough to figure out how to capture and manage all that data, but more importantly we need to identify the right set of architectural decisions, tools, and capabilities to allow our organizations to capitalize on that data. In other words, giving meaning to raw data is just as important as the collection, reliability, and management of Big Data environments.

The causes of the explosion of database data are not hard to find:

•

The advent of the Internet and the World Wide Web has generated exponential growth in the global user community—users with ever-expanding access to computing power and bandwidth.

•

The interaction of these users with Internet applications has resulted in unprecedented levels of data and transaction volumes.

•

The shift to online advertising supported by the likes of Google, Yahoo, and others is a key driver in the data boom we are seeing today.

•

The overall expansion of the worldwide economy has spurred massive data growth for traditional commerce (e.g., increased airline travel, international purchases, online products, etc.).

•

The core social networks (e.g., Facebook, Twitter, LinkedIn, and now Google+), by their very nature, have generated massive new ways for people to

communicate and interact, resulting in correspondingly large data sets and transaction volume.

•

Many specialized social networks have also arisen—everything from match-making sites to special interest groups, and even “buy-sell” applications that have generated their own micro-economies.

•

An entirely new breed of social network applications has been spawned, leveraging the inter-connection of social network users in fascinating ways, driving exponential growth in application volume, again with huge transaction volumes and data sizes (sometimes virtually overnight success stories).

•

Web- and advertising-analytics applications abound, crawling and analyzing virtually every aspect of the user interaction described above, again resulting in massive data sets with intense database access needs.

•

An entirely new breed of chatter trend analytics applications have emerged, analyzing things like Twitter tweets, Facebook chats, and so on, requiring massive levels of data storage and access.

(18)

•

Last, the world has gone mobile. In fact, in burgeoning economies and

established countries alike, smart phones and tablets are by far the most readily available, high-growth, and commonly used communication vehicle for much of the world’s population, generating a nearly incomprensible stream of data, transactions, application interaction, and messaging volume (with no end in sight).

Any one of the factors above would, in itself, have created unprecedented growth in databases, but taken collectively the impact is truly awe-inspiring and

overwhelming. I never cease to be amazed by the number of extremely bright people I meet that have an ever-widening set of ideas—ideas that are building new businesses that are further fueling the growth of databases to support these fast-growing industries and sectors.

Y

OUR

R

OLE AS A

D

ATA

A

RCHITECT

For those of us who are data architects, this is the most exciting time ever. Make no mistake about it: You are the key to the success of this new environment. For without the talent and dedication of the individual data architect to find solutions to address today’s data explosion, none of these ventures would succeed.

If you are a database architect or database administrator (DBA)—one who is driven to deliver the fastest and most scalable database platform possible for your

organization—this book is addressed to you. In the end, it is your ability and intelligence that will solve these problems for real; all the technology in the world can at best assist you to do the job.

T

HE

A

CCELERATION OF

B

IG

D

ATA

I

NNOVATION

The phenomenon of the Big Data explosion has not only yielded huge growth in data, but also has spurred ingenious innovation to conquer the Big Data challenge. The sheer number of new database platforms and DBMS engines introduced in the past few years is mind-boggling. Daunting as it may seem to stay current with this rate of technological advancement, the innovation itself is nothing but good for the professional data architect. This investment in Big Data solutions has yielded an incredibly wide array of technology options for you to consider.

While the sheer rate of innovation has created an environment that is often nothing short of exhilarating (at least for a database fanatic like me), it can at times be intimidating as well.

(19)

Therefore, knowing how to evaluate these technologies so you can pick the right options for your application is critical. In fact, it’s a vital skill you (and the companies or organizations you support) cannot afford to be without. A portion of this series is devoted to a review of the various types of data platforms available, explaining typical use cases, available solutions, and how and where to apply them. But more importantly, the underlying concepts—applicable to all DBMS engines— are covered, enabling you to make informed evaluations and decisions for your specific application requirements.

I firmly believe (and have found through hard-won experience) that the more you understand the fundamentals of Big Data platforms and technologies, the better you will be at implementing truly successful database environments, taming whatever Big Data explosion you are in the midst of.

B

ACKGROUND FOR

T

HIS

B

OOK

My favorite subject is databases—pretty much everything about them, from database performance to management and of course scalability most of all. You might say I’m a bit of fanatic about it (okay, maybe I’m an extreme fanatic), but I wouldn’t do it if I didn’t love the field so much. I have been working intensively with all types of database technologies throughout my career. In one of my first companies, we built a successful independent database and application development firm, beginning with Sybase and expanding into other DBMS engines after that. More recently I ran Rogue Wave Software, a C++ tools vendor with a large

concentration of Wall Street customers. While working with these customers, which included some of the top architects in the world, I saw a clear trend that spurred my interest even further: No matter how well an application could be scaled, the database tier was always the bottleneck. In fact, I learned that the database tier forms the last mile of application scalability, being at the very foundation of all functionality. The message was clear: Find a solution to that problem, and the potential was huge.

I have experienced the very best (and the very, very worst) that database management systems have to offer, in the trenches and firsthand. I’ve had the fortune and privilege of serving as the architect and manager for many mission-critical applications, from the advent of client-server technology to the Big Data systems of today. In addition to having lived through just about every database nightmare you can imagine, I’ve also seen the results when the database tier is performing outrageously well. I’ve seen how hard great software teams work to conquer these problems; there are few careers that require as much intelligence,

(20)

persistence, and long, demanding hours. This has given me a profound respect for sharp technologists that conquer Big Data problems, the engines that are the very backbone of every application, not to mention the foundation of much of today’s economic growth. In fact, it’s obvious that the database explosion of the information age is upon us, and it’s likely to continue for many years to come as new and clever business opportunities arise to capitalize on this first-ever phenomenon encountered in world history.

Through this extensive and often intense experience, I’ve gained many insights over the years. My intent in writing this book is to share this experience with a wide audience, with the hope that you can avoid many of the common pitfalls I have seen repeated over and over. The book is written from a practical point of view, with what I believe are the salient points required to make the database tier of an application really cook. Many of the techniques can be applied easily, without the need to purchase or implement new technology, while others are intended as guides for taking full advantage of the latest and greatest database capabilities emerging from the fantastic technical advances made available in the past few years. As CEO and CTO of CodeFutures (the makers of dbShards and AgilData

technology), I have had the pleasure of working with the sharpest technical teams in my career—not just within our own engineering team, but also the customers that demanded the absolute best in database performance and reliability. We have worked with some of the world’s largest social networking applications, with phenomenal data growth and transaction volumes. Frankly, in the early days of our technology, we lived through a few genuinely hair-raising experiences, but with hard work in partnership with our customers, we always were able to succeed, even against the toughest challenges. More important, the wide array of applications where we have applied database sharding resulted in a broad range of experience, with large data sets and transaction volumes that were unimaginable just a few years ago. Many of these applications experienced extreme growth in transaction volume, an extraordinary thing to behold (to put it mildly). For example, watching a social game grow from nothing to over 1 million daily users in a few short months is a sight to see, and when you are part of the team responsible for ensuring the database platform handles the load, continuously running 24X7, it’s an experience that really grabs your attention (and can result in some very late nights!).

As a disclaimer right up front, many of the examples and experiences are derived from our work with our own products. dbShards, and our newest offering, AgilData, are most definitely commercial products, and of course part of my purpose in writing this book is to gain more attention for what we deliver. Having said that, I have worked hard to keep the book balanced, concentrating on the important facts

(21)

that drive database scalability irrespective of a particular DBMS engine or technology. In fact, the book discloses a lot of the “tricks” and designs we use to deliver such scalable performance, with the hope that you will find them beneficial and useful in your own database environment. And don’t worry, we have new capabilities coming out with our advanced AgilData technology that will blow the world away. Who knows, they may provide material for future books in this series.

W

HY THE

F

OCUS ON

D

ATABASE

S

HARDING

?

You will notice as you read this book that much of the focus is on database

sharding, a technique and architecture for horizontal shared-nothing partitioning of database data across independent nodes or servers. From extensive experience and research, we have found that, regardless of what DBMS you use, database sharding is the only effective means for scaling a database. This applies to all types of databases, including the traditional relational database management system (RDBMS), NoSQL platforms, Big Data engines, Cloud databases, or databases of the so-called NewSQL paradigm. If you look closely at any of these offerings, particularly if they advertise a scalable platform, you will see that some sort of horizontal partitioning of data across nodes is used. Thus, database sharding is the answer, used in one way or another for virtually every leading database platform. Figure 1.2 shows the concept of database sharding, breaking a large monolithic database into multiple, smaller, sharded database instances across multiple servers.

Figure 1.2 Database sharding

Understanding the underlying principles of how database sharding works is the answer to controlling your own destiny regarding the ultimate performance of your

(22)

Big Data implementation. The basic concepts behind sharding and how it works are not difficult, and once you grasp them you can implement amazing levels of

scalability. Ultimately this book is about making you successful with your database tier, and the more you know about this concept, the more power you will have over defining, predicting, optimizing, and controlling the performance of your own databases.

NOTE

Some vendors have downplayed the importance of database sharding, or have outright stated that sharding is an outmoded or useless idea. Nothing could be further from the truth, and once you understand the full meaning and applicability of the term, you will see that these very same vendors are using database sharding in their own technologies. Much more on the topic will be covered throughout the series.

S

UMMARY

I hope you find the ensuing pages interesting, informative, and most of all useful. If you gain even a single new idea that results in improved database performance for your application, I will have achieved what I set out to do with this work.

(23)

Index 101

I

NDEX

Numbers

24X7 availability. See also high-availability (HA)

author’s background with databases and, 7 in ideal infrastructure for Big Data, 90 redundancy and, 45

A

Address book management, 34. See also

FullContact AgilData, 7

Agile nature, of data structures, 90

Analytics, challenges of managing Big Data, 26–27

Analytics applications architecture of, 50–53 scaling, 53

Apache Cassandra, 36 Apache server logs, 50–51 Apache Storm framework, 35–36

AppDynamics, system monitoring tools, 65 Application logic, separating from user

interface, 49

Application scalability

detecting performance bottlenecks, 42–44 excitement at success of high-growth

application, 41–42 fundamentals of, 42

goals of scalable application platforms, 39–41

reliability and, 44–45

scaling analytics applications, 53 scaling out (horizontal scaling), 45 scaling traditional online applications,

53–55

scaling up (vertical scaling), 45 summary, 55

understanding analytics architecture and, 50–53

understanding application architecture and, 46–50

Application server tier

scaling traditional online applications, 54

in three-tier architecture, 49 Application servers

implementing horizontal scaling in Big Data infrastructure, 63–64

round robin routing and, 48 sticky routing and, 48–49 Applications/apps

mobile device, 31 social networking, 30

understanding analytics architecture, 50–53

understanding application architecture, 46–50

Architecture

understanding analytics architecture, 50–53

understanding application architecture, 46–50

B

Big Data streaming. See streaming Big Data, what it is

contemporary sources, 28–31 formal definition, 25–27 future generators of, 31–32 overview of, 24

practical definition, 27–28 summary, 32

traditional sources, 31 Bigtable parallel framework, 29 Black-box sharding

overview of, 83–86

pros/cons of distributed data, 93 Black-box testing, 84

Bottlenecks

database tier as, 6 detecting, 42–44

(24)

102 Index

B-trees

slow writes as cause of database slowdown, 16–17

use in DBMS engines, 79

C

Cache/caching

high CPU usage as indicator of performance issue, 22 memory uses of DBMSs, 60 Cassandra (Apache), 36 Cloud

cost management and, 40 pros/cons of, 41

Social Point hosting games in, 36 uses of database sharding and, 8 Clusters. See database clusters Column databases, 80 Concurrency

pros/cons of distributed data, 92–93 read scaling and, 62–63

Concurrency contention

causes of database slowdown, 17–20 connection limits and, 19–20 overview of, 17

in single-treaded NoSQL engines, 19 in traditional RDMBSs, 18

Connection limits, concurrency contention and, 19–20

Consistent hashing

in black-box sharding, 84 database sharding and, 82 Contact Enrichment technique, in

FullContact, 34–35

Contact management, 34. See also

FullContact

Contention. See concurrency contention CPUs

detecting performance bottlenecks, 42–44 distributed operations and, 93

high usage as indicator of performance issue, 22

system metrics in detecting issues, 65–66 upgrading combined with vertical scaling,

59–60

Crawling/Web crawling causes of data explosion, 4

challenges of managing Big Data, 26

federated search engine used by FullContact for, 35

search engines and, 29 Cross-shard, database clusters, 13 Curation (management), challenges of

managing Big Data, 26

D

Data, relational. See relational data Data architects

role in managing Big Data, 5 use of data modeling by, 75 Data de-normalization, 97–98 Data explosion, 3–4

Data locality, modeling database tier and, 98–99

Data modeling

critical to success, 74–76 data locality and, 98–99 game application and, 97–98

strategies for the scaling endgame, 96–98 Data sets, limits on size of, 25–26

Data store (repository), in FullContact, 36 Data structures, agile nature of, 90 Database administrator (DBA), role in

managing Big Data, 5 Database clusters

dynamic clusters in ideal infrastructure for Big Data, 90

high-availability (HA) and, 55

implementing horizontal scaling and, 64 performance bottlenecks and, 13 rebalancing, 94–95

required for working with Big Data, 27 scaling up database tier, 57

Database management system (DBMS) author’s background with, 6–7 black-box sharding and, 86 challenge of Big Data, 2–3 choices in, 69–70

CPU upgrade and, 60 data modeling and, 75

lesson relative to slowdown, 11–13 memory upgrade and, 60–61 performance degradation cycle

(slowdown), 79

relational. See relational database management system (RDBMS)

(25)

Index 103

Database scalability database tier and, 56–57

determining when to scale, 57–58 horizontal scaling, 63–64

indicators of need, 65–66 options for improving database

performance, 58–59 overview of, 56 read scaling, 62–63

slow reads indicating need, 66–67 slow writes indicating need, 67–68 upgrades combined with vertical scaling,

59–62

vertical scaling, 59 Database sharding

black-box sharding, 83–86 definition of, 80–83

laws of databases and, 78–80

limitations of manual approach to, 80–81 overview of, 77

partitioning and, 54–55 reasons for focus on, 8–9 relational sharding, 86–88

as response to database slowdown, 77–78 in Social Point, 37–38

strategies for the scaling endgame, 96–98 summary, 88

virtual sharding, 38 Database tier

as bottleneck, 6

in ideal infrastructure for Big Data, 90 modeling, 98–99

scaling traditional online applications, 54– 55

in three-tier architecture, 50 Databases

author’s background with, 6–8 challenge of Big Data, 2–3 clustering. See database clusters common causes of slowdown, 14 concurrency contention causing

slowdown, 17–20

future options for tapping into storage data, 31–32

identifying issues that cause slowdown, 21–22

indicators of need for scaling, 65–66

laws of, 78–80

lesson relative to slowdown, 11–13 missing index causing slowdown, 20–21 options for improving performance, 58–59 scaling. See database scalability

slow reads indicating need for database scaling, 66–67

slow writes causing slowdown, 16–17 slow writes indicating need for database

scaling, 67–68

slowdown (degradation) curve, 10–11 time requirement for updating statistics,

21

traditional extraction of data from, 31 DBA (database administrator), role in

managing Big Data, 5

DBMS. See database management system (DBMS)

DbShards, 7

Degradation cycle. See slowdown Denial-of-service (DoS) attacks, 52 Developers, use of data modeling by, 75 Disaster recovery (DR), 2

Disk I/O

detecting performance bottlenecks, 43–44 indicators of performance issue, 22 slow reads indicating need for database

scaling, 66–67

slow writes indicating need for database scaling, 67–68

system metrics in detecting issues, 65–66 upgrading disk drive combined with

vertical scaling, 61–62 Disk reads. See reads Disk writes. See writes Distributed data

database sharding and, 82 horizontal scaling and, 46 ideals in scaling endgame, 96 indicator of when needed, 17

pros/cons of distributed operations, 91–94 Distributed joins, 93–94

DoS (denial-of-service) attacks, 52 DR (disaster recovery), 2

Dragon City game, 36–37

Dynamic clusters, in ideal infrastructure for Big Data, 90

(26)

104 Index

E

Elastic scaling, 40

Embarrassingly parallel applications, 53 ETL (extract/transform/load), 31

F

Facebook

one-to-many and many-to-many relationships in, 72

social networks as source of Big Data, 29– 30

Social Point as game provider on, 36 Failure points

parallel computing and, 45 scalable databases and, 55

Flexibility, of ideal infrastructure for Big Data, 90

Fragmented data, 98 FullContact

data store (repository) in, 36 facing challenge of Big Data, 35 horizontal scaling in, 36 overview of, 34–35

streaming approach in, 35–36

G

Games

relational sharding and, 97–98 relationships in, 73–74

Social Point as game provider, 36 as source of Big Data, 30

Global economy, causes of data explosion, 4 Google, 29

Google+, 30

Graphical user interface (GUI), interacting with back-end DBMS, 12

H

HA. See high-availability (HA) Hadoop engines, 51–52 Hash algorithms/hashing

consistent hashing and, 82, 84 rebalancing clusters, 95 shard key and, 82 Healthcare monitoring, 32

High-availability (HA). See also 24X7 availability

database clusters and, 55

in ideal infrastructure for Big Data, 90

implementing horizontal scaling, 64 meeting the challenge of Big Data, 2 Horizontal scaling (scaling out)

in database sharding, 78–80

exceeding limits of single server and, 42 in FullContact, 36

implementing, 63–64 read scaling, 62–63 scaling applications, 45 in Social Point, 37

“Hype cycle,” new technology and, 33–34

I

Indexes/indexing

as de-normalized structure, 98 memory use and, 60

missing index as cause of database slowdown, 20–21

read performance and, 77 slow writes as cause of database

slowdown, 16–17

updates causing slow writes, 67–68 Infrastructure, ideal for Big Data, 90 Inherent scalability, of database tier, 53 In-memory DBMS engines, 80

InnoDB engine, 11

Innovation, meeting the challenge of Big Data, 5–6

Instagram, 31 Internet

causes of data explosion, 4 search engines and, 29 Internet of Things (IoT), 32 IoT (Internet of Things), 32

J

Java, data relationships in, 71 Javascript Object Notation (JSON)

Social Point using JSON objects, 37–38 storing JSON structures, 75

K

Key range approach, to clustering, 95 Key-value stores, NoSQL/NewSQL, 84

L

Linear scalability

near-linear scalability in ideal infrastructure, 90

(27)

Index 105

Load, issues database cannot handle, 27–28 Load balancer tier

round robin routing by, 48

scaling traditional online applications, 53 sticky routing by, 48–49

in three-tier architecture, 47

Load testing, slowdown (degradation) curve example, 11

Locks

concurrency contention. See concurrency contention

lesson relative to database slowdown, 11 Logs, use of Apache server log in analytics,

50–51 M Many-to-many relationships, 72 MapReduce choices in DBMS engines, 69 definition of, 52

use of Apache server log in analytics, 51– 52

Meaning, data and, 70–72 Memory

detecting performance bottlenecks, 43 high usage as indicator of performance

issue, 22

system metrics in detecting issues, 65–66 upgrading combined with vertical scaling,

60

Memory leaks, 43

Memory swapping, 43, 60–61 Metrics, analytics applications, 50 Microsoft SQL Server, 69

Mobile devices, as source of Big Data, 5, 31 MySQL

choices in DBMS engines, 69

slowdown (degradation) curve example, 11

Social Point using, 37

N

Near-linear scalability, in ideal infrastructure for Big Data, 90

Nesting, black-box sharding and, 85 New Relic, system monitoring tools, 65 NewSQL. See NoSQL/NewSQL

nmon utility, checking OS statistics with, 22 NoSQL/NewSQL

B-tree use for indexing, 79 choices in DBMS engines, 69

concurrency contention in single-treaded NoSQL engines, 19

data modeling and, 75 key-value stores, 84

MySQL as NoSQL data store, 37 performance issues in, 14 shared contention model, 20

O

Object-oriented languages, data relationships and, 71

One-to-many relationships, 72

Online advertising, causes of data explosion, 4

Online applications, scaling traditional, 53– 55

Operating systems (OSes)

identifying performance issues, 21–22 memory swapping, 43

Oracle, 69

P

Parallel computing. See also database clusters

addition of failure points, 45

embarrassingly parallel applications, 53 horizontal scaling and, 46

required for working with Big Data, 27 Partitioning

database sharding and, 54–55, 80 implementing horizontal scaling in Big

Data infrastructure, 63–64 Performance

challenge of Big Data, 2–3 degradation cycle (slowdown), 79 detecting bottlenecks, 42–44 identifying issues, 21–22

impact of single suboptimal query on, 13 indexes for read performance, 77 in NoSQL/NewSQL, 14

options for improving, 58–59

slowdown (degradation) curve, 10–11 table scans causing slowdown, 14–16 Player games

black-box sharding and, 85 relational sharding and, 87 Postgre SQL, 69

(28)

106 Index

Primary Key, in black-box sharding, 84 Programming languages, 49

Q

Queries

impact of single suboptimal query on performance, 13

memory uses of DBMSs, 60 missing index as cause of database

slowdown, 20–21

slow reads indicating need for database scaling, 66–67

table scans causing performance slowdown, 14–16

Query analyzer, 22

R

RAID (Redundant Array of Independent Disks), 61–62

RAM

detecting performance bottlenecks, 43 high memory usage as indicator of

performance issue, 22 memory upgrade and, 60–61 Read scaling, 62–63

Reads

excessive disk I/O as indicator of performance issue, 22

indexes for read performance, 77 slow reads indicating need for database

scaling, 66–67

slowdown (degradation) curve, 10–11 upgrading disk drive combined with

vertical scaling, 61–62 Rebalancing

clusters, 94–95

ideals in scaling endgame, 96 Redundancy, 24X7 availability and, 45 Redundant Array of Independent Disks

(RAID), 61–62 Relational data data modeling, 75–76 importance of relationships, 73–74 meaning of data, 70–72 overview of, 69–70 summary, 75–76

Relational database management system (RDBMS)

concurrency contention in, 18

data modeling and, 75

difficulty in using with Big Data, 25–27 relational nature of data and, 69 relational sharding and, 86–88

Relational semantics, (one-to-many, many-to-many, and scale-out), 35

Relational sharding game application, 97–98 overview of, 86–88

pros/cons of distributed data, 93 Relationships, importance in understanding

data, 73–74 Reliability

of ideal infrastructure for Big Data, 90 scaling applications and, 44–45 Repository (data store), in FullContact, 36 Round robin routing, by load balancers, 48 Rows, table scans causing performance

slowdown, 14–16

S

SANs (storage area networks), 61–62 Scaling

database sharding and, 8–9 database tier as last mile in, 56–57 detecting performance bottlenecks, 42–44 determining when to scale databases, 57–

58

excitement from high-growth success, 41– 42

fundamentals of, 42

goals of scalable application platforms, 39–41

horizontal scaling, 63–64 indicators of need for, 65–66 options for improving database

performance, 58–59 overview of, 39 read scaling, 62–63 reliability and, 44–45

required for working with Big Data, 27 scaling analytics applications, 53 scaling out (horizontal scaling), 45 scaling traditional online applications, 53–

55

scaling up (vertical scaling), 45 slow reads indicating need for, 66–67 slow writes indicating need for, 67–68

(29)

Index 107

Scaling, continued

in Social Point with shared-nothing sharding, 37

summary, 55, 68

understanding analytics architecture and, 50–53

understanding application architecture and, 46–50

upgrades combined with vertical scaling, 59–62

vertical scaling (scaling up), 59 Scaling, endgame in

data locality and, 98–99 ideal circumstances, 96

ideal infrastructure for Big Data, 90 overview of, 89–90

pros/cons of distributed operations, 91–94 rebalancing and, 94–95

sharding and data modeling strategies for, 96–98

summary, 99–100 Scaling down, 40

Scatter-gather approach, black-box sharding and, 85

Schema-less database design, data modeling and, 75–76

Scientific data, traditional sources of Big Data, 31

Search engines

federated search engine used by FullContact, 35

as source of Big Data, 29

Searches, challenges of managing Big Data, 26

SELECT queries, 21

Server logs, use of Apache server log in analytics, 50–51

Service tier, application server tier and, 49 Session state, 49

Shard key. See also database sharding black-box sharding and, 86 hash algorithms and, 82 rebalancing clusters and, 95 routing messages based on, 35 selecting sharding strategy, 96–97 Shared contention model, 20 Shared-nothing architecture

linear scalability and, 83

partitioning, 8–9 sharding and, 37

Sharing, challenges of managing Big Data, 26

Shuffle join, 93–94

Single shard server operation ideals in scaling endgame, 96 pros/cons of distributed data, 91–92 Six Degrees of Separation, 72

Slowdown

challenge of Big Data and, 2 common causes, 14

concurrency contention causing, 17–20 database sharding as response to, 77–78 identifying issues, 21–22

indicating need for database scaling, 65– 66

lesson in, 11–13

missing index as cause, 20–21 overview of, 10

performance degradation curve, 10–11 performance degradation cycle, 79 slow reads, 66–67

slow writes, 16–17, 67–68 summary, 23

table scans causing, 14–16 Smart meters, 32

Smart phones, as source of Big Data, 31 Snapchat, 31

Social games, 30

Social networks, as source of Big Data, 4, 29–30

Social Point

clustering in, 37–38 overview of, 36–37

scaling with shared-nothing sharding, 37

Software Pipelines and SOA (Isaacson), 45 Solid-state drives (SSDs)

comparing with spinning disks, 43 upgrading to, 22, 61–62

Sources of Big Data future generators, 31–32 mobile devices, 31 overview of, 28 search engines, 29 social networks, 29–30

traditional sources (system monitoring and scientific data), 31

(30)

108 Index

SQL SELECT query, 15–16 SSDs. See solid-state drives (SSDs) State

application server tier and, 49 database tier and, 50

stateless nature of load balancers, 47 Sticky routing, by load balancers, 48–49 Storage, challenges of managing Big Data,

26

Storage area networks (SANs), 61–62 Storm framework (Apache), 35–36 Streaming

in FullContact, 35–36

future options for tapping into storage data, 32

System metrics

analytics applications, 50

indicating need for database scaling, 65– 66

System monitoring, traditional sources of Big Data, 31

T

Table scans

causes of database slowdown, 14–16 lesson relative to database slowdown, 11–

13

upgrading disk drive combined with vertical scaling, 61–62

Tablet devices, 31 Three-tier architecture

application server tier, 49 database tier, 50

load balancer tier, 47–49 overview of, 46–47

Transfer, managing Big Data, 26 Twitter

Apache Storm framework created by, 35 causes of data explosion, 4, 29–30

U

UI (user interface), 49

Upgrades, combining with vertical scaling, 59–62

Use cases, real world example of Big Data FullContact, 34–36

overview of, 33–34 Social Point, 36–38 summary, 38 User interface (UI), 49

V

Vertical scaling (scaling up) application scalability, 45 automating to meet growth, 40 combining upgrades with, 59–62 database scalability, 59

exceeding limits of single server and, 42 Virtual sharding, in Social Point, 38 Visualization, challenges of managing Big

Data, 26

W

Web crawling. See crawling/Web crawling Web-analytics, 4

World Wide Web, 4, 29 Writes

detecting performance bottlenecks, 43 distributed data and, 92

excessive disk I/O as indicator of performance issue, 22

performance degradation curve, 10–11 scaling up (vertical scaling) and, 45 slow writes as cause of database

slowdown, 16–17

slow writes indicating need for database scaling, 67–68

upgrading disk drive combined with vertical scaling, 61–62

www.bigdatascalability.com.