BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

(1)

INTRODUCTION

Big Data applications and the Internet of Things (IoT) are changing and often improving our lives. These applications strive to be simple to use, however the technology stack required to make them work can be very complex. Enterprise applications require data to be highly available and massively scalable, but they also need to be easy to manage.

This whitepaper shows how Basho Data Platform addresses challenges in Big Data, IoT, and hybrid cloud applications.

We first start with an example of how a company can use

integrated services to meet their business requirements. We then outline how Basho Data Platform can enhance your enterprise application by integrating NoSQL with caching,

real-time analytics, and search. Finally, we illustrate flexible

deployment options.

WHY USE BASHO DATA PLATFORM?

Basho Data Platform provides a new way to build, deploy, and manage your Enterprise Applications. It integrates Riak® KV software with Apache SparkTM, Redis, and Apache SolrTM, and controls the replication and synchronization of data between these components, simplifying management of your applications.

Reduce complexity with integrated NoSQL databases, caching, in-memory analytics, and search components

Enhance high availability and fault tolerance across components

Integrate real-time analytics with Apache Spark™ and Riak®_KV

Increase application performance with integrated Redis caching and Riak KV

Optimize search with Apache Solr™ and Riak KV

BASHO DATA PLATFORM BENEFITS

BASHO DATA PLATFORM SIMPLIFIES

BIG DATA, IOT, AND HYBRID CLOUD APPS

(2)

EXAMPLE

Picture an advertising company that runs an ad-exchange. In this model, they provide ad-serving, a “marketplace” of advertisements, and billing & reporting capabilities.

SERVING ADS

Let’s begin with the “core” of the business, serving advertisements.

In a world where impressions drive revenue, latency matters. Many have solved this by tuning their database read and write

parameters. However, at massive scale, even a highly efficient

database will incur too much latency. Many companies address high latency by placing a caching mechanism in front of their data persistence solution. Caching allows control of the latency

profile of an application, but requires custom code to enable

replication from database into the cache.

Basho Data Platform solves the problem of replication by intelligently pairing Redis with Riak KV. Basho Data Platform provides both speed and high availability. Auto-sharding and cluster management capabilities ensure that the environment is stable and easy to manage, turning Redis into an enterprise-grade solution. Redis then handles the ad serving while Riak KV provides the distributed, scalable, and available data store for ad persistence.

Since latency matters, the data location is also very important — the closer the data is to the end user, the faster it will be served. Basho Data Platform’s multi-cluster replication ensures ads are

near the presentation endpoint, which significantly reduces latency.

SEARCH FOR ADS

The advertising marketplace must allow customers to search for either their own advertisements or for ads to place into rotation onto their websites. With Basho Data Platform, Redis can be used as a caching solution for type-ahead prediction (auto-complete), and Solr can be added to search for

characteristics that have been tagged to each advertisement.

This implementation differs greatly from what’s described in

the “Serving Ads” section above. In this case, the customer would use multi-cluster replication to serve advertisements from cluster A while providing search capabilities from cluster B,

something no other solution offers.

BILL FOR ADS

Placing ads and finding ads is the core of an advertising

exchange, but the business wouldn’t survive for long without the ability to bill and generate revenue.

The advertising exchange tracks advertisement impressions in a very simple fashion — a date-time of impression per ad. The data must be correlated and analyzed over time intervals determined by the business (minute, hour, day, week, etc.). The process of correlating this data, performing the analysis for time ranges, and writing the data back into Riak is provided by a periodic running Spark job. The Spark Add-On handles both reading data from Riak KV and writing the result set back to Riak KV for persistence and consumption by the billing application.

(3)

Basho Data Platform provides a comprehensive set of data services that take the complexity out of manually deploying and managing separate clusters and instances of Riak KV with Spark, Redis, and Solr. These data services are integrated as a set of Core Services, Storage Instances, and Service Instances, which jointly form Basho Data Platform. This is illustrated in the origami graphic.

BASHO DATA PLATFORM CORE SERVICES

Big Data applications require a set of core services to keep them running smoothly. Manually deploying and managing separate clusters and instances of NoSQL databases, caching,

and in-memory analytics is difficult and complex.

The Basho Data Platform Core Services provide a distributed, scalable, fault-tolerant framework and resource manager for integrating databases and other key components of Big Data applications. These services impact data accuracy, high availability, scalability, and operational simplicity. Basho Data Platform Core Services deploy, manage, and synchronize data in and between Storage Instances (Riak KV, Riak S2) and Service Instances (Apache Spark, Redis, Apache Solr).

DATA REPLICATION & SYNCHRONIZATION

In addition to replicating and synchronizing data within and across Riak clusters, Redis and Spark Clusters are now also highly available. For Redis queries, when the data is not found in the Redis cluster it is read from Riak KV and synchronized across the client application query and Redis. Spark data is persisted in Riak KV so Spark now executes queries against imported data from Riak KV and existing Spark RDDs.

CLUSTER MANAGEMENT & MONITORING

Automated cluster management downloads, builds, and deploys clusters of Riak KV, Riak S2, Apache Spark, and Redis. Monitoring will auto-detect incidents with and across clusters and auto-restart clusters. It also auto-scales clusters as data grows. For Spark, cluster management plus the Riak KV Ensemble for leader election eliminates the need for Zookeeper.

INTERNAL DATA STORE

Built-in distributed data store for speed, fault tolerance, and

ease of operations. It is used to persist configurations as well as

static and dynamic data (port number, IP address) for sessions running across the Basho Data Platform.

MESSAGE ROUTING

A high throughput distributed message system for speed, scalability and availability. The data platform enhanced message system will persist and route messages across platform clusters.

LOGGING AND ANALYTICS

Event logs provide valuable information to assist with enhanced

tuning of clusters and to analyze dataflow across the cluster.

WHAT IS BASHO DATA PLATFORM?

(4)

Big Data applications need multiple data models to support

different use cases in the same enterprise and often in the

same application. Integrating these into applications requires additional development and operational skills that make it more complex.

Basho Data Platform simplifies this by supporting Storage Instances that include today’s most flexible NoSQL database,

Riak KV, and large object storage software, Riak S2, that are architected for high availability and horizontal scale.

Making it easy to deploy and manage these Storage Instances with Service Instances (Spark, Redis, and Solr), Basho Data Platform also replicates and synchronizes data between them.

RIAK KV — DISTRIBUTED NOSQL DATABASE

A key/value data store that is highly available, scalable, and easy to operate. Automatic data distribution across the cluster ensures fast performance and fault tolerance. Multi-cluster replication delivers low-latency global performance and robust business continuity.

RIAK S2 — OBJECT STORAGE SOFTWARE

Simple, available, distributed large object storage for public,

private, or hybrid clouds. Cost effective compared to traditional

storage at petabyte scale. Plus it’s compatible with Amazon S3 and OpenStack Swift for easy integration into existing production workloads.

APACHE SPARK

Integrated real-time analytics

REDIS

Faster application performance with integrated Redis caching

BASHO DATA PLATFORM SERVICE INSTANCES

Big Data applications never stand alone. They are highly distributed and comprised of multiple components that include NoSQL

databases, caching, and in-memory analytics, as well as separate configuration and resource management. Just keeping it all running and available takes a considerable commitment of effort and resources.

Basho Data Platform takes the difficulty out of doing this by integrating Riak KV with these Add-On Service Instances:

BASHO DATA PLATFORM STORAGE INSTANCES

DATACENTER #1

DATACENTER #2

DATACENTER #3

MULTI-CLUSTER REPLICATION

RIAK

STORES

AND

MANAGES

DATA

(5)

Integration of Riak KV with Apache Spark™ provides real-time analytics using the Spark connector. Built-in cluster management eliminates the use of Zookeeper.

The rapid growth of unstructured data has changed the way that modern Big Data applications are designed and deployed. These unstructured data sets must be processed fast, in real-time, to reveal patterns, trends, and associations. The Spark connector for Basho Data Platform connects directly to Riak KV instances and moves required data to the Spark cluster. When data is required for analysis in Spark, that data is read from Riak KV, processed in Spark, and the results can be stored

back in Riak KV. The ability to persist these results to Riak KV

retains flexibility for future data processing.

As part of Basho Data Platform, Spark Cluster deployments with Riak KV are as simple as specifying where code should

be deployed. Both static information (configuration) and

dynamic information (port numbers, etc.) are managed at installation time for newly deployed instances and existing Spark clusters. This makes it easy to manage Spark clusters without the use of Zookeeper.

APACHE SPARK

™

ADD-ON FOR BASHO DATA PLATFORM

CLUSTER MANAGEMENT

– Eliminate Zookeeper

Built-in leader election makes it easy to manage Spark clusters at scale.

FAST DATA MOVER

– Add Spark to your Riak data

Intelligently load data from Riak KV into Spark clusters to

minimize network traffic and processing overhead.

RIAK WRITE-BACK

– Make persistence simple

Store intermediate and final results in Riak KV for further

processing by Spark or other components of your Big Data application.

PERFORMANCE AT SCALE

– Process Big Data fast

Architected for high performance, real-time analysis, and persistence of your Big Data.

AUTOMATED DEPLOYMENT

– Run Spark easily

Quickly deploy and configure Spark clusters with Riak

KV. Auto-start failed Spark instances to reduce manual operations.

APPLICATION SIMPLICITY

– Don’t DIY

Systematically integrate and update analytics, caching, and search technologies to simplify the design and operations of your Big Data application.

(6)

HIGH AVAILABILITY

– Ensure Uptime

Integration with Riak KV makes the high-performance caching capabilities of Redis also highly available.

FAST CACHE

– Optimize for milliseconds

The speed of Redis is combined with the power of Riak KV to ensure low latency at scale.

AUTOMATIC DATA SYNCHRONIZATION

– Get your data when and where you need it

Data is automatically synchronized between Redis and Riak KV, and Basho Data Proxy resolves cache misses without requiring custom code to populate the cache.

AUTOMATIC SHARDING

– Eliminate painful manual sharding

Easily shard data automatically between multiple cache servers to reduce the time and errors of implementing manual sharding.

AUTOMATED DEPLOYMENT

– Save time

Easily deploy and configure Redis instances with Riak KV.

Automatically restart failed Redis instances or disable on failure to reduce manual processes.

APPLICATION SIMPLICITY

– Improve Efficiency

Systematically integrate and update caching, analytics, and search technologies to simplify your Big Data application.

“WRITE IT LIKE RIAK, CACHE IT LIKE REDIS”

Redis caching with Riak KV improves application performance by reducing latency. Built-in cluster management, high

availability, automatic data sharding, and the ability to replicate and sync data between Riak KV and Redis makes Redis

enterprise grade.

The combined power of Redis caching and Riak KV reduces latency to improve application performance. Basho Data Platform adds high availability and fault tolerance to Redis, and extends the operational simplicity of Riak KV to Redis for instance management and auto-sharding.

Since any Redis client can query the cache, no changes are required for existing Redis clients to access data in Riak KV. If

Redis doesn’t have the data in cache, it is accessed from Riak KV. Data is also automatically synchronized between Redis and Riak KV, increasing availability by allowing read-failures in Redis to be resolved by Riak KV and written back to the Redis cache. As part of Basho Data Platform, Redis deployment with Riak KV is as simple as specifying where the code should be deployed.

Both static (configuration) and dynamic information (port

numbers, etc.) are managed at the time of installation for both newly deployed instances and existing Redis installations.

(7)

The inclusion of integrated search means it’s easy to query Riak KV data sets using Apache Solr™. As data changes, search indexes are automatically synchronized. Get the full-text search power of Solr with the availability and scalability of Riak KV. Storing unstructured data in Riak KV is only one component of a Big Data application. It is also necessary to retrieve that data for application consumption. The Solr Add-On brings together the strengths of Riak KV’s scalable, distributed database with the powerful full-text search functionality of Apache Solr. This allows for transparent indexing and querying of Riak KV data values. In addition, there is direct support for Solr client query

APIs, which enables integration with existing software solutions (either homegrown or commercial).

With the Solr Add-On, Riak KV is responsible for the data and Solr is responsible for the indexes.

Riak KV monitors for changes to data and propagates those changes to indexes managed by Solr.

This data synchronization is critical to ensuring that full-text search results are up to date as data changes.

APACHE SOLR

™

ADD-ON FOR BASHO DATA PLATFORM

DISTRIBUTED FULL-TEXT SEARCH

– Connect to one, talk to all Standard full-text Solr queries are automatically expanded into distributed search queries to provide a complete result set across instances.

AD-HOC QUERY SUPPORT

– Ask complex questions of your data

Broad support for a wide range of Solr query parameters:

exact match, range queries, and/or/not, sorting, pagination, scoring, ranking, etc.

INDEX SYNCHRONIZATION

– Automate index updates Automatically synchronize data between Riak KV and Solr. Intelligent monitoring picks up changes to data and propagates those changes to Solr indexes.

SOLR API SUPPORT

– Integrate with existing software

Query data in Riak KV using existing Solr software, adding a powerful data source to Big Data applications.

AUTO-RESTART

– Reduce or eliminate slow manual restarts Monitor the Solr OS processes and automatically start or restart processes when failures are detected.

APPLICATION SIMPLICITY

– Make the complex simple Systematically integrate and update search, caching, and analytics technologies to simplify the design and operations of your Big Data application.

(8)

CONFIGURATION FLEXIBILITY

Basho Data Platform Fully Managed Configuration

INSTALLATION CHOICES

Basho Data Platform provides customers with the flexibility to

install one, some, or all available components of Basho Data Platform. Often, customers choose to install Spark and Redis with Riak KV for a fully managed implementation that ensures high availability and scalability for all of the data components in the solution (i.e. Riak KV, Spark, and Redis).

Again, customers can choose to install any combination of

Riak KV, Spark, Redis, and/or Solr that best fits their needs.

Customers can add the Basho Data Platform to their existing installation of Spark and/or Redis. However, we recommend a

(9)

Basho Data Platform includes a Spark Connector to implement

real-time analytics seamlessly. This creates a 1:1 mapping

between Riak KV and Spark data and optionally allows for query results to be persisted back into Riak KV. This Spark Connector

provides both power and flexibility. It does this by providing

high availability for Spark using Riak KV, rather than Zookeeper,

for leader election. Also, for greater flexibility, the Basho Spark

Connector does not require you to run Spark on the same node as the source database. You can run Spark anywhere you want, plus have either Riak KV or Solr (or both) query results.

FULLY MANAGED SPARK ADD-ON

FULLY MANAGED REDIS ADD-ON

Basho Redis Proxy supports multiple caching scenarios,

including read-through cache. The diagram below shows a client application attempting to read a value from cache. The proxy

service first tries to retrieve the value from Redis using a

read-through cache. If that value isn’t found in Redis, then the value is read from Riak KV. This method of caching is called read-through cache.

(10)

Basho specializes in solving

distributed systems challenges, and

integrated approaches such as Basho

Data Platform help ensure that applications

are highly available, massively scalable,

and easy to deploy at production scale.

— Mac Devine, VP CTO IBM Cloud Services at IBM

CONCLUSION

Architecting Big Data applications that rely on multiple data

stores requires a clear vision about the specific components and integration points in the data flow pipeline. Modern

solutions can include hybrid cloud, IoT streams, and many

other components or ‘flavors’ of Big Data. Rather than assuming that all data components easily fit together, effective use of well-designed integration services are key to

successful implementation.

Basho Data Platform is designed to deliver maximum data availability, to scale linearly using commodity hardware, and to provide operational simplicity at production scale.

CHALLENGES:

• Complex data models • Complex interactions • Complex fault tolerance • Complex query patterns

BASHO DATA PLATFORM:

• Supports multiple database models

• Integrates NoSQL with real-time analytics & caching • Ensures high availability and fault tolerance

• Provides rich query capabilities

YOU GAIN:

• Faster time to market • More uptime

• Faster application performance • Integrated real-time analytics

COMPLEX DATA PROJECTS SIMPLIFIED

ABOUT BASHO TECHNOLOGIES

Basho is a distributed systems company dedicated to developing disruptive technology that simplify enterprises’ most critical data management challenges. Basho has attracted one of the most talented groups of engineers and technical experts ever assembled devoted exclusively to solving some of the most complex issues presented by scaling distributed systems.

Basho’s distributed database, Riak®_{KV, the industry leading distributed NoSQL database, and Basho’s cloud storage software, Riak}®_S2,

are used by fast growing Web businesses and by one third of the Fortune 50 to power their critical Web, mobile and social applications. The Basho Data Platform helps enterprises reduce the complexity of supporting Big Data applications by integrating Riak KV and Riak S2