ebook Big Data Explained, Analysed, Solved

(1)

(2)

Tweet this

This eBook gives an overview of what big data is and its growing importance. It talks about some of the different kinds of big data, as well as some of the different things you would do with it.

Canonical is involved in Big Data

Canonical, the company behind Ubuntu, works closely with its partners on all aspects infrastructure and partner solutions to support storing, managing, and analysing big data.

What you will learn

The functional section of this book discusses applications, tools, managed services and clouds, used together or separately, that will help you benefit most from big data. You can skip directly to any section and focus on what’s most important to you, or read the book straight through.

(3)

Bill Bauman

Strategy & Content, Canonical

Bill Bauman, Strategy & Content, Canonical, began his technology career in processor development and has worked in systems engineering, sales, business development, and marketing roles. He holds patents on memory virtualization technologies and is published in the field of processor performance. Bill has a passion for emerging technologies and explaining how things work. He loves helping others benefit from modern technology.

(4)

Tweet this

Overview

What is Big Data, a general overview The increasing importance of Big Data Different types of Big Data

Big Data analysis and action Do I need a cloud for Big Data?

Big Data general overview

Traditional Data

To understand big data, consider some examples of traditional data. Traditional data may be a database of clients, with their associated contact information. It could be a database of cars, years, makes, models. This sort of data will usually grow gradually in size and the types of data stored

rarely changes.

Traditional data is generally well-structured and fits predefined or predictable categories.

Structured Database

(6)

Tweet this

The general purpose

The reason that these gigantic data sets are being compiled and stored is so that we can analyse the data. Analysis includes pattern recognition, trends, associations, etc. The outcome of analysis is respective actions that would otherwise not be possible without big data.

In the next section of this eBook, we go into further detail about big data analysis - why we do it, why it’s important, and the sort of information for which we’re looking.

Big Data

When we look at Big Data, typically the data is not so neatly organised. Some big data examples could be random spots on a map, documents, images, huge lists of named or unnamed individuals that have happened to all be in the same general area at a given time, or the millions of clicks on a web page

in a given week.

Big Data can be structured or unstructured, but generally the database and analysis tools are specially designed for a given purpose and to handle the tremendously large scale, size, velocity, and variety that most big data datasets represent.

Often Unstructured

Purpose-specific toolset

(7)

Collect

Organisations of all sizes and functions are increasingly gathering more information about their interactions and transactions. They are also looking to third parties to provide additional data. Regardless of how they gather data and the types and quantities are increasing. In a modern, data-driven world, an organisation that isn’t taking advantage of big data collection, analytics, and action, is likely going to become uncompetitive with those that are.

Analyse

The analysis of big data can have big returns. The ability to understand the types of data that are collected, to correlate one type of data with another, observe trends, identify outliers, and many other analytic functions, are increasingly valuable in organisations of all types.

Without thorough analysis via the use of modern, big data analytics tools, it can be easy to miss or overlook important trends, shifts in perspective, or subtle changes in customer interaction. Through analysis, you can learn patterns and predict actions before they occur and even begin to direct them via actions discussed here in the Act section.

Act

The ability to do something with the data that is collected and analysed is the most compelling part of big data. Corporations can offer more compelling products and solutions. Governments can better predict and serve the needs of citizens. Even small business can identify short and long term trends in their sales and interactions with customers, as well as other businesses. All of these outcomes are about improved efficiencies and experiences for everyone involved, from the provider to the consumer.

The increasing

(8)

Tweet this

Big data can be structured or unstructured. New tools and datasets are blurring the lines that separate the two. Below are some common examples of big data types.

Types of Big Data

Structured big data

Remember, just because it is structured, does not mean it isn’t big data. Structured big data could be compiled from millions or billions of data points, daily or even hourly.

User Input

This is data that is created via a prompt or requested action to a user. This could be a ratings system, a survey, a loyalty program, or any other prompt for the user to input specific data in specific fields that are then stored in a structured manner.

Compilations

Compiled big data is merging existing or otherwise disparate databases into a single dataset. For example, the data could include names, locations, demographics, account balances, credit scores, etc, all combined into a compiled big data dataset.

Transactions

Transactional big data is everything having to do with a transaction, including whether the transaction was even completed. The data could include what was purchased, how long it took, was it online or in-store, were other items typically purchased together.

(9)

User-generated Content

Every day, millions of Internet users post pictures, videos, short messages, audio, and more. Much of this data is completely unassociated with a category or field. Essentially, it is completely unstructured and it is the function of targeted big data applications to aggregate, cull, present, and analyze these datasets.

Unstructured Big Data

Big data is most commonly associated with unstructured data. Unstructured data, like photos and IoT datasets, were largely the genesis of modern big data.

Passive Data

This is generally the data that is generated without specific intent or interaction from users. For example, cell phones are perpetually updating GPS coordinates of their users’ respective locations. Logistics information, bar code scans, delivery information, are all data that are passively updated but can provide valuable insights when analyzed.

(10)

Tweet this

Predictive Analytics

Probably the most common type of

analysis, using past patterns or performance to determine future actions is one of the best known uses of big data. It’s important to analyse data from a multitude of different perspectives and to include cross referenced, sometimes loosely-associated data, to establish the most comprehensive patterns and future predictions. Predictive analytics can also be bolstered by machine learning, whereby, over time, the system builds its own intelligence profile on a given a subject, individual, or topic.

Descriptive Analytics

The focus here is on metrics, a summary of what has happened. This could be views, clicks, counts, posts, etc. While descriptive metrics are not necessarily incredibly useful on their own, they are the underlying data points that feed more advanced analysis and actions. Descriptive analytics have been used for many years now, and are the foundation of the many graphs and charts we see on the Internet and in presentations today.

Prescriptive Analytics

Largely an intelligent evolution of predictive analytics, with a prescriptive approach, data analysis is used to determine recommended actions. Where predictive analytics looks at patterns and makes recommendations, prescriptive analytics looks at patterns, associates them with additional datasets, determines where individual data points coincide or there are recurring common descriptors or activities, and then prescribes a potential course of action or solution. Prescriptive analytics are generally underutilised but offer great potential to reduce time to market for solutions or assessment times for individuals in various fields.

(11)

Even though big data was born in the cloud, it doesn’t mean you need a cloud to take advantage of big data solutions or to act on the data. The most important aspects of working with big data are that you have chosen the right tools and the right applications for your solution. Canonical can help you with both.

Canonical has created an open source solution for system design and service modeling called Juju. Juju simplifies the process of designing your solution, then configuring, associating, and deploying the applications in it. Having a tool like Juju means that selecting the right big data applications for your needs is the most important remaining factor.

For more information on Juju, see section Design, deploy, package Big Data solutions. Although it isn’t necessary, a cloud can be tremendously beneficial to big data processing. The nature of big data is that it is constantly changing, and the purpose of that data, the analysis of that data, and the storage of that data can change just as quickly. A tool like Juju can help you keep up with the change in usage by deploying new big data charmed solutions. But Juju can’t do it all.

For system scalability and the ability to easily access different types of storage for different needs, a cloud is recommended. Juju can talk directly to both public and private cloud solutions, like AWS and Canonical OpenStack, respectively.

Do I need a cloud for Big Data?

For more on building your own private cloud, see the sections OpenStack is a Big Data warehouse and BootStack for Big Data later in this eBook.

(12)

Tweet this

Choosing the right applications

There are many ways to go about application selection. Some people already know which big data processing solutions they want to use. Others are looking for advice, or looking to explore potential new solutions.

In the Juju Big Data Charms section of this book, we outline many big data software solutions that are available, and give a brief description of their purpose. This is a great starting point to see what’s out there, and Juju makes it easy to try them all.

Additionally, in the BootStack for Big Data section of this book, we go into detail on how a BootStack cloud helps to start processing big data quickly and efficiently.

Juju is the game-changing service modelling tool that lets you build entire cloud environments with only a few commands.

BootStack is your OpenStack private cloud, running on your hardware, in your choice of datacentre with Canonical’s experts responsible for design, deployment and availability.

(13)

Whether in a cloud or on a dedicated system, managing all the applications in a big data solution is best handled by a tool that does more than static configuration management or orchestration deployment.

Juju is a service modelling product from Canonical that gives you a blank canvas on which you can visually lay out all of your big data apps. Communications and data paths are defined as relationships between the applications by connecting the apps on your canvas. The visual solution design and all the application relationships can be deployed immediately, and exported and saved as a bundle for future use.

Juju, Charms, and Bundles

The use of Charms is what gives Juju its incredible capabilities to manage applications in complex infrastructures. Charms are intelligent scripts wrapped around big data applications that allow them to be dynamically configured and deployed without manual configuration. The abstraction of application relationship management by Juju’s Charms is what allows big data solutions to be rapidly deployed and seamlessly scaled. Without the application abstraction that Juju provides, big data system services require manual intervention or iteration of inflexible, static configuration scripts any time the solution design needs to be updated or changed.

Evolving the solution

When it comes to big data processing, the solution is rarely static. Big data deployments evolve over time, and that often involves adding or removing components services. The same tool that you used to design and deploy the solution can be used to dynamically add and remove components within it. Juju’s service modelling approach lets you evolve your solution and keep pace with the rapidly changing big data market.

Design, deploy, package

Big Data solutions

(14)

Tweet this Tweet this

Ingest & Messaging • Message Processing • Flume • Kafka • Message Queues • RabbitMQ • ZeroMQ Structured Data • MySQL • PostGreSQL • Percona Cluster • MariaDB

Scale Out Storage • Ceph • Swift noSQL • Stack • ElasticSearch • LogStash • Kibana • Document Databases • MongoDB • CouchDB • Couchbase • Column & KV • Cassandra / DSE • quasardb • memcached • Redis

Analytics / Search /Visualisation • SpagoBI

• Saiku

• Storm

• Spark

• Datafari (ManifoldCF, SolR)

• Zeppelin

• iPython Notebook

As discussed on the Design, deploy, package Big Data solutions page, these are a sample of the Charms available for big data. With Juju, you can readily deploy any combination of these Charms and define their configurations and data paths all from a graphical interface, CLI, or API.

(15)

Hadoop • Hadoop Flavours • Apache Hadoop • Cloudera Hadoop • YARN • Hive • Mahout • HBase • Pig • ZooKeeper • Flume • Kafka • Tez Spark • Spark • Spark Streaming • Spark SQL • SparkML • GraphX

Container Ecosystem & Orchestration • Docker

• LXD / LXC

• Kubernetes

• Mesos Big data frameworks are available for

deployment in Juju. You can deploy an entire Hadoop cluster with a Juju Charm bundle, or Spark, Docker, or Kubernetes, for example. The Charms listed on the Juju Big Data Charms page can all be associated with the frameworks listed here, as appropriate. All of these frameworks benefit from Juju’s ability to automatically configure application data paths and relationships.

(16)

Tweet this

Ubuntu Server is the most popular cloud operating system in use. There are many reasons why Ubuntu is so popular, but one of the primary reasons is that Canonical started to focus on OS scalability many years ago. When you’re working with big data, you need a cloud-ready platform, like Ubuntu, that is designed for scalability and reliability.

Ubuntu for Big Data systems

Ubuntu allows you to process your big data anywhere. Keep sensitive information in-house, leverage the public cloud for

unpredictable workloads, and trusted private cloud partners for both.

Ubuntu Server can be used as a traditional operating system. There are also optimised variants for low latency and other task-specific solutions, like big data processing.

Where Ubuntu runs:

• On-premise, in your own cloud

• In an external, private cloud

• On public clouds, like AWS, Azure, Rackspace, Google Cloud Platform, IBM, and many

others, please see the Ubuntu Certified

(17)

How Ubuntu runs:

The flexibility of Ubuntu to run anywhere on almost any architecture makes it the ideal platform choice to execute big data workloads.

Bare metal server on - x86, ARM, POWER, or z Mainframe

Container on bare metal

Public cloud guest instance

Container as a cloud instance Virtual Machine on - KVM,

VMware, Hyper-V, and other hypervisors

Container as a virtual machine Private cloud guest instance

(18)

Tweet this

The section Do I need a cloud for Big Data in this book addresses some of the benefits of clouds for big data. Specifically, an OpenStack cloud is the most popular private cloud solution for big data.

OpenStack is a community-based private cloud solution. It is not a single product, but a collection of individual projects designed to seamlessly interact to create a functional cloud. Canonical OpenStack is a production-ready, supported OpenStack distribution, and more.

The best way to build an OpenStack cloud is using Autopilot. Autopilot is a graphical installation tool that allows you to select the components of OpenStack you would like to install and deploys them for you. It can even deploy them with high availability.

Autopilot is designed to work with an extended tool set beyond just OpenStack. MaaS, Metal as a Service, automates the configuration of the physical nodes in your OpenStack environment. Juju, discussed further in the Design, deploy, package Big Data solutions section of this eBook, allows you to automatically deploy applications and their respective relationships within your OpenStack cloud. Landscape manages the Autopilot experience, as well as the cloud itself, and the guest instances within it. The comprehensive tool set that comes with a Canonical OpenStack cloud makes it easier, faster, and more robust to deploy big data solutions - from the bare metal, to the platform operating system to the

applications themselves.

The base platform of Canonical OpenStack is Ubuntu. Ubuntu is not only the most popular cloud operating system, it is also the most popular OpenStack infrastructure operating system. Ubuntu runs on the OpenStack physical nodes, providing critical services like compute, networking, and storage. It is also the platform for your guest instances, whether they are LXD machine containers or virtual machines, where you run your big data applications.

Combining OpenStack with Canonical’s feature-rich tools and Ubuntu creates a scalable, reliable, automated platform for deploying and managing big data solutions for any type of analytics, monitoring, and more. Canonical even guarantees upgrade ability of your OpenStack Big Data cloud.

(19)

BootStack is a unique, managed Canonical OpenStack offering. It is unique in that you may choose to run the solution in your own datacenter, on your own hardware, or in a 3rd-party hosted facility, like IBM SoftLayer, an Ubuntu Certified Public Cloud partner. Canonical’s engineers have years of OpenStack experience. With BootStack, you can leverage their knowledge of how-to and best practices and have a Canonical OpenStack cloud ready for big data processing in days.

With BootStack, you focus on the data, and Canonical takes care of the infrastructure. Additionally, when you want, Canonical can transfer total control of your OpenStack environment to you.

All of the tools that make Canonical OpenStack the platform of choice for big data are included in BootStack. Even better, they can be preconfigured for you and ready for use. As soon as your BootStack cloud is ready, you can start using all the big data solutions in the Juju Charm Store. You’ll find the core big data solutions you expect and can even start discovering new big data solutions from all our Charm partners.

BootStack is billed on a pay for use model. The model is similar to that of Ubuntu Advantage Storage. These unique and innovative price models are part of the initiative to make private cloud usage and consumption as easy to calculate and predict as that of public clouds.

Whether you just want to try it out, don’t have the in-house skills, or want to get up and running quickly, BootStack can provide the answer to a big data cloud. To learn more about BootStack, and use the BootStack calculator to calculate potential savings, visit the BootStack managed cloud page.

(20)

Tweet this

Ubuntu Advantage Storage is a unique and ideal storage solution for big data storage and real-time processing. It is based on Software Defined Storage (SDS) solutions, allowing for flexibility and modern data management approaches.

Choose the right technology

Ceph, NexentaEdge, Swift and SwiftStack are all supported by Ubuntu Advantage Storage. That means, you choose the right technology for your solution, and it is all directly supported by Canonical. The hardware you choose to run the solution on is just as important, and Canonical’s partners and engineers can help you with that, as well.

Pay for what you use

Another unique feature of Ubuntu Advantage Storage is its pay for use, metered model. As opposed to paying for all the storage in your datacenter, you just pay for the storage that’s actively in use. Additionally, you don’t pay for replicas or online backups. The cost savings compared to other SDS-based and managed storage solutions can be 2x to 3x,

or even more.

The pay for use model of Ubuntu Advantage Storage is similar to that of our managed OpenStack solution, BootStack. These unique and innovative price models are part of the initiative to make private cloud usage and consumption as easy to calculate and predict as that of public clouds.

Ubuntu Advantage Storage

Grow your capacity,

without growing your bill Increase your redundancy, pay the same!

Your Content Storage What you pay for What you pay for What you pay for Used Unused Capacity To tal Capacity Unused Capacity Redundant Dat a Unused Capacity Ne w To tal Capacity To tal Capacit y Used Unused Capacity Used

(21)

Machine containers are a relatively new technology in the virtualisation ecosystem. Delivered by Ubuntu as a technology called LXD, they provide the management of traditional virtual machines without the system overhead.

Many big data solutions execute optimally when run at bare metal speed. That can limit the use of virtualisation, though, and restrict system placement. By using LXD, multiple services can share a single system and all have direct hardware access.

LXD isn’t just about performance. There are big data workloads that run in public clouds as guest instances. Almost all of those instances are virtual machines. One of the benefits of LXD machine containers is that it provides process isolation and application mobility (live migration) to running processes. That means increased manageability for public cloud instances, as well as bare metal and private cloud solutions.

Machine Containers for Big Data

Multiple services can share a single system and all have direct hardware access

(22)

Tweet this

Working with Canonical as your valued partner will maximise your success with big data. Some attributes to keep in mind and that Canonical delivers are:

Your strategic big data partner should understand and have experience designing, building, deploying, and managing scalable infrastructures and big data applications. Ideally that partner brings with it an entire ecosystem of additional big data partners. Canonical works closely with a multitude of big data software and platform providers to ensure choice in solutions while maintaining quality and integrity in the overall stack.

Canonical as a strategic

partner for Big Data

Scalability 24/7 Support Prebuilt, intergrated bundles Managed offerings Application catalog Existing expertise Time to solution ...and more

(23)

There are many kinds of big data.

There are many big data applications, services, and solutions.

Canonical has domain expertise, understands big data, has strong industry partnerships, and can provide a scalable, supported solution.

If you’re excited to hear more and talk to us directly, you can reach us on our

Contact Us page.

To learn more about a managed solution for big data, download the paper BootStack Your

Big Data Cloud.

If you want to start trying things out immediately, we highly encourage you to visit Juju solutions for big data.

Conclusion

Your data is important. You need to know how to store, process, and act on your data. The overview, explanations, and solutions outlined in this book will get you started or accelerate your journey to maximising the benefits of the data you have and the new data you will start collecting.

Your best next step is to contact Canonical today.

(24)

Tweet this Tweet this

At Canonical, we are passionate about the potential of open source software to transform business. For over a decade, we have supported the development of Ubuntu and promoted its adoption in the enterprise.

By providing custom engineering, support contracts and training, we help clients in the telecoms and IT services industries to cut costs, improve efficiency and tighten security with Ubuntu and OpenStack. We work with hardware manufacturers like HP, Dell and Intel, to ensure the software we create can be delivered on the world’s most popular devices. And we contribute thousands of man-hours every year to projects like OpenStack, to ensure that the world’s best open source software continues to fulfil its potential.

ebook Big Data Explained, Analysed, Solved

Canonical is involved in Big Data

What you will learn

Bill Bauman

Overview

Contents

Functional

Partnership

Big Data general overview

Traditional Data

The general purpose

Big Data

Collect

Analyse

Act

The increasing

Types of Big Data

Structured big data

Unstructured Big Data

Predictive Analytics

Descriptive Analytics

Prescriptive Analytics

Do I need a cloud for Big Data?

Choosing the right applications

Juju, Charms, and Bundles

Evolving the solution

Design, deploy, package

Big Data solutions

Ubuntu for Big Data systems

How Ubuntu runs:

Choose the right technology

Pay for what you use

Ubuntu Advantage Storage

Machine Containers for Big Data

Canonical as a strategic

partner for Big Data

Conclusion