Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

(1)

Analyzing Big Data with Splunk

A Cost Effective Storage Architecture and Solution

Jonathan Halstuch, COO, RackTop Systems

(2)

We hear so much on Big Data and its effect on company business, such as fragmented information streams, how to manipulate and use massive amounts of collected information, storage considerations, and the hardships that these issues involve that we have to wonder—isn’t there a better way?

Generally, it is said, that when a company needs to look at information generated by its own devices, it gets a piecemeal view of its results—data logs are over here; billing and call records are over there; performance metrics require a totally different screen and application to view results; and click stream functionality needs yet another destination in order to review. And all of these live in a completely different headspace from where to store all this data. What to do?

The solution to all this is to centralize and organize all this disparate machine data into one easy to use application, with storage at the back end, so as to enhance business continuity with mission-critical data that is easy to gather, access, analyze, secure, and store.

Today, there is an even greater need to better understand your business environment so as to drive insightful decisions. Newer and better search and visualization tools are required to analyze data (Question-focused data set). Automated reporting is needed to consolidate and speed up business analysis, and, most important, the overall view of this data must move from piecemeal snapshots to a comprehensive aggregate overview. And then all this data has to be stored and made accessible. And with this operational intelligence, companies are realizing significant ROI results—such as fewer servers needed; tools consolidation; cost reductions in personnel; troubleshooting time per transaction; root cause identification; fewer outages and downtime; lower mean time to repair; infrastructure savings; and, most important, customer satisfaction.

(3)

Indexing

As Splunk accumulates and indexes data, it creates two types of files: 1) Raw Data, which is a

compressed version of all data; and 2) Index Files, which are the flat files of extracted information based on user-defined fields, which are highly customizable.

Collecting and repurposing information is at the heart of Splunk. Today, many plug-ins exist for

collecting information from disparate operating systems and applications. The collected data is indexed, which then allows for fast search, retrieval, and manipulation of information. The data is further

enhanced by separating the data stream into individual, searchable events; creating or identifying time stamps; extracting fields such as host, source, and source type; and performing user-defined actions, such as identifying custom fields, masking sensitive data, filtering unwanted events, and routing events to specified indexes or servers.

What Splunk Does

From Indexes to Buckets

Splunk collects and stores indexes in directories called Buckets, which consists of the index file and the raw data. The index and raw data is moved through these buckets Buckets—Hot, Warm, Cold, Frozen, Thawed—based upon timing or capacity thresholds. The availability and purpose of this data changes based upon what bucket it resides in.

Bucket Stage Description Searchable

Hot Contains newly indexed data. Open for writing.

One or more hot buckets for each index. Yes

Warm Data rolled from hot. There are many warm buckets. Yes

Cold Data rolled from warm. There are many cold buckets. Yes

Frozen Data rolled from cold. Splunk deletes frozen data by default,

however it can be archived in the frozen bucket. No

Thawed Restored Archive (Frozen) Data. Doesn’t age off. Yes

(4)

Lifecycle of Bucket Data

Hot / Warm Buckets require high performance Read / Writes as this data necessitates a lot of random IO. Cold Buckets require data integrity and capacity. Therefore, the Cold Bucket still will need to handle reads in the case of long search queries. Frozen Buckets also require data integrity and capacity as the Frozen Bucket is the data archive. It can be thawed when needed for a search.

Now that you have all that info, where do you store it?

BrickStor as a Solution

Splunk makes data aggregation easily available, but file counts quickly can grow into the billions. Data integrity, over the long run, must be protected. The storage system for all this accumulated data must be able to scale up to terabyte and petabyte capacities, offer efficient storage management, and protection against data corruption.

BrickStor, uses ZFS, an open source technology to help organizations implement high performance, yet cost-effective data storage solutions by taking advantage of features such as compression, inline deduplication, unlimited snapshots and cloning, and high availability support. Additional key features include:

• ZFS technology—the most scalable and flexible 128-bit file system • Unlimited File Size

• Unlimited Snapshots

• Native inline deduplication & Compression • Hybrid Storage Pools

• End-to-End Data Integrity (ZFS, checksumming, etc.) • Heterogeneous Block & File Replication

• Block-level Mirroring

• Simplified Disk Management

ZFS Hybrid Storage Pools

ZFS uses robust, scalable technology with features not available in other file systems today. ZFS Hybrid Storage Pools (HSP) allow you to combine DRAM, SSDs, and spinning HDDs into an accelerated storage medium. These

ZFS Hybrid Storage Pools optimize performance for any given working set by minimizing I/O bottlenecks. In addition, by reducing read and write latency, users end up with a system that outperforms stodgy old legacy storage systems, while having a much lower total cost of ownership. As part of this equation, you can have multiple pools within a single appliance, further reducing overall costs and simplifying

management. When this concept is applied to applications such as Splunk, it enables end users to address Big Data without worrying about storage management.

(5)

checksumming. In applications such as Splunk, this level of flexibility enables users to create hybrid storage pools designed and tuned for a specific purpose or Splunk bucket. This further reduces total cost of ownership without sacrificing performance.

Specifically for high performance low latency a user may consider a pool with striped vDEV’s and a read cache appropriate for the expected working set. For extreme performance a professional may even consider an all-flash SSD tier. Both configurations are ideal for the Hot/Warm Bucket and index files. For the Warm and Cold Bucket professionals should consider a RAID-Z2 configuration. This configuration provides higher usable capacity over a mirrored vDEV (RAID-10 equivalent). Flexibility and the ability to reconfigure disks and add caching later is the key to near and long term success. The flexibility of Hybrid Storage Pools ensures continued success without a forklift replacement should the environment or business needs change in the future.

Performance

RackTop’s tiered approach to storage allows for differing configurations to be used, depending on desired performance, capacity needs, speed of access required, and overall budgetary constraints. As the following table illustrates, Splunk users (and all others dealing with enormous volumes of data) can regulate or fine tune their hardware configuration to meet their specific demands.

Splunk Bucket Performance in Regards to Hardware1 On SSDs

(6)

So, for example, one solution might be to use SSDs for Hot and Warm Splunk Indexes, while employing spinning media for Cold, especially if the price per GB with SSDs keeps getting lower, thus making it even more cost efficient versus traditional disk drives.

Benefits of Splunk with SSDs

Legend for graph:

7200 – 2×4 2.40GHz, 16GB, 12x2TB 7200 RPM SATA RAID 10 10k – 2×6 2.677GHz, 48GB, 4x900GB 10K RPM SAS RAID 10 15k – 2×6 2.667GHz, 12GB, 6x146GB 15K RPM SAS RAID 10 SSD – 2×4 2.40GHz, 16GB, 1x240GB (same as 7200 w PCIe SSD) Hardware

Three machines were used for this benchmark, classified them by disk speed. CPU and memory were not identical.2

2

Source: http://blogs.splunk.com/2012/05/10/quantifying-the-benefits-of-splunk-with-ssds/

(7)

With huge data generating applications such as Splunk, RackTop BrickStor is uniquely positioned to address the massive storage needs generated by data collecting applications as it provides the only product that enables flexible and scalable tiered storage architecture across storage access protocols. BrickStor provides almost limitless future scalability, along with the capability to expand capacity and performance in the future. BrickStor is a ZFS storage solution based on open source technology that provides future-proofing capabilities through ease of scalability and an IT partner that acts as an extension of your internal IT team.