Issue 4. Near Real-time Analytics in the Bigdata Ecosystem. Featuring research from

(1)

Featuring research from

Issue 4

Near Real-time Analytics in the

Bigdata Ecosystem

(2)

Introduction

2

Introduction

3

Applying the Big Data Ecosystem 8

Near Real -Time analytics with Hadoop, Storm, Druid, and Lambda Architecture

12

About Cybage

In the last few years’, enterprises have realized the importance of data and information in real time. Real-time analytics involves access to business-specific information on-demand. Businesses require a snapshot of what is happening in the business—NOW. Access to data in real time and the ability to query the same provide real-time insight to businesses, which, in turn, helps them to take quick decisions.

However, real-time processing of data poses unique challenges, as real-time data stream needs more advanced processing technologies.

The current enterprise management systems are designed to process and stream real- time data and also act on multiple data streams at the same time.

Source: Cybage

Near Real-time Analytics in the Bigdata Ecosystem is published by Cybage. Editorial supplied by Cybage is independent of Gartner analysis. All Gartner research is © 2014 by Gartner, Inc. All rights reserved. All Gartner materials are used with Gartner’s permission. The use or publication of Gartner research does not indicate Gartner’s endorsement of Cybage’s products and/or strategies. Reproduction or distribution of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The opinions expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company, and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartner’s Board of Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner research, see “Guiding Principles on Independence and Objectivity” on its website, http://www.gartner.com/technology/about/ombudsman/omb_guide2.jsp.

(3)

3

Curabitur at nibh

Consectetuer adipiscing elit tortor lacus nonummy purus

The goal of any big data initiative is to impact business outcomes positively. After determining the vision and strategy, IT leaders need to develop their technology approach.

Understanding the differences in available technologies and how they interact is crucial to maximizing big data investment.

Key Challenges

• Organizations must determine the right mix of technologies to form their big data ecosystems.

• Expanding your big data applications beyond traditional components requires new skills and analytical approaches.

• Capitalizing on big data opportunities requires organizations to consume data sources with varying levels of veracity, consistency and structure.

Recommendations IT leaders should:

• Assess the full range of big data technologies, including in-memory technologies, streaming computation engines and graph databases, to complement new or existing big data initiatives.

• If new technologies are necessary to add value, engage with proven vendors and service providers to assist with architecture, implementation and skills transfer to address internal talent and skill gaps.

• Research available data sources,

particularly unstructured sources or sensor data, to see if they can be included in an existing or new big data initiative to support additional business objectives.

Introduction

Big data initiatives (see Note 1) promise to allow businesses to make better decisions, discover hidden insights and automate some business processes that could otherwise not be automated. The challenge facing enterprises is how to obtain these benefits.

Information management leaders must select and incorporate various technologies

From the Gartner Files:

Applying the Big Data Ecosystem

promise of big data. These technologies exist in a complex, interlocking set of offerings from commercial and open source providers. And the applicable set of technologies varies based on use case.

Conducting analytics on massive amounts of data at rest is the classic story of big data. Accomplishing this requires a large amount of storage capacity, as well as a method to execute analysis in parallel.

These efforts may be to derive predictive models for facets of your business, such as determining customer loyalty based on purchase activity, or to explore factors impacting supply chain effectiveness.

Once these models are created, they must be applied. Carrying over a previous example, a customer loyalty model can be actively applied to existing data to determine which of your customers are likely to defect, allowing you to begin selective retention efforts. However, this periodic batch approach may not be proactive enough to prevent defections. Combining a batch processing method to flag at risk customers with real-time data, such as call center interactions, new purchase history or usage activity, allows enterprises to be proactive with retention efforts.

Determining how items are related to each other is another problem falling into the domain of big data. The desired business outcome may be to determine influencers in a network of people, how people are connected, or how an event associates to a specific location. Unlike previous use cases, these types of problems require highly connected, or “linked,” data and specialized technologies to uncover insights.

Another common task organizations face is transforming new or existing data either in bulk or as the data enters the data center. This extraction, transformation and loading process may be used to migrate data between data stores, such as between a relational database and a NoSQL database or distributed file system. Data in flight may be transformed or enriched to facilitate downstream processing or in-process analytics.

Deriving value from big data likely involves several sources of data with varying levels of structure and relationships. The big data technology ecosystem will be driven in part by the physical and logical attributes of data in combination with the desired business outcome. Several different technologies must be combined when multiple outcomes are desired, such as real-time data processing and interactive data exploration.

Analysis

Select the Right Ecosystem Technologies

The big data ecosystem consists of multiple components, each designed to process data in different ways or at different points in its life cycle. The most commonly used components in this technology ecosystem include:

• Distributed batch processing.

• Complex-event processing (CEP) and distributed stream computing platforms.

• In-memory databases and data grids.

• Graph databases.

This section describes each of these technologies and how they complement and support the broader big data technology ecosystem.

Distributed Batch Processing for Data at Rest

Distributed batch processing is an ideal technology choice when the use case is driven by processing massive amounts of data at rest.

Distributed processing paradigms, such as MapReduce, enable some of these benefits by processing massive amounts of structured and unstructured data sources in support of new analytical capabilities.

Examples of the types of tasks supported by distributed processing include:

• Conducting analytics on large amounts of email traffic to indicate trending project success or failure.

• Creating an insurance fraud model based on decades of structured and unstructured claim information.

(4)

sources to get a better assessment of investment risk.

The most popular framework supporting distributed batch processing is Apache Hadoop. Apache Hadoop is a framework consisting of a number of components enabling distributed processing across a cluster of servers, or nodes. While the majority of these components aren’t required for most use cases, the core framework consists of the MapReduce and Hadoop Distributed File System (HDFS) components. HDFS stores and replicates data across the cluster and is accessed by MapReduce. The MapReduce component splits the application workload into fragments, which may be executed on one more or nodes (see Note 2). Both components are designed to gracefully handle failure of one or more nodes.

In addition to the Apache Hadoop framework, several NoSQL databases offer MapReduce functionality in varying capacities. These databases often provide MapReduce as an extension to native query capabilities, making them potentially easier to use than Apache Hadoop. However, these alternative MapReduce implementations may have limitations imposed by the underlying storage mechanism.

Despite these capabilities, Hadoop is limited by its batch-oriented nature. Depending on the environment and use case, it may take a Hadoop task between minutes or days to return a result or analytical decision.

If the result isn’t available in time for the business to gain a positive outcome, the effort is wasted and the opportunity is lost.

For example, detecting and preventing a fraudulent purchase at transaction time is more valuable than detecting and resolving after the fact. The model used to detect fraud can be created in Hadoop and applied at real-time using supporting technologies, such as complex-event processing or distributed stream computing platforms.

Complex-Event Processing and Distributed Stream Computing Platforms for Real-Time Analytics and Monitoring

Real-time monitoring and analytics processes data in motion for comparison against one or more analytical models developed from previously collected data. Common processing methods include CEP and distributed streaming computation platforms (DSCP). Of the two, CEP applications are far more commonly available from vendors, with distributed streaming computation beginning to gain relevance.

Complex-event processing solutions consume inbound events from one or more input sources to provide distilled information about the current conditions of an enterprise. Processing is triggered when an event is received, supporting near real- time applications. This allows enterprises to identify outlying conditions and triage them as they occur.

Traditional arenas leveraging CEP include financial services and utilities, but this technology is rapidly moving into retail as well as into operational technology (OT), which encompasses events generated by the numerous sensors deployed in static and dynamic locations. Examples of applied OT include smart power meters, remote healthcare diagnostics and condition-based maintenance.

Of the technologies described, distributed streaming computation platforms are conceptually similar to the MapReduce model used by Apache Hadoop. Distributed streaming computing platforms (DSCPs) operate across a cluster of servers, executing user-defined functions on a continuous stream of data. These platforms are ideal for a range of use cases, such as performing a continuous query, executing computationally expensive operations in parallel or

maintaining a database based on inbound events. DSCPs are distinct from CEPs in that

events and reason about an outcome. DSCPs generally lack this functionality.

Historically, complex-event processing required “stateful” and contextual information to be effective. This shared state model limited the ability of CEP products to scale horizontally. CEP products such as IBM’s InfoSphere Streams, HStreaming and Vitria CEP, have evolved to include the massive parallelism available in DSCP frameworks like Storm and Apache S4.¹

Complex-event processing and distributed streaming computation complement Hadoop deployments as well as business intelligence dashboards. Since complex- event and distributed stream computing platforms have access to data as it enters the data warehouse, they can preprocess and optionally enrich data before sending it on. Intermediate results from CEP or DSCP, commonly called rollups, can be passed into HDFS or another data store, such as an in-memory DBMS, to power analytics dashboards and visualization tools.

In-Memory Databases and Data Grids for Interactive and High- Speed Analytics

In-memory technologies are another component in a broader big data ecosystem.

In-memory databases (IMDBs) store data and data structures entirely in memory instead of spinning disk, providing faster access times with more predictable performance. In- memory data grids (IMDGs) are distributed, in-memory data stores aimed at high- performance, high-scale, data-intensive applications. The distinction between IMDBs and IMDGs is that IMDGs can be thought of as a distributed cache, without the underlying relational structure present in IMDBs.

Big data has helped drive interest in IMDBs and IMDGs and several options are available from both new and established vendors.

Due to their inherent speed, in-memory technologies supplement Apache Hadoop

(5)

5

solutions by supporting ad hoc analytics and real-time analytics processing. Inserting the results of MapReduce tasks, CEP applications and streaming computation engines into an in-memory data store allows business users to explore data in aggregate without the delay of batch processes or overhead of engaging with IT to create and execute new analytics programs.

For example, business analysts can interactively explore shopping cart contents in real-time and experiment with different analytical models to determine a more effective marketing mix. Performing this process with iterative batch methods may take days or weeks to complete. A number of IMDBs and IMDGs also allow other applications to easily access results through standard interfaces, such as ODBC or JDBC, instead of having to integrate with HDFS.

Graph Databases for Networks of Data

Analytics problems solved in Hadoop tend to be those which are readily parallelized.

Parallel processing is best accomplished when relationships between disparate data sources don’t exist since the sharing of state impacts processing performance. But analyzing unrelated data only provides part of the solution.

Interconnected data, such as communication patterns, social networks and biological interactions require a different view of data, which is where graph databases come in.² Graph databases represent data as a network of nodes connected by edges.

This representation allows you to readily determine and qualify connectivity between entities. A node may represent an office, member of a social network and or buyers on an auction site. Edge properties describe attributes connecting nodes.

Example properties may include the physical distance between offices, the frequency of communication between departments or purchase affinity between a buyer and an item can be represented as connection attributes. Graph databases support analytical processing which may be computationally expensive or impossible with other data representations.

Recommendation:

• Assess the full range of big data technologies, including in-memory technologies, streaming computation engines and graph databases, to complement new or existing big data initiatives.

Develop Necessary New Skills

The technologies described above have not yet seen wide adoption for various reasons.

Complex-event processing applications typically have unique deployment and operational considerations relative to other application types. To be effective and determine which events are relevant to the business, CEP systems are effectively repositories for business logic. CEP also requires substantial coordination between business users and the information technology team. Codifying and

implementing such logic is not trivial and can impact deployment.

As the newest, and consequently least mature, product, distributed streaming computation platforms have not seen wide adoption. DSCPs typically provide little more than a framework for performing real- time computation in a distributed fashion, leaving enterprises to perform the task of implementation themselves. Platforms also vary in areas such as robustness, manageability and durability. For example, some DSCPs consider losing data when a compute node fails as acceptable. This may not be a problem if the cluster is consuming millions of events for a use cases related to search advertising personalization. However, this may be unacceptable in high frequency trading scenarios.

Graph databases present a separate and distinct set of challenges. The value of graphs is in their inherent connectedness.

As such, it is difficult to distribute, or shard, components of a graph among a network of servers as graphs become increasingly larger. The complexity of managing a large, distributed graph requires substantial operational considerations that may not be effectively supported by various vendors of graph databases. Graph databases also

have unique design considerations that differ from skills found in most IT organizations.

Additionally, the long-standing relationship between computer science and advanced mathematics is more prevalent in these solutions indicating training in statistical and graph theory may be appropriate.

Integrating, deploying and managing these technologies require skills that many IT departments may lack. Address these shortcomings by working with your existing vendors on proofs of concept, either on- premises or in the cloud, to validate business value as well as to develop internal expertise.

Recommendation:

• If new technologies are necessary to add value, engage with proven vendors and service providers to assist with architecture, implementation and skills transfer to address internal talent and skill gaps.

Understand Your Data Landscape

A recommended practice for new big data projects is to leverage existing data sources available in-house. Commonly referred to as “dark data,” these data sources are typically stored as a part of normal business activity but aren’t utilized for analytics or monetization. Dark data includes emails, contract data, customer service calls, a variety of system logs and event streams, among other sources. Leveraging dark data allows information management leaders to extract value from an existing resource and minimize project startup cost and risk since the data is readily available.³

Big data also brings capabilities to analyze new types of data that were previously unavailable to enterprises due to the nature of the data itself (such as unstructured data) and/or the sheer volume of data (sensor feeds or massive government public datasets). Information management teams can’t just assume that tapping into these data sources will provide value. Each data source will have different levels of volume, variety and velocity.

(6)

may require different combinations of technologies from the big data ecosystem.

For example, an enterprise may analyze social network data to using an in-memory database to detect changing brand sentiment after a press release. The same social data can be processed in Hadoop to find insights that would cause the firm to update supply chain forecasting models. By analyzing demand indicated in social media, an enterprise can refine forecasts and update suppliers with new information.

Information management leaders should apply innovative thinking and consider how these new sources of data can improve their enterprise, for example:

• Social data. Social network data (Facebook, Twitter, YouTube) provides marketing departments with rich insight on customer service trends and brand positioning.

creating a huge new source of data.

Some of the feeds, such as sensor data, are publicly available (such as Dr.

Foster Intelligence). Things data will also come from operational assets such as shop floor manufacturing equipment or healthcare assets such as patient monitoring equipment.

• Raw data. In some cases already captured data can be made richer. For example, point-of-sale data coming from cash registers is often stripped of “non- essential data” such as cashier ID before it is stored in a data warehouse. Enterprises using big data technologies to keep the raw data, in this case keeping cashier ID, have been able to unearth new analytics such as cashier productivity by store.

• Public data sources. An increasing number of public sources of data now have APIs to access data (created by freely available public or commercial implementations). Data.gov.uk, public financial trade listings, earnings calls transcripts, OECD, World Bank, healthcare datasets (for example, Stanford’s HIVDB for listing HIV drug resistance test results).

benchmark industry data are also rich sources of data for comparison purposes.

For example, Dr. Foster Intelligence provides operational benchmarks for U.K.

healthcare facilities.

• Context data. There are many data providers that can provide additional context data to augment an enterprise’s data. For example, marketing services providers have long had databases about demographic and behavioral information on households/consumers. However, context data goes well beyond marketing data: location, credit score, traffic speed, current weather information and time of day are all data sources that can provide additional context in the right situation.

Recommendation:

• Research available data sources,

particularly unstructured sources or sensor data, to see if they can be included in an existing or new big data initiative to support additional business objectives.

Evidence

1Distributed Stream Computing Platform

2“Advanced Analytics Enables Real-Time Business Optimization”

3Gartner client inquiries

(7)

7

Note 1

Defining Big Data Initiatives

A big data initiative represents an effort undertaken by an organization to benefit one or more aspects of the business through the use innovative technologies to process data that has massive volume, variety or velocity.

Note 2

The Role of Statistical Primitives in Big Data

Map and Reduce, while often inferred as one function, are actually two separate and distinct statistical “motifs” or primitives. Statistical primitives are operators that scale linearly in binary numbers — which means other than physics limitations of transport and electrons they scale almost perfectly in computing systems. A well-designed primitive process can therefore be supported by simply adding more processors until interference from physical barriers overwhelm the scaling capability.

There are currently four statistical primitives that are either already extensively utilized in computing environments or are growing rapidly. While complex in statistical form, functional descriptions of these four primitives can be simplified for immediate understanding. Join is a statistical operative that answers the argument “does left side exactly equal right-side” of the equation. Join is the basis for all relational queries whether true or false and is the basis for relational data science.

Map answers the argument “when not equal is there a valid proxy or designated approximate equivalent.” Map puts data into more dynamically defined buckets and creates substitute values for Join. Reduce answers the argument “what is the best statistical representation of a group based upon quantitative or qualitative parameters.” It is more complex and requires that the programmer understand how even innocuous appearing parameters can create outcome bias — and is the reason that more advanced understanding of statistical analysis is required of data scientists.

Graph is the most advanced of the four primitives currently utilized in analytics. Graph utilizes Join, Map and Reduce to create “nodes” and determine “edges” of those nodes. The parameters of the Map and Reduce statements are evaluated to determine the affinity weight of adjacent, near and distant nodes and a numeric score is added to the possible Joins.

Graph databases store the results of such analysis in order to provide optimal performance on the second and all future Graph analyses of that same dataset. Graph analytics must be repeated every time new data is added to the dataset and as a result, in-line scoring of data as it moves through the data bus is highly preferred.

Gartner RAS Core Research Note G00252014, Nick Heudecker Hung LeHong, 19 July 2014 ;

(8)

The following sections in this document highlight the various components in Big Data ecosystems along with the technologies that are best suited for processing data at rest and streaming data.

We also explore the Lambda Architecture, which helps us combine all these technologies and provide an optimized solution for leveraging all these technologies and provide real-time analytics.

Over time, technology stacks have been developed which combine with Hadoop’s distributed batch processing capabilities to plug the limitations of Hadoop when it comes to real-time analytics capabilities.

Some of the following technologies are used in combination to provide highly flexible and reliable real-time analytics solutions.

Kafka

Apache Kafka is a fast, scalable, and durable publish-subscribe messaging system that is ‘distributed by design’. Its durability and fault-tolerance guarantee is based on a modern cluster-centric design.

A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers.

Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without any impact on the performance.

Storm

Apache Storm is a free, open-source, distributed, and real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, and it can be used with any programming language.

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and others.

Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant; it guarantees that data will be processed; and, it is easy to set up and operate.

Storm integrates with the queuing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation as needed.

Near Real -Time analytics with

Hadoop, Storm, Druid, and Lambda Architecture

Spark

Spark, an Apache project, is an open-source, distributed computing framework for advanced analytics in Hadoop. Originally developed as a research project at UC Berkeley’s AMPLab, Spark seeks to address the critical challenges to advanced analytics in Hadoop because of the following:

1. Spark is designed to support in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data. This enables true high-performance advanced analytics. For techniques such as logistic regression, project sponsors report runtimes in Spark hundred times faster than what they are able to achieve with MapReduce.

2. Spark offers an integrated framework for advanced analytics, including a machine learning library (MLLib), a graph engine (GraphX), a streaming analytics engine (Spark Streaming), and a fast interactive query tool (Shark). This eliminates the need to support multiple point solutions—such as Giraph, GraphLab, and Tez—for graph engines; Storm and S3 for streaming; or Hive and Impala for interactive queries. A single platform simplifies integration, and ensures that users can produce consistent results across various types of analysis.

Druid

Druid is an open-source infrastructure for real-time exploratory analytics and one that supports fast, ad-hoc queries on large-scale data sets.

Real-time ingestion

Real-time data. Typical analytics databases ingest data through batches. Ingesting one event at a time is often accompanied with transactional locks and other overheads that slow down the ingestion rate. Druid’s real-time nodes employ lock-free ingestion of append- only data sets to allow simultaneous ingestion and querying of more than 10,000 events per second. Simply put, the latency between when an event happens and when it is visible is limited only by how quickly the event can be delivered to Druid.

Scalable

In-memory or on-disk. Druid leverages the memory mapping capabilities of modern operating systems to allow only relevant data to be loaded into memory while the rest can live on disk. This means that if your performance requirements dictate that the data must be in memory, you can configure each node to only accept an amount of data that is equivalent to the available memory, and it will all be in memory.

If you are okay with only having the working set in memory, each node can hold more than just the working set on a given machine, and the requisite data will be swapped into memory on demand.

(9)

Highly available. Scaling up or down, replicating nodes, or recovering from failure typically impacts availability and performance. Druid uses a distributed architecture that allows replication at the segment level, relieving the load on ‘hot segments’. And, because of replication, Druid supports rolling deployments and restarts. Scale up or scale down just by adding or removing nodes—it’s that easy and no data has to be re-processed or re-indexed, just re-replicated.

Real-time queries

Ad hoc, multi-dimensional filtering. Druid maintains bitmap indexes compressed using CONCISE to determine which data it has to look at before it ever starts looking at data. This significantly speeds up ad hoc filtered queries, even allowing fast OR traditionally slow queries. All this is made possible without a significant impact on data footprint.

Column-oriented for speed. Data is laid out in columns so that scans are limited to the specific data being searched. Compression decreases the overall data footprint.

What does Real²time mean?

Real²time reflects the fact that Druid encompasses both the common definitions of real time in the data processing space.

Real-time queries refer to responsive or interactive queries; that is, you have your data and want to be able to ask questions pertaining to the data quickly.

Real-time ingestion refers to ingesting data and making it available for querying in real time; that is, minimizing the latency between when an event occurs and when it is reflected in your query results.

Putting it all together

Certain high-level architectural constructs can help us mentally visualize how the various types of applications mentioned previously fit into the Big Data architecture and how some of these technologies can be leveraged to transform the existing enterprise software landscape.

Lambda Architecture is a useful framework to think about designing Big Data applications. Nathan Marz designed this generic architecture addressing the common requirements of Big Data based on his experience while working on distributed data processing systems at Twitter.

Lambda Architecture has three major components.

1. Batch layer: This layer provides the following functionalities:

i. Managing the master dataset—an immutable, append-only set of raw data.

ii. Pre-computing arbitrary query functions, which are called batch views.

2. Serving layer: This layer indexes the batch views so that they can be queried ad-hoc, with low latency.

3. Speed layer: This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the Speed layer deals with recent data only.

Each of these layers can be realized using various Big Data technologies. For instance, the Batch layer datasets can be used in a distributed file system, while MapReduce can be used to create batch views that can be fed to the Serving layer. The Serving layer can be implemented using NoSQL technologies such as HBase, while querying can be implemented by technologies such as Druid. Finally, the Speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming.

MapReduce allows high-speed streaming data to be written directly to the Hadoop storage Storage layer, while allowing stream-processing applications such as Storm or Spark Streaming to run as an independent service within the cluster. The processing application now becomes more of a subscriber to the incoming data feed. If a failure occurs, and the original application goes down, a new instance of the application can pick up the data stream within seconds of where the original application instance dropped off. An added advantage of this architecture is the availability of streaming data for Batch and Serving layers.

Lambda Architecture allows us to create data pipelines in real time and in the form of delayed data as is prevalent in a real-life scenario. One data pipeline can run on Storm or Spark Streaming and the other on Hadoop MapReduce. Both these pipelines can be loaded into Druid.

This combination of technologies is flexible and fast. It can handle a wide variety of data processing requirements, real-time or delayed, and huge query loads to provide real-time anlaytics. Each piece of this technology stack is independently capable of handling specific tasks very well. A well-orchestrated architecture as described the best of these technologies and aid in the handling of complex event processing with events running into billions in real time.

Source: Cybage

(10)

Founded in 1995, Cybage Software is a leading offshore software services company, offering solutions that accelerate, simplify and enrich business processes to give its clients an edge over competition.

We are an SEI-CMMI-DEV v1.3, Level 5 and ISO 27001 company based in Pune, India. Our success is built on a pool of 5,000 software professionals. Based on a remarkable record of quality, consistency and outstanding technological prowess, we have partnered with more than 200 global software houses of fine repute. Our array of services includes Outsourced Product Development (OPD), enterprise business solutions and value-added services. Cybage specializes in the implementation of the Offshore Development model.

The domain expertise spans across several business verticals such as Media and Entertainment, Travel and Hospitality, Healthcare and Life Sciences, Retail and Distribution and Hi-Tech. Cybage has eight defined technology focused Centres of Excellence (CoEs)—E-commerce, Enterprise Mobility, Customer Relationship Management, Business Intelligence, Enterprise Content Management, Cloud Computing, E-learning and Supply Chain Management (SCM). Our unique model of operational efficiency, ExcelShore®, helps ensure de-risk our approach and provide the best value per unit cost.

About Cybage’s Business Intelligence Center of Excellence Cybage has a deep understanding of the domain and expertise with years of focused knowledge assimilation and extensive experience in providing full life cycle BI services and BIGDATA solutions. The BI Center of excellence at Cybage comprises of domain and technology experts to provide end-to end offerings for development, customization, testing and maintenance solutions across industry verticals.

Business Intelligence COE Quick Facts:

• 7,000 person months of experience in building End to End BI services right from strategy to implementation

• 170+ Professionals comprising Solution Architects, Business Analysts and Functional Experts

• Key areas of expertise – Data Integration, Reporting and Analytics, Product Engineering and BI-DWH Consultation

• Extensive experience in BIGDATA – Integration with Hadoop and Hive

• BIGDATA visualization using Microstrategy and Qlikview

• Industry leader partnerships: AWS, CloudEra, VoltDB, MongoDB, and SAP

About Cybage

Headquarters:

Cybage Software Pvt. Ltd.

Cybage Towers,

Survey No. 13A/1+2+3/1, Vadgaon Sheri Pune 411014 India

Phone: +91-20-6604-1700 E-mail: [email protected]

Additional Development Centers:

• India, Pune, Hyderabad, Gandhi Nagar

• The United States, Redmond, Washington International Sales and Development Centers