Online analytical processing (OLAP), which was so popular in the last decade of the last century and the first decade of this century and beyond, got sidelined by the big data tools expansion, but it is making a comeback, by reinventing itself for the world of big data. There were two reasons why OLAP became so popular.
1. As data sets were growing in size and relational databases were becoming bloated and slow in query response,
especially for analytical queries, database engineers invented the concepts of OLAP and cubes to perform analytical queries with lower latency.
2. It provided the right data structures to do multidimensional querying, especially for complicated analytic queries, with each OLAP provider having its own querying language, until providers came up with a standard language called MDX (Multidimensional Expressions).
One of the reasons OLAP slowly started losing its value, especially as data sets started getting bigger and queries became more complex, was because OLAP relied on building either one right cube that could address almost all the queries or multiple cubes, a combination of which could support most of the query workloads. Building cubes consumed a lot of time, resources, storage, and maintenance work during the ETL stage. This often entailed the possibilities of missing out on the SLAs. However, more recently, two new OLAP engines on Hadoop with an SQL interface (not MDX) are gaining traction in the world of big data for doing SQL-based analytics. Apache Kylin, an open source project started in South Korea, and AtScale, a startup based in the Silicon Valley, are trying to innovate and build OLAP engines for big data, using the current tools in the big data ecosystem. We discuss Apache Kylin and its architecture in the following sections.
CHAPTER 6 ■ INNOVATIONS AND THE ROAD AHEAD
Kylin is an open source distributed analytics engine that provides SQL interface and OLAP capabilities on Hadoop for large data sets. Kylin has been designed to reduce query latency on Hadoop and provide interactive SQL analysis on Hadoop. This allows Kylin to be integrated well with BI and third-party tools. Internally, Kylin is a MOLAP engine (multidimensional OLAP engine), in which the data is precomputed along different dimensions of interest and stored as pre-built and precomputed cubes. MOLAP is much faster but is inflexible and requires the cubes to be refreshed as data changes. The other type of OLAP is ROLAP (relational OLAP), as used in star or snow-flake schemas in data warehouses to do runtime aggregation. ROLAP is flexible but much slower. All existing SQL-on-Hadoop engines can be classified as ROLAP engines.
Kylin builds data cubes (MOLAP) from an underlying Hive table (ROLAP), according to the metadata specification. When an SQL query comes into Kylin, if the query can be fulfilled by an existing pre-built data cube, Kylin routes the query to the data cube and delivers the results immediately. These precomputed data cubes reside in HBase. If the query can’t be fulfilled by an existing cube, Kylin will route the query to the Hive table that is ROLAP and trigger a cube-building process for the query, so that future queries can be supported by the cube.
Apart from a more general set of features, Kylin also provides the following enterprise-level features out of the box:
• Incremental refresh of cubes
• A web interface to manage, build, monitor, and query cubes • Under-the-covers leverage HBase for query latency
• Approximate query capabilities for distinct count performance, using the HyperLogLog algorithm
• Compression and encoding Support • Job management and monitoring • LDAP integration
Figure 6-9 and Figure 6-10 provide a high-level overview of the internals of Apache Kylin and how it works.
Kylin is fronted by a REST server, which accepts SQL queries from third-party tools. On receiving a query, the REST server hands over the SQL query to the query engine, which decides the optimal way to execute the query. This optimization can be for either the MapReduce-based job for building the cube to satisfy the query or for running the query on an already built data cube that resides in HBase.
Kylin maintains two types of Metadata: end-user Metadata, which forms the data model for the star schema, and metadata for the cubes that have been generated by Kylin and stored in HBase.
Cubes are built using MapReduce, and, hence, this can be a time-consuming process. Typically, this process would be baked in as part of the ETL pipeline, so that the cube is ready to be queried when queries arrive in real time. In case a cube has not been built for a specific SQL query, and the already built cubes in HBase cannot satisfy the query, i.e, do not have the data for the query, Kylin offloads the cube-building process by running a MapReduce job on Hive for building the new cube. This new cube is then moved to HBase, the query results are executed on HBase, and the results are returned. The latter workflow can result in significant delays in query latency.
Kylin uses HBase as the storage engine and leverages the HBase coprocessor to reduce network traffic, parallelize scan logic, and improve the latency of the queries.
As seen in Figure 6-10 , – Apache Calcite is used internally by Kylin as the SQL query
CLI/BI
ODBC/JDBC SQL REST API
REST Server
Query Engine
Switcher
Metadata
Cube Builder OLAP Cubes Hive
APACHE KYLIN Build Cubes
CHAPTER 6 ■ INNOVATIONS AND THE ROAD AHEAD