Chapter 2. IBM Watson Explorer overview
2.6 Application Builder architecture
2.6.4 Scaling and availability for enterprise applications
The web-based nature of Application Builder applications liberates those applications from many of the limitations of traditional enterprise applications, such as local installation and system requirements, or only being available from a single server. The ease of access to web-based applications through any standards-compliant browser makes it even more important that the data resources that are required by web-based applications are always available. Increased availability is usually facilitated by eliminating single points of failure that can prevent successful application execution and operation. Eliminating single points of failure is typically done by replicating application-specific configuration information and data on multiple systems on a network: Replicating data across multiple hosts provides opportunities for both
reliability and performance improvements by distributing application load across multiple systems.
Replicating configuration data across multiple systems eliminates configuration data as a single point of failure when balancing front-end application load across multiple systems.
Although data replication and front-end and back-end load balancing are common ways of ensuring that applications are always available, and can scale to satisfy increasing usage and numbers of users, enterprise applications must also support continually increasing amounts of data and associated storage.
Although Watson Explorer indexes are usually much smaller than the data repository for which they provide an index, using a flexible model for allocating and distributing the back-end storage that is associated with an Application Builder application eliminates many common storage problems. Using a flexible model also simplifies managing that data if it must be redistributed, expanded, or moved to new or additional storage devices or systems.
The next two sections provide an overview of how Application Builder addresses replicating configuration data and indexes to satisfy the requirements of today’s enterprise applications, while building in the flexibility and scalability that is required to support the increasing demands of tomorrow.
Replicating configuration data using ZooKeeper
Application Builder applications store configuration data, such as their entity model, in a networked data repository known as ZooKeeper3.
ZooKeeper is an open source data repository that was designed to provide a highly-reliable, synchronized repository for data, such as configuration data that is required by large-scale distributed systems. Application-specific configuration data is stored in distinct namespaces, enabling single ZooKeeper installations to simultaneously support the configuration requirements of multiple applications. ZooKeeper satisfies its RAS requirements by coordinating its content across multiple systems that are running a ZooKeeper server. These ZooKeeper instances must be associated with each other and support the same namespace for configuration data. A single ZooKeeper server is often used when developing Application Builder applications, but three or more ZooKeeper nodes are
commonly used in actual deployments.
An odd number of ZooKeeper servers enables a set of related servers to use majority voting to select one of those nodes as an authoritative master node. Updates to any client server are synchronized with the master node, which then updates all other clients appropriately. Production environments often use at least five ZooKeeper nodes to account for system or network maintenance making one or more nodes unavailable without sacrificing availability or reliability. Other software in the IBM big data portfolio uses ZooKeeper, such as
BigInsights. Although both BigInsights and Watson Explorer use ZooKeeper, Watson Explorer solutions might not be able to use the same ZooKeeper ensemble as BigInsights. Therefore, maintain separate ensembles unless otherwise directed. Always check the system requirements and product guidance for version compatibility.
Enterprise requirements and the BigIndex API
Data repository indexes and the mechanisms used to access them have historically been created manually, using the Watson Explorer Engine administration tool. Similarly, the distributed indexing capabilities of Watson Explorer Engine have usually been configured manually within that tool. Being able to replicate and distribute indexes across multiple hosts is a fundamental requirement for the RAS expectations of enterprise applications, such as those created using Application Builder, but requiring manual
intervention to do so is not scalable.
To enable the programmatic creation, distribution, and management of indexes and associated storage, Watson Explorer provides an enterprise-caliber API known as the BigIndex API.
This API hides the complexity that can be associated with creating indexes, querying those indexes or other content sources, and returning the results of those queries. The BigIndex API also hides the complexity of replicating Watson Explorer indexes across multiple servers, providing opportunities for automating load and storage balancing across participating servers, automating
fault-tolerance, and porting indexes from one server to another.
BigIndex API functions that require these capabilities can simply use and configure them, rather than requiring thousands of hours to develop and test those capabilities for each application. Using the BigIndex API reduces code complexity and makes working with complex distributed systems as easy as working with a local client API.
To support enterprise application requirements, such as high availability and load balancing, indexes that are created using the BigIndex API are typically
partitioned into several segments, known as
shards
, which are optionally replicated across the servers associated with a given index.Segmenting indexes into shards also provides opportunities for reducing the footprint and memory requirements of an index on any given server by tuning the size of these shards and distributing them across larger numbers of servers. The servers that are associated with an index created using the BigIndex API are identified in a cluster collection store entity in that application’s entity model. This entity definition also includes attributes that specify the number of shards that are associated with that index, and the manner in which the index is used.
The BigIndex API was developed in conjunction with Application Builder, but makes it easy for other applications to incorporate the data exploration, navigation, and search capabilities that IBM Watson Explorer technologies provide. For more detailed information about the BigIndex API, see the
documentation and example applications that are delivered as part of the Watson Explorer product.