focus
between the application’s object model and the data’s relational model. Object-relational mapping (ORM) frameworks exist to bridge this divide, but ORMs aren’t trivial to use and often introduce more complexity than the problem they solve.1
Relational databases use a generic storage model based on tables and columns, so it’s no wonder they don’t scale well for all data types. Google designed BigTable2 and Amazon came up with Dynamo3 specifically to address these data-base issues. Both were implemented as distributed data stores designed to scale to a very large size. Recent experiences in data modeling with social networking applications such as Facebook,4 Twit-ter,5,6 and Digg7 also demonstrate this deficiency. One way to address this problem is to model data so that it remains closer to the way the ap-plication uses it. On the basis of practical experi-ences, I describe a multiparadigm data-storage model that permits applications to work with data in a way that’s semantically closer to its usage patterns.
Toward a Semantically
Richer Data Storage
Consider an enterprise application module that must handle document sets that can have many optional attributes—for example, an Address Book that can have fields like Name, House Num-ber, Street Name, City, Zip Code, and Country Code. It can optionally include a list of telephone numbers and email addresses as well. With a rela-tional model, you design it as a table with nullable attributes for optional items. For repeating entries, you use normalization techniques and store data in multiple tables to avoid data redundancy and inconsistent updates and deletes.
Distributing data across multiple tables has the effect of destructuring the semantics of the way your application looks at the data. The applica-tion would prefer to store the Address Book in a single structure that keeps the document whole and consistent with the way the domain model would use it.
Today, we have data stores that let us model
R
egardless of the paradigm used to model the application domain, most
en-terprise applications use the relational model for data storage. Relational
da-tabase technology is mature, widely understood, and successfully deployed
in countless applications. However, its dominance has also had some
unde-sirable consequences for application development. For an application that models the
business logic in an object-oriented way, the developer faces an impedance mismatch
Storing data the same
way it’s used in an
application simplifies
the programming
model, making it
easier to decentralize
data processing.
Emerging NoSQL
data-storage engines
support this strategy.
Debasish Ghosh, Anshin Software
Multiparadigm
Data Storage for
Enterprise Applications
multiparadigm programming
58 I E E E S O F T W A R E w w w . c o m p u t e r. o r g / s o f t w a r e
our data layer exactly this way. CouchDB (http:// couchdb.org) and MongoDB (http://mongodb.org) store data in JavaScript Object Notation (JSON, http://json.org) “documents” and let applications manipulate document structures directly through their query engines.
Consider another example where your applica-tion must store various routes across cities to find optimal shipping strategies for your clients. Typi-cally, you think of this as a graph, with the cities being the nodes and the connecting routes being the edges. You could store the structure in a re-lational database and use SQL queries that em-ploy complicated joins across multiple tables. Or you could store the data in a graph database like Neo4J (http://neo4j.org) that offers nodes and re-lationships as first-class abstractions and various graph-manipulation APIs for use directly within the application layer.
Both of these examples have one thing in common. They express a need for a data model much richer than the universal table/column-based representation that a relational database offers—a need for a specialized representation of each individual data type used in your appli-cation. Specialized representation also implies specific query languages for each data store. This is a benefit in that you can use the most expres-sive language to query your data structures, but it also means learning a multitude of languages and their best practices.
SQL is no longer the universal query interface, so these new data stores are popularly referred to as NoSQL stores. “NoSQL” has many connotations, but the most popular one today is Not Only SQL.
Multiparadigm Programming
with Data
When you’re using multiple data-representation techniques, you need the right tool for the right job. When you’re working on a large-scale application, use data stores that meet your application’s access-pattern requirements and offer the desired level of scalability and performance guarantees.
A relational database management system (RD-BMS) engine is the right tool for handling rela-tional data used in transactions requiring atomic-ity, consistency, isolation, and durability (ACID). However, an RDBMS isn’t an ideal platform for modeling complicated social data networks that in-volve huge volumes, network partitioning, and rep-lication. Graph databases like Neo4J model such relationships much better. CouchDB offers offline data-processing capabilities through replication techniques and allows synchronization with other
copies at a later time. MongoDB has blazing-fast in-memory operations.
Cassandra (http://cassandra.org) supports de-centralized data storage for efficient columnar ac-cess from your application. It has the fault toler-ance of Dynamo while offering a more advtoler-anced data model. For applications that need huge write scalability, Cassandra has proved to be a very good option.
If you need write scalability, Riak (http://riak. basho.com) is another option that models a key-value data store and offers decentralized access, availability, and network-partition tolerance.
The underlying idea is to use each data store’s strengths to meet your data model’s requirements. This brings the data’s storage model closer to the application-domain model that uses it. Plus, it gives you the scalability benefits these engines of-fer. The result is a multiparadigm strategy for data management.
With all these NoSQL stores acting as the inter-face to your application’s domain model, you might still need an underlying relational database to serve as the system of record for generating reports and audit trails, running of other batch processes, and so on. NoSQL stores don’t scale well for such jobs, and the RDBMS world offers lots of tool support in these areas.
Asynchronous Messaging
for Eventual Consistency
However, one question still remains: How do you keep the underlying relational store in sync with the other data stores?
One option is to use asynchronous messaging as the backbone for an integration layer. By combin-ing asynchronous messagcombin-ing with the actor com-munication model,11 we can establish an architec-ture for achieving eventual consistency between the underlying database and the online domain-specific data stores.
Figure 1 gives the overall architecture for such a multiparadigm database infrastructure. In this approach, asynchronous messaging repli-cates necessary changes in individual data stores in the main relational database. The applica-tion determines which changes must be propa-gated to the underlying store as the system of record. For example, if you’re using MongoDB as a data store for online processing, the busi-ness components will store all your transaction data in MongoDB collections. When the appli-cation logic updates a collection online, the mes-saging system will trigger downstream updates to schedule jobs on the queue. These jobs are
The idea is
to use each
data store’s
strengths
to meet
your data
model’s
requirements.
processed asynchronously and keep the under-lying relational data store consistent with the frontal online stores.
The overall application architecture benefits from asynchronous updates in a couple of ways. First, it scales well, because asynchronous pro-cessing doesn’t block—and can even be sched-uled as offline threads of execution. Second, it guarantees delivery of updates to the underlying store within a specified time interval. The two stores might be inconsistent for a short time, but many applications can tolerate this delay to meet other scalability objectives. In other words, the consistency isn’t instant, so it’s called eventual consistency.
In a sense, the frontal data store is like a cache between the application and the database of re-cord—that is, the SQL store. However, unlike a traditional database cache, the frontal store can provide a persistence abstraction that best fits the application. This multilevel model is also useful when the application requires an SQL store of re-cord for nontechnical business reasons.
NoSQL Stores:
A User’s Point of View
You must consider two important aspects of your application and infrastructure requirements be-fore selecting a NoSQL storage engine.
First, every data store needs a user-friendly query interface. Unlike the relational world where SQL provides the universal query language, the NoSQL world has no such unifying query lan-guage. Every data store provides query interfaces in multiple host languages—Java, Ruby, Python, and so on. The languages vary significantly across storage engines, and the query mechanisms differ across the various stores.
Stores like MongoDB offer user-friendly query APIs as part of the client library implementation. In CouchDB, you must write views, using a map/ reduce paradigm, to get data from the database.12 CouchDB comes with a default view-engine im-plementation in JavaScript. However, its view ar-chitecture is decoupled from the core server, so you can write your own view server using your preferred language.
So, every data store has its own query model and language that might be optimal for its under-lying database engine, yet, the absence of a uni-fied query model like SQL in the relational world could deter early adoption of NoSQL data stores.
The second important consideration when se-lecting a data store is scalability requirements. Each candidate data store has different strengths
and weaknesses, which you will need to align with your application requirements. Some fea-tures for determining your selection are
■throughput requirements of reads and writes for your application,
■whether your application needs to handle data distributed across nodes and serve query re-quests from users even when some nodes fail, ■whether your application needs offline
data-processing capabilities,
■your application’s availability requirements, and
■your application’s data consistency re quirements.
A complete analysis of all the stores with respect to distribution and scalability is beyond this article’s scope. For more details, see the documentation for each product.
Advantages of
Asynchronous Messaging
Architectural blueprints like the one in Figure 1 are becoming more common. In one project at Anshin Software, we use MongoDB for document storage in collaboration with Oracle as the under-lying system of record. The implementation
archi-(Actors) Cassandra database Graph-structured domain rules Columnar data access with decentralization Module 2 (Actors) Module 1 (Actors) Asynchronous message passing CouchDB database Document structures Document structures with offline processing (Actors) Module 3 Neo4J
database MongoDBdatabase
Relational database
Module 4
Figure 1. Architecture for a multiparadigm database infrastructure. The application uses multiple frontal data stores, depending on the way each component uses the data. Message-oriented middleware uses asynchronous replication to achieve eventual consistency with the underlying relational database management system.
60 I E E E S O F T W A R E w w w . c o m p u t e r. o r g / s o f t w a r e
tecture benefited from this model, first, because it stores the data in a model that’s closest to the data-access pattern. This minimizes the impedance mismatch between the data and the application layer. Furthermore, because application-layer data access is simpler and more direct, the application code base is much more expressive, concise, and maintainable.
Using asynchronous messaging as the bind-ing glue to manage back-end data consistency also gives us a horizontally scalable infrastructure. Fi-nally, it distributes online data-processing load across multiple data stores. There’s no single point of failure as there is when a single RDBMS handles all the loads.
Challenges of
Asynchronous Messaging
As with any architectural paradigm, there are a host of pitfalls to be aware of.
First, not all systems are suitable for enforc-ing an eventually consistent model. If your appli-cation requires all operations to be immediately available from the underlying relational database, this model isn’t for you. A typical example of such a use case is handling a banking system’s debit and credit transactions, which must be ACID consistent.
Second, the NoSQL database systems dis-cussed here are relatively immature compared to the well-known SQL systems. Many of them are still evolving rapidly. All of them are being used in production systems, but few have reached a version 1.0 level. You can expect these systems to make some incompatible changes in the near term. Furthermore, if the application or data architect doesn’t use the architecture pattern carefully, the result could turn into a database cacophony.
Finally, traditional database architects, accus-tomed to using a single RDBMS for an applica-tion, will have to be convinced of the wisdom of a multiparadigm strategy.
W
hen eventual consistency is sufficient for your application, asynchronous messaging provides a robust, scal-able way to synchronize data between different data stores. It lets you decentralize data process-ing and store data in structures that are more closely aligned with the application logic. This leads to prospects for a much simpler program-ming model without the incidental complexi-ties that additional glue frameworks bring to an application.References
1. T. Neward, “The Vietnam of Computer Science,” blog, June 2006, http://blogs.tedneward.com/2006/06/26/ The+Vietnam+Of+Computer+Science.aspx.
2. F. Chang et al., “BigTable: A Distributed Storage System for Structured Data,” Proc. 7th Symp. Operating
System Design and Implementation (OSDI 06), Usenix
Assoc., 2006; www.usenix.org/events/osdi06/tech/ chang.html.
3. G. DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store,” ACM SIGOPS Operating
Systems Rev., vol. 41, no. 6, 2007, pp. 205–220;
http://s3.amazonaws.com/AllThingsDistributed/sosp/ amazon-dynamo-sosp2007.pdf.
4. A. Lakshman, P. Malik, and K. Ranganathan, “Cassan-dra: A Structured Storage System on a P2P Network,” slide presentation at ACM SIGMOD Int’l Conf. Man-agement of Data (SIGMOD 08), 2008; www.slideshare. net/jhammerb/data-presentations-cassandra-sigmod. 5. J. Adams, “Billions of Hits: Scaling Twitter,” slide
presentation presented at the Chirp 2010 Official Twit-ter Developer Conf., 2010; www.slideshare.net/netik/ billions-of-hits-scaling-twitter.
6. N. Kallen, “Big Data in Real-Time at Twitter,” slide presentation, 2010; www.slideshare.net/nkallen/ q-con-3770885.
7. I. Eure, “Looking to the Future with Cassandra,” blog, 9 Sept. 2009, http://about.digg.com/blog/ looking-future-cassandra.
8. C. Hewitt, P. Bishop, and R. Steiger, “A Universal Modular ACTOR Formalism for Artificial gence,” Proc. 3rd Int’l Joint Conf. Artificial
Intelli-gence, Morgan Kaufmann, 1973, pp. 235–245.
9. S. Helmberger, “Introduction to CouchDB Views,” 2 Apr. 2010; http://wiki.apache.org/
couchdb/Introduction_to_CouchDB_views.
About the Author
Debasish Ghosh is the chief technology evangelist at Anshinsoft (www.anshinsoft. com), where he specializes in leading delivery of enterprise-scale solutions for clients ranging from small to Fortune 500 companies. His research interests are functional programming, domain-specific languages, and NoSQL databases. Debasish received his bachelor’s degree in computer science and engineering from Jadavpur University, India. He’s a senior member of the ACM and author of the book DSLs In Action, to be published this year by Manning. Read his programming blog at http://debasishg.blogspot.com and contact him at [email protected].