• No results found

The Riak key-value store allows each of its stored values to be augmented with link metadata. Each link is one-way, pointing from one stored value to another. Riak allows any number of these links to be walked (in Riak terminology), making the model some‐ what connected. However, this link walking is powered by map-reduce, which is rela‐ tively latent. Unlike a graph database, this linking is suitable only for simple graph- structured programming rather than general graph algorithms.

There’s another weak point in this scheme. Because there are no identifiers that “point” backward (the foreign aggregate “links” are not reflexive, of course), we lose the ability to run other interesting queries on the database. For example, with the structure shown in Figure 2-3, asking the database who has bought a particular product—perhaps for the purpose of making a recommendation based on customer profile—is an expensive operation. If we want to answer this kind of question, we will likely end up exporting the dataset and processing it via some external compute infrastructure, such as Hadoop, to brute-force compute the result. Alternatively, we can retrospectively insert backward- pointing foreign aggregate references, before then querying for the result. Either way, the results will be latent.

It’s tempting to think that aggregate stores are functionally equivalent to graph databases with respect to connected data. But this is not the case. Aggregate stores do not maintain consistency of connected data, nor do they support what is known as index-free adja‐ cency, whereby elements contain direct links to their neighbors. As a result, for con‐ nected data problems, aggregate stores must employ inherently latent methods for cre‐ ating and querying relationships outside the data model.

Let’s see how some of these limitations manifest themselves. Figure 2-4 shows a small social network as implemented using documents in an aggregate store.

Figure 2-4. A small social network encoded in an aggregate store

With this structure, it’s easy to find a user’s immediate friends—assuming, of course, the application has been diligent in ensuring identifiers stored in the friends property are consistent with other record IDs in the database. In this case we simply look up immediate friends by their ID, which requires numerous index lookups (one for each friend) but no brute-force scans of the entire dataset. Doing this, we’d find, for example, that Bob considers Alice and Zach to be friends.

But friendship isn’t always reflexive. What if we’d like to ask “who is friends with Bob?” rather than “who are Bob’s friends?” That’s a more difficult question to answer, and in this case our only option would be to brute-force scan across the whole dataset looking for friends entries that contain Bob.

O-Notation and Brute-Force Processing

We use O-notation as a shorthand way of describing how the performance of an algo‐ rithm changes with the size of the dataset. An O(1) algorithm exhibits constant-time performance; that is, the algorithm takes the same time to execute irrespective of the size of the dataset. An O(n) algorithm exhibits linear performance; when the dataset doubles, the time taken to execute the algorithm doubles. An O(log n) algorithm exhibits logarithmic performance; when the dataset doubles, the time taken to execute the al‐ gorithm increases by a fixed amount. The relative performance increase may appear costly when a dataset is in its infancy, but it quickly tails off as the dataset gets a lot bigger. An O(m log n) algorithm is the most costly of the ones considered in this book. With an O(m log n) algorithm, when the dataset doubles, the execution time doubles and increments by some additional amount proportional to the number of elements in the dataset.

Brute-force computing an entire dataset is O(n) in terms of complexity because all n aggregates in the data store must be considered. That’s far too costly for most

reasonable-sized datasets, where we’d prefer an O(log n) algorithm—which is somewhat efficient because it discards half the potential workload on each iteration—or better. Conversely, a graph database provides constant order lookup for the same query. In this case, we simply find the node in the graph that represents Bob, and then follow any incoming friend relationships; these relationships lead to nodes that represent people who consider Bob to be their friend. This is far cheaper than brute-forcing the result because it considers far fewer members of the network; that is, it considers only those that are connected to Bob. Of course, if everybody is friends with Bob, we’ll still end up considering the entire dataset.

To avoid having to process the entire dataset, we could denormalize the storage model by adding backward links. Adding a second property, called perhaps friended_by, to each user, we can list the incoming friendship relations associated with that user. But this doesn’t come for free. For starters, we have to pay the initial and ongoing cost of increased write latency, plus the increased disk utilization cost for storing the additional metadata. On top of that, traversing the links remains expensive, because each hop requires an index lookup. This is because aggregates have no notion of locality, unlike graph databases, which naturally provide index-free adjacency through real—not reified —relationships. By implementing a graph structure atop a nonnative store, we get some of the benefits of partial connectedness, but at substantial cost.

This substantial cost is amplified when it comes to traversing deeper than just one hop. Friends are easy enough, but imagine trying to compute—in real time—friends-of- friends, or friends-of-friends-of-friends. That’s impractical with this kind of database because traversing a fake relationship isn’t cheap. This not only limits your chances of expanding your social network, but it reduces profitable recommendations, misses faulty equipment in your data center, and lets fraudulent purchasing activity slip through the net. Many systems try to maintain the appearance of graph-like processing, but inevitably it’s done in batches and doesn’t provide the real-time interaction that users demand.