Data Understandability - JCR or RDBMS why, when, how?

Making architectural and implementation models understandable is one of the key aspects of the elaboration phase. Clear architecture which can easily be communicated allows people to enter more quickly into the project. It is also easier to define tasks and duties if the architecture is clear and made of separate modules.

Generally the architecture is defined or refined by an architect or an analyst during the elaboration stage. This actor takes the requirement identified during the inception phase as input and delivers blueprints which explain the behavior of the system at different levels. At an application level, these blueprints generally include use case diagrams, collaboration diagrams or class diagrams. To show how the application’s data persists, these schemas are often

translated into database schemas which take the properties of the data model into account.

JCR development

As mentioned, the structure and the content are indivisible in JCR. However it is possible to define a semantic which shows how data and structure will be instantiated. In this semantic, some aspects of the content can be omitted.

For example, if a semantic item has an unstructured basis, all the possible and imaginable properties can be saved under it. Thus, there is no need to mention them if they are not mandatory or don’t have to respect specific constraints. It is enough to declare them in the application’s schemas as made in a class diagram. Thus, the semantic diagram of a java content repository says less than the other architectural diagrams. This impacts its readability. In fact, reading the semantic of a repository gives a snapshot of the final application and helps to understand its general behavior.

Figure 5.1-1: JCR translation

Another interesting aspect is that the complexity of the JCR semantic is not decupled by many-to-many relationships. No intermediary nodes or artifacts are needed to represent these associations. Thus, these diagrams are very much closed from the other architectural schema. No translation rules are needed to create them.

Relational development

Class diagrams can be used as input to generate relational schemas. Entity-relationship diagrams (15) or Crow's Foot diagrams are often used to represent them. Translation rules are generally needed to produce these schemas. Far from summarizing the architecture, they enumerate to a high degree all the aspects of the final application.

Figure 5.1-2: SQL translation

Everything has to be explicitly mentioned in these database schemas. Only the records which respect the data structure can be instantiated in a relational database. Thus, it is necessary to carefully define this structure and make it fit in perfectly with the application architecture.

Many-to-many associations cannot be represented in relational database schemas without reification. This means that many-to-many associations will always require intermediary entities. Consequently, the internal complexity of a relational schema increases faster than the complexity of the other architectural diagrams. Thus, they don’t really help to understand the application. They are more often used as implementation’s blueprints.

5.2 Coding Efficiency

The construction phase of a development process is highly influenced by efficiency. Coding requires time, resources and money. These parameters are very sensitive. Furthermore, if developers have to write code twice, there is a high probability that they will make more than double the programming errors. Thus, efficiency also impacts quality.

Measuring coding efficiency implies some soft parameters. The programmer’s education and knowledge should be taken into account. Furthermore, the semantic and the readability of the code are also significant. These parameters make it difficult to judge the technology’s efficiency. Without going too deep into these questions, the following sections contain useful information which can be taken into consideration when making a decision in this area.

JCR development

Programmers are not really familiar with the JCR API and don’t really know the best practice linked to content repositories. However, the API is in large part self-explanatory and people generally have the habit of thinking in terms of hierarchies. These parameters should give to JCR a good learning curve.

Some interactions are possible between the query part of the API and it’s navigational part. One of the big advantages of JCR is stated in the fact that these aspects are merged coherently and are not considered as different abstraction levels.

The code quantity highly relates to the use case. If complex joining operations are mainly required, JCR will not be an efficient choice. However, if navigation is required, the size of the code will be much smaller. If special requirements such as versioning or fine grained access control are needed, it becomes clearly difficult to reach the same level as the one proposed by JCR.

Relational development

Nearly all programmers are familiar with the relational model and people have often used it in recent years. Thus, SQL and API as JDBC are part of the common language. In real world situations, this general knowledge often favors the relational model. Some problems need to be treated in a specific manner and the intuitive approach often gives bad results.

If complex operations are required by the use case, the relational model should not be bypassed. The completeness of the queries and the panel of operations made it very efficient in term of code

quantity. However, if the use case implies requirements such as navigation or versioning, the developer will have to add some artifacts into his implementation model to manage parameters such as tree structure or order. He will also face the problem of having to implement huge applicative logic. Thus, in terms of efficiency, the model’s choice should be driven by an honest analysis of the use case’s properties.

5.3 Application Changeability

Requirements which appear during the development process are often difficult to include in previously defined architecture. Modern software development processes generally address this problem with iteration cycles (16). Well managed, iterations should allow to include efficiently new requirements. However, because each logic level is generally impacted by architectural changes made during the elaboration phase, last iterations are more expensive than early iterations.

Decoupling clearly logic levels can reduce this increasing cost. Thus, data models which can transparently accept changes are really appreciated. To make this point, we will consider how simple changes are impacting the data logic of a system. JCR development

As mentioned in the ―Schema understandability‖ section, repository’s schemas summarize the other architectural diagrams. While this could appear meaningless, it is really not the case. Keeping the repository as weak as possible allows and includes new requirements without touching the data logic level. Only the application logic level is impacted. Thus, adding a property at an application level doesn’t necessarily require or touch the repository’s organization.

To be sure, deep changes impact data logic and JCR, and they do not provide a magic solution either. The JCR allows for a decoupling of most of the data logic from the application and the interface levels. It is also interesting to note that frameworks like Sling allow decoupling in a similar manner to the application logic from the interface logic. This

approach is clearly an attractive one, especially in environments driven by changes and agility.

Relational development

Nearly each modification made on the overall architecture will impact the data logic level. This comes from the fact that relational databases do not allow for instantiate elements which have not been previously defined in the structure. Thus, there is a great probability that a change made in a formulary of the interface or in the application logic will require perform changes on the data model logic.

Some frameworks provide tools to automate these changes. However, if the system has a production version, once executed the change will have a big foot print on all the database’s items. Furthermore, classical model-view-controller frameworks are not really decoupling the applications level from the interface. For example, a change made on a controller will often impact on views and models.

5.4 Synthesis

At a project level, people are often looking for solutions which will allow for the quick integration of

changes into their environment. In situations where some changes have to be performed the semi- structured nature of JCR will certainly be appreciated. Furthermore, the inclusion of features such as navigation, versioning or access control can gain us a lot of time.

Nevertheless, it is important to keep in mind that the efficiency of both solutions relates in a large way to the nature of the use case. The agility of JCR should not influence this aspect. Furthermore, the agility is inked in no small way to the project team. Thus, saying that JCR is a way to achieve agility is a too big a shortcut.

In all cases, the choice of a database technology should always be discussed during the inception and elaboration phases of the first iteration of the development process. This can be done by leveling the different parameters. Changing the persistence technology cannot easily be achieved after the first iteration. Consequently, this choice will have a strong impact for the rest of the project.

6 Product comparison

Choosing between database products implies that

we use different criteria. We can mention the compliance with a standard, the additional features proposed by the provider, the support offered by a company or by a community or the scalability of the solution. All these criteria have an importance. They

should be weighed carefully and a choice made depending to the situation.

In our context, basic and significant differences distinguish java content repositories from relational databases. Thus, a decision to employ one technology instead of another should be taken at a lower level. However, in relation to the product, people often ask in terms of performance, if they should use a relational database or a java content repository to manage their hierarchical information. This section will try to address, and answer this issue by reminding us of some basic theoretical concepts which relate to data structures and to the cost of associations. Then, at a more practical level, a benchmark of several database products will verify if these assumptions can be proved.

6.1 Theoretical analysis

In general, database products use basic data structures to manage their data. This section reminds us of simple concepts which relate to these structures and to the cost of associations made between data items. The goal is to determine if the product’s performances will be significantly impacted by the subtended approach.

Hierarchical and network database

In the hierarchical and network models, associations are made by storing references or pointers between items. The advantage of this kind of structure is that, because each node stores direct references with other nodes, a constant number of read accesses are needed to go from one node to its target.

Creating an association between two nodes also has a constant cost because the number of operations needed to perform this is always the same.

Thus, the cost of crossing and creating associations is constant and could be noted as O(1) in big O notation. Some people say that these associations are pre-computed.

Some strategies allow the representation of directed graphs such as those needed by the hierarchical and the network models. The most classical representations of this are adjacency lists and adjacency matrixes (17). Generally, the choice between one approach instead of another is made simply by analyzing the density of the graph.

If the graph has a number of arcs which are close to the square of the number of edges, selecting an adjacency matrix will show a better result. However, the JCR model is mainly driven by hierarchical associations. In this context, the number of arcs will not be a lot taller than the number of edges. Thus, an adjacency list will show more respect for the memory usage by requiring only the space needed to store the associations. It is also interesting to note that this kind of organization allows, with a certain amount of ease, the giving of an order to the children of a node.

Figure 6.1-1: A hierarchy and its adjacency matrix

Implementing this with a programming language can be accomplished by using several data structures such as arrays, maps or hash-tables. Some other solutions could also be presented but the main idea is that crossing an association has a constant cost and that crossing a graph has a cost which is proportional to the number of arcs and edges traversed. Thus, managing this kind of data is cost effective.

Relational database

In the relational model, associations are made between relations by computing the matching values stored in two domains. This allows for the expression of all imaginable associations between two or more data sets.

What is the cost implication of computing and creating associations in a relational database? To compute an association, a relational database has to cross the targeted set to find the matching values. In this case, the cost of the association equals O(n), with n the number of tuples stored in the source and

in the target. However, most database products provide indexation facilities such as b-tree indexes. So, in most cases, finding the matching entries has a cost of O(log(n)). While b-tree indexes are good, some articles (18) argue that in the network models, because associations are pre-computed, it is possible to reach better performance.

However, in most cases there is no need to use other comparison operators other than ―= ― or ―≠‖ to express relationships as these are presented in a hierarchical or network model. Consequently, hash indexes can be used on the domains which constitute the association. If the relational database provides good hash indexes’ implementations, the cost of retrieving data through associations will be close to O(1). It also results in a constant cost of O(1) when new items are added to the targeted sets and in the index. Thus, there are virtually no significant differences between the associations of the relational model and of the hierarchical model.

6.2 Benchmark

The previous section has summarized very succinctly and too quickly a huge problem. However, the main point to keep in mind is that intolerable differences should not appear if hierarchical data is managed with a content repository or a relational database. The following benchmark has been done to verify this assumption.

Four products are included in this benchmark. CRX is a native implementation of the JCR specification. The persistence of the items is managed with a proprietary technology which is based on the tar file compression (19) and implemented with java. H2 and Derby are two open source relational databases written in java. MySQL is one of the most widely used open source databases.

A simple wrapper has been defined for this benchmark. This wrapper proposes basic functions to create trees made of nodes and properties. The CRX wrapper uses directly the functionalities provided by the API. The SQL wrapper uses a simple database schema. One table stores the nodes and the other table stores the properties. The associations between items are managed with a parent foreign key and the default indexes of the

database are used on all fields. JDBC allows performing queries and prepared statements to avoid parsing the SQL statements each time.

The benchmark is composed of four parts which all measure the time required to perform an operation in hierarchies of different sizes. Each node of these base hierarchies has 5 sub-nodes and 5 properties

except leaves which only have 5 properties. The first hierarchy has one level. The following ones always include one more level. The tests have been launched 5 times on a Dell Latitude D820 installed with windows XP (processor: Intel Core Duo 2.00 GHz, virtual memory: 2.00GB). The average result is used in the following diagrams.

Writing the hierarchy

This test measures the time required to create the base hierarchy. The throughputs correspond to the time needed to write one item of the hierarchy. While the differences seem huge, all the throughputs are constant. The assumption that native implementations of JCR and relational databases should be equivalent in term of performance is true in this case. MySQL cannot be embedded in the application. This has a high impact on the result. H2 does not appear in the chart because its performance for write accesses is too good.

Reading the hierarchy

This test consists to read once all the items of the base hierarchy from the root to the leaves. The throughputs displayed in the chart correspond to the average time needed to read one item of the hierarchy. For most databases the results seam to be constant. Derby is just out of range. When recursive queries are performed on this database, the results are not tolerable.

0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 36 186 936 4686 23436 M ill isec o n d s Items crx h2 mysql derby 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 36 186 936 4686 23436 M ill isec o n d s Items crx h2 mysql derby

Randomly writing the hierarchy

The test consists of randomly writing 100 sub-hierarchies in the base hierarchy. Each sub hierarchy has a depth of 2 levels. Each level has two sub nodes and two properties. Thus, each sub hierarchy is composed of 21 items. The throughputs relate to the average time required to create all the items of one sub-hierarchy. The results of the first test are quite similar to this one. The good point is that all the databases have constant results.

Randomly reading the hierarchy

The test consists of randomly reading 100 nodes and their descendants on two levels in the base hierarchy. The throughput relates to the average time required to read one node and its descendant. As in the second test, Derby is just out of range. The same problem is encountered with recursive queries. It appears that CRX is well optimized for these situations. To be really pertinent this test should be launched on bigger hierarchies. However, the difference between the results is constant and relational databases are not showing extremely bad performances for recursive queries.

6.3 Synthesis

As shown in this chapter, performance should not be

In document JCR or RDBMS why, when, how? (Page 31-43)