Thesis

(1)

The last fifteen years have seen a renaissance in database software, as developers and consumers have finally agreed to branch out beyond the standard of relational, SQL-based database software. One of the new database types that has come to the forefront (under the umbrella of “NoSQL databases”) is graph databases. This paper will examine the performance of a specific graph database (Neo4J) on a specific problem, and elucidate more general conclusions about whether a graph database is the right choice for a prospective problem.

Relational Databases and NoSQL

The general history of database software can be traced back to the 1960s, when computing power became efficient and cheap enough to encourage storing data using software instead of physical methods (Vaughn). This saw the rise of two main data models: the Hierarchical System and the Network System. The Hierarchical System organizes data into a tree structure, where each node (aside from the root) has one (and only one) parent node and zero or more children; the most widely used Database Management System (DBMS) today that utilizes a Hierarchical Model is IBM’s IMS (Vaughn). The Network System, by contrast, relaxes the restriction on how many parents a node may have, allowing for a general graph structure where any node is connected to zero or more other nodes by one or more relationships from a defined set of relationships (Maurer).

(2)

The need for NoSQL databases became apparent as businesses began collecting large amounts of data that didn’t easily fit into a relational model; with data too complex to fit into a few tables, many businesses started seeing extremely poor performance using relational databases, due to the expensive cost of multiple joins in one query. NoSQL databases (literally, “Not only SQL”, though most NoSQL databases are non-relational) have been around since the 1960s as well, but only began to flourish in the late 1990s and early 2000s, when development begins on several different NoSQL databases, including Google’s Big Table, Memcache, and Neo4j (Haegan). As the name states, NoSQL databases don’t normally use SQL as a query language and usually have a non-relational model; instead, they have a data model that emphasizes scalability, to handle large amounts of data, and flexibility, to handle complex or sparse data types. By 2010, academics had started to separate NoSQL databases into four types: key-value stores, column stores, document databases, and graph databases. Key-value stores are databases with a data model of a value and a key to access it; document databases are similar to key-value stores, except that they store documents (complex data objects) instead of a simple key. Column stores are like key-value stores, except they are organized into a vertical structure, with values sharing similar columns or families. Graph databases are built on graph theory and their underlying data model consists of nodes, which represent objects, and relationships between those nodes (Vardanyan).

Graph databases and Neo4j

(3)

successors to the Network Database model that was created in the 1960s. The Network Model was replaced by the Relational Model as hardware improvements made the Network Model’s superior performance obsolete and businesses realized the flexibility of a relational model (Vaughn). The Relational Model made sense so long as businesses were keeping track of disconnected or only loosely-connected data, but as businesses evolved and started gathering data that more closely approximates the real world (where most data is interconnected in one or more ways), the Relational Model became a hindrance (Eifrem). In order to model those relationships in a Relational Model, you must perform joins on the tables involved; as tables get larger, joins become exponentially more expensive to perform - a specific example of this can be seen in the background of the Chemotext project, complete with numbers (Baker). Neo4j was created in response to these issues with databases, and is probably the most widely used graph database software in development today (DB-Engines Ranking).

Neo4j follows the basic graph database model, having two primitive types: nodes and relationships. Nodes can consist of one or more identifying attributes (every node has a unique internal Node ID that cannot be changed that serves as its primary identifier). Relationships exist only between two nodes and can have their own attributes; the only required attribute is a Type attribute.

Neo4j allows for indexing (its indexing system is built upon a Lucene1_{core) on relationships or}

nodes; the legacy version of indexing would index upon a node or relationship attribute and required that a node must be manually added to the index (aside from one auto_index for nodes and one for relationships; these indexes would automatically adjust on addition/deletion/update of appropriate nodes). This system was replaced in Neo4j 2.0 with the advent of labels; labels are special attributes that can be applied to a node or relationship as an identifier (like, but separate from, the Type attribute of a relationship); the new indexing system allows for indexing on an attribute of all nodes/relationships with a specific label and automatically maintains the index as creation, deletion,

(4)

and edit updates are made. There are many more specifics to the Neo4j system than those outlined above, but those are the only truly necessary parts that were required to build the Chemotext system. An example detailing how indexing works in Neo4j 2.0 is shown in Figure 1:

Figure 1: An example Neo4j database. Every node is of type Person (labels and relationships not shown for clarity). The index on Person(id) lists

every node’s location. The figure on the right shows the same database after inserting a new node, and the effect on the index. The added node

and its entry in the index are highlighted in red.

Chemotext Background and Prior Work

Chemotext was originally conceived of by Nancy Baker during her term as a PhD student at the University of North Carolina. She wanted a tool that could harness the approximately twenty three million articles in the MEDLINE database to discover new implicit connections between drugs and diseases for potential new therapeutic drug treatments.

(5)

Figure 2: This is the theoretical structure of Swanson’s ABC paradigm. Chemical A is linked to Disease C through some intermediate

terms B.

Figure 3: An example of Swanson’s ABC paradigm as it would be modeled in Neo4j

(6)

The dictionary of MeSH Terms is the set of concepts the National Library of Medicine uses as a controlled vocabulary thesis. It is a set of approximately twenty seven thousand concepts that serve as the overall vocabulary for indexing articles in MEDLINE; each MeSH term has been classified by Nancy as a chemical, disease, or protein. for the purposes of the Chemotext system. By combining the Dictionary of MeSH Terms with the MEDLINE database, Nancy had the set of all chemical/effects recognized by the national library of medicine and over 22 million articles relating the elements of that set.

The initial implementation for Chemotext used a SQL database as the back-end database and a PHP front-end (developed by Samuel Gass, Nick Bartlett, and Chris Allen) to deliver the data. The PHP front-end just made SQL requests to the database using user-provided information and returned the results of the query in a browser-digestible format; the majority of the work in this initial implementation was the structure of the SQL database. The structure consisted of several tables: a table each for chemicals, diseases, and proteins, three tables relating articles to chemicals, diseases, and proteins, and a table of articles; this structure is modeled in Figure 4 (Baker).

(7)

The Chemotext system performed one main query: if provided with a single MeSH Term, it would return a listing of all MeSH Terms related to that MeSH Term by a linking article. The query format for this database is shown in Figure 5:

SELECT c.name,

Count(a.pmid) AS pmid_ct, a.uid,

b.uid

FROM chemicals e JOIN chemart a ON e.uid = a.uid JOIN disart b

ON a.pmid = b.pmid JOIN diseases c ON b.uid = c.uid JOIN articles d ON a.pmid = d.pmid WHERE a.chemname = "" GROUP BY a.uid,

c.name

ORDER BY pmid_ct DESC

Figure 5: The structure of the Chemotext query in the relational version of the database

An example of this query (using “aspirin” as the starting chemical) across the tables is seen in Figure 6:

Figure 6: An example of how a query issued on the relational version of Chemotext would actually be executed. Source: NoSQL Databases and their Applications: A Case Study by Ian Kim

(8)

when the database was fully loaded, these joins could occur across several tables with tens of millions of rows, resulting in a total query time of up to fifteen minutes in some cases (Kim). This performance was unacceptable in a web application and rendered the system largely useless.

The next version of Chemotext was developed largely by Ian Kim. Recognizing that the performance of the database was limited by the inherent structure of a relational database, he decided to move the back end of the system to a database built on Neo4j. This involved a change in the data model.

Figure 7: A diagram of the schema of Ian’s Neo4j implementation of Chemotext

The new data model (pictured in Figure 7) placed both chemicals/effects and articles as nodes, with relationships between chemicals/effects and articles. This vastly simplified the query syntax; the equivalent query in the Neo4j database to the one shown in Figure 5 is shown in Figure 8:

START chem:node_auto_index(lc_name=”aspirin”)

MATCH chem < articles > other

WHERE others.type = "discond"

RETURN other.name

(9)

Ian then implemented this data model using only articles from 2002, taking the data from the original Relational database developed by Nancy’s team, ending up with approximately fifty thousand terms and articles and four hundred thousand relationships (Kim). Using this new system, Ian was able to reduce query times down to 60 milliseconds for his benchmark query, which is the query for all terms related to aspirin that was shown in Figure 9.

Creating the Full Database

Having taken the project over from Ian, our first goal was to create a working instance of the database that included all the articles from the original Relational database and could be considered up-to-date. Initial attempts to do so involved downloading the raw XML file from the NLM and processing it using Python. The Python module lxml (Behnel), a Pythonic wrapper for the C libraries libxml2 and libxslt, proved very useful in this regard; the native speed of the C libraries, combined with custom code from Liza Daly created specifically to handle large XML files, turned the process of parsing a 90+GB file into a matter of hours (approximately 18-24) instead of days (Daly). However, even with improved processing time, we could not reproduce the filtering and labeling of data that Nancy had created in the original relational system, as during her processing, she trimmed the number of articles from the approximately 20 million in PubMed to just over 11 million relevant articles, as well as added type (chemical, disease, or protein) and other annotations to the MeSH Terms. Thankfully, contacting Nancy revealed that she had continued to curate and update her data up to the current year, so we used her records of the MeSH Terms and Articles (current up through 2013) as the basis of the new Neo4j database.

(10)

statement involves a separate transaction for each CREATE statement (Espeed). However, Neo4j offers a batch import tool that is specifically designed for importing large amounts of data quickly; the native Batch Importer is written in Java (like the rest of Neo4j), but Michael Hunger, who is one of the Neo4j Developers, has written a Pythonic wrapper for the Batch Importer; Ian used this to create his version of the Neo4j database (Kim). The batch import tool takes in a configuration file, along with one or more tab-separated csv files for the nodes and zero or more tab-separated csv files for the relationships (Hunger). Using the batch import tool, we created two CSV files for nodes and two for relationships. The format of the database (nodes as circles and relationships as arcs) is shown in Figure 9:

Figure 9: The new schema for the Neo4j implementation, including the addition of labels as well as the new isMasterTerm relationship.

Note that the “Type” attributes for the relationships are not labels and they have fixed values (isMasterTerm for the relationship between a master term and its synonym and mentions for the relationship between an article and a MeSH Term it mentions).

(11)

applied to the database (see the section on indexing), the benchmark query for all terms related to aspirin (but with a search space of roughly twenty times the size) in the full implementation of Neo4j runs in 5080 ms, approximately 100 times slower than the same query run on Ian’s implementation of Neo4j. However, an average speed test of the benchmark query on 1000 random terms resulted in a mean query time of 650ms and a median query time of 42ms, implying that the Aspirin query is not a good indicator of the average query time.

Synonyms

Once we had upgraded the database to include the most up to date data from Nancy, the next step was to add some functionality to the database. There are approximately 20,000 terms in the MeSH Dictionary, but each term has a number of alternate names (referred to from now on as “synonyms”) that it may be known by. In order to prevent confusion if a researcher enters a synonym and the database returns zero results (since the articles reference the master term, or canonical term, not synonyms), we built the “isMasterTerm” relationship into the database: it connects every synonym with the canonical term it refers to. In order to avoid errors when a researcher used the canonical term in a search (or having to create a second query syntax), the canonical term has an “isMasterTerm” relationship from itself to itself. The query syntax for getting related terms after this change to the database is shown in Figure 10 and a graph representation of the query is shown in Figure 11:

MATCH

(synonym:MeSHTerm) <-[:isMasterTerm]-(canonicalTerm:MeSHTerm) < -[:mentions]- (article:Article) -[:mentions]-> (relatedTerm:MeSHTerm)

WHERE startTerm.name = “<name>”

RETURN relatedTerm.name

(12)

Figure 11: A graph representation of the query in Figure 10.

Performance Testing

Before we managed any performance testing, we discovered one very important fact about the database: due to the size of the database (>100 million primitives), the system we were running it on did not have enough available RAM to run queries efficiently. According to the Neo4j documentation, for a system with approximately 100 million primitives, you need a minimum of 4GB of available RAM for Neo4j; working with less than that requires paging out the database constantly, just like what happened with the Relational version of the database (Neo4j). Accordingly, we moved the Neo4j implementation to a system with 7.8GB of RAM, and eight i7-3770 (3.40GHz each) CPUs.

Our next step was to determine if we could improve the performance of the database in any way. Our first test was to see if reintroducing SQL to a portion of the process could improve performance. We created a standard Relational database in MySQL that contained a single table, which held the synonym relationship captured in the “isMasterTerm” relationship. We then tested the performance of getting the synonym through the Relational database and querying for related terms to the canonical term in Neo4j versus doing the entire query (synonym relationship included) in Neo4j. The thinking on this is that the SQL lookup would be very fast and would save time searching in the Neo4j database; however, this was not borne out in the testing.

(13)

Mean Time (s) Median Time (s)

Native Neo4j request 2.93 .59

Neo4j + SQL request 5.09 .65

Figure 12: A table showing the results of the two implementations of the Synonym function. The implementation using both SQL and Neo4j had a

slower median and mean time than the implementation using only Neo4j.

The miniscule difference in performance time meant that either solution was acceptable; we opted to continue with the synonym relationships in Neo4j, as this not only simplified the query process by eliminating an entire system, but it also freed up system memory for Neo4j.

The next performance step was to try and include a middle-man to limit requests to the Neo4j database. We chose to use Memcached, a high-performance distributed caching system originally developed by Danga Interactive (Memcached). Memcached not only allows you to use excess memory for caching, but also allows you to spread the load across multiple systems, creating a consistent virtual memory across all of them instead of duplicating a cache on each machine separately (Memcached).

The idea was that with so many related terms, a batch of queries would request a small section of localized, repeated nodes, and we could store those in a cache instead of going to the database. Further reflection upon this revealed that this cache would be of limited usefulness, as it could not efficiently store the relationships between a MeSH Term and its articles and then the relationships between those articles and other terms without duplicating a portion of the Neo4j database, which it would do a poorer job of and be a non-efficient use of memory. However, the caching system could be useful in requesting information on a node, so we decided to test it.

(14)

searched (in the same order), but only using Neo4j’s native caching agent. The test results can be seen below:

Mean Time (ms) Median Time (ms)

Native Neo4j Caching 0.142 0.129

Memcached-Assisted Caching 0.221 0.216

Figure 13: A table showing the performance differences between caching using Memcached and caching using Neo4j’s native caching utility

.

Since the nodes in the Neo4j database are so small (the amount of data they contain is tiny), querying for them directly is so inexpensive that going out to another system for caching actually slows the system down.

We also considered implementing an ElasticSearch client to sit on top of the Neo4j database, but further examination revealed that ElasticSearch, which is designed largely for full-text indexing and search, would only hinder the performance of the database (ElasticSearch).

(15)

Summary

The overall conclusions we drew from this research were the following:

Neo4j (and arguably graph databases overall) are no longer beta software. When you can create a database over a hundred million primitives that performs at sub-second query times, the software is mature enough to be usable in a performance environment.

The specifics of your problem determine whether a graph database will perform well. The Chemotext problem was uniquely suited to a graph database, because it represents a “friend of a friend”, or second-degree separation, connection. This second-degree connection, if solved in a relational database, requires multiple joins and becomes prohibitively expensive as soon as the tables of data reach any reasonably large size.

When the problem is right, a standalone graph database is the best option for performance and trying to add other software in to improve performance will either produce no improvement or actually degrade performance. This was evidenced with the tests using MySQL and Memcache.

(16)

Works Cited

Baker, Nancy C., and Bradley M. Hemminger. "Abstract." National Center for Biotechnology Information. U.S. National Library of Medicine, 27 Mar. 2010. Web. 10 Apr. 2014. <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2902698/>.

Behnel, Stefan. "Lxml - XML and HTML with Python." Lxml. N.p., n.d. Web. 13 Apr. 2014. <http://lxml.de/index.html#introduction>.

Codd, Edgar F. "A Relational Model of Data for Large Shared Data Banks." N.p., n.d. Web. 10 Apr. 2014. <http://technology.amis.nl/wp-content/uploads/images/RJ599.pdf>.

Daly, Liza. "High-performance XML Parsing in Python with Lxml." High-performance XML

Parsing in Python with Lxml. IBM DeveloperWorks, 24 Mar. 2011. Web. 21 Apr. 2014.

<http://www.ibm.com/developerworks/xml/library/x-hiperfparse/>.

"DB-Engines Ranking." DB-Engines. N.p., Apr. 2014. Web. 10 Apr. 2014. <http://db-engines.com/en/ranking>.

Eifrem, Emil. "Neo4j -- or Why Graph Dbs Kick Ass." Neo4j -- or Why Graph Dbs Kick Ass. N.p., 22 Nov. 2008. Web. 10 Apr. 2014. <http://www.slideshare.net/emileifrem/neo4j-presentation-at-qcon-sf-2008-presentation>.

"Open Source Distributed Real Time Search & Analytics | Elasticsearch."Elasticsearch.org. Elastic Search BV, n.d. Web. 19 Apr. 2014. <http://www.elasticsearch.org/>.

Espeed. "Fastest Way to Perform Bulk Add/insert in Neo4j with Python?" Stack Overflow. N.p., 1 Oct. 2012. Web. 13 Apr. 2014. <http://stackoverflow.com/questions/12643662/fastest-way-to-perform-bulk-add-insert-in-neo4j-with-python>.

Guzunda, Leon, and Nick Quinn. "An Introduction to Graph databases." An Introduction to Graph databases. N.p., n.d. Web. 10 Apr. 2014.

<http://www.slideshare.net/infinitegraph/an-introduction-to-graph-databases>.

Haegan, Knut. "A Brief History of NoSQL." All About the Code. N.p., May 2010. Web. 10 Apr. 2014. <http://blog.knuthaugen.no/2010/03/a-brief-history-of-nosql.html>.

Horne, Christopher. "IQube Marketing Limited » Glossary of Big Data Terminology." IQube Marketing Limited. N.p., n.d. Web. 10 Apr. 2014.

<http://www.iqubemarketing.com/glossary-big-data-terminolgy/>.

Hunger, Michael. "Public Jexp/batch-import." GitHub. N.p., n.d. Web. 13 Apr. 2014. <https://github.com/jexp/batch-import>.

(17)

for Mining Undiscovered Public Knowledge." BMC Bioinformatics. N.p., n.d. Web. 24 Apr. 2014. <http://www.biomedcentral.com/1471-2105/11/S2/S3>

Lindberg, Donald. "Internet Access to the National Library of Medicine." ACP Online. N.p., Sept. 2000. Web. 10 Apr. 2014.

<http://www.acponline.org/clinical_information/journals_publications/ecp/sepoct00/ nlm.pdf>.

Maurer, H., and N. Scherbakov. "1. Network (CODASYL) Data Model." 1. Network (CODASYL) Data Model. N.p., n.d. Web. 10 Apr. 2014.

"Memcached - a Distributed Memory Object Caching System." Memcached.org. Dormando, n.d. Web. 19 Apr. 2014. <http://memcached.org/about>.

Neo4j. "22.8. JVM Settings." 22.8. JVM Settings. N.p., n.d. Web. 13 Apr. 2014. <http://docs.neo4j.org/chunked/stable/configuration-jvm.html>.

Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. Boston: McGraw-Hill, 2003. Print.

"Using PubMed." National Center for Biotechnology Information. U.S. National Library of Medicine, n.d. Web. 19 Apr. 2014. <http://www.ncbi.nlm.nih.gov/pubmed>.

Vardanyan, Mikayel. "Home Industry Info Picking the Right NoSQL Database Tool." Uptime

Performance Tips Tips for SysAdmin Webmaster Network Admin. N.p., 22 May 2011.

Web. 10 Apr. 2014. <http://blog.monitis.com/2011/05/22/picking-the-right-nosql-database-tool/>.