• No results found

A special tree structure that uses string-like keys are so-called tries (prefix trees) [27, 26]. In a trie, thekey of a node is interpreted as a string of

m

elements. The root node (first level) contains a child for each distinct value of the first element in the key string. Each node on the second level contains a child node for each value of the second key element. On the third level, each node contains a child for each distinct value of the third element in the key string, and so on. Generally, the

n

th level of the tree discriminates the keys based on the

n

th element in the key string. The depth of a leaf node in a trie therefore corresponds to the length of the key that leads to the leaf. Only the leaf nodes contain actual values associated with a key. Each node must maintain references to child nodes, at most one for each distinct value of the element in the key at the position corresponding to the level of the node.

Figure 8: Objects of a BST. Arrows depict references. bst:BstTree

node1:BstNode

node2:BstNode node3:BstNode

node4:BstNode node5:BstNode node6:BstNode node7:BstNode

left right

right right

left left

root

Listing 15: A node used in a trie that stores strings of Booleans.

1 class BinaryTrieNode<Aspect>{

2 private BinaryTrieNode trueChild; 3 private BinaryTrieNode falseChild; 4 private Aspect data;

5 }

If the number of those distinct values is small, each node may contain one explicit reference for each possible element value. For example, in a trie that stores strings of Boolean values, each node can have at most two children. One for the child that represents strings where the next element is true and one for the child for strings where the next element is false. In this case, both references can be provided as distinct fields (Listing 15).

However, if the number of children is unknown beforehand or if the number of possible distinct values for the next level is considerably larger than the actual number of children, a trie node can use a secondary storage to store the references to those children, for example using a search tree or a hash map (see Listing 16).

Asymptotic time complexity of basic operations on tries

Generally, the exact complexity of the operations on tries depends on the secondary data structure used in each node to store the children.

Listing 16: A node for a trie that uses key-tuples with elements of typeObject.

1 class TrieNode<Aspect>{

2 private Map<Object, TrieNode<Aspect>> children; 3 private Aspect data;

4 }

Exact query

Finding the node with a key of length k requires k searches in the secondary data structures, one for each element in the key string. For example, if a hash table is used as the secondary data structure, the complexity for an exact query is constant (see Section 7.5).

Range query

Tries are also called prefix trees because finding all nodes with a key that starts with a specific sequence is efficient. For a prefix of lengthp, p exact queries in the secondary data structures need to be executed to find the node that represents the key with the requested prefix. All leaf nodes that can be reached from this node are then part of the requested range. Again, if hash tables are used as secondary storage, this can be done in constant time (with respect to the number of elements in the trie).

Insertion

Finding the node at which a new value needs to be added is similar to an exact query. For a node with a key of length

k

,

k

exact queries in the secondary storage are required.

Removal

Similarly, removing a node requires k searches in the secondary data structures to find the node to remove and an additional removal in the secondary data of the last node to remove the node.

Space complexity

The space complexity of tries largely depends on the choice of the secondary storage that is used at each intermediate node on the paths from the root to the leafs. In the worst case, no keys of the inserted values share a common prefix. In this case the number of nodes grows linearly with the number of inserted values times the length of the key string. The size of each secondary storage naturally depends on the choice of the data structure for the secondary storage and is, as such, an implementation detail.

The size of a single node that contains

P

children can be estimated with:

Because we assume that the length of a key tuple is fixed, we could use two different kind of nodes: one intermediate node that does not store a reference to any data, and one leaf node that does contain that reference but lacks a secondary storage, because it cannot have children. For simplicity, we use the same node everywhere. The number of nodes required for storing a trie depends on how much the key tuples of the stored associations share prefixes. Because this is not easy to predict, we consider the worst and best case.

In the worst case, no key tuple shares a prefix. In this case, the number of nodes is the number of associations times the length of the key tuples, that is

M

∗N

. In the best case, the key tuples of all stored associations are equal except for the last element in the key tuple, that is, the length of the shared prefix is

M−1

. In this case, the number of nodes is

N

+ (M−1)

. We can therefore estimate an upper and lower boundary for the size of a trie:

(N+M−1)∗sizeof(T rieN ode)≤sizeof(T rieStorage

M,N

)≤M N∗sizeof(T rieN ode)

Example

Listing 17 shows a node of a trie. The secondary storage can for example be a hash table. To find an aspect instance given a tuple, the protected

find

method receives the current level in the trie (this removes the need to store it in the node) and the whole tuple. If the current level equals the length of the tuple, the data of the current node is the sought aspect instance. Otherwise the method continues the search with the child node that is associated to the element of the tuple corresponding to the current level.

Figure 9 on page 64 shows the nodes of a trie that associates text strings with objects of typeData. In the figure, the trie contains the keys bar, baz andbob. The paths in the trie leading to the leaf nodes for bar andbaz share the common prefixba

(represented bynode2). All leaf nodes have a common root nodeb (node1).