Graphs - Creating an Extensible Unit Converter Using OpenMath as the Representation of the Sema

A major section of this project involves getting from one unit to another. In some cases this is trivial—for example, the foot is defined in terms of the metre, so it should be fairly straightforward to switch between the two. The difficult case arises when the two units are non-adjacent—i.e. the definition of one involves at least one intermediary before reaching the second. As an example, the yard is defined in terms of the foot, which is defined in terms of the metre, so if it was desired to convert yards to metres the definition of the foot would also be used. The obvious way to deal with this problem appears to be to turn the system into a graph-traversal problem. This would involve loading in all the definitions somehow and mapping the relations to a graph.

2.6.1 Storage

Bell College (2006) describes the data structures that can be used for the nodes and edges of a graph (albeit in Java), so we intend to base our design on this. Skiena (1997) agrees that a list, rather than a matrix, would be the most appropriate data structure for a small graph, as we will have.

We think that an undirected, unweighted graph for each dimension of unit should be created, storing the conversion factor between two units as a feature of the edges between nodes—it would not be the weight, as we do not believe it is worth setting different weights for different calculations. If we end up using an algorithm which requires graph weights, we will set them all to 1, but as most of the conversions will be simple multiplies, the cost of calculating the exact weighting would be far in excess of the gain from doing so. We choose to use an undirected graph because the conversion factor in one direction will simply be the inverse of the conversion factor in the other direction, and the OpenMath object stores the conversion factor as well as how it relates to both units. The dimension for each unit is stored in the Small Type System file for the CD that contains the unit.

CHAPTER 2. LITERATURE SURVEY 21

2.6.2 Traversal: Shortest Path

A large number of papers have been written on the subject of finding paths round graphs, describing algorithms of varying efficiency and function. As this is a time-consuming task, we intend to have a program that runs once to load all the relevant online CDs and build up a list of units and shortest paths between them. This could then be loaded and used as required, without having to work it out every time. The only times these data would require updating would be when the CDs that are needed are either updated or new ones are added, or, in a related case, when the user supplies their own. However, we feel this will be significantly quicker than calculating the best path each time the user wishes to convert units.

There are several kinds of shortest (“best”) path algorithm available: those that find the shortest path between two nodes, those that find the shortest path from one node to all the other nodes, and those that find the shortest paths between all the nodes. There are variants of these for both weighted and unweighted, as well as directed and undirected, graphs. As we have chosen to use an adjacency list as the predominant data structure, rather than an adjacency matrix, this rules out those algorithms that demand matrices (Seidel (1992), Zwick (1998) being examples). The most famous algorithm in the area is the first found in Dijkstra (1959), but this algorithm only finds all the paths from one source node. Finding all the paths using Dijkstra (1959) thus involves running the algorithm N times for N units, with a resulting minimal complexity of O(N2_{log N), or at most O(N}3_).

As we envisage calculating all the paths to all the nodes first, and storing this in a file, it may be worth finding a more efficient algorithm for finding all the paths at once. Johnson (1977) describes just such an algorithm for finding the shortest path between all nodes, but goes on to cite Wagner (1976) as a better algorithm when the weights are low (as previously stated, we anticipate the weights all being set to 1). The graph will be what is known as an “Edge-Sparse graph”, as there will be few edges between units (approximately N− 1 edges, for N units—this is based on the expectation that the graph will mostly have a star-like structure with several nodes coming from one central node—rather than more of the nodes being connected to each other; each unit definition will be defined in terms of exactly one other unit), and using Wagner’s algorithm would be faster than Dijkstra’s famous algorithm in this case (Wagner 1976). However, Wagner’s algorithm is significantly more complicated to implement than Dijkstra’s.

Dijkstra (1959) also contains a second algorithm, which could be used to find the shortest route without having to load all the nodes of the graph—starting from the source node, only those nodes required until the destination node has been found need be loaded. This could speed up the processing, or traversal, of the graph, and also finds the shortest path. Zwick (2002) (an update to Zwick (1998)) provides a different approach—attempting to find the shortest path that additionally uses the fewest edges. This is unnecessary in our case, due firstly to the unweighted nature of the graphs we will be using; any shortest path will use the fewest possible edges, and secondly because of the added complexity involved. In addition, it uses matrices.

CHAPTER 2. LITERATURE SURVEY 22 Chan (2006) and Chan (2007) describe faster algorithms, but the algorithms in the latter are more efficient for dense graphs, and our graphs will be very sparse. Chan (2006) points out that using a breadth-first search has complexity O(mn), where m is the number of edges and n is the number of vertices. As we are anticipating m = n − 1 (or certainly O(n)), this results in O(n2_{). The breadth-first search seems to be one of the simpler algorithms to}

implement, and gives the optimal result when the weights for each edge is the same, as it will always find the route encompassing the fewest edges possible. However, this algorithm is designed to find a route between two nodes, which is less efficient if we wish to find all the routes between all the nodes, and as such it may be better to only consider this algorithm if we decide to perform the lookup each time the user enters a query. On the other hand, if we do have to implement an algorithm, this is a fairly easy one to implement, and because we will be using such a small number for n, the complexity is not really going to be a huge issue; the time taken to implement a particular algorithm will be more of a deciding factor. we will investigate these further during the design phase of the project; a highly-optimised algorithm is probably more effort than it is worth, as none of our graphs will have many nodes or edges.

2.6.3 Implementation

We could write our own graph-related data structures and algorithms. However, as de- scribed in de Halleux (2007), there are freely available libraries for .NET for creating and traversing graphs. QuickGraph is a free, open-source set of libraries written in C#. It includes functions for several shortest path algorithms, including Dijkstra’s algorithm. As stated previously, we would prefer to use a different algorithm, as Dijkstra’s is not the most efficient, but we may choose to implement using this method, then change to a more efficient algorithm if necessary once the system is working.

In document Creating an Extensible Unit Converter Using OpenMath as the Representation of the Semantics of the Units (Page 38-40)