Chapter 4 Conceptual Graph Dataset
4.2 Conceptual Graph Complexity
An issue often not addressed is thecomplexity of the underlying MRL, CG or knowl-
edge in general. Generally it is assumed that larger input will produce more complex KR, the reason why all parsers’ accuracy deteriorates with increased input size [Choi
et al., 2015]. In the case of CG, complexity is measurable because it refers tograph
complexity and we can therefore use existing approaches to quantify it and map it, and then deduce how input complexity relates to agent accuracy and performance.
Hereinafter the use ofnode to edge ratio|V|/|E|(also known asgraph sparse-
ness) and average path length lG of a graph are used as indicators of the structural
complexity of a graph; other more complex measures such asbetweenness, radius,
closeness, clusterization[Barooah and Hespanha, 2007; Wright, 1977] were not used due to requiring more elaborate work on the CG without necessarily providing data that could aid the study of knowledge complexity.
In addition to the above, the edge search space was used (e.g., how many
possible edge combinations exist in a graph) as a metric of defining the entire set of possible edges. What was discovered (and discussed in detail in the next Chapter) is that similar to other research in NLU, accuracy and performance deteriorates with increased complexity.
Other metrics such asclustering coefficient ordegree distribution could have
been implemented; however the focus of this Chapter is the dataset and not its complexity metrics and analysis, which was an empirical byproduct. Furthermore,
as shown in (4.1), whereCi is the cluster coefficient, ejk is an edge from vj tovk,
andki is the number of neighbours to a node.
Ci =
|{ejk:vj, vk∈Ni, ejk∈E}|
ki(ki−1)
. (4.1)
However, the notion of a triplet in a CG is ill-defined; instead it should be a quadruple since the use of a triplet is violated by the fact that concepts via relations would connect to another concept thereby forming a rhombus/quadruplet instead of a triplet. This difference is demonstrated in Figure 4.3, where the left-side rhombus is a CG cluster, whereas the right-side triangle is the traditional approach described by equation (4.1).
Figure 4.3: Clustering Coefficient: CG rhombus versus KR triplet
Through the creation process (Section 4.1.1) certain empirical observations were made, based on visual analysis of the data; recurring patterns and common rules were identified which have been recorded and are herein presented. In all graphs presented in Section 4.2.1 the root node is always at the left side.
4.2.1 Graph Columns and Linearity
Certain patterns are associated with acolumn-like appearance, related to the linear
structure of the phrase. Such an example is shown in Appendix B.1. Those graphs are assumed to be easier to re-create and learn since their structure is linear; fur- thermore they produce simpler training material due to fewer contradicting nodes and edges. A column like graph will have a single average path length, and the
specific graph in Appendix B.1 has|V|/|E|= 0.54 andlG= 8. Another column-like
graph is shown in Appendix B.2, with a|V|/|E|= 1.14 andlG= 4.
Some of the linear graphs are less column-like, and resemble more of a but- terfly or tree branch, with large concept clusters, such as the one shown in B.3. In
this graph, the ratio is|V|/|E|= 0.62 and lG= 2.
A more complex graph is shown in Appendix B.4; this one has a large mid centre cluster of concepts, which are connected by two relations before and after. In
this graph, the ratio is |V|/|E|= 0.78 and lG = 4. Other forms of column graphs are identified by clustering of concept nodes, which are all connected to the same group of relations. Relation grouping is also possible, but more rare than concept grouping.
4.2.2 Graph Branching and Grouping
Branches of graphs increase complexity, especially when a graph contains multiple branches. Appendix B.5 shows what a shallow graph with a high ratio of concepts to relations, and a high ratio of nodes to edges looks like, especially when combined
with a very small path length. The actual ratio is |V|/|E| = 0.91 and lG = 2.
However, in this instance the graph has two groups, the first connecting the left- most concepts through relation ”is”, and the second relation ”affecting” connecting to the second group on the right.
This type of causality is represented by branching, which adds a level of complexity. A measure which could potentially aid would be the clustering coeffi- cient for directed graphs. Yet that would not suffice since the distinction between groups based on edges is more difficult to calculate and would thus need to be com- bined with degree distribution. The graph shown in Appendix B.6 which has a ratio
|V|/|E|= 0.51 andlG= 4.
Branching may include groups of column-like (linear) graphs, such as the one shown in Appendix B.7. This type of graph is a peculiar entity: there appear to exist two distinct column-like graphs, joined together by the relation ”for”, it has
a |V|/|E| = 0.63 and lG = 8. Those graphs are often hard to get right because
they are specialisations; they define a new rule (or often an exception to a recurring pattern) which is an outlier.
Another column-like graph in Appendix B.8 shows how even small branches
may affect accuracy; it has a|V|/|E|= 0.93 andlG= 6.2. Similar to Appendix B.7
is Appendix B.9, only this one is more symmetric, has a smallerlG = 4 and a ratio
|V|/|E|= 0.84. Last but not least, there is Appendix B.10 which albeit having an
lG = 5.3 is characterised by a ratio|V|/|E|= 0.92, thus making it more sparse than
dense.
4.2.3 Graphs and Operators
Analysing graph patterns during their creation (empirically) provided some inter- esting observations, most of which revolve around logic embedded in the KR and
and effect, others hint state changes, and whilst there are exceptions they are very rare. A clear connection between relations (in CG-terms) and operators exists; op- erators are always relations and relations are always the node in a graph which
diverges into branches and leafs. For example, in Appendix B.7 it is a ”for” that
creates the downside branch, and in Appendix B.9 it is an ”or” and an ”in” that
act as the connecting branch nodes.
This form of reappearing patterns is what inspired the theoretical algorithm in Section 3.12. Much of the premise of Semantic parsing in the agent relies on
the meaning of those terms which act as operators in the graph; therefore the
hypothesis is that Semantic-based parsing is preferable to Syntactic parsing, since it can inherently provide greater accuracy; this hypothesis is examined later in Chapter 5.