Frame State Calculation - Module 1

4.2 Module 1 – Pre-Processor

4.2.2 Frame State Calculation

The first stage of pre-processing is to extract the relevant qualitative states of each frame of the input. Each individual frame of the system input is itera-tively processed to find a set of variables that describe the state of the objects of that frame and the state of the relationships between each pair of objects.

These variables are similar to the variables that are calculated by dos Santos et al. (2009); differences will be noted as each variable is discussed. There are three per-object variables and four variables that describe the relationships between each pair of objects. The first per-object variable, a variable not used by dos Santos et al. (2009), marks whether an object is currently visible¹. The other two per-object variables are the x and y positions of each object (if more than one object is a part of the same box, then both objects share the same position). The four variables that describe the relationship between each pair of objects are the straight-line distance between the two centres of both objects, the connectivity between the two objects, the horizontal relationship between the two objects and the vertical relationship between the two objects.

The connectivity of the two objects represents whether the objects appear to be touching one another and if so, how. This is encoded by means of three fluents which are mutually exclusive. Table 4.1 describes the meaning of each predicate. The fluents are the the RCC-3 variant of the Region Connection Calculus (Randell et al., 1992) by Santos & Shanahan (2002). The original Region Connection Calculus has eight different states, but these assume that the relative positions of each region can be perfectly distinguished, even when one object is completely enclosed within the other. The variant by Santos &

Shanahan (2002) assumes that regions cannot be perfectly distinguished.

1For the purposes of scalability, the system assumes that by default a pred-icate is false if it has not been asserted.

frame( 9, 2, Frame number, box count

[ List of boxes

box( 9, 18, 16, Frame number, box ID, parent box ID [person], List of detected object labels [0.89], List of object existence probabilities

[ List of box geometry

[291.5, 352], Coordinates of the box centre

[93, 308], Minimum box size (x, y)

[100, 314] Maximum box size (x, y)

]),

box( 9, 19, 17, [ball], [0.99], [

[285, 197], [46, 46], [52, 53]

]) ]).

frame( 10, 1, [

box( 10, 20, 18, [person, ball], [0.81, 0.85], [

[293, 340.5], [96, 331], [105, 340]

]) ]).

Figure 4.2: An annotated example of the system’s input data.

Fluent Meaning

co(o₁, o₂) o₁ is coalescent with o₂: The boxes of the two objects o₁ and o₂ are either the same or overlap to the extent that the two objects cannot be reliably distinguished.

extC(o1, o2) o1 is externally connected with o2: The boxes of the two objects o₁ and o₂ are touching but do not overlap greater than a given error margin.

disC(o₁, o₂) o₁ is disconnected with o₂: The boxes of the two objects o1 and o2 are distinctly separate.

Table 4.1: A list of the meanings of the connectedness predicates available.

Because the Santos & Shanahan (2002) Region Connection Calculus vari-ant assumes that regions cannot be perfectly distinguished, it implies that the boundary of a region is also uncertain. This assumption also implies that as two objects transition between one of the states and another, there is some un-certainty as to when the transition happens. The way the states are calculated takes both of these factors into account.

The coalescence fluent holds either when the two objects are contained in the same box or when there is an area of overlap between the two boxes that as a percentage of the smallest box area is above a given threshold (τarea).

This criterion is expressed in equation 4.1. The external connection fluent holds when there is an overlap below the threshold τ_area and when there is no overlap and the distance between the nearest two points on the box boundaries (calculated by the function dist(o₁, o₂)) is below a given threshold (τ_dist). This criterion is expressed in equation 4.2. The disconnection fluent holds when the distance between the nearest two points on the box boundaries is above τ_dist This criterion is expressed in equation 4.3.

co(o₁, o₂) ↔ τ_area ≤ area(o₁∩ o₂)

min (area (o1) , area (o2)) (4.1) extC(o₁, o₂) ↔ τ_area > area(o₁∩ o₂)

min (area (o1) , area (o2))∧ τ_dist> dist (o₁, o₂) (4.2) disC(o₁, o₂) ↔ τ_dist≤ dist (o₁, o₂) (4.3) These criteria differ in some aspects from Santos & Shanahan (2002) and therefore also differ from dos Santos et al. (2009), as that paper uses the same criteria. The way they differ is that Santos & Shanahan define the criterion for coalescence to be when the distance between nearest two points on the box boundaries is zero – the distance threshold (τ_dist) separating the states external connection and disconnection remains the same. By using the overlap area percentage, it takes into account the sizes of the objects involved; an overlap area of 10 pixels may be considered small for an object with a total area of 1000 pixels but be considered large for an object with a total area of 20 pixels. This is not taken into account with the criterion used by Santos

& Shanahan (2002), which assumes any overlap should be considered to be a sign of coalescence, which does not respect the original implication that there needs to be a margin of tolerance in the region boundaries.

The horizontal and vertical relationships between each pair of objects re-fer to the positioning in relation to each other. Each relationship is sep-arately encoded by means of three mutually exclusive fluents: left (o₁, o₂), inlineX(o₁, o₂) or left (o₂, o₁) for the horizontal relationship and above(o₁, o₂), inlineY (o₁, o₂) or above (o₂, o₁) for the vertical relationship.

Table 4.2 describes the meaning of each predicate. The first three fluents refer to the horizontal relationship and the last three refer to the vertical relationship. These fluents can be seen to be similar to those proposed by Frank (1992), but instead of defining 9 states of the compass, the relations are

separated into their horizontal and vertical components and then only one of the two non-inline relations is defined, relying on the other being defined by transposing the object symbols. This extends the expressiveness of the posi-tioning fluents used by dos Santos et al. (2009), which only defined the left fluent. These predicates are all calculated using the position of the centre of one box in relation to the border of the other box. For example, left (o1, o2) holds when the centre of the box of object o₁ is to the left of the left box edge of object o₂; inlineX (o₁, o₂) holds when the centre of the box of object o₂ is between the left and right edges of the box of object o1 and left (o2, o1) holds when the centre of the box of object o₂ is to the right of the left box edge of object o₁.

Fluent Meaning

left(o₁, o₂) o₁ is to the left of o₂ left(o2, o1) o2 is to the left of o1

inlineX(o1, o2) Both o1 and o2 are in-line in the x axis Note that inlineX (o₁, o₂) = inlineX (o₂, o₁) above(o₁, o₂) o₁ is above o₂

above(o₂, o₁) o₂ is above o₁

inlineY(o1, o2) Both o1 and o2 are in-line in the y axis Note that inlineY (o₁, o₂) = inlineY (o₂, o₁)

Table 4.2: A list of the meanings of the horizontal and vertical object inter-relation predicates available.

Equations 4.4 to 4.9 provide definition for all of the horizontal and vertical object relationship fluents. The function pos_x() returns the horizontal posi-tion of the centre of the object and similarly, the funcposi-tion pos_y() returns the vertical position of the centre of the object.

left(o₁, o₂) ↔ pos_x(o₂) > pos_x(o₁) + width(o₁)

2 (4.4)

left(o2, o1) ↔ pos_x(o2) < pos_x(o1) − width(o1)

2 (4.5)

inlineX(o₁, o₂) ↔ ¬left (o₁, o₂) ∧ ¬left (o₂, o₁) (4.6) above(o₁, o₂) ↔ pos_y(o₂) > pos_y(o₁) +height(o₁)

2 (4.7)

above(o2, o1) ↔ pos_y(o2) < pos_y(o1) −height(o1)

2 (4.8)

inlineY(o₁, o₂) ↔ ¬above (o₁, o₂) ∧ ¬above (o₂, o₁) (4.9) To summarise this subsection, table 4.3 lists each of the variables that are calculated to embody the state of the input frame. For each variable, it lists the variable, the possible fluents that can represent the state, and any constraints for the fluent. Note that in the list of constraints the set Objects refers to the set of all objects recognised by the system (independent of whether an object is visible at any particular time), the set^N₀ refers to the set of natural

numbers (including zero) that can be represented by the computer and the set

0 refers to the set of non-negative real numbers that can be represented by the computer.

Variable Fluent Constraints

Object o is visible in the scene.

visible(o) o ∈ Objects Object’s x position pos_x(o, x) o ∈ Objects,

x ∈^N₀ Object’s y position pos_y(o, y) o ∈ Objects,

y ∈^N₀

Table 4.3: A list of the variables used to represent the state of a frame of system input.

In document The Application of Classical Conditioning to the Machine Learning of a Commonsense Knowledge of Visual Events (Page 110-114)