Modeling intent and destination prediction within a Bayesian framework: Predictive touch as a usecase

(1)

RESEARCH ARTICLE

Modeling intent and destination prediction within a Bayesian

framework: Predictive touch as a usecase

Runze Gan1,2,* , Jiaming Liang1,2 , Bashar I. Ahmad1and Simon Godsill1 1_{Engineering Department, University of Cambridge, Cambridge, United Kingdom}

2_{Equal Contribution}

*Corresponding author. E-mail:[email protected]

Received: 30 June 2020; Revised: 02 September 2020; Accepted: 04 September 2020

Keywords: Destination prediction; human–computer interaction; intent modeling; Kalman filter; object tracking; particle filter; sequential Monte Carlo; touchless interaction

Abstract

In various scenarios, the motion of a tracked object, for example, a pointing apparatus, pedestrian, animal, vehicle, and others, is driven by achieving a premeditated goal such as reaching a destination. This is albeit the various possible trajectories to this endpoint. This paper presents a generic Bayesian framework that utilizes stochastic models that can capture the influence of intent (viz., destination) on the object behavior. It leads to simple algorithms to infer, as early as possible, the intended endpoint from noisy sensory observations, with relatively low computational and training data requirements. This framework is introduced in the context of the novel predictive touch technology for intelligent user interfaces and touchless interactions. It can determine, early in the interaction task or pointing gesture, the interface item the user intends to select on the display (e.g., touchscreen) and accordingly simplify as well as expedite the selection task. This is shown to significantly improve the usability of displays in vehicles, especially under the influence of perturbations due to road and driving conditions, and enable intuitive contact-free interactions. Data collected in instrumented vehicles are shown to demonstrate the effectiveness of the proposed intent prediction approach.

Impact Statement

The presented Bayesian framework facilitates automated decision-making, resource allocation and future action planning with applications in various fields, such as in human–computer interaction (HCI), surveillance, robotics, to name a few. It led to the introduction of the patented HCI technology predictive touch, developed as part of a collaboration with Jaguar Land Rover and is set for commercialization; it won a Jaguar Land Rover TATA Innovista Award 2020 (“Dare To Try” category). Predictive touch does not only offer an intuitive approach to touchless interactions (i.e., no physical contact with the display is required), but also it can significantly improve the usability of interactive displays in vehicles or any moving platform, reduce the attention they require and enhance the input accuracy, including under the influence of perturbations due to road and driving conditions. This has been demonstrated in various on-road trials. This touchless interaction technology can have widespread applications in a post COVID-19 world by minimizing the risk of transmission of pathogens via touch surfaces, for instance, when using ticketing or self checkout machines, control panels, and interactive displays in public spaces, kiosks, or workplaces, and so on. It also offers a means to easily interact with emerging display technologies that do not have a physical surface, such as 2D/3D projections and in virtual or augmented reality, and offers additional design flexibility to support inclusive design practices.

© The Author(s), 2020. Published by Cambridge University Press. This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.

(2)

1. Introduction

In conventional sensor-level tracking, the objective is typically to estimate the hidden state xtof an object

of interest (e.g., pointing apparatus, pedestrian, vehicle, vessel, airplane, etc.), where xt is the target

location, orientation, velocity, higher order kinematics, or other spatio-temporal characteristics. This state is assumed to be related to the available noisy sensory measurements (e.g., from camera, radar, inertial measurement units, radio frequency transmissions, global navigation satellite system, acoustic signals, etc.) as per a defined observation model. Plethora of well-established algorithms for estimating x_texist, including from multiple data sources, see Bar-Shalom et al. (2011) and Haug (2012). They often implicitly assume that the object moves in an unpremeditated manner and suitable motion models are accordingly employed.

In this paper, the aim is not to estimate the state x_t, but instead to infer the underlying intent that is driving the object motion, namely its destination. This capitalizes on the premise that the target motion (e.g., the trajectory followed by a pointing finger while interacting with a display) is dictated by the intended endpoint (e.g., the sought interface item), and that the destination influence on the target movements can be modeled. Therefore, the sought probabilistic modeling and destination predictor(s) belong to a higher system level compared with the sensor-level tracking techniques, hence dubbed meta-level tracking algorithms. They have several applications, such as in surveillance, human–computer interaction (HCI), robotics, and others, since such meta-level approaches can facilitate automated decision-making, resources allocation and informed future action planning. They offer a more integrated viewpoint of a scene where intents can be automatically learnt and conflict or opportunities can be identified in a timely manner. The HCI technology, dubbed predictive touch, is used here as an application or motivation for the proposed Bayesian meta-level inference framework. Nonetheless, this approach can be applied in numerous other areas and scenarios.

1.1. Predictive touch

Predictive touch is an emerging HCI technology for intelligent displays and touchless interactions that can predict the interface component the user intends to select (e.g., a selectable graphical user interface [GUI] displayed on a touch screen), notably early in the pointing-selection task (Ahmad et al.,2017). This is based on the available freehand pointing movements in 3D, for example, provided by gesture trackers, and potentially other available sensory data such as eye-gaze. The pointing-selection task is then simplified and expedited by the predictive touch solution via applying a suitable selection facilitation scheme. This can significantly reduce the effort and distractions associated with using in-vehicle displays while driving (Jæger et al., 2008), including under the influence of perturbations, for example, vibrations and accelerations due to the road and driving conditions. Such perturbations can have a detrimental impact on the usability of displays in moving platforms, such as in-vehicle touch screens (Goode et al.,2012; Ahmad et al.2015), which often act as the gateway to control in-vehicle infotainment systems. For instance, pointing time can be reduced by over 30% and effort/workload halved with predictive touch, see Ahmad et al. (2017). It is noted that gesture trackers are increasingly becoming commonplace in automotive, gaming, infotainment applications in general and more recently in smartphones, see Quinn et al. (2019), due to recent advancements in sensing and computer-vision systems. Thus, predictive touch system typically assumes the presence of a gesture tracker (including integrated into the display, e.g., computer-vision solution with several built-in cameras on a touch screen), which it can utilize.

Figure 1depicts the system block diagram which comprises of the following four main modules: • Pointing gesture tracker: provides, in real-time, the pointing hand/finger(s) location in 3D, for

example, y0:nis the partial (filtered) pointing trajectory pertaining to the time instants tf1, t2,…,tng at

time tn.

• Intent predictor: for a set of ND selectable interface icons, fDi: i ¼ 1,2,…,NDg, this module

(3)

• Selection facilitation: based on the prediction results, the system simplifies-expedites the selection task. Various such facilitation schemes can be applied (e.g., expand or highlight/fade or drag the item closer to the pointing location, etc.) and were the subject of the studies in Ahmad et al. (2019a) for automotive applications. It was reported that the system autonomously selecting the predicted GUI item on behalf of the user, thus immediate mid-air selection, is an effective facilitation scheme leading to touchless or contact-free interactions.

• Additional data: available additional sensory data, such as inertial (accelerometer/gyroscope), eye-gaze measurements, environmental data can be utilized to improve the prediction results. For instance, vehicle CAN-bus data (e.g., suspensions and speed signals) can indicate the level of experienced perturbations due to road-driving conditions.

Therefore, it is software-based touchless technology where the user does not need to physically touch a display to select an interface component. Predictive touch can not only improve the usability and performance of interactive displays, but it also provides the means to interact with new display technologies that do not have a physical surface such as head-up displays, holograms and 3D projections (Bark et al.,2014; Broy et al.2015). This novel HCI solution uses the intuitive free hand pointing gestures and intrinsically relies on predicting the user intent, rather than using the pointing finger/arm location or orientation as a pointing apparatus as in Roider and Gross (2018). Thereby, predictive touch is not a mid-air pointing or ray-casting approach (Plaumann et al.,2018), and it is fundamentally distinct from gesture-recognition-based interactions that require the user to pre-learn particular“symbolic” gesture shapes to trigger certain interface responses (May et al.2017). It also offers several design flexibility in terms of the display placement and GUI design which is otherwise limited by the reach and motor capabilities of the user. This can promote inclusive design practices by tailoring the display operation to the user require-ments via configuring the prediction algorithms and facilitation schemes.

Figure 1. Block diagram of an in-vehicle predictive touch system. The dotted line is a recorded full in-car pointing trajectory.The gesture tracker (sensor is facing downwards to increase the region of coverage and minimize occlusions) provides at time tnthe pointing finger/hand Cartesian

(4)

1.2. Related work and contributions

The Bayesian framework for intent prediction presented in this article was introduced in Ahmad et al. (2016b) and Ahmad et al. (2018) for predictive touch and other applications; see Ahmad et al. (2019b) for a short overview. It treats the problem within an object tracking formulation, albeit not necessarily seeking state estimation, such that the influence of intended destination is captured by utilizing suitable stochastic motion model with a few unknown parameters. The latter parameters can be estimated from a small number of example motion patterns or trajectories. Linear Time-Invariant Gaussian systems were considered in the aforementioned papers and more recently nonlinear behavior due to external forces (e.g., jumps and jolts in the pointing movements due to the road/driving conditions) was briefly addressed in Gan et al. (2019). Here and compared with previous work, we

1. present an overview and unified treatment of the intent prediction task for linear as well as nonlinear (albeit within a conditionally linear formulation) motion models and systems,

2. propose a new approach to the bridging distributions (BD) class of intent-driven models, which have a moderate computational requirement and a clear stochastic interpretation. In this context, the previously unconsidered bridged (nearly) constant acceleration dynamic model is shown to deliver the highest prediction performance for a predictive touch system, and

3. benchmark various prediction models using significantly larger data set of pointing gestures recorded in instrumented vehicles under various road-driving conditions.,

In the tracking area, incorporating known predictive information to improve the accuracy of state estimation has a long history, for example Castanon et al. (1985) and Baccarelli and Cusani (1998). Additionally, mean-driven models such as those derived from an Ornstein–Uhlenbeck (OU) process, with known means, were to better estimate behavior of certain objects, for example vessel Millefiori et al. (2016) or financial time series data in Christensen et al. (2012). Also, the use of stochastic context-free grammar (SCFG) and conditionally Markov process/reciprocal process has been pro-posed to predict intent as in Fanaswala and Krishnamurthy (2015) and Rezaie and Li (2019a,b). In this paper, the destination (i.e., intent) is assumed to be unknown and predictors are developed to infer it. The adopted formulation here leads to significantly simpler algorithms with no constraints on the trajectory followed by the object (e.g., freehand pointing finger), unlike those using SCGF which discretizes the state space. The employed continuous state space models within the introduced Bayesian framework, such as OU-type process and bridging distributions (both are detailed in the next Section), enable treating asynchronous sensory measurements. A noteworthy fact is that the bridging distribution can be viewed as a special case of conditionally Markov models in Rezaie and Li (2019a) under certain assumptions.

On the other hand, modeling and inferring complex intentions, such as drivers behaviors at junctions, pedestrians at crosswalks, and human daily activities, can be tackled with data-driven or classification approaches, possibly combined with an a priori learnt pattern of life. They assume the availability of sufficiently complete and diverse training data sets with several well-established such prediction techniques, for example Bando et al. (2013), Völz et al. (2018), and Gaurav and Ziebart (2019). However, in this paper, the objective is to develop a simple and computationally efficient destination prediction algorithm where limited training data are available. For example, it can be very challenging and expensive to collect data sets of 3D freehand pointing gestures that sufficiently sample possible paths/trajectories to the display, starting locations of the gesture (e.g., steering wheel, armrest and others), road/driving conditions, context of use, user interface design, screen size/reach, etc. Instead, suitable state space models are employed here, albeit with a few unknown parameters, as is common in object tracking. They enable modeling and robustly inferring the intended endpoint of a tracked object, especially that the possible intentions are a finite set of nominal destinations, for example selectable interface items. Subsequently, the introduced Bayesian intent predictors have minimal training requirements.

(5)

1.3. Paper layout

The remainder of the paper is organized as follows. The overall inference framework, various approaches to modeling intent, and the system model are described inSection 2. Destination predictors for linear and nonlinear settings are then outlined inSection 3. Results using real pointing data, recorded by in-vehicle predictive touch prototypes under various road conditions, are presented inSection 4, and conclusions are drawn inSection 5.

2. Bayesian Framework: Modeling Intent and Overall System

Here, the destination inference problem is treated within a Bayesian framework. Let D ¼

Di: i ¼ 1,2,…,ND

f g be the set of ND nominal endpoints (e.g., selectable on-display interface icons)

of a tracked object (e.g., a pointing finger-tip). The objective is to sequentially calculate the probability of each destination (i.e., selectable interface components)D_i∈D being the intended endpoint at the current/ latest time instant tn, thus p D ¼ Dð ijy0:nÞ, i ¼ 1,2,…,ND, from the available sensory measurements

y₀_:n¼ yf ₀, y₁,…,y_ng. We recall that in a predictive touch system observations y₀_:nare provided by the gesture tracker and other sensors at the successive time instants tf0, t1,…,tng, for instance, ynis the 3D

Cartesian coordinates of the pointing finger/hand at tnas inFigure 1. For eachDi∈D and per Bayes’ rule,

we have

p D ¼ Dð ijy0:nÞ∝p yð 0:njD ¼ DiÞp D ¼ Dð iÞ, i ¼ 1,…,ND, (1)

where p D ¼ Dð iÞ is the prior on the ith possible destination. In predictive touch this prior can be attained

from semantic data, frequency of use, interface design, other sensory data, etc. The task of the inference module (i.e., intent predictor inFigure 1) at tn is hence to estimate the likelihoods p yð 0:njD ¼ DiÞ,

i ¼ 1,2,…,ND. This makes the Bayesian formulation particularly appealing since additional contextual

information can be easily incorporated, whenever available.

2.1. Destination-driven motion models

A key challenge within the introduced Bayesian approach is employing suitable motion models that represent the influence of intent on the object motion and devising inference algorithms to reveal it. The object motion (e.g., pointing gesture movement) towards an intended item on a display is not deterministic or necessarily takes the shortest path to the endpoint. This is because this movement is driven by a very complex sensorimotor system, capable of autonomous action based on various modalities (e.g., vision and can utilize feedback on the action) and is also subjected to various constraints (e.g., to optimize action required to deliver/predict smooth movement trajectories and minimize the variance of the eye or arm’s position, in the presence of biological noise due to mechanical properties of muscles) and possibly perturbed by external forces such as due to road/driving conditions or walking, see Harris and Wolpert (1998). Thereby, models of such motion are intrinsically uncertain and any prediction of the object movements at a future time instant should not be a single point following a particular deterministic path. Instead, it should be expressed as a probability distribution in space.

Stochastic processes can adequately capture the aforementioned motion uncertainties, where state x_n (e.g., pointing finger true position in 3D) at tn is related to its position at the previous time step tn1,

according to a given probability distribution defined by the following evolution of the state over time

x_n,i¼ f_i,hðx_n1,iÞ þ ε_n1 (2)

where f_i,hð Þ is the state transition function between t: _n1and tnand h ¼ tn tn1. Here, this function can be

nonlinear and it is assumed to be dependent on the intended endpointD_i; thus the subscript index i. Whereas,ε_n1is the process noise, which is often assumed to be independently and identically distributed (i.i.d) and represents the uncertainty in motion. For example, a zero-mean Gaussian process noise with covariance Q_i,hand a linear time-invariant transition function, for example x_n,i¼ F_i,hx_n1,iþ μ_i,hþ ε_n1, lead to a transition density of the state at tndescribed by a multivariate Gaussian distribution. It is given by:

(6)

p xð n,ijxn1,iÞ ¼ N xn,ijFi,hxn1,iþ μi,h, Qi,h

where its mean is dependent on the previous position x_n1,i, input term μ_i,h and covariance Q_i,h. The latter represents the potential level of uncertainty between successive movements.

2.1.1. Linear Gaussian motion models

Approximate motion models that enable inferring intent, that is not necessarily the exact modeling of the object motion, can suffice for the task of destination prediction. Under this assertion, Gaussian Linear Time Invariant (LTI) models can be particularly favorable since they can be easily formulated and lead to computationally efficient prediction algorithms, compared with nonlinear non-Gaussian models (Godsill,

2007; Haug,2012). Next, two classes of Gaussian LTI intent-driven models, namely mean-reverting and bridging distributions, are introduced.

2.1.1.1. Linear Gaussian mean reverting models. The OU process with mean reverting property offers an effective way to model the destination-driven behavior. By setting the mean term of the underlying model according to the destination information, the target would revert to the premeditated endpoint and finally arrive somewhere nearby. Denote the continuous-time destination dependent target state as vector x_t,i, then the OU-based models can be described in continuous time by the following stochastic differential equation (SDE),

dxt,i¼ A μð i xt,iÞdt þ σdβt, (3)

whereβ_tis a multivariate standard Wiener process. For a 3D pointing movement, x_t,i¼ x 0_t,i,1, x0_t,i,2, x0_t,i,30, with x_t,i,s∈ℝ2(position and velocity) orℝ3(position, velocity and acceleration), s ¼ 1, 2,3f g, and

A¼ diag Af 1, A2, A3g, μ_i¼ μ_i,1,μ_i,2,μ_i,3

0 , σ ¼ σ1 0 0 0 σ2 0 0 0 σ3 2 6 4 3 7 5,βt¼ βt,1,βt,2,βt,3 0 : (4)

Different orders of kinematics included in each“substate” x_t,i,salong with the corresponding parameters lead to distinct SDEs as per (3), for instance: (a) the mean reverting diffusion (MRD) model which only includes position in the state (Ahmad et al.,2016b), (b) equilibrium reverting velocity (ERV) that model position and velocity (Ahmad et al.,2016b), and (c) equilibrium reverting acceleration (ERA) represent-ing position, velocity, and acceleration (Gan et al.,2019). These three models have similar mean reverting behavior, that is, the state will revert to the mean termμ_i, for example set as the destination position for MRD and with (nearly) zero velocity and acceleration for ERV and ERA, respectively.

Here we only discuss the set up for ERA model for simplicity, while other models follow the similar rationale, refer to Ahmad et al. (2016b) for further details. For ERA, the submatrices and vectors in

Equation (4)for the sth dimension are

A_s¼ 0 1 0 0 0 1 η ρ γ 2 6 4 3 7 5, μi,s¼ pi,s, 0, 0

, x_t,i,s¼ x½ _t,i,s,_x_t,i,s,€x_t,i,s0, σ_s¼ 0,0,σ½ 0, (5)

where p_i,sis the position of destinationD_iin the sth dimension, and _xt,i,s,€xt,i,sdenote the second and third

derivative (velocity and acceleration) of xt,i,s. The above setup assumes independent transitions for each

coordinate, specifically, it can be specified by the following SDE, d€xt,i,s¼ η pi,s xt,i,s

dt ρ_xt,i,sdt γ€xt,i,sdtþ σdβt,s: (6)

One can see that the object motion governed by such an SDE will initially gravitate to the destination position (i.e., p_i,sprescribed in the mean vectorμ_i,sof this OU process) with increasing acceleration due to the positive reversion factorη, then the positive damping factor ρ and γ would guarantee the target slows

(7)

down and arrives the destination in an equilibrium state, with nearly zero velocity and acceleration. This velocity behavior can be demonstrated as the blue line inFigure 2a, which is the deterministic transition (i.e., withσ as zero) of the norm velocity of the ERA model. The norm velocities of an ERA model depicted inFigure 2a, that is sample realizations as well as their mean, are generated from the parameters manually tuned to maximize the intent prediction accuracy. They noticeably capture, on average, an overall profile similar to that exhibited by the real pointing gesture data shown in Figure 2b.

Solving (3) yields the general discrete LTI transition function for all three models (MRD, ERV and ERA) as per,

p xð nþ1,ijxn,iÞ ¼ N xnþ1,ijFi,tnþ1tnxn,iþ Mi,tnþ1tn, Qi,tnþ1tn

(7) such as F_i,t_nþ1_t_n ¼ eA tðnþ1tnÞ_, M_i,t_nþ1_t_n ¼ I eA tðnþ1tnÞ μ_i, Q_i,t nþ1tn ¼ ð_t_nþ1_t_n 0 eA tðnþ1tnvÞσσ0_eA0ðtnþ1tnvÞdv: (8)

Whereas, A, μ_i, and σ are parameters set for the specific model, I is the identity matrix with the corresponding size. The derivation of this solution and calculation for Q_i,h can be found in Ahmad et al. (2016b) and references therein. Note that the x_t,i in such models is constructed to revert to the destinationD_i, and thus the transition function (7) can be equivalently described as the destination-conditioned transition density, that is, p xð nþ1jxn,DiÞ ¼ p xð nþ1,ijxn,iÞ where xndescribes the general state

(without conditioned-destination information), and the conditionD_i can be further introduced by the destination reverting construction.

2.1.1.2. Bridging distributions. While the destination information is modeled above by the mean of the OU process, another approach to incorporate such knowledge can be provided by the bridging distribu-tions method. This is particularly relevant if we use a known or legacy motion model, which does not encapsulate the influence of intent on the object motion as with numerous models in the tracking literature, for instance the nearly constant velocity (CV) and acceleration (CA) models; see Li and Jilkov (2003) for a

(a) ERA Velocity Samples (b) Real Pointing Data

Figure 2. The 3D norm velocity profile generated by the equilibrium reverting acceleration (ERA) model is shown in (a), where the black lines are 100 random realizations, the red line is the mean of them,

and the blue line shows the deterministic transition of the norm velocity of the same ERA model. (b) Shows the velocity profile from 95 real pointing data, where the red line is the mean trajectory.

(8)

comprehensive overview. Additionally, in some scenarios, an OU process might not accurately charac-terize the destination reverting behavior of the tracked object. In such cases, BD permits more free underlying motion dynamics, and at the same time, ensures the object arrival at/near its endpoint.

Bridging distributions capture the destination influence on the target behavior by constructing a Markov bridge between the intended endpoint and the target current state at tn. This capitalizes on the

premise that the trajectory followed by the object (e.g., pointing finger) must terminate at the endpoint (on-display selectable interface item), at arrival timeT , despite the random behavior between the current time step tn andT . BD accordingly introduces this knowledge into a motion model via a prior and

facilitates destination-aware behavior modeling without requiring the development of specialized sto-chastic processes that are intrinsically intent-driven. Nonetheless, BD may be applied to OU-type models for means dictated by a destination or not, for endpoint-driven OU process BD can reduce their sensitivity to parameterization as discussed in Ahmad et al. (2018).

Assuming that the target will reach the destination at time tN¼ T , a terminal state is defined as xN. A

bridged state transition distribution in a Markovian system, which conditions on the destination and the arrival time, can be expressed as the conditional distribution p xð njxn1,Di,TÞ. There exists several ways

of finding this conditional density and they may differ based on the made assumption(s). For example, Ahmad et al. (2018) assumes the terminal state xNhas exactly the same position as the destinationDi, and

the destination-related information is introduced via a Gaussian prior at t0, p xð NjDi,TÞ ¼ N xð Njai,ΣiÞ

with a_ibeing the mean,Σ_ithe covariance matrix and i ¼ 1,2,…,ND. This covariance can model the

size-orientation of the endpoint and hence with BD destinations can be regions rather than single spatial points as with OU-type models. Based on this assumption, the sought transition density p xð njxn1,Di,TÞ is a

Markov transition density for the current state (x_n), conditioning on its terminal state (x_N), i.e.,

p xð njxn1, xN,TÞ∝p xð njxn1Þp xð Njxn,TÞ: (9)

Given the fact that the terminal state xNis fixed, one can construct a joint state vector zn¼ x½ n, xN0and

obtain the transition density for z_n accordingly. The joint state transition will ultimately lead x_n to its terminal state x_Nwhich follows the prior p xð NjDi,TÞ. When observations are available, such a

construc-tion of z_npermits a joint estimation on destination and kinematic state.

An alternative formulation of BD can be found in Liang et al. (2019), in which the destination information is interpreted as a“pseudo-observation” instead of as a state prior. Specifically, a linear and Gaussian pseudo-observation model,

p eyi_N¼ a_ijx_N¼ N a_ijeGx_N,Σ_i

, (10)

was considered with eGbeing the mapping matrix. It was shown in Liang et al. (2019), Algorithm 2, that this interpretation leads to the following destination-conditioned state transition density,

p xð njxn1,Di,TÞ∝

ð

p eyi_N¼ a_ijx_Np xð Njxn,TÞp xð njxn1ÞdxN, (11)

where the Markovian assumption is preserved between the terminal state and the initial state.

Motivated by the pseudo-observation–based formulation of BD, in this paper we introduce a new intent prediction algorithm which utilizes (11) as its main ingredient. Similar to Ahmad et al. (2018) and Liang et al. (2019), we will focus on linear Gaussian models because they lead to analytically tractable results. First, consider the following LTI SDE, where x_t¼ x½ _t,_x_t,€x_t, y_t,_y_t,€y_t, zt,_zt,€zt0has the same physical

meaning as inSection 2.1.1(i.e., position, velocity and acceleration in 3D Cartesian coordinates),

dxt¼ Axtdtþ σdβt, (12)

with A_s¼ 0,1,0;0,0,1;0,0,0½ (see againSection 2.1.1 for further details related to the noise compo-nents). It can be shown that the transition density resulting from this SDE is of the form:

(9)

with F_hbeing the state transition matrix, Q_hthe process noise covariance and h ¼ tn tn1. In comparison

to (7), this transition density has no dependency on a destination. When the process noise level is relatively low, (13) corresponds to the nearly CA model (also known as the Wiener-process acceleration model). Substituting (13) into (11) yields

p xð njxn1,Di,TÞ∝ ð N aijeGxN,Σi N xNjFT tnxn, QT tn N xð njFhxn1, QhÞdxN, ∝ N aijeGFT tnxn, eGQT tnGe 0 þ Σi N xð njFhxn1, QhÞ

¼ N xnjFi,hxn1þ Mi,h, Qi,h

,

(14)

where

F_i,h¼ Q_i,hQ1_h F_h, M_i,h¼ Q_i,hLa_i, Q_i,h¼ Q1_h þ LeGF_{T t}_n

1 , L¼ eGF_{T t}_n 0 e GQ_{T t}_nGe0þ Σ_i 1 : Here (14) serves as the transition density for the pseudo-observation process, that is satisfies p xð njDi,TÞ,

based on which the state will evolve under the guidance of destination information.Figure 3gives an example of the marginal distributions obtained according to the above pseudo-observation based process where the influence of the endpoint on the state distribution over time is evident. It can also be shown that the limiting distribution, lim_t_N_{¼T !∞}p xð NjDi,TÞ, of a state process having (14) as its transition density

equates toN að _i,Σ_iÞ when eG¼ I. Moreover, setting eG¼ I and Σ_i¼ 0 produces the same state transition density as (9), namely a canonical Gaussian bridge (Gasbarra et al.,2007) terminated at a certain state, with the fact that the endpoint x_N is certain. It should be stressed that the form of mapping matrix eG depends on what destination-related information is available at hand and thus it is not necessarily equal to an identity matrix; any such matrix is included in (14).

The state transition distributions inEquations (9) and (11)build the destination knowledge into the state dynamics and thus form the basis of BD-based destination-driven (or destination-constrained) motion models. For all nominal destinationsD_i∈D, N_Dsuch bridges are constructed, one per endpoint. In scenarios where we want the terminal state x_Nat tNas well as xnat the current time step tnto be jointly

estimated, the transition model prescribed by (9) may be utilized. However, if the main objective is to predict the intended destination as in this paper with available information on the nominal endpoints (e.g., a certain region/area represented by an ellipsoidal shape), (11) can be used to construct a computationally efficient predictor since the hidden state dimension in this case is less than that of the joint estimation scheme (i.e., includes xN). InSection 3.1.2, we present a new intent predictor based on the

destination-constrained prior as with (11). In comparison to Ahmad et al. (2016b), the new predictor requires less computations as it does not estimate the terminal state at tN. It is constructed using pseudo-observation and

therefore the underlying state process is still a Markov process. It also differs from the pseudo-observation

Figure 3. 1D (alongx-axis in a Cartesian coordinate) distributions of a pseudo-observation based bridging distributions-constant acceleration process (with eG ¼ I, Σi= 0); in this case, the distribution at

the endpoint (asterisk) reduces to p xð NjDi,TÞ ¼ δaið Þ with δxN ð Þ being the Dirac delta function. From left to right: (a). p xð njDi,TÞ; (b). p _xð njDi,TÞ; (c). p €xð njDi,TÞ; and (d). velocity norm. Horizontal axes are

(10)

based intent predictors presented in Liang et al. (2019) in that it utilizes a destination-constrained state transition density throughout the filtering procedure (although this implies a slightly higher computational burden). Finally, a pseudo-measurement technique for jointly estimating the object state and its destina-tion is presented in Zhou et al. (2020) based on a linear equality constraint. It dictates that the object follows some straight line to its intended endpoint. Although this simplifies the inference procedure as the condition of arrival time is avoided, it does not capture realistic motion behavior of several objects of interest (e.g., constraint-free pointing motion in 3D). On the contrary, the presented stochastic modeling is general and does not impose such restrictive constraints on the target trajectory.

2.1.2. Nonlinear motion models: conditionally linear Gaussian settings

The computationally efficient Gaussian model assumes that the change in the object motion (i.e., pointing movements) in any time interval always follows a Gaussian distribution. However, for some irregular movements which cause rapid spatial changes (e.g., jolts in the pointing motion due to perturbations or any external nonintent-driven force), such an assumption is unsuitable and can lead to large inference errors. In order to model such erratic perturbations-induced maneuvers, we introduce a pure jump process to the original (destination-aware) Gaussian processes. Such formulations are known as jump diffusion models or Markov/semi-Markov jump models.

The adopted jump diffusion models retain the Brownian motion as one of the driven noise, and thus they can be considered as a conditionally linear Gaussian system. In particular, when the non-Gaussian pure jump process is given as a condition, the dynamics can be constructed in a standard Gaussian form to ensure computational efficiency.

Such approaches have been extensively adopted in financial modeling to describe the discrete movements (Kou2002), and in object tracking field to capture sudden maneuvers undertaken by the target or induced by external forces (Godsill, 2007). Owing to the clear physical representation and computation tractability, such jump diffusion dynamical models have also been employed in Ahmad et al. (2014) and Gan et al. (2019) within a predictive touch system under high levels of perturbations due to road-driving conditions. The approach presented in Ahmad et al. (2014) embedded a self-decay jump process within a Gaussian process to pre-process the highly-perturbed pointing data, with the aim to obtain a smoothed trajectory for the later intent inference task, whereas Gan et al. (2019) introduces a jump diffusion model for a unified scheme for destination and state estimation. In this paper, we mainly discuss the latter recent work given its improved performance.

Since the target motion (e.g., pointing gesture movements) impacted by severe external perturbations or fast maneuvering is still destination reverting, we consider the jump diffusion model based on the following linear mean reverting SDE (3)

dxt,i¼ A μð i xt,iÞdt þ σdβtþ BdJt, (15)

where most parameters have the same definition as described in the previous sections. If we assume that the jumps only occur at key driving elements of the state (e.g., position for MRD, or acceleration in ERA), the parameter B¼ diag Bf 1, B2, B3g (for 3D movements) such that Bs¼ 0,0,1½ 0ðs ¼ 1, 2,3Þ for ERA and

0, 1

½ 0 for ERV. The multivariate jump process J_t here is a compound Poisson process with Gaussian distributed jump size. Specifically, we have J_t¼P_τ

k<tSk, with the jump size Sk∈ℝ

3_{and S}

k Sð kjμJ,ΣJÞ.

Note that if isotropic distributed jump (i.e., the jump on each direction of the space are identically distributed) is considered, the parameters can be simplified asμ_J¼ 0 and Σ_J¼ σ2

JI, whereσJis defined

as the standard deviation of the jump size in any dimension. The jump time τ_k which follows the Poisson process has the property thatτ_k τ_k1 exp_λ

Jð Þ, where λ

1

J is the mean value of the jump

interarrival time.

Solving SDE (15) yields the transition density as follows,

p xð nþ1,ijxn,i,τn:nþ1Þ ¼ N xnþ1,ijμ∗nþ1,Σ∗nþ1

(11)

with μ∗ nþ1¼ Fi,tnþ1tnxtþ Mi,tnþ1tnþ X tn<τk≤tnþ1 F_i,t_nþ1_τ_kBμ_J, (17) Σ∗ nþ1¼ Qi,tnþ1tnþ X tn<τk≤tnþ1 F_i,t_nþ1_τ_kBΣ_JB0F0_i,t_nþ1_τ_k0, (18) where F,M and Q have been defined in (8), and jump time sequenceτ_n:nþ1consists of all jump times that occurred in the interval tðn, tnþ1, that is τn:nþ1¼

S

tn<τk≤tnþ1τk. 2.1.3. Observation model

The available sensory measurement y_n(e.g., gesture-tracker output) is a noisy observation of the true hidden state x_n(e.g., pointing finger actual location). In a state space form, it is described at time tnby

y_n¼ h_nðx_n,iÞ þ w_n, (19)

where h_nð Þ is the mapping from the hidden state to the observed measurement(s) and w: _nis the measurement noise. Here and for simplicity, a linear and Gaussian measurement model can be assumed such that y_n¼ Hx_n,iþ w_n, with zero mean i.i.d Gaussian noise where w_n N 0,Vð _nÞ. For instance, if gesture tracker provides locations of the pointing finger in 3D and latent state x_n,i∈ℝ3_{consists of the object location, the}

mapping measurement matrix in (19) is a 3 3 identity matrix, H ¼ I3. The noise covariance matrix Vnis

specified by the tracker accuracy, that is in terms of determining the pointing finger position.

The overall system is described by the motion and observation models in (2) and (19), respectively. Next, we introduce various destination inference algorithms to estimate the sought probabilities p D ¼ Dð ijy1:nÞ,Di∈D. As shown below, the intent inference routine complexity is dependent on the

employed motion model. For instance, a Gaussian LTI set-up leads to a simple and computationally efficient Kalman-filer-based predictor for the destination inference task.

3. Destination Prediction

Recall from (1) that the key to sequentially infer the probability of the destinationD_ibeing the intended one is to estimate the likelihood p yð 0:njD ¼ DiÞ. Furthermore, this likelihood can be recursively expanded

according to prediction error decomposition (PED; Harvey,1990) given by

p yð 0:njDiÞ ¼ p yð njy0:n1,DiÞp yð 0:n1jDiÞ, (20)

where we have abbreviated the conditionD ¼ D_iasD_ihenceforth to simplify notation. This sequential likelihood estimation serves as the basis of online Bayesian intent predictor as it only requires the evaluation of predictive likelihood p yð njy0:n1,DiÞ at each time instant. In this section, we discuss the

strategy to compute this predictive likelihood for the various models introduced inSection 2.

3.1. LTI Gaussian systems

The destination reverting models inSection 2.1.1are devised in a Gaussian LTI form, which leads to linear Gaussian transition densities. Meanwhile, a linear Gaussian observation model (e.g., for an off-the-shelf gesture tracker) is assumed for (19). The standard Kalman filter is then sufficient to carry out the recursive filtering for intent inference, namely to produce the (optimal in the mean least squares error sense) PED (Haug,2012), rather than the conventional state estimation task as shown next.

3.1.1. OU-based intent predictors

Recall that the destination conditioned transition function for OU-based model, for example in (7), and the adopted observation function are both linear Gaussian, the estimated target state can thus be explicitly described by a normal distribution. Specifically,

(12)

p xð njy1:n,DiÞ ¼ N xnjμn∣n, Cn∣n , (21) p xð njy1:n1,DiÞ ¼ N xnjμn∣n1, Cn∣n1 : (22)

The predictive likelihood can be computed as follows, p yð _njy1:n1,DiÞ ¼

ð

p yð _njxnÞp xð njy1:n1,DiÞdxn, (23)

and this leads to a Gaussian density description p yð _njy1:n1,DiÞ ¼ N ynjμyn, Cyn

, where

μyn¼ Hμn∣n1, (24)

C_y_n¼ HC_n∣n1H0þ V_n: (25)

To computeμ_n∣n1and C_n∣n1at each time step, the standard Kalman filter is required to estimate the state recursively, summarized as follows,

p xð njy1:n1,DiÞ ¼

ð

p xð n1jy1:n1,DiÞp xð njxn1,DiÞdxn1, (26)

p xð njy1:n,DiÞ∝p yð njxnÞp xð njy1:n1,DiÞ: (27)

The corresponding matrix description is

μ_n∣n1¼ F_i,hμ_n1∣n1þ M_i,h, (28)

C_n∣n1¼ F_i,hC_n1∣n1F0_i,hþ Q_i,h, (29)

K¼ C_n∣n1H0 Cy_n 1 , (30) μ_n∣n¼ μ_n∣n1þ K y_n μ_y n , (31) C_n∣n¼ I KHð ÞC_n∣n1: (32)

The above equations specify the the computation of predictive likelihood for a single time step, the likelihood for each destination being the intended one can then be evaluated with (1) and (20).

3.1.2. BD-based intent predictor using pseudo-observation formulation

In principle, BD-based intent predictors, including those in Ahmad et al. (2018) and Liang et al. (2019) and the new approach introduced here, all utilize (1) and (20) for inferring the target destination from the available noisy sensory observations. However, for the BD approach proposed here, we have p yð 0:njDi,TÞ ¼ p yð 0:njΘi,TÞ with Θi containing destination-specific parameters (here, ai and Σi) and

T being the arrival time at the destination. As the likelihood term is further conditioning on an unknown arrival timeT , Equation (20) needs to be revised as follows:

p yð 0:njDi,TÞ ¼ p yð njy0:n1,Di,TÞp yð 0:n1jDi,TÞ, (33)

based on which p yð 0:njDiÞ can be obtained via

p yð 0_:njDiÞ ¼

ð

p yð 0_:njDi,TÞp T jDð iÞdT , (34)

where p T jDð iÞ is the prior distribution on the unknown arrival time. In general, the above integration is

(13)

since the arrival time is a one-dimensional quantity (and thereby the integral). In this paper, we will adopt the same quadrature approximation scheme as in Ahmad et al. (2018) for obtaining (34).

Henceforth, the aim is to compute the arrival-time-conditioned PED and likelihood (i.e., the unknown arrival time is treated as if it is available). We illustrate how to develop an intent predictor based on the destination-constrained state process defined inSection 2.1.1. Given observations up to tn, the

T -conditioned likelihood term of interest can be expressed by p yð _njy0:n1,Di,TÞ ¼

ð

p yð _njxnÞp xð njxn1,Di,TÞp xð n1jy0:n1,Di,TÞdxn1dxn, (35)

where the first component in the integral is the observation density, the second component is the destination-constrained state transition density as defined in (11) and the last component is a filtering distribution obtained at time tn1. Next, we outline how to calculate p yð njy0_:n1,Di,TÞ at each time step

for a linear and Gaussian dynamic system. For simplicity and without loss of generality, we use the same state model as with (13) with destination information incorporated via (10). This implies the availability of the destination-conditioned state transition density inEquation (14). With a linear Gaussian observation model, we have

p yð _njxnÞ ¼ N yð njHxn, VnÞ, (36)

where H is the observation matrix and V_nis the measurement noise covariance matrix. As a result, the filtering distribution p xð n1jy0:n1,Di,TÞ at the previous time step tn1can be obtained using a standard

Kalman filter in which (14) is used as the state transition density. Assuming at tnwe have obtained the

filtering distribution given by the Kalman filter associated withD_ifrom the last time step tn1as

p xð n1jy0:n1,Di,TÞ ¼ N xn1jμin1∣n1, Cin1∣n1

, (37)

withμi

n1∣n1and Cin1∣n1being the mean and covariance respectively, and substituting (36), (14) and

(37) into (35), the sought likelihood can be shown to be p yð _njy0:n1,Di,TÞ ¼

ð

N yð _njHx_n, V_nÞN x _njF_i,hx_n1þ M_i,h, Q_i,h N xn1jμin1∣n1, Cin1∣n1

dxn1dxn

¼ N y_njH Fi,hμin1∣n1þ Mi,h

, H F_i,hCi_n1∣n1F0_i,hþ Q_i,h

H0þ V_n

:

(38)

The above calculation can be further simplified by noticing that μi

n∣n1 ¼ Fi,hμin1∣n1þ Mi,h

Ci_n∣n1 ¼ F_i,hCi_n1∣n1F0_i,hþ Q_i,h (39) are actually the mean and covariance of the intermediate distribution p xð njy0_:n1,Di,TÞ ¼

N xn1jμin∣n1, Cin∣n1

obtained at the Kalman prediction step. As a result, there is no need to re-calculate these two quantities twice.

Combining (33), (38), and (34), p yð 0:njDiÞ can be evaluated sequentially when new measurements

become available. To complete the intent prediction algorithm, the above calculation needs to be performed for each destinationD_i∈D. Furthermore, when a quadrature approximation scheme is used, (38) needs to be evaluated at each quadrature point ofT ¼ fT_q, q ¼ 1,2,…,NT}. A detailed

implemen-tation note is summarized in Algorithm I. It is noted that a guidance on the choice number of quadrature points for BD methods can be found in Ahmad et al. (2018).

(14)

Algorithm I BD-based Intent Predictor

Input: Observations: yf 0_:Ng, Pseudo-observations: af i,Σig1≤i≤N_D,T ¼ Tq

1_≤q≤N_T;

Initialization:ND NT Kalman filters, each initialized with meanμi,q_1∣1and covariance Ci,q_1∣1.

forn ¼ 0 : N do ⊳ For each time instant

forD_i∈D do ⊳ For each destination

forT_q∈T do ⊳ For each quadrature point

Construct the intent-driven transition density p xnjxn1,Di,Tq

via (14); Standard Kalman prediction to obtainμi,q_n∣n1and Ci,q_n∣n1via (39); Standard Kalman update to obtainμi,q_n∣nand Ci,q_n∣n;

Compute: li,qn ¼ p ynjy0:n1,Di,Tq

via (38);

UpdateTq-conditioned likelihood via (33): p y0:njDi,Tq

¼ Li,q

n ¼ Li,qn1 li,qn

end for

Approximate p yð 0:njDiÞ numerically using Li,qn , q ¼ 1,2,…,NT

; end for

Obtain destination posterior at tn: p Dð ijy0_:nÞ≈P p yð 0:njDiÞp Dð Þi

D j∈Dp yð0:njDjÞp Dð Þ;j end for

3.2. Intent predictors for jump diffusion models

The jump diffusion model introduced inSection 2.1.2is constructed as a conditionally Gaussian form (16), that is, the transition density from time t to t þ h is a Gaussian density if the nonlinear component jump time sequenceτ_t:tþhis given as a condition. Thus an efficient strategy would be estimatingτ_t:tþhin a Monte Carlo sense, then for each sample ofτ_t:tþh, p xð tþh,ijxt,i,τt:tþhÞ is retained as Gaussian form so that

the standard Kalman filter can be employed to carry out the estimation. Such strategy, known as Rao-Blackwellized variable rate particle filter (Godsill,2007; Christensen et al.,2012), aims to strengthen the estimation accuracy by employing analytical computations as much as possible.

When NDpossible destinations are considered, the same number of particle filters are required, each with NPparticles for a particular destinationDi. Here, we allow the NDdifferent particle filters to share the same

sample set of jump times. This not only reduces the inference computational complexity, but can also circumvent spurious large differences between the likelihoods of the various destinations, induced by individual sample outlier(s). Nonetheless, this particular consideration is not expected to noticeably impact the intent prediction performance since the aim in this paper is not to accurately estimate the object state or the individual destination likelihood p yð 0:njDiÞ. Instead, the focus is on comparing the likelihoods for all nominal

destinations, calculated under the same conditions, in order to determine the intended endpoint from the observed motion. At time tn, each variable rate particle filter stores the samplesτð Þ0p:n(p ¼ 1,2,…,NP), the

normalized weightωð Þnp,i, the mean μð Þ_n∣np,i and covariance C_n∣nð Þp,i for Gaussian density p xnjy0:n,τp ð Þ 0_:n,Di

. Subsequently, the empirical estimations for jump time and state can be described as follows,

p τð 0_:njy0:n,DiÞ≈ XNP p¼1 ωð Þp,i n δτð Þp 0:nðτ0:nÞ, (40) p xð njy0_:n,DiÞ≈ XNP p¼1 ωð Þp,i n N xnjμð Þn∣np,i, C p,i ð Þ n∣n : (41)

(15)

Accordingly, the predictive likelihood can be approximated as p y_nþ1jy0:n,Di ¼ ð p y_nþ1jy0:n,τ0_:nþ1,Di p τð n:nþ1jτ0_:nÞp τð 0_:njy0:n,DiÞdτ0_:nþ1 ≈XNP p¼1 eωð Þp,i nþ1, (42)

where the updated weight eωð Þ_nþ1p,i, in the bootstrap particle filter setting, is defined as eωð Þp,i nþ1¼ ωð Þnp,ip ynþ1jy0:n,τ p ð Þ 0_:nþ1,Di , (43)

and the new jump time samplesτð Þ_n:nþ1p , in the corresponding (bootstrap) setup, are propagated according to the Poisson transition described inSection 2.1.2,

τð Þp

n:nþ1 p τn:nþ1jτð Þ0p_:n

: (44)

It can be shown that the updated weighteωð Þ_nþ1p,i is also required to compute the normalized weightωð Þ_nþ1p,i: ωð Þp,i nþ1¼ eωð Þp,i nþ1 PNP p¼1eω p,i ð Þ nþ1 : (45) Similar to (23), the p ynþ1jy0_:n,τp ð Þ 0:nþ1,Di

in (43) can be computed in a closed form with the stored mean μð Þp,i n∣n and covariance C p,i ð Þ n∣n , that is p y_nþ1jy0:n,τ p ð Þ 0:nþ1,Di ¼ N y_nþ1jHμð Þp,i nþ1∣n, HC p,i ð Þ nþ1∣nH0þ Vn , (46) where μð Þp,i nþ1∣n ¼ Fi,tnþ1tnμ p,i ð Þ n∣n þ Mi,tnþ1tnþ X tn<τð Þ_kp≤tnþ1 F_i,t nþ1τð ÞkpBμJ ,

Cð Þ_nþ1∣np,i ¼ F_i,t_nþ1_t_nCð Þ_n∣np,iF0_i,t nþ1tnþ X tn<τð Þkp≤tnþ1 F i,tnþ1τð Þ_kpBΣJB 0_F0 i,tnþ1τð Þ_kp þ Qi,tnþ1tn: (47)

In order to updated the stored density meanμð Þ_nþ1∣nþ1p,i and covariance Cð Þ_nþ1∣nþ1p,i , the following standard Kalman filter updated steps are required:

μð Þp,i nþ1∣nþ1 ¼ μ p,i ð Þ nþ1∣nþ K yn Hμp,i ð Þ nþ1∣n , Cð Þ_nþ1∣nþ1p,i ¼ I KHð ÞCð Þ_nþ1∣np,i , K ¼ Cð Þ_nþ1∣np,i H0 HCð Þ_nþ1∣np,i H0þ V_n 1 : (48)

This procedure completes the variable rate particle filtering for a single time step and the overall intent prediction algorithm is summarized asAlgorithm II.

Algorithm II Intent Inference with the jump model

Initialization: CreateNDvariable rate particle filters, each with NP particles; for each observationsn ¼ 1 : N captured at tndo

for particlesp ¼ 1 : NPdo

(16)

end for

for destinationsi ¼ 1 : NDdo if Resample then

Resample particles and set weightsωð Þ_n1p,i ¼ 1=N_P; end if

for particlesp ¼ 1 : NPdo

Predict the meanμð Þ_nþ1∣np,i and covariance Cð Þ_nþ1∣np,i via (47); Calculate the updated weighteωð Þ_nþ1p,i according to (43)(46); Update the meanμð Þ_nþ1∣nþ1p,i and covariance Cð Þ_nþ1∣nþ1p,i via (48) end for

Produce the predictive likelihood p ynþ1jy0:n,Di

from (42); Calculate the normalized weightωð Þ_nþ1p,i according to (45); Calculate likelihood p y0_:nþ1jDi

in (20); end for

Determine endpoint probability: p Dð ijy0_:nÞ in (1);

end for

4. Results

0Figure 1. This system used the off-the-shelf sensor, Leap Motion, which can reliably track hand and finger positions in 3D during the pointing-selection tasks, at a rate exceeding 30 Hz. The utilized dataset contains 95 trajectories pertaining to four participants while undertaking pointing-selection tasks under various road and driving conditions. Here, we divide these data into two sets:

1. Dataset A with all 95 pointing tracks; this allows us to perform a comprehensive comparison between different algorithms for various levels of present perturbations (e.g., static, motorway driving and off-road driving).

2. Dataset B with 10 trajectories when the user input was subjected to severe level of noise due to driving on a badly maintained road or off-road driving. This dataset is a subsect of Dataset A and was collected in a Land Rover. It is particularly relevant to examine the outcome of the algorithms that incorporate a jump process, that is employ jump diffusion models.

During the above interaction tasks, an experimental user interface with multiple selectable circular icons was displayed on a touchscreen mounted to the car dashboard. The number of selectable icons is∣D∣ ¼ 21 for Dataset A, and∣D∣ ¼ 37 for Dataset B. Two typical pointing trajectories of each dataset are presented in

Figure 4. Similar to the common ISO 9241 pointing task, often referred to as Fitt’s law task, one randomly

chosen GUI item is highlighted at a time and the user is expected to select it. Identical uniform prior is placed on all of the interface items, that is, p D ¼ Dð iÞ ¼ 1=NDfor allDi∈D in order for the results to be

comparable to those in previous work.

Below, we use the aggregate inference success and the timely successful prediction over pointing duration to evaluate the predictors performance; both apply a maximum a posteriori criterion (i.e., pick the most probable icon) as per:

bD tð Þ ¼ argmaxn Di∈Dp D ¼ Dð ijy0:nÞ:

More specifically, the first is defined as the proportion of the total pointing gesture (in time), from its start at t0until touching the display surface at timeT , for which the predictor correctly inferred the true

endpointDTrue∈D. The second captures the percentage of the correct prediction over all tested dataset as a

function of the percentage of pointing task duration, thus indicating how early the predictor assigns the highest probability for the correct destination.

(17)

4.1. Prediction performance with linear Gaussian intent-driven models

For the 95 pointing tracks covering different levels of perturbations (i.e., Dataset A), the computationally efficient LTI Gaussian models are sufficient to predict the intended icon with a high accuracy. In this section, we evaluate all LTI Gaussian models introduced inSection 2.1.1for this dataset. The parameters for all tested predictors are listed inTable 1. They are chosen in a manual way and from examining a few possible values (i.e., no training or fine tuning across all test trajectories was undertaken).

It is noted that this model parameterization is aimed at demonstrating the low training requirement of the adopted state-space-modeling-based inference approach, since the models are physically meaningful. Take the linear Gaussian mean reverting model as an example. Although a higher noise parameterσ would lead to higher uncertainty on the final endpoint, it permit more flexibility in the target dynamics manifested in elaborate maneuvers (e.g., swings) of the target (i.e., pointing finger) en-route to its destination, instead of simply following a straight line. A higher reversion parameterη would cause the stronger force towards the endpoint, such that a higher damping factorρ (and/or γ) ensures that the finger speed upon touch is reasonable. A set of fined-tuned parameters can trade-off generalizability of the model to new data for a high (validation) prediction accuracy. Alternatively, the parameters of OU models may be set based on maximization of the likelihoodQK_k¼1p y½ 0k:njD ¼ Di,Ω

for a sample of K typical full pointing finger trajectories;Ω is the set of the parameters for an intent-driven dynamic model. As the driver/passenger uses the touch system, the system can refine the applied model parameters from the larger available dataset(s). On the other hand, the automatic parameter tuning for BD models is more complicated due to the condition on unknown arrival time. Nonetheless, from our extensive experiments and Table 1 we can confirm that these empirically selected parameters of the BD methods work sufficiently well.

The timely successful prediction over pointing duration is shown inFigure 5. As expected, all methods generally exhibit an upward trend, that is their performance improves as the the pointing finger-hand approaches the intended endpoint. Specifically, the ERA model can perform poorly at the beginning period of the pointing motion (e.g., in the first 30%); however, it delivers comparable results thereafter. Combined with the overall success rate shown inTable 1, it can be seen that all examined models achieve comparable prediction successes. Hence, the predictive touch system could infer the intended on-display item remarkably early in the pointing-selection tasks. Nonetheless, it can be noticed fromTable 1and

Figure 5that the BD models achieve better results compared with the Gaussian mean reverting models. Furthermore, performance of models whose acceleration is driven by a Wiener process (BD–CA) are also superior to those constructed merely on target position and velocity (BD–CV). This may be due to the fact

(a) Dataset A (21 icons) (b) Dataset B (37 icons) Figure 4. Example trajectories of collected real pointing trajectories.

(18)

that present accelerations can reflect the movement trend with more details. Additionally, the advantage of BD methods may be gained from more accurate end state construction such that a successful prediction can always be achieved at the end of pointing period. It is worth mentioning that, in our case, the intent predictors implemented according to Algorithm 1 of Liang et al. (2019) have the lowest complexity compared with other BD counterparts while the OU-based predictors have the least computational cost among all evaluated methods. Note that for better visualization we have chosen to only display the success rate against gesture time for the BD method proposed in this paper because the lines from previous BD formulations are visually very similar to that introduced here.

4.2. Highly perturbed scenarios and particle filtering

The intent inference performance for highly perturbed trajectories in Dataset B has been tested with jump models and Gaussian mean reverting models in Gan et al. (2019); Ahmad et al. (2016a). Results from the BD models introduced in this paper are also included for comparison. The aggregate inference success for

Table 1. Linear time invariant (LTI) Gaussian models parameters and overall prediction performance for 95 tracks.

Models Parameter values Success rates

ERV (Ahmad et al.,2016b) η ¼ 55, ρ ¼ 15, σ ¼ 3000 62:9% ERA (Gan et al.,2019) η ¼ 1150, ρ ¼ 320, γ ¼ 29, σ ¼ 1:7 104 _63:4%

BD–CV (Ahmad et al.,2018)a σCV¼ 650, σDposi ¼ 1:5, σDveli¼ 100 64:4%

BD–CV (Liang et al.,2019)a σCV¼ 650, σDposi ¼ 1:5, σDveli¼ 100 65:4%

BD–CA (Liang et al.,2019)a σCA¼ 9400, σDposi ¼ 1:5, σD

i vel¼ 50, σDacci¼ 1500 68:3% BD–CV (this paper)a _σ CV¼ 650, σDposi ¼ 1:5, σD i vel¼ 100 65:2%

BD–CA (this paper)a _σ

CA¼ 9500, σDposi ¼ 1:5, σD

i

vel¼ 25, σDacci¼ 1500 68:3%

Abbreviations: BD–CA, bridging distributions-constant acceleration; BD–CV, bridging distributions-constant velocity; ERA, equilibrium reverting acceleration; ERV, equilibrium reverting velocity.

a

For all BD models, p T jDð iÞ ¼ Unif 0:1s,1:9sð Þ, the number of quadrature points NT¼ 30, eG ¼ I and σDiform the corrospondingΣi.

(19)

all algorithms and the timely successful prediction from four selected algorithms (i.e., omitting non-BD models for the clarity of presentation) are depicted inFigures 6and7, respectively. The applied jump models below are described in Algorithm II and each use 2000 particles, but it has been observed that a comparable performance can be achieved with a small number (e.g., 500) of particles; their parameters are listed inTable 2(the jumps are assumed to be isotropic) and those for all of the LTI Gaussian models remain the same as inTable 1.

FromFigure 7, one can see that the BD–CA model always achieves the highest successful prediction

after the first 20 percents duration, and the BD models can always achieve the accurate prediction at the end stage of pointing due to its Markov bridge nature. Similar to the LTI ERA model, the jump-ERA model ascends from a relatively low successful prediction, to a comparable successful rate on the second half of the pointing duration. This insensitivity may be caused by a longer reflection on the observation from the acceleration constructed intention. The average success rate in Figure 6 indicates that the BD–CA outperforms other models for this dataset, while the jump-ERV model achieves the second best success rate. This may lead to the conclusion that the BD–CA is the best among other models on characterizing the intention of the hand pointing. However, it is worthwhile to note that the exploration for the parameters of jump models are more restrictive due to their larger number of parameters and

Figure 6. Average success rate for Dataset B.

(20)

time-consuming evaluation process. Thus it is possible that a better results can be achieved with other parameters for jump models, especially for the jump-ERA model. Additionally, the present jumps/jolts in those 10 tracks might not be of the severity (magnitude and/or transience) that a BD–CA model cannot successfully smooth out or follow. Under such high-levels of perturbations, the numerical marginalization of arrival time with BD can be challenging as the pointing-task duration can be subject to large delays, with the risk of it being very distinctive from the prior ofT . Nevertheless, the use of the particle filtering with a jump process offers additional advantages, not necessarily relevant to the predictive touch usecase, such as detecting the location-time of the perturbations-induced fast maneuvers (jumps) and potentially better destination-aware tracking results, see Gan et al. (2019).

5. Conclusion

In this paper, we presented an overview of the existing stochastic dynamic modeling methods for destination inference, with the in-vehicle predictive touch system as the case study. It covers linear Gaussian and nonlinear setups, both proposed within a Bayesian framework. The adopted continuous time intent-driven state space models naturally facilitate treating asynchronous data, including from multiple sensors. In addition, a new bridging distribution approach was proposed here, which has a moderate computational requirement and a clear stochastic interpretation compared with previous formulations. Results from real data of a predictive touch system demonstrated the efficacy of the various considered prediction algorithms, namely their ability to infer the user intent remarkably early in the pointing-selection task. Thereby, this can facilitate effective touchless interactions via the intuitive free hand pointing gestures. It is emphasized that the presented prediction techniques are also applicable to other fields, for example surveillance, smart navigation, robotics, etc. Nevertheless, there are several extensions to this work, for example bridging distributions for nonlinear and/or non-Gaussian systems (e.g., a stable Lévy system in Gan and Godsill,2020), considering intrinsically nonlinear intent-driven motion models for highly maneuverable objects and various measurement models (one such example can be found in Liang et al.,2020). This paper serves as an impetus to further research on meta-level tracking models and inference algorithms.

Funding Statement. This research was supported by grants from Jaguar Land Rover under the Centre for Advanced Photonics and

Electronics CAPE agreement.

Competing Interests. The authors declare no competing interests exist.

Data Availability Statement. The data used in this work is proprietary and confidential; it cannot be made publicly available.

Readers are nonetheless encouraged to contact authors where data and code could be shared subject to the recipient abiding by certain terms and conditions.

Ethical Standards. The conducted user studies for predictive touch met all ethical guidelines of the University of Cambridge and

Jaguar Land Rover, including adherence to the legal requirements of the study country.

Authorship Contributions. Conceptualization: all; Data curation: B. A.; Formal analysis: R. G., J. L., and B. A.; Funding

acquisition: S. G., and B. A.; Investigation: R. G. and J. L.; Methodology: all; Software: R. G., and J. L.; Supervision: B. A., and S. G.; Validation: R. G., and J. L.; Writing-original draft: R.G, J. L., and B. A.; Writing-review editing: all; All authors approved the final submitted draft.

Table 2. Jump models parameter sets.

Models Mean-reverting dynamics Jumps

Jump-ERV η ¼ 60, ρ ¼ 15, σ ¼ 450 μJ¼ 0, σJ¼ 866, λ1J ¼ 1

Jump-ERA η ¼ 1150, ρ ¼ 320, γ ¼ 30, σ ¼ 8000 μJ¼ 0, σJ¼ 1:6 104,λ1J ¼ 0:2

(21)

Notation

D discrete set of possible destinations,D ¼ Df i: i ¼ 1,2,…,NDg

ND number of nominal endpoints

Di the ith endpoint

D considered intended destination

bD maximum a posteriori estimate for the intended destination

T destination arrival time,T ¼ tN

eyi

N pseudo-observation vector for destinationDi

Σi covariance of the Gaussian pseudo-observation model for destinationDi

x_n target dynamic state at time tn

x_n,i dynamic state at time tnfor an object travelling toDi

y_n observation vector captured at time tn

β_t multivariate standard Wiener process

J_t compound Poisson process with Gaussian distributed jump size Sk, that is Jt¼

P

τk<tSk

τk arrival time of the k th jump

N xjm,Cð Þ multivariate normal distribution for random variable x with mean m and covariance C NP number of particles used in the particle filtering

ωð Þp,i

n normalized weight at time tnfor the p th particle for destination Di

eωð Þp,i

n updated weight for the p th particle for endpoint Di

I identity matrix with the suitable size

:0 _{transpose operation}

p ð Þ probability density function

References

Ahmad BI, Hare C, Singh H, Shabani A, Lindsay B, Skrypchuk L, Langdon P and Godsill S (2019a) Touchless selection schemes for intelligent automotive user interfaces with predictive mid-air touch. International Journal of Mobile Human Computer Interaction (IJMHCI) 11(3), 18–39.

Ahmad BI, Langdon PM and Godsill SJ (2019b) A Bayesian framework for intent prediction in object tracking. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 8439–8443.

Ahmad BI, LangdonPM, Godsill SJ, Donkor R, Wilde R and Skrypchuk L (2016a) You do not have to touch to select: a study on predictive in-car touchscreen with mid-air selection. In Proceedings of the 8th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, pp. 113–120.

Ahmad BI, Langdon PM, Godsill SJ, Hardy R, Skrypchuk L and Donkor R (2015) Touchscreen usability and input performance in vehicles under different road conditions: an evaluative study. In Proceedings of the 7th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, pp. 47–54.

Ahmad BI, Murphy J, Langdon PM and Godsill SJ (2014) Filtering perturbed in-vehicle pointing gesture trajectories: Improving the reliability of intent inference. In 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, pp. 1–6.

Ahmad BI, Murphy JK, Godsill S, Langdon P and Hardy R (2017) Intelligent interactive displays in vehicles with intent prediction: a Bayesian framework. IEEE Signal Processing Magazine 34(2):82–94.

Ahmad BI, Murphy JK, Langdon P, Godsill S, Hardy R and Skrypchuk L (2016b) Intent inference for pointing gesture based interactions in vehicles. IEEE Transactions on Cybernetics 46, 878–889.

Ahmad BI, Murphy JK, Langdon PM and Godsill SJ (2018) Bayesian intent prediction in object tracking using bridging distributions. IEEE Transactions on Cybernetics, 48(1), 215–227.

Baccarelli E and Cusani R (1998) Recursive filtering and smoothing for reciprocal Gaussian processes with Dirichlet boundary conditions. IEEE Transactions on Signal Processing 46(3), 790–795.

Bando T, Takenaka K, Nagasaka S and Taniguchi T (2013) Unsupervised drive topic finding from driving behavioral data. In Proceedings of IEEE Intelligent Vehicles Symposium (IV). IEEE, pp. 177–182.

Bark K, Tran C, Fujimura K and Ng-Thow-Hing V (2014) Personal navi: Benefits of an augmented reality navigational aid using a see-thru 3d volumetric hud. In Proceedings of the 6th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. ACM, pp. 1–8.