• No results found

Feature 4: Disorganized thinking

7.0 Recommendations

Figure 4-5: This picture now shows that we have the choice of path ‘a’ or path ‘b’, based upon the mean cost of each path, where that mean cost is not precisely known. We would pose the question: “What is the probability that path ‘a’ is a better option than path ‘b’, based on how well we know the expected outcome of a?” As a point of clarification, even if travel time (cost) was deterministic, there would still be some uncertainty about what that deterministic cost is.

whereis a random variable. We can view the inputs to above solution in two different ways, as statistical units or as an overdetermined system of linear equations.

We are given a set ofnsamples from a system, where each sampleXiis a feature vector consisting offfeatures and1outputyi. Xrepresents the entire sample set andXij denotes featurej of samplei. We wish to learn the unknown parameters β with which to predict dependent variabley given independent variable X. Assuming that the estimation errors are uncorrelated with each other or with Xi and have equal variance, the lowest residual unbiased linear estimator of is:

βb⇐(XTX)−1XTy (4.15)

Statistical view

In the statistical view,is typically assumed to be a Gaussian random vector of mean zero and varianceσ2.

≡ N(0, σ2) (4.16)

This of course presupposes the above “best linear unbiased estimator” assumptions.

Regardless of the distribution type, the statistical view is that there are inherent system dy-namics that are stochastic. Therefore, we could not possibly model these stochastic dynam-ics with a more sophisticated model. The only option is to model it with a representative distribution.

Overdetermined system view

But seriously, that disturbance term would account for noise in the process but it also ac-commodates the inadequacy of the linear model to capture the system dynamics. Given the following overdetermined system off unknown coefficients β1. . . βf andn linear equa-tions wheren > f:

X

j

Xijβj =yi written as Xβ =y where (4.17)

X =











X11 X12 . . . X1f

X21 X22 . . . X2f X31 X32 . . . X3f

... ... ... ...

Xn1 Xn2 . . . Xnf











, β =







 β1

β2

...

βf







, y =









 y1

y2 y3

...

yn











, =









 1

2 3

...

n











The system usually has no solution, so the goal is to find coefficients β which fit the equations best. In this view we are fitting the output from a non-linear system, onto a linear model. The system cannot be perfectly described using a linear model, resulting in some error. The error has nothing to do with system stochastics. If we assume the system to be deterministic, the same error exists. Therefore, the error is due to an insufficient representation.

Homoscedasticity Functions

The above representation only allows for a homoscedastic errors. It assumes that the vari-ance of the error is constant across observations as depicted in Figure 4-6. Put another way, the function does not have the capability of attributing a different level of uncertainty for different parts of the prediction space.

Figure 4-6: One dimensional homoscedastic function reflecting a constant variance across all predictions

Heteroscedasticity Functions and Autonomy

But sometimes the prediction variance is correlated to the input variables. In other words, the prediction uncertainty is different depending on what part of the input space you are working in as depicted in Figure 4-7. This relates to the problem of autonomy in the following way. Suppose we are using a function to predict the value of taking a given action. Our function conveys how certain we are about that value. That value certainty translates to how certain we are about the what to do. Clearly, your level of certainty and confidence varies based on the situation. However, if that value is estimated with a homoscedastic function, we would not be able to represent that difference.

Figure 4-7: One dimensional heteroscedastic function showing tie-down points and vari-ance for a given prediction

Next we will review some methods that allow heteroscedasticity.

Kernel Smoothing Equations

Pˆ(x) = 1 nh

Xn i=1:n

K

x−Xi h

(4.18)

K(x) = 1

(2π)p2 exp−1 2

Xp i=1

(x2i) (4.19)

The kernel trick

The kernel trick is a calculation that accomplishes the above kernel smoothing in a very elegant way. It also allows for various kinds of kernels. It is called the kernel trick be-cause it allows you to create a function in a higher dimensional space without explicitly elaboration the higher dimensional basis or features. In other words it allows you to create a more expressive, more complicated mapping with a relatively simple calculation. That calculation does not involve transforming your basis into a more complicated basis. When using a Gaussian kernel, the function is theoretically infinite dimensional. Here we will go through an example using a polynomial kernel. The reader may refer to MacKay [2003]

regarding the infinite dimensional case.

Non-parametric Least Squares Regression with a polynomial Kernel

Non-parametric least squares regression explicitly estimates future data values using weighted combinations of the training data-set. The method is fast and effective given certain condi-tions. Its accuracy is good but limited by the number of samples that can be handled by the regression method. It works as follows.

Suppose we have a set of state samplesX and a set of target valuesy. X is ann byf matrix, wheren is the number of state samples andf is the number of features (variables used to characterize the state).yis annby 1 matrix wherenis the number of state samples.

We compute the function as follows:

α=yT

XXT+ 1d

+λI−1

(4.20) where λI is a ridge penalty, and d is the kernel degree. We can then estimate a new target value target valueyˆgiven a new feature vectorxnew. xnew is a single feature vector of size1×f.

ˆt=α XxTnew+ 1d

(4.21) whereX is a set of training samples andxnew is an new sample.

Complexity Penalty Matrix and Kernel Degree

A ridge penalty, λ multiplied by the identity matrix I, is a common regularizer, which first and foremost ensures that the Gram matrix is invertible. Secondly, the ridge penalty can be increased to reduce estimation variance at the expense of increased bias to control the complexity of the function. For convenience we will call R = λI. In general any regularizing priorRcan be used.

dis the polynomial degree of the kernel. Thus,XXTis the Gram matrix and XXT+ 1d is the result of applying a polynomial kernel to the Gram matrix. Essentially, this is a non-parametric way to create and use a set of polynomial basis vectors for function approxima-tion.

One interesting aspect of this function approximation method is that if we letK =inv

XXT+ 1d +R

,

whereRis a regularizing matrix. k(X, X)performs the following on each element of the Gram matrix,k

k(x, x) = (xxT+ 1)d

SupposeR= 0andd= 1, then we have a linear kernel.

we can break up the equation

α=yT(k(X, X) +R)−1

such that

α=yTK

Thus, if we perform a value iteration backup (where our function approximator repre-sents our value function) using the exact same state samplesX as our original function, we can easily and quickly recreate the function approximation simply by replacingywith our updated one. In other words, the expensive part of the function(K)remains the same.

This is called the kernel trick because it allows you to work in the higher dimensional space without ever explicitly representing that space. This is an easy way to get a more expressive function.

Extracting the covariance

kK−1kT (4.22)

f(x) = Xn

i=1

αik(x, xi) where n ∈N, xi ∈X and αi ∈R (4.23)

Gaussian kernel

k(x, x0) = σ2fexp− 1 2l2

Xp i=1

(xi−x0i)2 (4.24) We then use cross validation to learn the best band width (l) for the kernel based estima-tor. This idea can be expanded into the Gaussian process smoother, to capture and express its own representational uncertainty within the function.

4.4 Function approximation is the source of uncertainty

Related documents