Robust Regression - On using automated algorithms to parameterise molecules for molecular dynam

I. SpinningTop

3.2. Robust Regression

−π andπis returned. Equations 3.8 and 3.10 provide the polar parameters ofrstrictly in terms of the polar parameters ofaandb.

By employing this result, we can ﬁt for phase. Through application of a known phase shift to α andβ, for example 0 and π₂ respectively, the simultaneous equations of equation 3.4 can be used to solve foraandbwithδﬁxed to 0, then equations 3.8 and 3.10 can be used to determine the values ofkandδ.

3.2. Robust Regression

A common problem when fitting data is noise. With respect to the fitting of dihedral rotation energy surfaces presented here, sources of noise can include rotation induced steric clashes causing singularities in the potential which are not well-captured by a Fourier expansion, or incorrect convergence of the underlying calculations. In any case, the noisy data should be accounted for so that any effect on the fitted results is minimised. Generally, least squares regression is used to fit data, but has no inherent ability to manage noisy data. Here, robust regression is used instead.

The residuals of a ﬁt are the difference between the reference data,yi, and the expected value

given by the ﬁt,y(xi). Least squares regression is a means to minimise the sum of these residuals,

S: S=1 2 n

∑

i=1 (y(xi)−yi)2 (3.11)

The square of the residuals is used instead of the absolute residual values because it allows the residuals to be treated as a continuous differentiable quantity. However use of the squares of the residuals does have some drawbacks. In particular, outlying points can have a disproportionate effect on the ﬁt. Take the example of determining the mean of a set of numbers*_{such as 1}._05,

0.98, 0.93 and 12.2. The numbers have been taken from a sample with known mean of 1. The solution ofμ=3.79 is far from the true mean due to the presence of a single outlier.

Robust regression is a means to incorporate robustness into the estimation of a ﬁt to data. This is accomplished by introducing a loss function,ρ(z), which grows slower than linear, to formulate a least squares like problem:

S=1 2 n

∑

i=1 ρ(y(xi)−yi)2 (3.12)

*_{Though trivial, this can be thought of as minimising the equation}_y=mx+_c_where_m_{is ﬁxed at 0. When}_y=_c

is substituted into equation 3.11 and differentiated with respect tocin order to minimiseS, the result isc=1_n∑n i=1yi,

3. Theory and Method Development

A number of possible loss functions are available, from relatively mild functions such as Huber loss5_{to strongly sub-linear functions such as Cauchy loss.}6_{Equation 3.12 collapses to equa-}

tion 3.11 when the loss function is set toρ(z) =z. For the toy example from earlier, we can apply Cauchy loss, whereρ(z) =ln(1+z), to obtain a robust estimate for the mean:

S=1 2 n

∑

i=1 ln 1+ (c−yi)2 (3.13) which when differentiated with respect tocgives:

∂S ∂c= n

∑

i=1 c−yi (c−yi)2+1 =0 (3.14)

Solving forcgivesc≈1.017, which given the parameters of the sample is a much better estimate for the mean than that obtained by minimising equation 3.11.

The form of the derivative of the Cauchy loss function is much more complex than that of the least squares function, even in this simple toy case. As such, in general solving the equations analytically will be difﬁcult, if not impossible. Thus, iterative numerical methods, such as New- ton’s method7_{are required. Numerical methods require an initial estimate of the result, and due}

to the possibility of multiple minimum values, the quality of the initial estimate is important. There is a risk of divergence in the iterative process if the initial guess is too far from a root, and the possibility of getting stuck in a local minimum as opposed to the desired global minimum. Divergence can be avoided by taking step sizes in the iterative process such that each step reduces the sum of residuals. Convergence to a local minimum is more difficult to avoid. The simplest means of obtaining an initial estimate of the fit is by performing a least squares fit first, and using the result of that as an input for the robust regression fit. However, outliers affecting the least squares fit can lead to convergence to a local minimum.

To limit the possibility of convergence to a non optimal root, an iterative process is undertaken here for determining the initial guess. A least squares fit to the reference data is performed, and the residuals at each point calculated. If a datum has a residual larger than some cut-off value, in this case 25% of the median absolute reference data value, it is removed from the data set and the least squares fit is performed again. Once either five data points have been removed or all data points have residuals less than the cut-off, the least squares result is used as the initial conditions for the robust regression, and all previously removed data is returned to the data set. The final least squares fit from this iterative process is not used as the final fit as the discarded data may contain important information on the shape of the curve, which will be lost if it is discarded. Figure 3.2 shows the effect of using different fitting methods to fit slightly noisy data containing three outliers. The least squares approach performs poorly, being heavily influenced

In document On using automated algorithms to parameterise molecules for molecular dynamics simulations and investigating suitable ensembles for the simulation of naphthalimide monolayers : a thesis submitted to Massey University in Albany, Auckland in fulfilment of t (Page 60-62)