I. SpinningTop
3.2. Robust Regression
−π andπis returned. Equations 3.8 and 3.10 provide the polar parameters ofrstrictly in terms of the polar parameters ofaandb.
By employing this result, we can fit for phase. Through application of a known phase shift to α andβ, for example 0 and π2 respectively, the simultaneous equations of equation 3.4 can be used to solve foraandbwithδfixed to 0, then equations 3.8 and 3.10 can be used to determine the values ofkandδ.
3.2. Robust Regression
A common problem when fitting data is noise. With respect to the fitting of dihedral rotation energy surfaces presented here, sources of noise can include rotation induced steric clashes causing singularities in the potential which are not well-captured by a Fourier expansion, or incorrect convergence of the underlying calculations. In any case, the noisy data should be accounted for so that any effect on the fitted results is minimised. Generally, least squares regression is used to fit data, but has no inherent ability to manage noisy data. Here, robust regression is used instead.
The residuals of a fit are the difference between the reference data,yi, and the expected value
given by the fit,y(xi). Least squares regression is a means to minimise the sum of these residuals,
S: S=1 2 n
∑
i=1 (y(xi)−yi)2 (3.11)The square of the residuals is used instead of the absolute residual values because it allows the residuals to be treated as a continuous differentiable quantity. However use of the squares of the residuals does have some drawbacks. In particular, outlying points can have a disproportionate effect on the fit. Take the example of determining the mean of a set of numbers*such as 1.05,
0.98, 0.93 and 12.2. The numbers have been taken from a sample with known mean of 1. The solution ofμ=3.79 is far from the true mean due to the presence of a single outlier.
Robust regression is a means to incorporate robustness into the estimation of a fit to data. This is accomplished by introducing a loss function,ρ(z), which grows slower than linear, to formulate a least squares like problem:
S=1 2 n
∑
i=1 ρ(y(xi)−yi)2 (3.12)*Though trivial, this can be thought of as minimising the equationy=mx+cwheremis fixed at 0. Wheny=c
is substituted into equation 3.11 and differentiated with respect tocin order to minimiseS, the result isc=1n∑n i=1yi,
3. Theory and Method Development
A number of possible loss functions are available, from relatively mild functions such as Huber loss5to strongly sub-linear functions such as Cauchy loss.6Equation 3.12 collapses to equa-
tion 3.11 when the loss function is set toρ(z) =z. For the toy example from earlier, we can apply Cauchy loss, whereρ(z) =ln(1+z), to obtain a robust estimate for the mean:
S=1 2 n
∑
i=1 ln 1+ (c−yi)2 (3.13) which when differentiated with respect tocgives:∂S ∂c= n
∑
i=1 c−yi (c−yi)2+1 =0 (3.14)Solving forcgivesc≈1.017, which given the parameters of the sample is a much better estimate for the mean than that obtained by minimising equation 3.11.
The form of the derivative of the Cauchy loss function is much more complex than that of the least squares function, even in this simple toy case. As such, in general solving the equations analytically will be difficult, if not impossible. Thus, iterative numerical methods, such as New- ton’s method7are required. Numerical methods require an initial estimate of the result, and due
to the possibility of multiple minimum values, the quality of the initial estimate is important. There is a risk of divergence in the iterative process if the initial guess is too far from a root, and the possibility of getting stuck in a local minimum as opposed to the desired global mini- mum. Divergence can be avoided by taking step sizes in the iterative process such that each step reduces the sum of residuals. Convergence to a local minimum is more difficult to avoid. The simplest means of obtaining an initial estimate of the fit is by performing a least squares fit first, and using the result of that as an input for the robust regression fit. However, outliers affecting the least squares fit can lead to convergence to a local minimum.
To limit the possibility of convergence to a non optimal root, an iterative process is undertaken here for determining the initial guess. A least squares fit to the reference data is performed, and the residuals at each point calculated. If a datum has a residual larger than some cut-off value, in this case 25% of the median absolute reference data value, it is removed from the data set and the least squares fit is performed again. Once either five data points have been removed or all data points have residuals less than the cut-off, the least squares result is used as the initial conditions for the robust regression, and all previously removed data is returned to the data set. The final least squares fit from this iterative process is not used as the final fit as the discarded data may contain important information on the shape of the curve, which will be lost if it is discarded. Figure 3.2 shows the effect of using different fitting methods to fit slightly noisy data containing three outliers. The least squares approach performs poorly, being heavily influenced