5 Approximation and fitting - Additional Exercises Sol

5.1 Three measures of the spread of a group of numbers. For x∈ Rⁿ, we define three functions that measure the spread or width of the set of its elements (or coefficients). The first function is the spread, defined as

φsprd(x) = max

i=1,...,nxi− min

i=1,...,nxi.

This is the width of the smallest interval that contains all the elements of x.

The second function is the standard deviation, defined as

φ_stdev(x) =

This is the statistical standard deviation of a random variable that takes the values x₁, . . . , x_n, each with probability 1/n.

The third function is the average absolute deviation from the median of the values:

φaamd(x) = (1/n)

i=1

|xi− med(x)|,

where med(x) denotes the median of the components of x, defined as follows. If n = 2k− 1 is odd, then the median is defined as the value of middle entry when the components are sorted, i.e., med(x) = x_[k], the kth largest element among the values x₁, . . . , x_n. If n = 2k is even, we define the median as the average of the two middle values, i.e., med(x) = (x_[k]+ x_[k+1])/2.

Each of these functions measures the spread of the values of the entries of x; for example, each function is zero if and only if all components of x are equal, and each function is unaffected if a constant is added to each component of x.

Which of these three functions is convex? For each one, either show that it is convex, or give a counterexample showing it is not convex. By a counterexample, we mean a specific x and y such that Jensen’s inequality fails, i.e., φ((x + y)/2) > (φ(x) + φ(y))/2.

Solution. The first one is straightforward. The maximum of x_i is convex, and the minimum is concave. The difference, which is φsprd(x), is therefore convex.

The second one is also convex. The standard deviation is the Euclidean norm of a linear function of x,

φ_stdev(x) = (1/√

n)k(I − (1/n)11^T)xk2, and so is convex.

The third function is also convex. The snappiest proof we know goes like this. Consider the function defined as

φ(x) = inf

t kx − t1k1.

Sincekx − t1k1is convex in x, t, it follows that φ (which is obtained by minimizing over t) is convex in x. The median t = med(x) minimizeskx − t1k¹. (When n is even, any number between the two middle ones is also a minimizer.) Using this value of t, we see that

φ_aamd(x) = (1/n)φ(x),

and so is convex.

By the way, the same proof works for the other two functions. For p = ∞, the t that minimizes kx − t1k^∞ is t = (max_ix_i+ min_ix_i)/2, and we have

φsprd(x) = 2φ(x).

For p = 2, the t that minimizeskx − t1k2 is the mean, t = (1/n)^P_ix_i. Then we have φstdev(x) = (1/√

n)φ(x).

We mention another nice proof of convexity of φaamd(x). Suppose that n = 2k− 1 is odd. We use x_[1] to denote the largest element of x, x_[2] the second largest, and so on. The median of x is given by x_[k]. Our function can be expressed as

φ_sprd(x) =

But we know that for any r, the function

i=1

x_[i],

the sum of the r largest entries of x, is convex. The sum of the r smallest elements can be expressed

as n

and so is concave. It follows that φ_sprd(x) is the difference of a convex and concave function, and so is convex. The same type of argument works when n is even.

And, here is yet another proof of convexity of φsprd. Let’s take n even for simplicity. Consider each vector c with all entries in c either 1 or −1, with an equal number of 1s and −1s. (There are

n n/2

such vectors.) The linear function c^Tx gives the difference between the sum of some half of the entries of x and the sum of the other half. Our function φ_sprd(x) is the maximum over all such linear functions, and therefore is convex.

5.2 Minimax rational fit to the exponential. (See exercise 6.9 of Convex Optimization.) We consider the specific problem instance with data

ti =−3 + 6(i − 1)/(k − 1), yi = e^tⁱ, i = 1, . . . , k,

where k = 201. (In other words, the data are obtained by uniformly sampling the exponential function over the interval [−3, 3].) Find a function of the form

f (t) = a0+ a1t + a2t² 1 + b₁t + b₂t²

|f(t − y|. (We require that 1 + b ²

Find optimal values of a₀, a₁, a₂, b₁, b₂, and give the optimal objective value, computed to an accuracy of 0.001. Plot the data and the optimal rational function fit on the same plot. On a different plot, give the fitting error, i.e., f (ti)− yi.

Hint. To check if a feasibility problem is feasible, in Matlab, you can use strcmp(cvx_status,’Solved’) after cvx_end. In Python, use problem.status == ’optimal’. In Julia, use problem.status == :Optimal.

In Julia, make sure to use the ECOS solver.

Solution. The objective function (and therefore also the problem) is not convex, but it is quasi-convex. We have max_i=1,...,k|f(ti)− yi| ≤ γ if and only if

a0+ a1ti+ a2t²_i 1 + b₁t_i+ b₂t²_i − yⁱ

≤ γ, i = 1, . . . , k.

This is equivalent to (since the denominator is positive)

|a⁰+ a1t + a2t²_i − yⁱ(1 + b1ti+ b2t²_i)| ≤ γ(1 + b¹ti+ b2t²_i), i = 1, . . . , k,

which is a set of 2k linear inequalities in the variables a and b (for fixed γ). In particular, this shows the objective is quasiconvex. (In fact, it is a generalized linear fractional function.)

To solve the problem we can use a bisection method, solving an LP feasibility problem at each step.

At each step we select some value of γ and solve the feasibility problem

find a, b

subject to |a0+ a1ti+ a2t²_i − yi(1 + b1ti+ b2t²_i)| ≤ γ(1 + b1ti+ b2t²_i), i = 1, . . . , k, with variables a and b. (Note that as long as γ > 0, the condition that the denominator is positive is enforced automatically.) This can be turned into the LP feasibility problem

find a, b

subject to a₀+ a₁t_i+ a₂t²_i − yi(1 + b₁t_i+ b₂t²_i)≤ γ(1 + b1t_i+ b₂t²_i), i = 1, . . . , k a0+ a1ti+ a2t²_i − yi(1 + b1ti+ b2t²_i)≥ −γ(1 + b1ti+ b2t²_i), i = 1, . . . , k.

The following Matlab code solves the problem for the particular problem instance.

k=201;

t=(-3:6/(k-1):3)’;

y=exp(t);

Tpowers=[ones(k,1) t t.^2];

u=exp(3); l=0; % initial upper and lower bounds bisection_tol=1e-3; % bisection tolerance

while u-l>= bisection_tol gamma=(l+u)/2;

cvx_begin % solve the feasibility problem cvx_quiet(true);

variable a(3);

variable b(2);

subject to

abs(Tpowers*a-y.*(Tpowers*[1;b])) <= gamma*Tpowers*[1;b];

cvx_end

if strcmp(cvx_status,’Solved’) u=gamma;

a_opt=a;

b_opt=b;

objval_opt=gamma;

else

l=gamma;

end end

y_fit=Tpowers*a_opt./(Tpowers*[1;b_opt]);

figure(1);

plot(t,y,’b’, t,y_fit,’r+’);

xlabel(’t’);

ylabel(’y’);

figure(2);

plot(t, y_fit-y);

xlabel(’t’);

ylabel(’err’);

The following Python code solves the problem for the particular problem instance.

import numpy as np

import matplotlib.pyplot as plt import cvxpy as cvx

k = 201

t = -3 + 6 * np.arange(k) / (k - 1) y = np.exp(t)

Tpowers = np.vstack((np.ones(k), t, t**2)).T a = cvx.Variable(3)

b = cvx.Variable(2)

gamma = cvx.Parameter(sign=’positive’)

lhs = cvx.abs(Tpowers * a - (y[:, np.newaxis] * Tpowers) * cvx.vstack(1, b)) rhs = gamma * Tpowers * cvx.vstack(1, b)

l, u = 0, np.exp(3) # initial upper and lower bounds bisection_tol = 1e-3 # bisection tolerance

while u - l >= bisection_tol:

gamma.value = (l + u) / 2

# solve the feasibility problem problem.solve()

if problem.status == ’optimal’:

u = gamma.value a_opt = a.value b_opt = b.value

objval_opt = gamma.value else:

l = gamma.value

y_fit = (Tpowers * a_opt / (Tpowers * np.vstack((1, b_opt)))).A1 plt.figure()

plt.plot(t, y, ’b’, t, y_fit, ’r+’) plt.xlabel(’t’)

plt.ylabel(’y’) plt.show() plt.figure()

plt.plot(t, y_fit - y) plt.xlabel(’t’)

plt.ylabel(’err’) plt.show()

The following Julia code solves the problem for the particular problem instance.

using Convex, ECOS, PyPlot

set_default_solver(ECOSSolver(verbose=false)) k = 201

t = -3:6/(k-1):3 y = exp(t)

Tpowers = [ones(k) t t.^2];

a = Variable(3) b = Variable(2)

l, u = 0, exp(3) # initial upper and lower bounds bisection_tol = 1e-3 # bisection tolerance

a_opt, b_opt, objval_opt = zeros(3), zeros(2), 0 while u - l >= bisection_tol

gamma = (l + u) / 2

lhs = abs(Tpowers * a - (y .* Tpowers) * [1; b]) rhs = gamma * Tpowers * [1; b]

problem = minimize(0, [lhs <= rhs])

# solve the feasibility problem solve!(problem)

if problem.status == :Optimal u = gamma

a_opt = evaluate(a) b_opt = evaluate(b) objval_opt = gamma else

l = gamma end

end

y_fit = (Tpowers * a_opt ./ (Tpowers * [1; b_opt])) figure()

plot(t, y, "b", t, y_fit, "r+") xlabel("t")

ylabel("y") show() figure()

plot(t, y_fit - y) xlabel("t")

ylabel("err") show()

The optimal values are

a0= 1.0099, a1 = 0.6117, a2 = 0.1134, b1 =−0.4147, b2 = 0.0485, and the optimal objective value is 0.0233. We plot the fit and the error in Figure 5 and 6.

5.3 Approximation with trigonometric polynomials. Suppose y : R→ R is a 2π-periodic function. We will approximate y with the trigonometric polynomial

f (t) =

k=0

a_kcos(kt) +

k=1

b_ksin(kt).

We consider two approximations: one that minimizes the L₂-norm of the error, defined as kf − yk2=

Z _π

(f (t)− y(t))²dt

1/2

−30 −2 −1 0 1 2 3 5

10 15 20 25

Figure 5: Chebyshev fit with rational function. The line represents the data and the crosses the fitted points.

−3 −2 −1 0 1 2 3

−0.025

−0.02

−0.015

−0.01

−0.005 0 0.005 0.01 0.015 0.02 0.025

err

Figure 6: Fitting error for Chebyshev fit of exponential with rational function.

and one that minimizes the L₁-norm of the error, defined as kf − yk1 =

Z π

−π|f(t) − y(t)| dt.

The L₂ approximation is of course given by the (truncated) Fourier expansion of y.

To find an L₁ approximation, we discretize t at 2N points, ti=−π + iπ/N, i = 1, . . . , 2N, and approximate the L1 norm as

kf − yk1 ≈ (π/N)

i=1

|f(ti)− y(ti)|.

(A standard rule of thumb is to take N at least 10 times larger than K.) The L₁ approximation (or really, an approximation of the L1 approximation) can now be found using linear programming.

We consider a specific case, where y is a 2π-periodic square-wave, defined for −π ≤ t ≤ π as y(t) =

( 1 |t| ≤ π/2 0 otherwise.

(The graph of y over a few cycles explains the name ‘square-wave’.)

Find the optimal L2 approximation and (discretized) L1 optimal approximation for K = 10. You can find the L₂optimal approximation analytically, or by solving a least-squares problem associated with the discretized version of the problem. Since y is even, you can take the sine coefficients in your approximations to be zero. Show y and the two approximations on a single plot.

In addition, plot a histogram of the residuals (i.e., the numbers f (ti)−y(tⁱ)) for the two approxima-tions. Use the same horizontal axis range, so the two residual distributions can easily be compared.

(Matlab command hist might be helpful here.) Make some brief comments about what you see.

Solution.

The optimal approximations are shown below. The L2 optimal approximation has the familiar Gibb’s ringing near the discontinuities in y, but the L₁ optimal approximation has much less pronounced oscillation near the discontinuity in y. The L₁ optimal approximation has very small error, except near the discontinuity.

−3 −2 −1 0 1 2 3

−0.2 0 0.2 0.4 0.6 0.8 1

square L2−norm L1−norm

The residual distributions of the two approximations are shown below. As expected, the L2 approx-imation has more small residuals, and fewer large residuals, compared to the L1 approximation.

The L₁ approximation has a large number of zero and very small errors, and a few large ones.

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0

50 100 150

histogram of the residuals

l2−norm

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

0 50 100 150

l1−norm

The Matlab code is as follows.

% square wave approximations close all; clear all;

K = 10;

N = 10*K;

t = -pi+pi/N:pi/N:pi;

y = (abs(t) <= pi/2)’;

% A = [f0 f1 f2 f3 ... f9 f10];

A = ones(2*N,1);

for k = 1:K

A = [A cos(k*t)’];

end

% L2 approximation c = A\y

y_2 = A*c;

% L1 approximation cvx_begin

variable x(K+1)

minimize(norm(A*x-y,1))

y_1 = A*x;

% plot of the square-wave and optimal fits figure;

plot(t,y,t,y_2,t,y_1,’--’);

title(’approximations’)

legend(’square’,’l2norm’,’l1norm’);

axis([-pi pi -0.2 1.2])

% distribution of residual magnitudes figure;

subplot(2,1,1), hist(y - y_2, 25), axis([-.6 .6 0 150]);

title(’l2 and l1 residual distributions’), ylabel(’l2-norm’) subplot(2,1,2), hist(y - y_1, 25), axis([-.6 .6 0 150]);

5.4 Penalty function approximation. We consider the approximation problem minimize φ(Ax− b)

where A ∈ R^m×n and b ∈ R^m, the variable is x ∈ Rⁿ, and φ : R^m → R is a convex penalty function that measures the quality of the approximation Ax ≈ b. We will consider the following choices of penalty function:

(a) Euclidean norm.

φ(y) =

(d) A piecewise-linear penalty.

φ(y) =

(f) Log-barrier penalty.

φ(y) =

k=1

h(y_k), h(u) =− log(1 − u²), dom h ={u | |u| < 1}.

Here is the problem. Generate data A and b as follows:

m = 200;

n = 100;

A = randn(m,n);

b = randn(m,1);

b = b/(1.01*max(abs(b)));

(The normalization of b ensures that the domain of φ(Ax− b) is nonempty if we use the log-barrier penalty.) To compare the results, plot a histogram of the vector of residuals y = Ax− b, for each of the solutions x, using the Matlab command

hist(A*x-b,m/2);

Some additional hints and remarks for the individual problems:

(a) This problem can be solved using least-squares (x=A\b).

(b) Use the CVX function norm(y,1).

(d) Use CVX, with the overloaded max(), abs(), and sum() functions.

(e) Use the CVX function huber().

(f) The current version of CVX handles the logarithm using an iterative procedure, which is slow and not entirely reliable. However, you can reformulate this problem as

maximize (^Q^m_k=1((1− (Ax − b)k)(1 + (Ax− b)k)))^1/2m, and use the CVX function geo_mean().

Solution.

We show that each of the functions φ is convex and, hence, φ(Ax− b) is convex.

• Parts 1,2. Any norm k · k is a convex function.

• Part 3.. See exercise 3.19 (b).

• Parts 4,5,6. Making a plot of each of these functions h clearly shows that they are convex.

For the piecewise-linear penalty, we can also note that

h(u) = max{0, |u| − 0.2, 2|u| − 0.5},

so it is the pointwise maximum of convex functions. In part 6, h is the Euclidean norm of the vector (√ρ, u).

• Part 7. We can express h as the sum of two convex functions:

h(u) =− log(1 + u) − log(1 − u).

The Matlab code is as follows.

m = 200; n = 100 A = randn(m,n);

b = randn(m,1);

b = b/(1.01*max(abs(b)));

% Part 1. L2 x1 = A\b;

% Part 2. L1 cvx_begin

variable x2(n)

minimize(norm(A*x2-b,1)) cvx_end

% Part 3. Sum of largest abs. values cvx_begin

variable x3(n)

minimize(norm_largest(A*x3-b,floor(m/2))) cvx_end

% Part 4. PWL cvx_begin

variable x4(n)

minimize(sum(max([zeros(m,1), abs(A*x4-b)-0.2, 2*abs(A*x4-b)-0.5]’))) cvx_end

% Part 5. huber disp(’huber’) cvx_begin

variable x5(n)

minimize sum(huber(A*x5 - b, .2)) cvx_end

% Part 6. Smoothed l1 disp(’sqrt’)

cvx_begin

variable x6(n)

minimize(sl1(A*x6-b, 1e-8)) cvx_end

% Part 7. entropy cvx_begin

variable x7(n)

maximize(geomean([1-(A*x7-b); 1+(A*x7-b)])) cvx_end

The residual distributions of an example problem are shown in the figure.

−1 0 1

0 2 4 6

−20 −1 0 1 2

50 100 150

−1 0 1

0 20 40 60

−1 0 1

0 10 20 30 40 50

−1 0 1 0

5 10 15 20

−1 0 1

0 50 100 150

−1 0 1

0 2 4 6

5.5 `1.5 optimization. Optimization and approximation methods that use both an `2-norm (or its square) and an `₁-norm are currently very popular in statistics, machine learning, and signal and image processing. Examples include Huber estimation, LASSO, basis pursuit, SVM, various `1 -regularized classification methods, total variation de-noising, etc. Very roughly, an `2-norm cor-responds to Euclidean distance (squared), or the negative log-likelihood function for a Gaussian;

in contrast the `1-norm gives ‘robust’ approximation, i.e., reduced sensitivity to outliers, and also tends to yield sparse solutions (of whatever the argument of the norm is). (All of this is just background; you don’t need to know any of this to solve the problem.)

In this problem we study a natural method for blending the two norms, by using the `_1.5-norm, defined as

kzk1.5 =

i=1

|zi|^3/2

!^2/3

for z∈ R^k. We will consider the simplest approximation or regression problem:

minimize kAx − bk1.5,

with variable x∈ Rⁿ, and problem data A∈ R^m×n and b∈ R^m. We will assume that m > n and the A is full rank (i.e., rank n). The hope is that this `1.5-optimal approximation problem should share some of the good features of `₂ and `₁ approximation.

(a) Give optimality conditions for this problem. Try to make these as simple as possible.

(b) Explain how to formulate the `_1.5-norm approximation problem as an SDP. (Your SDP can include linear equality and inequality constraints.)

randn(’state’,0);

A=randn(100,30);

b=randn(100,1);

Numerically verify the optimality conditions. Give a histogram of the residuals, and repeat for the `₂-norm and `₁-norm approximations. You can use any method you like to solve the problem (but of course you must explain how you did it); in particular, you do not need to use the SDP formulation found in part (b).

Solution.

(a) We can just as well minimize the objective to the 3/2 power, i.e., solve the problem minimize f (x) =^P^m_i=1|a^Ti x− bi|^3/2

This objective is differentiable, in fact, so the optimality condition is simply that the gradient should vanish. (But it is not twice differentiable.) The gradient is

∇f(x) =

i=1

(3/2) sign(a^T_i x− bi)|a^Ti x− bi|^1/2a_i, so the optimality condition is just

i=1

(3/2) sign(r_i)|ri|^1/2a_i= 0,

where ri= a^T_i x− bi is the ith residual. We can, of course, drop the factor 3/2.

(b) We can write an equivalent problem minimize 1^Tt subject to s^3/2 t,

−si a^Tix− bi si i = 1, . . . , m, with new variables t, s∈ R^m.

We need a way to express s^3/2_i ≤ ti using LMIs. We first write it as s²_i ≤ ti√s_i. We’re going to express this using some LMIs. Recall that the general 2× 2 LMI

u v v w

is equivalent to u≥ 0, uw ≥ v². So we can write s²_i ≤ ti√si as

" √ s_i si

Now this is not yet an LMI, because the 1, 1 entry is not affine in the variables. To deal with (Here we use the fact that if we increase the 1, 1 entry of a matrix, it gets more positive semidefinite. (That’s informal, but you know what we mean.)

Now we can assemble an SDP to solve our `_1.5-norm approximation problem:

minimize 1^Tt

Here is another solution several of you used, which we like. The final SDP is minimize z

Evidently we minimize z, and therefore the righthand side above. For s_i fixed, the choice y_i = √s_i minimizes the objective, so we are minimizing

(c) We’re going to use CVX to solve the problem. The function norm(r,1.5) isn’t implemented yet, so we’ll have to do it ourselves. One simple way is to note that|r|^3/2 = r²/√

r, which is the composition of the quadratic over linear function x²₁/x₂ with x₁ = r, x₂ =√

r. Fortunately, the result is convex, since the quadratic over linear function is convex and decreasing in its second argument, so it can accept a concave positive function there. In other words, CVX will accept quad_over_lin(s,sqrt(s)), and recognize it as convex. So we have a snappy, short way to express s^3/2 for s > 0. Now we have to form the composition of this with the convex function ri = a^T_i x− bi. Here is one way to do this.

cvx_begin

variables x(n) s(m);

s >= abs(A*x-b);

minimize(sum(quad_over_lin(s,sqrt(s),0)));

cvx_end

The following code solve the problem for the different norms and plot histograms of the residuals.

n=30;

m=100;

randn(’state’,0);

A=randn(m,n);

b=randn(m,1);

%l1.5 solution cvx_begin

variables x(n) s(m);

s >= abs(A*x-b);

minimize(sum(quad_over_lin(s,sqrt(s),0)));

cvx_end

%l2 solution xl2=A\b;

%l1 solution cvx_begin

variables xl1(n);

minimize(norm(A*xl1-b,1));

cvx_end

r=A*x-b; %residuals rl2=A*xl2-b;

rl1=A*xl1-b;

%check optimality condition A’*(3/2*sign(r).*sqrt(abs(r))) subplot(3,1,1)

hist(r)

axis([-2.5 2.5 0 50]) xlabel(’r’)

subplot(3,1,2) hist(rl2)

axis([-2.5 2.5 0 50])

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0

10 20 30 40 50

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

0 10 20 30 40 50

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

0 10 20 30 40 50

Figure 7: Histogram of the residuals for `_1.5-norm, `₂-norm, and `₁-norm

subplot(3,1,3) hist(rl1)

axis([-2.5 2.5 0 50]) xlabel(’r1’)

%solution using SDP cvx_begin sdp

variables xdf(n) r(m) y(m) t(m);

A*xdf-b<=r;

-r<=A*xdf-b;

minimize(sum(t));

for i=1:m

[y(i), r(i); r(i), t(i)]>=0;

[r(i), y(i); y(i), 1]>=0;

end cvx_end

Figure 7 shows the histograms of the residuals for the three norms.

5.6 Total variation image interpolation. A grayscale image is represented as an m× n matrix of

intensities U^orig. You are given the values U_ij^orig, for (i, j)∈ K, where K ⊂ {1, . . . , m} × {1, . . . , n}.

Your job is to interpolate the image, by guessing the missing values. The reconstructed image will be represented by U ∈ R^m×n, where U satisfies the interpolation conditions Uij = U_ij^orig for (i, j)∈ K.

The reconstruction is found by minimizing a roughness measure subject to the interpolation con-ditions. One common roughness measure is the `₂ variation (squared),

Another method minimizes instead the total variation,

Evidently both methods lead to convex optimization problems.

Carry out `2and total variation interpolation on the problem instance with data given in tv_img_interp.m.

This will define m, n, and matrices Uorig and Known. The matrix Known is m× n, with (i, j) entry one if (i, j) ∈ K, and zero otherwise. The mfile also has skeleton plotting code. (We give you the entire original image so you can compare your reconstruction to the original; obviously your solution cannot access U_ij^orig for (i, j)6∈ K.)

Solution. The code for the interpolation is very simple. For `₂ interpolation, the code is the following.

cvx_begin

variable Ul2(m, n);

Ul2(Known) == Uorig(Known); % Fix known pixel values.

Ux = Ul2(1:end,2:end) - Ul2(1:end,1:end-1); % x (horiz) differences Uy = Ul2(2:end,1:end) - Ul2(1:end-1,1:end); % y (vert) differences minimize(norm([Ux(:); Uy(:)], 2)); % l2 roughness measure

cvx_end

For total variation interpolation, we use the following code.

cvx_begin

variable Utv(m, n);

Utv(Known) == Uorig(Known); % Fix known pixel values.

Ux = Utv(1:end,2:end) - Utv(1:end,1:end-1); % x (horiz) differences Uy = Utv(2:end,1:end) - Utv(1:end-1,1:end); % y (vert) differences minimize(norm([Ux(:); Uy(:)], 1)); % tv roughness measure

cvx_end

We get the following images

orig

10 20 30 40 50

obscure

10 20 30 40 50

5.7 Piecewise-linear fitting. In many applications some function in the model is not given by a formula, but instead as tabulated data. The tabulated data could come from empirical measurements, historical data, numerically evaluating some complex expression or solving some problem, for a set of values of the argument. For use in a convex optimization model, we then have to fit these data with a convex function that is compatible with the solver or other system that we use. In this problem we explore a very simple problem of this general type.

Suppose we are given the data (xi, yi), i = 1, . . . , m, with xi, yi ∈ R. We will assume that xⁱ are sorted, i.e., x₁ < x₂ <· · · < xm. Let a₀ < a₁ < a₂ <· · · < aK be a set of fixed knot points, with a0 ≤ x1 and aK ≥ xm. Explain how to find the convex piecewise linear function f , defined over [a0, aK], with knot points ai, that minimizes the least-squares fitting criterion

i=1

(f (x_i)− yi)².

You must explain what the variables are and how they parametrize f , and how you ensure convexity of f .

Hints. One method to solve this problem is based on the Lagrange basis, f0, . . . , f_K, which are the piecewise linear functions that satisfy

fj(ai) = δij, i, j = 0, . . . , K.

Another method is based on defining f (x) = α_ix + β_i, for x ∈ (ai−1, a_i]. You then have to add conditions on the parameters αi and βi to ensure that f is continuous and convex.

Apply your method to the data in the file pwl_fit_data.m, which contains data with x_j ∈ [0, 1].

Find the best affine fit (which corresponds to a = (0, 1)), and the best piecewise-linear convex function fit for 1, 2, and 3 internal knot points, evenly spaced in [0, 1]. (For example, for 3 internal knot points we have a₀ = 0, a₁ = 0.25, a₂ = 0.50, a₃ = 0.75, a₄ = 1.) Give the least-squares fitting cost for each one. Plot the data and the piecewise-linear fits found. Express each function in the form

f (x) = max

i=1...,K(α_ix + β_i).

(In this form the function is easily incorporated into an optimization problem.)

Solution. Following the hint, we will use the Lagrange basis functions f₀, . . . , f_K. These can be expressed as The function f can be parametrized as

f (x) = fitting criterion is then

J =

We must add the constraint that f is convex. This is the same as the condition that the slopes of the segments are nondecreasing, i.e.,

zi+1− zⁱ

a_i+1− ai ≥ zi− zⁱ⁻¹ a_i− ai−1

, i = 1, . . . , K− 1.

This is a set of linear inequalities in z. Thus, the best PWL convex fit can be found by solving the QP

minimize kF z − yk²2

subject to _a^zⁱ⁺¹^−zⁱ

i+1−ai ≥ _a^zⁱ_i^−z_−aⁱ⁻¹_i−1, i = 1, . . . , K− 1.

The following code solves this problem for the data in pwl_fit_data.

figure

plot(x,y,’k:’,’linewidth’,2) hold on

% Single line

p = [x ones(100,1)]\y;

alpha = p(1) beta = p(2)

plot(x,alpha*x+beta,’b’,’linewidth’,2) mse = norm(alpha*x+beta-y)^2

for K = 2:4

% Generate Lagrange basis a = (0:(1/K):1)’;

F = max((a(2)-x)/(a(2)-a(1)),0);

for k = 2:K

a_1 = a(k-1);

a_2 = a(k);

a_3 = a(k+1);

f = max(0,min((x-a_1)/(a_2-a_1),(a_3-x)/(a_3-a_2)));

F = [F f];

end

f = max(0,(x-a(K))/(a(K+1)-a(K)));

F = [F f];

% Solve problem cvx_begin

variable z(K+1)

minimize(norm(F*z-y)) subject to

(z(3:end)-z(2:end-1))./(a(3:end)-a(2:end-1)) >=...

(z(2:end-1)-z(1:end-2))./(a(2:end-1)-a(1:end-2)) cvx_end

% Calculate alpha and beta

alpha = (z(2:end)-z(1:end-1))./(a(2:end)-a(1:end-1)) beta = z(2:end)-alpha(1:end).*a(2:end)

% Plot solution y2 = F*z;

mse = norm(y2-y)^2 if K==2

plot(x,y2,’r’,’linewidth’,2) elseif K==3

plot(x,y2,’g’,’linewidth’,2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−1

−0.5 0 0.5 1 1.5 2

Figure 8: Piecewise-linear approximations for K = 1, 2, 3, 4

else

plot(x,y2,’m’,’linewidth’,2) end

end

xlabel(’x’) ylabel(’y’)

This generates figure 8. We can see that the approximation improves as K increases. The following table shows the result of this approximation.

K α₁, . . . , α_K β₁, . . . , β_K J

1 1.91 −0.87 12.73

2 −0.27, 4.09 −0.33, −2.51 2.62

3 −1.80, 2.67, 4.25 −0.10, −1.59, −2.65 0.60 4 −3.15, 2.11, 2.68, 4.90 0.03, −1.29, −1.57, −3.23 0.22

There is another way to solve this problem. We are looking for a piecewise linear function. If we have at least one internal knot (K ≥ 2), the function should satisfy the two following constraints:

In document Additional Exercises Sol (Page 136-200)