What is an optimum? - Global Optimization Algorithms

TODO

Number of Criteria

Optimization algorithms can be divided in such which try to ﬁnd the best values of single objective functions f and such that optimize sets F of target functions. This distinction between single-objective optimization and multi-objective optimization is discussed in depth inSection 1.2.2.

1.2 What is an optimum?

We have already said that global optimization is about ﬁnding the best possible solutions for given problems. Thus, it cannot be a bad idea to start out by discussing what it is that makes a solution optimal¹³.

1.2.1 Single Objective Functions

In the case of optimizing a single criterion f , an optimum is either its maximum or minimum, depending on what we are looking for. If we own a manufacturing plant and have to assign incoming orders to machines, we will do this in a way that miniminzes the time needed to complete them. On the other hand, we will arrange the purchase of raw material, the employment of staff, and the placing of commercials in a way that maximizes our profit. In global optimization, it is a convention that optimization problems are most often defined as minimizations and if a criterion f is subject to maximization, we simply minimize its negation (−f).

Figure 1.2 illustrates such a function f deﬁned over a two-dimensional space X = (X1, X2). As outlined in this graphic, we distinguish between local and global optima. A global optimum is an optimum of the whole domain X while a local optimum is an optimum of only a subset of X.

Definition 1.6 (Local Maximum). A (local) maximum ˆxl∈ X of one (objective) function f : X7→ R is an input element with f(ˆx^l)≥ f(x) for all x neighboring ˆx^l.

If X⊆ Rⁿ, we can write:

∀ˆxl∃ε > 0 : f(ˆxl)≥ f(x) ∀x ∈ X, |x − ˆxl| < ε (1.1) Definition 1.7 (Local Minimum). A (local) minimum ˇxl∈ X of one (objective) function f : X7→ R is an input element with f(ˇx^l)≤ f(x) for all x neighboring ˇx^l.

If X⊆ R, we can write:

∀ˇxl∃ε > 0 : f(ˇxl)≤ f(x) ∀x ∈ X, |x − ˇxl| < ε (1.2) Definition 1.8 (Local Optimum). A (local) optimum x^⋆_l ∈ X of one (objective) function f : X7→ R is either a local maximum or a local minimum.

Definition 1.9 (Global Maximum). A global maximum ˆx∈ x of one (objective) function f : X7→ R is an input element with f(ˆx)≥ f(x) ∀x ∈ X.

Definition 1.10 (Global Minimum). A global minimum ˇx∈ X of one (objective) func-tion f : X7→ R is an input element with f(ˇx)≤ f(x) ∀x ∈ X.

13http://en.wikipedia.org/wiki/Maxima_and_minima[accessed 2007-07-03]

local maximum

local minimum

global minimum

local maximum

global maximum

Fig. 1.2: Global and local optima of a two-dimensional function.

Definition 1.11 (Global Optimum). A global optimum x^⋆∈ X of one (objective) func-tion f : X7→ R is either a global maximum or a global minimum.

Even a one-dimensional function f : X = R 7→ R may have more than one global maximum, multiple global minima, or even both in its domain X. Take the cosine function for example: It has global maxima ˆxiat ˆxi= 2iπ and global minima ˇxiat ˇxi= (2i + 1)π for all i∈ Z. The correct solution of such an optimization problem would then be a set X^⋆ of all optimal inputs in X rather than a single maximum or minimum. Furthermore, the exact meaning of optimal is problem dependent. In single-objective optimization, it either means minimum or maximum. In multi-objective optimization, there exist a variety of approaches to deﬁne optima which we will discuss in-depth inSection 1.2.2.

Definition 1.12 (Optimal Set). The optimal set X^⋆ is the set that contains all optimal elements.

There are normally multiple, often even infinite many optimal solutions. Since the mem-ory of our computers is limited, we can find only a finite (sub-)set of them. We thus dis-tinguish between the global optimal set X^⋆ and the set X^⋆ of (seemingly optimal) elements which an optimizer returns. The tasks of global optimization algorithms are

1. to ﬁnd solutions that are as good as possible and 2. that are also widely diﬀerent from each other [24].

The second goal becomes obvious if we assume that we have an objective function f : R 7→ R which is optimal for all x ∈ [0, 10] ⇔ x ∈ X^⋆. This interval contains uncountable many solutions, and an optimization algorithm may yield X₁^⋆ ={0, 0.1, 0.11, 0.05, 0.01} or X₂^⋆ = {0, 2.5, 5, 7.5, 10} as result. Both sets only represent a small subset of the possible solutions. The second result (X₂^⋆), however, gives us a broader view on the optimal set.

Even good optimization algorithms do not necessarily ﬁnd the real global optima but may only be able to approximate them. In other words, X₃^⋆={−0.3, 5, 7.5, 11} is also a possible result of the optimization process, although containing two sub-optimal elements.

InChapter 19 on page 291, we will introduce diﬀerent algorithms and approaches that can be used to maintain an optimal set or to select the optimal elements from a given set during an optimization process.

1.2 What is an optimum? 9 1.2.2 Multiple Objective Functions

Global optimization techniques are not just used for ﬁnding the maxima or minima of single functions f . In many real-world design or decision making problems, they are rather applied to sets F consisting of n =|F | objective functions fⁱ, each representing one criterion to be optimized [25, 26, 27].

F ={fⁱ : X7→ Yⁱ : 0 < i≤ n, Yⁱ⊆ R} (1.3) Algorithms designed to optimize such sets of objective functions are usually named with the preﬁx multi-objective, like multi-objective evolutionary algorithms which are discussed inDeﬁnition 2.2 on page 76.

Examples Factory Example

Multi-objective optimization often means to compromise conﬂicting goals. If we go back to our factory example, we can specify the following objectives that all are subject to optimiza-tion:

• Minimize the time between an incoming order and the shipment of the corresponding product.

• Maximize proﬁt.

• Minimize costs for advertising, personal, raw materials etc..

• Maximize product quality.

• Minimize negative impact on environment.

The last two objectives seem to contradict clearly the cost minimization. Between the per-sonal costs and the time needed for production and the product quality there should also be some kind of (contradictive) relation. The exact mutual inﬂuences between objectives can apparently become complicated and are not always obvious.

Artificial Ant Example

Another example for such a situation is the Artificial Ant problem¹⁴ where the goal is to find the most efficient controller for a simulated ant. The efficiency of an ant should not only be measured by the amount of food it is able to pile. For every food item, the ant needs to walk to some point. The more food it piles, the longer the distance it needs to walk. If its behavior is driven by a clever program, it may walk along a shorter route which would not be discovered by an ant with a clumsy controller. Thus, the distance it has to cover to find the food or the time it needs to do so may also be considered in the optimization process. If two control programs produce the same results and one is smaller (i. e., contains fewer instructions) than the other, the smaller one should be preferred. Like in the factory example, the optimization goals conflict with each other.

From these both examples, we can gain another insight: To ﬁnd the global optimum could mean to maximize one function fi∈ F and to minimize another one f^j∈ F, (i 6= j).

Hence, it makes no sense to talk about a global maximum or a global minimum in terms of multi-objective optimization. We will thus retreat to the notation of the set of optimal elements x^⋆∈ X^⋆⊆ X.

Since compromises for conflicting criteria can be defined in many ways, there exist mul-tiple approaches to define what an optimum is. These different definitions, in turn, lead to different sets X^⋆.

y=f (x)1

y=f (x)2

x^1

x^2 x XÎ 1

Fig. 1.3: Two functions f1 and f2 with diﬀerent maxima ˆx1and ˆx2.

Graphical Example 1

We will discuss some of these approaches in the following by using two graphical examples for illustration purposes. In the first example pictured inFigure 1.3, we want to maximize two independent objective functions F1 ={f¹, f2}. Both objective functions have the real numbers R as problem space X1. The maximum (and thus, the optimum) of f1 is ˆx1 and the largest value of f2 is at ˆx2. InFigure 1.3, we can easily see that f1 and f2 are partly conflicting: Their maxima are at different locations and there even exist areas where f1rises while f2 falls and vice versa.

Graphical Example 2

^ x1

X₂

x² x²

x₁ x₁

f3 f4

Fig. 1.4: Two functions f3and f4with diﬀerent minima ˇx1, ˇx2, ˇx3, and ˇx4.

The objective functions f1and f2in the ﬁrst example are mappings of a one-dimensional problem space X1 to the real numbers that are to be maximized. In the second exam-ple sketched inFigure 1.4, we instead minimize two functions f3 and f4 that map a two-dimensional problem space X2⊂ R² to the real numbers R. Both functions have two global minima; the lowest values of f3 are ˇx1 and ˇx2 whereas f4 gets minimal at ˇx3 and ˇx4. It should be noted that ˇx16= ˇx26= ˇx36= ˇx4.

14SeeSection 21.3.1 on page 338for more details.

1.2 What is an optimum? 11 Weighted Sums (Linear Aggregation)

The simplest method to define what is optimal is computing a weighted sum g(x) of all the functions fi(x)∈ F .¹⁵ Each objective fi is multiplied with a weight wi representing its importance. Using signed weights also allows us to minimize one objective and to maximize another. We can, for instance, apply a weight wa = 1 to an objective function fa and the weight wb = −1 to the criterion fb. By minimizing g(x), we then actually minimize the first and maximize the second objective function. If we instead maximize g(x), the effect would be converse and fb would be minimized and fa would be maximized. Either way, multi-objective problems are reduced to single-objective ones by this method.

g(x) = Xn i=1

wifi(x) = X

∀fi∈F

wifi(x) (1.4)

x^⋆∈ X^⋆⇔ g(x^⋆)≥ g(x) ∀x ∈ X (1.5)

Graphical Example 1

Figure 1.5demonstrates optimization with the weighted sum approach for the example given in Section 1.2.2. The weights are both set to 1 = w1 = w2. If we maximize g1(2), we will thus also maximize the functions f1 and f2. This leads to a single optimum x^⋆= ˆx.

y =f (x)1 1

y =f (x)2 2

y=g (x)=f (x)+f (x)1 1 2

x^ x XÎ 1

Fig. 1.5: Optimization using the weighted sum approach (ﬁrst example).

Graphical Example 2

The sum of the two-dimensional functions f3 and f4 from the second graphical example given in Section 1.2.2is sketched inFigure 1.6. Again we set the weights w3 and w4 to 1.

The sum g2 however is subject to minimization. The graph of g2 has two especially deep valleys. At the bottoms of these valleys, the two global minima ˇx5and ˇx6 can be found.

Problems with Weighted Sums

The drawback of this approach is that it cannot handle functions that rise or fall with diﬀerent speed¹⁶properly. InFigure 1.7, we have sketched the sum g(x) of the two objective functions f1(x) = −x² and f2(x) = e^x−2. When minimizing or maximizing this sum, we

15This approach applies a linear aggregation function for fitness assignment and is therefore also often referred to as linear aggregating.

16SeeSection 30.1.3 on page 550

orhttp://en.wikipedia.org/wiki/Asymptotic_notation[accessed 2007-07-03]for related informa-tion.

x² x₁

y=g (x)=f (x)+f (x)2 3 4

Fig. 1.6: Optimization using the weighted sum approach (second example).

-5 5

-5 -3 -1

-15

-25

1 3 5

y =f (x)1 1

y =f (x)2 2

y=g(x)=f (x)+f (x)1 2

Fig. 1.7: A problematic constellation for the weighted sum approach.

will always disregard one of the two functions, depending on the interval chosen. For small x, f2 is negligible compared to f1. For x > 5 it begins to outpace f1 which, in turn, will now become negligible. Such functions cannot be added up properly using constant weights.

Even if we would set w1 to the really large number 10¹⁰, f1 will become insigniﬁcant for all x > 40, because

−(⁴⁰²)^∗10¹⁰

e⁴⁰⁻²

≈ 0.0005. Therefore, weighted sums are only suitable to optimize functions that at least share the same big-O notation (see Section 30.1.3 on page 550). Often, it is not obvious how the objective functions will fall or rise. How can we, for instance, determine whether the objective maximizing the food piled by an Artiﬁcial Ant rises in comparison to the objective minimizing the distance walked by the simulated insect?

And even if the shape of the objective functions and their complexity class were clear, the question about how to set the weights w properly still remains open in most cases [28]. In the same paper, Das and Dennis [28] also show that with weighted sum approaches, not necessarily all elements considered optimal in terms of Pareto domination will be found.

1.2 What is an optimum? 13 Pareto Optimization

Pareto efficiency¹⁷ (also called Pareto optimality) is an important notion in neoclassical economics with broad applications in game theory, engineering and the social sciences [29, 30]. It defines the frontier of solutions that can be reached by trading-off conflicting objectives in an optimal manner. From this front, a decision maker (be it a human or another algorithm) can finally choose the configurations that, in his opinion, suite best [31, 32, 33, 34, 35]. The notation of optimal in the sense of Pareto efficiency is strongly based on the definition of domination:

Definition 1.13 (Domination). An element x1 dominates (is preferred to an) element x2 (x1 ⊢ x²) if x1 is better than x2 in at least one objective function and not worse with respect to all other objectives. Based on the set F of objective functions f , we can write:

x1⊢ x²⇔ ∀i : 0 < i ≤ n ⇒ ωⁱfi(x1)≤ ωⁱfi(x2) ∧

∃j : 0 < j ≤ n : ω^jfj(x1) < ωjfj(x2) (1.6) ωi=

1 if fi should be minimized

−1 if fⁱ should be maximized (1.7) Diﬀerent from the weights in the weighted sum approach, the factors ωi only carry sign information which allows us to maximize some objectives and to minimize some other criteria.

The Pareto domination relation deﬁnes a strict partial order (see Deﬁnition 27.31 on page 463) on the space of possible objective values. In contrast, the weighted sum approach imposes a total order by projecting it into the real numbers R.

Definition 1.14 (Pareto Optimal). An element x^⋆ ∈ X is Pareto optimal (and hence, part of the optimal set X^⋆) if it is not dominated by any other element in the problem space X. In terms of Pareto optimization, X^⋆ is called the Pareto set or the Pareto Frontier.

x^⋆∈ X^⋆⇔6 ∃x ∈ X : x ⊢ x^⋆ (1.8)

Graphical Example 1

In Figure 1.8, we illustrate the impact of the definition of Pareto optimality on our first example (outlined inSection 1.2.2). We assume again that f1 and f2 should both be maxi-mized and hence, ω1= ω2 =−1. The areas shaded with dark gray are Pareto optimal and thus, represent the optimal set X^⋆ = [x2, x3]∪ [x⁵, x6] which here contains infinite many elements¹⁸. All other points are dominated, i. e., not optimal.

The points in the area between x1 and x2(shaded in light gray) are dominated by other points in the same region or in [x2, x3], since both functions f1 and f2 can be improved by increasing x. If we start at the leftmost point in X (which is position x1), for instance, we can go one small step ∆ to the right and will find a point x1+ ∆ dominating x1 because f1(x1+ ∆) > f1(x1) and f2(x1+ ∆) > f2(x1). We can repeat this procedure and will always find a new dominating point until we reach x2. x2 demarks the global maximum of f2, the point with the highest possible f2 value, which cannot be dominated by any other point in X by definition (seeEquation 1.6).

From here on, f2 will decrease for a while, but f1 keeps rising. If we now go a small step

∆ to the right, we will ﬁnd a point x2+ ∆ with f2(x2+ ∆) < f2(x2) but also f1(x2+ ∆) >

f1(x2). One objective can only get better if another one degenerates. In order to increase f1, f2would be decreased and vice versa and so the new point is not dominated by x2. Although some of the f2(x) values of the other points x∈ [x¹, x2) may be larger than f2(x2+ ∆),

17http://en.wikipedia.org/wiki/Pareto_efficiency[accessed 2007-07-03]

18In practice, of course, our computers can only handle finitely many elements

x1 x2 x3 x4 x5 x6

y=f (x)1

y=f (x)2

x XÎ 1

Fig. 1.8: Optimization using the Pareto Frontier approach.

f1(x2+ ∆) > f1(x) holds for all of them. This means that no point in [x1, x2) can dominate any point in [x2, x4] because f1keeps rising until x4is reached.

At x3however, f2steeply falls to a very low level. A level lower than f2(x5). Since the f1

values of the points in [x5, x6] are also higher than those of the points in (x3, x4], all points in the set [x5, x6] (which also contains the global maximum of f1) dominate those in (x3, x4].

For all the points in the white area between x4 and x5 and after x6, we can derive similar relations. All of them are also dominated by the non-dominated regions that we have just discussed.

Graphical Example 2

Another method to visualize the Pareto relationship is outlined inFigure 1.9for our second graphical example. For a certain resolution of the problem space X2, we have counted the number of elements that dominate each element x ∈ X². The higher this number, the worst is the element x in terms of Pareto optimization. Hence, those solution candidates residing in the valleys of Figure 1.9are better than those which are part of the hills. This Pareto ranking approach is also used in many optimization algorithms as part of the ﬁtness assignment scheme (seeSection 2.3.3 on page 92, for instance). A non-dominated element is, as the name says, not dominated by any other solution candidate. These elements are Pareto optimal and have a domination-count of zero. InFigure 1.9, there are four such areas X^⋆₁, X^⋆₂, X^⋆₃, and X^⋆₄.

x² x1

#dom X1

« X2

«X3

« X4

Fig. 1.9: Optimization using the Pareto Frontier approach (second example).

If we compareFigure 1.9with the plots of the two functions f3and f4 inFigure 1.4, we can see that hills in the domination space occur at positions where both, f3and f4have high

1.2 What is an optimum? 15 values. Conversely, regions of the problem space where both functions have small values are dominated by very few elements.

Besides these examples here, another illustration of the domination relation which may help understanding Pareto optimization can be found inSection 2.3.3 on page 92(Figure 2.4 andTable 2.1).

Problems of Pure Pareto Optimization

The complete Pareto optimal set is often not the wanted result of an optimization algorithm.

Usually, we are rather interested in some special areas of the Pareto front only.

Artificial Ant Example We can again take the Artiﬁcial Ant example to visualize this prob-lem. In Section 1.2.2 on page 9 we have introduced multiple conﬂicting criteria in this problem.

• Maximize the amount of food piled.

• Minimize the distance covered or the time needed to ﬁnd the food.

• Minimize the size of the program driving the ant.

Pareto optimization may now yield for example:

• A program consisting of 100 instructions, allowing the ant to gather 50 food items when walking a distance of 500 length units.

• A program consisting of 100 instructions, allowing the ant to gather 60 food items when walking a distance of 5000 length units.

• A program consisting of 10 instructions, allowing the ant to gather 1 food item when walking a distance of 5 length units.

• A program consisting of 0 instructions, allowing the ant to gather 0 food item when walking a distance of 0 length units.

The result of the optimization process obviously contains two useless but non-dominated individuals which occupy space in the population and the non-dominated set. We also invest processing time in evaluating them, and even worse, they may dominate solutions that are not optimal but fall into the space behind the interesting part of the Pareto front. Further-more, memory restrictions usually force us to limit the size of the list of non-dominated solutions found during the search. When this size limit is reached, some optimization al-gorithms use a clustering technique to prune the optimal set while maintaining diversity.

On one hand, this is good since it will preserve a broad scan of the Pareto frontier. In this case on the other hand, a short but dumb program is of course very diﬀerent from a longer, intelligent one. Therefore, it will be kept in the list and other solutions which diﬀer less from each other but are more interesting for us will be discarded.

Furthermore, non-dominated elements have a higher probability of being explored fur-ther. This then leads inevitably to the creation of a great proportion of useless oﬀspring. In the next generation, these useless oﬀspring will need a good share of the processing time to be evaluated.

Thus, there are several reasons to force the optimization process into a wanted direction.

In Section 22.2.2 on page 374you can ﬁnd an illustrative discussion on the drawbacks of strict Pareto optimization in a practical example (evolving web service compositions).

1.2.3 Constraint Handling

Such a region of interest is one of the reasons for one further extension of the deﬁnition of optimization problems: In many scenarios, p inequality constraints g and q equality con-straints h may be imposed additional to the objective functions. Then, a solution candidate x is feasible, if and only if gi(x)≥ 0 ∀i = 1, 2, .., p and hⁱ(x) = 0∀i = 1, 2, .., q holds. Obvi-ously, only a feasible individual can be a solution, i. e., an optimum, for a given optimization problem.

Death Penalty

Probably the easiest way of dealing with constraints is to simply reject all infeasible solution candidates right away and not considering them any further in the optimization process.

This death penalty can only work in problems where the feasible regions are very large and

In document Global Optimization Algorithms (Page 25-40)