• No results found

Program Synthesis using Decision Trees

6.3 Combining Enumeration with Unification

6.3.2 Program Synthesis using Decision Trees

Recall the Algorithm 4.1 SynthForPoints, respectively, shown in Chapter 4. Algorithm 4.1 essentially synthesizesoneexpression such that the expression satisfies the given specification

forallthe concrete inputs in a given setP. In essence, it enumeratesallconditional expressions implicitlyas a part of its search.

The basic idea behind the algorithm which we now present is that we do not need to synthesize an expression which satisfies the specification forallconcrete inputs. We can learn a setEof expressions, such that each expression satisfies the specifications for some subsetP0of

the concrete inputsP, such that foreveryconcrete input inp∈P, there existssomeexpression ine∈Esuch thatesatisfies the specification atp. Once we have gathered such a setE, we can then enumerate a sufficient set of atomic predicates fromGP. These atomic predicates can

then be combined using Boolean connectives to form the conditions in a conditional expression that combines the terms inEto produce an expression which is correct overallthe concrete inputs. The computational problem of generating this conditional expression, which is correct overallthe concrete inputs, can be reduced to one of learning an appropriate decision tree, as

we describe in this section.

Formally, we are given a canonicalized, separableSyGuSspecification foronefunctionf of the formψcan,∃f∀x,aϕcan[f,x,a]defined earlier in this section. We are also given two grammarsGT andGPwhich are as described earlier. We abuse notation slightly, and also use GT andGPto refer to thesetsof terms and predicates generated by the grammarsGT and GP respectively, whenever the context creates no opportunity for ambiguity. Further, we have a set ofvaluationsP of the variables inxa, where eachσ ∈ P maps a variablev ∈xa

to a valueσ(v)of the appropriate type. We define a functionL:P→2GT, such that a term t∈L(p), for any pointp∈Pif and only ifϕcan[t[p],xa7→p]evaluates totrue. Note that the notationϕcan[t[p],xa7→p]denotes that firsteveryoccurrence of all variables fromain

thas been replaced by its valuation according top, which is denoted ast[p]. Following this, every occurrence off(.)inϕcanis replaced byt[p], and lastly, all other occurrences of variables fromxainϕcanare also replaced by their valuations according top, which is denoted by xa7→p.

Now, we can view the set of valuationsP as a sample set. The labeling function is now essentially amulti-labelingfunctionL, which maps each pointp∈Pto asetof labels drawn

from the setGT. Further, for each pointp∈P, the results of evaluating each predicateg∈GP atpforms a vector of Boolean attributes forp, which may be of infinite length. Given these parallels, it is now clear how we can treat this as a decision tree learning problem, except for one wrinkle: that each sample may be multiply labeled. The possibility that a point may be labeled with multiple terms causes problems in the computation of entropy according to Equation (6.4), which requires the fraction of samples labeled with a particular label. Applying this equation naïvely will result inPl∈LPr(l)6=1 and thus the function Pr will no longer be a

probability mass function.

To deal with this wrinkle, given a sample setP, we define a conditional distribution on the probabilities of labels,i.e., the probability of a pointpbeing assigned a labell ∈ L(p), conditionedon the fact that a particular pointp∈P has been chosen. In the original single

p∈P, we know that it can be assigned onlyonelabel:label(p). In the multi-label case, our formulation takes the view that once a pointp∈P has been picked, it can be assignedany labell∈L(p)according to a probability distribution. This conditional probability distribution is defined as follows: Pr(label(p) =t|p) =              0 ift /∈L(p) cover(t) X t0L(p) cover(t0) if t∈L(p) (6.6)

where, given a sample setP, the functioncover:GTNdenotes how many samples inP can possibly be labeled with a given termt∈GT, and is a rough measure of howrelevanta

particular term is. This function is defined as follows:

cover(t)≡| {p∈P:t∈L(p)} | (6.7) Now, given the sample setP, we can determine the unconditional label probabilities by summing the conditional probability shown in Equation 6.6 overallthe points inP. Thus, we have, the

probability of a randomly chosen point fromP being labeled witht∈GT is: Pr(t) = X

p∈P

Pr(label(p) =t|p)×Pr(p)

Now, assuming that each pointp∈P is equally likely to be chosen, i.e., we sample fromP uniformly at random, we obtain:

Pr(t) = 1

|P|

X

p∈P

Pr(label(p) =t|p) (6.8) We can now directly use Equation 6.8 to compute the entropy according to Equation 6.4, and thus information gain according to Equation 6.5, which can then be used to learn a decision tree based on the greedy information gain heuristic. Finally, we note that the conditional distribution that we have defined in Equation 6.6 makes intuitive sense, and works well in practice, as we will demonstrate shortly. However, we note that better choices for this probability distribution might still be possible, and this conditional distribution must therefore be viewed astunable

Row # p∈P L(p) attrib(p)

1 hx:2,y:1i {x} hx < y:F,x=0:F,y=0:Fi 2 hx:1,y:0i {x,x+y} hx < y:F,x=0:F,y=0:Ti 3 hx:0,y:1i {y,x+y} hx < y:T,x=0:T,y=0:Fi 4 hx:1,y:2i {y} hx < y:T,x=0:F,y=0:Fi

Table 6.1: A multi-labelled sample set over which a decision tree is to be learned An Illustrative Example

We now illustrate the techniques which we have just described, with an example. Consider the following specification which describes a binary functionf, over integers, which is expected to return the maximum of its arguments:

∃f∀x,y f(x,y)>x∧f(x,y)>y∧(f(x,y) =x∨f(x,y) =y

Suppose that the set of terms that we’re working with is{x,y,x+y}and the set of predicates is{x < y,x=0,y=0}. Further, the setP for our example contains the four valuations shown in the second column of Table 6.1, with the third column showing the set of labels (terms) that satisfy the specification at each sample (or point), and the fourth column showing the attribute vector, which consists of predicates, and their truth value for the corresponding point. For instance, the row numbered one in the table considers the valuation wherexis two andyis one. We see that the termxis the only term from among the termsx,x+yandythat satisfies the specification this point. Lastly, for this valuation, all the predicates that we consider,i.e.,

the predicatesx < y,y=0 andx=0, evaluate to false as shown in the last column.

To learn a decision tree over this sample set, we need to evaluate the entropies that result from splitting the set of valuations on each of the atomic predicates. We then choose the predicate, splitting on which results in the smallest entropy, and split the set of valuations according to the predicate. To illustrate, let us first consider splitting this sample set according to the predicatex < y. Splitting the set of valuations using this predicate yields two partitions the set of valuations P. Let us refer to these partitions P1 and P2, where P1 contains the rows numbered one and two — wherex < yevaluates to false — andP2contains the rows numbered three and four — wherex < yevaluates to true. We need to compute the entropy for each of these partitions. The total entropy for the partitioned set of valuations is then

Partition Points in Partition Label Probabilities Entropy P1 hx:2,y:1i Pr(label(p) =x) = 56 0.650022 hx:1,y:0i Pr(label(p) =x+y) = 16 Pr(label(p) =y) = 0 P2 hx:0,y:1i Pr(label(p) =x) = 0 0.650022 hx:1,y:2i Pr(label(p) =x+y) = 16 Pr(label(p) =y) = 56

Table 6.2: Entropies that result by splitting the sample set shown in Table 6.1 using the predicate x < y

Partition Points in Partition Label Probabilities Entropy

P1 hx:2,y:1i Pr(label(p) =x) = 59 1.351644 hx:1,y:0i Pr(label(p) =x+y) = 19 hx:1,y:2i Pr(label(p) =y) = 13 P2 hx:0,y:1i Pr(label(p) =x) = 0 0.5 Pr(label(p) =x+y) = 12 Pr(label(p) =y) = 52

Table 6.3: Entropies that result by splitting the sample set shown in Table 6.1 using the predicate x=0

the sum of entropies of each of these partitions, weighted by the fraction of valuations in the respective partition.

Table 6.2 shows the partitions that result from splitting on the predicatex < y, as well as the label probabilities computed according to Equation 6.8. Finally, the entropy corresponding to each partition are computed according to Equation 6.4, using the set{x,y,x+y}as the set of all possible labels. Note that in this table, the partition namedP1corresponds to the rows in Table 6.1 where the predicatex < yevaluates to false, and the partitionP2corresponds to the rows where the predicatex < y evaluates to true. Also, for the purposes of entropy calculations, we assume that 0×log2(0) =0. The overall entropy that results from the split using the predicatex < yis the weighted sum 12×0.650022+12×0.650022=0.650022.

Now, repeating the same procedure to determine the entropy obtained by splitting on the predicatex=0 yields the results shown in Table 6.3. The overall entropy from the split is the weighted sum 34×1.351644= 14×0.5=1.138733. The results of splitting on the predicate

x < y

use the termx N

use the termy Y

Figure 6.1: The decision tree learned for the sample set shown in Table 6.1

Algorithm 6.2:ExpandTermSet: Expand the labeling functionLto include more terms

Input :A canonicalizedSyGuSspecificationψcan,∃f∀a,xϕcan[f,x,a].

A list ofnvaluations of variables inxa, calledP.

A stateful enumeratorenumerator(GT)for terms. A mapLfromPto subsets of terms fromGT.

Output : An expanded mapL0, such that for allp∈P,L0(p)⊇L(p).

1 new_terms←the nextKT terms fromenumerator(GT) 2 foreacht∈new_termsdo

3 s← hϕcan[t[p],xa7→p], forpinPi

4 ifthere exists a termt06=t, such that for alli∈[1,length(P)],s[i]ifft0 ∈L(P[i])then 5 continue

6 foreachi∈[1,length(p)]such thats[i] =truedo 7 L[P[i]]←L[P[i]]∪{t}

8 returnL

y=0 will be similar, as the casesx=0 andy=0 are symmetric, and will hence result in the exact same entropy and are not shown here. Thus, the entropy obtained by splitting on the predicatex < yis the minimum among the choices, and will therefore yield the highest information gain. So, the decision tree learning algorithm splits according to the predicate x < yat the first level. Once this has been done, notice that the sample setP1 that results from the split, can be labeled consistently by the labelx, which results in the specification being satisfied at all the valuations in the set. Similarly, the labelycan be chosen for the setP2. Thus, the decision tree learned for this example is as shown in Figure 6.1. From this tree, the expressionite(x < y,y,x)can easily be deduced, which is a correct solution for this example.