A Case-Based Approach for Reuse in Software Design
4.6 Learning Module
Learning in REBUILDER is performed by storing new cases in the case library.
Design cases can be a result of the software designer work, or they can be the result of the CBR process. In both situations, it is up to the system’s administrator to accept (or not) the case as a valid one and add it (or not) to the case library.
Several Case Base Maintenance (CBM) strategies have been implemented in RE-BUILDER. They provide guidance to the software designer, whether a case should be added (or not) to the case base. Most of these criteria were inspired in approaches developed for cases represented as vectors of attribute/value pairs. Since we have a complex case representation we had to adapt these strategies to our case representa-tion. So, our contribution to this field, concerns the adaptation of these strategies to a complex case representation.
In REBUILDER, the learning mechanism is a tool at the disposal of the KB administrator. When the software designer thinks that a design is worth being in-tegrated in the design repository (the case library) s/he can submit it. This action
sends the design to the list of unconfirmed cases in the case library. The KB admin-istrator can then call the Activate Learning command to start the CBM process. For each submitted case, the learning module runs the CBM strategies that were selected by the KB administrator. These strategies provide an advice about adding (or not) submitted cases to the case library. It also gives advice about the deletion of cases in the case base. The final decision is always up to the KB administrator. The next subsections describe the CBM strategies integrated in REBUILDER and how they were adapted to our case representation.
4.6.1 Frequency Deletion Criteria
The frequency deletion criteria developed by Minton [Minton, 1990], suggests cases for deletion based on the frequency of case access. This implies the existence of an access counter associated with each case. The counter starts with zero and is incremented each time the case is retrieved. A maximum number of cases is established for the case library (called swamping limit). When the learning mechanism is activated, if the number of cases in the library reaches the swamping limit, the less retrieved cases (enough to keep the swamping limit) are suggested for deletion. If the swamping limit is not reached, new cases can be added to the case library with their frequency initialized to zero. In case there are one or more cases with the same frequency of use, then the one with the oldest access is suggested for deletion.
In REBUILDER, if the number of retrieval operations criteria is not enough, then the smallest case is suggested for deletion.
4.6.2 Subsumption Criteria
The subsumption criteria developed by Racine [Racine and Yang, 1997] defines that a case is redundant if: is equal to another case, or is equivalent to another case, or is subsumed by another case.
When a new case is to be added to the case library it must be checked for redun-dancy. If the case is considered redundant, then it is not added to the case library.
Cases are redundant when they are subsumed by other cases. In REBUILDER a case C1 is considered subsumed by case C2 if the root package of C1 is subsumed by a
CHAPTER 4. CBR Engine
package of C2. A package P k1 is subsumed by a package P k2 if: P k1 and P k2 have the same synset, and all the diagram objects of P k1 (except sub-packages) have an equivalent in P k2, and all sub-packages in P k1 are subsumed by a sub-package of P k2. This definition is recursive, which is adequate to the tree-like structure of class diagrams. This process can be time consuming, especially if we are dealing with long design cases. Subsumed cases are suggested for deletion.
4.6.3 Footprint Deletion Criteria
Smyth and Keane [Smyth and Keane, 1995b] developed the footprint deletion criteria, which involves two important notions: coverage and reachability. Coverage of a case is considered to be the neighborhood of the case within certain adaptation limits. In other words, coverage relates to the set of problems that a case can solve.
CoverageSet(c ∈ C) = {c0 ∈ C : Solves(c, c0)} (4.41) where C is the case base considered, and c and c0 are cases. The reachability set of a case, is the set of cases that can solve this case.
ReachabilitySet(c ∈ C) = {c0 ∈ C : Solves(c0, c)} (4.42) Since a target problem in REBUILDER is a class diagram, we say that a case c solves a problem p, when c subsumes p, which means:
Solves(c, p) = Subsumes(c, p) (4.43) Using these two concepts, Smyth and Keane divide cases in the case library into four types: pivotal, auxiliary, spanning, and support cases. Pivotal cases represent unique ways to answer a specific query. Auxiliary cases are those which are completely subsumed by other cases in the case base. Spanning cases are cases between pivotal and auxiliary cases, which link together areas covered by other cases. Support cases exist in groups to support an idea. The recommended order of deletion is: auxiliary, support, spanning, and pivotal cases.
In REBUILDER support cases can not be distinguished from spanning cases, due to the definitions used for the coverage and reachability sets. So we decided not to
consider support cases. The formal definitions used in REBUILDER for these types of case are:
P ivotal(c) ⇐ ∀c0 ∈ C : ¬Subsumes(c0, c) Spanning(c) ⇐ ∃c0 ∈ C : Subsumes(c0, c) ∧ ∃c00 ∈ C : Subsumes(c, c00)
Auxiliary(c) ⇐ ∃c0 ∈ C : Subsumes(c0, c) ∧ ∀c00 ∈ C : ¬Subsumes(c, c00) (4.44) In presence of a draw, the similarity between the candidate cases and the new case is used to suggest the cases to be deleted (see subsection 4.1.2), in which the cases most similar to the new case are suggested for deletion.
4.6.4 Footprint-Utility Deletion Criteria
This criteria, also developed by Smyth and Keane [Smyth and Keane, 1995b], is the same as the Footprint Deletion Criteria, with the difference that when there is a draw the selection is based on the case usage - less used cases are suggested for deletion.
4.6.5 Coverage Criteria
The coverage criteria, as it was first devised by Smyth [Smyth and McKenna, 1998], involves three factors: case base size, case base density, and case base distribution.
The competence of a case base is strongly influenced by its size, this is an intuitive statement. Another relevant factor is case base density, with the local density of a case c within a group of cases G in the case base C being defined by:
CaseDensity(c, G) = P
c0∈G−{c}Sim(c, c0)
|G| − 1 (4.45)
where Sim(c,c’) returns the case similarity between c and c0. The third factor that influences a case base competence is case distribution. This factor is more complex than previous factors, because if the CBR system performs adaptation and verification of solutions, this will also influence the case base distribution (besides retrieval, of course). To assess the competence of a case base, Smyth and McKenna compute the local case coverage sets and determine how these sets combine and interact to form the case base competence. They define a competence group as a set of cases which are related to each other, and that make a contribution to the case base competence,
CHAPTER 4. CBR Engine
which is independent from other competence groups. The definition of competence group is based on the shared coverage concept. Two cases exhibit shared coverage if their coverage sets overlap:
SharedCoverage(c, c0) = true ⇐ CoverageSet(c) ∩ CoverageSet(c0) 6= ∅ (4.46) A competence group can be defined as:
CompetenceGroup(G) = {∀ci∈ G, ∃cj ∈ G − {ci} : SharedCoverage(ci, cj) = true}
∧{∀ck ∈ C − G, ¬∃cl∈ G : SharedCoverage(ck, cl) = true} (4.47)
Smyth and McKenna show that according to this definition a case belongs only to one competence group. Competence group size and number depends on four factors:
distribution of cases, density of cases, retrieval mechanism, and adaptation mecha-nism. Group coverage can be defined by the number and density of cases in the group.
The number of cases in the group is easy to measure. Group density is given by:
GroupDensity(G) = P
c∈GCaseDensity(c, G)
|G| (4.48)
Group coverage is based on group size and group density, and is defined as:
GroupCoverage(G) = 1 + [|G| • (1 − GroupDensity(G))] (4.49) The total coverage of a case base comprising several competence groups is given by the following formula (G = {G1, ..., Gn}):
Coverage(G) = X
Gi∈G
GroupCoverage(Gi) (4.50)
The learning module of REBUILDER uses these definitions to decide which cases should be suggested for deletion. A new case is added to the case library if the swamping limit has not been reached or its inclusion increases the ratio case base coverage/case base number. Otherwise the KB administrator is advised to delete the case.
4.6.6 Case-Addition Criteria
The case-addition criteria [Zhu and Yang, 1999] involves the notion of case neighbor-hood, which in REBUILDER is defined as:
N eighborhood(c) = {c0 ∈ C, τ ∈ [0, 1] : (RP Synset(c) = RP Synset(c0)) ∧ (Sim(c, c0) > τ )} (4.51)
1. Determine the neighborhood for every case in the case base.
2. Set S to?.
3. Selecta case from C − S with the minimal benefit with respect to the neighborhood of S and add it to S.
4. Repeat step 3 until N eighborhood(C) − N eighborhood(S) is empty or S has k elements.
Figure 4.45: The case deletion algorithm used by the case-addition criteria.
where RP Synset(c) is the root package synset of case c, and τ is a threshold value used to define a case neighborhood. This criteria uses also the notion of benefit of a case c in relation to a case set S, which we define as:
Benef it(c, S) = X
c0∈N eighborhood(c)−N eighborhood(S)
P (c0) (4.52) The neighborhood of a set of cases is given by:
Neighborhood(S) =[
c∈S
Neighborhood(c) (4.53)
In our implementation of this criteria we defined P (c) as the frequency function of case c, which is computed using an access counter associated with each case in the case library. Then the algorithm in figure 4.45 is used to determine a set of cases S defining the optimal case base coverage.
At the end, the cases in S should remain in the case base. All the others should be removed. To determine if a case that is not in the case base, should be added to the case base, just add it to the case base and run the algorithm. At the end, if the case is in S then it should be added to the case base, otherwise it’s deletion is suggested.
4.6.7 Relative Coverage and Condensed NN Criteria
Smyth and McKenna [Smyth and McKenna, 1999] developed this criteria with the goal of maximizing coverage while minimizing case base size. The proposed tech-nique for building case bases is to use the Condensed Nearest Neighbor (CNN8, see [Hart, 1968]) on cases that have first been arranged in descending order of their rel-ative coverage contributions. The relrel-ative coverage is defined as:
RelativeCoverage(c) = X
c0∈CoverageSet(c)
1
|ReachabilitySet(c0)| (4.54)
8CNN is a method proposed to reduce the storage requirements of the original data set, for the efficient implementation of the nearest neighbor decision rule in pattern classification problems.
CHAPTER 4. CBR Engine
1. OrderedSet ← Rank by Relative Coverage 2. CaseBase and N ewCase 3. EvaluatedSet ←?
4. Changes ← true 5. WHILE Changes DO 6. Changes ← f alse
7. FORALL Case IN OrderedSet DO
8. IF EvaluatedSet can not solve Case THEN
9. Changes ← true
Figure 4.46: The relative coverage criteria algorithm.
Our implementation of this criteria is detailed in figure 4.46. CaseBase and NewCase are respectively the set of cases in the case base and a new case not yet in the case base.
The algorithm starts by ranking by relative coverage the cases in the case base to the OrderedSet list, including the new case. It sets the list of evaluated cases (EvaluatedSet) to empty and the Changes flag to true. While there are changes in the list of evaluated cases, determined by the Changes flag, then: turn Changes to false; for each case in OrderedSet, if cases in the evaluated list can not solve it, then set Changes to true, add the current case to the evaluated list and remove the case from OrderedSet. At the end return the evaluated list.
Using this algorithm if the NewCase makes part of the EvaluatedSet then it should be added to the case base otherwise it should be suggested for deletion. Cases in the case base that are not in EvaluatedSet should be removed from the case base.
4.6.8 Relative Performance Metric Criteria
Leake and Wilson [Leake and Wilson, 2000] developed this criteria based on the no-tion of relative performance to decide if a case should be added to the case base or not. The relative performance of a case is defined as follows:
RP (c) = X
c0∈CoverageSet(c)
Ã
1 − AdaptCost(c, c0)
maxc00∈ReachabilitySet(c0)−{c}{AdaptCost(c00, c0)}
!
(4.55) where AdaptCost is the adaptation cost of transforming c into c0.
In REBUILDER, AdaptCost is defined as the modulus of the difference between objects in c and objects in c0. A submitted case should be added to the case base if it’s
relative performance is higher than a threshold value defined by the KB administrator.
In our experiments we have used 0.5 as the threshold value.
4.6.9 Competence-Guided Criteria
The competence-guided criteria [McKenna and Smyth, 2000] extends previous works of Smyth, and uses the notions of case competence based on case coverage and reach-ability. This criteria uses three ordering functions:
Reach for Cover (RFC): uses the size of the reachability set of a case. The RFC evaluation function implements this idea: the usefulness of a case is an inverse function of its reachability set size.
Maximal Cover (MCOV): is based on the size of the coverage set of a case. Cases with large coverage sets can classify many target cases and in this way must make a significant contribution to classification competence.
Relative Coverage (RC): is defined in the relative coverage criteria (see subsection 4.6.7).
In REBUILDER the algorithm that implements this criteria is presented in figure 4.47 (CaseBase and NewCase are defined as in the relative coverage and condensed NN criteria). This algorithm starts by initializing the set of remaining cases (Remain-ingSet) with all the cases in the case base and the new case being evaluated. Then the edited set of cases (EditedSet) is initialized to empty. While there are cases in RemainingSet, the algorithm gets the case (Case) in RemainingSet according to the selected ordering function, adds it to the EditedSet, removes all cases of the Case’s coverage set from RemainingSet, and updates the reachability and coverage sets of the cases in the RemainingSet. The NewCase is only added to the case base if it is part of the EditedSet.
CHAPTER 4. CBR Engine
1. RemainingSet ← CaseBase ∪ N ewCase 2. EditedSet ←?
3. WHILE RemainingSet 6=? DO
4. Case ← Next case in the RemainingSet according to the selected ordering function 5. Add Case to EditedSet
6. Remove all cases in CoverageSet(Case) from RemainingSet
7. Update the reachability and coverage sets of the cases in the RemainingSet 8. ENDWHILE
9. RETURN EditedSet
Figure 4.47: The competence-guided criteria deletion algorithm.