Case Indexing and Storage - Case-Based Design

Background Knowledge

2.1 Case-Based Design

2.1.4 Case Indexing and Storage

Case indexing and storage are tightly connected. The way cases are organized in memory, constraints the indexing schemes that can be used for retrieval. The main aspects that influence the case base organization are: case contents, paradigms used in indexing cases, and case visualization. Other two facts important in defining the case indexing and structure are: flexibility and efficiency. Flexibility because a case may have to be retrieved in different ways, depending on the retrieval perspective.

Efficiency because the size of the case base can impose strong computational limita-tions.

Some of the most common indexing and storage mechanisms used in CBD are:

Flat Organization In a flat case base, cases are stored in a list (see [Kolodner, 1993, page 293]). This list can be organized in a specific order, like ranking cases alphabetically. But the main issue here, is that there is no clustering of cases to make the process faster. On the other hand, all the cases can be inspected and the retrieval accuracy can be greater comparatively to clustering structures.

There is no indexing scheme, since cases are inspected one after another, starting at the beginning of the list. If the case base is large, the retrieval process can be impossible to perform due to time limitations. Adding new cases or deleting cases are trivial computationally inexpensive operations.

Hierarchical Organization One way to make the retrieval process more efficient, is to use a hierarchical structure to cluster cases. For example, decision trees or shared-feature networks (see [Kolodner, 1993, page 295]). In this approach, cases are indexed by taxonomy nodes, and can be accessed using these nodes as indexes. Generally, nodes are attributes, and edges have an associated value (a decision tree), which corresponds to indexing cases by attribute/value pairs.

This organization works well with a attribute/value pair representation. Other ways to organize cases is by using part of cases as indexes, and to build a hier-archical partonomic structure that clusters the cases [Gomes and Bento, 1998].

An index in this situation would be a piece of a case.

Model Organization Another way to organize cases can be performed using mod-els, like Structure-Behavior-Function models (SBF [Goel, 1992]), Function-Behavior-Structure models (FBS [Qian and Gero, 1996]), causal models or qual-itative models (see CADET [Sycara and Navinchandra, 1991]), or even depen-dency network models. Each type of models has it’s own specific way of in-dexing cases. We will focus on the SBF models cited above. The SBF models describe structure, behavior and function of the design they represent. Using the SBF formalism cases can be indexed using structural variables, behavioral variables, and/or functional variables. Some systems that use this represen-tation formalism are capable of learning which are the relevant descriptors to be used in case indexing [Bhatta and Goel, 1992, Bhatta and Goel, 1993a, Bhatta and Goel, 1993b].

The indexing process consists on associating one or more indexes to a case, defining the situations in which the case is relevant. The index selection is an interpretation process based on the context in which the case is considered to be useful. There are three aspects to take into account in the indexing mechanism:

1. Indexing vocabulary.

2. Indexing process.

3. Index organization.

An index is a label associated with a case, which establishes a relation between the indexed case and the particular situation that the index represents. Indexes are usually associated with problem specifications and can have different levels of abstraction. An index that can be directly extracted from the case description is called a specific index. Abstract indexes are derived from the case description, but do not correspond directly to case features.

Specific indexes are easy to extract from cases, but are not in general good indexes, because they represent superficial aspects of the situation being described in the case.

Abstract indexes are more difficult to extract, because extraction involves an inference step. In another way, they are more accurate for determining the case relevance.

CHAPTER 2. Background Knowledge

The indexing vocabulary often coincides with the design attributes, or is a subset of these attributes. When abstract indexes are used, the vocabulary does not coincide with the description attributes.

Indexes must have certain characteristics that allow them to be effective. They must preclude the situations in which a case can be used. They must be abstract enough to be used in several situations, but specific enough to be easily identified.

They must enable the discrimination between different scenarios, in such a way that the differences between a case and a target problem can easily be identified.

Case indexing can be performed manually by a designer, or in an automated way using indexing algorithms. There are three main ways to select indexes automatically:

based on lists, based on differences, and based on explanations. In the list-based in-dexing there is a list of attributes that are used as indexes. The difference-based indexing uses differences between cases’ descriptions to determine the relevant in-dexes. Indexing based on explanations extracts indexes using models of the working domain.

2.1.5 Retrieval

Retrieval of relevant cases is an important task for a CBD system. Case relevance can have different interpretations, depending on the system’s goals. Case relevance is generally defined taking into account the similarity between a case description and the problem specification. Nevertheless, there are other ways of defining rele-vancy. For instance, in spite of being based on similarity with the problem, it can be defined in terms of the effort needed to adapt a case to the target problem (see [Smyth and Keane, 1995a]).

Retrieval can be decomposed in two tasks: retrieval of relevant cases from the case base (we will call it the retrieval task), and ranking of these cases against the target problem (the ranking task). In general, CBD systems incorporate both tasks, but there are systems that do not have one of these tasks.

The retrieval task is usually performed using the indexes, and is basically a search process in the indexing structure. Retrieval must be efficient and accurate in order to save time in the subsequent reasoning phases.

After a set of relevant cases are identified by the retrieval task, these cases are ranked in relation to the target problem. Usually cases are ranked using a similarity metric, which assesses the degree of similarity between the target problem and the case being ranked. A similarity metric can be a weighted sum of the degree of similarity between attributes.

An example of a similarity function is represented in equation 2.1 where, P is the target problem, C is the case, ωi are the weights associated with the attributes

¡ Pω_i = 1¢

, AttrSim is the attribute similarity, p_i is a problem attribute, c_i is a case attribute, and n is the number of attributes in the case description).

Similarity(P, C) = Xn

i=1

ωi· AttrSim(pi, ci)

n (2.1)

Weights are a way of assigning different importances to attributes. There are other similarity metrics, like recursive metrics for hierarchical case representations [Bergmann and Stahl, 1998], or less sophisticated metrics, like counting the number of word occurrences in a textual description of a case.

An important aspect for retrieval is attribute matching. A simple way to perform attribute matching when cases are represented by attribute/value pairs, is just to match attributes with the same name. But even in this rather simple approach there can be situations in which it fails. An example of this, is when it is allowed the same attribute to occur two or more times in a case description. The matching phase would have to decide which attributes to match. More elaborated schemes use semantic similarity of attributes, or in hierarchical or graph case representations structural matching, which introduces a great deal of complexity into the similarity metric.

There are some systems that make no distinction between retrieval and ranking, since they use a K-Nearest Neighbor (KNN) algorithm [Dudani, 1976] to retrieve and rank cases. This algorithm applies the similarity metric described in equation 2.1 to all cases in the case base. In this situation, there is no retrieval task in the sense of using indexes to retrieve cases from the case base. The algorithm selects the K most similar cases to be retrieved. This can be a time consuming way of doing retrieval, but one of the most accurate ones, because it performs a global search in the case base.

CHAPTER 2. Background Knowledge

2.1.6 Adaptation

After the retrieval phase, the selected cases may have considerable differences against the target problem, making necessary for the CBD system to adapt one or more re-trieved cases to fit the target problem description [Kolodner, 1993, pages 393-437].

This phase is named adaptation and comprises the modification of one or more re-trieved cases to address the specifications of the target problem. There are three ba-sic adaptation methods (see [Kolodner, 1993, page 395] or [Maher et al., 1995, pages 110-111]): substitution, transformation and derivational replay.

Substitution methods replace values or components of the retrieved case. In a transformation method, rules or procedures are used to modify values or case com-ponents. Analogy replay assumes that the retrieved case includes the method or procedure used to generate the case solution. Using this generation method, deriva-tional replay applies it to the target problem to generate a new solution that fits the target problem.

Most of the adaptation methods use domain knowledge to perform modifications in the case solution. There are several types of domain knowledge used by adaptation methods. The most used ones are: heuristics, domain models, causal models, rules, and semantic nets. In the remaining of this subsection we will detail some of the most common adaptation methods.

Substitution methods work by replacing a case element that is not coherent with the specification for the target problem. One possible substitution method is rein-stantiation. It assumes that the retrieved case solution fits the target problem. Based on this, it uses the same structure from the case’s solution to build a new solution, with old components replaced by ones with the same structural and functional roles, but applicable to the target problem specifications. Another substitution method is parameter adjustment, which uses interpolation to replace a variable’s value. This method is normally used on numeric variables, though it can also be applied to sym-bolic attributes. Local search is a method used for replacing a solution substructure.

This method is local to the old solution, and it uses other memory structures to se-lect the elements that are going to be replaced. Another method in this category is case-based substitution, which uses other cases to find a substitute component.

In the substitution methods it is assumed that there is a component in the old

solution that can be replaced. In transformation this to-be-replaced component does not exist, thus it is necessary to transform the retrieved solution so that it fits the target problem. There are several transformation methods. Two of them are the use of rules and the use of domain models. In the rule-based transformation the adaptation is determined by a set of rules that transform the retrieved solution.

In the model-based transformation, the retrieved solution is modified using domain models. Another form of performing transformation adaptation is by using heuristics to determine what should be modified in the old solution. These heuristics are highly dependent on the application domain.

The previous methods modify the retrieved solution to reach a new one. In deriva-tional replay it is not the retrieved solution that is reused, but the process that originated the retrieved solution. What this method does, is to replay the generation process used for construction of the retrieved solution, but now using the target prob-lem specifications. The outcome will be a new solution that fits the target probprob-lem specifications.

In document A Case-Based Approach to Software Design (Page 75-80)