Applying Experiences of Organizing Large-Scale Knowledge Bases to Industry-Sized Software Reuse

(1)

Applying Experiences of Organizing Large-Scale Knowledge Bases to

Industry-Sized Software Reuse

Yufeng F. Chen and Nazir A. Warsi

Army Center of Excellence in Information Science Department of Computer and Information Science Clark Atlanta University, Atlanta, GA 30314.

Tel: (404) 880-6943, Fax: (404) 880-6963, Email: [email protected]

keywords. compositional software reuse, large-scale knowledge base organization, contextual knowledge representation, multiple-view approach, case-based reasoning.

1. INTRODUCTION

Software reuse is widely believed to be a promising means for improving software productivity and reliability, and therefore is an issue of growing interest in software engineering. Unfortunately, many difficulties prevent reuse from being widely used in industrial software-production environments. Although there are many AI knowledge-based approaches are being applied for the software engineering process [Green et al. 1986], [Lowry, 1992], software engineering practitioners have reasons for skepticism about AI's potential in solving real world problem. This paper intends to smooth away this concern by exchanging the experiences between organizing large scale knowledge bases [Chen, 1993] and handling industry-sized software reuse problems. We first address the large-scale software reuse problems associated with the issues of complexity (depth) and scalability (volume). We then propose a multiple-view framework for the compositional software reuse to overcome their inherent difficulties. AI contextual knowledge representation and case-based reasoning are integrated to implement this framework.

2. COMPLEXITY PROBLEM IN SOFTWARE REUSE Organization Problem

The central problem in compositional software reuse is organizing collections of reusable components for effective search and retrieval. A classification and retrieval library scheme involves the construction of a taxonomic model for the set of data operations embodied in a library of software components, categorized along different aspects (called facets in the faceted classification scheme [Prieto-Diaz, 1991]). The library is queried by specifying facets and values (also called terms.) However, unlike the subjects of a classic taxonomy (i.e., plants and animals), the software exists in a dynamic social environment. The organization for the component taxonomy will be evolutionary. Consequently, any taxonomy flexible enough to capture the dynamic nature of the components being classified must resist rigid definitions. For example, the popular faceted classification approach, although more amenable to a query-modify retrieval cycle than a pure keyword description, once designed, the classification scheme is static and fixed.

In AI, the knowledge organization and integration researches have focused on similar issues for the incorporation of new information to update an existing knowledge base. For example, The Botany Knowledge Base developed by Potter et al. [1988], used a form of domain knowledge called views to define a segment of the knowledge base comprised of concepts that interact in some significant way and to heuristically guide search during knowledge integration. Cyc's [Guha and Lenat, 1994] knowledge base, containing more than two-million assertions, has organized by multiple "microtheories" which make different assumptions and simplifications about the world.

To cope with the dynamic changing in software taxonomic models, we propose a multiple-view organization intended to keep different taxonomic versions coexisting. Different views make different assumptions and simplifications about the taxonomic representation. The system provide a mechanism for recording and reasoning with these assumptions. The problem of maintaining consistency is transformed from global to the local, which in

(2)

practice is vastly simpler and faster. This becomes especially important as the total size of the reusable software component increases.

Acquisition Problem

A taxonomic approach provides a framework to organize a software component library but does not offer guidance about how to represent and retrieve the components. When a reuser tries to retrieve software components from an unfamiliar reuse library, they do more than just search a piece of code, they also analyze in the context of the semantics of the source application in a process. The informal expression of computational intent must be created through a process of analysis. As the informal concepts are discovered, they are assigned to the specific implementation structures within the software component library [Biggerstaff et al., 1994]. Therefore, the retrieval process can thought of as a series of queries that reusers must ask and answer to gain a proper understanding of the part of the software components they are intended to retrieve. These queries require different kinds of retrievals, which depend on knowledge associated with different views of the system [Devanbu,

et al., 1991]. On the other hand, in order to build a precise description for software components, a considerable

domain analysis is required to build powerful systems. The main problem of applying this approach in the context of software libraries is that many domains cannot be easily circumscribed and the domain analysis is very difficult. This makes the construction of large scale reuse systems very tedious and expensive.

In AI, the similar research areas are called the knowledge acquisition and ontological engineering [Silverman and Murray, 1989]. An ontological model provides a set of general top-level semantic categories along with the associated relations and assumptions that makes the model as explicit as possible. For example, the Penman Upper Model [Bateman et al., 1990] provides a domain- and task-independent classification system based on the patterns found in English expressions. On the other hand, case-based reasoning (CBR) [Kolodner, 1991] is used to alleviate the knowledge acquisition bottleneck for the ontological building. This approach identifies certain commonalties among large sets of real-world systems that can be exploited. CBR can then continually learn by acquiring new cases and the system stores a case without generalizing it. A more complete and consistent ontology is built up when more new situation is encountered. A CBR approach is therefore more robust because it learns by acquiring new cases. This approach, therefore, is quite different from the inferential one where each solution is derived from scratch, relying on the well-defined generic and domain knowledge.

Furthermore, in building a complex knowledge-based system, it is widely recognized that no single model is adequate for a wide range of problem solving tasks [Chittaro et al., 1993]. Efficiency cannot be achieved, in general, using only one model. An appropriate concept decomposition and the cooperative views is possibly the only way of adequately coping with deep understanding issues. Therefore, our multiple view approach is characterized by the representation of many diverse, explicit views for software components which are used in a cooperative way in software reuse tasks. At the same time, all software components are represented as cases, and thus there is no need for a knowledge base filled with domain concepts. Reusers only have to identify general (not domain specific) software reuse primitives as the basis for each software compositional ontology. The combinations further demonstrate how each ontological view provides guidance for indexing a case. This guidance is based on insights into the use of the case that come from a solid model of the software reuse. The hybrid methodology permits the relaxation of the completeness and consistency constraint imposed by the model-based approach and can help overcome shortcomings in human capabilities. CBR allows the system to improve on the original software ontological model built by humans.

3. SCALABILITY ISSUE IN SOFTWARE REUSE Expressiveness Issue

To build an industrial-sized software reuse library, it is important that the component representation language be expressive enough to allow us to say exactly what needs to be said about the reusable component. For example, if the term space is too coarse-grained it will be difficult to find accurate terms for classification, and a large number of components are likely to be retrieved with the same terms. This will only cause a large number of uninteresting components to be found during a search. Moreover, if there are several equivalent conceptual forms possible for a signature, then the technology must provide a method for determining reusability among them and this adds a significant level of computational complexity to the recognition system making it impractical for all but the small programs. As an example, the representation scheme used in faceted

(3)

classification approach is rather weak; one cannot express constraints on how component should be reused [Sørumgård et al., 1993].

AI's experience has shown that the computational effort needed to achieve a specific goal grows exponentially with the number of variables of the single representational scheme. Instead of exhaustively expanding the term space in a single dimension, our multiple-view representation promotes expressiveness by implementing contextual knowledge using higher-order logics (like natural language.) The knowledge base is organized to include a number of contexts, each represents a view with its own set of vocabularies (facets or terms). The notion of context is an effective solution to the problem of representing and reasoning with multiple domain theories [Guha and Lenat, 1994]. The basic idea is to encapsulate a domain view within a context, and to provide a mechanism for using the simplest, and most efficient, applicable context in every situation. With this capability, the same symbol can be used to denote different meanings in different view contexts. It often lead to simplification of the term space used in the large scale software component library.

Trade-off Issue

People often trade off expressibility to make the classification algorithm faster and easier to implement. Still, a small term space is easier to get familiar with, and hence easier to maintain and use. Therefore, it comes that the computation of retrieval with the language will need to be restricted in some way to avoid the complicate searching chains that take exponential or infinite time. Approaches like plan driven [Kozaczynski et al., 1992] or domain model driven [Devanbu, et al., 1991] have the desirable characteristics that they are completely deriving concepts within small-scale programs but suffer from the problem of not being able to deal readily with large-scale programs because the parsing procedures tend to become computationally untractable. For parsing technologies to be effective, they rely heavily upon the premise that the concepts to be recognized are completely and unambiguously determined by the formal features of the entity being parsed. These features are therefore contextually quite local which limiting their expressibility.

There has been much AI research on the trade-off between the amount of representational freedom and the complexity of computing subsumption in the KL-ONE family [MacGregor, 1991], [Doyle and Patil, 1991]. While the KL-ONE family restricts the representational expressiveness to allow subsumption to be computed completely and efficiency, Cyc researchers argue that what should be sacrificed is completeness, not expressiveness and efficiency, both of which are viewed as essential in large-scale knowledge-based systems [Lenat and Guha, 1991]. To deal with software component representation language for a large-scale software library, our proposed solution tends to use Cyc's approach. The efficiency is handled by incorporating a host of heuristics abstracted from regularity of the software properties. The completeness issue could be sacrificed. However, the user can determine the level of completeness by specifying the resource bounds (e.g., in terms of time spent, depth of search.)

4. A MULTIPLE-VIEW FRAMEWORK FOR SOFTWARE REUSE

Model-based reasoning has been a very active AI research field for diagnosis problem solving [Chandrasekaran and Milne, 1985]. At the same time, increase attention was devoted to the issue of cooperation of multiple models of the same system in order to improve the effectiveness and efficiency of reasoning processes. We extend the structure-behavior-function framework to our multiple-view approach for compositional software reuse. We identify four basic views (structural, behavioral, functional, and teleological) for software reuse allowed for a variety of concrete implementations. A specific (user-defined) view can be built up from modifying (composing, simplifying, or extending) those basic views through the domain-specific knowledge. Several decisions remain to be made in defining the particular set of vocabularies of each view which is considered appropriate for the considered case. This approach considers the task of software reuse as a cooperative activity which integrates the contributions of the diverse view models, each one can perform a specific reuse task.

Individual Views

Structural View: It focuses on the topology of a software component; it describes which parts constitute the

component and their interconnections. The vocabularies used in this view are based on syntactic description (data type etc.) For example, the structural description for the software component stack will be either a linked list or an array. The retrieval mechanism is strongly related to the syntactic pattern recognition [Gonzalez and Thomason, 1978]. Structural decomposition is typically using parser to make a transition from source code to

(4)

basic structural representation blocks (statements, procedures etc.) Focusing on structural view is useful when syntactic structural retrieval is preferred. As an example, searching for program to do matrix multiplication, we may look for a code fragment with multiple nested iterative statements [Paul and Prakash, 1994].

Behavioral View: It is devoted to represent the potential behavior of a software component; it describes how

components operate and interact with each other through relationships among input/output quantities. Behavior sampling technique [Podgurski and Pierce, 1993] for retrieval of reusable components is based on this view. This approach specifies the software component's interface (number, types, and input/output of parameters) and executes them on a representative sample on input and retrieves those components that produce output satisfying criteria specified by the searcher.

Functional View: It is aimed at describing how the behaviors of individual components contribute to the

achievement of the common goal assigned to the applications by its designer. For example, the functional view for the component stack emphasizes its first-in, first-out functionality. This approach has recently been exploited by the reverse engineering community to recognize specific "plans" or "concepts" [Kozaczynski et al., 1992].

Teleological View: It describes the purpose of the software components by specifying the goals associated

with it and their relationships. With the same example, the purpose of the software component stack implemented in a program to evaluate a postfix expression is to "remember" the precedence rule during the conversion process. This view intends to retrieve the reusable components indexing with their goal by specifying the purpose of reusing this component. For example, AdaBasis, a library of reusable Ada software components, has been classified into nine different application domains (artificial intelligence, compiler, database management, etc.) Each domain category can be used as a facet vocabulary from the teleological view.

Query View: When setting up a discovery process for software reuse, the request is posed in a query view

that captures the assumptions made by the specific reusable approach. In order to response the request, indexing information can be lifted from above generic views (represented as different contexts). This view is usually created dynamically by the reuser and is often short-lived.

Cooperating with Different Views

Structural and behavioral views are basic knowledge used to deal with a software component retrieval without the application contexts. The functional view is defined as the relationship between the component's behavior and the goal assigned to it by the searcher. It is understood as a bridge between behavioral (context independent) and teleological (context dependent) knowledge. Their overall interrelationship can be simplified as shown in Figure 1. Therefore, functional view mainly focuses on system organization and implies a teleological explanation of behavior. The vocabularies of functional view which realizes the transition between these two views must share some aspects of both ontologies.

As an application example in software reuse, there is a connection between a "slice" [Weiser, 1984] at the behavioral level and a program "concept" [Kozaczynski et al., 1992] at the functional level. The expressive power of slices is used to decompose a program into its functionally meaningful parts and to extract the code structures performing the functions [Fouque' and Matwin, 1993]. There are other applications like generalized behavior-based retrieval approach [Hall, 1993] is also to complement software component's behavior with its functional view.

Structural View

Behavioral View

Functional View

Teleological View

(5)

5. CONCLUSIONS AND DISCUSSION

We apply the experiences learned from organizing large-scale knowledge bases to develop a framework for the industrial-sized software reuse. We use multiple views as vehicles for separation of concerns. They allow reusers to address only those concerns or criteria that are of interest, ignoring others that are unrelated. They also facilitate the partitioning of a software reuse process into loosely coupled, distributable views that encapsulate partial specifications described in different notations, and locally developed and managed according to different reuse work plans. Our approach has therefore reduced the complexity of software component library based on this principle. In particular, it has allowed us to envisage the consequences of fundamental decentralization of software engineering knowledge.

Contextual knowledge representation is adopted to enhance the expressiveness for software component library description. Integrated CBR mechanism alleviates the knowledge acquisition difficulty, since cases are easy to obtain and need not be interpreted for their knowledge content. We are implementing our multiple-view framework using the knowledge organizational representation language (KORL) [Chen and Warsi, 1994] under development by Army Center of Excellence in Information Science at Clark Atlanta University. Compared with those highly domain specific and algorithmic methods, our approach has potential to handle a large-scale compositional software reuse and its computational growth appears to be linear in the length of the components under retrieval. This approach, however, may suffer the converse problem in that its results are more ambiguous and incomplete.

References

[Bateman et al., 1990] J. A. Bateman, R. T. Kasper, J. D. Moore, and R. A. Whitney, "A General Organization of Knowledge for Natural Language Processing: the Penman Upper Model," Technical report, USC/Information Science Institute, Marina del Rey, CA, 1990.

[Biggerstaff et al., 1994] T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster, "Program Understanding and the Concept Assignment Problem," in Communications of the ACM, vol. 37, no. 5, May 1994, pp. 72-83.

[Chandrasekaran and Milne, 1985] B. Chandrasekaran and R. Milne, "Special Section on Reasoning About Structure, Behavior and Function," in ACM SIGART Newsletter, July 1985, no. 93, pp. 4-55.

[Chen, 1993] Y. F. Chen, "Organizing Relations in Large Knowledge Bases," Ph.D. Dissertation, Department of Electrical and Computer Engineering, University of South Carolina, 1993.

[Chen and Warsi, 1994] Y. F. Chen and N. A. Warsi, "A Knowledge-Based Design for Pattern Recognition Software Re-engineering," in Proceedings of the Sixth International Conference on Artificial Intelligence and

Expert Systems Applications, 1994, pp. 85-90.

[Chittaro et al., 1993] L. Chittaro, G. Guida, C. Tasso, and E. Toppano, "Functional and Teleological Knowledge in the Multimodeling Approach for Reasoning About Physical Systems: A Case Study in Diagnosis," in IEEE

Transactions on System, Man, and Cybernetics, vol. 23, no. 6, November/December, 1993, pp. 1718-1751.

[Devanbu, et al., 1991] P. Devanbu, R. Brachman, P. Selfridge, and B. Ballard., "LaSSIE: A Knowledge-Based Information System," in Communications of the ACM, vol. 34, no. 5, May 1991, pp. 35-49.

[Doyle and Patil, 1991] J. Doyle and R. S. Patil, " Two theses of knowledge representation: language restrictions, taxonomic classification, and the utility of representational services," in Artificial Intelligence Journal, vol., 48, no. 3, 1991, pp. 261-297.

[Fouque' and Matwin, 1993] G. Fouque' and S. Matwin "Compositional Software Reuse with Case-Based Reasoning," in Proceedings of the 9th on Artificial Intelligence for Applications, 1993, pp. 128-134.

[Gonzalez and Thomason, 1978] R. C. Gonzalez and M. G. Thomason, Syntactic Pattern Recognition, An

Introduction, Addison-Wesley Publishing Company, Reading, Massachusetts, 1978.

[Green et al., 1986] C. Green, D. Luckham, R. Balzer, T. Cheatham, and C. Rich, "Report on a Knowledge-Based Software Assistant," in Readings on Artificial Intelligence and Software Engineering, eds. C. Rich and R. Waters, Morgan Kaufmann, San Mateo, CA, 1986, pp. 377-428.

[Guha, and Lenat, 1994] R. V. Guha, and D. B. Lenat, "Enabling Agents to Work Together," in Communications

of the ACM, vol. 37, no. 7, July 1994, pp. 127-142.

[Hall, 1993] R. J. Hall, "Generalized Behavior-based Retrieval," in Proceedings of 15th International

(6)

[Kolodner, 1991] J. L. Kolodner, "Improving human decision making through case-based decision aiding, in AI

magazine, vol. 12, no. 2, Summer 1991, pp. 52-68.

[Kozaczynski et al., 1992] W. Kozaczynski, J. Ning, and T. Sarver, "Program Concept Recognition," in

Proceedings of The Seventh Knowledge-Based Software Engineering Conference, September 1991, pp.

216-225.

[Lenat and Guha, 1991] D. B. Lenat, and R. V. Guha, "The Evolution of CycL, The Cyc Representation Language, " in ACM SIGART Bulletin, vol. 2, no. 3, June, 1991, pp. 84-87.

[Lowry, 1992] M. R. Lowry, "Software Engineering in the Twenty-First Century," AI magazine, vol. 13, no. 3, Fall 1992, pp. 71-87.

[MacGregor, 1991] R. MacGregor, "The Evolving Technology of Classification-based Knowledge Representation Systems," in Principles of Semantic Networks, ed. by J. F. Sowa, Morgan Kaufmann, San Mateo, CA, 1991, pp. 385-400.

[Paul and Prakash, 1994] S. Paul and A. Prakash, "A Framework for Source Code Search Using Program Patterns," in IEEE Transactions on Software Engineering, vol. 20, no. 6, June 1994, pp. 463-475.

[Podgurski and Pierce, 1993] A. Podgurski and L. Pierce, "Retrieval Reusable Software by Sampling Behavior," in ACM Transactions on Software Engineering and Methodology, vol. 2, no. 3, July 1993, pp. 286-303. [Potter et al., 1988] Potter et al., "AI research in the context of a multifunctional knowledge base: The Botany

Knowledge Base project," Technical Report AI-TR-88-88, Department of Computer Sciences, University of Texas at Austin, 1988.

[Prieto-Diaz, 1991] R. Prieto-Diaz, "Implementing Faceted Classification for Software Reuse," in

Communications of the ACM, vol. 34, no. 5, May 1991, pp. 89-97.

[Silverman and Murray, 1989] B. G. Silverman and A. J. Murray, "Full-Sized Knowledge-Based Systems Research Workshop," in AI magazine, Special Issue, vol. 11, no. 5, January 1991, pp. 88-94.

[Sørumgård et al., 1993] L. Sørumgård, G. Sindre and F. Stokke, "Experiences from Application of a Faceted Classification Scheme," in Proceedings of Advances in Software Reuse, 1993, pp. 116-124.

[Weiser, 1984] M. Weiser, "Program Slicing," in IEEE Transactions on Software Engineering, vol. 10, no. 4, 1984, pp. 352-357.