A software perspective
4.3 Program understanding
4.3.2 Reverse engineering
4.3.2.3 Design recovery
There is an inherent limit to the amount of aid structural redocumentation can provide to program understanding. This is because structural redocumentation uses source code alone as the basis for the reconstruction of architectural representations of the subject system. To gain an even deeper understanding, one must move to a higher abstraction level: that of the design.
Design recovery is a sub-area of reverse engineering that uses domain knowledge, external and/or informal information, and heuristics, in addition to traditional source-level analyses, to aid program understanding citeBigg89. Its aggressive goal is to reproduce all the information needed for someone to fully understand the subject system.
An issue related to design recovery is teleological maintenance [Kar90]. It is the attempt to recover information from the subject system based on a specic user model, for example business rules, rather than from the source code. It is representative of future research in reverse engineering.
4.3.3 Approaches
At present, there is no single comprehensive approach to aiding program understanding through reverse engineering. Rather, there is a wide spectrum of tools that provide dierent capabilities addressing various pieces of the reverse engineering process. Although many research groups have focused their eorts on the development of tools and techniques for program understanding, the current state of practice in reverse engineering tools and techniques tend to be informal and ad hoc.
The major research issues include the design of selection algrorithms to locate relevant information, the need for formalisms to represent program structure and behavior, and presentation methods to
CHAPTER 4. A SOFTWARE PERSPECTIVE 34
Automation level Analysis level
Manual Automatic
Textual Conceptual
Specific
General
Domain Retargetability
Figure 4.1: Reverse engineering classication axes visualize system architecture and run-time execution behaviour.
In [Nin89], Ning identied four levels of abstraction for reverse engineering: implementation, structural, functional, and domain. The implementatin-level view examines individual program-ming constructs. The program is typically represented as an abstract syntax tree (AST), symbol table, or plain source text. The sructural-level view examines the structural relationships among the program constructs. Dependencies among program components are explicitly represented. The functional-level view examines the relationships between program structures and their behaviour (\function"). The rationale behind program constructs is investigated. The domain-level view examines concepts specic to the application domain. Thus, reverse engineering has many sup-porting aspects. It may focus on features such as control ows, global variables, data structures, and resource exchanges. At a higher semantic level, it may focus on behavioral features such as memory usage, uninitialized variables, value ranges, and algorithmic plans. Each of these points of investigation must be addressed dierently.
There are many commercial reverse engineering and re-engineering tools available; catalogs such as [OS93, Zve94] describe several hundred such packages. Most commercial systems focus on source-code analysis and simple source-code restructuring, and use the most common form of reverse engineering:
information abstraction via program analysis. Research in reverse engineering consists of many diverse approaches, including: formal transformations [ABFP86], meaning-preserving restructuring [Gri91], plan recognition [RW90], function abstraction [HPLH90], maverick identication [SAP89], and graph queries [CMR92].
There are many ways of classifying reverse engineering approaches. Three of the most important are (1) by domain retargetability; (2) by automation level; and (3) by analysis level. As illustrated
CHAPTER 4. A SOFTWARE PERSPECTIVE 35 in Figure 4.1, these dimensions form a classication space along the following axes:
Domain retargetability
A domain is a problem area [DMR94]. An approach to reverse engi-neering, and the environment supporting the approach, must be exible so that it can be applied to diverse target domains. \Domains" in this sense is an over-burdened term. It includes dierent application domains, such as database systems, health information systems, and online documentation systems; implementation domains, including the application's im-plementation language; and the reverse engineering domain, in which the user applies reverse engineering to the problem of program understanding.Automation level
While creating the semantic abstractions during the system comprehension process, it should be possible to include human input and expertise in the decision making.There is a tradeo between what can be automated and what should or must be left to humans; the best solution lies in a combination of the two. Hence, the construction of abstract representations manually, semi-automatically, or automatically (where applicable), should be possible. Through user-control, the comprehension process can be based on diverse criteria such as business policies, tax laws, or other semantic information not directly accessible from the gathered data.
Analysis level
Program understanding techniques may consider source code in increasingly ab-stract forms, including: raw text, preprocessed text, lexical tokens, syntax trees, annotated abstract syntax trees with symbol tables, control/data ow graphs, program plans, and con-ceptual models. The more abstract forms entail additional syntactic and semantic analysis that corresponds more to the meaning and behavior of the code and less to the form and structure. Dierent levels of analysis are necessary for dierent users and dierent program understanding purposes.The taxonomy used here is (3): analysis levels, a classication of reverse engineering approaches based on pattern matching and abstraction levels.
This classication scheme is chosen because searching for code is an extremely common activ-ity in reverse engineering. Maintainers must rst nd the relevant code before they can correct, enhance, or re-engineer it. Software engineers usually look for code that ts certain patterns.
Those patterns that are somehow common and stereotypical are known as cliches. Patterns can be structural or behavioral, depending on whether one is searching for code that has a speci-ed syntactic or semantic structure, or looking for code components that share specic data- ow, control- ow, or dynamic (program execution-related) relationships. The sections below describe approaches to program understanding through reverse engineering based on syntactic, semantic,
CHAPTER 4. A SOFTWARE PERSPECTIVE 36