• No results found

Persisting here means two things: keep rolling that rock, and figure out what the requirements mean for your persistent data.

You can spend your entire life gathering requirements rather than developing systems. Although endless acquisition of knowledge is useful, unless someone pays you for just that, at some point you need to start developing your software. Knowing when to persist in gathering and interpreting

requirements and when to move on comes from experience. You can use sophisticated metrics to judge the adequacy of your requirements, or you can use your intuition. Both require a good deal of experience, either to gather benchmark data or to develop your feelings to the point of being able to judge that you've done enough. It's important to probe to the level of ambiguity resolution that suits the needs of the project.

Persistence also extends to the data itself. Since this is a book on database design, it focuses on the requirements for such design. The last section listed various requirements for the system as a whole. Running through all those requirements are hidden assumptions about persistent data. Examining a few of the requirements shows you what this means.

ƒ The client for the commonplace book is the consulting private detective at Holmes PLC. This system is proprietary and is part of the intellectual property of Holmes PLC. Nothing much related to databases here.

ƒ The value of this system to the client is at least 20 million pounds sterling over a five-year period in increased revenues. These revenues come from better identification of revenue opportunities and better results of investigations, leading to increased marketability of detective services.

Nor here.

ƒ There is increasing pressure for an automated solution to this problem from the clients, who want faster access to better information than the current system provides. There have been too many Godfrey Stauntons of late.

Here the first assumption appears about the underlying technology. "Faster access to better information" implies the storage of that information and the existence of access paths to it. The only qualifier is "faster," which could mean anything. In this case, "faster" refers to the current paper-based system. Pushing down on the requirement would probably elicit some details such as the need for access from mobile locations, the need for immediate response over a wide-area network, and the distribution of the data to Holmes PLC locations around the world. "Faster" then becomes relative to the current system, in which detectives need to call into a central office to get information, which researchers look up in paper-based file storage. The term "better" probably refers to the later

requirement about "complete" data, meaning that not only must we move the database into electronic form but we must improve its content.

ƒ The content of the system ranges from critical information about criminals and criminal events to "nice-to-have" information about vampires and vipers. Critical information includes biographical data about criminals and relevant people and organizations, case histories from Holmes PLC and police files, and agony column entries.

Here is some specific data that must persist: biographical and police information about criminals. There is also information about relevant other people, information about organizations (criminal, corporate, nonprofit, and so on), case history data, and media publications that reflect potential criminal or "interesting" activity.

ƒ Not all information must be available immediately. Recent information is more important, and you can add historical information as time permits. Biographical information needs to be present in the first release for all criminals known to the agency. Information in the current paper-based system is more important than other information.

This requirement is similiar to the "faster access" one the came previously.

ƒ The clients would like to have the existing paper system accessible through computer searching within a year. The old system must continue to exist until the new system is up and running with equivalent information.

This requirement helps us prioritize the data requirements. See the next section for details.

ƒ Information is more important than time, in this case. It would be better to have more complete information than to get the system running in a year.

These are broad requirements relating to data content. The last section discussed the huge degree of ambiguity in this requirement, so it will need quite a bit of probing to refine it into something useful. It is, however, a critical requirement for the persistent data. "Complete" in this sense certainly applies to the persistent data, not to the software that accesses it. You should realize, however, that this

requirement is unusual; time is often as or more important than completeness of information.

felon has been acquitted by a jury for lack of evidence. This database focuses on a longer-term perspective, which is vital to understanding how to structure the application and database.

ƒ The information addresses the detectives' need for factual information about criminals and others involved in cases. These facts provide the basis for deductive and inductive detective work; without them, you cannot reason effectively. The computer system solves the problem of accessing a huge quantity of information quickly. It also solves the problem of cross-indexing the information for access by concept. A potential for conflict exists here because the cognitive maps of different clients differ greatly: Holmes might have an interest in "Voyages," while Watson might want to look for "Gloria Scott."

Here we have two elements that relate to persistent data: access methods and second-order data. The "huge quantity" implies that there will be terabytes of data, which requires special database management software, extensive hardware analysis, and specific requirements for access paths such as indexing, clustering, and partitioning. These will all be part of the physical database design. As well, "cross-indexing" requires data about the data—data that describes the data in ways that permit rapid access to the data you want to see. There are several different ways to organize second-order data. You can rely on text searching, which is not particularly good at finding relevant information. You can develop a keyword system, which is quite labor intensive but pays dividends in access speed: this is what Holmes did with his paper-based "indexes." You can use a simple enumerative scheme such as the Dewey decimal system that libraries use, or you can develop more elaborate systems using facets, dimensions of the problem domain, in various combinations. This is the classic style of data warehousing, for example. Probing a bit may elicit requirements in this case for extensible indexing, letting the client add facets or classifications to the scheme as needed rather than relying on a static representation.

ƒ Data quality is vital. The wrong information is not just worthless, it actually impedes deductive and inductive thought. That means that data validation tools are an essential component of the system, as are processes for accomplishing validation. It also means that the database must be secure from damage, either through accident or malicious intent.

This one definitely impacts persistent data. It means a number of things. First, you must have some way to tell whether conflicting information is part of the reality your database represents or is simply wrong. People disagree, for example, in their recollection of events and things. This is not a data quality problem but an information issue. You need not concern yourself with anything but data quality. Second, you must establish a process for data entry and validation. Third, you must err on the side of caution in establishing enforcement of referential integrity constraints and other business rules. Fourth, you must defer to the database in such enforcement. Fifth, you must impose security on the persistent data. You must protect the data going in and coming out.

ƒ Security during system access not only is important for data validity, it also ensures that the client maintains confidentiality and secrecy during ongoing investigations. Subjects of investigations should not be able to determine whether the system has information about them or whether clients are accessing such information.

While you might think this requirement affects the software rather than the persistent data, it can affect the data also. For example, if you want the greater security of a mandatory access security scheme, you need to associate security labels with the data at all levels of the system, including the database and operating system. This quickly narrows down your technology choices, as there aren't many operating systems and database managers that support this level of security. It also has implications for authentication and access to the persistent data, which overlaps with the previous requirement relating to data integrity. Also, espionage is now easier than ever with Internet access to data. Holmes PLC depends upon the confidentiality of its trade secrets and other types of intellectual property in the database. Security at the database level protects those secrets from access through other

mechanisms than applications.

ƒ The most important problem with this system is a combination of invasion of privacy and intellectual property issues. Much of the material from which Holmes PLC gathers information is public. As time goes on, Holmes PLC will increasingly use information gathered from individual informants, police agencies, and other private organizations. This could cause problems with privacy laws and regulations, particularly wiretapping and electronic eavesdropping laws in varying jurisdictions around the world. Also, material gathered from public sources may be subject to copyright or other intellectual property restrictions.

The major implication of this requirement is a need to store licenses and permissions relating to proprietary and restricted data. This is another kind of second-order data. This also relates back to the

security requirement, as keeping intellectual property secure is often required, as is privacy of data applying to individuals.