• No results found

Recording formats

CHAPTER 9: THESAURUS CONSTRUCTION

6. Perform steps 4 and 5 for each level starting with level 1.

9.7 CONCLUSION

This chapter began with an introduction to thesauri and a general description of thesaural features. Two major automatic thesaurus construction methods have been detailed. A few related issues pertinent to thesauri have not been considered here: evaluation of thesauri, maintenance of thesauri, and how to automate the usage of thesauri. The focus has been on the central issue, which is the construction of thesauri. However, these secondary issues will certainly be important in any realistic situation.

REFERENCES

AITCHISON, J., and A. GILCHRIST. 1972. Thesaurus Construction -- A Practical Manual. London: ASLIB. BOOKSTEIN, A., and D. R. SWANSON. 1974. "Probabilistic Models for Automatic Indexing." J. American Society for Information Science, 25(5), 312-18.

CAN, F., and E. OZKARAHAN. 1985. Concepts of the Cover-Coefficient-Based Clustering Methodology. Paper presented at the Eighth International Conference on Research and Development in Information Retrieval.

Association for Computing Machinery, 204-11.

CHOUEKA, Y. 1988. Looking for Needles in a Haystack OR Locating Interesting Collocational Expressions in Large Textual Databases. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling, MIT, Cambridge, Mass. 609-23.

FORSYTH, R., and R. RADA. 1986. Machine Learning -- Applications in Expert Systems and Information Retrieval. West Sussex, England: Ellis Horwood Series in Artificial Intelligence.

FOX, E. A. 1981. "Lexical Relations: Enhancing Effectiveness of Information Retrieval Systems." SIGIR Newsletter, 15(3).

FOX, E. A., J. T. NUTTER, T. AHLSWERE, M. EVENS, and J. MARKOWITZ. 1988. Building A Large Thesaurus for Information Retrieval. Paper presented at the Second Conference on Applied Natural Language Processing. Association for Computational Linguistics, 101-08.

FOX, C. FALL 1989/Winter 1990. "A Stop List for General Text." SIGIR Forum, 21(1-2), 19-35.

FROST, C. O. 1987. "Subject Searching in an Online Catalog." Information Technology and Libraries, 6, 60-63. GUNTZER, U., G. JUTTNER, G. SEEGMULLER, and F. SARRE. 1988. Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. Paper presented at the Conference on User-Oriented Content-Based Text and Image Handling, MIT, Cambridge, Mass., 588-96.

HARTER, S. P. 1975. "A Probabilistic Approach to Automatic Keyword Indexing. Parts I and II." J. American Society for Information Science, 26, 197-206 and 280-89.

MCGILL, M. et al. 1979. An Evaluation of Factors Affecting Document Ranking by Information Retrieval Systems. Project report. Syracuse, New York: Syracuse University School of Information Studies.

SALTON, G., and M. MCGILL. 1983. Introduction to Modern Information Retrieval. NewYork: McGraw-Hill. SALTON, G., and C. S. YANG. 1973. "On the Specification of Term Values in Automatic Indexing." Journal of Documentation, 29(4), 351-72.

SOERGEL, D. 1974. "Automatic and Semi-Automatic Methods as an Aid in the Construction of Indexing Languages and Thesauri." Intern. Classif. 1(1), 34-39.

SRINIVASAN, P. 1990. "A Comparison of Two-Poisson, Inverse Document Frequency and Discrimination Value Models of Document Representation." Information Processing and Management, 26(2) 269-78.

WANG, Y-C., J. VANDENDORPE, and M. EVENS. 1985. "Relationship Thesauri in Information Retrieval." J. American Society of Information Science, 15-27.

APPENDIX

/*

PURPOSE: This program will generate a hierarchy in two ways.

1) It can simply read the parent-child links from an input fifile and

store the links in the inverted file structure,

OR

2) It can use the Rada algorithm which splits up words

into different frequency groups and then builds links

between them.

INPUT FILES REQUIRED: (Depends on the option selected).

Option 1: requires inverted file and link file.

Option 2: requires inverted file.

1) inverted file: sequences of

term document number weight.

(multiple entries for any term should be grouped together)

2) links file: sequences of

parent term child term

NOTES: Filters such as stop lists and stemmers should be used

before running this program.

PARAMETERS TO BE SET BY USER:

1) MAXWORD: identifies the maximum size of a term

2) NUMBER_OF_LEVELS: specifies the desired number of levels

in the thesaurus hierarchy to be generated, used by the

COMMAND LINE: (INPUT & OUTPUT FILES ARE SPECIFIED INTERACTIVELY) hierarky ***********************************************************************/ #include <stdio.h> #include <string.h> #include <math.h>

#define MAXWORD 20 /* maximum size of a term */

#define NUMBER_OF_LEVELS 10 /* # of levels desired in the thesaurus */

struct doclist { /* sequences of document # and weight pairs */

int doc; /* document number */

float weight /* term weight in document */

struct doclist *nextdoc; /* ptr. to next doclist record */

} doclistfile;

struct parentlist {

char term[MAXWORD]; /* parent term */

struct invert *parent; /* ptr. to parent term in inverted file */

struct parentlist *nextparent; /* ptr. to next parentlist record */

} parentfile;

struct childlist {

char term[MAXWORD]; /* child term */

struct childlist *nextchild; /* ptr. to next childlist record */

} childfile;

struct invert { /* inverted file */

char term[MAXWORD]; /* term */

struct doclist *doc /* sequences of document # and weigh */

struct parentlist *parents; /* ptr. to parent terms */

struct childlist *children; /* ptr. to child terms */

int level; /* thesaurus level based on term frequency */

struct invert *nextterm; /* ptr. to next invert record */

} invfile;

struct invert *startinv /* ptr. to first record in inverted file */

struct invert *lastinv /* ptr. to last record in inverted file */

struct doclist *lastdoc; /* ptr. to last document in doclist */

static char currentterm[MAXWORD]; /* tracks current term in inverted file */

static int Number_of_docs; /* total # of documents which is computed */

static struct invert *get_mem_invert ( ); /* these 4 functions will obtain */

static struct doclist *get_mem_doclist ( ); /* memory for records. The type of */

static struct parentlist *get_mem_parentlist ( ); /* is indicated by the name of */

static struct childlist *get_mem_childlist ( ); /* the function */

static FILE *input1; /* link file */

static FILE *output; /* holds any output */

static

float cohesion ( ), /* compute cohesion between two terms */

total_wdf ( ), /* compute total frequency of term in dbse. */

get_freq_range ( );

static

void read_invfile ( ), /* read in the inverted file */

read_links ( ), /* read in the links file */

add_link ( ), /* called within read_links ( ) */

pr_invert ( ), /* print the inverted file */

add_invert ( ), /* called within read_invfile ( ) */

write_levels ( ), /* initialize the levels information */

generate_Rada_hierarchy ( ), /* generate the Rada hierarchy */

get_term_data ( ); /* get basic information about terms */

struct invert *find_term ( ); /* searches for term in inverted file and */

/* returns its address. */

int main (argc)

int argc;

{

char ch, fname[128];

currentterm [ 0 ] = '\0'; Number_of_docs = 0;

if (argc > 1)

{

(void) printf ("There is an error in the command line\n");

(void) printf ("Correct usage is:\n");

(void) printf ("hierarchy\n");

exit (1);

}

(void) printf ("\nMake a selection\n");

(void) printf ("To simply read links from a link file enter 1\n");

(void) printf ("To use Rada's algorithm to generate links enter 2\n");

(void) printf ("To quit enter 3\n");

(void) printf ("Enter selection: ");

ch=getchar ( );

switch (ch)

{

case '1':

(void) printf ("\nEnter name of inverted file: ");

(void) scanf ("%s", fname);

if ( (input=fopen (fname, "r") ) ==NULL) {

(void) printf ("cannot open file %s\n", fname);

}

(void) printf ("Enter name of link file: ");

(void) scanf ("%s", fname);

if ( (input1=fopen (fname,"r") ) == NULL) {

(void) printf ("cannot open file %s\n",fname);

exit (1);

}

(void) printf ("Enter name of output file: ");

(void) scanf ("%s", fname);

if ( (output=fopen (fname,"w") ) ==NULL) {

(void) printf ("cannot open file %s\n", fname);

exit (1);

}

read_invfile ( );

(void) fprintf (output,"\nINVERTED FILE\n\n");

pr_invert ( );

read_links ( );

(void) fprintf (output,"\nINVERTED FILE WITH LINK INFORMATION\n\n");

pr_invert ( );

(void) fclose (input); (void) fclose (input1); (void) fclose (output);

break;

(void) printf ("\nEnter name of inverted file: ");

(void) scanf ("%s", fname);

if ( (input=fopen (fname,"r") ) ==NULL) {

(void) printf ("cannot open file %s\n", fname);

exit (1);

}

(void) printf ("Enter name of output file: ");

(void) scanf ("%s", fname);

if ( (output=fopen (fname,"w") ) ==NULL) {

(void) printf ("cannot open file %s\n", fname);

exit (1);

}

read_invfile ( );

(void) fprintf (output,"\nINVERTED FILE\n\n");

pr_invert ( );

generate_Rada_hierarchy ( );

(void) fprintf (output,"\nINVERTED FILE AFTER GENERATING RADA HIERARCHY\n \n");

pr_invert ( );

(void) fclose (input); (void) fclose (output);

break;