Special
Feature:
An
Automated Chinese Telephone Directory
Y.H.Chin,J. W.Jou, W. H. Peng,and C. C. Yang Telecommunication Laboratories ofTaiwan
Introduction
A real-time, on-line information storage and retrieval system for telephone information servicehas beendesigned at the Telecommunication Laboratories of Taiwan. The input and output of the system areprocessedinChinese via akeyboard and a graphic display unit respectively. System functions include sorting, merging, updating, displaying, andprintingofdata.
In order to avoid a serial, exhaustive search, the file is subdivided intoblocks oflengthx.Theoptimalsize of each block is determined so as to minimize the average number ofcomparisons (file accesses)to locatea record.
The system has been tested on a telephone office of medium capacity-about 8000 subscribers. The number of daily queries is about 1600 and the updating rate is approximately 50. On the basis of a comparison of the
system's slowest response time (0.24 sec) with the quickest manual response time, the system is at least 130 times faster.
The design philosophy stresses fast responses to queries andquick updatingof the data.
For a computer, a Chinese character is not acharacter but a pattern;therefore, a portion of the codingis related toChinese.
System Description
The system hardware (see Figure 1) consists of an HP/2100Acomputer under the control of the Moving-Head Disk Operating System (DOS-III). Two secondary storage devices are attached: a moving head disk unit and a magnetic tape unit. The disk unit is used to store the system programs, utility programs, and data files.The data 49
GRAPHIC DISPLAY ELECTROSTATIC PRINTER - ~~~~~~~~~~~MAGNETICDISK TAPE
Figure1. SystemBlock Diagram ofthe Automatic TelephoneDirectoryInquiryServicesSystem
TELEPHONE CHINESE
DIRECTORY CHARACTER
FILE PATTERN
record FILE
record 1 (dot matrix)
Displaying
(a) Data Base StructureandSearching SequenceofTelephoneDirectory Data Base
LI user'sname 16 words
Tel. No.
I
pointerI
3words 2words(b) Record FormatforTelephone DirectoryFile
Figure 2. Logical Structure of Data Base
files contain the telephone directory file and the Chinese consonants and 16 vowels. The phonetic spellings are
character pattern file. The magnetic tape is used as a translated into a stringofnumerical codes which is usedas
backup unit. The addresses ofthe telephonesubscribersare a searchkeytolocate the desired record and todisplayitin
storedonthe tape forprintingof thetelephonebook.* Chinese.
Input devices consist of a paper tape reader, used as a
main input device for reading utility programs and data *In general, an operator finds a user's phone number whenever a
, keyboard, Muser's name is
given
in a query. Therefore, the operator at thewhich isused to type
thephonet
icsym spellingseof
the i tX servicecenterhasnoneedtofindaphone numberthroughahomewhich is used to type the phonetic spellings of the input address of a particular subscriber. In fact, the subscriber's address
Chinese characters,4 contains 37 phonetic symbols-21 isconfidentialattheQueryServiceCenter.
PHONETIC SYMBOL KEYBOARD TAPE READER KEYTABLE key record number record 1 recordm recordk
I
50 COMPUTEROutput devices include an HP/1331C X-Y graphic display, which is used to display the desired telephone record in Chinese, and an electrostatic printer (a VERSATEC plotter) for printing the telephone book in Chinese. The speed is about two
lines/sec
in Chinese characters. Depending on the size ofthe printed Chinese character, each line can contain 40 to 80 Chinese characters.Except the DOS-III system program, we developed our own utility programs for handling data processing. The functions ofthese programsaresearching,sorting, merging, converting,updating,displaying, printing,andplotting.
Data Base
Structure
In this system we have two data bases: a telephone directory data base and a Chinese character pattern data base. The telephone directory data base is a three-level indexed sequential file containing three files:
(1)
index table, (2)key table, and(3) telephonedirectory file(Figure 2a). The record format of the keytablehas6bytes forthe search key, 2 bytes forthe correspondingrecord number in the telephone directory file, and 2bytes
forapointerused for updating. The variable-length record format of the telephone directory file(Figure 2b)
consists of three fixed-length fields: a 32-byte namefield,
which is a variable-length field containing 1 to 16 Chinese character codes, a 6-byte telephone numberfield,
and a 4-byte pointer field for printing the telephone- book. Since each sector(256
bytes)
can store six42-byte
records, it is very easy to calculate the actual location in the disk. Each record is assigned an unique record number. Through this number, the physical address of a specified record canbe identifiedwiththefollowing
rule:sector=starting sector of thefile+
(S
1)/R
... . . .(1)
recordnumberinthesector
remainderof
(S
-1)/R
...(2)
where S is the record numberofthe
specified
record andR isthe number of recordsineachsector.In thekey table
file,
the entries are stored in ascending order with respect to theirnumericalvalue ofsearchkeyas well as record number. In order to minimize the mean number of comparisons to locate the desired record, the key table file is subdivided into blocks of length x. The optimal block size x is determined with respect to combined search methods so that the mean number of comparisonsis minimized. In thekey
tablefile,
each block is initially 90%filled,
with the remainder reserved for overflow. After a period of time thekey
table is reorganized to avoid filling up the reservedoverflowspace. In case offilling
up, an overflow signal is given, andanew key table and 10% reserved block space are automatically regenerated. Therefore, the logical structure of the data base does not need to bechanged
when the system is applied to a larger city like Taipei, which has260,000
telephone subscribers.
Every Chinese character is represented
by
an 18 X 15 dot matrix and is assigned an unique number as itscode-i.e.,
its record number (seeFigure
3).
Hence each May 1975w t v _ ~r I -_-k
JL L I
_--_4PI _I~_
Chinese characterexpressedinan
18x15 dotmatr'ix recordformat: I word word word word word word word word word word word word word word word word word word 1 00174 2 00104 3 04104 4 04174 5 77104 6 04104 7 16174 8 16000 9 16777 10 16100 11 15377 -12 24445 13 44111 14 04221 15 04441 16 24101 17 14102 18 04004
memorydata of the dot matrix
dot matrix 36bytes
Figure3. Chinese Character Patternand the RecordFormat Chinese character needs 36
bytes
for its dot matrix representation. There are 1400 Chinese charactersinuse inour present system. These are stored in ascending order with respect to frequency ofusage, so thatcharacterswith
high
usage can be accessed (i.e., made core-resident) veryrapidly.3
These 1400 character representations form a file called the Chinese character pattern file. To access a Chinese
character, simply
get this character's code and set R=7 informulas
(1)
and(2). Then,
the location is found and the desired dot matrix of the character is displayed. Thisoperation
takes,
at most, one disk access. The reasons for using a dot matrix rather than using other forms, such asthe Chiao-Tung Radical System,5 are that it requires a smaller number of disk accesses andpermitsanelegantand symmetrical presentation of the Chinese character in
display
and print. (Typical video andhardcopyoutputsareshown in
Figures
4 and5.)
Relationship
Between Search Methods
and
Optimal
Block Size
To locate a desired record
Ri
from the file, the search sequence isas follows. A user's name is firsttranslatedintophonetic
spellings which are typed by an operator into the systemthrough
akeyboard;then thespellingsaretranslated into a stringof numerical codes which are used as a search key. (This input operation is illustrated in Figure 6.) The searchkey
is first compared with eachentry in the index table ofn/x entries in order to locate the block, sayBj,
where the desired recordRi
is stored. In case of a match, each code of theseChinesecharactersintheRi
mustsearch for theChinese characterpattern file in order to find out its corresponding dot matrix. From the dot matrix, this desired recordRi
canbe displayed and printed in Chinese. In case ofmismatch, an error message will be sent to the operator. Depending on the received information, the operator can either retype the query or inform the requestertorepeat his message._
_
W
_
_ - tE_
;
GC
a. b. c. d.Figure4. The displayed result: (a)multiple-phonenumber record;(b)variable-length record;(c)subscribers with same name aredistinguished bytheir addresses inside theparenthesis;(d) updatedrecord shown in the second row.
Among the various methods known on retrieving information records from a data base, we have chosen ISAM (indexedsequential accessmethod)andbinary search method in our application. For the time being, the other methods are unsuitable for our case. For example, the difficulties in using scatter storage techniques are (1) finding satisfactory hashing function so that, after hashing the phonetic spellings of Chinese characters, the search key can be
uniformly'distributed
ineach blockof thekeytable; and (2) estimating the suitable storage allocation scheme for the hashing table, because the number of telephone users is increased rapidly. This problem is particularly serious for a minicomputer with only 16K core memory. On the other hand, linear search is time-consuming. Therefore, it contradicts the requirement of real-time reply on inquiry.In order to minimize the mean number ofcomparisons to locate the record
Ri
from the blockBj,
the problem of determining the optimal block size xwith respecttosearch methods is solved as follows:(1) For a sequential search in the index block and the selected block
Bj,
the optimalblocksize isn1/2;
(2) For asequentialsearch in the indextable andbinary search in the block
Bj,
the optimal block size is n.ln2/2;(3) For a binary search in the index
table
and the selected blockBj,
the optimal blocksize isn/2;
(4) For abinary search in theindex tableandsequential
search in the selected block
Bj,
the optimal block sizedoes not exist;where n is the number of records in the telephone directory.**
**Detailedproof:
An ordered file of n records (either in fixed length or in variablelength) is subdivided into blocks of length x. There aren/x blocks, each having x records. The optimal block size x with respect to various searchmethods is proved as follows:
(1) For a sequential search in the index table and in the block 13-theoptimal block size isnY2.
Proof: The mean number of comparisons to locate a record fromthe fileis
a =1/2((n/x)+ t)(n/x)/(n/x)+l/2x(x+ 1)/x =/2((n/x)+ 1)+ /2(X+) di/dx=O e(-n/x2)+ 1 0 x=n/2 d2i/dx2 2n/x3= 2/nl/2>0-O xmin =n'/2 x=n/2
(2) For a sequential search in the index table and binary search in the blockBi,the optimal block size is(n.ln2)/2.
Proof: a=1/2(n/x+ 1)+((x+ 1)(log2(x+ 1))/x- 1)
1/2(n/x+ 1) +(log2(x+ 1)- 1) (SeeReference I
da/dx= 0x=(n.ln2)/2 forx>10)
d2N/dx2 4/(ln2)3n2>0-xmin =(n.ln2)/2
x=
(n.ln2)/2
ID I f
k
IW
e+3 tgn,44 r th t-">:
ES LX-x
PCI. rEI ' L 7Vcw
*Yf.T
ZM,P~P-q3-,,
L.tLNI-
=EeFigure 5. The Printed Result.The5-digitnumbersrepresentcontinuedphonenumbers-e.g.,316364representstwotelephone numbers 3163 and3164.
(3) For a binary search in the index table and the block Bj, the
optimalblock size isn.
Proof: a=(log2(n/x+ 1)- 1)+(log2(x+1)- 1)
da/dx=0 x=n/2
d2i/dx2 = 2/nt/2(nl/2+ 1)2>0
1x=n1/2
Xmin-n/2(4) For abinary search in the index table and sequential search in theblockBj, theoptimal block size doesnotexist.
Proof: a=(log2(n/x+ 1)- 1)+1/2(x+ 1)
da/dx=0 -> x=0or-n
This isimpossible sincexisapositive integer.
In all, ofthese fourexpressions, the firstterm and the second termrepresentthe formula of themeannumber ofcomparisonsin
1975
User'snameinChinese
(givenbyaninguirer)
Phoneticspellings
(typed by an operatoron akeyboard) S{7 2 T- 1 <-72
Translatednumericvalue35 266 80
(systemgenerated) 350 2616 3830
Actualsearchkey 00536 05070 07366
(startingtosearch) 8 8 038
Figure 6. Typical ExampleofInput OperationsforSearching
aTelephoneRecord
the index table and theselectedblock
Bj
respectively. For largen (e.g.,n>1000), the valueofain(2) and (3)isalwaysless than the valuein (1)wheneverx=xmin;therefore,thebinarysearchshouldbeadoptedin either theindex tableortheblockB- orinboth. 53
(E
rr3ft
r-3 BWt IRil t taIo1\j&
ptOIV
il
ftgB r,s Eb !X 1kAEf#
tI',L4
31UQI3
ES~i1
34206
3653
2%B
290
2670
5073
20114071
490
4357
3716
4437
3422
4388
52883216
6091
5234
5881
4843
6871
2807
6075
547
40905313
6603
4519
4951
6518
2711
652
4778
6743
2424,2036
4928
2320
5744
3278
261-32017
56094836
207
2859
66
4702
6043
1265
3685
337
5002
256
402i8
b5939
3S70
6343
31634S
5191
312930
501-00
6511
3087
2853
23316
307-72709
.373
4385i
5831
606?
36614441
£311
1263 DU 3-f39 'uJL-h3 293'('637,
System
Testing
To insert a new record into the system, the whole information of a
telephone record
is first stored at the bottom of the telephone directory file. Then the record's searchkey(the
phonetic spellings of the selected characters from the user'sname) is stored at the appropriate location inthe ordered key table file. This operation takestwo disk accesses: one forwriting the searchkey
into the keytable and one for writing the record in the telephone directory file. It takes an average of 10 secondstokey
in atelephone record of8Chinese characters, which includes the character codes, the phonetic spellings of the search key, and the telephone number. Hence,theoiineinsertionoperationis not used in the testing system with one CPU during the busy hours(9-12
A.M. and 2-5 P.M.daily2).
The modification is simple: the record to be modified is first fetched and displayedonthe graphic display(see
Figure 5); then the information needed to be addedordeleted can be processed similarly to an edit operation. The system's regenerating time, defined as the time ofregenerating the key table, equals b(2dt +pt)/n
where b isthe number of blocksin thekeytable,dtisthe average diskaccesstime,Pt is the CPUtime,and n is thetotalnumber
ofrecordsinthe telephone directory file. Now, the system's regeneration time(i.e., rebuilding cost)isapproximately 0.7 milliseconds per record.The total size ofour utility programsisabout6Kbytes, and the size of the data file is about 313K
bytes
in which 210K bytes are used for thetelephone directoryfile,
50K bytes for the key table, and 53K bytes for the Chinese characterpattern file.The total size of the system is about 320K
bytes.
The response time to displayarecordinChinesedepends
onthe number of Chinese characters in the retrieved record. On the average, it takes 30 milliseconds(the
average access time of anHP/7900
disk)to access a Chinesecharacter.In the worst case, it takes 0.24 sec todisplay
a record of8 Chinese characters on theHP/1331C
graphic
display.
By comparisonofthe slowestsystem's
response time(0.24
sec) with the quickest manual response time, whensomeofthe most frequently used Chinese characters(about
100)
are made core-resident, the system's response time will be improved to 200 times faster than that of a manual system.3Acknowledgment
The authors would like to
acknowledge
the corrections and suggestions of Dr. R.C.T. Lee and Mr. Pierre Loisel.They
are also indebted to the referees' useful comments andcolleagues'
discussions intheLaboratories.References
1. G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill Book Company, N.Y., 1968.
2. C. C. Yang, "Statistic Analysis on the Traffic of Taipei Query Service Center," Quarterly Report of the Telecommunication Labs,Vol.3,No.4,pp.69-126,October1973.
3. C. V. Ramamoorthy andY. H. Chin,"An EfficientOrganization of Large Frequency Dependent Files for Binary Searching," IEEE Trans. onComputers, October 1971,pp. 1178-1187. 4. S. K.Chang, C. S. Chiu, M. H.Yang,and B. S.Lin, "PEACE-A
Phonetic Encoding andChinese Editing System," Proceedings of the First International Symposium on Computers and Chinese I/O Systems, Academia Sinica, Taipei, Taiwan, R.0. C., Aug.
14-16, 1973, pp.2947.
5. C. C.Hsieh,M.W.Du, et al., "The Chiao-Tung Radical System," Proceedings of theFirst International Symposium onComputers andChinese I/OSystems, pp. 49-78.
Yeh-hao Chin is an assistant professor in the Computer Sciences Department at North-western University. Earlier, he was with the Telecommunication Laboratories in Chung-Li,
Taiwan,where hewasresponsiblefor thedesign
and development of software systems for telephone service systems and electronic switch-ing systems. During this period he was also a
,,V! X i researchfellow atthe ElectronicLaboratoryof UC Berkeley,and anadjunctassociateprofessor in the Computer Science Department of the NationalChiao-Tung University.
He received the BSEE from the NationalTaiwanUniversity in 1966 and theMS andPhD degrees in electrical engineering from the University of Texas in 1970 and 1972. Dr. Chin's research and
teaching interests are in the design and development ofdatabase systemsfor Chineseinputand output.
Jun Wun Jou is a research scientist at the Telecommunication Laboratories in Taiwan where he leads a group in designing an automatic telephone directory inquiry service system for the Taipei area. Joureceived his MS tLd+_ degree from National Chiao-Tung University, Hsinchu, Taiwan, in 1970. He isa member of
the Institute of Electrical Engineering of the
RepublicofChina.
System Improvement
At the present, each operation takesacertainamount of disk accesses. The disk access includes not only record retrieval but also location of the Chinese characters in retrieved
record-especially
the time spent in locating the Chinese characters. In order to reduce the diskaccesstime, the usage frequency of each Chinese character is being monitored so that those characters used most oftencan be made core-resident. Also, some coding methods such as threshold functions for querywillbe adoptedto facilitatea user's query. The "best"threshold value is studied so that whenever a communication aX4iguity occurs, the mis-spelled phonetic input willstill " e acorrect answer(i.e.,
get the desired record). In summary, we plan to do the following in the near future: (1) make the input query format more flexible, (2) reduce the number of disk accesses, (3) modify and extend the operating system when the system is working in a time-sharing environment with multiple terminals.u
54
RepublicofChina. information storag(
W. H. Peng works for the Telecommunication Laboratories in Chung-Li, Taiwan, where he is engaged in research programs concerning information storage and retrieval. He received the BSEE and MSEEfrom the NationalChiao
TungUniversity, Hsinchu,Taiwan, in 1968 and 1970.
Chen Chau Yang is a research scientist in the Computer Scientist Group of the Telecom-munication Laboratories, where heis in charge of the design and implementation of an system at the Taipei Query Station, Yang received the BSEE and MSEE from National received the BSEE and MSEE from National Chiao-Tung University, Hsinchu, Taiwan, in 1969 and 1971, and is a member of the Institute of Electrical Engineering of the His current research interests are in the field of e andretrieval.