BM307 File Organization

(1)

Gazi University

Computer Engineering Department

(2)

Index



Sequential File Organization



Binary Search



Interpolation Search



Self-Organizing Sequential Search



Direct File Organization



Locating Information



Hashing Functions



Collision Resolution



Coalesced Hashing

(3)

File Organization



Goal



Organizing files efficiently in terms of both

space and performance



File Organization

File Access

sequential

indexed sequential

sequential & direct

(4)

File Access Types



Sequential – accessing multiple records (often an entire

file) and usually according to a predefined order



Direct (random) – locating a single record



Question



How can we have an effective organization?



Answer



matching the type of organization with the

(5)

Sequential File Organization

Background

 Fields (eg.: Employee name, number)

 Records  contain data about individual entities  _{Files (eg.: employee list)}

 Primary Key  field(s) which uniquely distinguishes a record

from all others

(6)

Sequential File Organization



File



consists of records of the same format



Fixed-length records



Variable-length records



Sequential File Organization



(i+1)

st

element of a file

(7)

Sequential File Organization



Sequential access



moving from one record in the

file to the next by incrementing the address of the

current record by the record size



Direct access



processing a single record directly if

(8)

Sequential File Organization



Probe



access to a distinct location



Sequential Search



In an entire file of N records



N/2 probes are needed in average



Need to probe entire file for an unseccessful retrieval

 Computational complexity O(N)



Appropriate when N is small



Performance improvement?

(9)

Eg. -

Sequential Search



100000 records, each record size is 400 bytes,

block size is 2400 bytes.



Sequential search time for retrieving 10000 records?



Each probe



one block of data



(100000*400)/2400 = 16667 blocks



Reading time for one block



0.84ms (IBM 3380)



Time requirement for each record



(16667/2)*0.84 = 7 sec.



For 10000 records



7sec * 10000 = 19 hours



Better organization is needed!!

(10)

Sequential File Organization



Binary Search

 _{Requires sorting}

 Compares the key of the sought record with the middle record of the file  Half of the file is eliminated in each turn

 Computational complexity O(log₂n)

(11)

Sequential File Organization

(12)

Sequential File Organization

Interpolation Search



Approximate relative position

 Eg.: Searching a name in a telephone book

 Choses the next position for a comparison based upon the

estimated position of the sought keyrelative to the remainder of the file to be searched

NEXT := LOWER + –––––––––––––––––––––––––––––––––––––––––(UPPER-LOWER)



Worst case computational complexity O(n)



Average case computational complexity O(log

₂

log

₂

n)



Its performance improves as the distribution of keys becomes

more uniform

key[sought] – key [LOWER] key[UPPER] – key [LOWER]

(13)

 _{binary search}_{should be preferred when the data is stored in}

primary memory

 Why?

 _{interpolation search}_{should be preferred when the data is stored in}

auxilary memory

(14)

 _{binary search}_{should be preferred when the data is stored in}

primary memory

 The additional calculations needed for the interpolation search cancel any savings gained from fewer probes

 interpolation search should be preferred when the data is stored in

auxilary memory

 An access of auxiliary storage is an order of magnitude greater than

(15)

Sequential File Organization

Self-Organizing Sequential Search



Modifies the order of records



Moves the most frequently retrieved records to the beginning

of the file

Most popular algorithms:



Move_to_front



Transpose

(16)

Sequential File Organization



Move_to_front

 The sought record is moved to the front position of the file

 Potential of making big mistakes if a record accessed , moved to the

front of the file, and then rarely if ever accessed again!

 A linked implementation is preferable even though it takes more storage

 Appropriate when space is not limited and locality of access is important

 _{Essentially the same as the LRU (least recently used) paging algorithm}

(17)

Eg. -

Move_to_front

 The records are accessed in the order of “fileediting”  _{a b c d e f g h i j k l m n o p r q s t v w y z}  f a b c d e g h i j k l m n o p r q s t v w y z  i f a b c d e g h j k l m n o p r q s t v w y z  l i f a b c d e g h j k m n o p r q s t v w y z  e l i f a b c d g h j k m n o p r q s t v w y z  _{e l i f} _{a b c d g h j k m n o p r q s t v w y z}  d e l i f a b c g h j k m n o p r q s t v w y z  i d e l f a b c g h j k m n o p r q s t v w y z  t i d e l f a b c g h j k m n o p r q s v w y z  i t d e l f a b c g h j k m n o p r q s v w y z  _{n i t d e l f} _{a b c g h j k m o p r q s v w y z}

(18)

Sequential File Organization



Transpose

 Interchanges the sought record with its immediate predecessor  More stable than the Move_to_front algorithm

 A record needs to be accessed many times before it is moved to the

front of the list

 Easily implemented

 Does not need additional space

(19)

Eg. -

Transpose

 The records are accessed in the order of “fileediting”  a b c d e f g h i j k l m n o p r q s t v w y z  a b c d f e g h i j k l m n o p r q s t v w y z  a b c d f e g i h j k l m n o p r q s t v w y z  a b c d e f g i h j k l m n o p r q s t v w y z  a b c e d f g i h j k l m n o p r q s t v w y z  a b c d e f g i h j k l m n o p r q s t v w y z  a b c d e f i g h j k l m n o p r q s t v w y z  a b c d e f i g h j k l m n o p r q t s v w y z  a b c d e i f g h j k l m n o p r q t s v w y z  a b c d e i f g h j k l n m o p r q t s v w y z  a b c d e i g f h j k l n m o p r q t s v w y z

(20)

Sequential File Organization



Count

 Keeps count of the number of accesses of each record

 The file is always ordered in a decreasing order of frequency of

access

 Requires extra sorage to keep the count

(21)

Direct File Organization

 Ideally, we want to go directly to the address where the record is stored  A key can be unique address  one probe

 More address space than needed Key space Address space 1 – 1 999-99-9999 999-99-9999 0 0 correspondence

(22)

Direct File Organization

 _{Converting information into a unique address}

 Eg. : Airline reservation system

 Flight numbers from 1 to 999

 Days are numbered from 1 to 366

 Flight number and day of the year could be concatenated to determine the location

Location = flight number || day of the year, address range  001001-999366 (???367 - ???999 would not exist)

(23)

Direct File Organization



The key converts to a probable address

 If we remove most of the empty spaces in the address space, we have

lost the 1-1 correspondence btw keys & addresses

 _{Hashing functions} _{are used to map the wider range of key values into}

the narrower range of address values

Hash (key) probable address

 Initial probable address  home address  _{Hashing function should}

 Evenly distribute the keys among the addresses  Executes efficiently

(24)

Direct File Organization



A

collision

occurs when two distinct keys map to the same

address



Hashing is then composed of two aspects;

 _{The function}

 The collision resolution method

Key space Address space 999-99-9999 1200 0 0

(25)

Direct File Organization

(26)

Direct File Organization

Hashing Functions

 Squaring

 Taking square of a key and then substringing or truncating a portion of the result

 Radix conversion

 The key is considered to be in a base other than 10 ans is then

converted into a number in base 10

 Eg.: Base 11

 1234 = 1 * 113 _{+ 2 * 11}2 _{+ 3 * 11}1 _{+ 4 * 11}0_{= 1331 + 242 + 33 + 4}

= 1610

(27)

Direct File Organization

Hashing Functions

 Polynomial hashing

 The key is divided by a polynomial

 f(information area) cyclic check bytes

 _{Alphabetic keys}

 Alphabetic or alphanumeric key values can be input to a hashing

(28)

Direct File Organization

Collisions

 For a given set of data, one hashing function may distribute the keys

more evenly over the address space than another

 A hashing function that has a large number of collisions is said to

exhibit primary clustering

 _{It is better to have a slightly more expensive hashing function for data}

that need to be stored on auxiliary storage

 Another method for reducing collisions is reducing the packing factor

number of records stored total number of storage locations Packing factor = co ll isi o n s

(29)

Direct File Organization

Collision Resolution



Collision resolution with links



Collision resolution without links



Static positioning of records



Dynamic positioning of records

(30)

Direct File Organization

Collision resolution with links

 If multiple synonyms occur for a

particular home address, we form a chain of synonym records

 Disadvantage  extra storage is

needed

Collision resolution without links

 We can use implied links by applying a convention , or set of rules for deciding where to go next

 A simple convention is to look at the next location in memory

(31)

Direct File Organization

Coalesced Hashing

 _{Occurs when we attempt to insert a record with a home}

address that is already occupied by a record from a chain with a different home address

 _{The two chains with records having different home addresses}

coalesce or grow together

(32)

Direct File Organization

(Eg.)

 Hash (key) = key mod 11  27, 18, 29, 28, 39, 13, 16

42 & 17 added

(33)

Direct File Organization

Discussion

 Packing factor of the final table = 9/11 (82%)

 One method of reducing coalescing is to reduce the packing factor

 It would be advisable to place the most frequently accessed records early in the insertion process

 Deleting a record is complicated  If coalescing has occurred,

a simple deletion procedure is to move a record later in the probe chain into the position of the deleted record

(34)

Direct File Organization

Variants

 Table organization (whether or not a seperate

overflow area is used)

 _{The manner of linking a colliding item into a chain}  _{The manner of choosing unoccupied locations}

Table Organization

 Table  primary area + overflow area

 Adres factor = (primary area ) / (total table size)  _{Best performance} _{when the adres factor is 0.86}

(35)

Direct File Organization

Variants

 _{Late Insertion Standart Colesced Hashing (}_LISCH₎

 New records are inserted at the end ofa probe chain  Lack of a cellar

 Late Insertion Coalesced Hashing (LICH)

 Uses a cellar

 Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17

hashing function: key mod 7

 _{Average # of probes} _1.3

 (It was 1.8 for LISCH)

 In general, for a 90 percent packing factor,

using a cellar will reduce the number of probes by about 6 percent compared

(36)

Direct File Organization

Variants

 _{Early Insertion Standart Colesced Hashing (}_EISCH₎

 İnserts a new record into a position on the probe chain immediately after the record srored at its home address

 İnsertion of the record with key 17 according to EISCH algorithm:  _{Hash (key) = key}_mod ₁₁

(37)

Direct File Organization

Variants

 Random Early Insertion Standart Colesced Hashing (REISCH)

 Choosing a random unoccupied location for the new insertion  Gives only a 1% improvement over EISCH

 Random Late Insertion Standart Colesced Hashing (RLISCH)

 Bidirectional Late Insertion Standart Colesced Hashing (BLISCH)

 Choosing the overflow location for a collision insertion by alternating the selection between the top and bottom of the table

(38)