• No results found

BM307 File Organization

N/A
N/A
Protected

Academic year: 2021

Share "BM307 File Organization"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Gazi University

Computer Engineering Department

(2)

Index

Sequential File Organization

Binary Search

Interpolation Search

Self-Organizing Sequential Search

Direct File Organization

Locating Information

Hashing Functions

Collision Resolution

Coalesced Hashing

(3)

File Organization

Goal

Organizing files efficiently in terms of both

space and performance

File Organization

File Access

sequential

sequential

indexed sequential

sequential & direct

(4)

File Access Types

Sequential – accessing multiple records (often an entire

file) and usually according to a predefined order

Direct (random) – locating a single record

Question

How can we have an effective organization?

Answer

matching the type of organization with the

(5)

Sequential File Organization

Background

 Fields (eg.: Employee name, number)

 Records  contain data about individual entities  Files (eg.: employee list)

 Primary Key  field(s) which uniquely distinguishes a record

from all others

(6)

Sequential File Organization

File

consists of records of the same format

Fixed-length records

Variable-length records

Sequential File Organization

(i+1)

st

element of a file

(7)

Sequential File Organization

Sequential access

moving from one record in the

file to the next by incrementing the address of the

current record by the record size

Direct access

processing a single record directly if

(8)

Sequential File Organization

Probe

access to a distinct location

Sequential Search

In an entire file of N records

N/2 probes are needed in average

Need to probe entire file for an unseccessful retrieval

 Computational complexity O(N)

Appropriate when N is small

Performance improvement?

(9)

Eg. -

Sequential Search

100000 records, each record size is 400 bytes,

block size is 2400 bytes.

Sequential search time for retrieving 10000 records?

Each probe

one block of data

(100000*400)/2400 = 16667 blocks

Reading time for one block

0.84ms (IBM 3380)

Time requirement for each record

(16667/2)*0.84 = 7 sec.

For 10000 records

7sec * 10000 = 19 hours

Better organization is needed!!

(10)

Sequential File Organization

Binary Search

Requires sorting

 Compares the key of the sought record with the middle record of the file  Half of the file is eliminated in each turn

 Computational complexity O(log2n)

(11)

Sequential File Organization

(12)

Sequential File Organization

Interpolation Search

Approximate relative position

 Eg.: Searching a name in a telephone book

 Choses the next position for a comparison based upon the

estimated position of the sought keyrelative to the remainder of the file to be searched

NEXT := LOWER + –––––––––––––––––––––––––––––––––––––––––(UPPER-LOWER)

Worst case computational complexity O(n)

Average case computational complexity O(log

2

log

2

n)

Its performance improves as the distribution of keys becomes

more uniform

key[sought] – key [LOWER] key[UPPER] – key [LOWER]

(13)

binary search should be preferred when the data is stored in

primary memory

 Why?

interpolation search should be preferred when the data is stored in

auxilary memory

(14)

binary search should be preferred when the data is stored in

primary memory

 The additional calculations needed for the interpolation search cancel any savings gained from fewer probes

interpolation search should be preferred when the data is stored in

auxilary memory

 An access of auxiliary storage is an order of magnitude greater than

(15)

Sequential File Organization

Self-Organizing Sequential Search

Modifies the order of records

Moves the most frequently retrieved records to the beginning

of the file

Most popular algorithms:

Move_to_front

Transpose

(16)

Sequential File Organization

Move_to_front

 The sought record is moved to the front position of the file

 Potential of making big mistakes if a record accessed , moved to the

front of the file, and then rarely if ever accessed again!

 A linked implementation is preferable even though it takes more storage

 Appropriate when space is not limited and locality of access is important

Essentially the same as the LRU (least recently used) paging algorithm

(17)

Eg. -

Move_to_front

 The records are accessed in the order of “fileediting”  a b c d e f g h i j k l m n o p r q s t v w y z  f a b c d e g h i j k l m n o p r q s t v w y z  i f a b c d e g h j k l m n o p r q s t v w y z  l i f a b c d e g h j k m n o p r q s t v w y z  e l i f a b c d g h j k m n o p r q s t v w y z  e l i f a b c d g h j k m n o p r q s t v w y z  d e l i f a b c g h j k m n o p r q s t v w y z  i d e l f a b c g h j k m n o p r q s t v w y z  t i d e l f a b c g h j k m n o p r q s v w y z  i t d e l f a b c g h j k m n o p r q s v w y z  n i t d e l f a b c g h j k m o p r q s v w y z

(18)

Sequential File Organization

Transpose

 Interchanges the sought record with its immediate predecessor  More stable than the Move_to_front algorithm

 A record needs to be accessed many times before it is moved to the

front of the list

 Easily implemented

 Does not need additional space

(19)

Eg. -

Transpose

 The records are accessed in the order of “fileediting”  a b c d e f g h i j k l m n o p r q s t v w y z  a b c d f e g h i j k l m n o p r q s t v w y z  a b c d f e g i h j k l m n o p r q s t v w y z  a b c d e f g i h j k l m n o p r q s t v w y z  a b c e d f g i h j k l m n o p r q s t v w y z  a b c d e f g i h j k l m n o p r q s t v w y z  a b c d e f i g h j k l m n o p r q s t v w y z  a b c d e f i g h j k l m n o p r q t s v w y z  a b c d e i f g h j k l m n o p r q t s v w y z  a b c d e i f g h j k l n m o p r q t s v w y z  a b c d e i g f h j k l n m o p r q t s v w y z

(20)

Sequential File Organization

Count

 Keeps count of the number of accesses of each record

 The file is always ordered in a decreasing order of frequency of

access

 Requires extra sorage to keep the count

(21)

Direct File Organization

 Ideally, we want to go directly to the address where the record is stored  A key can be unique address  one probe

 More address space than needed Key space Address space 1 – 1 999-99-9999 999-99-9999 0 0 correspondence

(22)

Direct File Organization

Converting information into a unique address

 Eg. : Airline reservation system

 Flight numbers from 1 to 999

 Days are numbered from 1 to 366

 Flight number and day of the year could be concatenated to determine the location

Location = flight number || day of the year, address range  001001-999366 (???367 - ???999 would not exist)

(23)

Direct File Organization

The key converts to a probable address

 If we remove most of the empty spaces in the address space, we have

lost the 1-1 correspondence btw keys & addresses

Hashing functions are used to map the wider range of key values into

the narrower range of address values

Hash (key) probable address

 Initial probable address  home addressHashing function should

 Evenly distribute the keys among the addresses  Executes efficiently

(24)

Direct File Organization

A

collision

occurs when two distinct keys map to the same

address

Hashing is then composed of two aspects;

The function

 The collision resolution method

Key space Address space 999-99-9999 1200 0 0

(25)

Direct File Organization

(26)

Direct File Organization

Hashing Functions

 Squaring

 Taking square of a key and then substringing or truncating a portion of the result

 Radix conversion

 The key is considered to be in a base other than 10 ans is then

converted into a number in base 10

 Eg.: Base 11

 1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4

= 1610

(27)

Direct File Organization

Hashing Functions

 Polynomial hashing

 The key is divided by a polynomial

 f(information area) cyclic check bytes

Alphabetic keys

 Alphabetic or alphanumeric key values can be input to a hashing

(28)

Direct File Organization

Collisions

 For a given set of data, one hashing function may distribute the keys

more evenly over the address space than another

 A hashing function that has a large number of collisions is said to

exhibit primary clustering

It is better to have a slightly more expensive hashing function for data

that need to be stored on auxiliary storage

 Another method for reducing collisions is reducing the packing factor

number of records stored total number of storage locations Packing factor = co ll isi o n s

(29)

Direct File Organization

Collision Resolution

Collision resolution with links

Collision resolution without links

Static positioning of records

Dynamic positioning of records

(30)

Direct File Organization

Collision resolution with links

 If multiple synonyms occur for a

particular home address, we form a chain of synonym records

 Disadvantage  extra storage is

needed

Collision resolution without links

 We can use implied links by applying a convention , or set of rules for deciding where to go next

 A simple convention is to look at the next location in memory

(31)

Direct File Organization

Coalesced Hashing

Occurs when we attempt to insert a record with a home

address that is already occupied by a record from a chain with a different home address

The two chains with records having different home addresses

coalesce or grow together

(32)

Direct File Organization

Coalesced Hashing

(Eg.)

 Hash (key) = key mod 11  27, 18, 29, 28, 39, 13, 16

42 & 17 added

(33)

Direct File Organization

Coalesced Hashing

Discussion

Packing factor of the final table = 9/11 (82%)

One method of reducing coalescing is to reduce the packing factor

It would be advisable to place the most frequently accessed records early in the insertion process

Deleting a record is complicatedIf coalescing has occurred,

a simple deletion procedure is to move a record later in the probe chain into the position of the deleted record

(34)

Direct File Organization

Coalesced Hashing

Variants

 Table organization (whether or not a seperate

overflow area is used)

The manner of linking a colliding item into a chainThe manner of choosing unoccupied locations

Table Organization

 Table  primary area + overflow area

 Adres factor = (primary area ) / (total table size)  Best performance when the adres factor is 0.86

(35)

Direct File Organization

Coalesced Hashing

Variants

Late Insertion Standart Colesced Hashing (LISCH)

 New records are inserted at the end ofa probe chain  Lack of a cellar

 Late Insertion Coalesced Hashing (LICH)

 Uses a cellar

 Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17

hashing function: key mod 7

Average # of probes 1.3

 (It was 1.8 for LISCH)

 In general, for a 90 percent packing factor,

using a cellar will reduce the number of probes by about 6 percent compared

(36)

Direct File Organization

Coalesced Hashing

Variants

Early Insertion Standart Colesced Hashing (EISCH)

 İnserts a new record into a position on the probe chain immediately after the record srored at its home address

 İnsertion of the record with key 17 according to EISCH algorithm:  Hash (key) = key mod 11

(37)

Direct File Organization

Coalesced Hashing

Variants

 Random Early Insertion Standart Colesced Hashing (REISCH)

 Choosing a random unoccupied location for the new insertion  Gives only a 1% improvement over EISCH

 Random Late Insertion Standart Colesced Hashing (RLISCH)

 Bidirectional Late Insertion Standart Colesced Hashing (BLISCH)

 Choosing the overflow location for a collision insertion by alternating the selection between the top and bottom of the table

(38)

Direct File Organization

Coalesced Hashing

References

Related documents