• No results found

Efficient File Storage Using Content-based Indexing

N/A
N/A
Protected

Academic year: 2021

Share "Efficient File Storage Using Content-based Indexing"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Efficient File Storage Using

Content-based Indexing

Distributed Systems Group - INESC-ID Lisbon

Technical University of Lisbon

http://www.gsd.inesc-id.pt/

João Barreto

[email protected]

Paulo Ferreira

(2)

Why Using Content-based Indexing for File

Storage?

• Extending content-based indexing (e.g. as used by LBFS

[MCM01]

) from network transference of file contents to the

context of local file storage is a natural step.

• Particularly interesting as a storage-efficient support for:

– Versioning file systems

– Resource-constrained embedded file systems

• However, access performance must be acceptable and

storage gains significative.

(3)

Existing Solutions: Chunk Repository Storage

Model

• To some extent, all existing file storage architectures that are based on content-based indexing share a core

storage model [CN02, QD02, BF04].

– File contents are divided into disjoint chunks of data, each individually stored with a unique hash of its

contents in a repository of chunks.

– The actual files are then stored as sequences of possibly shared references to chunks in the repository.

Chunk Repository ... ` hn cn h1 c1 hk ck h1 c1 h2 c2 ` h3 c3 h1 c1 h4 c4 h5 c5 ... r5 r2 rk r3 rn r3 rn r4 r1 file1 file2 file3

Example using Chunk Repository Model:

Legend

ci – chunk contents h – chunk hash

(4)

Problems of Chunk Repository Storage Model

Higher storage penalties as chunk size is decreased:

1. Increased chunk meta-data overhead (mainly with chunk hashes); 2. Increased internal fragmentation, if the chunk repository is stored

on a block-based device;

3. Lower chunk compression ratios are achieved.

– Trade-off restricts the choice of the expected chunk size to

relatively high values, hence existing solutions do not fully exploit the similarity that may exist in a file system.

Sequential read performance is penalized even if the file

being accessed does not share any chunk with other files

(5)

The Proposed Storage Model

Two distinguishing principles:

1. If a file shares no chunks with the remaining file system, it should be stored in a plain form.

• Succeeding files that share any chunks with that file will reference those portions of its plain contents

2. Hashes of chunks are not stored permanently along with contents of the file system

• Content similarity detection of new file system data requires indexing whole contents of the file system

• Performed periodically in background

file1 file2 file3 c3 c4 c5 c2 ck cn c1 r3 rn Legend ci – chunk contents ri – chunk reference

Previous Example using the Proposed Model:

(6)

Chunk Coalescing

• Optimises cases where consecutive pointers to contiguous shared chunks are detected.

• Such pointers are replaced by a simple multiple-chunk pointer, thus reducing the storage overhead and allowing faster access

performance.

• In practice, this is comparable (though not always equivalent) to considering a higher chunk size whenever resorting to a lower size would yield no additional similarity gains.

file2

file3 c c3

4 c1 cn

r3 rn

file3 r3,n c4 c1

(before chunk coalescing)

(after chunk coalescing)

(7)

Advantages

• In case of no similarity, no storage overhead is imposed and read access performance is identical to that of a regular file system.

• Storage penalties that resulted from dividing files into smaller chunks are eliminated:

– In case of no sharing of a chunk, no chunk storage overhead;

– In case of sharing, storage overhead is negligible and always compensated by the gains resulting from the increased chunk sharing:

• Hash values are not stored along with contents;

• Update coalescing optimises chunk reference overhead;

– In case of an underlying block-based file system, internal fragmentation is, on average, not affected;

– Data compression may be applied to the unshared portions of each file as a whole, thus achieving higher compression ratios than to individual

(8)

Current Status

• Partially functional in a simulator.

• Currently being implemented as a Linux Virtual File

System.

[BF04] J. Barreto and P. Ferreira. A replicated file system for resource constrained mobile devices. In Proceedings of IADIS International Conference on Applied Computing, 2004.

[CN02] L. Cox and B. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the Fifth ACM/USENIX Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002.

[MCM01] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth

network file system. In Symposium on Operating Systems Principles, pages 174–187, 2001.

References

[QD02] S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In First USENIX conference on File and Storage Technologies, Monterey,CA, 2002.

References

Related documents

There are currently three important regional groupings in West Africa: The Economic Community of Western African States (ECOWAS), the Communaute Economique de l'Afrique de

Para entender como o jornalismo faz uso do Twitter para desenvolver suas atividades, buscou-se analisar como jornalistas utilizam a ferramenta. A pesquisa partiu da análise

S&P Global Ratings has assigned its 'A+' rating to Greater Orlando Aviation Authority (GOAA), Fla.'s series 2016 priority subordinated airport facility revenue bonds.. At the

• Dropbox is a cloud-based file hosting service that uses networked storage to enable users to store and share files and folders with others across the. internet using

wc file count number of lines/words/characters in file Copying files, moving files, removing files and directories and searching the contents of file. ls list files

A retrospective audit was conducted on patients attending the clozapine clinic at Cork University Hospital and assessed patients’ (i) independent living, (ii) co-prescribed

Companies have attempted to manage engineering design using existing solutions within the Product Lifecycle Management (PLM) and Building Information Management (BIM) space such

V tretjem poglavju bomo spoznali programski jezik TypeScript in njegove najpomembnejše lastnosti ter naredili nekaj primerjav s programskim jezikom JavaScript.. Na koncu bomo