Efficient File Storage Using Content-based Indexing

(1)

Efficient File Storage Using

Content-based Indexing

Distributed Systems Group - INESC-ID Lisbon

Technical University of Lisbon

http://www.gsd.inesc-id.pt/

João Barreto

[email protected]

Paulo Ferreira

(2)

Why Using Content-based Indexing for File

Storage?

• Extending content-based indexing (e.g. as used by LBFS

[MCM01]

) from network transference of file contents to the

context of local file storage is a natural step.

• Particularly interesting as a storage-efficient support for:

– Versioning file systems

– Resource-constrained embedded file systems

• However, access performance must be acceptable and

storage gains significative.

(3)

Existing Solutions: Chunk Repository Storage

Model

• To some extent, all existing file storage architectures that are based on content-based indexing share a core

storage model [CN02, QD02, BF04].

– File contents are divided into disjoint chunks of data, each individually stored with a unique hash of its

contents in a repository of chunks.

– The actual files are then stored as sequences of possibly shared references to chunks in the repository.

Chunk Repository ... ` h_n c_n h₁ c₁ h_k c_k h₁ c₁ h₂ c₂ ` h₃ c₃ h₁ c₁ h₄ c₄ h₅ c₅ ... r₅ r₂ r_k r₃ r_n r₃ r_n r₄ r₁ file₁ file₂ file₃

Example using Chunk Repository Model:

Legend

c_i – chunk contents h – chunk hash

(4)

Problems of Chunk Repository Storage Model

• Higher storage penalties as chunk size is decreased:

1. Increased chunk meta-data overhead (mainly with chunk hashes); 2. Increased internal fragmentation, if the chunk repository is stored

on a block-based device;

3. Lower chunk compression ratios are achieved.

– Trade-off restricts the choice of the expected chunk size to

relatively high values, hence existing solutions do not fully exploit the similarity that may exist in a file system.

• Sequential read performance is penalized even if the file

being accessed does not share any chunk with other files

(5)

The Proposed Storage Model

Two distinguishing principles:

1. If a file shares no chunks with the remaining file system, it should be stored in a plain form.

• Succeeding files that share any chunks with that file will reference those portions of its plain contents

2. Hashes of chunks are not stored permanently along with contents of the file system

• Content similarity detection of new file system data requires indexing whole contents of the file system

• Performed periodically in background

file₁ file₂ file₃ c₃ c₄ c₅ c₂ c_k c_n c₁ r₃ r_n Legend c_i – chunk contents r_i – chunk reference

Previous Example using the Proposed Model:

(6)

Chunk Coalescing

• Optimises cases where consecutive pointers to contiguous shared chunks are detected.

• Such pointers are replaced by a simple multiple-chunk pointer, thus reducing the storage overhead and allowing faster access

performance.

• In practice, this is comparable (though not always equivalent) to considering a higher chunk size whenever resorting to a lower size would yield no additional similarity gains.

file₂

file₃ _c c₃

4 c1 cn

r₃ r_n

file₃ r_3,n c₄ c₁

(before chunk coalescing)

(after chunk coalescing)

(7)

Advantages

• In case of no similarity, no storage overhead is imposed and read access performance is identical to that of a regular file system.

• Storage penalties that resulted from dividing files into smaller chunks are eliminated:

– In case of no sharing of a chunk, no chunk storage overhead;

– In case of sharing, storage overhead is negligible and always compensated by the gains resulting from the increased chunk sharing:

• Hash values are not stored along with contents;

• Update coalescing optimises chunk reference overhead;

– In case of an underlying block-based file system, internal fragmentation is, on average, not affected;

– Data compression may be applied to the unshared portions of each file as a whole, thus achieving higher compression ratios than to individual

(8)

Current Status

• Partially functional in a simulator.

• Currently being implemented as a Linux Virtual File

System.

[BF04] J. Barreto and P. Ferreira. A replicated file system for resource constrained mobile devices. In Proceedings of IADIS International Conference on Applied Computing, 2004.

[CN02] L. Cox and B. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the Fifth ACM/USENIX Symposium on Operating Systems Design and Implementation, Boston, MA, December 2002.

[MCM01] Athicha Muthitacharoen, Benjie Chen, and David Mazieres. A low-bandwidth

network file system. In Symposium on Operating Systems Principles, pages 174–187, 2001.

References

[QD02] S. Quinlan and S. Dorward. Venti: a new approach to archival storage. In First USENIX conference on File and Storage Technologies, Monterey,CA, 2002.