Strategy 2, the file-based approach, is a combination of current file syncing and deduplication techniques with know-how of the virtual machine structure. Of both techniques empirical data is available about the performance.
Performance factor Performance Score
Virtual Machine Characteristics
Support and scope of VM variations + (all major filesystems and OSes) Deduplication of the Operating System + (file-based, difference-only possible) Deduplication of the Applications + (file-based, difference-only possible) Deduplication of the ‘User Data’ ++ (file-based, difference-only possible) Deduplication of Revisions +/- (difference-only, not functionality)
Distribution Factors
Speed of importing VM image + (5 to 15 minutes) Speed of reassembling VM image +/- (10 to 30 minutes) Minimal size for effective cache + (very flexible)
Re-usability of data-objects +/- (difference-only, not functionality)
Support and scope of virtual machine variations
Strategy 2 has to understand the files on a virtual machine’s disk. For this it has to understand virtual disk images, disk layouts, partition tables, filesystems, possibly LVM (logical volume management), mountpoint configurations etcetera.
Luckily, of all these aspects there are a limited set of choices. Only a couple of virtual disk image formats exists, disk layout is universal, for filesystems there are multiple choices but only a few are popular, for LVM there is a limited set of technologies and mountpoints are according to Windows’ standard or the Unix standard. For all of these components there are good (open source) drivers and/or documentation available to implement support.
This results in this performance factor scoring ‘good’: +.
Deduplication of the Operating System
Strategy 2 is quite good for deduplicating operating system content. Since most files will be identical between different virtual machines as long as they will have the same
operating system distribution’s version installed, all these files can easily be
deduplicated. There might even be a possibility to deduplicate between different versions, since small differences might be just described by a difference-only-description from a common version. The same might apply even with different distributions or platform
74
architectures, since many files are stored on the same path with the same file name and could easily be matched for alikeness.
Of course, deduplication is not possible between whole different operating systems. The score of this performance is therefore ‘good’: +.
Deduplication of the Applications
Just as with the operating system all identical files of applications can easily be matched in strategy 2. No trouble for in which order the applications were installed, the path and name of the file will always be consistent and usable. Again alikeness between different revisions and distributions can also be checked if path and filename correspond and stored as difference-only-descriptions.
Also the popularity of applications can be indexed per file. Thus allowing for caching more popular or universal components of the application, while keeping optional or diverse files aside. The score of this performance factor is ‘good’: +.
Deduplication of the ‘User Data’
Strategy 2 is relatively suitable for deduplication of ‘user data’. Of course, if the content ‘user data’ is not popular or nothing alike to be found in other virtual machines, there is nothing to be gained. But in case there is any likeness of the ‘user data’ between virtual machines and it’s popular, this strategy should be able to deduplicate it. It can look for identical content and just deduplicate it. Or it can detect identical paths, filenames and try to describe any data by difference-only-descriptions from a common revision. The score of this performance factor is ‘good’: +.
Deduplication of Revision Information
Strategy 2 is in general suitable for deduplicating content that has multiple revisions. It can leverage the knowledge of path and filename to detect common ancestry between files and us it too its advantage.Unfortunately the strategy doesn’t allow for replacing a file with another one with equal functionality, because the file is still assessed on its content and not on its function. Thus differences between a file’s revisions always have to be distributed. The score of this performance factor is +/-.
Speed of importing VM image
The speed of importing a virtual machine image is quite good for strategy 2. The virtual disk image is read file by file, which is not the fastest method since the data is not in- order on the disk. But if a smart filesystem was used the retrieval of the meta-data and the layout of the data on the disk might be quite optimal. Also having files that are larger in size can have a positive impact on the average transfer speed. Adding the files to the database should be fast, since the files are stored separate of the meta-data and the amount of data-objects is limited. The performance factor is ‘good’: +.
Speed of reassembling VM image
The speed of retrieving a virtual machine image is reasonable for strategy 2. First the virtual disk image, the filesystems etcetera have to be created at the target location. All the files can be retrieved relatively fast from a database and if large enough this good also go with reasonable average transfer speed. But when the files are written to the filesystem on the target virtual disk the filesystem has to decide on all the locations, which might be
75
non-linear. Also all the meta-data on the files will have to be applied. This makes this performance factor +/-.
Minimal size for effective cache
Strategy 2 allows for any size of database. All files, whether they are part of the operating system, applications or ‘user data’ they can all fairly be judged on their popularity to be cached or not. Also the spread of the curve for the popularity of files is not too equal. Thus it is possible to just cache just the popular instead of all files. The score of this performance factor is ‘good’: +.
Re-usability of data-objects
As described earlier, sometimes alikeness can be used to cache and reuse some files with just a difference-only-description. And of course identical files can be reused.
Unfortunately no re-usage on functionality is possible with this strategy, which limits re- usability between different versions of a file. This performance factor scores +/-.