5.2. Implementation of the Digital Library Feature 74
5.2.3. Testing J-ISIS Performance 95
In this section, results of testing J-ISIS performance in general and in comparison to the well known digital library software Greenstone will be presented.
To assess the performance of J-ISIS in terms of storage space, indexing and searching speed and capability of handling different file formats, different tests have been undertaken by creating a digital library collection of 200 documents with different file formats, mostly PDF (as can typically be expected) and a few .doc(x), .html, .ppt and .xls files. The average size of the original documents is 1940KB with a maximum of 209MB (a big PDF) and typically the smallest files being .html at about 20KB. The average size of the extracted text files is 106KB with the maximum being 2818KB and the smallest being 1.3KB. We found a Pearson correlation coefficient of 0.56 in between the two series of respectively original and extracted sizes, indicating a large variety of compression due to different content types of the files.
- 96 -
As a result, J-ISIS has been found to be quite powerful software in terms of the above mentioned parameters. For instance, it only took 3 seconds for indexing one document on average, it can store any file size, searching and viewing of results is almost immediate. All documents were included in the collection without exception.
Moreover, to compare the performance of J-ISIS with Greenstone, Greenstone software has been installed and the same collection (with 200 documents with different formats) was used to be consistent and allow comparison with J-ISIS. In case of Greenstone, it has reported 189 documents were successfully processed, 1 was unrecognized and 11 were rejected, despite having used the most recent version (v2.85, published fall 2011) with latest PDFBox installed as per the instructions. So, this shows us J-ISIS has more powerful mechanisms in extracting and recognizing the file formats as a result of using Tika, compared to Greenstone which uses different plug-ins for handling different documents.
In addition, we have compared the indexing speed and storage space of J-ISIS with Greenstone. All tests were done on a Windows PC with DualCore 2,20Ghz CPU and 3 GB of RAM. Even if Greenstone has its own indexing engine (MGPP), it also uses Lucene to support incremental indexing and for comparison matters, Lucene has been used as indexer. As a result, Greenstone took 52 seconds to index 189 files whereas J-ISIS took only 12,5 seconds for the whole set of 200 documents.
Finally, the comparison with storage space was undertaken and for storing these 189 documents, Greenstone used 626 MB, whereas J-ISIS only used 204 MB for 200 files. This is huge difference with consequences for librarians planning to do bigger collections: for only 200 documents already the storage efficiency drop of Greenstone costs hundreds of MB’s. Extrapolating this for thousands of documents, one can see that Greenstone needs much more storage space, especially because the difference is not only in the indexes (J-ISIS taking only 1/10th of the Greenstone index storage space, where again for some reason for each document a copy is stored and a separate XML-file is to be kept) but also in storage of the documents themselves, meaning that the storage space will grow monotonically with the number of documents.
While doing the tests, we have observed that Greenstone is capable of batch importing, while J-ISIS is not, and this makes the initial collection creation faster, whereas in case of J-ISIS it has to be done one by one, taking more time for this initial creation of the collection. When
- 97 -
doing incremental building of the collection however, the J-ISIS approach, which takes only seconds to add a document, is much faster.
A summary of the findings is presented in the following table:
Table 3: Comparison of Greenstone Digital Library System with J-ISIS
Feature/Criterion Greenstone
Digital Library System
J-ISIS Comments
Success coping with test collection documents
91% 100% GSDL: v2.85 with new PDFBox
installed as instructed
Total time to build collection
15:45 33 min. (10 seconds average per document)
GSDL : batch import possible, J-ISIS : speed highly dependent on file-selection interaction
both : all docs in 1 folder available
Average time adding one document
6 min 10 sec GSDL : minimal rebuild only, no hashing for optimal speed, plugins list optimized to collection profile both : no meta-data added but embedded meta-data processed Indexing full
collection
55 sec 9 sec Lucene is used in both cases
Total storage for collection
626 MB 204 MB GSDL : keeps copy of original files in 2 locations Space occupied by indexes 223 MB 23 MB Web-interface end- users
Yes Yes J-ISIS : Web-JISIS prototype only
Possibility to edit text- contents of document
No Yes e.g. tagging inside document
Advanced features e.g. page-browsing, intra- document sections
Yes No GSDL, as a dedicated DL
software, emphasizes these
- 98 -
As can be seen, both softwares are quite different in their architecture. E.g. GSDL allows batch-import, which could be handy in specific circumstances. GSDL uses the file-system as the ‘database’ manager and organizes the collections therefore in a complex set of folders with many subfolders. This is not as fast as a powerful database like Berkeley DB can deal with records, and this is a significant advantage of the database-based approach of J-ISIS. The main difference might be in the storage space needed for typical collections, where the observed major differences will be magnified in view of the habit of GSDL to not copy the original files only in the ‘import’ folder but again in the ‘archives’ and even index folders. On the other hand, Greenstone will allow some advanced features to be applied, especially with regards to browsing the database, not only by ‘classifiers’ (e.g. metadata-fields) but also by page and document-subsections. This makes GSDL the excellent solution it is, without doubt, for digital libraries. But all this clearly comes at a speed and storage price.
To sum up, even if we take into account that a full comparison needs probably a more elaborated set-up and discussion (in view of the many architectural differences), we have discovered that J-ISIS outperforms Greenstone in terms of storage space and indexing speed. This is a remarkable result in view of Greenstone having been designed with digital library capability in mind, whereas J-ISIS was used for traditional library database usage.