Choice of language - Software Implementation

Software Implementation

4.1 Choice of language

The original paper on TOPSIG utilised an “unoptimised Java engine” [Geva and De Vries, 2011] for testing, comparing the approach against competing algorithms etc. For the purposes

of this research project the TOPSIG algorithm was reimplemented from scratch in a lower-level language, C.

Moving the implementation to a lower-level language provides a number of advantages:

4.1.1 Performance

All things equal, software that is implemented as native code from the beginning can provide superior performance to its higher-level counterparts that may be interpreted or utilise an intermediary language that is dynamically recompiled at runtime.

4.1.2 Direct access to memory

Lower-level languages provide access to a computer’s memory in a matter that more accu-rately reflects the actual state of the machine than what many higher-level languages provide.

It is usually clear whether the memory allocated to a particular data structure will be stack-allocated or heap-allocated, exactly how much memory will be allocated, when the memory will be allocated and when it will be deallocated. In addition, without a virtual environment making its own claims on memory, more of it is available to the program.

This is crucial in information retrieval, where performance is critical and extremely large collections are common.

4.1.3 Abstraction

Lower-level languages often do not do as good a job at abstracting details of the underlying hardware away from you as higher-level languages do. This is normally valuable, as ab-straction frees the programmer from thinking about the hardware the code is running on and makes it easier for code to be ported to other platforms.

Abstraction, however, makes getting a clear picture of exactly what the hardware is doing in a particular situation more difficult. Without understanding exactly how objects of data structures are laid out in memory it can be difficult to optimise those data structures to make more efficient use of CPU cache. Another abstraction that can cause problems when attempting to measure performance is garbage collection. While a useful tool to free the

programmer from having to manually manage memory, garbage collection can also run at inopportune times, unexpectedly consuming CPU resources.

4.1.4 Flexibility of representation

Document signature approaches can also cause certain problems when implemented in higher-level languages. Much of the motivation behind their design comes from the fact that the bit manipulation operations used to generate and search them are highly efficient to implement in low-level languages. If these advantages are not present in the implementation language much of the justification for the use of signatures is rendered irrelevant.

As one example, a na¨ıve implementation of signature search simply consists of perform-ing a Hammperform-ing distance computation between the search signature and all the document signatures in a collection. Unless there is something very wrong with the implementation, the main computational task will be the Hamming distance computation, making how well a given language can implement this computation an important issue.

The most obvious way to implement a Hamming distance computation, conceptually, is to:

• Retrieve a register’s worth of bits from the first bit sequence

• Retrieve a register’s worth of bits from the second bit sequence

• Perform an exclusive-or (XOR) operation on the two registers, producing a register containing a 1 bit for each bit that differs between the two registers

• Count the number of 1 bits in this register through some method

• Increment a counting register by this count

• Repeat process until the two bit sequences have been entirely processed

While even lower-level languages like C do not provide control over whether a variable is stored in a register or not¹ automatic compiler optimisations are usually clever enough to work this out, so this aspect is not an issue.

1C has the register storage specifier, but it is only advisory and its main function is to forbid a program from taking the address of such a variable so that it is possible for the compiler to store it in a register, rather than guaranteed.

The efficiency of retrieving and performing operations on the bits in the two sequences is also highly dependent on how the bit sequences are represented in memory. Lower-level languages that allow type punning²can make this a non-issue; if the bit sequences are present in memory they can be extracted and treated as any type.

Higher-level languages can have problems with obtaining this level of access to the underlying representations of data types. For example, if the type that works most naturally with the language’s facilities for reading and writing binary data from files is a “byte” type, gaining direct access to word-sized chunks of this memory without expending additional computation time may be a problem.

In other cases, a language may offer a more suitable type to represent this model of use.

For example, Java has a BitSet type used for holding vectors of binary digits. BitSet also has binary operations such as and, or and xor that can be used with other BitSet objects, as well as a cardinality operation to count the number of set bits, so this data type would seem to be optimal. The performance of BitSet is frequently subject to criticism, however, with competing replacements such as OpenBitSet³ and FastBitSet⁴ designed to address this limitation. It should be noted that OpenBitSet, while boasting greater performance than the stock BitSet, is implemented in Java itself and therefore cannot take advantage of certain CPU features suitable for accelerating the cardinality operations.

This is only one example, but it shows how higher-level languages leave you at the mercy of the feature set and the quality of the implementation of those features. You can sometimes be snared by the idiosyncrasies of the language when trying to do things your computer is quite capable of doing.

Naturally, none of this is a problem in lower-level languages and compiler features can even be used to access instructions such as POPCNT (count the number of bits in a word) that can be put to good use for computing Hamming distances.

2Accessing data stored as one type as if it were another type. If the programmer is not careful, this can lead to issues with aliasing.

3http://lucene.apache.org/core/3_4_0/api/all/org/apache/lucene/util/

OpenBitSet.html(retrieved October 19, 2015)

4http://javolution.org/apidocs/javolution/util/FastBitSet.html(retrieved Oc-tober 19, 2015)

In document Scalable document hashing and retrieval (Page 107-111)