Software Implementation
4.5 Signature storage
Document signatures, after being created by the indexer, need to be stored in some form so that they will be available for queries. This could be in memory, but usually means on disk so the generated signatures are available for subsequent invocations of the program, as well as available to be used by other software, moved to other machines, processed etc. The way signatures are stored, both on disk and in memory, has various implications for portability, performance and convenience.
For the moment, the scenario the storage scheme is being designed around is that of exhaustive sequential searching; that is, every signature in the collection is compared against the search signature. Other means of searching signatures and what they imply for how signatures are stored will be discussed later.
4.5.1 In-memory representation
First of all, it is highly desirable to have the in-memory representation of signature data match the on-disk representation. This means that the bits that make up the signature in conjunction with all the metadata necessary to store the signatures is stored in a block of memory identical to how it is stored on disk. This has a number of implications:
• Reading signatures from disk into memory, the necessary first step for searching a collection9 is as fast as possible.
If the in-memory and on-disk representations differ at all, the signature data will need to be processed as it is being read in. Depending on the extent of the processing, this can be done without slowing down the reading; however, at the same time, CPU time that could be spent on something else in a different thread is wasted. Having identical representations avoids this and allows you to spend idle CPU while the signature data is being read on something else.
• Writing signatures to disk is also faster. A highly optimised signature indexer will typically have threads processing the collection into signatures that are written out by a
9Assuming the collection is being searched in-memory. It is possible to stream a signature index, which can be useful if only one query is being executed and/or the signature data requires more memory than is available to the machine doing the searching.
dedicated writer thread. Keeping the representations as similar as possible means less processing is required on the writer thread’s side, meaning signature indexing takes less time.
• Portability is a desirable property and moving signature data between architectures can be problematic with mirrored in-memory and on-disk representations; occasionally small sacrifices will need to be made.
One of the more well known architectural differences in the storage of data types is endianness. In “little endian” byte ordering a data word has the bytes ordered from least significant to most significant while “big endian” byte ordering has the bytes in a word ordered from most significant to least significant. The x86 architecture uses little endian. Some architectures such as ARM, MIPS and PowerPC make both byte orderings available; however, this does not mean that a given system using one of these architectures will be able to use programs with either. MIPS, for example, requires that the byte order be specified on boot, which means the underlying operating system and all the programs running on it will need to share endianness.
While little endian is almost ubiquitous, especially on consumer PCs, big endian byte orderings are still used and cannot be ignored entirely.
The raw signature data itself is byte-order-agnostic, as the bits that make up the sig-nature are stored in the same order regardless. While larger data units such as words may be used to accelerate operations such as XOR and POPCNT across the signature bit vectors, these operations are themselves agnostic to the underlying byte-ordering and will produce the same results on any architecture. Hence, if only the signatures are stored in the data file, endianness will not have any impact on the results.
This does not apply to any metadata stored with the signatures, which may feature word-sized integers among other things. This can become a minor tradeoff between performance and portability; is it most efficient to work with in-memory words that use the same byte ordering as the architecture.
A potential compromise is to pick one of the byte orderings (such as little endian, due to its ubiquity on consumer hardware) and use special access routines to read signature metadata. These routines can, at compile time, be replaced with native memory accesses on little endian architectures and with more complicated routines that
reverse the byte order of data words during reads/writes on big endian architectures.
This allows the software to take advantage of performance gains from using a consis-tent byte ordering that mirrors the native byte ordering of the hardware the program is running on while also working, albeit with slightly poorer performance, on incompati-ble hardware without sacrificing the portability of the signature data files.
• A similar problem, also related to portability, is the potential difference between the optimal data word sizes between architectures. Typically a given architecture will have a word size which is the most efficient unit of memory to work with on that architecture.
Hence, for frequently-accessed data, use of an architecture’s native word size is highly desirable when considering the problem from the perspective of performance.
On modern consumer hardware, this typically comes down to a choice between 32 bit and 64 bit word sizes. 64 bit hardware is almost ubiquitous in consumer PCs; however, 32 bit machines still exist and some consumers may still be using older operating system software forcing their 64 bit machines to run in 32 bit mode.
This is, however, only a minor issue. Standard 64 bit consumer machines suffer no performance penalties for accessing 32 bit words and unless a larger range of values is necessary there is no justification for using 64 bit words for storing signature metadata.
64 bit accesses can be used in 64 bit machines to make signature processing more efficient; however, this has no implications for the underlying data format and is simply an advantage those machines will gain.
4.5.2 Signature metadata
Raw signature data, by itself, is not particularly useful for searching. After a signature that closely matches the query is found, some way of tying that signature to a document is desirable, especially as the signature is not itself human-readable and hence not terribly useful to a human user.
To solve this problem some amount of metadata will need to be stored along with the signature data. There are a number of potential storage options for the metadata:
• Stored separately from the signature data. (Figure 4.2) This effectively implies two separate blocks: one of pure signature data and one of metadata (generally stored in
Signature Signature Signature Metadata Metadata Metadata
Figure 4.2: Depiction of document metadata stored separately from the signature data. The metadata could appear in a separate data file, or elsewhere in the same file.
Signature Metadata Signature Metadata Signature Metadata
Figure 4.3: Depiction of document metadata stored interleaved with the signature data. The metadata appears in a contiguous block with the signature it represents. This keeps all the data a given signature requires in one place.
the same order to ensure that there is a way of mapping signature data to metadata).
These blocks could be placed in separate files or stored in the same file. What matters is the organisation.
This approach makes a lot of sense for searching if the code doing the searching does not make use of the metadata until it has determined which signatures to return. CPUs have caches of high-speed memory that is loaded in cache lines from regular memory as memory accesses are made. This means that accessing contiguous bytes is very efficient. However, data that is read into the cache and not subsequently used wastes valuable cache space, which is what will happen to metadata that is stored interleaved with signature data if the signature data is all used but the metadata is not.
• Interleaved with the signature data. (Figure 4.3) This typically means that, for each signature, there is a small block of signature data and a small block of metadata. This provides some of its own advantages.
With the metadata and the signature data forming one cohesive block, if that block is available all the data required for working with the signature is as well. The metadata is never apart from the signature data. This is inviting purely from the standpoint that it is a very simple representation; signature blocks can be concatenated together to form larger ones. Signature data can be streamed, either from disk or over a network. The signature indexer only has to write out single blocks to a file; nothing needs to be kept
after it is written out. All operations on signatures are much simpler when metadata and signature data are kept together.
There are also several reasons to prefer that the signature metadata be available while searching, in which case it is better for the data to be together to make best use of the CPU cache. Certain kinds of metadata, such as “document quality” or “document length” values created while indexing may be useful to the search code, making it pos-sible to consider some factors other than Hamming distance from the search signature when making a determination about how relevant a document is. Data that is relevant for more targeted searching, such as category information, may also be useful to the search code.
4.5.3 Signature file format
While both of the aforementioned storage models for signatures have their own benefits, the model chosen for this research implementation is that with the interleaved signatures. This is mostly to simplify file formats and make it easier to stream signature data for cases in which the signature file cannot be kept entirely in memory. Note that, although Figure 4.3 showed the metadata following the signature data, in the file format described below the signature data actually follows the metadata.
The signature file format used by this implementation is as described in Table 4.4. The format consists of two types of blocks; the collection header, which appears once per file, and the signature block, of which one appears for each signature in the file.
The header does not contain the number of signatures in the file. This is because the signature count is a value that is typically not known during indexing, and to write this to the file would require rewinding after the file has been indexed, which can be problematic in some situations. Furthermore, the number of signatures in a file can be calculated from information in the header (the header size, the length of the document name field and the signature width) combined with the file size.
The header file contains information necessary to create signatures that are compatible with the signatures in this data file and hence enable the searching of the signature file in question. This information is stored in the header rather than in the individual signature files
Size Field description
Collectionheader
4 B The size of this header (in bytes)
4 B The version of the signature file format. This allows the file format to be changed without breaking old signature files, as new versions of the software can detect the old files. The current version is 2. Version 1 lacked a signature seed field.
4 B The length of the document name field in the signature block 4 B The signature width (in bits)
4 B The density of the signature. A value of x means a signature density of 1x. 4 B The additional seed (or salt) added to the term signature seeds
64 B The name of the signature generation method, padded with 0-bytes
Signatureblock
? The name of the document this signature represents. The length of this field is equal to the value in the document name length field plus 1.
4 B The number of unique terms that appear in this document 4 B The length of the original document (in characters) 4 B The total number of terms in the original document
4 B The document’s “quality” index. This is used for splitting ties when searching.
4 B The starting position within the original document of the segment this signature was generated from
4 B The end position within the original document of the segment this signature was generated from
4 B Unused 4 B Unused
? The raw signature data. The size of this field is determined by the signature width given in the header.
Figure 4.4: The file format of the signature data files used by TOPSIG
due to a number of factors:
• Having signatures of different formats in the same collection makes that collection more difficult to search. The standard signature search approach involves creating one search signature and comparing it to all the signatures in the collection. If the search signature must be recreated for every signature in the collection the additional computational expense involved would make signature searching infeasible.
• Some parameters, such as the document name field length and the signature width affect the size of the signature blocks. Having signatures of different sizes in the one collection makes handling the data file more inefficient as determining the position of signature n would require interpreting the previous n − 1 signatures first.
• The extra data stored with every signature would cause the size of the file to increase dramatically, making it more difficult to store large collections in memory.
This is also the reason why it is important that the document name field length be sized appropriately for the collection; the default size (255 bytes) is large and designed to accommodate even really long document names; if the document names in a particular collection are no longer than a certain size, it is useful to set this field to that size to reduce the file size and memory usage of the signature data.
If the collection headers of two signatures match the signature blocks within them can be combined. This means that multiple signature data files can be concatenated together very efficiently, which is a useful property. For instance, while this research does not cover indexing over multiple networked machines or clusters, different parts of a collection can be indexed with identical settings on multiple machines, after which the resulting signature files combined into one large signature file that can be searched like any other.
4.6 Availability
To make the software available to others for their own research, the decision was made for the software to be released under an open license (the GNU GPL v3). The source is available at www.topsig.org.
4.7 Summary
In the course of investigating document signatures for the purpose of this research, the open-source TOPSIG information retrieval platform was developed. The software was developed in C to take advantage of that language’s performance, simplicity and ability to handle low-level data representations efficiently. The initial goal for TOPSIG was to port the original TOPSIG [Geva and De Vries, 2011] algorithm over to C, with particular attention paid to performance and extensibility, in order to develop the approach further.
TOPSIGwas implemented as a collection of interconnected modules, which expose only a thin interface and the layout of certain data structures and act as a black box from the perspective of other modules. This design encourages limited inter-module coupling and information hiding. TOPSIGcan launch in a variety of different modes and the root module is responsible for transferring control to the most appropriate module for a given execution mode. Certain modules are not directly responsible for a given execution mode but instead provide common functionality to other modules, such as the search and threading modules.
The configuration module is a special case in that it provides access to user configuration settings to the rest of the application.
User configuration is performed through a flexible combination of configuration files and command-line arguments, and works on the basis that configuration settings declared later overwrite those that are declared earlier. This makes the order in which configuration is loaded important. By default, TOPSIG reads all the configuration information from the config.txtfile in the working directory from which TOPSIG is launched, then collects additional configuration information through command-line arguments, allowing the user to keep a common set of configuration options in this file and set additional options at each launch. Additional flexibility comes from the fact that the user can supply additional configuration files through the CONFIG option, which immediately loads in all the con-figuration information from the specified file as soon as it is seen, allowing concon-figuration files to reference other configuration files, or the user to specify a different configuration file at launch. Configuration files can even be nested a variable number of levels deep. All configuration options are specified as key-value pairs through this approach and the value of a given key can be read by any module that links the configuration module. Many configuration settings are entirely optional and will assume sensible defaults if no value is given, while
others require a value to be set before the execution modes that use it can be run.
Term signatures are generated reproducibly using the ISAAC pseudo-random number generator, seeded with the text of the term that the signature is being generated for. The generator is then used to select positions within the signature to set. This is a highly efficient process; four times faster than the na¨ıve approach, albeit one that is also affected by the signature density settings. Higher-density settings require more bits to be set and hence require more random values to be generated.
As the expenses associated with generating these signatures takes up a relatively large portion of the signature search indexing time, it is useful to store term signatures that have already been generated in a term signature cache. The term signature cache is simply a hash table that terms are added to after their signatures have been generated. For subsequent doc-ument signature generations, terms signatures that are cached do not need to be regenerated and can simply be added directly to the document signature.