Number of processors
5.5 Distributed computing
Not covered in this chapter is the topic of utilising distributed computing for signature in-dexing and searching. Distributed processing is conceptually similar to parallel processing, with the difference being that the computation is spread over multiple computers, instead of simply multiple threads.
This has several implications; for instance, the costs required in synchronisation are far greater when they require communication over a network, and there is usually no shared memory available. Document signature approaches suit distributed computing platforms due to the fact that indexing can be performed almost entirely in isolation, with the signature files generated on multiple machines being trivial to combine for searching later.
Searches can also be distributed in the same fashion; however, this could potentially cause issues with latency. A better approach may be to distribute the indexing, but to have a copy of the signature data on each individual machine and use the machines to load-balance search requests, allowing requests from many users to be handled. This is feasible due to the relatively small file size of the indexed signature data.
5.6 Summary
One of the easiest ways of increasing the speed of an application is to take advantage of the latent processing capabilities of unused cores on a multi-core or multi-processor system.
Hypothetically, doubling the number of threads can double the performance of a program;
however, this performance improvement is limited to the amount of time the application spends in processing that can be parallelised. Amdahl’s law shows how adding more pro-cessing threads will produce diminishing returns in performance improvement, the severity of which depends on how great the non-parallelisable portions of the application are. The non-parallelisable (sequel) parts of the application include any processing that is run before the execution threads have been spawned, any processing run after the execution threads have
been joined and any time that is spent in the middle waiting for other threads to do something.
The action of potentially waiting for a particular resource or location of memory to become free is typically known as synchronisation. Synchronisation typically needs to be performed with data structures and operations specifically designed for the task, known as synchronisation primitives. Regular shared memory typically does not work due to the fact that optimisations and instruction reordering in the compiler and on the CPU can cause operations to not happen in the order specified, can cause reads and writes to work from cache rather than actual memory and can sometimes cause operations to be skipped entirely.
Synchronisation primitives range from mutexes (mutual exclusions) and semaphores to low-level operations such as atomic test-and-set or fetch-and-swap; however, all are designed to be used to safely access shared resources between threads. Those parts of the program protected by synchronisation primitives such that only one thread can be present in them at a time are known as critical sections.
While invaluable to the stability of any multithreaded application that utilises shared resources, synchronisation is an inherently sequential process, as it ultimately means waiting for other threads to finish working and doing nothing in the meantime. Synchronisation even incurs a performance cost in cases where there is no need to wait, as is the case with lock-free data structures, as operations on shared memory need to invalidate any cache lines that happen to reference it, which might be present on any CPU dealing with the same data.
Highly efficient parallel processing relies on avoiding excessive amounts of synchronisation.
One aspect of this is determining how to best partition out a large task among many threads. An ideal partitioning is one that divides the data to be processed into chunks that are small enough that there are plenty to share among the processing threads and that time is not wasted waiting for a thread that happens to be running slower than the others to finish, but simultaneously large enough so that the threads do not have to synchronise too often. Another issue to consider when partitioning is whether certain subtask divisions require additional synchronisation than others; for instance, if several subtasks may end up requiring shared access to the same resource, it may be more optimal to split along allocations of that resource simply to avoid having to synchronise it.
To show these factors being applied, a simple array bit-counting algorithm was developed, using the POSIX threads API (pthreads) for multithreading. The bit-counting algorithm was
designed to be very fast, processing data 64 bits at a time and utilising the SSE4.2 64 bit POPCNT instruction. The algorithm was initially designed with the largest possible subtasks, with the array simply divided into equal portions for each thread. This is an incredibly simple way of dividing up a task and manages to entirely avoid synchronisation until the end. However, this resulted in a performance shortfall as more threads were added, with about half the threads finishing earlier than the other half and being left with nothing to do.
This was addressed by splitting the task into smaller subtasks and placing them into a queue that the threads could grab new subtasks from whenever they finished their old tasks. In order for multiple threads to share a queue of this form, some level of synchronisation is necessary as the threads need to organise things between themselves such that multiple threads do not take the same tasks. While a mutex would be capable for this task, in this case shared atomic variable was sufficient as there is never any need for threads to wait, only that they take the next available subtask and stop when all the subtasks are taken.
With the full set of optimisations enabled, the multithreaded bit counter was capable of counting all the bits in a 1 MiB array in just 24 905 ns using 16 threads, processing data at a rate of over 40 GiB s−1. This is ∼ 7.29× the rate of the run with only 1 thread, and while this is slower than might be expected for a task processed with 16 threads, the data is also approaching memory bandwidth limitations, meaning that the maximum rate at which memory can be read in might be the ceiling for the algorithm. To verify this, the algorithm was modified to skip counting every second word, and doing so barely showed any improvement at all for the 16 thread run. By skipping every second word the system still had to transfer the same amount of memory into cache, but half of it was unused by the program.
This showed that the speed limit was very likely imposed by memory bandwidth limitations.
The two main jobs TOPSIG has are signature indexing and searching. The basic model behind the signature searching approach is that a given collection needs to be indexed once, which is the process by which the signature file is generated, and after that it can be searched many times with different queries. Both the indexing and searching operations can be multi-threaded, and while the tasks are entirely different, they can both be implemented utilising a similar multithreading model as the bit counter was. With the bit counter, the data array was divided up into chunks which were stored in a central pool that threads could read from once they had finished processing their previous chunks.
In the case of the signature indexer, the resource in question is a file handle, reading a file (potentially with compression), which is then through some process broken up into individual documents, which are then converted into signatures and written out to a data file. Reading the data does require some processing; however, this operation is largely I/O-bound. After this, the individual documents can easily be processed independently so it makes sense to adopt a similar model to the bit counter and put these on a queue for processing threads to read. The biggest difference between the signature indexer and the bit counter is the large amount of I/O taking place on both ends. While the bit counter created the array before launching any of the threads, reading the signature data could take some time and it makes sense to allow threads to start processing data even before all of it has been read into memory. In fact, in the case of particularly large collections, it may be necessary to do this; the ClueWeb09 collection occupies 25 TiB in its uncompressed state, which is a lot to fit into RAM at once. On the output side, it makes sense to have a similar queueing system;
although in this case the processing threads will add generated signatures to the queue where they will be written to a file by the output thread. To avoid the additional synchronisation overhead at the cost of an additional memory overhead, each thread is given its own term signature cache. With the use of multithreading, the time taken to index the Wikipedia XML collection is reduced from 48 min 37 s with 1 thread to 3 min 59 s with 16 threads, a performance improvement of over 12× (Table 5.8, page 136).
Signature searching can be implemented either as a batch model or as a single query model. In a batch model, all of the queries are available from the start and it is up to the search engine to execute all of them as quickly as possible. In a single query model, queries are executed one at a time and the search engine optimised to execute this one query as quickly as possible. For TOPSIGthe decision was made to implement the single query search model as it is more flexible, despite the fact that its optimisation ceiling will inherently be lower than that of the batch model, which requires less synchronisation. The implementation of multithreaded signature search is again quite simple; the document signatures are divided up between the threads and the threads each return an individual top-k list. The top-k lists are then consolidated into one top-k list which is returned to the user or written out to a topic file. With multithreading, the time to search the Wikipedia XML collection is reduced from 160.531 ms with 1 thread to 28.373 ms with 15 threads, a performance improvement of 5.66×
(Table 5.13, page 151).