Performance Observations - Use of High-Performance Computation to Accelerate M OTIVATED

4.6 Use of High-Performance Computation to Accelerate M OTIVATED

4.6.6 Performance Observations

For reasons detailed in the previous section, the most effective performance of the MOTIVATED

system was observed when run using the local MAVENcluster. Due to the nature of the SGE

system and the fact that as a shared resource it is not always possible to guarantee 100% usage, a certain performance improvement cannot be guaranteed. However, a set of small benchmark runs were evaluated on the system at a period when no external jobs were allocated. There are a number of important considerations to make to ensure efficient and effective use of the cluster-based system: firstly, it is necessary for the workload to be equally balanced across all the available nodes. Furthermore, it is desirable for all the individual jobs to terminate as close to each other as possible, as the MO ranking mandates that all results are available to compare and sort. A final consideration is that the workload is of a practical length; by this, setting a very short workload to each node, such as a single SPICE evaluation, will be inefficient due to a number of reasons: whilst the job submission process on SGE is rapid compared to that for Grid-submission systems, it can take a number of seconds from submission to the initiation of jobs; having multiple jobs start simultaneously adds delays due to the shared file system; the polling process adds a slight delay between the completion of the final job on the cluster and the acknowledgement of completion on the remote system on which MOTIVATEDis running.

As a consequence, it is generally desirable to have more than one SPICE netlist evaluated by each available node, thus a multiple of 48 is generally chosen for the population size.

The time taken for NGSPICE to process a netlist on a single node can be broken in two parts; firstly the relatively static time taken to load the software, read the netlist and calculate

the initial conditions, secondly the time taken to perform the transient analysis and write the corresponding output data to the file-system. Typically the first of these processes generally takes around half a second on the systems used6; a precise figure is hard to extrapolate due to the various loads on the filesystem. The time taken to perform the transient analysis is roughly linear in line with the number of steps to be calculated in the analysis, determined from the combination of the sample-rate, the clock-speed and the length of the transients.

A set of performance benchmarks were carried out to evaluate the difference in performance between running RandomSPICE, NGSPICE and the data handling on a local workstation to the performance when running on Maven. For these tests, two different test circuits were optimised, first a four-transistor NAND gate and secondly a nine-transistor XOR gate from the VSCLIB standard-cell library (described in Section 5.2.2). For each individual, Ran- domSPICE creates 10 netlists which are evaluated in NGSPICE. The tests were carried out at three different sample rates for SPICE transient analysis, firstly taking 50 samples per clock- cycle (a 10pS sample-rate based on the 2GHz clock-frequency, for a total of 1050 data points), then with 100 samples per clock-cycle (5pS sample-rate, 2100 data points) and finally at 200 samples per clock (2.5pS sample-rate, 4200 data points). The test were carried out using single, dual and quad-SPICE threads on the local workstation (which has a quad core Intel Q6600 CPU) and the using all 48-logical nodes on the MAVENcluster, with overall population sizes of 48, 96 and 192 (i.e. one, two and four RandomSPICE runs per node respectively). On both systems other processes and jobs were minimised to ensure maximum CPU availability, and the tests were repeated three times with an average result taken. Each algorithm was ter- minated after 10 generations, with the average time to evaluated one individual calculated by dividing the total time the algorithm was running by the number of individuals evaluated.

The results for the benchmarks are shown in Figure 4.7, with the relative performance increases above single-threaded operation on an overall and per-core basis illustrated in Fig- ure 4.8. Across the runs operating just on the workstation, a speed-up of between 1.46× and 1.82× (µ = 1.60×, σ = 0.15) was observed when moving from single-threaded to dual- threaded operation. When moving to four threads, the speed-up measured was between 2.72× and 3.50× (µ = 3.17×, σ = 0.37) the single-thread performance, demonstrating that the evolutionary loop is very scalable through parallel threading of the RandomSPICE and SPICE operations. The largest improvement in performance was actually found with the smaller circuits and lower sample rates; it is suggested that this is likely due to the hard disc access being the bottleneck, with smaller reads and writes to the disc needed with smaller circuits.

This figure is roughly the same on both the 2.66GHz Intel Q6600-based local machines and the nodes in the MAVENcluster.

0 20 40 60 80 100 NAND 10 pS EXOR 10 pS NAND 5 pS EXOR 5 pS NAND 2.5 pS EXOR 2.5 pS

Evaluations per Minute

Circuit and Sample-Rate Benchmark performance of Motivated

Workstation - 1 Thread Workstation - 2 Threads Workstation - 4 Threads Cluster - Population Size 48 Cluster - Population Size 96 Cluster - Population Size 192

Figure 4.7: Benchmarks of the performance of MOTIVATEDat evaluating different circuits on

a single workstation and on the MAVENcluster

When moving to the Maven cluster, the performance speed up over the single-threaded workstation ranges from 15.8× to 21.0× (µ = 18.35×, σ = 1.95) when a population size of 48 is used (equivalent to 1 RandomSPICE run per compute node), rising up to 18.3× to 24.7× (µ = 20.92×, σ = 2.36) when a population of 96 is used. A much smaller increase in performance is observed when the population size is again doubled to 192, with a performance speed up of 19.2× to 25.0× (µ = 21.93×, σ = 2.52). When the cluster computation is compared to the four-threaded single workstation performance, a speed-up of between 6.82× and 7.77× (µ = 7.27×, σ = 0.31) is observed, with a greater improvement found when using the smaller NAND circuit than the EXOR circuit, and also when the lower sample rates are used. This again points towards a data-storage (or network transport) bottleneck, which is probably a side-effect of all the worker nodes sharing the fileserver resources.

0 5 10 15 20 25 Workstation 1 Thread Workstation 2 Threads Workstation 4 Threads Cluster Pop. 48 Cluster Pop. 96 Cluster Pop. 192 0 20 40 60 80 100

Relative Overall Performance

Relative Performance Per CPU (%)

Relative performance of Motivated Overall Performance Performance Per CPU

Figure 4.8: The relative performance improvement when MOTIVATED is run both multi-

threaded, single workstation mode, and on the MAVENcluster, across all the benchmark runs, weighted so the single-threaded performance = 1.

This chapter has discussed the algorithm that will be used for optimising the transistor dimensions with logic circuits, and the set of tools that allow the algorithm to be used in conjunction with RandomSPICE and NGSPICE, both locally on a workstation and on a HPC cluster based on Sun Grid Engine. The following chapter details the experiments that have been carried out to attempt to optimise logic circuits using the system.

Optimising Standard Cell Libraries

The chapter describes the experiments that have been carried out with the aim of optimising standard cell libraries, and the results that have been achieved. The results are given in approx- imate chronological order of when the experiments were carried out; it is important to point out that the system used to evolve circuits itself, and the model libraries used, have themselves evolved over the period of the experiments, guided by observations made from the results, and performance, and also by the progress made at other nano-CMOS project partners, notably the DMG responsible for creating the transistor models.

5.1 Early Results: Evolving Transistor Topologies using the SGA

The first experiments that were carried out using the SGA system actually attempted to evolve circuit topologies, as opposed to just optimising transistor dimensions within known designs. The results from these experiments, and similar parallel experiments into evolving topologies using an early CGP-based implementation, guided many of the decisions that have been made in future revisions to the MOTIVATEDsystem, described in the previous chapter, and so are included here.

In document Evolving Variability Tolerant Logic (Page 136-140)