Parallel Gate Level Simulation - Multi-level simulation of nano-electronic digital circuits on

The solving of linear equation systems from sparse matrices is an essential part of SPICE simulations and many optimizations have been proposed for GPU-accelerated solvers that utilize efficient LU-factorization. In [CRWY15] and [HTWS16] the authors show, for ex- ample, that right-looking approaches can exploit parallelism from vector columns, sub- matrices and vector operations simultaneously. The reported speedups reached up to two orders of magnitude compared to conventional solvers. A recent work [vSAH18] that accelerates SPICE can consider parameter variations in the transistor models for modeling aging effects. The simulation is based on CUSPICE, which is a GPU-accelerated version of ngspice [CUS19]. It achieved speedups of up to 218× on a recent GPU-architectures and was able to simulate designs with more than 200,000 transistors (128-bit multiplier) in 18.5 hours.

Note that even with speedups achieved, GPU-accelerated SPICE implementations still ex- hibit relatively high runtimes for small to medium-sized problem sizes. Current works either focus on speeding up single simulation instances [GCKS09, vSAH18] or parallelize over circuit instances [HF16] with no scalability in terms of design size. This is usually related to limitations in the GPU memory or increased communication and synchroniza- tion overhead, which limits the problem size and computational throughput, and hence the applicability to larger simulation problems.

4.3 Parallel Gate Level Simulation

In order to cope with larger designs with millions of cells, with even more faults and test patterns, simulators must exploit higher level modeling and simulation approaches that are able to exploit massive parallelism from many dimensions simultaneously and that provide high throughput in terms of solving many simulation problems. While GPU- accelerated simulators for RT-level [QD11, BFG12] and system level [NPJS10] exist, they are not suitable for timing-accurate simulation since they neither provide the capabilities to model circuit structure nor the accuracy to reflect timing accurately. Circuit function- ality and timing is usually evaluated at (logic) gate level and various research has been conducted on parallelizing gate level simulation on GPUs.

4.3.1 Logic Simulation

[GK08] presented a first accelerated logic simulation in two-valued Boolean logic on GPUs which is also utilized for stuck-at fault simulation. It implements a forward simulation approach by using compact look-up tables (LUTs) to compute gate functions, which are stored in the cached read-only constant texture memory on the GPU. The simulation is performed in a levelized manner by calling an evaluation kernel for each level. Only the circuit data of the respective current level is required, thereby avoiding the need to store the entire netlist in the (limited) GPU memory. A compact encoding allows each thread to compute two gates simultaneously. The underlying algorithm exploits structural parallelism by concurrent threads for the gates on a level and data-parallelism from parallel simulation of patterns (pattern-parallelism) for each gate. This way, a speedup of over 238× in average was achieved over a commercial solution.

The evaluation of gates on levels in a levelized circuit by parallel threads is a common method to exploit structural parallelism in circuits. This has been adopted in many other publications [CDB09b, GK09, SRG+_{11] and this thesis as well.}

In [CDB09b] a parallel cycle-based logic level simulator for GPUs was presented where the netlist is partitioned into clusters each of which computes the gates in the cone of influence of a circuit output. Each cluster is processed by an individual thread block, where the threads of a block concurrently process the gates of the cluster in levelized order. After each level, the threads of the block are synchronized. However, thread blocks can work independently of each other. Truth tables of gates as well as intermediate signal values are kept in the local shared memory, while inputs and outputs of the clusters are stored in the device memory. Independent evaluation of clusters by different thread blocks is achieved through duplication of the netlist gates that reside in the cone of influence of multiple outputs. Experimental results demonstrated a speedup of 14.4× over the sequential simulation of the algorithm. A similar approach using clustering-based method was proposed in [SABM10].

The first GPU-accelerated event-driven logic level simulator was presented in [CDB09a]. It uses a more fine-grained partitioning of the netlist into so-called macro-gates, each of which corresponds to a set of connected gates in the original netlist. A macro-gate is designed to be evaluated on a single streaming multiprocessor on the GPU and a sensitivity

4.3 Parallel Gate Level Simulation

list keeps track of any value changes at its inputs. In case the sensitivity list of a macro- gate contains an event, it is scheduled for execution on a multiprocessor with all threads processing the gates level by level. Speedups of 13× over a commercial approach were reported, although, duplication of gates is required for independent evaluation of macro- cells.

[SRG+_{11] proposes GPU-accelerated logic level simulation approach with a pipelined} evaluation of the circuit where alternate memory locations are used for accessing different patterns. When the gates of a circuit level are simulated, the corresponding threads access the patterns in alternate order for consecutive processing thereby allowing to hide write- latencies when storing the output information to the GPU memory, before proceeding with the next level. To avoid loss of intermediate signal values during simulation, additional gates have to be introduced as placeholders that maintain the signal values, which causes a high overhead in gates during evaluation. The authors reported speedups of roughly 10× compared to a serial execution.

The structural independence of gates is an important factor for efficient parallelization of the simulation of a netlist. In order to achieve this, many algorithms rely on duplication of structures, which often introduces a large overhead. While data-structures for the gates are compact and optimized for fast access and execution, the afore-mentioned algorithms only consider information of the functional behavior.

4.3.2 Timing-aware Simulation

The consideration of time in circuit simulation requires the modeling of the temporal behavior for gates and signals. Depending on the timing model and abstraction, the timing information requires a large amount of data to be stored and processed on the GPU. [WLHW14] presented a parallelized static timing analysis (STA) to compute the worst- case delay of a circuit on GPUs. The algorithm considers slopes of signals and computes propagation delays using a two-dimensional interpolation over look-up-tables (LUTs) with respect to input slope and output capacitance. Gates are processed in parallel for each level and each streaming multiprocessor on the GPU processes only gates of the same type in a data-parallel fashion. This way, the corresponding LUTs need to be fetched only once

and are cached for more efficient access. The reported speedup of the STA is 12.85× over a CPU-based implementation and three orders of magnitude over a commercial solution. Other GPU-accelerated STA simulators have been proposed in [DM08] and [Den10] as well. While the first one is simply based on a maximum operation, the latter approach is based on sparse matrix-vector products (SMVP) showing speedups of 50×. Yet, STA simulators usually only reveal a worst-case timing information without taking switching activity from hazards and glitches into account or identification of false paths [MMGA19]. The authors of [GK09] presented the first GPU-accelerated Monte-Carlo-based statistical static timing analysis (SSTA) to estimate the delay deviations and yield of a design. It exploits structural-parallelism from data-independent gates on each level in the circuit, as well as data-parallelism from simulation of Monte-Carlo instances in parallel. For this, par- allel pseudo-random number generators (PPRNG) are implemented, such that each thread can generate individual samples to compute the propagation delays of a gate. While Monte-Carlo-based SSTA usually is a compute-intensive task, the presented approach showed a significant boost in speed with an average speedup of 260× on a single GPU. However, similar to STA, SSTA only provides probabilistic worst-case information.

In [ZWD11, WZD10], an event-driven parallel logic level time simulation on GPUs was presented based on the parallel implementation of a message passing algorithm [CM79, Bry77]. In general, the simulator performs three steps to propagate events through the circuit each of which are handled by implemented kernels. First, event queues of signals are handled to input pins of gates, where the events are then processed in temporal order and stored in the respective output event queues of the gates. Individual threads are responsible for fetching the input event queues and assigning them to FIFOs at each gate input pin. Once assigned, the evaluation kernel processes the input events of each gate independently in temporal order by individual threads and writes the output signal in the gate output event queues. A memory paging mechanism is applied to manage the event queues in pages on the GPU that are dynamically swapped during simulation. Although complex dynamic memory management usually reduces the effectiveness of parallelization [BBC94, SG91], the achieved speedups were reported to be 47.4× in average. Yet, similar to other event-driven approaches the algorithm only accelerates the simulation of single circuit instances and it does not benefit from simulation of patterns in parallel.

In document Multi-level simulation of nano-electronic digital circuits on GPUs (Page 67-71)