Hybrid Hardware-Software Priority Queue (HybridPQ)

NIOS II Software Architecture

5.5 Hybrid Hardware-Software Priority Queue (HybridPQ)

In a single-weighted graph problem, the size of worst-case priority queue is known. For example, the worst-case priority queue size for Dijkstra’s algorithm is |V|, the total number of vertices. In this case, the queue size is deterministic. On the other hand, multi-weighted graphs are NP-problems. The algorithms are non- polynomial in complexity. The worst-case priority queue size is non-deterministic. However for hardware development, we have to determine the exact priority queue size required as the available logic resources are limited. In other words, if the hardware priority queue size is predetermined and fixed, overflow can occur.

In software, priority queue may be implemented in a number of advanced data structures, e.g. double-linked-list and pointer data structure. Double-linked-list and pointer data structure does allow the size of priority queue to grow or shrink throughout the computation. During INSERT, a processor can extend the priority queue size though standard memory allocation. During EXTRACT, a processor can release the vacant memory location. In software implementation, the size of priority queue is ‘self-adaptive’; it is only constrained by the available memory.

Hence we propose the implementation of Hybrid Hardware-Software Priority Queue (HybridPQ), which integrates the benefit of high-speed but fixed-size

(hardware) Priority Queue Accelerator Module with self-adaptive but reasonably fast software priority queue. Fibonacci-Heap priority queue (FHPQ) is chosen as the target software priority queue implementation. The reason is simple, it is the fastest

software priority queue with O(1) run-time complexity for INSERT and O(lg n) run- time complexity for EXTRACT. The HybridPQ is implemented as software

abstracted above the hwPQ device driver and FHPQ software routines, as illustrated in Figure 5.14.

Graph Processing Unit (GPU)

NIOS II Processor Priority Queue Accelerator Module Avalon System Bus User Application

Nano-scale VLSI Routing Module (SRABI)

HAL

API hwPQ Device Driver User API HybridPQ

Hardware Priority Queue Unit (hwPQ)

hwPQ Avalon Interface Unit

FHPQ

Figure 5.14: Software Abstraction Layer of HybridPQ

Specific control mechanism is proposed and incorporated into HybridPQ, such that at the top-level abstraction, the HybridPQ still supports the two basic priority queue operations: INSERT and EXTRACT, see Figure 5.15. Also in the figure are the underlying functions of each hardware and software priority queue, which are utilized in the control mechanism of HybridPQ.

HybridPQ

Hardware Priority Queue Unit

(hwPQ) Priority Queue Accelerator Module HybridPQ_insert HybridPQ_extract hwPQ_insert hwPQ_extract hwPQ _peek hwPQ delete FHPQ_insert FHPQ_extract FHPQ_peek FHPQ delete HybridPQ Control Mechanism (C routine executed by NIOS II) _FHPQ Software Priority Queue (resides on RAM)

Figure 5.15: Functional Block Diagram of HybridPQ

The control mechanism is simple. We have a fix-k-size Hardware Priority Queue Unit (hwPQ) which accommodates up to k queue entries. On the other hand, the software priority queue, FHPQ has unlimited size. For INSERT operation, if the hwPQ is not fully occupied (not fully loaded with entries), then the new-element will be inserted into hwPQ, otherwise the new-element will be inserted into FHPQ (see Figure 5.16). Both hwPQ and FHPQ possesses O(1) run-time complexity for insert operation. Hence, HybridPQ still maintain O(1) run-time complexity for INSERT, although insert a new entry into FHPQ have larger constant run-time overhead, compared to hwPQ.

For EXTRACT operation, both top-priority-element from hwPQ and FHPQ will be compared, only the highest-priority among the two will be returned (see Figure 5.17). Thus, if the extracted-element is originated from hwPQ, the run-time is still O(1). But if it is from FHPQ, then the run-time is O(lg n).

Figure 5.16: INSERT control mechanism in HybridPQ

Figure 5.17: EXTRACT control mechanism in HybridPQ HybridPQ_extract

Highest priority entry from my HWPQ?

my_HWPQ_peek FHPQ_peek

END

YES NO

my_HWPQ_delete FHPQ_delete

From my_HWPQ, return the highest priority entry.

From FHPQ, return the highest priority entry. HybridPQ_insert my_HWPQ full? my_HWPQ_insert FHPQ_insert YES END NO

hybridPQ_reset(Q) { my_HWPQ_reset(Q); FHPQ_create_heap(Q); queueCount = 0; } hybridPQ_insert (Q, x)

{ if (queueCount < length-of-my_HWPQ) then my_HWPQ_insert(Q, x); queueCount ++ ; else FHPQ_insert(Q,x); end if; } hybridPQ_extract (Q)

{ variable var_my_HWPQ_min Í my_HWPQ_peek(Q); variable var_FHPQ_min Í FHPQ_peek(Q);

if(var_my_HWPQ_min < var_FHPQ_min ) then

returned (var_my_HWPQ_min); my_HWPQ_delete(Q); queueCount -- ; else returned (var_FHPQ_min) FHPQ_delete(Q); end if; }

Figure 5.18: Functions provided in HybridPQ

The HybridPQ INSERT operation gives first priority to hwPQ. This ensures queue entries first filled up hwPQ, making full utilization of the high-speed but fixed size hwPQ. On the other hand, forcing most entries to enter hwPQ increase the possibilities that a HybridPQ EXTRACT operation will extract an entry from hwPQ, thus O(1) run-time rather than O(lg n) if the extracted entry is from FHPQ. The trade-off due to HybridPQ are one additional counter is required to count the fill-up level of hwPQ, besides constant additional overhead due to the comparison process spent during INSERT and EXTRACT. Pseudo-code describing the control

The implementation of HybridPQ avoids overflow condition in hardware priority queue (hwPQ). Although software priority queue (FHPQ) might possibly overflow due to insufficient memory (RAM), it will be handled by the Operating System. Software memory overflow is not within our scope. Anyway, as the

hardware hwPQ can be extended by cascading it; software memory overflow can be solved by simply expand the memory capacity. Certainly, expanding software memory will be easier and cheaper, comparably to hardware priority queue expansion.

In document Graph processing hardware accelerator for shortest path algorithms in nanometer very large-scale integration interconnect routing (Page 109-115)