Architecture and Background - Studies on the Impact of Cache Configuration on Multicore Process

Traditional bus structures were improved, to be used as interconnect in the multicore processors. However, with enhancement in the number of cores, use of bus as interconnect proved inefficient with increasing complexity. Moreover, bus does not scale well with increase in the number of cores. To address the scalability problem in multicore, NoC is used [12, 42].

In this chapter, we analyze the effect of interconnect on the multicore processors and have proposed a novel scalable on-chip interconnection mechanism for multicore processors.

4.2 Architecture and Background

Various work in current literature has explored the multicore architecture utilizing various performance metrics and application domain. Researchers like Bucker et al. [12] in their work have analyzed the effect of NoC on power consumption of multicore processors. They have discussed the challenges as well as progress in research and development of NoC for multicore processors. Zhou et al. [88] have proposed a new scheme for cost minimization named Interconnect Communication Cost Minimization

Model (ICCM). They have demonstrated an average of 50 % reduction in the commu-

nication cost, average of 23.1 % throughput improvement, and 35 % dynamic power reduction with the aid of their optimization mechanism.

Tullsen et al. [54] have analyzed a single-ISA heterogeneous multicore architecture for multi-threaded workload performance. The objective was to analyze the performance of multicore architectures for multi-threaded workloads. This section details the benefits of variation in the interconnection network in the multicore architecture with multi-threaded workloads.

Through various works, performance has been analyzed in both single core and multi-core architectures. Julian et al. [13] determined the relationship between performance and memory system in single core as well as multicore architecture. They have evaluated multiple performance parameters like cache size, core complexity. The authors have also discussed the effect of variation in cache size and core complexity

4.2 Architecture and Background

across the single core and multicore architecture.

Multicore architecture with multiple types of on-chip interconnect network [24, 80] are the recent trend of multicore architecture. This type of architecture have different types of interconnect network with different core complexity and cache configuration. A few of the architecture of this type has been described in the subsequent sections.

4.2.1 Metrics of Interconnection performance

The performance of interconnection networks can be described using a few metrics. These metrics have been cited below:

• Connectivity, or degree: This is the number of nodes that can be reached from a particular node in a single hop.

• Diameter: This gives the maximum delay that can occur during the passage of a message from one processor to another.

• Distance: The distance between two nodes is given by the total number of hops in the shortest path between two nodes.

• Average Distance: It is given by the the equation

davg= r P d=1(d − Nd ) N − 1

where N: The number of nodes, N_d is the number of nodes at distance d apart and r is the diameter.

• Bisection Bandwidth: It is the minimum number of wires that need to be discon- nected to divide the network into two equal halves.

4.2.2 Switches

Initially to establish communication in the multicore processor, buses were used as a shared fabric. In a bus, the communication is always in broadcast, in which, a message going from entity A to entity B, is seen by all other entities. In addition to it multiple

4.2 Architecture and Background

messages cannot be in transit simultaneously. But, when the cores and memory are placed on two either sides of the die, then it becomes a compelling choice to take the aid of crossbar interconnect for better communication. Even though crossbar are more costlier than buses with respect to the wiring overhead, still they enable simultaneous transmission of multiple messages, which in turn results in enhanced network bandwidth.

A crossbar circuit has N inputs and establishes a connection between these N inputs and M possible outputs. The circuit is setup as a grid of wires. At every point where the wires intersect, a pass transistor serves as a crosspoint connector which can short the two wires when enabled, which in turn connects the input to the output. We have small buffers at the crosspoint which can store the messages for a temporary period of time at times of contention for the target output port. A centralized arbiter is responsible to control the crossbar. This arbiter takes requests for output port from the incoming messages and computes a viable assignment of input port to the output port connections. The network model as shown in the proposed architecture in Figure 4.3 consists of set of nodes and switch nodes. Few links that connects the input and output buffers of a pair of nodes is also included in this network. The nodes existing in this network is classified as the end nodes and switch nodes. An end node is able to send or receive packets to or from another end node. Switches are used only to forward the packets between the switches or end nodes [44]. A link is particularly used for connection between the end nodes and switches. The internal architecture of a switch is displayed in the Figure 4.1. This switch includes a crossbar which connects each input buffer with every other output buffer. A packet at any input buffer can be routed to any of the output buffer at the tail of the switch [79].

4.2.3 Multicore Processor with Internal Network

The architecture of a multicore processor with internal network is shown in the Figure 4.2. This architecture is primarily targeted for the commercial server class applications that have a low degree of instruction level parallelism (ILP) but a high degree of thread level parallelism (TLP). It has a private L1 cache dedicated for each core. L1 caches

4.2 Architecture and Background

Figure 4.1: Switch Architecture Model

are connected to two L2 cache banks through a switch [78]. Each core is capable of executing eight threads independently. At every cycle, the thread selection stage of pipeline chooses one of the eight threads for execution. An instruction from the selected thread is fetched and pushed onto the beginning of the pipeline. The thread selection policy aims at switching threads in every cycle giving a fine-grained interleaving of threads in the pipeline [14].

Interconnect

MPIN architecture uses a crossbar switch as interconnection network. A core can communicate with another core or L2 cache only through the switch. The crossbar allows up to eight simultaneous accesses of the L2 cache. Four memory controllers are used to interface L2 cache with main memory.

Cache Crossbar Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L2 B0 L2 B1 L2 B2 L2 B3 L2 B4 L2 B5 L2 B6 L2 B7 Memory Controller 3 Memory Controller 2 Memory Controller 1 Memory Controller 0

In document Studies on the Impact of Cache Configuration on Multicore Processor (Page 48-52)