Dynamic Scheduling Issues in SMT Architectures

(1)

Dynamic Scheduling Issues in SMT Architectures

∗

Chulho Shin

System Design Technology Laboratory

Samsung Electronics Corporation

Seong-Won Lee

Dept. of Electrical Engineering - Systems

University of Southern California

Jean-Luc Gaudiot

Dept. of Electrical Engineering and Computer Science

University of California, Irvine

Abstract

Simultaneous Multithreading (SMT) attempts to attain higher processor utilization by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Previous studies have shown, however, that its throughput may be limited by the number of threads. A reason is that a fixed thread scheduling policy can-not be optimal for the varying mixes of threads it may face in an SMT processor. Our Adaptive Dynamic Thread Schedul-ing (ADTS) was previously proposed to achieve higher uti-lization by allowing a detector thread to make use of wasted pipeline slots with nominal hardware and software costs. The detector thread adaptively switches between various fetch policies. Our previous study showed that a single fixed thread scheduling policy presents much room (some 30%) for im-provement compared to an oracle-scheduled case. In this paper, we take a closer look at ADTS. We implemented the functional model of the ADTS and its software architecture to evaluate various heuristics for determining a better fetch policy for a next scheduling quantum. We report that perfor-mance could be improved by as much as 25%.

1. Introduction

Simultaneous Multithreading (SMT) or Multithreaded Super-scalar Architectures [4, 10, 21, 20, 5, 8] can achieve high pro-cessor utilization by allowing multiple independent threads to coexist in the processor pipeline and share resources with support of multiple hardware contexts. SMT is an attempt to overcome low resource utilization of wide-issue single-threaded superscalar processors by exploiting Thread-Level Parallelism (TLP) at a relatively low hardware cost for sup-porting the multiple hardware contexts.

Studies by Tullsenet al. and Ungereret al. [21, 16] have shown that when the number of threads simultaneously ac-tive in an SMT processor becomes greater than four, perfor-mance often saturates and in some cases even degrades. In these studies, an attempt was made to overcome the satura-tion effect by finding a better fetch mechanism or increasing the number and availability of resources that would other-wise become bottlenecks (such as register files and

instruc-∗_{The material reported in this paper is based upon work supported in part}

by the National Science Foundation under Grants No. CSA-0073527 and INT-9815742. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

tion queues). It was also shown that increasing the size of the caches can result in a higher saturation point. Unfortu-nately, such remedies do not work in all cases because their effectiveness is heavily affected by the properties of the ap-plication mixtures. We believe that one fixed thread schedul-ing policy which performs better than others “on the average” cannot deliver the performance we anticipate in SMT proces-sors with more than four thread contexts. We will show that with our adaptive dynamic thread scheduling policy [15], we can significantly improve the performance of SMT proces-sors and prevent the saturation or degradation effects alluded to earlier.

Our work focuses on multiprogrammed or multi-user en-vironments where combinations of multiple threads that an SMT processor faces are significantly varied over time. For multiprogramming or multi-user workloads consisting of threads running on the processor independently of one an-other, no information about any interactive behavior between threads may be known in advance. Consequently, it is in-dispensable to adopt a more ”intelligent” and more dynamic thread scheduling capability if we are to sustain high through-put.

When parallelizing an application to generate multiple threads, the role of thread scheduling is to eliminate resource conflicts and avoid data dependencies in order to expose more parallelism. On the contrary, the role of scheduling for multi-ple independent threads (of multiprogrammed workloads) is to perform a better “traffic control” so as to sustain higher throughput by maintaining low interference between threads. Tullsen et al. [20] evaluated several fetch policies and showed that the ICOUNT policy yields the best average per-formance. ICOUNT gives priority to the threads with fewer instructions in the decode stage, the rename stage, and the in-struction queues. Actually, ICOUNT best accounts for what is taking place in SMT pipelines in general: since it gives pri-ority to the threads that have fewer instructions in the earlier stages of the pipelines, a balanced use of the instruction win-dow occurs. Since it gives more opportunities to the threads whose instructions drain through the pipeline more rapidly, a more efficient use of the pipeline results.

WhileICOUNTis the scheduling policy that works best on the average, it does not address problems as directly as other policies such as BRCOUNT and MISSCOUNT1

do. (BRCOUNT prioritizes threads with fewer conditional branches.) Assume for example that the set of applications

(2)

in an SMT processor consists of four control-intensive appli-cations (with many conditional branches) and four other ap-plications. Further assume that these four control-intensive applications are experiencing high branch prediction misses at the moment. Then, the processor will suffer from wasted slots filled with wrong-path instructions of the four control-intensive applications while preventing the other four threads from exploiting the resources in the pipeline. In this specific case, if BRCOUNT had been used, the four control-intensive threads would have found fewer chances to get fetched. Con-sequently, the number of (fetched) instructions of control-intensive threads will diminish while the number of instruc-tions of the other four threads will increase making the num-ber of effective instructions even among all threads.

The main goal of a hardware thread scheduler is toavoid

imbalance among threads, where imbalance on a resource

means that usages or counts of the resource are not even among the threads. For example, if one thread has many more instructions in the early stages of the pipeline (the decode and rename stages and the instruction queue) than the others do, we have an imbalance in terms of instruction count. Imbal-ance adversely affects the throughput for the following rea-sons (it would result in lowered Thread-Level Parallelism):

• Since a small number of threads are occupying one type of

resources, the other threads cannot have access to these same resources.

• The average number of non-dependent and “issuable”

instruc-tions per thread becomes lower for the other threads, lowering the average number of instructions that can proceed through the pipeline.

With adaptive dynamic thread scheduling, when a change in the system environment is detected, the fetch policy which should be used during the next interval is decided upon and put into effect to eliminate the problematic imbalance. How-ever, having multiple fetch policies and decision-making al-gorithms in hardware could translate into high hardware com-plexity. In our previous work [15], we proposed our detec-tor thread approach which could help lower the hardware re-quirements and also make use of unused pipeline slots to run decision-making algorithms and fetch policies. Our approach also has the advantage that thread scheduling can be manipu-lated even after the chip has been produced because the detec-tor thread is programmable. The detecdetec-tor thread can also help lower the overhead of the system job scheduler by shortening its stay in the processor and analyzing information before the job scheduler needs it.

In this paper, we take a closer look at the software aspect of ADTS. We propose an effective software architecture for the detector thread. The core of this software is the heuristics for determining the fetch policy that will be used in the next scheduling quantum. We implement and evaluate the func-tional models of those heuristics.

This paper is organized as follows. In section 2, previous works related to our work are summarized. The adaptive dy-namic thread scheduling is reviewed in section 3, its software architecture is discussed in section 4 and how we evaluate our idea is discussed in section 5. Results of our simulation ex-periments are presented in section 6 and analyzed. Summary and conclusions will appear in section 7.

2. Related Work

Wanget al.investigated the use of a special thread while

aim-ing at realizaim-ing speculative precomputation in one of the two threads available on the Hyper-Threading architecture [22]. The study is targeted at improving the performance of single-threaded applications on two-context SMT processors.

DanSoft [6] proposed the idea ofnanothreadsin which

one nanothread is given the control of the processor upon the stall of a main thread. The idea was based on a CMP with dual VLIW single-threaded cores and its success hinges on the effectiveness of the compiler.Assisted Execution[18] extended the nanothread idea for architectures that allow si-multaneous execution of multiple threads including SMT. It attempts to improve the performance of a main thread by having multiple nanothreads perform prefetch and its success also hinges on the operation of the compiler.

Speculative data-driven multithreading [14] takes advan-tage of a speculative thread, called a data-driven thread

(DDT) to pre-execute critical computations and consume latency on behalf of the main thread on SMT. This study was also focusing on improving the performance of a main thread. Luk [11] also proposed pre-executing for more ef-fective prefetch for hard-to-predict data addresses using idle threads to boost the performance of a primary thread.

Simultaneous Subordinate Microthreading (SSMT) [3] was proposed in an attempt to improve the performance of a single thread by having multiplesubordinate microthreads

perform useful work such as running sophisticated branch predication algorithms. The idea was not based on an SMT architecture and also requires effective compiler technology. Parekh et. al. [13] investigated issues related to job scheduling for SMT processors. They compared the perfor-mance ofobliviousandthread-sensitivescheduling. Oblivi-ous scheduling means round-robin and random while thread-sensitive scheduling takes into account resource demands and the behavior of each thread. The study concluded that thread-sensitive IPC-based scheduling can achieve a signifi-cant speedup over round-robin methods. However, this study concerns system job scheduling and cannot be directly re-lated to dynamic thread scheduling. Also, the job scheduler will have to be brought into the processor, resulting in a con-text switch of user threads. This job scheduler, however, can take advantage of our detector thread approach and it will be discussed in section 3.

Another similar study [17] investigated job scheduling for SMT processors. The study proposed a job scheduling scheme calledSOSwhere anoverhead-freesample phase is involved where the performance of various schedules (mixes) is sampled and taken into account for the selection of tasks for the next time slice.

We recognize that this strategy can also benefit from our approach because the detector thread will be always active. It could make use of unused pipeline slots and resources to find out what threads should not be selected in the next job scheduling time slice while lowering the burden of the job scheduler.

Our adaptive dynamic thread scheduling approach [15] should not be confused with adaptive process scheduling [12] which addresses O/S job scheduling issues for SMT pro-cessors: the goal of our approach is to offer more efficient thread scheduling at the individual instruction level in the SMT pipeline.

(3)

A study that examine approaches to detect per-thread cache behavior using hardware counters and help job scheduling based on the information obtained on SMT was performed by Suh et al. [19]. This approach is similar to our idea of relating the detector thread with job schedulers. However, it does not aim at controlling thread fetch policies.

3. Adaptive

Dynamic

Thread

Scheduling

(ADTS) with a Detector Thread (DT)

Our Adaptive Dynamic Thread Scheduling (ADTS) was in-troduced and discussed in details in [15]. Its implementation with a detector thread (DT) was also discussed. The ADTS with a DT tackles two problems: first, a new fetch policy can be activated if the system is suffering from low through-put. Second, it allows unused pipeline slots to be used to de-tect adverse changes in the system, identify threads that clog the pipeline, and take actions needed to sustain high through-put. The action that can be taken include context-switching a thread and preventing a specific thread from being fetched.

A detector thread is a special thread which reads thread status indicators and updates thread control flags based on the current values of the indicators so that the thread control hardware can take any necessary action to improve perfor-mance of an SMT processor. The per-thread status indica-tors are updated by circuitry located throughout the proces-sor pipeline, based upon specific events such as cache miss, pipeline stalls, population at each stage,etc.

Per-Thread Counters

A

DT B C D E F G

flags

H

Thread Selection Units

Figure 1. How a Detector Thread works with normal threads.

The role of the detector thread is to check the values of the various thread status indicators and, based on the condi-tions dynamically defined in software, to properly update the thread control flags as shown in Figure 1. A thread will have its own set of flags. A flag may tell whether a thread can be fetched in the next cycle while another flag may tell whether it should be context-switched in the next opportunity. When the system thread is loaded, it will look at the flag and sus-pend a clogging thread without going through the process of determining which thread to suspend. Then, the thread se-lection unit simply issues instructions from threads in their order of priority. Although the per-thread status indicators, thread control flags, and thread selection units are fixed in hardware, we can control the thread control behavior around those hardware resources by writing a different program code for the detector thread.

Our previous work [15] proposed a way to implement the detector thread based on another study [3]. The detector thread will have its own program cache sufficiently large (2 or 4KB) to fit its small program image and its data accesses should be mostly to special registers such as the per-thread counters and general-purpose registers. Most of the time, the detector thread will be the lowest-priority thread. When the slots are almost fully occupied by normal threads, the detec-tor thread will not obtain any more scheduling slots; this is acceptable because it means that the processor pipeline slots are enjoying high utilization.

Fetching the detector thread’s instructions should not re-sult in significant overhead either. Since its instructions are coming from their own isolated program cache, they will not compete for fetch bandwidth with other normal threads. It should not affect the data memory bandwidth either because its data will be mostly coming from special registers. Also, it was shown that the detector thread’s job can fit within the cycle budget allowed in realistic situations [15].

The detector thread plays a major role in this process as shown in Figure 1. It keeps watching the per-thread status in-dicators and updates the flags based on its active policy. The indicators are updated by hardware on predetermined events in places spread across the pipeline. The detector thread has the lowest priority among threads. As long as the pipeline is well utilized, the detector thread will not often be activated. Can a detector thread experience starvation in such cases? This depends upon the occupancy rate of the instruction fetch buffer. As long as the instruction fetch buffer is full, no in-structions from the detector thread can be fetched.

For this detector thread approach to work successfully, it has to be equipped with intelligent heuristics or algorithms to dynamically detect clogging (low throughput) and to choose a better fetch policy for the next time frame. However, since the resources allowed for the detector thread are quite limited in order to minimize hardware overhead, the algorithm is also limited in the data to which it can refer. This will be the topic of the next section.

4. Software Architecture of the Detector Thread

The software architecture of the detector thread for adaptive thread scheduling is shown in Figure 2. The status counters are updated at each cycle throughout the pipeline. For every period of 8K cycles, the number of committed instructions are counted and the maximum number of instructions that can be executed (8Kx8) are counted. If the interval is to remain constant, the maximum numbers need not be counted. The detector thread will check whether the IPC (the number of committed instructions per cycle) is less than the threshold. In this case, the previous time frame will be identified as

low-throughput.

Once a previous scheduling quantum2is determined to be

low-throughput, a new fetch policy has to be determined

be-cause theincumbentpolicy (the one which is currently en-gaged) turned out to perform poorly. Then, the policy that has been decided to be used to replace the incumbent policy for the next scheduling interval is activated. In the meantime,

2_{This scheduling quantum should not be confused with that of the job}

scheduler. Typical sizes of a quantum for job scheduling is in the range of milliseconds which can be equivalent to a million cycles.

(4)

during the remaining idle slots, other functions can be accom-plished. The first thing is to identify the clogging threads. By looking at the per-thread status counters, the threads that are clogging the pipelines for various reasons can be identified and marked so that the job scheduler can later suspend them once loaded without going through the possibly long process of identifying them for itself. This results in a shorter pe-riod of activity for the job scheduler. The second thing is to enforce the incumbent policy. Per-thread status counters are checked and the priority array is updated depending on the values of the counters. Then, the thread selection unit will look at the array to make decisions on which two threads should be selected for instruction fetch at each cycle.

IPC<Threshold Determine New Policy Policy Switch Identify Clogging Threads Policy Enforce No Yes TSU Status Counters Updated IPC<Threshold Determine New Policy Policy Switch Identify Clogging Threads Policy Enforce No Yes TSU Status Counters Updated

Figure 2. Software architecture of the detector thread

4.1 Pseudo Code of the Detector Thread

The pseudo code of the detector thread is shown in Fig-ure 3. The main subroutine Detector Thread()has a large endless while loop with a jump location right ahead of it,

East. If the condition,IP C last < IP C tholdholds true, it will be recognized as a low-throughput condition event and the required actions will be taken. IP C lastis the com-mitted instructions per cycle during the last eight-kilo-cycle quantum andIP C thold means the threshold value of the IPC which is predetermined by the detector thread manage-ment kernel developer. This threshold value may also be cho-sen to be updated by the detector thread software.

Once a low-throughput condition is recognized,

Iden-tify CloggingThreads() is called and the cause of the low

throughput is analyzed to identify the clogging threads.

De-termine NewPolicy()is next called to find out the policy that

should be engaged in the next quantum. This stage needs the most effort since choosing a new policy will significantly affect the throughput of the next scheduling interval. The new policy is then engaged as the next incumbent policy by the function Policy Switch() and a jump to the subroutine

Policy Enforce() is made. In this routine, the thread

prior-ity array (TPA) is updated depending on the current system

state and the incumbent policy while the thread selection unit (TSU) examines this array to determine the threads for in-struction fetch at each cycle. The TSU selects up to two threads at each cycle because we are usingICOUNT.2.8[20].

Figure 3. The framework of a detector thread in pseudo code (abridged)

4.2 Determination of Threshold Values

The big question to address before determining the next fetch policy is how we know whether the processor is experienc-ing low throughput or not. What is the threshold that makes the reference based upon which we can make accurate judg-ments? Figure 7 illustrates how the value of the IPC thresh-old affect the frequency of switchings and the quality of a switch. If the threshold value is too low, very little switch-ing will take place while the quality of a switch can often be high (the quality of a switch is high when the switch re-sults in increase of throughput in the next scheduling inter-val.) In this case, the quality of a switch would be high be-cause when low-throughput is detected, the incumbent policy is less likely to be capable of improving the situation since the threshold value was very low. If the value is too high, switch-ing will occur too frequently. Further, the “quality” of each switch can be very low since it is more likely that the situ-ation cannot improve even with alternative policies because the current throughput can be fairly high.

4.3 Determination of Next Fetch Policy

4.3.1 Underlying Premises

Once it turns out that the incumbent fetch policy fails to sus-tain high throughput, the followings may be taken into con-sideration to determine a new fetch policy.

• What was the fetch policy for the last quantum?

• What are the current conditions? (Instruction counts, cache

miss rates, etc.)

(5)

• What has been the history of a fetch policy’s effect under a certain condition?

The more things we take into consideration, the more sophisticated and informed the determination heuristic be-comes. However, too sophisticated heuristics may not fit in the available cycle budget or in the DT PRAM whose size is also limited. The fewer things we take into consideration, the lower the overhead of the detector thread and the quicker the response of the detector thread. However, limiting the so-phistication of the scheduling algorithm may result in weak performance. Thus, we need to find the trade-off where the overhead fits our budget while still producing good results.

The simplest way to determine the new policy is the fixed transition with no current condition considerations. This will basically be what we do in our Type 1 heuristic (Figure 4). However, it should be noted that switching to another specific thread may worsen an already deteriorating situation instead of improving it if the newly engaged policy does not happen to address the problems the system is currently experiencing. This kind of approach will also heavily rely on the value of the threshold because a higher value of the threshold is more likely to cause such adverse effects while a lower value is less so.

4.3.2 Various Heuristics

The first of the heuristics is calledType 1, the simplest way of determining a new fetch policy. In this scheme, no sta-tus indicators are referenced before making a decision and consequently it is not sensitive to the state in which the sys-tem currently is. As long as a low throughput condition is not detected, the current state, that is, the incumbent fetch policy will be maintained. Once a low throughput condi-tion has been detected, transicondi-tion to the other thread (either BRCOUNT or ICOUNT) will unconditionally be made. Ini-tially, the default fetch policy will be ICOUNT. The advan-tage of this scheme is that the software overhead of the de-tector thread will be minimal to a degree that it can be imple-mented in hardware. However, the advantages of the detec-tor thread such as flexibility and programmability will not be available.

BRCOUNT ICOUNT

Figure 4. Type 1 heuristic for determination of a new fetch policy

Type 2heuristic is another simple way of determining a

new fetch policy. In this scheme, as in Type 1, no status indicators are referenced for decision. The difference (Fig-ure 5) is that one more state (or fetch policy) has been added to the original finite state machine. The variants based on this scheme can be made by changing the sequence of the transitions, which currently is set to the order of ICOUNT, L1MISSCOUNT and BRCOUNT, or adding more fetch poli-cies to the current set of three. Type 1 and Type 2 only con-siders what was the fetch policy for the last quantum once

BRCOUNT ICOUNT

L1MISSCOUNT

low throughput is detected. There is only one state that can be transited to from a state. Thus, as long as low throughput is not avoided, one of the two or three states will be entered in a cyclic fashion.

InType 3heuristic (Figure 6), one of the two states can

be entered from a state depending on the value of some spe-cific conditions. Depending on the value of a condition, the transition is made to the policy that is reckoned to improve throughput with the current condition. Type 3 heuristic relies on the following conditions:

• COND MEM is true when one of the following two

sub-conditions is true.

1. L1 miss count for the last quantum is higher than its threshold value of 0.19 times/cycle.

2. Load/Store Queue becomes full too often, more often that its threshold value of 0.45 times/cycle.

• COND BR is true when one of the following two

sub-conditions is true.

1. Branch misprediction count for the last quantum is higher than its threshold value of 0.02 times/cycle. 2. The count of conditional branches for the last quantum

is higher than its threshold value of 0.38 branches/cycle.

Above, the specific threshold values for L1 miss count, Load/Store Queue occupancy rate, Branch count and it mis-prediction count were determined by simulation. We ran eight-thread simulation in our SMT simulator with our 13 different mixes of applications and ended up with an average value for each metric. These measures are indeed dependent on hardware configurations and what kind of mixes are run-ning in the processor. There can be no single “golden” refer-ence measures that can always be used. To be more effective, the threshold values should be updated to reflect newly found information.

That is one of the reasons why the detector thread ap-proach is good for the adaptive dynamic thread scheduling. The system’s detector thread management kernel can profile the system and determine whether current threshold numbers are obsolete and if so, it may update the values to reflect the new state of the system. This update can be done by writing values in the detector thread’s DT DRAM through DMA [15].

(6)

BRCOUNT ICOUNT L1MISSCOUNT !COND_MEM COND_BR !CO ND _B R !C ON D_ BR C O N D __M E M C O N D __B R

Type 3 heuristic works as follows. Suppose that BR-COUNT is the incumbent fetch policy when low throughput is detected. It implies that BRCOUNT has not worked well during the last quantum and there is no crucial imbalance among threads of the current set about conditional branches; imbalance might be in other factors. Now we can guess that one of the other policies, ICOUNT or L1MISSCOUNT may work better. Now, we consider the condition, COND MEM and check its value. If it holds true, then it implies that the imbalance might have been in the number of L1 cache misses or the usage of the load/store queue. Thus the transition will be made to L1MISSCOUNT. Otherwise, the problem might not lie in memory usages and the transition will be made to ICOUNT which works best on the average.

For another type of heuristic, Type 4, we add two fea-tures. The first one is to take into account the gradient of the throughput. Even when low throughput is detected, if the throughput is higher than the throughput observed one quantum earlier (positive gradient), switching policies is not allowed. That way, we are waiting for the situation to keep improving with the original fetch policy.

The second feature is to keep track of the switching his-tory. In the switching history buffer, the followings are recorded for each policy switching event.

• Incumbent policy: The fetch policy that is originally engaged

before a switching takes place.

• Value of the condition: For each policy, there is one condition

that is checked. The value of the condition is recorded.

• Counter for positive outcomes (poscnt): This counter is

in-cremented every time a specific case ended up with increase in throughput.

• Counter for negative outcomes (negcnt): This counter is

in-cremented every time a specific case ended up with decrease in throughput.

Before making the final decision,poscntandnegcntare compared. If poscnt is greater, then a regular switching is made. Otherwise, the opposite direction will be chosen. For instance, suppose the incumbent policy was ICOUNT and low throughput is detected. Then with COND BR be-ing true, transition should have been toward BRCOUNT pol-icy with Type 3 heuristic. In Type 4, the counters (poscnt

andnegcnt) are examined and ifposcntis not greater than

negcnt, the transition will be made toward the opposite, L1MISSCOUNT.

Fetch Policies

BRCOUNT Number of total branches for a thread LDCOUNT Number of total loads for a thread MEMCOUNT Number of total memory accesses for a thread L1MISS COUNT Number of total L1 Cache misses for a thread L1IMISS COUNT Number of total L1 ICache misses for a thread L1DMISS COUNT Number of total L1 DCache misses for a thread ICOUNT Current Instruction Queue population for a thread ACCIPC Accumulated IPC for a thread

STALL COUNT Number of total stalls incurred for a thread RR Round-Robin scheduling

Table 1. Various Fetch Policies tested

5. Methodology

We used the SimpleSMT simulator [9] which is an extension of the SimpleScalar tool set [1]. It thus inherits most architec-tural specifications of the superscalar model in SimpleScalar. The main architectural difference between SimpleSMT and SimpleScalar is that SimpleSMT has separate integer and floating-point instruction queues and more pipeline stages to reflect the additional complexity of SMT. The simulation en-vironment has been configured to have resources compati-ble with previous research on SMT [20] (for verification pur-poses) as we did in our previous work [15].

We used SPEC CPU2000 [7] as our simulation work-loads and formed thirteen program mixtures depending on each program’s properties: IPC on a single threaded machine model, memory footprint and whether an application requires floating-point operations or not. For combinations with a mix of integer and floating-point applications, we attempted to make the mix as even as possible. For simulation of 4- and 6-thread cases, some applications were randomly chosen to be excluded from the 8-thread mixes.

We modeled ten different fetch policies as shown in Ta-ble 1.BRCOUNT,L1DMISS COUNT,ICOUNTandRRwere proposed and evaluated in [20]. Additionally, we included in our listLDCOUNT,MEMCOUNT,ACCIPCandSTALL

COUNT. The description of each policy is found in the table.

L1MISS COUNTandL1IMISS COUNT were added to have

a closer look at the effect of the caches.

At each cycle, the simulator sorts out threads according to the fetch policy. Instructions are fetched from the first thread as long as the cache block boundary is not met. If no bound-ary is encountered, all eight instructions are fetched from one thread. Otherwise, instructions can be fetched from the next thread. We limited the number of threads that can be fetched in one cycle to two. A study [2] showed that fetching all eight instructions from one thread can adversely affect the perfor-mance due to fetch fragmentation. For fair comparison, we applied the same mechanism to both fixed scheduling and adaptive scheduling.

Because of the huge size of the SPEC 2000 applications, it is almost impossible to run simulations until the end of all programs. Since the reference mode of a typical SPEC 2000 application has an average of 200 billion instructions, it would take about three months to completely run one appli-cation since the performance of our simulator is about 25K instructions per second.

(7)

sim-ulation results, we ran simsim-ulation for a million cycles in ten randomly chosen different intervals by taking advantage of

thefast-forwardfeature of the SimpleScalar simulator [1].

6. Experimental Results

Figure 7 a) and c) verify what we had surmised in 4.2. As the threshold value increases, more switchings incur for all types of heuristics. The quality of a switch decreases as the threshold value, but not as fast as the number of switchings increases. Note that with Type 1 and Type 2, it is not the case; the quality of the switch may be higher with the thresh-old value of 3 than with 2. Figure 7 b) and d) shows how the policy determination heuristic type affects the frequency and quality of switchings. Type3represents the Type 3 heuristic plus considering gradient of throughput. It is interesting to note that Type 4 heuristic results in more low-quality (malig-nant) switchings. This implies that determining a new fetch policy based on historical performance is not effective.

0 20000 40000 60000 80000 100000 1 2 3 4 5 Threshold Value switches Type 4 Type 3' Type 3 Type 2 Type 1

(a) Number of switchings vs. threshold value 0 20000 40000 60000 80000 100000

Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type

switches 5 4 3 2 1 (b) Number of switchings vs. type 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 1 2 3 4 5 Threshold Value

Probability of Benign Switches

(aggregate) Type 4 Type 3' Type 3 Type 2 Type 1 (c) Probability of benign

switches vs. threshold value

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50

Probability of Benign Switches

(aggregate) 5 4 3 2 1 (d) Probability of benign switches vs. type

Figure 7. Effect of the threshold value on switch occurrence and quality

Figure 8 shows the effects of the IPC threshold value on the throughput. Obviously, the best performance is reached when the threshold value is 2 and Type 3 heuristic is used. The maximum performance improvement over ICOUNT is about 30%. The values in the graphs are actually the one av-eraged over all various mixtures. We also found that greater improvements can be achieved when more similar applica-tions are found in a mixture. With a mixture of various appli-cations, less improvement was achieved.

It should also be noted that the throughput values observed in this experiment are relatively low considering the number of threads involved. The reason lies in the configuration we chose for simulation. The SimpleSMT does not simulate

sys-Average for All Combinations

0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00 1 2 3 4 5 Threshold Value aggregate IPC Type 4 Type 3' Type 3 Type 2 Type 1

(a) IPC vs. threshold values

Average for All Combinations

0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 18.00 20.00

Type 1 Type 2 Type 3 Type 3' Type 4 ICOUNT Policy Determination Heurstic Type

aggregate IPC m=5 m=4 m=3 m=2 m=1 (b) IPC vs. type

2.0 2.4 2.8 3.2 3.6 4.0 4.4 1 2 3 4 5 Threshold Value IPC Type 1 Type 2 Type 3 Type 3' Type 4 ICOUNT

(c) IPC vs. threshold value for each type

2.0 2.4 2.8 3.2 3.6 4.0 4.4

IPC m=1 m=2 m=3 m=4 m=5 ICOUNT

(d) IPC vs. type for each thresh-old value

Figure 8. Effect of the threshold value and pol-icy determination heuristic on throughput (av-erage of all mixtures)

tem calls (like SimpleScalar) and, instead, it translates them into the host system calls for efficient simulation sacrificing accuracy. We assumed that when a thread encounters a sys-tem call, all threads have to flush out of the pipeline before the system call can be started, which is the most conservative assumption. In real situations, if the system call is for crit-ical operations like memory allocation, that will be the case because such operations may affect all threads resident in the processor.

7. Summary and Conclusion

This paper has investigated how much more improvement can be made by allowing an adaptive dynamic thread scheduling approach rather than the fixed scheduling approaches em-ployed in earlier work. It proposed the detector thread ap-proach to implement adaptive scheduling with low hardware and software overhead. The detector thread is a special thread that occupies one designated thread context with minimal ex-tra hardware. It is scheduled for execution when idle slots are available.

To validate the idea, we used the SimpleSMT simulator to derive an upper-bound for the performance improvement we can hope to achieve using our approach. SPEC 2000 plications were used to create thirteen various mixes of ap-plications based on single-application performance, memory footprint and type (integer or floating-point).

Simulation results showed that there still is signifi-cant room (27%) for performance improvement over fixed scheduling for eight threads on which adaptive scheduling can work. This paper stresses that adaptive scheduling is fea-sible because our platform is SMT where it is posfea-sible to have

(8)

one thread resident in the processor with minimal overhead. The results we obtained in this study are greatly encour-aging. Since SMT was introduced, studies have shown that having too many threads (usually more than four or five) will not return the expected throughput increase and sometimes even lower the throughput. Our study has shown that adaptive thread scheduling in combination with a detector thread can significantly extend the saturation point in terms of number of threads provided that the detector thread is programmed with effective low-throughput detection and fetch policy selection algorithms.

The software architecture for the detector thread was de-veloped and various heuristics were evaluated for determin-ing the fetch policy to be used for the next scheduldetermin-ing quan-tum. Type 3 turned out to work best with the threshold value of 2. Type 4 which keeps track of outcomes of earlier de-cisions turned out not worthy of the efforts because there seemed to be no correlation in time domain regarding the fetch policies because there is no fixed pattern about the inter-actions between independent threads. Once the job scheduler is put into the picture, because more dynamic change in the set of applications is going to take place, correlation in time domain will be even harder to find.

We also found that with a mixture of various applications, less improvement was achieved with the ADTS over the fixed scheduling of ICOUNT. That is because we have a “good” mixture of applications so that we can maintain high utiliza-tion of various resources available in the processor. Conse-quently, we may ask the following question. Why not we just let the job scheduler concentrate on co-scheduling well-balanced sets of applications? Then, ICOUNT will work well and not much improvement can be made over it with the adaptive scheduling.

Our answer to the question is no. There are two reasons for the answer. The first one is that the job scheduler cannot co-schedule well-balanced sets of applications all the time, especially when the number of jobs available in the system is not significantly larger than the number of the hardware con-texts of an SMT processor. The other reason being is that the job scheduler would have to stay on the processor for signifi-cantly longer duration had it not been for the detector thread.

References

[1] T. Austin. The SimpleScalar Architectural Research Tool Set, Version 2.0. Technical Report 1342, University of Wisconsin-Madison, June 1997.

[2] J. Burns and J.-L. Gaudiot. Exploring the SMT Fetch

Bottle-neck. InProceedings of the Workshop on Multithreaded

Ex-ecution, Architecture and Compilation (MTEAC99), Orlando,

Florida, January 1999.

[3] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt.

Si-multaneous Subordinate Microthreading (SSMT). In

Pro-ceedings of the 26th Annual International Symposium on

Computer Architecture, pages 186–195, May 1999.

[4] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and

D. Tullsen. Simultaneous Multithreading: A Platform for

Next-Generation Processors. IEEE Micro, pages 12–18,

September/October 1997.

[5] M. Gulati and N. Bagherzadeh. Performance Study of a

Mul-tithreaded Superscalar Microprocessor. In Proceedings of

the 2nd International Symposium on High Performance

Com-puter Architecture, pages 291–301, Feburary 1996.

[6] L. Gwenlapp. Dansoft Develops VLIW Design.

Microproces-sor Report, 11(2):18–22, Feburary 1997.

[7] J. Henning. SPEC CPU2000: Measuring CPU Performance

in the New Millennium. IEEE Computer, 33(7):28–35, July

2000.

[8] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki,

A. Nishimura, Y. Nakase, and T. Nishizawa. An Elemen-tary Processor Architecture with Simultaneous Instruction

Is-suing from Multiple Threadds. InProceedings of the 19th

Annual International Symposium on Computer Architecture,

pages 136–145, May 1992.

[9] S. Lee and J.-L. Gaudiot. ALPSS: Architectural Level

Power Simulator for Simultaneous Multithreading, Version 1.0. Technical Report TR-02-04, University of Southern Cal-ifornia, April 2002.

[10] J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting Thread-Level Parallelism to Instruction-Level

Parallelism via Simultaneous Multithreading. ACM

Transac-tions on Computer Systems, pages 322–354, August 1997.

[11] C. Luk. Tolerating Memory Latency through

Software-Controlled

Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the 28th Annual International Symposium on

Computer Architecture, pages 40–51, June 2001.

[12] M. McCormick, J. Ledlie, and O. Zaki. Adaptively Schedul-ing Processes on a Simultaneous MultithreadSchedul-ing Processor. Technical report, University of Wisconsin - Madison, 2000. [13] S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-Sensitive

Scheduling for SMT Processors. Technical report, University of Washington, 2000.

[14] A. Roth and G. Sohi. Speculative Data-Driven

Multithread-ing. In Proceedings of the 7th International Symposium

on High Performance Computer Architecture, pages 37–48,

Monterrey, Mexico, January 2001.

[15] C. Shin, S. Lee, and J.-L. Gaudiot. The Need for Adaptive Dy-namic Thread Scheduling in Simultaneous Multithreading. In Proceedings of the 1st Workshop on Hardware/Software Sup-port for Parallel and Distributed Scientific and Engineering Computing (SPDSEC-02) in conjunction with the 11th Inter-national Conference on Parallel Architectures and

Compila-tion Techniques (PACT-02), September 2002.

[16] U. Sigmund and T. Ungerer. Evaluating a Multithreaded Su-perscalar Microprocessor versus a Multiprocessor Chip. In Proc. of the 4 th PASA Workshop–Parallel Systems and

Algo-rithms, pages 147–159, April 1996.

[17] A. Snavely and D. Tullsen. Symbiotic Jobscheduling for a

Simultaneous Multithreading Architecture. InProceedings of

the 9th International Conference on Architectural Support for

Programming Languages and Operating Systems, pages 234–

244, Cambridge, Massachussets, November 2000.

[18] Y. Song and M. Dubois. Assisted Execution. Technical Re-port Technical ReRe-port CENG 98-25, University of Southern California, 1998.

[19] G. E. Suh, S. Devadas, and L. Rudolph. A New

Mem-ory Monitoring Scheme for MemMem-ory-Aware Scheduling. In Proceedings of the High Performance Computer Architecture

(HPCA’02) Conference, Feburary 2002.

[20] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm. Exploiting Choice: Instruction Fetch and Issue on an

Imple-mentable Simultaneous Multithreading Processor. In

Pro-ceedings of the 23rd Annual International Symposium on

Computer Architecture, pages 191–202, May 1996.

[21] D. Tullsen, S. Eggers, and H. Levy. Simultaneous

Multi-threading: Maximizing On-Chip Parallelism. InProceedings

of the 22nd Annual International Symposium on Computer

Ar-chitecture, pages 392–403, June 1995.

[22] H. Wang, P. Wang, R. Weldon, and et. al. Speculative Precom-putation: Exploring the Use of Multithreading for Latency.

Dynamic Scheduling Issues in SMT Architectures