System Configuration and Programming Considerations for High Performance Embedded Systems on Multicore x86/64- Based Systems

(1)

System Configuration and Programming Considerations for High Performance Embedded Systems on Multicore x86/64-

Based Systems

Daron Underwood, CTO, IntervalZero, Inc.

400 Fifth Avenue, Fourth Floor

Waltham, MA 02451

(2)

Introduction

It is no secret that more and more companies are looking for ways to take advantage of the powerful, low- cost General Purpose Processors (GPP) available today, in order to reduce cost of product design, development and manufacturing, and to quicken their time-to-market.

Although these processors are very powerful, it is imperative that developers have some basic

understanding of the hardware architecture in order to squeeze out every last bit of performance in order to use these systems for extremely high-performance products.

This paper will focus on the areas where the maximum benefit can be realized to optimize the determinism of these multicore systems. The main areas of focus are:

 System configuration, hardware, firmware, and software

 Multicore/multithreading programming

 Memory, I/O and cache

The performance increases and scalability of SMP is immediately attractive to anyone developing embedded systems. On one end, you have those who want to take advantage of the ability to create systems that are seemingly segregated functionally, yet tightly integrated. On the other are those who need extremely high performance. Then there is everything in between. As attractive as all these design possibilities are with the new multicore systems, it does come at a price.

That price turns out to be a need to manage the resource usage of these robust processors which are implemented to increase the performance for normal general purpose use whether for consumer or server systems. In many cases, the functional requirements for embedded systems, real-

time/deterministic or tightly-bounded high performance, are impacted adversely by these very mechanisms. Although this seems to be a show stopper, there is a silver lining.

With some effort to understand these mechanisms as well as the software and configuration techniques available to manipulate them to one’s advantage, the pay-off can be tremendous in terms of performance and scalability.

There was a common opinion that the software developer’s free lunch went away when multicore, lower frequency processor design was chosen as the way to move processor technology forward. This of course is referring to the fact that a program without change could benefit from the previous practice of increasing the processor frequencies of new single core designs. The thinking was that the free lunch was now going to cost money as work, in the form of software redesign, was needed to add performance gains.

In some regards I think that was true, however, only for software that existed during the transition. If you put the right effort into making your new designs multicore-aware and design the software to be scalable, I would argue that the free lunch still exists and maybe even more so. Maybe it’s time to come up with a new term for the era of multicore. Perhaps we do have to pay a little now, but for that small price, you get an unlimited dinner buffet.

It is time for embedded systems to truly be in a position to take advantage of Moore’s Law, by developing systems that can readily scale to performance doubling roughly every 18 months.

(3)

System Configuration

It is extremely important to be able to configure a system to optimize it for the specific embedded application. One of the biggest black-boxes on a PC system is the hardware configuration which is controlled for the most part by the Basic Input Output System (BIOS). It is critical to select a system that allows for the most granular configuration of processor and chipset features from BIOS setup. Without this level of control, it can sometimes be a daunting task to qualify that a system is capable of providing the level of performance and determinism required by the embedded application.

There are many features of a system that can have significant effects on performance relative to an embedded application. The system features that have the most potential impact to embedded application performance are categorized into four functional areas:

 Legacy Device Support

 CPU configuration

 Power Management

 Memory Configuration

Legacy Device Support

One area that has been seen to inject some unavoidable, although usually small, System Management Interrupts (SMI) activity in today’s systems is support for legacy devices. This is typically the use of USB mouse and keyboard at boot time. On some systems, with this support enabled, there are some

unavoidable processor interrupts that can be attributed to System Management Interrupts required to handle these devices. This can cause significant in the otherwise near-constant duration when executing certain code paths.

It is recommended to disable in BIOS the support for legacy USB. The drawback is that on some systems the keyboard will not be recognized during boot, so breaking into BIOS setup after disabling this is not supported. If needed, a ps2 keyboard can be used to break-in when required. However, on many newer systems, the BIOS appears to handle the keyboard correctly even when the legacy support is disabled.

CPU Configuration

Hyper-threading is specific to Intel in the x86/x64 architecture. More generically known as Simultaneous Multithreading (SMT), it is actually a second hardware thread that can be executed on a processor. The goal of SMT is to be able to keep the core busy more of the time by executing a second thread when the initial thread is waiting on some resource (known as a stall) to continue execution. This should effectively give you some performance increase; however it is not that straight forward as the threads share many processor components such as L1 and L2 cache, execution units, etc. Due to this sharing, there is a level of jitter introduced in the execution that may be too high in terms of determinism for some embedded systems.

It is recommended that for any given application, if there is a consideration to use processors with hyper- threading, the developer should qualify that the system can operate within the required time bounds of the real-time application.

Also, one should note, current versions of the 32-bit Windows operating system do not allow for the selection of which processors the OS can execute on. This can be problematic for a Windows real-time extension, as it can only operate on the cores that Windows was instructed to ignore. Windows will enumerate cores using all physical cores first followed by the HT cores. Given a system with a quad-core hyper-threaded processor, if Windows is directed to only use 4 cores, Windows will select the primary HW

(4)

thread from each core and enumerate them as processor 0-3. This would leave the 4 hyper-threads for the real-time extension. Since this could lead to an increase in cache conflicts, performance of the real- time applications can be significantly impacted. It is recommended that hyper-threading be disabled, or Windows be configured to use fewer cores, ensuring that there is at least one primary thread available to the real-time system.

The figures below show examples of how cores would be enumerated on a Windows 7 32-bit OS running on a Quad-core hyper-threaded system.

Below, the red indicates RTSS cores. The gray indicates the L1/L2 cache sharing, where a core and its hyper-thread are owned by different subsystems. This configuration would potentially have a higher incident of cache collision on the 4^th and 5^th core due to sharing of the caches with Windows processes running on core 0 and core 1.

(5)

From the diagrams that follow, you can see that this method is employed even when adding additional processor sockets to the system. This has potential to have even more impact as there are more cores/threads that could have shared caches.

64-bit Windows does enumerate cores differently. Although, currently, specific processors cannot be selected, 64-bit windows does enumerate the 2 threads of a hyper threaded core sequentially. This enumeration method gives the system designer much more flexibility in terms of how processors are divided between Windows and the real-time extension since it avoids many of the cache conflicts associated with sharing threads of a single core as noted above in the 32-bit Windows processor enumeration method. The key difference is that the shared resource is pushed out to much larger L3 cache, which is further away physically and logically from the cores. The benefit is that the potential for Windows activities to dirty the more critical l1/l2 caches is significantly reduced. The figures below illustrate the 64-bit windows enumeration method. Note the second diagram that now shows no sharing of the L1/L2 caches with a 2+6 configuration on an 8 logical core system.

(6)

Power Management Configuration

There are a few power management features that can really impact the consistent performance level of today’s systems. This section focuses only on the Intel versions of these features. However they do in many cases have their counterparts in processors from other manufacturers.

SpeedStep®

SpeedStep was one of the first technologies introduced by Intel® to attempt to reduce the amount of power consumed by processors when there was not enough work to keep them busy. SpeedStep effectively controls both the processor voltage and frequency in a dynamic way to reduce unneeded power and heat.

(7)

This power savings is great and very effective for typical consumer and even server systems; however it can wreak havoc on the embedded system. Consider the case where a thread performing a specific function has been optimized to execute with a worse-case time of 100 microseconds (µs). Put another way, the system designer is 100% certain that the execution of that threads code path will always be less than 100µs. Using a typical safety margin of, say, 20%, the system should be capable of running this thread every 120µs without ever overrunning that cycle time. This is by definition very deterministic and exactly the type of performance that the embedded systems developers require.

However, using the above example, if SpeedStep or similar technologies are enabled the execution of the thread may incur drastic variances in execution time as a result of the processors voltage and frequency being dynamically ratcheted up and down. Due to this, it recommended that embedded systems disable the SpeedStep setting in BIOS if possible.

Turbo Boost Technology

Turbo Boost is a technology that allows processors to operate more efficiently by dynamically increasing the processor frequency when the operating system is requesting the highest available power state. Like SpeedStep technology, Turbo Boost does cause the frequency to essentially be unknown and non- deterministic. However, unlike SpeedStep, an increase in the frequency tends to not have a negative effect to the thread execution times as stated in the scenario above. In fact, that thread would typically execute in less time than the average and therefore not pose an overrun risk. That said, it is

recommended that the developer understand the base operating frequency of the processor and design to that such that Turbo Boost would not affect the processing and performance of the embedded application.

C-States

CPU states, or C-States for short, are another mechanism designed to reduce the consumption of power in a system. Unlike the power states (P-States) that are used to control the clock frequency and voltage levels, C-States are used to efficiently describe the functional components of the CPU that can be turned off, when not in use. If it is not on, it is not drawing power and generating excess heat.

There are many C-States which can vary by processor and vendor. C0 is the special state assigned when the processor is fully turned on. C-States are grouped typically by what they turn off. C1 through C3 essentially cut the clock signal to the CPU or some of its internal components, while C4-C6 typically work by cutting the voltage levels. Then there are hybrid states that define a combination of both.

The drawback to C-States in embedded systems comes down to one thing; is the function of the CPU that is needed at any given time, available immediately for use. And the answer with C-states enables is a big MAYBE. This is due to the fact that as clock signals and or voltage levels are cut or reduced to the CPU and its internal components, it takes time to ‘wake’ these systems up for use and that time can vary significantly. It is difficult, if not impossible, to design a high performing, highly deterministic embedded system under those conditions. Thankfully, the use of C-States can be disabled, typically in BIOS, or even with some suggestions to the operating system. Either way, it is recommended that these power saving states be disabled or at a minimum reduced in use so that the system can run as predictable as possible.

(8)

Memory Configuration

There are generally two memory configurations that can be configured on today’s systems. These are typically referred to as Symmetric Multiprocessing (SMP) or Uniform Memory Access (UMA) and Non- Uniform Memory Architecture (NUMA). It is typical that the memory architecture used can be selected from BIOS. In most cases, if this option is not available, it is most likely that SMP architecture is

employed. SMP memory architecture simply means that all memory is equally addressable from any core and there is no concept of locality. The SMP memory configuration can be thought of as truly global in nature.

The NUMA architecture, on the other hand, is logically a hybrid of local and global memory. This means that all memory is globally accessible, however, memory does have locality associated with processing cores. NUMA memory can easily be thought of as another level of cache beyond L3 that makes use of the physical placement of system memory (RAM) to allow faster access to memory that is closer (more local) to a given processor/core.

(9)

Memory, I/O and Cache

This is probably the most technical section of this paper, as it is extremely important to understand how these hardware subsystems affect the performance of executing code. The use and tuning of these subsystems can be the difference of an embedded system that is highly deterministic or not.

Everything done on the computer requires memory of some kind, whether it is RAM, I/O ports, device memory, etc. The problem with memory, which everyone is aware of, is it is significantly slower to access than are the speeds at which the processors execute instructions.

Since this gap has not been overcome, not even on today’s advanced systems, processor vendors have developed multi-level memory hierarchies which included fast memory close to the processor, known as cache. The problem with cache is that it is expensive, small and fixed in size (amount, not physical) and needs to be on the same die as the processor to minimize the physical distance, which is required to have the fastest access time possible. This along with improvements and in RAM speed, sizes and configuration (UMA/NUMA) are helping embedded system designers to optimize these commodity systems for their needs. The cost, however, is the requirement of knowing how this memory hierarchy functions.

Let’s take a look at memory from the bottom up. That is, we’ll start as with the memory that is closest to the processor and work out from there.

Cache Architecture Using Level 1/2/3 interchangeably with L1/2/3

It is critical that system designers and developers understand the cache architecture of the processors that they are working with. Knowing this can make the difference not only with a product that is delivered on-schedule, but ending up with a product that meets and even exceeds design requirements. A great paper that goes into depth about memory architecture is What Every Programmer Should Know About Memory by Ulrich Drepper at Red Hat.

(10)

As shown earlier, the cache of most x86/64 systems today have a 3-level hierarchical cache architecture.

The level 1 cache is smallest and closest to the processor and is split into an instruction and a data cache of equal size. The Level 2 cache is per processor core like L1, however is a unified cache containing both instructions and data. L2 is typically larger than L1. On hyper-threaded cores, L1 and L2 cache are shared by both hardware threads. Finally there is the Level 3 cache which is also referred to as smart cache. This cache is considerably larger than L1 and L2, but like L2 it is also a unified cache. L3 cache is a processor shared cache, meaning all cores/hardware threads in a processor package have equal use of the L3 cache. The graphics from the CPU configuration section above depict the logical hierarchy of the cache as described here.

Not to over-simplify things too much, but when data is required to execute an instruction by the processor, that data is read and moved into L1 cache. When that data is no longer needed and other data is

required for the processor to continue execution, the old data is evicted to the next level of cache away from the processor. This eviction logic can continue until the data only exists in the memory (RAM). If this data is required again, a lengthy fetch of the data from system memory must be performed. However, if the data was never evicted in the first place, access to it is quite fast as the cache version is still valid.

Understanding the cache architecture and its operation logic to ensure data validity across multiple cores is the best way to reliably predict where the data is when you will need it to ensure a deterministic execution of real-time threads. Shy of knowing this, at least at an operational level is simply allowing the system to determine the way your applications are going to execute and not you. Embedded system design is about understanding the hardware and software at a level that allow as complete control of the system as possible, and on today’s systems it really does start with the cache.

Multicore/Multithreaded Programming

The biggest issue with programming on multicore systems is the need for data concurrency across all the cores. We have found that a significant portion of latency is in the allocation and use of memory and the impact that cache mechanisms have on performance.

The good news is there are design rules and techniques you can use to both minimize these types of performance hits and tune a system for optimal performance. Intel documents this in the “Intel(R) 64 and IA-32 Architectures Optimization Reference Manual”. It is recommended that this manual be used in determining the best practices for developing critical areas of software where specific performance requirements are needed. Given that this document is over 700 pages, an example of a best practice that we recommend is a technique that will result in an improvement in the very types of thread execution latency that are associated with false sharing and out of order processing.

False Sharing

The main culprit of latency that was found by a study focused on latency issues reported by IntervalZero customers was the sharing of modified data and false-sharing between threads executing on different processors. Based on the Intel optimization manual noted above, memory sharing under these conditions should be designed carefully to avoid these types of performance penalties. This all has come about as modern CPUs have gotten much faster than modern memory systems. This disparity in speed --- more than two orders of magnitudes – has resulted in the multi-megabyte caches found on modern CPUs.

As the Manual describes in section “8.4.5 Prevent Sharing of Modified Data and False-Sharing”:

“…on an Intel Core Duo processor or a processor based on Intel Core microarchitecture, sharing of modified data incurs a performance penalty when a thread running on one core tries to read or write data that is currently present in modified state in the first level cache of the other core. This will cause eviction of the modified cache line back into memory and reading it into the first-level cache of the other core. The latency of such cache line transfer is much higher than using data in the immediate first level cache or second level cache.”

(11)

In the process of application design for high performance/deterministic functionality, cache impact must be considered.

1. The first decision is whether sharing memory between threads is really required.

2. If required, the second decision is whether the data in the shared memory should be protected.

3. If protection is required, use synchronization objects to protect the shared memory.

4. If deterministic latency of accessing the shared memory region is required, threads should avoid modifying the data within a cache line or within a sector.

Following is a code sample that demonstrates this jitter. The jitter is due to the sharing of adjacent memory that is modified by threads on other processors. The writing of the adjacent data causes cache eviction that requires a trip to memory to reload the cache for subsequent reads.

Sample 1 is a code snippet of how deterministic latency of accessing the shared memory region will be affected by the system cache. The sample creates multiple threads and assigns one thread per core.

Each thread repeatedly reads out 4 byte data from one offset of shared memory region, and writes into another offset of shared memory region. The sample measures the time it takes within certain loops. The problem is each thread may modify the data within the cache line of other threads. As a result, there may be different latency among each measurement.

Sample 1

static ULONG Inputs[SAMPLES_PER_TIMER][NUM_INPUTS];

static ULONG Outputs[SAMPLES_PER_TIMER][NUM_OUTPUTS];

typedef struct _MEMORY {

INPUTS* volatile Inputs;

OUTPUTS* volatile Outputs;

HANDLE hTimer;

ULONG loopCount;

ULONG data[1];

}MEMORY, *PMEMORY;

void _cdecl wmain(

int argc, wchar_t **argv, wchar_t **envp )

{

for ( i = 0; i < g_usedCores; i++ ) {

PMEMORY pMemory;

pMemory->Inputs = &Inputs;

pMemory->Outputs = &Outputs;

//Setup a thread affined to each core

HANDLE hThread = CreateThread(NULL, 0, InnerLoop, pMemory, CREATE_SUSPENDED, NULL);

SetThreadAffinityMask(hThread, affinity);

SetThreadPriority(hThread, RT_PRIORITY_MAX);

ResumeThread(hThread);

CloseHandle(hThread);

}

(12)

}

VOID RTFCNDCL InnerLoop(PVOID pArgument) {

PMEMORY pMemory = (PMEMORY)pArgument;

LARGE_INTEGER start = {0};

LARGE_INTEGER stop = {0};

LARGE_INTEGER diff = {0};

ULONG i = g_numDataPointsPerCycle;

ULONG max = 0;

ULONG min = (ULONG)-1;

ULONG curuS = 0;

//Work loop variables

int sampleCount, OutCount, InCount;

for ( i; i > 0; i-- ) {

_asm {

lea ebx, start rdtsc

mov [ebx], eax mov [ebx+4], edx }

//Do work.

for (sampleCount = 0; sampleCount < SAMPLES_PER_TIMER;

sampleCount++) {

for (OutCount = 0; OutCount < NUM_OUTPUTS; OutCount++) {

(*pMemory->Outputs)[sampleCount][OutCount] = 0;

for (InCount = 0; InCount <

NUM_INPUTS; InCount++) {

(*pMemory >Outputs)[sampleCount][OutCount]

+= (*pMemory->Inputs)[sampleCount][InCount];

} } }

_asm {

rdtsc

lea ebx, stop mov [ebx], eax mov [ebx+4], edx }

//Calculate

diff.QuadPart = stop.QuadPart - start.QuadPart;

curuS = (ULONG)((diff.QuadPart * 1000000)/freq.QuadPart);

if ( curuS > max ) max = curuS;

if (curuS < min ) min = curuS;

}

if ( StoreData(pMemory, curuS, min, max) )

(13)

{

//Shouldn't return from here.

OutputData(pMemory);

} }

Below is a sample of the timing results from the running of sample 1. This clearly shows a wide amount of jitter due to these types of cache evictions of shared memory.

Thread:27 Cur: 402 Min: 401 Max: 432 Thread:27 Cur: 403 Min: 401 Max: 440 Thread:27 Cur: 402 Min: 401 Max: 428 Thread:27 Cur: 403 Min: 401 Max: 428 Thread:27 Cur: 404 Min: 401 Max: 439 Thread:27 Cur: 403 Min: 401 Max: 431 Thread:27 Cur: 401 Min: 401 Max: 447 Thread:27 Cur: 404 Min: 401 Max: 614 Thread:27 Cur: 513 Min: 401 Max: 649 Thread:27 Cur: 403 Min: 401 Max: 665

The above jitter issue is nearly eliminated by using separated memory as in Sample 2, shown below.

Unlike in Sample 1, where one memory area is allocated in a single call and passes the pointer to each thread, sample 2 allocates memory for each thread/core. Each of these separate allocations assures that the memory blocks are on page boundaries and as such are far enough away from each other (physically in memory) to assure that there will not be cache line sharing.

Sample 2

void _cdecl wmain(

int argc, wchar_t **argv, wchar_t **envp )

{

for ( i = 0; i < g_usedCores; i++ ) {

PMEMORY pMemory;

pMemory->Inputs = (INPUTS*)malloc(sizeof(ULONG) * SAMPLES_PER_TIMER * NUM_INPUTS);

pMemory->Outputs = (OUTPUTS*)malloc(sizeof(ULONG) * SAMPLES_PER_TIMER * NUM_OUTPUTS);

//Setup a thread affined to each core

HANDLE hThread = CreateThread(NULL, 0, InnerLoop, pMemory, CREATE_SUSPENDED, NULL);

SetThreadAffinityMask(hThread, affinity);

SetThreadPriority(hThread, RT_PRIORITY_MAX);

ResumeThread(hThread);

CloseHandle(hThread);

} }

(14)

Below is a sample of the timing results from the execution of Sample 2. This clearly shows a significant improvement in the amount of jitter due to separate memory allocations to assure non-adjacency of data.

Here the jitter is within 1µs using the same inner loop code. The only difference is the way memory was allocated.

Thread:61 Cur: 402 Min: 401 Max: 402 Thread:61 Cur: 402 Min: 401 Max: 402 Thread:61 Cur: 402 Min: 401 Max: 402 Thread:61 Cur: 401 Min: 401 Max: 402 Thread:61 Cur: 401 Min: 401 Max: 402 Thread:61 Cur: 401 Min: 401 Max: 402 Thread:61 Cur: 401 Min: 401 Max: 402 Thread:61 Cur: 401 Min: 401 Max: 402 Thread:61 Cur: 402 Min: 401 Max: 402 Thread:61 Cur: 402 Min: 401 Max: 402

Although there are many areas for optimization as the Intel “Intel(R) 64 and IA-32 Architectures Optimization Reference Manual” document describes, we are confident that the case of sharing of modified data and false sharing is the biggest contributor to the latencies that have been reported to IntervalZero.

Out of Order Execution

Another area of unexpected latency is introduced by the sophisticated and complex mechanism employed in today’s modern processors known as out-of-order execution. This feature is yet another mechanism added to reduce processor stalls from the mismatch of processing speed and memory access.

Reordering can occur in two places, one in the hardware and the other is from the compiler. You will potentially need to address one or both of these to make sure you are controlling the execution of your threads in a predictable way.

Memory fences can be used to address the hardware out-of-order execution. A memory fence forces all reads and/or writes preceding the fence to complete before the instructions following the fence are executed. The read or write operation that is forced is determined by the type of fence used, which include the read fence, write fence, or a full fence. The reason this can be so important is due to the fact that out-of-order execution may cause cache evictions to occur more frequently depending on the nature of the memory allocation in the program. An example of this is the code snippet below:

Sample 3

// Initial conditions:

int x = 0, y = 0;

// Thread A, started first:

while (x == 0)

// Spin until x is non-0.

;

std::cout << y;

// Thread B, started later:

y = 42;

x = 1;

Assume that thread A and Thread B are run on two different cores. Under certain conditions this code can output “0”. How can this be? Because the 42 was buffered in another core’s cache write buffer and the write to x was seen by thread A before the write to y, even though no instruction reordering happened!

(15)

Also, cache write buffers are not flushed by the inter-processor cache coherency logic. In order to enforce the correct operation a write fence or a full fence should be used before while loop in thread A.

In the case of compiler reordering, each compiler will have its own way of suggesting that the operations should not be optimized and thereby reordered. You should consult the compiler documentation. Keep in mind that the compile cannot prevent runtime processor instruction reordering.

Code Branching

Although it seems common sense, code branching should be avoided in code where high performance and determinism is required. Code branching can cause a cache invalidate on both the initial branch and the return, which could add a considerable amount of latency that is difficult to quantify. Because of this, code branching under these conditions should be avoided or at the least minimized as much as possible.

(16)

Conclusion

This article was designed to point the spotlight on specific areas of multi-core system design and their use as platforms for embedded systems. The area of system configuration, memory architecture, and

programming were addressed, but this is not a definitive list.

The information presented here should be used as a primer by the embedded systems engineers to start the systematic thought process of how today’s systems are architected and how the off-the-shelf designs of these systems can be used to develop some of today’s most high performing, deterministic

applications for anything from machine tool control to digital media processing to cutting-edge medical instruments that require a rich interactive user interface with the power and performance and scalability that only a multicore RTOS platform can provide.

So, take this information, use it as a starting point of understanding, and go build the platforms of tomorrow starting with the foundations of the platforms being built today.