Analyse de performances pour les systèmes intégrés multi-cœurs

(1)

Page | 1

Ben Salma SANA

Laboratoire TIMA SLS

Analyse de performances pour les systèmes

intégrés multi-cœurs

Encadrants:

Frederic Petrot

([email protected])

(2)

Page | 2

1 - Introduction:

Computer processor design has evolved at a constant pace for the last 20 years. The proliferation of computers into the mass market and the tasks we ask of them continue to push the need for more powerful processors. The market requirement for higher

performing processors is linked to the demand for more sophisticated software applications. E-mail, for instance, which is now used globally, was only a limited and expensive technology 10 years ago. Today, software applications span everything from helping large corporations better manage and protect their business-critical data and networks to allowing PCs in the home to edit home videos, manipulate digital photographs, and burn downloaded music to CDs.

Multi-core processors represent a major evolution in computing technology. This

important development is coming at a time when businesses and consumers are beginning to require the benefits offered by these processors due to the exponential growth of digital data and the globalization of the Internet. Multi-core processors will eventually become the pervasive computing model because they offer performance and productivity benefits beyond the capabilities of today’s single-core processors.

2 – Context:

Benchmarks are necessary to experimentally determine the benefits of new designs. The presence of parallel machines is now ubiquitous. Nevertheless, existing benchmarks have a number of limitations and will not be very useful to evaluate future CMPs. The lack of benchmark suites can hamper parallel architecture research and reduce its impact. The “Princeton Application Repository for Shared-Memory Computers” (PARSEC) was created to be useful for the design of next generation of processors. Parsec is benchmark suite for chip multiprocessor that started as cooperation between Intel and Princeton University. It offers new application areas for parallel machines through proliferation of CMPs.

Parsec was used with many processors such as i386, x86_64 and Sparc but not with the ARM processors.

During this semester, my TER project was in the research laboratory TIMA carrying out basic research in the field of Electronics: researches aim at the design, architecture, test, CAD, quality of integrated circuits and by means of data processing and microelectronics. My research project topics were to adapt Parsec benchmarks to ARM architecture throw a Cross Compilation and execute them on ARM.

(3)

Page | 3

3- What is Cross Compilation?

A cross compiler is a compiler capable of creating executable code for a platform other than the one on which the compiler is run. Cross compiler tools are used to generate executables for embedded system or multiple platforms. It is used to compile for a

platform upon which it is not feasible to do the compiling, like microcontrollers that don't support an operating system. It has become more common to use this tool for

paravirtualization where a system may have one or more platforms in use.

Before doing the cross compilation I started with compiling and executing Benchmark Parsec on the local machine (Linux Intel) using the Gnu compiler gcc. Then I compiled the code with arm-none-linux-gnueabi-gcc to execute it finally on the ARM machine.

4 – The benchmarks Parsec:

4.1. General Presentation:

The Princeton Application Repository for Shared-Memory Computers is a benchmark suite

composed of multithreaded programs. It differs from other benchmark suites by different characteristics:

Multithreaded: PARSEC uses different ways of parallelization (threads, pipeline…). Diverse: PARSEC does try to explore different application domains such as financial analysis, Computer vision, animation, similarity search… Not HPC-Focused: The PARSEC suite anticipates the development of parallel programs in the domain of High-Performance Computing and tries to avoid a program selection which is skewed towards HPC. It focuses on programs from all domains, such as desktop and server applications.

Research: The suite is primarily intended for research. It can also be used for

performance measurements of real machines, but its original purpose is insight, not numbers. Benchmark suites intended for research usually go beyond pure scoring systems and provide infrastructure to instrument and manipulate included programs in an efficient manner.

(4)

Page | 4

In the section bellow I will give deeper explanation of Parsec application starting with Blackscholes application.

4.2. Parsec applications:

4.2.1. Blackscholes:

The Black-Scholes equation is a differential equation that describes how, under a certain set of assumptions, the value of an option changes as the price of the underlying asset changes. Parallelization: The parallelization algorithms are very simple: they simply price multiple options in parallel using Black-Scholes formula. Each thread prices an individual option. In practice financial houses price 10.s to 100.s of thousands of options using Black-Scholes.

Caractérisation des Blackscholes

4.2.2. Bodytrack :

The bodytrack computer vision application is an Intel RMS workload which tracks a 3D pose of a marker-less human body with multiple cameras through an image sequence. Bodytrack employs an annealed particle filter to track the pose using edges and the foreground silhouette as image features, based on a 10 segment 3D kinematic tree body model. These two image features were chosen because they exhibit a high degree of invariance under a wide range of conditions and because they are easy to extract. An annealed particle filter was employed in order to be able to search high dimensional configuration spaces without having to rely on any assumptions of the tracked body such

(5)

Page | 5

as the existence of markers or constrained movements. This benchmark was included due to the increasing significance of computer vision algorithms in areas such as video

surveillance, character animation and computer interfaces.

This work is based upon the following works : -Articulated Body Motion Capture by Stochastic Search Jonathan Deutscher and Ian Reid

http://www.springerlink.com/content/k842275774597530/

-A. Balan, L. Sigal and M. Black. A Quantitative Evaluation of video-based 3D Person Tracking.

http://www.cs.brown.edu/~alb/software.htm

Output of Bodytrack (Frame 1)

Parallelization: The program has three parallel kernels:

Edge detection: bodytrack employs a gradient based edge detection mask to find edges. The result is compared against a threshold to eliminate spurious edges. Edge detection is

implemented in function GradientMagThreshold. The output of this kernel will be further refined before it is used to compute the particle weights.

(6)

Page | 6

Edge smoothing: A separable Gaussian filter of size 7×7 pixels is used to smooth the edges in function GaussianBlur. The result is remapped between 0 and 1 to produce a pixel map in which the value of each pixel is related to its distance from an edge. The kernel has two parallel phases, one to filter image rows and one to filter image columns.

Calculate particle weights: This kernel evaluates the foreground silhouette and the image edges produced earlier to compute the weights for the particles. This kernel is executed once for every annealing layer during every time step, making it the computationally most

intensive part of the body tracker.

The parallel kernels use tickets to distribute the work among threads balance the load dynamically. The ticketing mechanism is implemented in class TicketDispenser and behaves like a shared counter.

(7)

Page | 7

4.2.3. Canneal :

This kernel was developed by Princeton University. It usescache-aware simulated annealing to minimize the routing cost of a chip design. Simulated annealing is a common method to approximate the global optimum in a large search space. Canneal pseudo-randomly picks pairs of elements and tries to swap them.

The program was included in the PARSEC program selection to represent engineering workloads, for the fine-grained parallelism with its lock-free synchronization techniques and due to its pseudo-random worst-case memory access pattern.

Canneal uses a very aggressive synchronization strategy that is based on data race recovery instead of avoidance. Pointers to the elements are dereferenced and swapped atomically, but no locks are held while a potential swap is evaluated. This can cause disadvantageous swaps if one of the relevant elements has been replaced by another thread during that time. This equals a higher effective probability to accept swaps which increase the routing cost, and the simulated annealing method automatically recovers from it. The swap operation employs lock-free synchronization which is implemented with atomic instructions. An alternative implementation which relied on conventional locks turned out to be too inefficient due to excessive locking overhead. The

synchronization routines with the atomic instructions are taken from the BSD kernel. Support for most new architectures can be added easily by copying the correct header file from the BSD kernel sources.

(8)

Page | 8

4.2.4. Dedup:

The dedup kernel was developed by Princeton University. It compresses a data stream with a combination of global compression and local compression in order to achieve high

compression ratios. Such a compression is called ’deduplication’. The reason for the inclusion of this kernel is that deduplication has become a mainstream method to compress storage footprints for new-generation backup storage systems and to compress communication data for newgeneration bandwidth optimized networking appliances. The kernel uses a pipelined programming model to parallelize the compression to mimic real-world implementations. Parallelization:

There are five pipeline stages the intermediate three of which are parallel. In the first stage, dedup reads the input stream and breaks it up into coarse-grained chunks to get independent work units for the threads. The second stage anchors each chunk into fine-grained small segments with rolling fingerprinting. The third pipeline stage computes a hash value for each data segment. The fourth stage compresses each data segment with the Ziv-Lempel algorithm and builds a global hash table that maps hash values to data. The final stage assembles the deduplicated output stream consisting of hash values and compressed data segments.

Dedup uses a separate thread pool for each parallel pipeline stage. Each thread pool should at least have a number of threads equal to the number of available cores to allow the system to fully work on any stage should the need arise.

The operating system scheduler is responsible for a thread schedule which will maximize the overall throughput of the pipeline. In order to avoid lock contention, the number of queues is scaled with the number of threads, with a small group of threads sharing an input and output queue at a time.

(9)

Page | 9

4.2.5. Facesim:

This Intel RMS application was originally developed by Stanford University. It takes a model of a human face and a time sequence of muscle activations and computes a visually realistic animation of the modeled face by simulating the underlying physics. The goal is to create a visually realistic result. Certain effects such as inertial movements would have only a small visible effect and are not simulated.

The workload was included in the benchmark suite because an increasing number of computer games and other forms of animation employ physical simulation to create more realistic virtual environment. Human faces in particular are observed with more attention from users than other details of a virtual world, making their realistic

presentation a key element for animations.

The parallelization uses a static partitioning of the mesh. Data that spans nodes belonging to more than one partition is replicated. Every time step the partitions process all elements that contain at least one node owned by the particle, but only results for nodes which are owned by the partition are written.

(10)

Page | 10

4.2.6. Ferret:

This application is based on the Ferret toolkit which is used for content-based similarity search of feature-rich data such as audio, images, video, 3D shapes and so on. It was developed by Princeton University.

The reason for the inclusion in the benchmark is that it represents emerging next generation desktop and Internet search engines for non-text document data types. In the benchmark, we have configured the Ferret toolkit for image similarity search.

Ferret is parallelized using the pipeline model with six stages. The first and the last stage are for input and output. The middle four stages are for query image segmentation, feature extraction, indexing of candidate sets with multi-probe Locality Sensitive Hashing (LSH) and ranking. Each stage has its own thread pool and the basic work unit of the pipeline is a query image.

(11)

Page | 11

4.2.7. Fluidanimate:

This Intel RMS application uses an extension of the Smoothed Particle Hydrodynamics (SPH) method to simulate an incompressible fluid for interactive animation purposes. Its output can be visualized by detecting and rendering the surface of the fluid. The force density fields are derived directly from the Navier-Stokes equation. Fluidanimate uses special-purpose kernels to increase stability and speed.

Fluidanimate was included in the PARSEC benchmark suite because of the increasing significance of physics simulations for computer games and other forms of real-time animations.

The parallelization algorithm provided is based on spatial partitioning. As mentioned above, particles are sorted spatially into a uniform grid which covers the entire simulation domain. This grid is evenly partitioned along cell boundaries using axis-aligned splitting planes to produce subgrids, one per thread. Since particles which reside in adjacent cells may interact with each other, multiple threads may need to update the cells which lie on sub-grid boundaries.

(12)

Page | 12

Each grid cell has a flag which indicates whether or not it lies on a subgrid boundary. Whenever a particle in a boundary cell needs to be updated, fine-grain locks are used for that update. Locks are not used when updating particles away from boundaries since these updates will never occur concurrently.

Fluidanimate characteristics

4.2.8. Freqmine:

Frequent Itemsets Mining (FIM) is the basis of Association Rule Mining. Association Rule Mining is the process of analyzing a set of transactions to extract association rules.

Association Rule Mining is a very common used and well-studied data mining problem. The

mining is applicable to any sequential and time series data via discretization. Example domains are protein sequences, market data, web logs, text, music, stock market, etc.

To mine Association Rule Mining is converted to mine the frequent itemsets Lk, which

contains the frequent itemsets of length k. Many FIMI (FIM Implementation) algorithms have been proposed in the literature, including FP-growth and Apriori based approaches.

(13)

Page | 13

Researches showed that the FP-growth can get much faster than some old algorithms like the Apriori based approaches except in some cases the FP-tree can be too large to be stored in memory when the database size is so large or the database is too sparse.

Freqmine has been parallelized with OpenMP. It employs three parallel kernels: Build FP-tree header, Construct prefix tree and Mine data.

(14)

Page | 14

4.2.9. Streamcluster

_:

This RMS kernel was developed by Princeton University and solves the online clustering problem: For a stream of input points, it finds a predetermined number of medians so that each point is assigned to its nearest center. The quality of the clustering is measured by the sum of squared distances (SSQ) metric. Stream clustering is a common operation where large amounts or continuously produced data has to be organized under real-time conditions, for example network intrusion detection, pattern recognition and data mining. The program spends most of its time evaluating the gain of opening a new center.

This operation uses a parallelization scheme which employs static partitioning of data points. The program is memory bound for low-dimensional data and becomes increasingly

computationally intensive as the dimensionality increases. Due to its online character the working set size of the algorithm can be chosen independently from the input data.

Streamcluster was included in the PARSEC benchmark suite because of the importance of data mining algorithms and the prevalence of problems with streaming characteristics.

(15)

Page | 15

4.2.10. Swaption:

The swaptions application is an Intel RMS workload which uses the Heath-Jarrow-Morton (HJM) framework to price a portfolio of swaptions. The HJM framework describes how interest rates evolve for risk management and asset liability management for a class of models. Its central insight is that there is an explicit relationship between the drift and volatility parameters of the forward-rate dynamics in a noarbitrage market. Because HJM models are non-Markovian the analytic approach of solving the PDE to price a derivative cannot be used. Swaptions therefore employs Monte Carlo (MC) simulation to compute the prices.

The workload was included in the benchmark suite because of the significance of PDEs and the wide use of Monte Carlo simulation.

The program stores the portfolio in the swaptions array. Each entry corresponds to one derivative. Swaptions partitions the array into a number of blocks equal to the number of threads and assigns one block to every thread. Each thread iterates through all swaptions in the work unit it was assigned and calls the function HJM Swaption Blocking for every entry in order to compute the price.

Swaption characteristics

4.2.11. Vips:

This application is based on the VASARI Image Processing System (VIPS) which was originally developed through several projects funded by European Union (EU) grants. The benchmark version is derived from a print on demand service that is offered at the National Gallery of London, which is also the current maintainer of the system.

(16)

Page | 16

The benchmark includes fundamental image operations such as an affine transformation and a convolution. It was chosen because image transformations are a common task on desktop computers and for the ability of the VASARI system to construct multi-threaded image

processing pipelines transparently on the fly. Future libraries might use concepts such as the ones employed by VIPS to make multi-threaded functionality available to the user.

The image transformation pipeline of the vips benchmark has 18 stages. It is implemented in the VIPS operation in benchmark.

(17)

Page | 17

4.2.12. x264:

The x264 application is an H.264/AVC (Advanced Video Coding) video encoder. In the 4th annual video codec comparison it was ranked 2nd best codec for its high encoding quality. It is based on the ITU-T H.264 standard which was completed in May 2003 and which is now also part of ISO/IEC MPEG-4. In that context the standard is also known as MPEG-4 Part 10. H.264 describes the lossy compression of a video stream. It improves over previous video encoding standards with new features such as increased sample bit depth precision, higher-resolution color information, variable block-size motion compensation (VBSMC) or context-adaptive binary arithmetic coding (CABAC). These advancements allow H.264 encoders to achieve a higher output quality with a lower bit-rate at the expense of a significantly increased encoding and decoding time. The flexibility of H.264 allows its use in a wide range of contexts with different requirements, from video conferencing solutions to high-definition (HD) movie distribution. Next generation HD DVD or Blu-ray video players already require H.264/AVC encoding. The flexibility and wide range of application of the H.264 standard and its ubiquity in next generation video systems are the reasons for the inclusion of x264 in the PARSEC benchmark suite.

The parallel algorithm of x264 uses the pipeline model with one stage per input video frame. This results in a virtual pipeline with as many stages as there are input frames.

x264 processes a number of pipeline stages equal to the number of encoder threads in parallel, resulting in a sliding window which moves from the beginning of the pipeline to its end. For P- and B-Frames the encoder requires the image data and motion vectors from the relevant region of the reference frames in order to encode the current frame, and so each stage makes this information available as it is calculated during the encoding process. Fast upward movements can thus cause delays which can limit the achievable speedup of x264 in practice. In order to compensate for this effect, the parallelization model requires that x264 is executed with a number of threads greater than the number of cores to achieve maximum performance.

(18)

Page | 18

X264 characteristics

4.2.13. Raytrace:

Raytracing is a technique that generates a visually realistic image by tracing the path of light through a scene. It leverages the physical property that the path of light is always reversible to reduce the computational requirements by following the light rays from the eye point through the image plane to the source of the light. This way only light rays that contribute to the image are considered.

The raytrace benchmark program uses a variety of the raytracing method that would typically be employed for real-time animations such as computer games. It is optimized for speed rather than realism. The computational complexity of the algorithm depends on the resolution of the output image and the scene.

For more details about the specifics of benchmarks Parsec:

http://parsec.cs.princeton.edu/doc/parsec-report.pdf

(19)

Page | 19

5. Project Milestones:

In the beginning of my TER I started with reading documentations and trying to understand the goals of the subject. It was the most difficult part of the work. The subject treats new concepts that I didn’t know and which are varied and complex. First the understanding of what is a benchmark and why they are useful then I started to explore the Parsec applications and that was very difficult because they treat different domains and use varied types of parallelization. Trying to understand Parsec took the biggest part of my project and I know that I didn’t really understand the real use of each benchmark it is so complicated and it takes more then a half day per week.

Then I started to discover the cross compilation.

I first began with compiling and executing the code on the local machine (Intel Linux) and add the necessary libraries.

Finally I made the cross compilation of the code using arm-none-linux-gnueabi-gcc and then simulate it on a four core ARM machine. I tried to execute all the benchmarks but I still have some benchmarks that didn’t yet work on the ARM machine.

The last step of the project was to analyze performance of ARM with benchmarks Parsec but I didn’t have time to make this step. Analyze needs a good comprehension of each benchmark and its inputs (the size of benchmarks parsec is about 6 Go).

6. Conclusion:

This project was very useful for me. I learned a lot of new concepts: what is Cross Compilation why we use it and how to cross compile a program for a specific architecture; it also improved my knowledge about scripting. Then I learned about the use of benchmarks and how

important they are for the analysis of the performance of an architecture and how it gives the possibility of comparing the performance of different architecture.

It was also very hard because benchmark Parsec use many technologies in many domains such as financial analysis, media processing, animation, visualization … I find a big difficulty in understanding the use of each benchmark so I spent a lot of time on reading specifications.

References

: