ple, the SGI UV 1000  has 8192 Intel Nehalem hardware threads interconnected via the NUMALink technology . Scalability bottlenecks, however, prevent applications from benefiting from the large number of threads on existing or emerging shared-memory architectures. Given the impor- tance of scalability, it is necessary to identify and eliminate scalability bottlenecks in multithreaded applications. How- ever, it is difficult for programmers to conduct analysis and apply appropriate optimizations. There are three principal challenges. First, applications can be complex. A typical HPC application or an industrial workload usually consists of hundreds of thousands of lines of code, hiding scalability bottlenecks deep in the code. Second, modern parallel archi- tectures have sophisticated micro architectures, integrating many hardware threads and multiple levels of memory. Iden- tifying which hardware component incurs scalability bottle- necks is challenging. Finally, program execution can have complicated behaviors, such as interactions between threads as well as interactions between user and operating system. Hence, understanding abnormal behaviors in a long-running parallel program is challenging. Given these three complexi- ties, people need performance tools to automatically identify the scalability bottlenecks and provide insightful guidance for optimization.
OpenMP is a sharedmemoryApplication Programming Interface (API) whose aim is to ease SharedMemoryparallel Programming. The OpenMP multithreading interface is designed to support HPC programs. Also it is portable across SharedMemory Architectures. OpenMP is implemented as a combination of a set of pragmas, compiler directives and a runtime providing both management of the thread pool anda set of a library routines . The compiler is instructed by these directives to create threads, perform synchronization operations, and manage sharedmemory. Therefore ,to understand and process these directives , OpenMP does require specialized compiler support. At present , an increasing number of OpenMP versions for Fortran, C and C++ are available in free.
grams on sharedmemory multiprocessors. It performs a simulation of memory references by the processors in an SMP node for a specified cache configuration using traces obtained through on-the-fly dynamic binary rewriting. Figure 1.2 represents the entire framework. METRIC generates compressed trace files by dynamically instrumenting the memory ref- erences of each OpenMP thread in the executing application binary. These trace files serve as input to driver threads representing the processors in the system. Each driver thread performs the simulation for its corresponding trace in parallel. A shared bus serves as the common interconnect and it uses a MESI bus-based protocol to maintain cache co- herence. Execution is simulated by implementing the OpenMP semantics, and detailed statistics for the execution are obtained. Statistics for hits, misses, temporal and spatial locality, eviction-related information and, most significantly, coherence-related metrics are provided. The important contribution lies in the simulation of coherence traffic. This helps to isolate the causes of invalidations and coherence misses leading to increased program latency. A notable feature is the ability of the simulator to derive cumulative as well as per-reference statistics. This helps in an in-depth analysis of application behavior on the platform of interest. Causes of bottlenecks can be accurately determined and can be used to propose optimization techniques to avoid the detected problems.
This is very real case of practical DFFT parallel computations. In this example we examine implementing the binary exchange algorithm to compute an s-point DFFT on a hypercube with p processors, where p > s. Assume that both s and p are powers of two. According the Figure 10, we partition the sequences into blocks of s/p contiguous elements and assign one block to each processor. Assume that the hypercube is d-dimensional (p=2 d ) and s=2 r . Figure 10 shows that elements with indices differing in their d (=2) most significant bits e mapped onto different processors. However, all elements with indices having the same r-d most significant bits are mapped onto the same processor. Hence, this parallel DFFT algorithm performs inter process communication only during the first d = log p of the log s iterations. There is no communication during the rested r – d iterations.
For a sequence of length ‘n’, total space required to store the matrices is Ɵ(nm 2 ). The sharedmemory algorithms worked with different number of processors used (p). So, depending on this parameter and memory storage capacity of the system the time complexity will vary in a parallel environment [20, 24]. If p processors were used during execution of the parallel program, then the total time complexity will be reduced approximately by the term p 2 . This is because, one processor works with n/p rows/column of the matrices. So, multiplication of two matrices will be completed in (Total time required)/p 2 . So, time complexity of parallel program can be expressed as,
Program development on a sharedmemory ar- chitecture is easily achieved by transforming a sequential algorithm into a parallel one by sim- ply identifying areas of code, which are suitable to be run in parallel i.e. in which few dependen- cies exist between iterations and different itera- tions access different data. Subsequently, local and shared variables need to be declared and parallel compiler directives are inserted. In con- trast to distributed memory programming there is no need to handle overlapping data regions explicitly.
To take advantage of the processing power in the Chip Multiprocessors design, applications must be divided into semi-independent processes, that can run concur- rently on multiple cores within a system. Therefore, programmers must insert thread synchronization semantics (i.e. locks, barriers, and condition variables) to synchro- nize data access between processes. Indeed, threads spend long time waiting to acquire the lock of a critical section. In addition, a processor have to stall execution to wait for load data accesses to complete. Furthermore, there are often independent instructions which include load instructions beyond synchronization semantics that could be executed in parallel while a thread waits on the synchronization semantics. The conveniences of the cache memories come with some extra cost in Chip Multi- processors. Cache Coherence mechanisms address the Memory Consistency problem. However, Cache Coherence adds considerable overhead to memory accesses. Having aggressive prefetcher on different cores of a Chip Multiprocessor, can definitely lead to significant system performance degradation when running multi-threaded appli- cations. This result of prefetch-demand interference when a prefetcher in one core ends up pulling shared data from a producing core before it has been written, the cache block will end up transitioning back and forth between the cores and result in useless prefetch, saturating the memory bandwidth and substantially increase the latency to critical shared data.
Calculate the pageviews that your site current is undergoing based on current trends. From those figures setup a Jrun, LoadRunner etc script to simulate the usage patterns of your users against your site. Ensure that the load test is a comprehensive test that covers the actual pattern of usage and not just one page. During the load test record statistics on each of your different components, the components that you should monitor are; Database (CPU, Memory, SharedMemory, Cursors), Application Server(CPU, Memory, JVM Heap Space, File
the MTA is similar to and predates the MAOs of the SGI Origin 2000 and Cray T3E. Active message are an well-organized way to arrange parallel applications [5, 32]. An active message includes the address of a user-level handler to be executed by the receiving processor upon message arrival using the mes- sage body as the arguments. Active messages can be used to perform atomic operations on a synchronization variable’s home node, which eliminates the need to shuttle the data back and forth across the network or perform long latency for re- mote invalidations. However, performing the synchronization operations on the node’s primary processor quite than on de- voted synchronization hardware entail higher summons laten- cy and interfere with useful work being performed on that pro- cessor. In particular, the load imbalance induces by having particular node hold synchronization traffic can severely im- pact performance due to Amdahl’s Law effects. The I-structure and explicit token-store (ETS) mechanism supported by the early dataflow project Monsoon [4, 24] can be used to imple- ment synchronization operations in a manner similar to active messages. A token comprises a value, a pointer to the instruc- tion to execute (IP), and a pointer to an activation frame (FP). The instruction in the IP specifies an opcode (e.g., ADD), the offset in the activation frame where the match will take place (e.g., FP+3), and one or more destination instructions that will receive the result of the operation (e.g., IP+1). If synchroniza- tion operations are to be implemented on an ETS machine, software needs to manage a fixed node to handle the tokens and wake up stalled threads. QOLB [10, 15] by Goodman et al. serializes synchronization requests through a distributed queue supported by hardware. The hardware queue mecha- nism greatly
The school started with an introduction on sharedmemory parallelism and multi- core technology by UPCRC Co-Director Marc Snir. UPCRC Principal Investigator Danny Dig presented parallelism techniques for object-oriented languages and how refactoring can be applied to transform sequential applications into concurrent ones. Clay Breshears and Paul Peterson from Intel illustrated how OpenMP and Threading Building Blocks can be used for parallelizations. In addition they introduced cutting- edge developer tools by Intel: the Intel Parallel Inspector, the Intel Parallel Amplifier and the Intel Parallel Advisor. Phil Pennington and James Rapp from Microsoft pre- sented the C++ Concurrency Runtime and the .NET Task Parallel Library (TPL). María Garzarán gave an overview on vectorization and described various techniques to apply them. UPCRC Illinois Co-Director and PI for the world’s first NVIDIA CUDA Center of Excellence Wen-mei W. Hwu introduced OpenCL. John E. Stone of the Illinois Beck- man Institute illustrated CUDAs utility with a Electrostatic Potential Maps application. Marc Snir concluded the school with his Taxonomy of Parallel Programming Models. As a special final event we were visiting the Petascale Computing Facility at Illinois which will house the Blue Waters sustained-petaflop supercomputer.
A GPU can manipulate matrices very efficiently. Here this is first used to show how to speed up the calculation of the cover set of propositional hypotheses. We adopt the terminology used in Description Logic, and talk about concepts, defined as sets over a universe of individuals in the same way in which unary predicates can be defined for a domain consisting of a set of terms. We can represent a propositional data set as a Boolean 2D matrix M with each column corresponding to a concept, and each row to an individual (see Figure 5). The membership of an individual to a concept (i.e. whether the unary predicate is true when it takes this term as argument) is represented as a Boolean value in the matrix. To make it possible to process a large amount of data (of the order of gigabytes), this array is stored in the global memory of the GPU.
Abstract: Parallel principles are the most effective way how to increase parallel computer performance and parallel algorithms (PA) too. Parallel using of more computing nodes (processors, cores), which have to cooperate each other in solving complex problems in a parallel way, opened imperative problem of modeling communication complexity so in symmetrical multiprocessors (SMP) based on motherboard as in other asynchronous parallel computers (computer networks, cluster etc.). In actually dominant parallel computers based on NOW and Grid (network of NOW networks)  there is necessary to model communication latency because it could be dominant at using massive (number of processors more than 100) parallel computers . In this sense the paper is devoted to modeling of communication complexity in parallel computing (parallel computers and algorithms). At first the paper describes very shortly various used communication topologies and networks and then it summarized basic concepts for modeling of communication complexity and latency too. To illustrate the analyzed modeling concepts the paper considers in its experimental part the results for real analyzed examples of abstract square matrix and its possible decomposition models. These illustration examples we have chosen first due to wide matrix application in scientific and engineering fields and second from its typical exemplary representation for any other PA.
We compare the performance of DPLAL with another distributed memoryparallel implementation of Louvain method given in Charith et al. . For a network with 500, 000 nodes, Charith et al. achieved a maximum speedup of 6 whereas with DPLAL for a network with 317, 080 nodes we get a speedup of 12 using 800 processors. The largest network processed by them has 8M nodes and achieved a speedup of 4. Our largest network achieves a comparable speedup (4-fold speedup with 1M nodes). The work in  did not report runtime results so we could not compare our runtime with theirs directly. Their work reported scalability to only 16 processors whereas our algorithm is able to scale to almost a thousand of processors.
The shared-memory component of HPC-GAP, which we refer to for this paper as GAP5 reimple- ments the latest version of GAP (GAP4) to include support for parallelism at a number of levels . This has a number of implications for the language design and implementation. Distributed mem- ory implementations of memory hungry symbolic computations can duplicate large data structures and may even lead to memory exhaustion. For example, the standard GAP library already requires several 100 MB of memory per process, much of which is data stored in read-only tables; a multi- threaded implementation can provide access to that data to multiple threads concurrently without replicating it for each thread. This is why it is crucial to use sharedmemory to achieve efficient fine-grained parallelisation of GAP applications. Secondly, we need to maintain consistency with the previous version of the GAP system, refactoring it to make it thread-safe. Achieving this has involved rewriting significant amounts of legacy code to eliminate common, but unsafe, practices such as using global variables to store parameters, having mutable data structures that appear to be immutable to the programmer, and so on.
The application is implemented using POSIX library for threads and C programming language on LINUX platform. The POSIX thread library, IEEE POSIX 1003.1c standard, is a standards based thread API for C/C++ . Pthreads is a sharedmemory programming model where parallelism takes the form of parallel function invocations by threads which can access shared global data. Pthreads defines a set of C programming language types, functions and constants. It allows one to spawn a new concurrent process flow. It is most effective on multi-processor or multi-core systems where the process flow can be scheduled to run on another processor thus gaining speed through parallel or distributed processing.
This paper discusses three techniques used for parallelizing the ELLPACK software pack- age for solving partial differential equations: an explicit approach using compiler direc- tives available on a particular target machine, an automatic approach using an optimizing and parallelizing compiler, and a two-level approach using a set of low-level computa- tional kernels. Results are reported for a Sequent Symmetry S81 with 1-16 processors, and general implications are noted for porting mathematical software to shared-memory machines. The authors conclude that identifying and parallelizing the low-level kernel routines provide the best performance results.
The default message creation in FairMQ hides all the memory allocation and management de- tails from the user and simply provides a ready to use buffer for every message. However, in a few use cases, the user might have very specific requirements for the memory layout – typi- cally where hardware needs to write to that memory (e.g. detector readout hardware). Ideally this memory should still be usable with no / minimal copy by further devices in the pipeline. For such cases FairMQ provides UnmanagedRegion component, that allocates memory via the transport allocator and provides it to the user to manage. Messages can then be created out of subset of this region. The framework will do no additional management for the region, except destroy it entirely when the region object goes out of scope. With the sharedmemory implementation, messages created for this region can be given to devices on the same node without any copy of the data. As additional argument in the region creation, one can provide a callback that will be called once the last user of the message on the node no longer needs the buffer (see Figure 2). This can then be used to cleanup and reuse the memory of the region part. For all devices except the region creator device, the region messages appear as regular messages and no special care has to be taken in the user code to handle region messages, which keeps the usage simple.
As in the LBM, care must be taken to avoid common sharedmemory problems such as race conditions and thread contention. To circumvent the problems associated with using locks  the SPH data can be structured to remove the possibility of thread contention altogether. By storing the present and previous values of the SPH field variables (e.g. position, velocity etc.), necessary gradient terms can be calculated as functions of values in previous memory, while updates are written to the current value memory. This reduces the number of synchronizations per time step from two (if the gradient terms are calculated before synchronizing, followed by the update of the field variables) to one, and a rolling memory algorithm switches the index of previous and current data with successive time steps.
Ray tracing is not the only viable method for large model visualization. The first system to render the Boeing dataset interactively, Boeing’s FlyThru application, does not use ray tracing but instead relied on fast hardware from SGI and used model simplification [ABM96]. Correa et al. demon- strated an interactive out-of-core rendering system based on visibility preprocessing and prefetching that allowed a user to control the approximation error [CKS03]. The UNC GAMMA group has repeatedly demonstrated the feasibil- ity of large scale model visualization using view dependent culling and mesh simplification [WVBSGM02, YSGM04]. The floating point performance of the GPU has also been used to interactively render the Boeing 777 dataset at im- pressive frame rates using precomputation of visibility and LOD techniques [GM05].