high-performance parallel computing

Top PDF high-performance parallel computing:

High Performance Parallel Computing in Residue Number System

High Performance Parallel Computing in Residue Number System

However, computations in RNS require a number of specific operations, without which it is impossible to represent numbers in RNS [3], Figure 1 shows the general model of computations in RNS. It includes the steps of the conversion to RNS and positional notation back from RNS. In addition, class of non-modular operations can be represented as a separate computational structure. These operations require special approaches associated with the estimation of the result. Computations for each digit can be done independently and in parallel. The highest performance is achieved for algorithms, which are based on addition and multiplication.
Show more

6 Read more

Domain Decomposition Based High Performance Parallel Computing

Domain Decomposition Based High Performance Parallel Computing

The solution of non-linear system of equations involves two iterative loops. One is the outer Newton's iterative loop (outer loop) and the other is the inner additive Schwarz domain decomposition loop (inner loop) to solve the linear system of equations (Eq. 9) simultaneously on different processors. Each subdomain problem in the inner loop is solved by a direct solver. Since the left hand matrix is not updated during the inner iterations, the resolution facility of direct solvers can be used to skip the factorization phase (i.e. only the solve phase is invoked). The inner iterations are continued till the error norm is within tolerance. In the outer iterative loop for Newton, the Jacobian is updated. Hence the factorization phase is invoked in each of the subdomain during the first inner iteration of all outer iterations. In the case of modified Newton, the Jacobian is not updated during the outer iterations. Consequently, the subdomains need not invoke the factorization phase during the first inner iteration loop of all the outer iterations. In summary, for the modified Newton algorithm, each of the subdomains perform LU factorization only once and all other subsequent calls to the solver invokes only the solve phase. In the case of Newton algorithm, each of the subdomains invokes the LU factorization during the first inner iteration of all the outer iterations. The performance of both Newton and modified Newton are examined in this paper. All the simulations are carried out on Ohio Supercomputing Cluster “Glenn”. It is a cluster of AMD opteron multi-core, 2.6 GHZ, 8GB RAM machines.
Show more

6 Read more

High Performance Parallel Computing with Clouds and Cloud Technologies

High Performance Parallel Computing with Clouds and Cloud Technologies

Abstract. Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many parallelizable problems, most “pleasingly parallel” applications can be performed using MapReduce technologies such as Hadoop, CGL-MapReduce, and Dryad, in a fairly easy manner. However, many scientific applications, which have complex communication patterns, still require low latency communication mechanisms and rich set of communication constructs offered by runtimes such as MPI. In this paper, we first discuss large scale data analysis using different MapReduce implementations and then, we present a performance analysis of high performance parallel applications on virtualized resources.
Show more

20 Read more

High Performance Parallel Computing with Clouds and Cloud Technologies

High Performance Parallel Computing with Clouds and Cloud Technologies

• Cloud technologies works for most pleasingly parallel applications • Runtimes such as MapReduce++ extends MapReduce to iterative MapReduce domain • MPI applications experience moderate[r]

21 Read more

High Performance Parallel Computing with Clouds and Cloud Technologies

High Performance Parallel Computing with Clouds and Cloud Technologies

time for computation and communication inside the applications. Therefore, all the I/O operations performed by the applications are network-dependent. From figure 19 (right), it is clear that Dom0 needs to handle 8 event channels when there are 8-VM instances deployed on a single bare-metal node. Although the 8 MPI processes run on a single bare-metal node, since they are in different virtualized resources, each of them can only communicate via Dom0. This explains the higher overhead in our results for 8-VMs per node configuration. The architecture reveals another important feature as well - that is, in the case of the 1-VM per node configuration, when multiple processes (MPI or other) that run in the same VM communicate with each other via the network, all the communications must be scheduled by the dom0. This results in higher latencies. We could verify this by running the above tests with LAM MPI (a predecessor of OpenMPI, which does not have improved support for in-node communications for multi-core nodes). Our results indicate that, with LAM MPI, the worst performance for all the test occurred when 1-VM per node is used. For example, figure 19 shows the performance of Kmeans clustering under bare-metal, 1-VM, and 8-VMs per node configurations. This observation suggests that, when using VMs with multiple CPUs allocated to each of them for parallel processing, it is better to utilize parallel runtimes, which have better support for in-node communication.
Show more

39 Read more

AN EVOLUTIONARY APPROACH TO PARALLEL COMPUTING USING GPU

AN EVOLUTIONARY APPROACH TO PARALLEL COMPUTING USING GPU

For application such as large data processing , it would be useful to have a reduce the execution time for getting the result as fast as possible compare to serial executation. A few years, the programmable graphics processor unit has evolved into an absolute High performance computing. Share memory programming is an Application Program Interface (API), jointly defined by a group of major computer hardware and software vendors. OpenMP provides a portable, scalable model for developers of shared memory parallel applications.
Show more

8 Read more

Parallel Computing: High Performance

Parallel Computing: High Performance

There are two critical forces shaping software development today. One is the popular adoption of Parallel Computing and the other is the trend toward Service Oriented Architecture. Both ideas have existed for quite a while, but the current technology of CMT(Chip Multi-Threading) processor designs, horizontally scaled systems, near zero latency interconnects and new web service standards are all accelerating both ideas into the main stream and are becoming adopted everywhere. It is quite easy to predict that most desktop machines or even laptops will be powered by multi-core or CMT processors over the next few years.[4] A more intriguing and important issue is whether the current state of software development is sufficient to produce good quality parallel applications for the new computing machines.
Show more

5 Read more

ACCOMPLISHING HIGH PERFORMANCE DISTRIBUTED SYSTEM BY THE IMPLEMENTATION OF CLOUD, CLUSTER AND GRID COMPUTING

ACCOMPLISHING HIGH PERFORMANCE DISTRIBUTED SYSTEM BY THE IMPLEMENTATION OF CLOUD, CLUSTER AND GRID COMPUTING

interconnected networks from external networking. Admitting considerable expedient in computing power, clustering certainly has hitches and hesitancies as a comparably newfangled technology. Distributed computing administers to beset a distended sphere of clustering by permitting the nodes to prevail all over the world and also be multiuse machines. Distributed computing has an analogous notion as clustering, allowing many nodes to work on large problems in parallel after breaking them into smaller units. Innumerably work units are distributed several times to too many nodes, curbing the probabilities of processing lapses and narrate for processing done on tedious CPUs. The client supervises the data resurgence and capitulation laps along with the code essential to order the CPU how to routine the work unit.
Show more

8 Read more

HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE

HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE

branch predictors guess which branch a conditional jump will go to and speculatively execute the correspond- ing instructions [66]. For distributed systems where communication overhead is substantial, task duplication redundantly executes some tasks on which other tasks critically depend [23]. So task duplication mitigates the penalty of data communication by running the same task on multiple nodes. Speculative execution in MapReduce employs a similar strategy but is mainly used for fault tolerance. It is implemented in Hadoop to cope with the situations where some tasks in a job become laggard compared with others. The assumption is that the execution time of map tasks does not differ much, which makes it possible for Hadoop to predict task execution time without any prior knowledge. When Hadoop detects that a task runs longer than expected, it starts a duplicate task to process the same data. Whenever any task completes, its other duplicate tasks are killed. This can improve fault tolerance and mitigate performance degradation. However the performance gain is obtained at the cost of duplicate processing and more resource usage. In addition, the speculative ex- ecution caused by the nature of map operations does not benefit at all, because duplicate tasks cannot shorten the run time either. Our work is complementary to task speculation in that task splitting and task duplication can be combined together to deal with long running tasks resulting from either the nature of map operations or system failure. Moreover, there has been some research on heterogeneity in MapReduce. A MapReduce implementation for .NET platform was presented in [77].
Show more

196 Read more

Scalable Deep Analytics on Cloud and High Performance Computing Environments

Scalable Deep Analytics on Cloud and High Performance Computing Environments

• We stress that the communication structure of data analytics is very different from classic parallel algorithms as one uses large collective operations (reductions or broadcasts) rather than the many small messages familiar from parallel particle dynamics and partial differential equation solvers.

80 Read more

d2o: a distributed data object for parallel high-performance computing in Python

d2o: a distributed data object for parallel high-performance computing in Python

DistArray [7] is very mature and powerful. Its approach is very similar to d2o: It mim- ics the interface of a multi dimensional numpy array while distributing the data among nodes in a cluster. However, DistArray involves a design decision that makes it inapt for our purposes: it has a strict client-worker architecture. DistArray either needs an ipy- thon ipcluster [11] as back end or must be run with two or more MPI processes. The former must be started before an interactive ipython session is launched. This at least complicates the workflow in the prototyping phase and at most is not practical for batch system based computing on a cluster. The latter enforces tool-developers who build on top of DistArray to demand that their code always is run parallelized. Both scenarios conflict with our goal of minimal second order dependencies and maximal flexibility, cf. "Aim" section. Nevertheless, its theme also applies to d2o: “Think globally, act locally”.
Show more

34 Read more

High Performance Cloud Computing

High Performance Cloud Computing

THE idea of using clouds for scientific applications has been around for several years, but it has not gained traction primarily due to many issues such as lower net-work bandwidth or poor and unstable performance. Sci-entific applications often rely on access to large legacy data sets and pre-tuned application software libraries. These applications today run in HPC environments with low latency interconnect and rely on parallel file systems. They often require high performance systems that have high I/O and network bandwidth. Using commercial clouds gives scientists opportunity to use the larger re-sources on-demand. However, there is an uncertainty about the capability and performance of clouds to run scientific applications because of their different nature. Clouds have a heterogeneous infrastructure compared with homogenous high-end computing systems (e.g. su-percomputers). The design goal of the clouds was to pro-vide shared resources to multi-tenants and optimize the cost and efficiency. On the other hand, supercomputers are designed to optimize the performance and minimize latency[1].
Show more

10 Read more

Improving Power and Performance Efficiency in Parallel and Distributed Computing Systems

Improving Power and Performance Efficiency in Parallel and Distributed Computing Systems

In Chapter 4, we determine minimum energy consumption in voltage and frequency scal- ing systems for a given time delay. As we mentioned earlier, The benefit of DVFS varies by work- load characteristics in code regions. Therefore, it is difficult to evaluate the effectiveness of a DVFS scaling algorithm. Our work establishes the optimal baseline of DVFS scheduling for any applica- tion. Given the baseline, one can better evaluate a specific DVFS technique. We assume we have a set of discrete places where scaling can occur. A brute-force solution is intractable even for a moderately sized set (although all programs presented can be solved with the brute-force approach). Our approach efficiently chooses the exact optimal schedule satisfying the given time constraint by estimation. We evaluate our time and energy estimates in NPB serial benchmark suite. The results show that the running time of the algorithm can be reduced significantly with our algorithm. Our time and energy estimations from the optimal schedule have high accuracy, within 1.48% of actual. Chapter 5 describes an internal power meter that provides accurate, fine-grained measure- ments. The above projects use the entire system power consumption to quantify the effectiveness. However, the power consumed by each component within a system actually varies during program execution. External power measurements do not provide information about how the individual com- ponents utilize power. The fine-grained monitoring of the prototype is evaluated and compared with the accuracy and access latency of an external power meter. The results show that we can measure the power consumption more accurately and more frequently (about 50 measurements per second) with low power monitoring overhead (about 0.44 W). When combined with an external power meter, we can also derive the power supply efficiency and the hard disk power consumption.
Show more

125 Read more

Performance prediction and its use in parallel and distributed computing systems

Performance prediction and its use in parallel and distributed computing systems

The computing architectural landscape is changing. Re- source pools that were once large, multi-processor super- computing systems are being increasingly replaced by het- erogeneous commodity PCs and complex powerful servers. These new architectural solutions, including the Internet computing model [20] and the grid computing [13, 18] paradigm, aim to create integrated computational and col- laborative environments that provide technology and infras- tructure support for the efficient use of remote high-end computing platforms. The success of such architectures rests on the outcome of a number of important research ar- eas; one of these – performance – is fundamental, as the uptake of these approaches relies on their ability to provide a steady and reliable source of capacity and capability com- puting power, particularly if they are to become the com- puting platforms of choice.
Show more

9 Read more

How Amdahl’s Law limits the performance of large artificial neural networks

How Amdahl’s Law limits the performance of large artificial neural networks

Simulating the inherently massively parallel brain uti- lizing inherently sequential conventional comput- ing systems is a real challenge. According to the recent studies  [4, 36] both the purely SW simulation and the specially designed (but SPA processor-based) HW simu- lation currently show very similar performance limita- tions and saturation. The present paper interpreted why even the special-purpose HW simulator cannot match the performance of the human brain in real time. It was explained that the reason is the operating principle itself (or more precisely: the computing paradigm plus its tech- nical implementation together). Based on the experiences with the rigorously controlled database of supercomputer performance data, the performance of the large artificial neural networks was placed on the map of performance of the high-performance computers. The conclusion is that processor-based brain simulators using the present computing paradigms and technology surely cannot sim- ulate the whole brain (i.e., study processes like plasticity, learning, and development), and especially not in real time.
Show more

11 Read more

High Performance Computing Meets Grid and Cloud Computing In Hybrid Computing

High Performance Computing Meets Grid and Cloud Computing In Hybrid Computing

Each of the three major computing paradigms has its strengths and weaknesses. The motivation for hybrid computing is to combine all three paradigms so that strengths are maintained or even enhanced, and weakness are reduced. The strengths and weakness of each paradigm are well known and much studied. It is not our purpose here to provide a comprehensive evaluation of each individual approach. However, even a high-level summary of the attributes of each paradigm helps identify areas where combining approaches shows promise. Fig. 1 summarizes some key attributes of the HPC, Grid and Cloud approaches. We note that no single paradigm is the best solution from all points of view. For example, there are important differences in the three paradigms with respect to the ‘‘capacity vs. capability’’ distinction. Capability resources are needed for demanding applications, such as tightly coupled, highly parallel codes or large shared memory applications. Capacity resources are well suited for throughput intensive applications, which consist of many loosely coupled processes. A capacity resource is typically an installation with commodity components; a capability resource is an HPC installation with special hardware, such as low-latency interconnects storage-area networks, many-core nodes, or nodes with hundreds of Gigabytes of main memory. The traditional, owner-centric HPC paradigm excels in handling capability workloads in a well-managed, secure environment. However, capacity is fixed in this domain, and there is typically weak support for virtualization and resource sharing. Strengths of Grid computing include access to additional HPC capacity and capability, which also promotes better utilization of resources. Grid computing enables exposing heterogeneous resources with a unified interface, thereby allowing users to access multiple resources in a uniform manner. However, Grids have limited interoperability between different Grid software stacks.
Show more

11 Read more

High Performance Computing Clusters

High Performance Computing Clusters

A computer cluster is a group of internconnected computers which are connected to form a single computer. Interconnections between computers in a cluster are made through local area networks. Problems regarding computing are solved by using high performance computing(HPC) which is an amalgamation between super computers and computing clusters.HPC combines of systems administration and parallel programming into a combination of computer architecture, system software, programming languages, algorithms and computational techniques. This paper consist of mechanism required for the creation of a 96 node single cluster.
Show more

6 Read more

Development of Operational Technology for Meteorological High Performance Computing

Development of Operational Technology for Meteorological High Performance Computing

High-performance computer system construction, resource management, and technology development will better support the development of meteorological numerical model software business and scientific research. In response to new trends in technology development, we have added applications for new technol- ogies such as large-scale core and GPU computing. We must pay attention to the parallel computing of cross-cut talents, promote the development of the meteo- rological business model to the new parallel technology architecture platform, and improve parallel scalability. We also need to coordinate the layout, con- struction and management of high-performance computing resources in the meteorological department, and gradually reduce the small-scale system with geographically dispersed. In order to meet the needs of the numerical weather and climate forecasting model business operation and scientific research work, we will build a new generation of domestic high-performance computer systems, alleviate the shortage of computing resources, and support the business research work of numerical weather forecasting, climate prediction and climate change.
Show more

9 Read more

Identifying friction stir welding process parameters through coupled numerical and experimental analysis

Identifying friction stir welding process parameters through coupled numerical and experimental analysis

Taking advantage of high performance cluster parallel computing and the commercial Finite Element software ABAQUS python codes, the Finite Element Method was coupled with a genetic algorithm optimization to obtain the best value for the thermal input (heat from a moving heat source simulating friction stir welding) and thermal fi lm coef fi cient (between the workpiece and support plate). By using the predicted parameters from one set of experiment results, the temperature distribution at other points are predicted and found to be in good agreement with the experimental results. The heat input predicted is also similar to that obtained in Refs. [4], in which a general inverse method is used. The optimization pro- cedure presented in this paper performs the parameter identi fi ca- tion automatically and could be extended to include the complex features of the welding tool. As the temperature history plays a very important part of the microstructure in welded zones, this
Show more

5 Read more

Computational modelling of inelastic neutron scattering for nanomaterial characterisation in GPU architectures

Computational modelling of inelastic neutron scattering for nanomaterial characterisation in GPU architectures

Roach, Parallel computational modelling of inelastic neutron scattering in multi-node and multi-core architectures, in: IEEE HPCC-10: Int Conf on High Performance Computing and Communica[r]

11 Read more

Show all 10000 documents...