It is a common practice to transfer data into the cloud, where storage, computation, analysis, and decision making take place . As the typical infrastructure in cloud provides abundant resources (e.g. powerfull high-end servers), cloud computing can be used to solve problems in IoT-based systems. Cloud computing has several ad- vantages. Firstly, computational scalability can be achieved by properly leveraging the resources of high-end servers. Secondly, the cloud services typically enable the users to pay for the services in a pay-as-you-go manner, making it more economical than having on-premise equipment. Thirdly, the cloud is maintained by the cloud ser- vice providers, sparing the end users from technical managements. These advantages show the importance of cloud computing for many problems in IoT-based systems.
configurable, flexible and scalable solution to the management of today’s complex telecommunication networks thereby reduces the number of necessary human interaction. MAGENTA provides autonomous, reactive, proactive communicative mobile agents. As discussed, the independence and mobility of mobile agents reduce bandwidth overloading problems by moving a processing of the management data and decision making from centralized management stations to the managed devices thereby saving many repetitive request/response roundtrips and also address the problems created by intermittent or unreliable network connections between the network management stations and managed devices. Agents can easily work off-line and communicate their results when the application is back on- line. Moreover agents support parallel execution (load balancing) of large computation which can be easily divides among various computational resources. Thus decentralization solves many problems of centralized network system make the system efficient and increases the performance with the help of mobile agents.
However, PFs are not practical yet, due to their excessive computational and memory cost. A natural approach to overcome this prohibitive limitation is to employ parallel & distributed (P&D) computing. Development of P&D algorithms and architectures that utilize the spatial and temporal concurrency of PF computations is a possibility to make the PF- based multiple target tracking, and particle filtering in general, practical. This possibility has not been much studied so far, even though useful work has been done in this direction, e.g., [6, 84, 9, 8, 10, 56]. The most critical and computationally expensive in every implementation of particle filtering is the resampling. Resampling is in effect discarding of samples that have small probability and concentrating on samples that are more probable. In parallel implementations, resampling becomes a bottleneck due to its inherently sequential nature.
In a parallel database system, proper data placement is essential for load balancing. Ideally, interference between concurrent parallel operations can be avoided by having each operation work on an independent dataset. These independent datasets can be obtained by the declustering (horizontal partitioning) of the relations based on a function (hash function or range index) applied to some placement attribute(s), and allocating each partition to a different disk. As with horizontal fragmentation in distributed databases, declustering is useful to obtain inter-query parallelism, by having independent queries working on different partitions; and intra-query-parallelism, by having a query’s operations working on different partitions. Declustering can be single-attribute or multi-attribute. In the latter case [Ghandeharizadeh et al., 1992], an exact match query requiring the equality of multi-attributes can be processed by a single node without communication. The choice between hashing or range index for partitioning is a design issue: hashing incurs less storage overhead but provides direct support for exact-match queries only, while range index can also support range queries. Initially proposed for shared-nothing systems, declustering has been shown to be useful for shared-memory designs as well, by reducing memory access conflicts [Bergsten et al., 1991]. Full declustering, whereby each relation is partitioned across all the nodes, causes problems for small relations or systems with large numbers of nodes. A better solution is variable declustering where each relation is stored on a certain number of nodes as a function of the relation size and access frequency [Copeland et al., 1988]. This can be combined with multirelation clustering to avoid the communication overhead of binary operations.
In this thesis NVIDIA’s CUDA framework for general purpose computing on graphics pro- cessing units (GPGPU) was investigated as a low cost, high performance solution to address the computational demands of unsupervised multivariate data clustering for flow cytometry . The existing work on data clustering algorithms using GPGPU has been limited in the algorithms imple- mented, the scalability of such algorithms (such as using multiple GPUs), and lack of optimization specifically for the flow cytometry problem. Two unsupervised multivariate clustering algorithms, C-means and Expectation Maximization with Gaussian mixture models, were implemented using CUDA and the Tesla architecture. Multiple GPUs on a single machine are leveraged using shared memory and threading with OpenMP. The parallelism is expanded to support GPUs spread across multiple nodes in a high-performance computing environment using MPI. Functionality is verified for all methods using synthetic data. Real flow cytometry data sets are used to assess the accuracy and quality of results. The performance of sequential, single GPU, and multiple GPU implementa- tions are compared in detail.
Two problems exist with current soft-output sphere decoders: IP reusability is one of the real concerns in the industry, and designing a reconfigurable IP which can work with different number of antennas and constellations in run-time, yet being efficient in terms of area/power/throughput is very important which has not been addressed yet. Also, the current designs in the literature are not very practical for the MIMO based standards and their provided throughput and area are both more than what are expected by these standards. By using the K-best algorithm combined with my proposed parallel merge architecture (PMA), the child reduction technique and the MMSE-SQRD channel processing technique, I have designed a low-area in-place reconfigurable architecture for 16QAM and 64QAM constellations. Also, using a multi-core architecture is the solution for a reconfigurable system which supports large number of antennas, that we have incorporated in our design.
In this chapter, we present Shredder, a system for performing efficient content- based chunking to support scalable incremental storage and computation. Shred- der builds on the observation that neither the exclusive use of multicore CPUs nor the use of specialized hardware accelerators is sufficient to deal with large- scale data in a cost-effective manner: multicore CPUs alone cannot sustain a high throughput, whereas the specialized hardware accelerators lack programmability for other tasks and are costly. As an alternative, we explore employing modern GPUs to meet these high computational requirements (while, as evidenced by prior research [89, 96], also allowing for a low operational cost). The application of GPUs in this setting, however, raises a significant challenge — while GPUs have shown to produce performance improvements for computation intensive applica- tions, where CPU dominates the overall cost envelope [89, 90, 96, 137, 138], it was unclear when we started this work whether GPUs are equally as effective for data intensive applications, which need to perform large data transfers for a signifi- cantly smaller amount of processing.
In this context, related to all these previous works we are focused on the use of the middleware which is based on the mobile agents. It is considered as a new grateful computer science technology which is used in  in order to propose a new model for automatic construction of business processes based on multi agent systems. And also in  to improve the management, the flexibility and the reusability of grid like parallel computing architecture; and the time efficiency of a medical reasoning system in . So Thanks to the several interesting mobile agents skills, we design and implement a parallel and distributed environment composed by the middleware which assigns and orchestrates a set of mobile agents as AVPUs (Agent Virtual Processing Units) for each physical processor in heterogeneous parallel and distributed grid computing. It implements some interesting mechanisms for load balancing, fault tolerance, and to reduce the communication cost in order to have a control about all the parallel and distributed computing challenges and ensure the HPC. This paper is organized as follows:
Although general complex engineering simulations come to mind when identifying families of applications benefiting most from heterogeneous parallelarchitectures, in the upcoming era of IoT and Big Data, there is significant interest in exploiting the ca- pabilities offered by customised heterogeneous hardware such as FPGA, ASIP, MPSoC, heterogeneous CPU+GPU chips and hetero- geneous multi-processor clusters all of which with various mem- ory hierarchies, size and access performance properties. In fact, Big Data online with nearly instantaneous results demand massive parallelism and well devised divide-and-conquer approaches to ex- ploit heterogeneous hardware, both client and server sides, to its fullest extent. Moreover, heterogeneous systems can not only han- dle workload with fewer and/or smaller servers (cost saving) but also slash the energy used to run certain applications, which helps gain clear benefits and addresses the growing interest in green so- lutions and the pressure to reduce the environmental impact of, e.g. data centres. A common theme across all scenarios is the need for low-power computing systems that are fully interconnected, self-aware, context-aware and self-optimising within application boundaries .
This research surveys new trends that will introduced a new approach in regards of solving a tridiagonal structure of matrices complex problems. This new loom will have a great major impact on all aspect of Mathematics and Science growth. It revealed under parallel computing technology in its evolution from chronological to parallel algorithmic practice. The exposure of iterative or domain decomposition approximate methods such as Jacobi and Gauss-Seidel Red Black, provides a consisted of guessing a value that initial guesses is to assume that they are all zero. Even some cases stated that Numerical method is useful; with some advantageous from Gauss Seidel Red Black’s exposure, it had been the best method chosen in solving the parallelization algorithm of PDE problems. The experiment of parallel computing were carried out in more that 2 PC Intel Pentium IV under homogenous architecture platform of LINUX to perform the algorithm. The comparisons between sequential and parallel algorithm will be presented. The result of some computational experiments and the parallel performance in solving the equations will be discussed.
dom memory accesses with poor spatial and temporal locality. Such problems lead to low compute capacity utilisation with execution times dominated by memory latency. Reconfigurable computing as field programmable gate arrays (FPGAs) can tackle this with the use of customised hardware design and software flexibility. The use of linear algebra has recently proved a major alternative approach in distributed memory systems . Graph algorithms can be expressed as a sequence of linear algebraic operations where BFS is equivalent with the fun- damental operation of matrix vector multiplication. In Chapter 5, we proposed using a linear algebra based implementation of BFS in FPGA. We suggested to map the BFS computation on a reconfigurable platform based on a sparse matrix vector multiplication (SpMV). We proposed the transformation of com- plex sparse graph algorithms to a sparse matrix matrix multiplication (SpMM) algorithm. This is a generic method for mapping sparse graph algorithms to FP- GAs. The main property enabling embarrassing parallelism is that matrix-matrix multiply might be thought as a collection of matrix-vector multiplies with same matrix but different vectors. This approach provided us with an embarrassingly parallel execution strategy where graph data dependencies are no more existed and could be easily implemented on dataflow implementation. The performance of our FPGA implementation was evaluated with real-world PPIs and achieved 1.9 times faster APSP computation over the sequential platform and 1.5 times improvement over the GPU parallel implementation. We also assessed the perfor- mance of our FPGA implementation over graphs of increasing edges density and variable structure. Our FPGA implementation achieved a maximum speedup of 4 times improvement over the sequential APSP implementation.
Abstract. In this paper we perform a comprehensive area, power, and energy analysis of some of the most recently-developed lightweight block ciphers and we compare them to the standard AES algorithm. We do this for several different architectures of the considered block ciphers. Our evaluation method consists of estimating the pre-layout power consumption and the derived energy using Cadence Encounter RTL Compiler and ModelSIM simulations. We show that the area is not always correlated to the power and energy consumption, which is of importance for mobile battery-fed devices. As a result, this paper can be used to make a choice of architecture when the algorithm has already been fixed; or it can help deciding which algorithm to choose based on energy and key/block length requirements.
In  the Pool/Map approach is described, as well as the Process/Queue approach. The Pool/Map approach spawns a pool of worker processes and returns a list of results. The Python map function is extended to the multiprocessing module and can be used with the Pool class to instantiate with a single operation a set of worker processes that will work in parallel and collect the results, using the regular Python syntax. The Pro- cess/Queue approach again extends the multiprocessing module to enable more than one input parameter to the concurrent function. This approach creates two FIFO queues, one for the input data and one for receiving out- put data. The parallel worker processes are started us- ing the Process class. In the same paper, the Pool/Map, Process/Queue approach and Parallel Python are com- pared when solving parallel astronomical data process- ing applications, the Process/Queue approach showing better perfomance. Although in terms of features, Par- allel Python will also enable execution in clusters, while the approaches based in multiprocessing can only be ex- ecuted in a single node.
The outgoing point in mentioned both cases is to consider using of collective communication commands of standardized development environments as MPI API, and that collective command Reduce or Gather. For some parallel computers are available alternative global summarization operation gssum(), just to eliminate such bottlenecks. This operation in iterative manner always sums partial results of two computing nodes, which both computing nodes are exchanged. Each partial sum, which receives one of computing node pair, is added to the obtained sum in a given node, and this result is transmitted to next computing node of defined communication chain. In this way there is gradually obtained the total accumulated amount of the computing nodes whereby manager computing node 0 can perform the last sum and print the final result. This procedure of global summarization can be also programmed, but the sequence of implemented appropriate procedures simplifies implementation of parallel algorithms and contributes to its effectiveness too.
From the previous sections it is quite clear that a Helmholtz equation is changed to a system of linear equations using numerical methods. The order of these equations is quite large and hence iterative methods should be used to solve them easily, moreover parallel programming gives us the chance to solve this big problem in fast manner. The programs were made in C++ language while the library from which the parallel functions were imported was MPI (Message Passing Interface). A primary reason for the usefulness of this model is that it is extremely general. Essentially, any type of parallel computation can be cast in the message passing form. In addition, this model can be implemented on a wide variety of platforms, from shared-memory multiprocessors to networks of workstations and even single-processor machines. Generally allows more control over data location and flow within a parallel application than in, for example, the shared memory model. Thus pro- grams can often achieve higher performance using explicit message passing. Indeed, performance is a primary reason why message passing is unlikely to ever disappear from the parallel programming world.
The design of partial reconfigurable FIR filter using systolic distributed arithmetic architecture. A two dimensional fully pipelined structure is used to implement low power, high speed, computationally efficient FIR Filter. Look-Up-Table (LUT) a new architecture in distributed arithmetic is planned to reduce the partial reconfigurable time. In partial reconfiguration module, by changing the filter coefficients the FIR filter is dynamically reconfigured to realize the low pass and high pass filter characteristics. Using XUP Virtex 5 LX110T kit the design is implemented. FIR Filter design illustrates improvement efficiency and time
The encoder works independently on each video frame, performing thus a so-called intra coding of the frames. The even indexed frames, that are referred to as key-frames (KF), are traditionally encoded using, for example, an H.264/AVC encoder operating in intra mode (i.e., without using any inter-frame prediction). The odd indexed frames instead, called Wyner-Ziv (WZ) frames, are encoded using the principles of distributed source coding. More specifically, these frames are first transformed, with a block based DCT, and then quantized thanks to proper quantization matrices. Homologous coeﬃcients are then encoded, bit plane by bit plane, using a suitable turbo code. In particular, each bit plane of each frequency band is fed into a turbo encoder with rate 1/3; while information bits are discarded, parity bits are properly stored in a buﬀer. The encoded bit stream is thus composed of two diﬀerent parts: the H.264/AVC intra coded key-frame stream, and the WZ stream. The key-frame information is entirely sent to the decoder, while the parity bits are only partially sent, depending on the decoder requests to the encoder, provided iteratively through a feedback channel.