Top PDF Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation

Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation

Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation

Based on the results from this experiment, Figure 4.1.2 shows interaction plots for the effect of two-level interactions on the execution time. The interaction plots highlight the significance of change in the execution time by changing one factor while given the value of another factor. For example, the average time between customer arrivals (arrival rate) and the number of employees taking order form the customers, as shown in Figure 4.1.2 (a). Decreasing the number of employees taking orders in a simulation with low arrival rate would not be as significant as it would be in a simulation with higher arrival rate. In the high arrival rate scenario, having more employees taking orders can reduce the number of customers waiting in queues, reducing the time associated to the logic of pushing and pulling elements from queues in the program. However, in a simulation with low arrival rate, there might not even be customers in a queue that could benefit from a large number of employees taking orders. This means that the magnitude of the effect in reducing the processing time is dependent on the value of the arrival. The same applies if comparing the relation between number of cashiers and the number of employees taking orders, as shown on Figure 4.1.2 (b). In this scenario, the processes are linked sequentially and one processes is performed after the other in the simulation logic; delays in the time taking orders affects the time paying. However, if the delay is in the cashiers then the other process is not affected.
Show more

79 Read more

PC Grade Parallel Processing and Hardware Acceleration for Large Scale Data Analysis

PC Grade Parallel Processing and Hardware Acceleration for Large Scale Data Analysis

In this thesis, the concept of architectural patterns is used for categorizing the parallel processing, while the GPGPU programming framework which is based on a specific architectural pattern - Processor Farms - is used as a general guideline to develop GPGPU applications where GPGPU is viewed as a branch of parallel computing. For a single GPGPU application, there should be a programming model to devise its solution. Basically, the programming model is originated from the programming framework, which has been depicted by Figure 4.1 that also represents a gradually more focusing research pipeline involved in this thesis. After the identifying the architectural patterns and the programming frameworks, the GPGPU programming model design is becoming the next focus of this study. Referring to Figure 4.2, it is clear that any application can ultimately be divided into the combination of tasks and streams which are contained in its own container, task pool and stream pool, respectively. The four aforementioned implementations, partitioning, communication, agglomeration, and mapping, are then used for each components in the task and stream pools. Through the analysis on GPGPU’s programming framework depicted by Figure 4.3 and 4.4, it is concluded that communication, agglomeration are generally fixed, that is, they have common instructions and constant implementation orders for most GPGPU applications. For example, data agglomeration is normally achieved by GPU’s hardware, it is not transparent and programmable to developers, i.e., it can’t be directly controlled by developers. Although issued by API instructions, the communications between CPU and GPU are also relatively straightforward as depicted in Figure 4.2 to 4.4.
Show more

230 Read more

Big Data Analysis – Design And Implementation Of Novel Algorithms For Tweets Retrieval And Processing

Big Data Analysis – Design And Implementation Of Novel Algorithms For Tweets Retrieval And Processing

In [6], most Twitter search systems generally treat a tweet as a plain text when modeling relevance. However, a series of conventions allows users to tweet in structural ways using combination of different blocks of texts. These blocks include plain texts, hash tags, links, mentions, etc. Each block encodes a variety of communicative intent and sequence of these blocks captures changing discourse. Previous work shows that exploiting the structural information can improve the structured document e.g. web pages retrieval. A set of features, derived from the blocks of text and their combinations, is used into a learning-to-rank scenario. That show structuring tweets can achieve state-of-the-art performance. The approach does not rely upon social media features, but when do add this additional information, performance improves significantly.
Show more

10 Read more

Parallel Processing Of Data Mining Functions Using Clustering, Optimization And Classification Techniques

Parallel Processing Of Data Mining Functions Using Clustering, Optimization And Classification Techniques

discussed the efficient parallelization of the alternating least squares-based Tucker decomposition algorithm for sparse tensors in shared and distributed memory systems. A fine- grain parallel algorithm was used in a distributed memory environment. Junbo Zhang et al. [29] have processed a large- scale incomplete data with rough set theory in parallel process. Rough set theory, which used successfully in solving problems in pattern recognition, machine learning, and data mining. Yunquan Zhang et al. [30] have introduced the basic framework of MapReduce and its outstanding features as well as deficiencies compared to DBMSs. MapReduce was outstanding for better simplicity, scalability, and fault-tolerance. Adib Haron et al. [33] have presented the communication efficient mapping of a large-scale matrix multiplication algorithm on the CIM architecture. A memristor-based CIM architecture was a new computing architecture that integrates computation and memory in the same physical location rather than Von Neumann architecture. Paweł Czarnul et al. [35] have presented the implementation of similarity measure for big data analysis in a parallel environment. The simulations allowed to determine be performed on a particular hardware. Ruben Mayer et al. [36] have proposed a pattern-sensitive stream partitioning model. The proposed pattern-sensitive stream partitioning method supported the consistent parallelization of a wide class of important CEP operators. Themis Palpanas [46] have described the past efforts in designing techniques for indexing and mining truly massive collections of data series. It was based on indexing techniques for fast similarity search of mining algorithms. Fabrizio Marozzo et al. [47] have described the design and implementation of the Data Mining Cloud
Show more

11 Read more

Parallel Implementation of Big Data Pre-Processing Algorithms for Sentiment Analysis of Social Networking Data

Parallel Implementation of Big Data Pre-Processing Algorithms for Sentiment Analysis of Social Networking Data

Experimental setup includes three totally different configurations to test the suitability of various tools ranging from Linux/Unix command line tools to GPGPUs and Hadoop Map-Reduce based tools. A High End Apple MacBook Pro running Mac OSX Mavericks with a Quad-Core i7 Processor running at 2.5 GHz, 16 GB DDR3 RAM @ 1600MHz and 6 MB L3 Cache followed by a 128 MB L4 Cache and a 512 GB SSD Drive. A High End Dell Precision T7600 Workstation with dual Intel Xeon Octa-Core Processors each running at 2.7GHz, 32GB DDR3 RAM, 20MB L3 Cache Memory with 4 * 600GB 7200RPM HDD. Finally a Hadoop cluster containing 10 High End PCs and a two High End Servers acting as Master Nodes. Master Nodes are Dell R910 Servers with Four Intel Xeon Octa-Core Processors running at 3.5 GHz, 64GB DDR3 RAM, 30 MB L3 Cache Memory per Processor and 6 * 600GB SAS 10K RPM Drives in RAID0 configuration. Data Nodes are Dell OptiPlex 9020 PCs with Quad-Core i7 processors running at 2.7 GHz, 8 GB DDR3 RAM and 1TB 7200 RPM HDD. It is obvious that for almost 90% of the first two phases and for a considerable part of the third phase, using a high end laptop is found to be much more flexible and effective than going in for a workstation or a cluster. This is due to the fact that the data handled during these phases is found to fit In- Memory in such a laptop and hence very effective and fast compared to the GPGPU based approach where the data has to be transferred from the CPU HDD to the CPU memory and then from the CPU memory to the Device memory and the results must be copied back to the CPU memory from the Device memory before being persisted to the hard disk. The cluster on the other hand is basically used for Map-Reduce based batch processing and the overhead with the cluster is that all the intermediate data needs to be persisted to the HDD and then retrieved back for the next phase. Since the first two phases involves a lot of process being performed on the same data again and again just doing a small amount of processing each time, it is very difficult to run the jobs on a GPU or on a cluster as it will involve a lot of Memory or Disk I/O overhead.
Show more

11 Read more

Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud

Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud

In the next Hadoop job, we examine each triple against the set of 1-hop extended out-vertex blocks, obtained in the previous phase, to determine which extended vertex blocks should include this triple in order to meet the k-hop guarantee. We call this phase the controlled triple replication. Concretely, we perform a full scan of the RDF dataset and examine in which 1-hop extended out-vertex blocks this triple should be added in two steps. First, for each triple, one key-value pair is generated with the subject of the triple as the key and the remaining part (predicate and object) of the triple as the value. Then we group all triples with the same subject into a local subject-based triple group. In the second step, for each 1-hop extended out-vertex block and each subject with its local triple group, we test if the subject vertex matches the extended out- vertex block. By matching, we mean that the subject vertex is the border vertex of the extended out-vertex block. If a match is found, then we create a key-value pair with the unique ID of the extended out-vertex block as the key and the subject vertex of the given local subject-based triple group as the value. This ensures that all out-triples of this subject vertex are added to the matching extended out-vertex block. If there is no matching extended out-vertex block for the subject vertex, the triple group of the subject vertex is discarded. This process is repeated until all the subject-based local triple groups are examined against the set of extended out-vertex blocks. Finally, we get a set of 2-hop extended out-vertex blocks by the reduce function, which merges all key-value pairs with the same key together such that the subject-based triple groups with the same extended out-vertex block ID (key) are integrated into the extended out-vertex block to obtain the 2-hop extended out-vertex block and store it as a HDFS file. For k > 2, we will repeat the above process until the k- hop extended vertex blocks are constructed. In our experiment section we report the performance of the SPA partitioner in terms of both the time complexity and data characteristics of generated partitions to illustrate why partitions generated by the SPA partitioner is much more efficient for RDF graph queries than random partitioning by HDFS chunks.
Show more

8 Read more

Big Data Processing of Data Services in Geo Distributed Data Centers Using Cost Minimization Implementation

Big Data Processing of Data Services in Geo Distributed Data Centers Using Cost Minimization Implementation

Raghavendra.P,Z.Wang proposed the key challenges in data center environments are Power delivery, power consumption, and heat management. Propose using different power management strategy such as virtual machine controller and efficiency controller. Using these strategy to validate the power in data centers.A.Sivasubramanian,B.Urgaonkar et al proposed the Data center power consumption has one of the a significant impact on both its recurring electricity bill (Op-ex) and one-time construction costs (Cap-ex). They develop peak reduction algorithms that combine the UPS battery knob with existing throttling based techniques for minimizing power costs in datacenter .Sharad Agarwal, John Dunagan et al proposed the Nowadays services grow to span more and more globally distributed datacenters, so we need urgent automated mechanisms to place application data across these datacenters. Proposed the MapReduce is a programming model and its associated with implementation for processing and to generating large data sets. MapReduce runs on a large cluster of commodity machines and is highly scalable and its support to Programmers for the system easy to use. Kuangyu Zheng, Xiaodong Wang et al proposed the Data center power optimization has recently received a great deal of research attention .Traffic consolidation has one to recently proposed to save energy for data center networks (DCNs). we propose PowerNetS, a power optimization strategy that leverages workload correlation analysis to jointly minimize the total power consumption of servers. Dan Xu Xin Liu , Bin Fan, The goal is to achieve an optimal tradeoff between energy efficiency and service performance over a set of distributed IDCs with dynamic demand. Dynamically adjusting server capacity and performing load shifting in different time scales. We propose three
Show more

5 Read more

Performance Analysis of Classification Algorithms for Prediction of Student Failure in College using Academic Data Processing

Performance Analysis of Classification Algorithms for Prediction of Student Failure in College using Academic Data Processing

Late years have demonstrated a developing intrigue and concern in numerous nations about issue of school disappointment and the determination of its principle contributing elements [2]. The considerable arrangement of examination [2] has been done on distinguishing the components that influence the low execution of understudies (school disappointment and dropout) at diverse instructive levels (essential, auxiliary furthermore, higher) utilizing the expansive measure of data that current PCs can store in databases. All these information are a "gold mine" of significant data about understudies. Distinguish and find helpful data covered up in substantial databases is a troublesome assignment. An extremely encouraging answer for accomplish this objective is the utilization of information disclosure in databases methods or information mining in instruction, called instructive information mining, EDM
Show more

5 Read more

IoT Implementation in Manufacturing using Data Analysis and Data Managemnt

IoT Implementation in Manufacturing using Data Analysis and Data Managemnt

The fifth paper we studied was composed by M. Munir Ahmad, Nasreddin Dhafr. Having the title an Establishing and improving assembling execution measures. The idea of the paper is to utilize IoT gadgets in assembling for better and progressively proficient handling and better item quality. The upsides of this paper were better and better assembling, decreased odds of breakdowns and the detriments were significant expense, not supportive in little scale enterprises. The sixth paper we reviewed was composed by Xun Ye, Tae Yang Park, SeungHo Hong, Yuemin Ding, Aidong Xu. Having the title Implementation of a Production-Control System utilizing Integrated AutomationML and OPC UA. The idea of this paper was to incorporate IoT gadgets into assembling for expanded preparing effectiveness and expanded item quality. The benefits of this paper were It can be utilized to accomplish precise administration of information trade work process, OPC UA is fit for disclosure, displaying, and security and the drawbacks being that because of high reflection in the framework, it prompts a gap in the range of abilities and range of abilities.
Show more

5 Read more

Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond

Automatic Derivation of Statistical Data Analysis Algorithms: Planetary Nebulae and Beyond

the task clause as special case. This allows the network structure to guide the applica- tion of the schemas and thus to constrain combinatorial explosion of the search space, even if a large number of schemas is available. Schemas are implemented as Prolog- clauses and search control is thus simply relegated to the Prolog-interpreter: schemas are tried in their textual order. This simple approach has not caused problems so far, mainly because the domain admits a natural layering which can be used to organize the schema library. The top layer comprises network decomposition schemas which try to break down the network into independent subnets, based on independence the- orems for Bayesian networks. These are domain-specific divide-and-conquer schemas: the emerging subnets are fed back into the synthesis process and the resulting programs are composed to achieve a program for the original problem. A UTO B AYES is thus able to automatically synthesize larger programs by composition of different schemas. The next layer comprises more localized decomposition schemas which work on products of i.i.d. variables. Their application is also guided by the network structure but they require more substantial symbolic computations. The core layer of the library contains statistical algorithm schemas as for example expectation maximization (EM) [10, 11] and k-Means (i.e., nearest neighbor clustering); these generate the skeleton of the program. The final layer contains standard numeric optimization methods as for example the Nelder-Mead simplex method or different conjugate gradient methods. These are applied after the sta- tistical problem has been transformed into an ordinary numeric optimization problem and if A UTO B AYES failed to find a symbolic solution for the problem. Currently, the
Show more

16 Read more

Transcriptome analysis in preterm infants developing bronchopulmonary dysplasia : data processing and statistical analysis of microarray data

Transcriptome analysis in preterm infants developing bronchopulmonary dysplasia : data processing and statistical analysis of microarray data

In comparative microarray experiments gene expression is measured for defined groups, the so called treatment groups, of samples. These treatment groups may be for example a group of healthy subjects versus a group of patients, a control versus disease set up, or different grades of a single disease. Also in a microarray experiment samples taken at different time points and different stadiums of a disease can be compared. Is a transcript or gene over- or underexpressed, in one or more groups this transcript is differentially expressed (Li and Tibshirani, 2013). Differential expression analysis also can be used to identify transcripts that are correlated with a quantitative feature or the survival of patients. In the following chapter some common methods to identify differential expression in two- or multiple-group set ups are introduced. But first the problem of testing multiple hypotheses at once needs to be addressed: Multiple testing problems occur when multiple parameters are tested in a single population of probes. In microarrays thousands of genes or probe sets are monitored simultaneously. According to the multiple testing problem the probability of a false discovery or a false positive detection of statistically significant differentially expressed genes increases dramatically (Reiner et al., 2003). Therefore, the “use of ordinary t-tests or other traditional univariate statistics to assess differential expression be disastrous” [(Ritchie et al., 2007) citing (Smyth, 2004)].
Show more

149 Read more

Technique of Building Parallel Data Mining Algorithms

Technique of Building Parallel Data Mining Algorithms

The described algorithm works with the original knowledge model, in which all elements are already created based on metadata about the data being processed (values of the target and non-target attributes). In this regard, the algorithm does not have operations related to adding elements to the model, and there are only operations to change them in lines 3 and 6. Actions in lines 2 and 5 do not change the model. In addition, the algorithm includes two cycles nested into each other: an external one - for all vectors W and an internal one - for all independent attributes of A.
Show more

6 Read more

Distributed processing for statistical data mining

Distributed processing for statistical data mining

The execution of computational or in silico experiments like those addressed in this thesis follows the same patterns of scientific endeavour as any other scientific pursuit. Accordingly, clear, accurate and detailed record keeping is as important in silico as it would be in a physical laboratory. As all the information about experiment set up is already available in computerised form it should, in principle, be a much simpler task to capture this information in a standardised way, than it would be in a comparable physical laboratory. Various disciplines do specify standards and procedures for the collection and storage formats of metadata, but there exists no encompassing standard or procedure. For instance, a data set analysed in a cheminformatics study may provide chemical names or Simplified Molecular Input Line Entry Specification (SMILES) [36] descriptors under investigation, and will perhaps report the programs and techniques used to calculate the molecular descriptors used in their analysis. However, the parameters for these programs and optimisations which would have been performed along the way may not be thoroughly reported in the text. While there may be alternative avenues to discover this additional data in some circumstances, such as direct contact with the author, generally this lack of consistent detail makes the reported findings far less useful, and certainly less reproducible.
Show more

206 Read more

The Use of Statistical Methods for Data Processing

The Use of Statistical Methods for Data Processing

Míra věřitelského rizika je dopočtem do 100 % spolu s ukazatelem míry finanční nezávislosti, a proto je tedy jasné, ţe hodnoty tohoto ukazatele vyjadřující zadluţen[r]

114 Read more

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Parallel Selective Algorithms for Nonconvex Big Data Optimization

In this paper, we focus on nonconvex problems in the form (1), and proposed a new broad, deterministic algorithmic frame- work for the computation of their stationary solutions, which hinges on ideas first introduced in [21]. The essential, rather natural idea underlying our approach is to decompose (1) into a sequence of (simpler) subproblems whereby the function is replaced by suitable convex approximations; the subproblems can be solved in a parallel and distributed fashion and it is not necessary to update all variables at each iteration. Key features of the proposed algorithmic framework are: i) it is parallel, with a degree of parallelism that can be chosen by the user and that can go from a complete parallelism (every variable is updated in parallel to all the others) to the sequential (only one variable is updated at each iteration), covering virtually all the possibil- ities in “between”; ii) it permits the update in a deterministic fashion of only some (block) variables at each iteration (a fea- ture that turns out to be very important numerically); iii) it easily leads to distributed implementations; iv) no knowledge of and parameters (e.g., the Lipschitz constant of ) is re- quired; v) it can tackle a nonconvex ; vi) it is very flexible in the choice of the approximation of , which need not be nec- essarily its first or second order approximation (like in prox- imal-gradient schemes); of course it includes, among others, up- dates based on gradient- or Newton-type approximations; vii) it easily allows for inexact solution of the subproblems (in some large-scale problems the cost of computing the exact solution of all the subproblems can be prohibitive); and viii) it appears to be numerically efficient. While features i)–viii), taken singu- larly, are certainly not new in the optimization community, we are not aware of any algorithm that possesses them all, the more so if one limits attention to methods that can handle nonconvex objective functions, which is in fact the main focus on this paper. Furthermore, numerical results show that our scheme compares favorably to existing ones.
Show more

16 Read more

Parallel processing of big point clouds using Z-Order-based partitioning

Parallel processing of big point clouds using Z-Order-based partitioning

The release of big point cloud data sets e.g., the Actueel Hoogtebe- stand Nederland 2 (AHN2) (Swart, 2010) dataset, has enabled re- searchers to have access to a high-quality and high-density point cloud of a large area. It has also brought to light the challenges of dealing with point clouds with a total disk size comparable to the volume of what can be considered as Big Data. AHN2, for ex- ample, consists of 446 billion points spread across 1351 LAS zip files for a total uncompressed size of 9 TB. Handling and process- ing the data on a single machine is challenging even impractical. The problem of handling large amounts of data is not unique to those working with big point clouds. In recent years, with the rise of new communication and information technologies as well as with improved access to affordable computing power, substantial work is being done in developing systems, frameworks and algo- rithms for handling voluminous amounts of high-dimensionality and quickly accumulating data, collectively known as Big Data. By definition, Big Data is voluminous that it typically does not fit on a single hard drive. Even if it does, transferring data from storage to processing machine over a slow bus (network) results in a huge performance penalty. The solution to these problems would be to split the data into different partitions which may be stored across several processing machines. We want the data to be on the processing machine as much as possible, a concept known as data locality, to avoid slow network transfers.
Show more

7 Read more

Secure Parallel Processing on Encryption Cloud Data Using Fully Homomorphic Encryption

Secure Parallel Processing on Encryption Cloud Data Using Fully Homomorphic Encryption

Cloud computing provides the cloud services to the users, clients, organizations, public and etc., as on the pay- asyou-go method. In cloud computing security_is the major concern. Commonly data encryption techniques are used by clients to secure the data on cloud_computing. Encryption techniques, they effectively secures the client data on public environment called cloud computing. Client can use encryption method on plaintext for security purpose while storing data on cloud, and client can use decryption method to get his own data from the cloud Generally, if client wants to apply some computational operations on his own encrypted data that data is stored on cloud storage. First he should retrieve the data by decrypting the cipher text (i.e., converting cipher text into plain text) from the cloud. After decryption he can apply the computing operations on that data, after applying the operations client can again encrypt the result and store on cloud. This decrypting the data and applying operations, again encrypting the result is an overhead procedure. So this long procedure is reduced by using homomorphic encryption method.
Show more

9 Read more

Parallel processing using big data and machine learning techniques for intrusion detection

Parallel processing using big data and machine learning techniques for intrusion detection

Currently, information technology is used in all the life domains. Many devices and equipment produce data and transfer them across the network. These transfers are not always secured and can contain new menaces and attacks invisible by the current security tools. Moreover, the large amount and variety of the exchanged data make the identification of the intrusions more difficult in terms of detection time. To solve these issues, we suggest in this paper, a new approach based on storing the large amount and variety of network traffic data employing big data techniques, and analyzing these data with machine learning algorithms, in a distributed and parallel way, in order to detect new hidden intrusions with less processing time. According to the results of the experiments, the detection accuracy of the machine learning methods reaches up to 99.9%, and their processing time has been reduced considerably by applying them in a parallel and distributed way, which proves that our proposed model is very effective for the detection of new hidden intrusions.
Show more

8 Read more

GENE EXPRESSION DATA ANALYSIS USING DATA MINING ALGORITHMS FOR COLON CANCER

GENE EXPRESSION DATA ANALYSIS USING DATA MINING ALGORITHMS FOR COLON CANCER

It is observed that rapid developments in the field of Genomics and Proteomics have created huge biological data to be analyzed. Making sense of the large biological data or analyzing the data by inferring structure or generalization from the data has a great potential to increase the interaction between data mining and bioinformatics. Bioinformatics and data mining provide exciting challenging research and application in the areas of computational sciences [3].

7 Read more

Show all 10000 documents...