UCI datasets

Top PDF UCI datasets:

Classification Of Complex UCI Datasets Using Machine Learning And Evolutionary Algorithms

Classification Of Complex UCI Datasets Using Machine Learning And Evolutionary Algorithms

Thus in this paper we have compared the performance of various classifiers. Eleven data sets from benchmark data set (UCI) are used for experimentation. Numbers of cross-folds in each case are 10. In terms of overall performance that is if we consider Accuracy, Time Complexity, MAE and RMSE, MLP, NaiveBayes, RandomForest, J48, Genetic Programming perform comparatively better than others in case of all datasets. According to the rankings, for iris VFI performed best, for abalone Logistic performed best, for labor and lung cancer NaiveBayes performed best, for contact-lense, hayesroth and statlog Genetic Programming(GP) performed best, for Soybean MLP performed best, for glass identification test RandomForest performed best, for vote J48 performed best, for teaching assistant evaluation NaiveBayes and J48 both performed best. ZeroR performed worst in almost all the cases. As this work is much concerned on GP, it can be concluded from results section that accuracy given by GP is appreciable in almost all datasets except abalone, labor and teaching assistant evaluation. In case of contact-lense, hayesroth and statlog datasets accuracy given by GP is the highest. The performance of GP decreases by a small amount as the no of instances increases because GP is an iterative process. As the number of instances increase number of iterations also increase. This is the case with abalone dataset. For the datasets containing missing values for attributes performance of GP decreases. Time complexity charts for different datasets show similarity. The height of the bar for GP is highest in every chart. In every case, time complexity of GP is maximum. This is because GP is an iterative process, the
Show more

10 Read more

Classification of Complex UCI Datasets Using Machine Learning Algorithms Using Hadoop

Classification of Complex UCI Datasets Using Machine Learning Algorithms Using Hadoop

Its a software/library [6] of highly scalable frameworks that allows for distributed processing of large datasets across clusters of computers using a simple programming model. Internally Hadoop is a java implementation of map reduce[7] ,which is a popular software architecture that facilitates processing of large amount of data in a distributed fashion. The application creates its own mapper and reducer implementations, register the mapper and reducer classes into a Hadoop job, indicate the location of the input and output and preprocess it to the Hadoop framework .The framework takes care of reading the data from the input location, invokes the mapper and reducer application classes when needed in a concurrent and distributed fashion and writes the result to the output locations. Hadoop input and output are always read from the Hadoop Distributed File System (HDFS).
Show more

9 Read more

Classification of remote sensed data using Artificial Bee Colony algorithm

Classification of remote sensed data using Artificial Bee Colony algorithm

To evaluate the performance of the data, selection of points from datasets is stored in the UCI datasets for training and sig- natures are controlled by the size of a colony (land cover [r]

8 Read more

Disease Diagnosis Using Machine Learning Techniques: A Review and Classification

Disease Diagnosis Using Machine Learning Techniques: A Review and Classification

based on the limitations of this review. The review in this study provides recommendations for researchers in their future studies. Based on the result of our review, we recommend the researchers to use not only UCI datasets, but also use large datasets in their experiments from other machine learning repository to ensure the effectiveness of the developed method. Focusing on large datasets not only allow them to test the accuracy of the adopted classification, but also can motivate them to develop methods for incremental learning for time complexity issue. We hope that this research will provide useful information about the data mining techniques, their application in diseases diagnosis and help researchers in developing medical decision support systems with insight into the state-of-the-art of development methods.
Show more

12 Read more

Coupled Kernel Ensemble Regression

Coupled Kernel Ensemble Regression

In this paper, we investigated the problem of how to combine a set of kernel regressors into a unified ensemble regression framework. The framework can simultaneously couple multiple kernel regres- sors by minimizing total loss of ensembles in Reproducing Kernel Hilbert Space. In this way, one kernel regressor with more accurate fitting precession on data, can obtain bigger weight, which leads to a better overall ensemble performance. Experimental results on several UCI datasets for regression and classification, compared with several single models and ensemble models such as Gradient Boosting (GB), Tree Regression (TR), Support Vector Regression (SVR), Ridge Regression (RR) and Random Forest (RF), illustrate that, the proposed method achieves best performances among the comparative methods.
Show more

8 Read more

A Novel Hybrid Whale Optimization Algorithm with Flower Pollination Algorithm for Feature Selection: Case Study Email Spam Detection

A Novel Hybrid Whale Optimization Algorithm with Flower Pollination Algorithm for Feature Selection: Case Study Email Spam Detection

Feature Selection (FS) in data mining is one of the most challenging and most important activities in pattern recognition. The problem of choosing a feature is to find the most important subset of the main attributes in a specific domain, and its main purpose is removing additional or unrelated features, and ultimately improving the accuracy of the classification algorithms. As a result, the problem of FS can be considered as an optimization problem, and use metaheuristic algorithms to solve it. In this paper, a new hybrid model combining whale optimization algorithm (WOA) and flower pollination algorithm (FPA) is presented for the problem of FS based on the concept of Opposition based Learning (OBL) which name is HWOAFPA. In our proposed method, using natural processes of WOA and FPA, we tried to solve the problem of optimization of FS; and on the other hand, we used an OBL method to ensure the convergence rate and accuracy of the proposed algorithm. In fact, in the proposed method, WOA create solutions in their search space using the prey siege and encircling process, bubble invasion and search for prey methods, and try to improve the solutions for the FS problem; along with this algorithm, FPA improves the solution of the FS problem with two global and local search processes in an opposite space with the solutions of the WOA. In fact, we used all of the possible solutions to the FS problem from both the solution search space and the opposite of solution search space. To evaluate the performance of the proposed algorithm, experiments were carried out in two steps. In the first stage, the experiments were performed on 10 FS datasets from the UCI data repository. In the second step, we tried to test the performance of the proposed algorithm in terms of spam e-mails detection. The results obtained from the first step showed that the proposed algorithm, performed on 10 UCI datasets, was more successful in terms of the average size of selection and classification accuracy than other basic metaheuristic algorithms. Also, the results from the second step showed that the proposed algorithm which was run on the spam e-mail dataset, performed much more accurately than other similar algorithms in terms of accuracy of detecting spam e-mails.
Show more

32 Read more

Structural Regular Multiple Criteria Linear Programming for Classification Problem

Structural Regular Multiple Criteria Linear Programming for Classification Problem

In this subsection, we perform these methods on the UCI datasets [21]. For each dataset, we randomly select the same number of data from different classes to compose a dataset. 50% percent of each extracted dataset are for training, 50% for testing. The results are shown in the Table 1. From the Table 1, we can draw the conclusion as follows: 1) SRMCLP and SRSVM have the better predictive ability than RMCLP in all cases. This shows that these priori structural information embedded in classes has a great help to improve the classification performance of the classifier. 2) SRMCLP is superior to SRSVM in most cases. This shows SRMCLP is a strong competitive method.
Show more

11 Read more

A Repository of Conversational Datasets

A Repository of Conversational Datasets

Progress in Machine Learning is often driven by the availability of large datasets, and con- sistent evaluation metrics for comparing mod- eling approaches. To this end, we present a repository of conversational datasets con- sisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models us- ing 1-of-100 accuracy. The repository con- tains scripts that allow researchers to repro- duce the standard datasets, or to adapt the pre-processing and data filtering steps to their needs. We introduce and evaluate several com- petitive baselines for conversational response selection, whose implementations are shared in the repository, as well as a neural encoder model that is trained on the entire training set.
Show more

10 Read more

HW 5, Problems 5.4 EECS 203A, UCI, Fall 2004

HW 5, Problems 5.4 EECS 203A, UCI, Fall 2004

I downloaded the image from the text book website, and used InfranView to get the image information to find how many pixels the whole image is, then read in into Mathematica to display i[r]

10 Read more

Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition

Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition

The dataset contains a total of 240 clips with 66919 frames. The number of actors in the dataset is 10 and they performed each action 5–10 times. All the videos are provided with 1920 × 1080 resolution and 25 fps. The average duration of each action was 11.15 sec. A summary of the dataset is given in Table 1. The total clip length and mean clip length of each class are represented on the left side (blue) and right side (amber) bar graphs of Figure 3 respectively. In Table 2, we compare our dataset with eight recently published video datasets. These datasets have helped progress research in action recognition, gesture recognition, event recognition, and object tracking.
Show more

16 Read more

The SpeDial datasets: datasets for Spoken Dialogue Systems analytics

The SpeDial datasets: datasets for Spoken Dialogue Systems analytics

In Let’s Go 1 was used when anger was detected and 0 oth- erwise. The labels used the Movie Ticketing data were dis- crete scores that lie in the [1 − 5] interval capturing very angry user utterances (1) to friendly utterances (5). In or- der to adopt the same scheme across datasets the values [1 − 3] were mapped into 1 and values in the interval 4 and 5 were mapped into 0. The presence of anger was al- ways signaled by the shouting or use of offensive language. However, there are other ways of user’s to express their (dis)satisfaction towards the system. Therefore, Satisfac- tion was also annotated as a subjective measure of the user experience. As expected, all the subset angry turns are part of dissatisfied turns as well For Let’s Go 0 when the user was satisfied and 1 when she was not. In the MT data, the data was annotated in a five point scale from 1 very unsat- isfied to 5 very satisfied.
Show more

7 Read more

Early Diagnosis of Coronary Artery Disease using UCI Data set

Early Diagnosis of Coronary Artery Disease using UCI Data set

In recent years, computerized diagnostic systems are being designed to enhance the abilities of cardiologists in detecting CAD patients with higher accuracy rate [9]. Numerous researches related to the diagnosis of heart disease have motivated this research proposed for performance enhancement of CAD detection system. Different feature extraction and selection algorithms have been integrated with classifiers like Artificial Neural Networks, Support Vector Machines, Decision Tress and Clustering approaches, etc. This research is focused entirely on optimizing the classification architecture by reducing complexity and processing overhead in the detection of pattern behavior for disease identification. Our research group is using this proposed methodology on standard UCI Heart datasets for extracting their sensitive features to enhance the detection rate of CAD and normal cases.
Show more

5 Read more

Synthetic datasets

Synthetic datasets

The easy way to do the elimination is to randomly eliminating each row in a type I non-symmetric synthetic dataset with a certain probability. However, after conducting this elimination on some datasets, we observe that the size of views in the new datasets does not change much. Thus, we derive the following elimination procedure for obtaining a type II non-symmetric synthetic dataset.

9 Read more

SR 0011K CRAY OS Version 1 Reference Jul82 pdf

SR 0011K CRAY OS Version 1 Reference Jul82 pdf

DATASET TYPES Temporary datasets Local datasets Mass storage permanent data sets • Magnetic tape datasets EXECUTE-ONLY DATASETS MEMORY-RESIDENT DATASETS • INTERACTIVE DATASETS • DATASET [r]

438 Read more

Teaching Contract Law through Common Law Analysis: The UCI Law Experiment

Teaching Contract Law through Common Law Analysis: The UCI Law Experiment

Teaching Contract Law through Common Law Analysis The UCI Law Experiment SMU Law Review Volume 66 | Issue 2 Article 3 2013 Teaching Contract Law through Common Law Analysis The UCI Law Experiment Greg[.]

13 Read more

UNIVERSITY OF CALIFORNIA, IRVINE MANAGED BY UCI FACILITIES MANAGEMENT

UNIVERSITY OF CALIFORNIA, IRVINE MANAGED BY UCI FACILITIES MANAGEMENT

It is just as important to educate about our diversion plan as it is to have one. UCI needs a comprehensive website that allows our students, faculty, and community to understand how UCI recycles and learn about ways they can participate in the recycling program. Our current website (www.sustainability.uci.edu) is shared with the UCI Sustainability website, but is not as interactive or as comprehensive as is necessary.

25 Read more

Visualisation of bioinformatics datasets

Visualisation of bioinformatics datasets

The distributional assumption over the noise model is important for the type of the data the model can handle and also for the link function required for modelling different types of data (Kab´ an and Girolami, 2001). It has been argued in (Kab´ an and Girolami, 2001; Bishop et al., 1998) that modelling continuous data can be achieved with the assumption of noise as independent and identically distributed (i.i.d.) Gaussian, that gives a tractable analytical solution, which is considered not to be suitable for the discrete type datasets. For example for the Gaussian case that is appropriate for the continuous feature set, the link function is considered as a linear regression function of the latent vectors with weight matrix (see Equation (6.1)). However for the purpose of simplicity and generality, an exponential family of distributions is assumed for noise modelling purpose to handle different type of features in a dataset during the derivation of the LTM algorithm (a model developed with a main focus on datasets with discrete type features). The similar idea of using exponential family of distributions is adopted here for mixed-type data modelling under the latent variable framework (where we apply the same for each type subset of features (i.e. x R or x B or x C )). For the purpose of simplicity we use x M where superscript M can be replaced with either R or B or C to indicate type of subset of features for a data point x. The functional form of the exponential family of distributions can be defined by
Show more

201 Read more

Interpolation of paleoclimatology datasets

Interpolation of paleoclimatology datasets

Paleoclimatology data includes measures of the amount of carbon dioxide in the atmosphere and level and temperature of the oceans, among others. Recent records of climate change data were done at equidistant times; the different variables were typically measured at the same time to allow for association studies among them. However, there are no registered records of climate change data for thousands or millions of years ago. Scientists have had to device alternative ways of measuring these quantities. These methods are usually a result of indirect measurements, such as ice coring, where both the variable of interest and the time have to be estimated. As a result, paleoclimate data are a collection of time series where observations are unequally spaced. Here we review a Bayesian statistical method to produce equally spaced series and apply it to three paleoclimatology datasets that span from 300 million years ago to the present.
Show more

17 Read more

Visualization of Online Datasets

Visualization of Online Datasets

As computing technology advances, computers are being used to orchestrate and advance wide spectrums of commercial and personal life, information visualization becomes even more significant as we immerse ourselves into the era of big data, leading to an economy heavily reliant on data mining and precise, meaningful visualizations. However, accuracy of information visualization techniques is heavily dependent on the knowledge and capabilities of users, leaving novices in many fields at a disadvantage. This is a challenging problem that has been inadequately addressed regardless of the influx in visualization tools. Therefore, this paper proposes a novel approach with a focus on online datasets, allowing users to automatically and accurately visualize datasets. Experiment results show that using a browser extension and specially created HTML tables containing custom attributes - stating the data attribute type - the approach is able to detect and present the most suitable visualizations at the click of a mouse. This proposed approach provides a means for novices to quickly and accurately visualize online datasets.
Show more

13 Read more

On the Classification of Imbalanced Datasets

On the Classification of Imbalanced Datasets

There are many real-world applications, where uneven distribution of data patterns is very common. In these cases number of training samples of a minority class is much smaller compared to other majority classes. Microcalcification classification is one classical example for imbalanced data problem. In this paper, we review in brief the state of the art techniques in the framework of imbalanced data sets, and investigate the performance of different methods for handling data imbalance in the microcalcification classification. A range of alternative classifiers are selected and datasets are prepared or sampled in order to assess the performance. Future research work can be focused on medical images with intractable geometric complexity in data classification.
Show more

7 Read more

Show all 2893 documents...