In Section 1.2, we discussed our goals such as multiple-views of data, higher-order interactions, mixed data types, multivariate time series feature extraction, redun- dancy and user understandable correlation analysis that we will address in this dissertation. However, in the process of achieving these goals, we are confronted with several challenges and in this section we briefly discuss them.
Challenge 1: Exponential Search Space
The estimation of multivariate correlations involve analyzing multiple feature sub- set combinations. For a dataset with 200 features, the number of possible feature subset combinations escalates to 2200. In real world applications, the dimension- ality can be much larger and analyzing the relevance of every possible combina- tion is computationally inefficient. Traditional approaches handle this problem by designing efficient search organization techniques [DV11, BALD14]. However, the challenge is not only exploring the exponential search space but also to infer maximum knowledge of the higher-order interactions while evaluating the subsets. Hence, it requires an ideal combination of a search organization and correlation inferring technique.
Challenge 2: Multi-view
Considering multiple views (subsets) of the high-dimensional feature space for the evaluation of correlations may enhance the prediction quality by capturing the interactions between different dimensions. Traditional selection algorithms [KJ97, MBN02, YL03, LMD+12] aim to select a single feature subset by evalu- ating the correlation based on a single statistical property. However, evaluating multiple statistical properties in a dataset can enhance the prediction quality. Two major challenges exists in the process of selecting multiple views of a dataset. The first challenge is to select the views that are relevant for a given task. Secondly, multiple views that capture the same dependencies will not provide any novel in- formation for the prediction task. Hence, it is necessary to ensure that the multiple views are complementary to each other. Overall, it is still an open research ques- tion on how to exploit the heterogeneity of the feature dependencies in multiple feature subsets for improving the prediction quality.
Challenge 3: Higher-order interactions
In complex systems, an individual feature exhibiting higher-order interactions in- fluences the target differently when combined with different feature combinations [SBS+17, KRT+17]. In such cases, assigning a feature’s importance without eval- uating its higher-order interactions is misleading. Hence, it is necessary to assess the role of a feature by analyzing its interactions in multiple subsets. However, to include the higher-order interactions in high-dimensional datasets, it is not efficient to perform an exhaustive search [JBB15]. Traditional approaches [PLD05, Hal00] are based on pairwise correlation analysis and fails to capture interactions between more than two features. Thus, the challenge is to efficiently obtain a reasonable estimate of the feature relevance by evaluating its interactions in multiple subsets.
Challenge 4: Redundancy
In addition to estimating a feature’s correlation to the target, it is also neces- sary to evaluate its non-redundancy with respect to other features in a subset. This ensures that each feature is not only relevant but also provides new infor- mation for a classification or regression algorithm. Experimental and empirical analysis from several feature selection literatures prove that removal of redundant features enhance the speed and accuracy of classification and regression algorithms
[Hal00, KJ97, YL03, PLD05, DP05]. As redundancy does not only imply that two features are identical, the challenge is to quantify the magnitude of novelty that a feature contributes for the prediction task without actually training a prediction model.
Challenge 5: Mixed datasets
Modern datasets usually contain a mixture of continuous and categorical data [SBS+17]. Traditional approaches transform the dataset with mixed data types into homogeneous data type by discretization of the continuous features [Hal99]. Such transformation techniques avoid the necessity to treat each data type differ- ently. However, it leads to information loss and the effectiveness of the selected features is strongly influenced by the discretization method employed [JS02]. The other na¨ıve preprocessing step is to encode the categories with numerical values. Such encodings are not meaningful because the categories represent a qualitative property and assigning random numbers can be misleading, i.e., computing dis- tances between two differently coded categorical features versus the target can show different results [HLY08]. Hence, the major challenge is to perform multi- variate relevance and redundancy estimation of large datasets without the need for such additional preprocessing (i.e., discretization and encoding) techniques.
Challenge 6: Multivariate time series correlation
The transformation of time series into static features is a prevalent concept in the literature. However, several existing approaches [NAM01, M¨or03, WWW07, FJ14] do not address the multivariate nature of the time series. That is, by performing univariate transformations on multiple dimensions they fail to encode the multi- variate interactions of the time series into the resulting features. As explained in Section 1.2, in lengthy time series, it is only a subsequence that is informative for the prediction task. In a multivariate time series, by including the co-occurrences of subsequences from multiple dimensions, there exists an exponentially growing number of subsequence combinations to evaluate. However, not all of them are relevant for the target and non-redundant to the already selected subsequences. In such a scenario, traditional approaches perform unsupervised feature transfor- mation [NAM01, WSH06, WWW07, LKL12, Kat16] and may lead to generation of irrelevant and redundant features. Hence, the challenge lies in identifying the
relevance of time series subsequences by evaluating its multivariate nature and non-redundancy.
Challenge 7: Efficiency
Both feature selection and extraction involve analysis of an exponentially grow- ing number of feature subset or subsequence combinations. Upon this, several application domains incorporate new sensors and collect more data from exist- ing sensors [LP03]. This directly leads to increasing dimensionality and data- base size. Regardless of the increasing volume of data, it is necessary to perform the correlation analysis efficiently. For this reason several feature selection and extraction algorithms focus on the algorithmic efficiency in addition to quality [MBN02, Fle04, YL03, BPZL12]. The computational efficiency of the correlation function and the search organization technique together influence the runtime of the feature selection algorithms. Hence, the challenge is to compute the feature dependencies with efficient correlation functions and search space exploration tech- niques that are scalable.
Challenge 8: Understanding multivariate correlations
To inspect the results of a feature selection algorithm, it is necessary for domain- experts to understand the algorithm. Conventional selection algorithms [KJ97, MBN02, YL03, LMD+12, DP05, PLD05] provide only a set of highly correlated features and do not show a summary of all correlations in a high-dimensional data- set. Though a few approaches aid in the understanding of bivariate correlations, difficulty arises when trying to comprehend the dependencies between more than two features. Overall, the dimensionality and the complex dependencies in the data impair an expert’s understanding of the correlations. Hence, making multi- variate correlation analysis as a transparent process is still an unresolved problem. In high-dimensional datasets, there are both feature-to-target (i.e., relevance) and feature-to-feature dependencies (i.e., redundancy). For a dataset with hundreds of features, the major challenge is to visualize both the dependencies and explain the detected multivariate correlations to domain-experts.