• No results found

Background on Multimodal Learning

10.2.1

What is Multimodal Learning?

Multimodal learning involves relating information from multiple sources [221]. Each of these sources is known as a modality [79]. The objective of multimodal

Chapter 10. Malware Detection Using All Modalities 101

learning is to build models that can process and relate information from multiple modalities. However, the research field of multimodal learning brings some unique challenges given the heterogeneity of the data.

Baltruˇsaitis et al. [79] identified and explored five challenges related to mul- timodal learning: (1) representation; (2) translation; (3) alignment; (4) fusion; and (5) co-learning. Representation refers to how represent and summarize multi- modal data in a way that exploits the complementarity and redundancy of multiple modalities. Translation answers how to map data from one modality to another. Alignment is to identify the direct relations between sub-elements from two or more different modalities. Fusion refers to combine information from two or more modalities to perform a prediction. Co-learning explores how knowledge learning from one modality can help a computational model trained on a different modality. From these multimodal challenges, in this dissertation we explore multimodal fusion for malware detection.

10.2.2

Multimodal Fusion

Multimodal data fusion is the process of integrating information from multiple modalities with the goal of predicting an outcome (e.g., malware versus non- malicious software) through classification or regression [79, 74]. The interest in multimodal fusion arises due to several advantages: (1) Having access to multiple modalities that observe the same event may allow robust predictions and might also allows us to capture complementary information; (2) A multimodal approach can still operate when one of the modalities is missing or has been compromised (e.g., an attacker modifies a modality). Next, we describe two multimodal fusion levels: feature level and decision level fusion.

10.2.3

Levels of Multimodal Fusion

There exists two levels of multimodal fusion: feature level and decision level. Fea- ture level (also known as early fusion) is the most widely used approach as it fuses

all the extracted features into one feature vector. Feature level fusion is accom- plished by simply combining the feature sets from different modalities. Let us suppose thatX={x1, x2, x3, ..., xn}andY ={y1, y2, y3, ..., yn}are feature vectors

(X ∈Rm and Y Rm) representing the information extracted from two different

modalities. The objective is to combine these two feature sets in order to obtain a new feature vectorZthat would be used for classification. An advantage of feature level fusion is that it uses the correlation between multiple features from different modalities at an early stage, which helps to perform tasks better. However, when doing feature level fusion is hard to represent the time synchronization between the multimodal features [296]. To avoid this issue, features should be represented in the same format before performing the fusion.

Decision level fusion (also known as late fusion) fuses multiple modalities in the semantic space [74]. Here, decisions are combined using a decision fusion unit to make a fused decision vector that is analyzed further to obtain a final decision D

about the task or hypothesis. Unlike feature level fusion, the decisions usually have the same format representation. Furthermore, decision level fusion offers scalability in terms of the modalities used during the fusion process, which is difficult to achieve in the feature level fusion [75]. Another advantage of decision level fusion, is that it allows us to use the most suitable methods for analyzing each modality. However, a disadvantage of decision level fusion is that as different learners are used to obtain the local decisions, the learning process for them becomes time consuming. Besides feature level and decision level fusion, some prior works have also used a hybrid approach by performing fusion on both feature level and decision level. The idea behind the hybrid fusion approach is to use the advantages of both early and late fusion strategies. Some prior works that used hybrid fusion focused on multimedia analysis [85, 223, 298].

The fusion of multiple modalities provides complementary information and it is recognized by prior works in multimedia analysis [74] and pattern recognition [79] to increase the classification performance. We used the findings from previous mul- timodal approaches on multimedia analysis and pattern recognition as motivation

Chapter 10. Malware Detection Using All Modalities 103

to explore the effectiveness of multimodal data fusion for malware detection.

10.2.4

Data Fusion Techniques

Some previous works have categorized distinct multimodal data fusion tech- niques [79, 127, 74]. Baltruˇsaitis et al. [79] divided the multimodal fusion techniques into two categories: model-agnostic approaches and model-based ap- proaches. Model agnostic approaches include fusion techniques, such as averag- ing, voting schemes, weighting-based, or a learned model; while model-based ap- proaches include fusion techniques like kernel-based methods, graphical models, and neural networks.

Durrant-Whyte et al. [127] described some probabilistic methods that are com- monly employed for data fusion in robotics. Most of these methods were based on the Bayes rule for combining prior and observation information. Typically, the Bayes rule can be implemented by using the Kalman and extended Kalman filters through sequential Monte Carlo methods or through the use of functional density estimates. However, there are several limitations when using probabilistic tech- niques: (1) complexity; (2) inconsistency; and (3) precision of models. To address these limitations in multimodal data fusion, they recommended using techniques like interval calculus, fuzzy logic and/or Dempster-Shafer methods.

Atrey et al. [74] presented three categories for multimodal data fusion tech- niques: rule-based methods, classification-based methods, and estimation-based methods. Rule-based fusion methods includes a variety of statistical rule-based methods like linear weighted fusion and majority voting. Linear weighted fusion is one of the simplest and widely used method, in which the information is combined in a linear fashion. Some previous works used the linear fusion strategy at the feature level (i.e., for video surveillance and traffic monitoring [175, 288]) and deci- sion level (i.e., for speaker recognition and speech event detection [100]) to perform multimedia analysis tasks. In the case of majority voting, the final decision is the one where the majority of the learners reach a similar decision [228]. Some fusion

methods under the classification-based category are the support vector machine, Bayesian inference, Dempster Shafer theory, dynamic Bayesian networks, neural networks, and the maximum entropy model. While the estimation category in- cludes fusion techniques, such as the Kalman filter, extended Kalman filter, and particle filter fusion methods.

From these data fusion techniques, we decided to used a deep neural network because they are a commonly used in the multimodal domain. For instance, mul- timodal neural networks have been explored for multimedia analysis tasks [142].