3.2 Malware Fingerprints
3.2.2 Malware Detection Framework
In this section, we leverage the proposedAPK-DNAfingerprint for Android malware detection. More precisely, we present i) thefamily-fingerprintingapproach, where we define and use a family fingerprint, and ii) thepeers-matchingapproach, where we compute the similarity between malware fingerprints. Both approaches are based on the peer-fingerprint-voting mechanism to decide on malware detection and family attribution.
Peer Fingerprint Voting
As we have seen in Section 3.2.1, comparing two Android malware packages consists of com- puting similarities between their metadata, binary, and assembly sub-fingerprints, which gives nu- merical values on how the two packages are similar in a specific content category, as presented in Algorithm 2. In addition, we add the summation of all the similarities as a summary value of these sub-contents similarities. Note that other summary values, such as the average and the maximum, could also be used. However, it is challenging to detect the most similar packages if we compare an unknown package to known malware packages using multiple sub-fingerprints. The most obvious solution is to merge bit-vectors of each content category into one vector and then compute the sim- ilarity of the resulting feature vector. However, in our case, merging bit vectors will heavily reduce the contribution of some sub-fingerprints in the similarity computation.
Likewise, the density of the assembly feature vector is considerably less compared to the binary feature vector. Consequently, we propose to use a composed similarity usingpeer-fingerprint voting.
Algorithm 2:APK-DNA Similarity Computation input :APK-DNAA: list
APK-DNAB: list output:similarity-list: list similarity-list =empty-list(); forcontentincontent categoriesdo
similarity = Jaccard bitwise(A[content],B[content]) ; similarity-list.add(similarity);
end
summation = sum(similarity-list); similarity-list.add(summation);
Algorithm 3:Peer-Fingerprint Voting Mechanism
input :similarity-listA-B: list similarity-listA-C: list output:Decision
A-B-count = 0 ; A-C-count = 0 ;
forcontentincontent categoriesdo ifA-B[content]>A-C[content]then
A-B-count += 1; else
A-C-count += 1; end
end
ifA-B-count>A-C-countthen Decision = A-B;
else
Decision = A-C; end
The idea is to compare parts (sub-fingerprints) instead of comparing full fingerprints, as depicted in Algorithm 3. In other words, we examine each sub-similarity pairs. The decision is made by a voting mechanism on the result of each sub-comparison. Moreover, in case of equal votes, we compare thesummationof the sub-similarities to remove the ambiguity as shown in the example depicted in Figure 3.5. At this stage, we can compare different Android packages and decide on the most similar package to a given one. In what follows, we propose two approaches to malware detection.
Similarity 1 MetaData Binary Assembly Summation Similarity 2 MetaData Binary Assembly Summation > < = > > Similarity 1 Similarity 2
Figure 3.5: Peer-Fingerprint Voting
Peer Matching
In thepeer-matchingapproach,ROARqueries the fingerprints database to check the most sim- ilar malware fingerprint. To detect Android malware variation, we build a malware fingerprint database by computingAPK-DNA for known Android malware. The more fuzzy fingerprints in this database, the broader the detection system could cover. As shown in Figure 3.6, for each new malware, we compute itsAPK-DNAand add it to the database.
Malware Fingerprints DB Malware-DNA1 Malware-DNA2 Malware-DNA3 Malware-DNA4
APK-DNA New App
Compute Similarity Generate ? Detection & Family Attribution … APK-DNA Update Database New Malware
Figure 3.6: Malware Detection Using Peer-Matching
To attribute the malware family to a new app, we first compute the similarity between the mal- ware fingerprint and each entry in the database of known malware fingerprints, as depicted in Figure 3.6. To this end, we usebitwise Jaccardsimilarity, presented in Section 3.2.1, to produce a set of sub-similarity values, i.e., thecomposed similarity. Afterwards, to compare thecomposed similarity values, we use the previously presentedpeer-votingtechnique. The entry with the highest similarity value that exceeds an acceptance threshold determines the malware family. In the current imple- mentation, we use an experimentally derived static threshold. As such,Peer-matchingis a simple approach for malware detection and family attribution.
Family-Fingerprinting
In this approach, some extra steps are needed to build a second database of malware family fingerprints. The aim is to reduce the number of database entries required to match an Android malware fingerprint. For this reason, we propose a custom approximate fingerprint for a malware family. The intent is to leverage this family fingerprint for malware detection purposes. The idea is to build a database of family fingerprints from known Android malware samples, and use this database for similarity computation with unknown malware apps. The number of malware families limits the actual size of the family-fingerprints database. Notice that the fingerprint structure for a malware family is the same as for a single malware, i.e., metadata, binary, and assembly family sub-fingerprints.
Algorithm 4:Family Fingerprint Computation
input : Malware Family X Fingerprints: Set output: Family X Fingerprint:FP X
FP X=newbitvector[Zeros]; forfprintinFingerprintsdo
FP X{meta}=FP X{meta}or fprint{meta}; FP X{bin}=FP X{bin}or fprint{bin}; FP X{asm}=FP X{asm}or fprint{asm}; end
Algorithm 4 depicts the computation of the family fingerprint based on the underlying content sub-fingerprints. First, the fingerprint is initialized to zeros (each content sub-fingerprint). Af- terward, the fingerprint is generated by applying a logicalOR on the current value of the family fingerprint with a single malware fingerprint. Note that each content sub-fingerprint is computed separately. This operation is applied to all malware samples in the database. After calculating the fingerprints from known malware samples, we store them in afamily-fingerprint database, which is used for detection and family attribution. The detection process is composed of several steps. First, for a given Android package, we generate its fingerprint as described in Section 3.2.1. Then, we compute the similarity between this fingerprint and each family fingerprint in the database. The family with the highest similarity score will be chosen as the family of the new app if the similarity value is above a defined threshold. In the current implementation, we use an experimentally derived static threshold, which is only applied to thesummationpart of the composed similarity. The result
is similar to the single malware fingerprint, but it represents a malware family instead of a particular malware. Malware Fingerprints DB Family-DNA 1 Family-DNA 2 Family-DNA 3 Family-DNA 4
APK-DNA New App
Family Fingerprints DB
Compute\Update Family Fingerprints
Compute Similarity ? Generate
Detection & Family Attribution
Figure 3.7: Malware Detection Using Family-Fingerprint