3.4 Analysis
6.2.1 Android App Repackaging/Plagiarism Detection
Desnos [62] defines a Dalvik bytecode similarity metric based on Normalized Com- pression Distance (NCD) [48] and applies it to find lists of identical, similar, and new/deleted methods between a pair of Android apps. NCD is a symmetric simi- larity metric between a pair of strings that is based on the idea that compressing the concatenation of two similar strings will produce a shorter output than if the two strings are not similar. The essence of this work are: 1. For each method of an APK sample, generate a string that encode a (linearized) control flow graph, external API (Android/Java) invocations, and Java exceptions. 2. Use NCD to compute similarity between a pair of such strings. 3. Similar strings correspond to similar methods.
Zhou et al. adapt Context Triggered Piece-wise Hashing (CTPH) [120] in their DroidMOSS [233] system to measure pair-wise similarity of apps for detecting app repackaging. The essence of this work are: 1. Split each long sequence of executable code into short sequences when the hash value of a sliding window is prime; this procedure is named “resetting” in the paper. 2. Local changes, such as insertion of an invocation instruction in a repackaged app, is localized by the sliding windows and resetting criteria. 3. Hashes of the short sequences are concatenated, and the
edit distance [36] of two such concatenated hashes are their corresponding code’s similarity. 4. Only the opcode, rather than the easily replaceable operands, of each code piece is used in computing the above hashes.
Crussell et al. use Program Dependence Graphs (PDGs) [79] in their DNADroid [55] system to detect similar apps with different authorship, or app “clones” as called in the paper. The essence of this work are: 1. Convert Android app code from (Android’s) DEX format into (Java’s) JAR format using the tool dex2jar [194]. 2. Remove com- mon libraries from the code by Java package name and code checksum. 3. Use the existing Java analysis framework WALA [41] to extract the data dependency part of PDGs. 4. The justification of not including the control flow part of PDGs is “so that our detection is more robust against statement reordering, insertion and dele- tion” [55]. 5. Apply heuristics (called “filters” in the paper) to remove pairs of apps that are unlikely to be clones from the pair-wise comparison; one example of such heuristics is to remove PDGs that have less than 10 nodes. 6. Use the VF2 subgraph isomorphism algorithm [52] to identify isomorphic subgraphs in the PDGs, and use the node number ratio of isomorphic subgraphs in the PDGs as app similarity score. Hanna et al. use k-gram feature of code sequence and feature hashing in their Justapp[97] system to detect method-level similarity between Android apps. The essence of this work are: 1. Extract basic blocks of each Java method from the APK. 2. Hash every k opcodes, along with all constant operands (but not variable operands), and set a corresponding bit in a feature bit-vector of the method. 3. The justification of dropping variable operands but retaining constant operands is that variable operands are easily changeable but constant operands may contain control flow information as used by Java’s reflection mechanism [46]. 4. Use agglomerative hierarchical clustering [61] on the aforementioned feature bit-vector to group similar apps together.
In their PiggyApp [234] system, Zhou et al. identify one type of app repackaging (what they call “piggybacked” apps), characterized by the relative independence of the added code (the “rider”) with the original apps (the “carrier”), and propose to
identify pigbacked apps by isolating and comparing the primary module (correspond- ing to the carrier) for similarity. The essence of this work are: 1. Build a weighted graph of Java packages that represents Java package’s data/control dependence (in- cluding class inheritance, package homogeny, method calls, and member field refer- ences). 2. Use agglomerative clustering algorithm to group class packages into dif- ferent modules. 3. Use information declared in AndroidManifest.xml to determine the primary modules, and use identifier (string) similarity to break tie. 4. Generate a bit vector (the feature vector) of the primary modules that representing the pres- ence/absence of the following features: Requested permission, Android API calls, intent types, use of native code/external classes, and authorship information. 5. Em- bed the bit vector into a metric space, and use Jaccard distance of primary module feature vectors as similarity score of apps. 6. Use Vantage Point Tree (VPT) [222] as
the metric space representation to reduce complexity from O(n2) to O(n log n).
In AnDarwin [56], Crussell et al. base on works of Gabel et al. [82] and Jiang et al. [112] to extract “semantic vectors” (histogram of code type frequencies) and identifying similar code using Locality Sensitive Hashing (LSH) [9]. The essence of this work are: 1. Split data dependency graph by connected components (the “semantic blocks”) and use the histograms of code type frequencies (the “semantic vector”) as the feature of each semantic block. 2. Semantic blocks with fewer than 10 nodes are discarded to reduce program size, “because they represent trivial and uncharacteristic code.” 3. LSH are applied to each group of similarly sized semantic vectors (another heuristics to reduce problem size) to identify similar semantic vectors. 4. Common libraries are identified by their frequencies in all semantic vectors, and are removed from the data feeding to the next stage. 5. MinHash [33] are used to compute an efficient approximation to Jaccard-distance-based distance between pair-wise apps. In FSquaDRA [229], Zhauniarovich et al. use non-code resources contained in APKs for app repackaging detection to eschew the computational complexity as- sociated with code analysis. This is also used by Viennot et al. in their large-scale study [205] of the Google play store for detecting repackaged apps.
Unlike several of the methods mentioned above that eschew control flow graph (CFG) in favor of data dependence graph (DDG), in their work [44], Chen et al. embed Java-method-level CFG of apps (the “3D-CFG”) into a 3-dimensional metric space, and use a weighted centroid (center of mass) of the 3D-CFG to represent the underlying method. Similarity of app methods is computed from the distance between their 3D-CFG centroids. Scalability is derived from the observed “centroid effect”, i.e., proximity of centroids is positively correlated with method similarity, and is achieved by localizing the search space to similar methods with centroids that are proximate in the embedding metric space.
In ViewDroid [225] by Zhang et al., a “featured view graph” is constructed from each APK sample, and app repackaging/plagiarism detection is performed pair-wisely on the corresponding featured view graphs with subgraph isomorphism. A featured view graph consists of user interface views (corresponding to Java classes that are derived from android.app.Activity in Android) as nodes and the transition rela- tionship between the views as edges; Android APIs and the event callback method that triggers view transition are used to label the nodes and edges, respectively. Like in Crussell et al.’s work [55], the VF2 [52] subgraph isomorphism detection algorithm is used for this purpose, perhaps due to the availability of its implementation [114].