Machine learning-based dynamic analysis of Android apps with improved code coverage

Yerima, Suleiman Y.; Alzaylaee, Mohammed K.; Sezer, Sakir

doi:10.1186/s13635-019-0087-1

Research
Open access
Published: 29 April 2019

Machine learning-based dynamic analysis of Android apps with improved code coverage

Suleiman Y. Yerima¹,
Mohammed K. Alzaylaee ORCID: orcid.org/0000-0002-3791-3458²^na1 &
Sakir Sezer²

EURASIP Journal on Information Security volume 2019, Article number: 4 (2019) Cite this article

11k Accesses
23 Citations
Metrics details

Abstract

This paper investigates the impact of code coverage on machine learning-based dynamic analysis of Android malware. In order to maximize the code coverage, dynamic analysis on Android typically requires the generation of events to trigger the user interface and maximize the discovery of the run-time behavioral features. The commonly used event generation approach in most existing Android dynamic analysis systems is the random-based approach implemented with the Monkey tool that comes with the Android SDK. Monkey is utilized in popular dynamic analysis platforms like AASandbox, vetDroid, MobileSandbox, TraceDroid, Andrubis, ANANAS, DynaLog, and HADM. In this paper, we propose and investigate approaches based on stateful event generation and compare their code coverage capabilities with the state-of-the-practice random-based Monkey approach. The two proposed approaches are the state-based method (implemented with DroidBot) and a hybrid approach that combines the state-based and random-based methods. We compare the three different input generation methods on real devices, in terms of their ability to log dynamic behavior features and the impact on various machine learning algorithms that utilize the behavioral features for malware detection. Experiments performed using 17,444 applications show that overall, the proposed methods provide much better code coverage which in turn leads to more accurate machine learning-based malware detection compared to the state-of- the- art approach.

1 Introduction

With nearly 80% market share, Google Android leads other mobile operating systems. Over 65 billion downloads have been made from the official Google play store, and there are currently more than 1 billion Android devices worldwide [1]. According to Statista [2], there will be around 1.5 billion Android devices shipped worldwide by 2021. Due to the increasing popularity of Android, malware targeting the platform has increased significantly over the last few years. According to a recent report from McAfee, there are around 2.5 million new Android malware samples exposed every year, thus increasing the total number of malware samples discovered in the wild to more than 12 million [3]. Android malware can be found in a variety of applications such as gaming apps, banking apps, social media apps, educational apps, and utility apps. Malware-infected applications can have access to privacy-sensitive information, send text messages to premium rate numbers without user approval, or even install a rootkit on the device allowing it to download and execute any code the malware developer wants to deploy etc.

In order to mitigate the spread of malware, Google introduced Bouncer to its store in Feb 2012. Bouncer is the system used to monitor submitted applications for potentially harmful behaviors by testing the submitted apps in a sandbox. However, Bouncer has been shown to be vulnerable to unsophisticated detection avoidance techniques that can evade sandbox-based dynamic analysis [4]. Furthermore, most third party app stores do not have any screening mechanism for submitted applications. There is therefore a need to further research more efficient approaches for detecting Android malware in the wild.

Android applications are heavily user interface (UI) driven. Because of the UI operation, efficient input generation is crucial for testing applications and many tools are available to support developers. Likewise, efficient input generation is needed to drive automated dynamic analysis for malware detection. On the Android platform, malware can hide their malicious activities behind events/actions that require user interaction. Hence, in order to facilitate effective malware detection, researchers integrate (into their analysis platforms) tools that can simulate the human interaction with the UI. The main goal is to reach a high percentage of code coverage such that most of the suspicious activities are revealed during the analysis. However, as highlighted in [5], some challenges still remain with automated test input generation. Many of the existing dynamic analysis systems rely on a random-based input generation strategy based on the Android Monkey UI exerciser tool. In fact, the random-based test input generation tool Monkey, is currently the most popular input generation tool used in most dynamic analysis systems. Its popularity can be attributed to being readily available as part of the Android developer’s toolkit and its ease of use.

An empirical study conducted in [6] compared the Monkey tool to other input generation tools. The study found that the highest code coverage was achieved by the random-based Monkey tool compared to the other tools. However, the study did not evaluate the impact on machine learning-based detection of Android malware. In our previous work [7], preliminary experiments conducted on 2444 Android apps (containing 1222 benign and 1222 malware samples) showed that dynamic analysis behavioral footprint enabled by the random-based Monkey tool could be improved further. The results showed that stateful approaches enabled better code coverage than the random-based approach. Out of the stateful approaches, the hybrid method performed better than the stand-alone state-based approach. The code coverage was quantified by the number of apps where specific dynamic behavioral features were traced or discovered from.

This paper extends the preliminary work in [7], and presents a more extensive comparative analysis of the code coverage by the random-based, state-based and hybrid input generation methods by means of a larger dataset of 15,000 apps. The study in this paper focuses on investigating the impact of the input generation methods on performance of machine learning-based Android malware detection, which has not yet been addressed in previous works. In particular, our paper seeks to answer the following research questions:

Given that the random-based test input generation approach is widely utilized, does this method enable the best possible code coverage for effective detection of Android malware?
Does the state-based method produce larger behavioral footprints for dynamic analysis compared to the random- based method?
When the state-based method is combined with the random-based method to enable hybrid input test generation, does this increase the behavioral footprint?
Lastly, what are the comparative performance differences for various machine learning classifiers that utilize the dynamic behavioral features logged using the random-based vs. state-based vs. the hybrid method? Most importantly, does the use of stateful methods increase classifier performance?

The reminder of the paper is structured as follows. Section 2 discusses the input generation methods investigated in the paper. Section 3 details the methodology and experiments undertaken. Section 4 presents and discusses the results. Section 5 gives an overview of related work, followed by conclusions and future work in Section 6.

2 Input/event generation methods for Android application dynamic analysis

In this section, we describe the input/event generation methods investigated in this paper. The input generation schemes are incorporated into an extended version of our dynamic analysis framework (DynaLog) for comparative analysis. DynaLog [8] enables dynamic analysis of Android apps by instrumenting the apps with APIMonitor [9] and then logging real-time behavioral features. DynaLog is extended to enable dynamic analysis on real devices (as described in [10]) in order to mitigate the potential impact of anti-emulation and environmental limitations of emulators on our dynamic analyses. Figure 1 shows an overview of the dynamic analysis process using DynaLog. It logs features from API calls and Intents. On the Android platform, Intents are used to signal to the system that a certain event has occurred. Examples of Intents include SMS_RECEIVED, PACKAGE_INSTALL, BOOT_COMPLETED, and PACKAGE_REMOVED. In order to receive the signal for an Intent, an app must register for that particular Intent.

2.1 Random input generation method

The random input generation method is a stateless approach that sends pseudo-random events of clicks, swipes, touch screens, scrolling etc. during run-time testing of an application in an emulator or real device. As mentioned before, the state-of-the-practice events generator for run-time testing of Android apps is the Monkey tool, which has been incorporated into many dynamic analyses systems (such as AASandbox [11], ANANAS [12], Mobile Sandbox [13], and vetDroid [14]). Monkey is a random-based events generation tool that is part of the Android Developers’ toolkit [15]. It is a command line tool that can be run on any emulator instance or on a device. It sends a pseudo-random stream of user events into the system and includes several configuration options. The seed value for the pseudo-random number generator can be set, and if you re-run Monkey with the same seed value it would generate the same sequence of events. A fixed delay can be inserted between the events and if not specified, there is no delay and the events are generated as rapidly as possible. Monkey can be configured to ignore crashes, timeouts or security exceptions and continue to send events to the system until the count is complete.

Although Monkey is quite effective as an event generation tool for dynamic analysis systems, it does not operate with any awareness of the current state of the system. For this reason, the code coverage might not be optimal, and this could reduce effectiveness of detection systems that utilize on it for event generation. This is what we will be investigating in the experiments presented later in the paper by comparative analysis with alternative approaches that have not been utilized in previous works.

The main advantage of the random input generation method is the speed, since events can be sent continuously without pausing to determine the current state of the app’s UI. The main disadvantage is that unintended consequences, such as turning off Internet connectivity, could occur as observed in [7]. Since the input/event generation is not guided, the phone’s settings could be inadvertently changed by indiscriminate streams of generated input. This may impact the extent of coverage of the behavioral footprint thus leaving out key features that could be indicative of malicious activities.

2.2 State-based input generation method

This approach is also known as model-based input generation or model-based exploration [16–20]. The method utilizes a finite state machine model of the app with activities as “states” and events as “transitions.” The finite state model can be built by statically analyzing the app’s source code to understand which UI events are relevant for a specific activity. Alternatively, the model can be built dynamically and terminated when all events that can be triggered from all the discovered state lead to already explored states.

In order to implement state-based test input generation for our study, we employed DroidBot [21], an open-source automated UI event generation tool for Android. DroidBot was integrated into our DynaLog framework and configured to operate in the state-based input generation mode. DroidBot generates UI-guided test inputs based on a state transition model generated on-the-fly from the information monitored at run-time. Hence, it does not require any prior knowledge of unexplored code. A depth-first exploration algorithm is used to generate the events. Figure 2 shows a simplified example of a state transition model. It shows a directed graph in which each node represents a device state, and the edge between the two nodes represents the input event that triggered the state transition. A state node typically contains the GUI information and the running process information, while an edge node contains details of the input event. The current state information is maintained, and after sending an input to the device, the state change is monitored. Once the device state is changed, the input is added as a new edge while a new node is added to represent the new state. Further details on DroidBot are given in [21].

DroidBot was selected for our work because compared to other model-based/state-based tools it is (a) open source, (b) easier to integrate with our existing analysis environment (DynaLog) since it does not require system/framework instrumentation, and (c) usable with common of the shelf Android devices without modification. The main disadvantage of the state-based approach is that it is considerably slower than the random-based approach. This is due to the fact that the app’s current state needs to be constantly tracked in between sending events. Note that the time taken to traverse the possible states of the app depends on the size and complexity of the apps. Running times therefore vary from app to app.

2.3 Hybrid input generation method

The hybrid input method is a stateful approach that combines the random-based method with the state-based method as described in [7]. This was motivated by desire to exploit the strengths of each method for possible improvement in behavioral footprint. Furthermore, we want to determine whether a combined scheme will impact the detection accuracy of machine learning-based systems built upon the dynamic analysis process that utilize the resulting run-time behavioral features.

The hybrid input generation system runs a random-based event generation script based on Monkey first. Afterwards, it commences the state-based phase by starting DroidBot (with dynamic policy mode). It also checks the device configuration in order to restore the device to its original starting configuration, in case this has been altered by any generated random events. Figure 3 shows a flow chart of the process [7]. Note that the execution order (i.e., running Monkey before DroidBot) is mandated by technical constraints that made it infeasible to start with DroidBot first.

3 Methodology and experiments

Two sets of experiments were performed to address the research questions outlined in Section 1: (a) Comparative analysis of the random-based, state-based, and hybrid input generation in terms of behavioral footprint. (b) Investigating the impact of the resulting behavioral footprints on the performance of various machine learning-based classifiers. In this section, we present the setup of our experiments.

3.1 Testbed configurations

The experiments were performed using real phones with the following configurations. Eight smartphones of different brands with each processing an average of 100 apps per day were utilized. The phones are equipped with Android 6.0 “Marshmallow” OS, 2.6 GHz CPU, 4GB RAM, 32GB ROM, and 32 GB of external SD card storage. Moreover, each phone was installed with a credit-loaded sim card to allow sending SMS, outgoing calls, and 3G data usage. The phones were also connected to an Internet-enabled Wi-Fi access point. The aim was to ensure that the dynamic analysis environment mimicked real smartphone operations as much as possible.

The running time was different depending on the input generation method. With the random-based method, each of the applications was installed on a phone and run for 300 s. Preliminary investigation confirmed that 300 s was a sufficient time to generate at least 2000 random events. The preliminary studies found that for most apps, 2000 events provided optimum coverage beyond which no improvement is observed. With the state-based input generation method, we specified a running time of 180 s which was enough to allow the possible states to be traversed and all relevant events invoked. This was also confirmed via preliminary studies. Therefore, for the hybrid test input generation, we adopted the sum of the times used in the two individual methods, i.e., 480 s.

3.2 Datasets

In order to evaluate behavioral footprints and subsequently measure the accuracy performance of the machine learning-based classifiers, we utilized two datasets. The first one (Dataset1) consisted of 2444 apps with equal numbers of benign and malware apps. The malware apps in Dataset1 are from 49 families of the Android Malware Genome project samples [22], while the benign samples are from McAfee Labs.

The second dataset (Dataset2) had 15,000 apps consisting of 6500 clean apps and 8500 malware apps all obtained from McAfee labs. Some of the apps in Dataset2 could not be processed successfully due to errors, crashes, or absence of “activity” components in the app. Out of the initial 6500 benign apps, 6069 were processed successfully. Also, out of 8500 malware apps, only 7434 were processed successfully. Thus, in the end, a total of 13,530 apps were utilized from Dataset2.

3.3 Extracting behavioral features

The behavioral footprints were extracted during app processing on the smartphones. Each of the smartphones was connected via the android debug bridge (adb) to a Santoku Virtual Machine [23]. Within the Santoku Virtual Machine, an instance of our dynamic analysis tool Dynalog was used to extract the behavior features for each app at run-time and then further process the behavior logs from all apps into a.csv file. The entire process was performed for each of the three analyses scenarios, i.e., with random-based, state-based, and hybrid input generation methods respectively.

For each application, a total of 178 dynamic features based on API calls and Intents were extracted. These features were utilized for the comparative behavioral footprint analyses of the input generation methods using the Dataset1 and Dataset2 respectively. After pre-processing, the total number of features that remained out of the initial 178 for training and evaluating the machine learning classifiers were 102 for random-based method, 110 for state-based method, and 110 for hybrid method.

3.4 Investigated machine learning classifiers

In the second set of experiments, the extracted behavioral features were used to investigate the performance of seven popular machine learning classifiers. The classifiers include the following: Sequential Minimal Optimization (SMO), Naive Bayes (NB), Simple Logistic (SL), Multilayer Perceptron (MLP), Partial Decision Trees (PART), Random Forest (RF), and J48 Decision Tree. These classifiers are all implemented in the popular machine learning software WEKA (Waikato Environment for Knowledge Analysis), and we used the default configurations of the classifiers within the software for our experiments. For each input generation method, we studied the performance of each classifier on the two datasets. The results of our experiments are presented in Section 4 using the “precision,” “recall,” and “weighted F-measure” performance metrics defined as follows:

Recall or sensitivity is the “true positive ratio” given by:

$$ \mathrm{Rec = \frac{TP}{TP+FN}} $$

(1)

Precision (also known as “positive predictive rate”) is given by:

$$ \mathrm{Prec = \frac{TP}{TP+FP}} $$

(2)

where TP is the true positives, i.e., number of correctly classified instances. FN stands for false negatives, i.e., the number of instances within a class that were incorrectly classified as the other class. FP is false positives, i.e., the number of instances of the other class that were incorrectly classified as the current class.

F-measure is a metric that combines precision and recall as follows:

$$ \mathrm{FM = \frac{2* recall * precision}{recall + precision}} $$

(3)

As the F-measure is calculated for both malware and benign classes, the combined measure known as weighted F-measure is the sum of the F-measures weighted by the number of instances in each class as follows:

$$ W-FM = \frac{(F_{m}.N_{m})+(F_{b}.N_{b})}{N_{m}+N_{b}} $$

(4)

where F _m and F _b are the F-measures of the malware and benign classes respectively, while N _m and N _b are the number of instances in the malware and benign classes respectively.

4 Results and disscussions

4.1 Comparisons of behavioral footprint: state-based vs. random vs. hybrid

4.1.1 Random-based approach vs. state-based approach

In this subsection, the results of the behavioral footprint analysis of the random-based method compared to the state-based method (from both Datasets 1 and 2) are presented. Figures 4 and 5 show the top 10 extracted run-time features with the largest differences between the two methods, from the malware and benign samples respectively in Dataset1. In this experiment, we discovered that some API calls were logged at a higher number with random-based method whereas others were logged at a higher number with state-based method.

In Fig. 4, the top 10 features where the state-based method had a larger overall behavioral footprint are depicted for the malware subset. The feature NetworkInfo;->getTypeName, logged 114 more with the state-based method than the random-based method. While the feature Landroid/telephony/TelephonyManager;->getDeviceId logged 100 more with the state-based method. In Fig. 5, much larger differences can be seen with the experiments performed on the benign subset of Dataset1. This also corresponds with the fact that the benign apps were generally larger in size than the malware apps in Dataset1.

Figures 6 and 7 show results of experiments performed on Dataset2 to compare random-based to state-based. In Fig. 6, the top 10 features with the largest differences where the state-based method triggered higher number of logs than the random-based method are shown for the malware subset. Similarly, Fig. 7 illustrates the same for the benign subset. We can see that with Dataset2 there are much higher differences in behavioral footprint between the two test input generation methods compared the Dataset1 scenario.

Figure 8 shows the results of a few exceptional cases where the random-based method outperformed the state-based approach from the malware subset of Dataset1. This indicates that for some applications, the random-based approach could sometimes reach parts of the applications that the state-based method was unable to reach. However, it is worth noting that the overall differences in Fig. 8 are smaller than those shown in Fig. 4 thus indicating that the overall footprint is larger with the state-based approach.

4.1.2 Random-based approach vs. hybrid approach

The results of the behavioral footprint analysis of the random-based method compared to the hybrid method (from both Datasets 1 and 2) are presented. Tables 1 and 2 show the top-10 extracted run-time features with the largest differences between the two methods, from the malware and benign samples respectively in Dataset1. Both tables show that more of the API calls were logged from larger number of APKs when the hybrid input test generation method was used compared to the random-based method. The API method call Ljava/io/File;->exists, for instance, was logged from 677 malware APKs using the hybrid test input method, while it was only logged from 477 malware APKs using the random-based approach. Similarly, the method Landroid/telephony/TelephonyManager;>getDeviceId in Table 1, was discovered from 429 malware APKs using the hybrid method, whereas only 315 malware APKs logged the same API method when random-based method was used.

Table 1 Top 10 API calls logged from malware samples where the hybrid approach was better than the random-based approach (Dataset1)

Machine learning-based dynamic analysis of Android apps with improved code coverage

Abstract

1 Introduction

2 Input/event generation methods for Android application dynamic analysis

2.1 Random input generation method

2.2 State-based input generation method

2.3 Hybrid input generation method

3 Methodology and experiments

3.1 Testbed configurations

3.2 Datasets

3.3 Extracting behavioral features

3.4 Investigated machine learning classifiers

4 Results and disscussions

4.1 Comparisons of behavioral footprint: state-based vs. random vs. hybrid

4.1.1 Random-based approach vs. state-based approach

4.1.2 Random-based approach vs. hybrid approach

4.1.3 State-based approach vs. hybrid approach

4.1.4 Explanation of the obtained results

4.2 Comparisons of impact of the code coverage on machine learning classifiers

4.3 Evaluating accuracy performance with additional statically obtained permission features.

5 Related work

6 Conclusion

7 Appendix

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords