Machine learning-based dynamic analysis of Android apps with improved code coverage

This paper investigates the impact of code coverage on machine learning-based dynamic analysis of Android malware. In order to maximize the code coverage, dynamic analysis on Android typically requires the generation of events to trigger the user interface and maximize the discovery of the run-time behavioral features. The commonly used event generation approach in most existing Android dynamic analysis systems is the random-based approach implemented with the Monkey tool that comes with the Android SDK. Monkey is utilized in popular dynamic analysis platforms like AASandbox, vetDroid, MobileSandbox, TraceDroid, Andrubis, ANANAS, DynaLog, and HADM. In this paper, we propose and investigate approaches based on stateful event generation and compare their code coverage capabilities with the state-of-the-practice random-based Monkey approach. The two proposed approaches are the state-based method (implemented with DroidBot) and a hybrid approach that combines the state-based and random-based methods. We compare the three different input generation methods on real devices, in terms of their ability to log dynamic behavior features and the impact on various machine learning algorithms that utilize the behavioral features for malware detection. Experiments performed using 17,444 applications show that overall, the proposed methods provide much better code coverage which in turn leads to more accurate machine learning-based malware detection compared to the state-of- the- art approach.


Introduction
With nearly 80% market share, Google Android leads other mobile operating systems. Over 65 billion downloads have been made from the official Google play store, and there are currently more than 1 billion Android devices worldwide [1]. According to Statista [2], there will be around 1.5 billion Android devices shipped worldwide by 2021. Due to the increasing popularity of Android, malware targeting the platform has increased significantly over the last few years. According to a recent report from McAfee, there are around 2.5 million new Android malware samples exposed every year, thus increasing the total *Correspondence: malzaylaee01@qub.ac.uk † Suleiman Y. Yerima and Mohammed K. Alzaylaee contributed equally to this work. 2 Centre for Secure Information Technologies (CSIT), Queen's University Belfast, Belfast, BT7 1NN Northern Ireland, UK Full list of author information is available at the end of the article number of malware samples discovered in the wild to more than 12 million [3]. Android malware can be found in a variety of applications such as gaming apps, banking apps, social media apps, educational apps, and utility apps.
Malware-infected applications can have access to privacysensitive information, send text messages to premium rate numbers without user approval, or even install a rootkit on the device allowing it to download and execute any code the malware developer wants to deploy etc. In order to mitigate the spread of malware, Google introduced Bouncer to its store in Feb 2012. Bouncer is the system used to monitor submitted applications for potentially harmful behaviors by testing the submitted apps in a sandbox. However, Bouncer has been shown to be vulnerable to unsophisticated detection avoidance techniques that can evade sandbox-based dynamic analysis [4]. Furthermore, most third party app stores do not have any screening mechanism for submitted applications. There is therefore a need to further research more efficient approaches for detecting Android malware in the wild. Android applications are heavily user interface (UI) driven. Because of the UI operation, efficient input generation is crucial for testing applications and many tools are available to support developers. Likewise, efficient input generation is needed to drive automated dynamic analysis for malware detection. On the Android platform, malware can hide their malicious activities behind events/actions that require user interaction. Hence, in order to facilitate effective malware detection, researchers integrate (into their analysis platforms) tools that can simulate the human interaction with the UI. The main goal is to reach a high percentage of code coverage such that most of the suspicious activities are revealed during the analysis. However, as highlighted in [5], some challenges still remain with automated test input generation. Many of the existing dynamic analysis systems rely on a randombased input generation strategy based on the Android Monkey UI exerciser tool. In fact, the random-based test input generation tool Monkey, is currently the most popular input generation tool used in most dynamic analysis systems. Its popularity can be attributed to being readily available as part of the Android developer's toolkit and its ease of use.
An empirical study conducted in [6] compared the Monkey tool to other input generation tools. The study found that the highest code coverage was achieved by the random-based Monkey tool compared to the other tools. However, the study did not evaluate the impact on machine learning-based detection of Android malware. In our previous work [7], preliminary experiments conducted on 2444 Android apps (containing 1222 benign and 1222 malware samples) showed that dynamic analysis behavioral footprint enabled by the random-based Monkey tool could be improved further. The results showed that stateful approaches enabled better code coverage than the random-based approach. Out of the stateful approaches, the hybrid method performed better than the stand-alone state-based approach. The code coverage was quantified by the number of apps where specific dynamic behavioral features were traced or discovered from.
This paper extends the preliminary work in [7], and presents a more extensive comparative analysis of the code coverage by the random-based, statebased and hybrid input generation methods by means of a larger dataset of 15,000 apps. The study in this paper focuses on investigating the impact of the input generation methods on performance of machine learning-based Android malware detection, which has not yet been addressed in previous works. In particular, our paper seeks to answer the following research questions: • Given that the random-based test input generation approach is widely utilized, does this method enable the best possible code coverage for effective detection of Android malware? • Does the state-based method produce larger behavioral footprints for dynamic analysis compared to the random-based method? • When the state-based method is combined with the random-based method to enable hybrid input test generation, does this increase the behavioral footprint?
• Lastly, what are the comparative performance differences for various machine learning classifiers that utilize the dynamic behavioral features logged using the random-based vs. state-based vs. the hybrid method? Most importantly, does the use of stateful methods increase classifier performance?
The reminder of the paper is structured as follows. Section 2 discusses the input generation methods investigated in the paper. Section 3 details the methodology and experiments undertaken. Section 4 presents and discusses the results. Section 5 gives an overview of related work, followed by conclusions and future work in Section 6.

Input/event generation methods for Android application dynamic analysis
In this section, we describe the input/event generation methods investigated in this paper. The input generation schemes are incorporated into an extended version of our dynamic analysis framework (DynaLog) for comparative analysis. DynaLog [8] enables dynamic analysis of Android apps by instrumenting the apps with API-Monitor [9] and then logging real-time behavioral features. DynaLog is extended to enable dynamic analysis on real devices (as described in [10]) in order to mitigate the potential impact of anti-emulation and environmental limitations of emulators on our dynamic analyses. Figure 1 shows an overview of the dynamic analysis process using DynaLog. It logs features from API calls and Intents. On the Android platform, Intents are used to signal to the system that a certain event has occurred. Examples of Intents include SMS_RECEIVED, PACKAGE_INSTALL, BOOT_COMPLETED, and PACKAGE_REMOVED. In order to receive the signal for an Intent, an app must register for that particular Intent.

Random input generation method
The random input generation method is a stateless approach that sends pseudo-random events of clicks, swipes, touch screens, scrolling etc. during run-time testing of an application in an emulator or real device. As mentioned before, the state-of-the-practice events generator for run-time testing of Android apps is the Monkey Extracting app behavioral footprint using Dynalog [10] tool, which has been incorporated into many dynamic analyses systems (such as AASandbox [11], ANANAS [12], Mobile Sandbox [13], and vetDroid [14]). Monkey is a random-based events generation tool that is part of the Android Developers' toolkit [15]. It is a command line tool that can be run on any emulator instance or on a device. It sends a pseudo-random stream of user events into the system and includes several configuration options. The seed value for the pseudo-random number generator can be set, and if you re-run Monkey with the same seed value it would generate the same sequence of events. A fixed delay can be inserted between the events and if not specified, there is no delay and the events are generated as rapidly as possible. Monkey can be configured to ignore crashes, timeouts or security exceptions and continue to send events to the system until the count is complete. Although Monkey is quite effective as an event generation tool for dynamic analysis systems, it does not operate with any awareness of the current state of the system. For this reason, the code coverage might not be optimal, and this could reduce effectiveness of detection systems that utilize on it for event generation. This is what we will be investigating in the experiments presented later in the paper by comparative analysis with alternative approaches that have not been utilized in previous works.
The main advantage of the random input generation method is the speed, since events can be sent continuously without pausing to determine the current state of the app's UI. The main disadvantage is that unintended consequences, such as turning off Internet connectivity, could occur as observed in [7]. Since the input/event generation is not guided, the phone's settings could be inadvertently changed by indiscriminate streams of generated input. This may impact the extent of coverage of the behavioral footprint thus leaving out key features that could be indicative of malicious activities.

State-based input generation method
This approach is also known as model-based input generation or model-based exploration [16][17][18][19][20]. The method utilizes a finite state machine model of the app with activities as "states" and events as "transitions. " The finite state model can be built by statically analyzing the app's source code to understand which UI events are relevant for a specific activity. Alternatively, the model can be built dynamically and terminated when all events that can be triggered from all the discovered state lead to already explored states.
In order to implement state-based test input generation for our study, we employed DroidBot [21], an open-source automated UI event generation tool for Android. Droid-Bot was integrated into our DynaLog framework and configured to operate in the state-based input generation mode. DroidBot generates UI-guided test inputs based on a state transition model generated on-the-fly from the information monitored at run-time. Hence, it does not require any prior knowledge of unexplored code. A depthfirst exploration algorithm is used to generate the events. Figure 2 shows a simplified example of a state transition model. It shows a directed graph in which each node represents a device state, and the edge between the two nodes represents the input event that triggered the state transition. A state node typically contains the GUI information and the running process information, while an edge node contains details of the input event. The current state information is maintained, and after sending an input to the device, the state change is monitored. Once the device state is changed, the input is added as a new edge while a new node is added to represent the new state. Further details on DroidBot are given in [21]. DroidBot was selected for our work because compared to other model-based/state-based tools it is (a) open source, (b) easier to integrate with our existing analysis environment (DynaLog) since it does not require system/framework instrumentation, and (c) usable with common of the shelf Android devices without modification. The main disadvantage of the state-based approach is that it is considerably slower than the random-based approach. This is due to the fact that the app's current state needs to be constantly tracked in between sending events. Note that the time taken to traverse the possible states of the app depends on the size and complexity of the apps. Running times therefore vary from app to app.

Hybrid input generation method
The hybrid input method is a stateful approach that combines the random-based method with the statebased method as described in [7]. This was motivated by desire to exploit the strengths of each method for possible improvement in behavioral footprint. Furthermore, we want to determine whether a combined scheme will impact the detection accuracy of machine learning-based systems built upon the dynamic analysis process that utilize the resulting run-time behavioral features.
The hybrid input generation system runs a randombased event generation script based on Monkey first.
Afterwards, it commences the state-based phase by starting DroidBot (with dynamic policy mode). It also checks the device configuration in order to restore the device to its original starting configuration, in case this has been altered by any generated random events. Figure 3 shows a flow chart of the process [7]. Note that the execution order (i.e., running Monkey before DroidBot) is mandated by technical constraints that made it infeasible to start with DroidBot first.

Methodology and experiments
Two sets of experiments were performed to address the research questions outlined in Section 1: (a) Comparative analysis of the random-based, state-based, and hybrid input generation in terms of behavioral footprint. (b) Investigating the impact of the resulting behavioral footprints on the performance of various machine learningbased classifiers. In this section, we present the setup of our experiments.

Testbed configurations
The experiments were performed using real phones with the following configurations. Eight smartphones of different brands with each processing an average of 100 apps per day were utilized. The phones are equipped with Android 6.0 "Marshmallow" OS, 2.6 GHz CPU, 4GB RAM, 32GB ROM, and 32 GB of external SD card storage. Moreover, each phone was installed with a credit- Checking and restoring device configurations on the hybrid input generation system comprising both random and state-based subsystems [7] loaded sim card to allow sending SMS, outgoing calls, and 3G data usage. The phones were also connected to an Internet-enabled Wi-Fi access point. The aim was to ensure that the dynamic analysis environment mimicked real smartphone operations as much as possible.
The running time was different depending on the input generation method. With the random-based method, each of the applications was installed on a phone and run for 300 s. Preliminary investigation confirmed that 300 s was a sufficient time to generate at least 2000 random events. The preliminary studies found that for most apps, 2000 events provided optimum coverage beyond which no improvement is observed. With the state-based input generation method, we specified a running time of 180 s which was enough to allow the possible states to be traversed and all relevant events invoked. This was also confirmed via preliminary studies. Therefore, for the hybrid test input generation, we adopted the sum of the times used in the two individual methods, i.e., 480 s.

Datasets
In order to evaluate behavioral footprints and subsequently measure the accuracy performance of the machine learning-based classifiers, we utilized two datasets. The first one (Dataset1) consisted of 2444 apps with equal numbers of benign and malware apps. The malware apps in Dataset1 are from 49 families of the Android Malware Genome project samples [22], while the benign samples are from McAfee Labs.
The second dataset (Dataset2) had 15,000 apps consisting of 6500 clean apps and 8500 malware apps all obtained from McAfee labs. Some of the apps in Dataset2 could not be processed successfully due to errors, crashes, or absence of "activity" components in the app. Out of the initial 6500 benign apps, 6069 were processed successfully. Also, out of 8500 malware apps, only 7434 were processed successfully. Thus, in the end, a total of 13,530 apps were utilized from Dataset2.

Extracting behavioral features
The behavioral footprints were extracted during app processing on the smartphones. Each of the smartphones was connected via the android debug bridge (adb) to a Santoku Virtual Machine [23]. Within the Santoku Virtual Machine, an instance of our dynamic analysis tool Dynalog was used to extract the behavior features for each app at run-time and then further process the behavior logs from all apps into a .csv file. The entire process was performed for each of the three analyses scenarios, i.e., with random-based, state-based, and hybrid input generation methods respectively. For each application, a total of 178 dynamic features based on API calls and Intents were extracted. These features were utilized for the comparative behavioral footprint analyses of the input generation methods using the Dataset1 and Dataset2 respectively. After pre-processing, the total number of features that remained out of the initial 178 for training and evaluating the machine learning classifiers were 102 for random-based method, 110 for state-based method, and 110 for hybrid method.

Investigated machine learning classifiers
In the second set of experiments, the extracted behavioral features were used to investigate the performance of seven popular machine learning classifiers. The classifiers include the following: Sequential Minimal Optimization (SMO), Naive Bayes (NB), Simple Logistic (SL), Multilayer Perceptron (MLP), Partial Decision Trees (PART), Random Forest (RF), and J48 Decision Tree. These classifiers are all implemented in the popular machine learning software WEKA (Waikato Environment for Knowledge Analysis), and we used the default configurations of the classifiers within the software for our experiments. For each input generation method, we studied the performance of each classifier on the two datasets. The results of our experiments are presented in Section 4 using the "precision, " "recall, " and "weighted F-measure" performance metrics defined as follows: Recall or sensitivity is the "true positive ratio" given by: Precision (also known as "positive predictive rate") is given by: where TP is the true positives, i.e., number of correctly classified instances. FN stands for false negatives, i.e., the number of instances within a class that were incorrectly classified as the other class. FP is false positives, i.e., the number of instances of the other class that were incorrectly classified as the current class.
F-measure is a metric that combines precision and recall as follows: As the F-measure is calculated for both malware and benign classes, the combined measure known as weighted F-measure is the sum of the F-measures weighted by the number of instances in each class as follows: where F m and F b are the F-measures of the malware and benign classes respectively, while N m and N b are the number of instances in the malware and benign classes respectively.

Random-based approach vs. state-based approach
In this subsection, the results of the behavioral footprint analysis of the random-based method compared to the state-based method (from both Datasets 1 and 2) are presented. Figures 4 and 5 show the top 10 extracted run-time features with the largest differences between the two methods, from the malware and benign samples respectively in Dataset1. In this experiment, we discovered that some API calls were logged at a higher number with random-based method whereas others were logged at a higher number with state-based method. In Fig. 4, the top 10 features where the state-based method had a larger overall behavioral footprint are depicted for the malware subset. The feature NetworkInfo;->getTypeName, logged 114 more with the state-based method than the random-based method. While the feature Landroid/telephony/TelephonyManager;->getDeviceId logged 100 more with the state-based method. In Fig. 5, much larger differences can be seen with the experiments performed on the benign subset of Dataset1. This also corresponds with the fact that the benign apps were generally larger in size than the malware apps in Dataset1. Figures 6 and 7 show results of experiments performed on Dataset2 to compare random-based to state-based. In Fig. 6, the top 10 features with the largest differences where the state-based method triggered higher number of logs than the random-based method are shown for the malware subset. Similarly, Fig. 7 illustrates the same for the benign subset. We can see that with Dataset2 there are much higher differences in behavioral footprint between the two test input generation methods compared the Dataset1 scenario. Figure 8 shows the results of a few exceptional cases where the random-based method outperformed the statebased approach from the malware subset of Dataset1. This indicates that for some applications, the random-based approach could sometimes reach parts of the applications that the state-based method was unable to reach. However, it is worth noting that the overall differences in Fig. 8 are smaller than those shown in Fig. 4 thus indicating that the overall footprint is larger with the state-based approach.

Random-based approach vs. hybrid approach
The results of the behavioral footprint analysis of the random-based method compared to the hybrid method (from both Datasets 1 and 2) are presented. Tables 1  and 2 show the top-10 extracted run-time features with the largest differences between the two methods, from the malware and benign samples respectively in Dataset1. Both tables show that more of the API calls were logged from larger number of APKs when the hybrid input test generation method was used compared to the random-based method. The API method call Ljava/io/File;->exists, for instance, was logged from 677 malware APKs using the hybrid test input method, while it was only logged from 477 malware APKs using the random-based approach. Similarly, the method Landroid/telephony/TelephonyManager;>getDeviceId in Table 1, was discovered from 429 malware APKs using the hybrid method, whereas only 315 malware APKs logged the same API method when random-based method was used. Similar results were obtained with the benign samples with even higher differences observed as shown in Table 2. For example, the method Landroid/net/Uri;>parse was logged from 492 benign APKs using the hybrid method, while the same method was extracted from only 192 benign APKs using the random-based approach.  The same experiment was repeated with the larger Dataset2 and the results are presented in Tables 3 and  4. The tables show the top-10 features with the largest differences. In Table 3 (malware subset), the hybrid method shows larger collective behavioral feature footprints than the random-based method. For example, the Ljava/util/zip/ZipInputStream;>read feature was logged from 2446 more malware APKs when the hybrid test input method was used compared to the random-based method. Likewise, the Ljava/lang/reflect/Method;->getClass feature was discovered from 4346 malware samples using the hybrid method compared to only 2468 using the  random-based method. From Table 6, out of 122 features with differences in Dataset2, the random-based method had only two features where it logged higher than the hybrid-based method. The results from the benign samples shown in Table 4 depict the same pattern of larger behavioral footprint with the hybrid input test generation method compared to the random-based input test generation method.

State-based approach vs. hybrid approach
The results of the behavioral footprint analysis of the state-based method compared to the hybrid method (from both Datasets 1 and 2) are presented. Tables 5 and 6 show the top-10 extracted run-time features with the largest differences where the hybrid method logged higher than the state-based method (in Dataset1). Both tables show that more of the API calls were logged from larger number of APKs when the hybrid input test generation method was used compared to the state-based method. It can be seen that the hybrid approach allows for the discovery of more API calls with a difference of over 100 in some cases. For instance, the class Ljava/util/Date was extracted from only 177 malware samples using state-based test input generation, while with the hybrid method, it was logged from 124 more malware samples. The differences decreased to less than 20 samples when we applied the same analysis to the benign sample set. Tables 7, 8, and 9 present results of comparative analysis with Dataset2. In Tables 7 and 8, we show the top 10 features where the hybrid approach has larger behavioral footprints than the state-based approach within the APK subsets. With 125, the feature Landroid/net/Uri;->parse has the largest difference in the malware sample set, followed by the Landroid/content/ContextWrapper;->sendBroadcast feature with 92. Similarly, with 117, the feature Landroid/os/Process->;myPid has the largest difference in the benign sample set, followed by the Ljava/Lang/ClassLoader->;loadClass feature with 61.
In Table 9, we depict the top 10 features where the statebased approach has larger behavioral footprints than the hybrid approach within the APK subsets. It was stated    in 726 more malware APKs, followed by the Network-Info;>isConnected feature found in 547 more malware APKs. The methods getSubscriberId and getSimOperator from the TelephonyManager class, were logged from 373 and 356 more malware samples respectively using the state-based compared to the hybrid method. In a nut shell, much larger differences can be seen in Dataset2 with the features were the state-based approach logged higher than the hybrid approach than vice-versa (for instance 726 vs. 125). Hence, this suggests that the state-based method enabled an overall larger behavioral footprint than the hybrid method. Investigating how these differences will impact on the machine learning-based detection systems trained on these set of features is the goal of the experiments presented in the next subsection. The summary of the overall differences in number of features where one method logged higher that the other (far Dataset1 and Dataset 2) are shown in Tables 10  and 11.
From Table 10, out of 76 features with differences, the random-based method had 23 features where it logged higher than the state-based method. Conversely, the statebased method had higher logs than the random-based method in 49 different features. In Dataset2 (Table 11) (out of 103 features with differences), the random-based is higher than state-based in 3 features, whereas the statebased is higher than the random-based in 100 features. These numbers suggest that the behavioral footprint is larger in both Datasets for the state-based compared to the random-based test input generation method.
Table 10 also shows that the hybrid method is higher than the random-based method in 62 features, while the random-based is higher than the hybrid in 10 features. Similarly, from Table 11 (Dataset2), hybrid exceeds random-based for 93 features and random-based exceeds hybrid for only 1 feature. These numbers also suggest that the behavioral footprint is larger in both datasets for the hybrid compared to the random-based test input generation method.
In Dataset 1 (Table 10) the state-based exceeds hybrid in 4 features while hybrid exceeds state-based in 64 features. This suggests that the behavioral footprint of the hybrid approach is larger than that of the state-based approach. In Dataset 2 (Table 11), we have the opposite, i.e., statebased exceeds hybrid in 80 features, while hybrid exceeds state-based in 22 features. This suggests that the behavioral footprint is larger for the state-based compared to the hybrid with Dataset2. In the Appendix, we present a full table of the features and number of apps that logged each feature for each respective test input generation method ( Table 6).

Explanation of the obtained results
The random-based method generates test input events at a much faster rate but a large percentage of these are usually redundant or irrelevant to the current state of the  application. This explains why some of the behavioral features might not be triggered and logged by the system despite sending thousands of events to the UI. On the other hand, the state-based approach although slower and with much less events being sent to the UI is more accurate because of the relevance of the events sent in response to the current state. The accuracy of the state-based method may be the reason why it enables larger numbers of the apps to log the behavioral features when tested in the dynamic analysis system. Another issue that was observed with the random-based method is that app being analyzed, or the phone was sometimes driven into a state where no further progress could be made in traversing the app. This also contributed to limiting the extent of the collected behavioral footprint.
In the smaller dataset (Dataset1), the hybrid approach logged higher for more features than the state-based approach. Whereas, with the larger dataset (Dataset2) the state-based approach logged higher for more features than the hybrid approach. This was counterintuitive to our original expectation since the combined approach was designed to exploit the advantages of both state-based and random-based methods. This suggests that in the hybrid system, rather than enhance the code coverage, the accuracy of the state-based method was affected by integrating it with the random-based method.
Looking back at Fig. 3, during the experiments we discovered that it was not always possible to restore the device configuration (i.e., airplane mode and/or WiFi connectivity) whenever it had been altered by the randombased subsystem (Monkey) within the hybrid system. In most cases the device was restored properly before the state-based subsystem was invoked. Instances where the proper restoration failed contributed to the lowering the accuracy of the state-based subsystem within the hybrid system. This explains the reason why the stand-alone state-based system had better code coverage than the hybrid system consisting of both state-based and randombased components. In the next subsection, we shall examine the impact of the different code coverage capacities of the three methods on various machine learning classifiers.

Comparisons of impact of the code coverage on machine learning classifiers
In this section the performance of the three input generation approaches are compared using seven popular machine learning classifiers. From the experiments we can gain insight into the impact of their relative code coverage capacities on machine learning-based Android malware detection performance. First, we discuss the performance evaluation results from Dataset1 (i.e., 1146 malware and 1109 benign samples). All results are obtained using 10 fold cross-validation approach. Figure 9 summarizes the weighted F-measures (W-FM) of the seven classifiers for the three input generation methods. The Random Forest (RF) classifier performs best for the three methods. The state-based method achieved the best W-FM of 0.943, followed by the hybrid method with 0.934 and then the random-based method with 0.926. In the remaining 6 classifiers, both state-based and hybrid methods outperformed the random-based method with higher W-FM results. These results can be seen in Table 15 in the Appendix. The hybrid approach showed slightly higher W-FM results for the SMO and J48 classifiers. For the MLP, PART, SL and NB classifiers, the statebased approach obtained better W-FM results. Thus, we can conclude that for Dataset1, the overall best accuracy performance is achieved with the state-based, followed by the hybrid method and lastly the random-based method. Table 16 in the Appendix contains the performance evaluation results for the seven classifiers, from a second experiment with Dataset2 (i.e., 7434 malware and 6069 benign samples). Figure 10 summarizes the weighted Fmeasures (W-FM) of the seven classifiers for the three input generation methods. These results for Dataset2 shows more significant performance differences between the three methods. Again, for each of the test input generation methods, RF achieves the best accuracy performance. For the RF classifier, the state-based method has WFM of 0.878, followed by the hybrid with 0.868 and then the random-based method with 0.832. Furthermore, both hybrid and state-based methods performed significantly better than the random-based method in the other 6 classifiers. Figure 10 shows that the W-FM results for PART, J48, MLP, SL, SMO, and NB were higher for the state-based method compared to the hybrid method. Figures 11 and 12 show the results from top 20, 40, 60, 80, and 100 information-gain ranked features for the analyzed Dataset1 and Dataset2 respectively. It illustrates the W-FM performance obtained for the RF algorithm trained on both datasets. In all cases, the random-based test input generation method achieved the lowest performance as the number of features is increased. In addition, from the figures it is clear that the overall detection performance of the state-based method surpassed the others especially from 60 features and above.
Even though the analyses in the previous section (as summarized in Tables 10 and 11) showed that each test input generation approach collected higher logs for some specific features, it is the "importance" of the features that will ultimately impact the machine learning performance. The relative importance of the features can be computed by a feature ranking algorithm such as information gain.
In Tables 12 and 13, the total information gain scores for each method are presented.
In both datasets, the combine scores for the state-based method is highest (with 1.735 and 1.661487 respectively). This is followed closely by that of the hybrid method, and both of them surpass the random-based method. Even when considering only the top 20 or top 40 ranked features, the combined information gain scores maintained the same ranking. These results show that the different code coverage capacities of the three methods had an impact on the most important/significant behavioral features, which ultimately affected the performance of the machine learning classifiers. It shows that compared to the state-based approach, the random-based method is not an optimal choice for dynamic behavioral analysis of Android apps for malware detection.

Evaluating accuracy performance with additional statically obtained permission features.
In this section we explore accuracy performance improvement of the RF dynamic classifiers (trained from Dataset2) with additional static features. In order to achieve this, we extended the dynamically obtained features (API calls + Intents) with static permission features. Using only Dataset2, the overall W-FM results are illustrated in Fig. 13. From the Figure we can see that for state-based, hybrid and random-based methods, the W-FM improves from 0.8774, 0.8674, and 0.8319 to 0.93, 0.926 and 0.918 respectively. This illustrates that machine learning-based malware detection systems that utilize both static and dynamic features still need to consider more effective test input generation for the run-time feature extraction aspect in order to maximize accuracy performance.

Related work
This section reviews related work on Android malware detection and automated test input generation for Android. Previous work on Android malware detection can be categorized under static analysis or dynamic analysis, although some systems combine both techniques. In the static analysis approach, the code is usually reverse engineered and examined for presence of any malicious code. [24][25][26][27][28][29][30][31][32][33][34] are examples of detection solutions based on static analysis. Dynamic analysis on the other hand, involves executing apps in a controlled environment such as a sandbox, virtual machine, or a physical device in order to trace its behavior. Several automated dynamic analysis systems such as [8,19,[35][36][37][38][39][40][41][42][43][44] have been proposed for detecting suspicious behaviors from Android applications. However, the efficiency of these systems depend on the ability to effectively trigger the malicious behaviors hidden within an application during the analysis. Dynamic analysis systems seek to improve behavioral footprint of apps under analysis by applying automated test input generation tools. Hence, test input generation needs to be as efficient as possible because it impacts on the discovery of malicious behavior. Several test input generation tools have been developed in order to trigger Android applications under testing.
Monkey [15], is the most frequently used tool for testing Android apps. Linares-Vàsquez et al. [5] stated that Monkey is the only random testing tool available for the researchers. It does not require any additional installation effort as it is part of the Android developers' toolkit. Monkey implements a random exploration strategy that considers the application under test as a black box to which it continuously sends UI events, until the specified maximum upper bound is reached. Monkey can be used on both emulators and real devices. Dynodroid [45] is a random-based test input generation tool for Android. It can generate both IU events and system events. It also has the ability to allow users to manually provide inputs (e.g., for authentication) when exploration is stalling. Its drawback is the need to instrument the Android framework in order to generate system events. Also, Dynodroid can only run on an emulator and cannot be used with real devices.
ORBIT [17], is proprietary tool developed by Fujitsu Labs. It implements a model-based exploration strategy (i.e., a state based approach). ORBIT statically analyses the application's source code to understand which UI events are relevant for a specific activity. Unlike ORBIT, Mobi-GUITAR [46], formerly known as GUIRipper [16] dynamically builds the GUI model of the application under test. It implements a depth-first search strategy and re-starts its exploration from its starting state when it can no longer detect new states during exploration. It can only generate UI events and not system events. Also, it is not an open source tool and is only available as a Windows binary.
A3E [18] is publicly available tool that consists of two strategies to trigger the applications: the DFS (depth first search) and a taint-targeted approach. However, the open source A3E repository does not provide the taint-targeted strategy. ACTEve [47] is developed to support both system events and UI events which is based on concolic-testing and symbolic execution. However, ACTEve needs to instrument both the Android framework and the application in order to perform the test. For this reason, ACTEve cannot be used for analysing apps on "common off the shelf " devices.
PUMA [20] is a tool designed to be a general purpose UI automator for Android. It provides the random exploration approach using Monkey. It is also extensible to implement different exploration strategies because it provides a finite state machine representation of the app. However, PUMA is only compatible with the most recent releases of the Android framework.
Choudhary et al. [6] performed a thorough comparison of the main existing test input generation tools for Android. They evaluated the effectiveness of these tools using four metrics: ability to detect faults, ability to work on multiple platforms, ease of use, and code coverage. They evaluated Monkey, ACTEve, Dynodroid, A3E, GUIRipper, SwiftHand, and PUMA on 68 applications. For the code coverage, on average Monkey and Dynodroid (which are both random-based) performed much better than the other tools. Overall, Monkey achieved the best code coverage. Their results also showed that maximum coverage was achieved by all tools within 5 to 10 min. The study in [6], however, quantified code coverage by the average number of Java statements covered by each tool. By contrast this paper uses the number of apps that logged each feature as an indicator of code coverage, which has more relevance to malware detection systems. Unlike [6], this paper focused on code coverage analysis of random-based, state-based and a hybrid approach to test input generation using a much larger number of applications consisting of benign and malware samples. As mentioned earlier, the Monkey tool was used to implement the random-based approach, while DroidBot, was used to implement the state-based approach. Both of these tools best met our requirements as representative candidates for the different methods due to the following reasons: (a) open source; (b) usable with common off the shelf Android devices without modification; (c) Android framework and platform independence, i.e., no requirement for platform or app instrumentation; (d) amenable to large scale automation for malware detection. Note that DroidBot was not one of the test input generation tools evaluated in [6].
Furthermore, several works have presented machine learning-based Android malware detection systems built upon dynamically obtained features. Marvin [57] applies a machine learning approach to features extracted from a combination of static and dynamic analysis techniques. Shabtai et al [58] presented a dynamic framework called Andromaly which applies several different machine learning algorithms, including random forest, naive Bayes, multilayer perceptron, Bayes net, logistic, and J48 to classify the Android applications. However, they assessed their performances on only four self-written malware applications. MADAM [59] is also a dynamic analysis framework that uses machine learning to classify Android apps. MADAM utilizes 13 features extracted from the user and kernel levels. Other systems include Droidcat [60], STREAM [61], Mobile-Sandbox (2014) [62], Dysign [63], Massarelli et. al [64], Alzaylaee et. al. [10], and Afonso et al. [65]. All of these machine learning-based malware detection systems employ the random-based Monkey tool for test input generation.
From the literature, it is clear that there has been extensive research and publications in the area of Android malware detection. However, unlike this paper, no study has been undertaken to date that comparatively evaluates the impact of automated test input generation methods on machine learning-based Android malware detection. Such a study provides valuable insights that will be useful for optimizing and improving dynamic analysis systems designed for future Android malware detection.

Conclusion
In this paper, stateful input generation approaches are proposed for machine learning-based dynamic analysis for malware detection. These include a state-based approach and a hybrid approach that combines the state-based with the random-based method. The stateful approaches were compared to the commonly used random-based method (utilizing the Monkey tool) by evaluating their respective code coverage capacities within a dynamic analysis system using real devices. The code coverage capacities were determined based on their respective behavioral footprints measured from logged API calls and Intents captured at run-time from two datasets each consisting of benign and malware applications. The statebased approach is implemented using DroidBot, while the hybrid approach combines the Monkey tool with DroidBot. The paper also presents experiments conducted to study the impact of the respective code coverage capacities of the input generation systems on various machine learning classifiers. It was found that both state-based and hybrid approaches provided much better code coverage than the random-based method. Contrary to expectation, the hybrid method was unable to improve the code coverage over the state-based method. This was because the random component of the hybrid system frequently interferes with device operation (despite our implementation of mitigating measures) leading to sub-optimal code coverage. The state-based approach ultimately enabled the best accuracy performance in majority of the machine learning classifiers.
Based on our findings, it is clear that Android dynamic analysis systems need to incorporate better input generation methods than the currently popular random-based Monkey tool. Furthermore, machine learning-based malware detection systems that employ dynamically obtained features need better input generation tools to improve code coverage. Utilizing a state-based/model-based tool such as DroidBot is definitely a step towards more robust dynamic analysis systems and higher accuracy malware detection capability.   6 2 P a c k a g e M a n a g e r ; -> g e t I n s t a l l e d A p p l i c a t i o n s 1 2 2 7 7 6 6 1 7 9 8