Summary of Effectiveness of Static Analysis Against Android Malware

Summary Effectiveness of Static Analysis Against Android Malware arxiv.org

13,467 words - PDF document - View PDF document

One Line

This study explores the effect of obfuscation on Android malware detection using machine learning and suggests a strong detector.

Slides

Slide Presentation (15 slides)

Copy slides outline Copy embed code Download as Word

Effectiveness of Static Analysis Against Android Malware

Source: arxiv.org - PDF - 13,467 words - view

Introduction

• The spread of Android devices has led to an increase in malware crafted for this operating system.

• Machine learning (ML) algorithms have been developed for Android malware detection.

• ML algorithms depend on the quality and soundness of the data used to build the classifier.

Static Analysis vs Dynamic Analysis

• Static analysis involves inspecting the content of the app package file (APK).

• Dynamic analysis involves executing the app in a controlled environment and logging traces.

• Both techniques are valid for extracting valuable data from apps, but static analysis is computationally cheaper.

Obfuscation Techniques

• Obfuscation aims to prevent code analysis by transforming the code of apps without altering functionality.

• Obfuscation is commonly used by malware authors to bypass static analysis-based malware detectors.

• Some studies have shown that obfuscation harms detectors that rely on static analysis features.

Impact of Obfuscation on Static Analysis Features

• Obfuscation techniques affect different static analysis features to varying degrees across different tools.

• Certain features remain valid for ML malware detection even in the presence of obfuscation.

• A comprehensive assessment of the impact of obfuscation techniques on static analysis features is needed.

Experimental Results

• The impact of different obfuscation strategies and tools on static analysis features is evaluated.

• Insights about the use of these features for malware detection in obfuscated scenarios are provided.

• A high-performing ML-based Android malware detector that is robust against obfuscation is proposed.

Robust ML Malware Detector

• The proposed detector outperforms current state-of-the-art detectors.

• It can identify goodware and malware despite the presence of obfuscation.

• A novel dataset with more than 95K obfuscated Android apps is presented for testing the robustness of malware detection proposals.

Conclusion

• Certain features remain valid for ML malware detection even in the presence of obfuscation.

• The proposed ML-based Android malware detector is robust against obfuscation and outperforms current state-of-the-art detectors.

• The availability of the dataset and code used in the study promotes open science and reproducibility.

Impact of Obfuscation Techniques on Static Analysis Features

• Different obfuscation strategies were applied to Android apps using state-of-the-art obfuscation tools.

• The impact on various static analysis features was analyzed.

• Some features obtained from the manifest of applications were the most stable.

Differences in Features from Different Obfuscation Tools

• The differences in features depended on the tool used and were due to implementation peculiarities.

• The largest differences were observed for API function and Ad-hoc features.

• The way files to be transformed were selected also affected the differences observed between tools.

Sensitivity of ML Algorithms to Feature Vector Obfuscation

• Even small changes in feature vectors could have a significant impact on prediction performance.

• Some feature families exhibited high fluctuations in the decisions made by the models.

• This indicates greater sensitivity to changes introduced by obfuscation.

Robust Malware Detection Model

• A robust model was proposed based on features that exhibited high insensitivity to changes and high accuracy with non-obfuscated apps.

• The selected features included Permissions, API functions, and Strings.

• The model outperformed state-of-the-art obfuscation-resilient detectors in both non-obfuscated and obfuscated scenarios.

Limitations of Static Analysis Features

• Some feature types, such as file-related features, were not effective in differentiating malware.

• Feature persistence was not the sole factor influencing the robustness of the detection model.

• High insensitivity values were a more adequate indicator of robustness.

Conclusion

• Static analysis features can be effective for ML-based Android malware detection in the presence of obfuscation.

• Features that are both relevant and insensitive to changes can be used to build robust detection models.

• The proposed robust detection approach outperforms state-of-the-art obfuscation-resilient detectors.

Key Takeaways

• Obfuscation techniques are commonly used by malware authors to bypass static analysis-based malware detectors in Android.

• Certain features remain valid for ML malware detection even in the presence of obfuscation.

• The proposed ML-based Android malware detector is robust against obfuscation and outperforms current state-of-the-art detectors.

Key Points

Obfuscation techniques are commonly used by malware authors to bypass static analysis-based malware detectors in Android.
Some machine learning (ML) detection proposals have been developed that are resilient to obfuscation.
The impact of specific obfuscation techniques on static analysis features used for ML malware detection in Android is assessed.
Certain features remain valid for ML malware detection even in the presence of obfuscation.
A robust ML malware detector for Android is proposed that outperforms current state-of-the-art detectors.
Static analysis and dynamic analysis are two techniques used for data extraction in Android malware detection.
Obfuscation is a security technique used to prevent code analysis by transforming the code of apps without altering their functionality.
This study provides a comprehensive assessment of the impact of obfuscation techniques on static analysis features for ML malware detection in Android.

Summaries

19 word summary

This study examines the impact of obfuscation on machine learning malware detection in Android and proposes a robust detector.

60 word summary

This study explores the impact of obfuscation on static analysis features for machine learning (ML) malware detection in Android. Despite obfuscation, certain features remain valid for ML malware detection. The study proposes a robust ML malware detector for Android that outperforms current detectors. Obfuscation techniques can counteract the effectiveness of static analysis, but the proposed detector is robust against obfuscation.

137 word summary

This study examines the impact of obfuscation techniques on static analysis features for machine learning (ML) malware detection in Android. Despite the influence of obfuscation, certain features remain valid for ML malware detection. Based on these findings, a robust ML malware detector for Android is proposed that outperforms current state-of-the-art detectors. The rise of Android devices has led to an increase in malware targeting this operating system. To combat this, researchers have developed ML-based anti-malware solutions. Both dynamic and static analysis methods are valid for Android malware detection, but static analysis is more computationally efficient. However, obfuscation techniques can counteract the effectiveness of static analysis. This study comprehensively assesses the impact of obfuscation techniques on static analysis features for ML-based Android malware detection. The proposed ML-based Android malware detector is robust against obfuscation and outperforms current detectors.

510 word summary

This study examines the impact of obfuscation techniques on static analysis features used for machine learning (ML) malware detection in Android. The experiment evaluates how obfuscation affects different static analysis features across various tools. Despite the influence of obfuscation, certain features remain valid for ML malware detection. Based on these findings, a robust ML malware detector for Android is proposed that outperforms current state-of-the-art detectors.

The rise of Android devices has led to an increase in malware targeting this operating system. To combat this, researchers have developed ML-based anti-malware solutions. These algorithms analyze app data to classify apps as either goodware or malware. Data extraction for Android malware detection can be done through dynamic or static analysis. Dynamic analysis involves executing the app and monitoring its behavior, while static analysis involves inspecting the APK file. Both methods are valid, but static analysis is more computationally efficient. However, obfuscation techniques can counteract the effectiveness of static analysis.

Obfuscation is a technique that transforms app code to prevent code analysis. Both legitimate developers and malware authors use obfuscation. Some studies have shown that obfuscation hampers static analysis-based malware detection, while others propose feature extraction techniques that can identify obfuscated malware.

In conclusion, this study offers valuable insights into the impact of obfuscation techniques on static analysis features for Android malware detection. Certain features remain effective for ML-based malware detection despite obfuscation. The proposed ML-based Android malware detector outperforms current state-of-the-art detectors and is robust against obfuscation. The availability of the dataset and code used in the study promotes open science and reproducibility.

The study evaluates the effectiveness of static analysis features for detecting Android malware in the presence of obfuscation. Different obfuscation strategies are applied to Android apps using state-of-the-art tools, and the impact on various static analysis features is analyzed. Seven feature families are considered, including Permissions, Components, API functions, Opcodes, Strings, File Related, and Ad-hoc. The stability of these features when obfuscation is applied varies, with manifest-based features being the most stable.

The study then uses ML algorithms to assess the ability of static features to detect malware and their stability in the presence of obfuscation. In a clean environment, most feature families provide sufficient information for effective malware detection using ML algorithms, particularly API functions and Strings. However, File-Related features are found to be unsuitable.

The study investigates the sensitivity of ML algorithms to changes induced by feature vector obfuscation. Even small changes in feature vectors can significantly impact prediction.

628 word summary

Obfuscation is a technique that transforms app code to prevent code analysis. Both legitimate developers and malware authors use obfuscation. Legitimate developers use it to protect their code, while malware authors use it to hinder static analysis. Some studies have shown that obfuscation hampers static analysis-based malware detection, while others propose feature extraction techniques that can identify obfuscated malware. However, these studies have limitations in terms of reproducibility and experimental details.

To address these limitations, this study comprehensively assesses the impact of obfuscation techniques on static analysis features for ML-based Android malware detection. The strength, validity, and detection potential of various static analysis features are analyzed when obfuscation is present. The study evaluates the impact of different obfuscation strategies and tools on these features, providing insights into their use for detecting obfuscated malware. Based on the experimental results, a high-performing ML-based Android malware detector that is resilient against obfuscation is proposed. This detector surpasses current state-of-the-art detectors and can identify goodware and malware even in the presence of obfuscation. Additionally, the study provides a dataset of over 95K obfuscated Android apps, enabling researchers to test their malware detection proposals.

The study also examines the differences in features obtained using different obfuscation tools. These differences depend on the tool used and are due to implementation peculiarities. The largest differences are observed for API function and Ad-hoc features. The selection of files to be transformed also affects the differences between tools, as some tools perform additional checks to avoid modifying certain content.

The study then uses ML algorithms to assess the ability of static features to detect malware and their stability in the presence of obfuscation. The RandomForests classification algorithm is used without parameter optimization. In a clean environment, most feature families provide sufficient information for effective malware detection using ML algorithms, particularly API functions and Strings. However, File-Related features are found to be unsuitable.

The study investigates the sensitivity of ML algorithms to changes induced by feature vector obfuscation. Even small changes in feature vectors can significantly impact prediction

1044 word summary

Malware authors often use obfuscation techniques to bypass static analysis-based malware detectors in Android. However, some machine learning (ML) detection proposals have been developed that are resilient to obfuscation. In this study, the impact of specific obfuscation techniques on static analysis features used for ML malware detection in Android is assessed. The experimental results show that obfuscation techniques affect different static analysis features to varying degrees across different tools. However, certain features remain valid for ML malware detection even in the presence of obfuscation. Based on these findings, a robust ML malware detector for Android is proposed that outperforms current state-of-the-art detectors.

The spread of Android devices has led to an increase in the amount of malware crafted for this operating system. As a result, researchers have developed anti-malware solutions based on ML algorithms. These algorithms are able to find patterns in app data that can be used to classify apps as either goodware or malware. The performance of ML algorithms depends on the quality and soundness of the data used to build the classifier. In the case of Android malware detection, data extraction can be performed using either dynamic or static analysis. Dynamic analysis involves executing the app in a controlled environment and logging traces that describe its behavior. Static analysis, on the other hand, involves inspecting the content of the app package file (APK). Both techniques are valid for extracting valuable data from apps, but static analysis is computationally cheaper. However, it can be counteracted by applying obfuscation techniques.

Obfuscation is a security through obscurity technique that aims to prevent code analysis by transforming the code of apps without altering their functionality. Obfuscation can be used by both legitimate software developers and malware authors. Legitimate developers use obfuscation to protect their code from being analyzed by third parties, while malware authors use it to prevent static analysis from obtaining meaningful information about the behavior of apps. Some studies have shown that obfuscation harms detectors that rely on static analysis features for malware detection, while others have proposed feature extraction techniques that enable successful identification of malware even when apps are obfuscated. However, these studies have limitations in terms of reproducibility, biased datasets, and lack of details about the experimental setups.

To address these limitations, this study presents a comprehensive assessment of the impact of obfuscation techniques on static analysis features for ML malware detection in Android. The study analyzes the strength, validity, and detection potential of a complete set of features obtained through static analysis when obfuscation is used. The impact of different obfuscation strategies and tools on static analysis features is evaluated, providing insights about the use of these features for malware detection in obfuscated scenarios. Based on the experimental results, a high-performing ML-based Android malware detector that is robust against obfuscation is proposed. The detector outperforms current state-of-the-art detectors and can identify goodware and malware despite the presence of obfuscation. The study also presents a novel dataset with more than 95K obfuscated Android apps, allowing researchers to test the robustness of their malware detection proposals.

In conclusion, this study provides valuable insights into the impact of obfuscation techniques on static analysis features for Android malware detection. The findings show that certain features remain valid for ML malware detection even in the presence of obfuscation. The proposed ML-based Android malware detector is robust against obfuscation and outperforms current state-of-the-art detectors. The availability of the dataset and code used in the study promotes open science and reproducibility.

The effectiveness of static analysis features for detecting Android malware in the presence of obfuscation was evaluated in this study. Different obfuscation strategies were applied to Android apps using state-of-the-art obfuscation tools, and the impact on various static analysis features was analyzed. Seven families of features were considered, including Permissions, Components, API functions, Opcodes, Strings, File Related, and Ad-hoc. The persistence of these features when obfuscation was applied varied, with the features obtained from the manifest of applications being the most stable.

The differences in features obtained using different obfuscation tools were also examined. It was found that the differences in features depended on the tool used and were due to implementation peculiarities. The largest differences were observed for API function and Ad-hoc features. The way in which files to be transformed were selected also affected the differences observed between tools. Some tools performed additional checks to avoid modifying certain content, while others did not, resulting in differences in the features obtained.

Machine learning (ML) algorithms were then used to evaluate the ability of static features to detect malware and their stability in the presence of obfuscation. The RandomForests classification algorithm was used without any parameter optimization. In a clean environment, most feature families provided enough information for effective malware detection using ML algorithms, particularly API functions and Strings. However, File-Related features were found to be unsuitable for this purpose.

The sensitivity of ML algorithms to changes induced by feature vector obfuscation was also investigated. It was found that even small changes in feature vectors could have a significant impact on prediction performance. Some feature families exhibited high fluctuations in the decisions made by the models, indicating greater sensitivity to changes introduced by obfuscation.

To address this issue, a robust malware detection model was proposed based on features that exhibited high insensitivity to changes and high accuracy with non-obfuscated apps. The selected features included Permissions, API functions, and Strings. A RandomForest classifier trained with these robust features outperformed state-of-the-art obfuscation-resilient detectors in both non-obfuscated and obfuscated scenarios.

The study also highlighted the limitations of static analysis features for detecting Android malware. It was found that some feature types, such as file-related features, were not effective in differentiating malware. Feature persistence was not the sole factor influencing the robustness of the detection model, and high insensitivity values were a more adequate indicator of robustness.

In conclusion, static analysis features can be effective for ML-based Android malware detection in the presence of obfuscation. Features that are both relevant and insensitive to changes can be used to build robust detection models. The proposed robust detection approach using Permissions, API functions, and Strings outperformed state-of-the-art obfuscation-resilient detectors. Future work could involve extending the analysis to additional obfuscation techniques and exploring richer app representations that integrate different static analysis data.

Raw indexed text (85,722 chars / 13,467 words / 2,132 lines)

Light up that Droid! On the Effectiveness of

Static Analysis Features against App

Obfuscation for Android Malware Detection

Borja Molina-Coronado, Antonio Ruggia Usue Mori, Alessio Merlo,

Alexander Mendiburu and Jose Miguel-Alonso

Abstract—Malware authors have seen obfuscation as the mean to by-

pass malware detectors based on static analysis features. For Android,

several studies have confirmed that many anti-malware products are

easily evaded with simple program transformations. As opposed to these

works, ML detection proposals for Android leveraging static analysis

features have also been proposed as obfuscation-resilient. Therefore, it

needs to be determined to what extent the use of a specific obfuscation

strategy or tool poses a risk for the validity of ML malware detectors

for Android based on static analysis features. To shed some light in

this regard, in this article we assess the impact of specific obfuscation

techniques on common features extracted using static analysis and

determine whether the changes are significant enough to undermine the

effectiveness of ML malware detectors that rely on these features. The

experimental results suggest that obfuscation techniques affect all static

analysis features to varying degrees across different tools. However,

certain features retain their validity for ML malware detection even in

the presence of obfuscation. Based on these findings, we propose a

ML malware detector for Android that is robust against obfuscation and

outperforms current state-of-the-art detectors.

Index Terms—machine learning, static analysis, malware detection,

obfuscation, reliability, evasion

I NTRODUCTION

ITH the spread of Android devices, the amount of

malware crafted for this OS has also experienced a

extraordinary growth [1, 2]. This has led researchers to de-

vise cutting-edge anti-malware solutions based on machine

learning (ML) algorithms. When fed with app data, these

algorithms are able to find patterns that are characteristic

•

Borja Molina Coronado, Alexander Mendiburu and Jose Miguel-Alonso

are with the Dept. of Computer Architecture and Technology, University

of the Basque Country UPV/EHU, Donostia, Spain.

E-mail: { borja.molina, alexander.mendiburu, j.miguel } @ehu.es

Usue Mori is with the Dept. of Computer Science and Artificial Intelli-

gence, University of the Basque Country UPV/EHU, Donostia, Spain.

E-mail: [email protected]

Antonio Ruggia is with the Dept. of Informatics, Bioengineering, Robotics

and Sytems Engineering, University of Genoa, Genoa, Italy.

E-mail: [email protected]

Alessio Merlo is with the CASD - Centre for Advanced Defense Studies,

Rome, Italy.

E-mail: [email protected]

Corresponding author: Borja Molina-Coronado

and informative enough to classify apps as either goodware

or malware. In this sense, the performance of ML highly

depends on the quality and soundness of the data that is

used to build the classifier [3, 4]. In the case of Android

malware detection, the extraction of this data, in the form of

a vector of features that represents the behavior of apps, is

performed using either dynamic or static analysis [5, 6].

Dynamic analysis is performed on a controlled envi-

ronment (sandbox) where the app is executed. During ex-

ecution, traces that describe the behavior of the app,

e.g.,

network activity, system calls, etc. are logged [5]. On the

contrary, static analysis is based on the inspection of

the

content of the package file (APK) of an app. This includes

the compiled code and other resources such as image and

database files [7]. Both techniques are valid to extract valu-

able data from apps. However, dynamic analysis involves a

costly process whose success is dependent on the emulation

method used and the absence of sandbox evasion artifacts

in the code of apps. Instead, static analysis is computation-

ally cheaper, but it can be counteracted by applying app

code transformations. Such transformations are commonly

known as obfuscation [8].

Obfuscation is a security through obscurity technique

that aims to prevent automatic or manual code analysis.

It involves the transformation of the code of apps, making

it more difficult to understand but without altering its

functionality [9]. This characteristic has made obfuscation a

double edged sword, used by both, goodware and malware

authors. Developers of legitimate software leverage obfus-

cation to protect their code from being statically analyzed

by third parties, e.g., trying to avoid app repackaging or in-

tellectual property abuses [10]. Malware authors have seen

obfuscation as a mean to conceal the purpose of their code

[11], preventing static analyses from obtaining meaningful

information about the behavior of apps.

It may seem common sense that the application of any,

or the combination of several, obfuscation techniques will

make malware analysis relying on features extracted using

static analysis fruitless. However, it is unclear to what extent

this aspect is true. Some studies on Windows and Android

executables have demonstrated that obfuscation harms de-

tectors that rely on static analysis features. For example,2

packing 1 prevents obtaining informative features [12, 13],

which are essential to train classifiers. Similar conclusions

have been drawn for other forms of transformation [14, 15],

showing a major weakness in Android malware detectors.

However, other studies contradict what has been stated

in the aforementioned works, proposing feature extraction

techniques via static analysis that enable a successful identi-

fication of malware even when apps are obfuscated [16–18].

All of these works appear promising in demonstrating

either the flaws or the strengths of static analysis features for

malware detection. However, these discrepancies compli-

cate the extraction of sound conclusions regarding the valid-

ity of static analysis features for Android malware detection.

In addition, many of these works focus solely on the labels

predicted by the detectors, without analyzing the effect of

the obfuscation on the apps and/or features used to train

them [14, 15, 17, 19, 20]. This additional feature-centered

information is important to understand and explain why the

detectors are working or failing when obfuscation is present,

and is crucial for building more robust detectors. Finally,

another evident flaw of some of these studies is the lack

of details concerning their datasets and the configuration

of their experimental setups [16, 18, 21, 22]. Apart from

the lack of reproducibility, biases in the datasets may lean

the results towards non-generalizable results. Therefore, the

conclusions drawn from all these works may have limited

applicability beyond the evaluated scenarios, and can be the

cause of the contradictions found in the literature.

To the best of our knowledge, this work presents the

first comprehensive study about the impact of common

obfuscation techniques in the information that is obtained

through static analysis to perform malware detection with

ML algorithms. The contributions of this paper can be

summarized in the following highlights:

•

We provide an agnostic 2 evaluation of the strength,

validity and detection potential of a complete set of

features obtained by means of static analysis of APKs

when obfuscation is used.

We analyze the impact of a variety of obfuscation

strategies and tools on static analysis features, pro-

viding insights about the use of these features for

malware detection in obfuscated scenarios.

We propose a high-performing ML-based Android

malware detector leveraging a set of robust static

analysis features. We demonstrate the ability of this

detector to identify goodware and malware despite

obfuscation, outperforming the state-of-the-art.

We present a novel dataset with more than 95K ob-

fuscated Android apps, allowing researchers to test

the robustness of their malware detection proposals.

In spirit of open science and to allow reproducibility,

we make the code publicly available at gitlab-borja.

The rest of this paper is organized as follows. Sec-

tion 2 introduces the literature that has previously tackled

1. Packing is a particular form of obfuscation which hides the real

code through one or more layers of compression/encryption. At run-

time, the unpacking routine restores the original code in memory to be

then executed.

2. In this context, we refer agnostic as an analysis carried out without

focusing on a specific malware detection proposal.

obfuscation as a problem in malware analysis.

Section 3

provided basic information about topics that are required

to understand the content of this paper. Section 4 describes

the construction of the app dataset and presents the features

that are considered in our experiments. Section 5 evaluates

the impact of different obfuscation strategies and tools in

static analysis features, as well as their validity for malware

detection. Section 6 is devoted to assess the robustness of

our ML malware detection proposal. Section 7 includes a

discussion of the main findings made along this paper.

Finally, we conclude this paper in Section 8.

R ELATED W ORK

The related work can be divided into two groups: (1) studies

that analyze the vulnerabilities of malware detectors when

obfuscation is present, and (2) works that propose novel

malware detectors which are presumably robust to obfus-

cation.

2.1

Study of the Vulnerabilities of Malware Detectors

The works that evaluate the negative effects of obfuscation

on Android malware detectors have mainly been carried

out for black box malware detectors, i.e., the system or

model is analyzed and evaluated based solely on its input-

output behavior, without direct access to or knowledge of

its internal workings. The first work of this type [19] studied

how obfuscation impacts the detection ability of 10 popular

anti-virus programs available in the VirusTotal

platform.

The work demonstrated that these detectors are vulnerable

and loose their reliability in the identification of obfuscated

malware. Similarly, in [20], 13 Android anti-virus programs

from VirusTotal are assessed using different obfuscation

strategies to modify malware. The results showed a meek

improvement in detection accuracy concerning the findings

of previous works [19] and proved that companies respon-

sible of developing these tools are trying to counteract

obfuscation. A more comprehensive analysis for 60 anti-

virus tools in VirusTotal has been presented in [14]. Again,

the work demonstrated the vulnerabilities of most detectors

when facing obfuscated malware. However, this analysis

shows that the success on bypassing detection highly de-

pends on the obfuscation tools and strategies considered.

In the mentioned studies, the detectors are commercial

products with unknown characteristics. Some other works

have focused on assessing the impact

of obfuscation in

published ML based detectors. In [17], an analysis of the

effect of obfuscation in two detectors, one relying on static

and the other on dynamic analysis features, is presented. It

is shown that the performance of the detector using dynamic

analysis features is not altered by obfuscation, contrary to

the detector that uses static analysis features. However,

authors indicated that this effect can be easily mitigated by

including obfuscated samples during the training phase of

ML models. In [15], eight state-of-the-art Android malware

detectors leveraging static analysis features and ML algo-

rithms are assessed using obfuscated malware samples. The

authors demonstrated that obfuscation is a major weakness

of these popular solutions, since all of them suffered a drop

in their performance. One of the most recent and compre-

hensive studies is carried out in [12]. This work analyzes3

the effect of packing in ML malware detectors relying on

static analysis for Windows executables. The conclusions

drawn from the extensive set of experiments indicate that

ML malware detectors for Windows fail to identify the class

of transformed samples due to the insufficient informative

capacity of static analysis features.

All these works prove the added difficulty that obfusca-

tion entails for malware detection. However, most of them

fail to provide explanations behind accurate or erroneous

detections. In this sense, they treat the detectors as black-box

tools and do not analyze the effect of different obfuscation

strategies and tools on the apps and, specifically, on the fea-

tures that will be used for training the detectors. This makes

it difficult to extract meaningful insights and provides no

useful information to build more robust classifiers.

2.2

Obfuscation-Resilient Detectors

A second group of proposals focuses on the development

of obfuscation-resilient detectors, specifically designed to

operate effectively in the presence of obfuscated apps. Two

of the most relevant works in this regard are DroidSieve

[16] and RevealDroid [18]. The former categorizes static

analysis features as obfuscation-sensitive and obfuscation-

insensitive based on theoretical aspects. Feature frequency

is studied for different datasets with obfuscated and un-

obfuscated malware samples to support the idea that most

changing features provide better information. In conse-

quence, they proposed a detector that relies on the features

of both groups, and offering good performance in terms of

malware detection and family identification. The latter work

argues against static analysis features such as Permissions,

Intents or Strings for robust malware detection. Contrary to

the authors of DroidSieve, they suggest that obfuscation-

sensitive features do not provide useful information to

detect malware. Instead, the authors propose a new set of

static analysis features based on a backward analysis of

the calls to dynamic code loading and reflection APIs.

this way, the functions invoked at runtime are identified,

nullifying the effect of obfuscation, making the proposed

detector obfuscation-resilient.

Two allegedly obfuscation-resilient detectors leveraging

deep learning algorithms are presented in [21] and [22]. The

authors of these works suggest that the capacity of deep

learning to embed and extract useful information from the

features is enough to tackle obfuscation. The first work relies

on strings extracted from the app code.

Strings are then

transformed into sequences of characters to obtain an em-

bedded representation of the app that is then used for clas-

sification. Despite the excellent results reported for malware

detection, the ability of the detector to identify obfuscated

apps is based on (unproven) statements that are not specif-

ically tested. The latter proposal incorporates obfuscation-

sensitive and insensitive features, including permissions,

opcodes and meta-data from ApkID 3 , a signature-based

fingerprinting tool. Similarly to the previous proposal, the

obfuscation-resiliency of this work cannot be confirmed

based on the results, since the effect of the use of obfuscation

in the detector is based on theoretical aspects not specifically

covered by the experiments.

3. https://github.com/rednaga/APKiD

The experiments carried out in all these works present

some flaws that, in our opinion, put in question their

capability. For example, most of them do not describe, or

vaguely analyze, the composition of their datasets in terms

of the number of obfuscated malware or goodware samples,

as well as the tools and strategies considered to obfuscate

the samples. Some articles focus their analyses exclusively

on obfuscated malware, either for the training or evaluation

of the detectors, but what about obfuscated goodware? How

do detectors behave in the presence of such apps? The use

of different obfuscation tools or strategies for malware and

for goodware introduces biases in ML algorithms, since the

generated models may associate obfuscation, or the use of a

particular obfuscation tool, to a specific class in the data

[12]. Additionally, experiments performed with malware

and goodware captured from different periods can cause

biases in the detectors [23].

Also, most of these studies

focused on a small set of features, arguing against other

types of features without providing any proof. All these

aspects may justify the good published results and cause

contradictions concerning other analyses carried out for ML-

based detectors [15, 17]. Finally, we also found that most

of them do not provide enough details to reproduce their

systems and thus, lack of reproducibility.

B ACKGROUND

This section briefly introduces some basic concepts needed

to understand the rest

of this paper. This includes the

structure and content of an Android Application Package

(APK) from which static analysis features are extracted, and

the types of obfuscation techniques and their effect in the

apps.

3.1

Android Apps

Android apps are usually developed in Java or

Kotlin 4 .

When an app has to meet

very strict performance con-

straints, or interact directly with hardware components,

Android allows developers to introduce native components

written in C and C++ (i.e., native code). An Android app is

distributed and installed via an APK, a compressed (ZIP) file

containing all the resources needed (e.g., code, images) to

firstly execute the app. Figure 1 shows the internal structure

of an APK file.

Every APK must be signed with the private key of the

developer. To validate this signature, the APK contains the

public certificate of the developer inside the META-INF

folder. This mechanism guarantees the integrity of the

APK 5 . In a nutshell, before installing an app, Android veri-

fies if the files in the APK match a pre-computed signature

and continues with the installation only if the integrity check

succeeds.

The AndroidManifest.xml defines the structure of

an Android app and its meta-data,

such as the package

name of the app, the required permissions, and the main

4. From now on, we will refer to Java code, although the techniques

we describe are also valid for apps written in Kotlin.

5. Note that Android does not verifies the validity of the developer’s

certificate but instead, uses this mechanism to validate the integrity of

the content within the APK. Therefore, the developers’ certificates can

be self-signed.4

META-INF/ cert.rsa

lib/ cert.sf

assets/ manifest.mf

lifecycle (e.g., onCreate, onPause) or Android framework

components because that would break the execution logic.

3.2.0.2 Code manipulation: These techniques ma-

nipulate the DEX code to remove useless operations, hide

specific API invocations, and modify the execution flow. The

main techniques in this category are:

•

res/

classes.dex

AndroidManifest.xml

Fig. 1: Structure of an APK file

components (i.e., Activity, Service, Broadcast Receiver, and

Content Provider). An Android app can contain one or

multiple DEX file(s) (i.e., classes *.dex), which include

the compiled Java code. Each .dex file can reference up to

64k methods [24], such as the Android framework methods,

other library methods, and the app-specific methods. For the

native components, Android provides an Android Native

Development Kit (NDK) [25] that generates native libraries

in the form of Linux shared objects. Such objects are stored

into the lib folder.

Finally, the res folder contains the compiled resources

(e.g., images, and strings), and the assets directory in-

cludes the raw resources, providing a way to add arbitrary

files such as text, HTML, font, and video content into the

app.

3.2

Obfuscation

Obfuscation is the process of modifying an executable with-

out altering its functionality [26]. It aims to counteract

automatic or manual code analysis. In the Android con-

text, many strategies can be applied to modify the code

or resources within the APK file: from simple operations

that change some metadata to bypass basic checks (e.g.,

signature-based anti-malware), to techniques that explicitly

modify the DEX code or resources of the app [27]. It is worth

emphasizing that in Android obfuscation is more common

than in other binary code (e.g., x86 executables), because

analyzing and repackaging an Android app is straightfor-

ward [13]. In the rest of this Section, we present the type of

modifications considered in this work.

3.2.0.1 Renaming: A DEX file stores the original

string-valued identifiers (names) of fields, methods and

classes [28]. Often, these identifiers leak information about

code functionalities, lifecycle components and how they

interact with each other. For instance, a common practice

by programmers is to add “Activity” to each Java class that

implements an activity component. The renaming technique

replaces these identifiers with meaningless strings, aiming

to remove information about the functionality of the app.

Consequently, renaming involves modifying the .dex files

and the Manifest file ( AndroidManifest.xml ). Note that

this technique cannot be applied to methods of the Android

•

Junk code insertion (JCI) This technique introduces

sequences of useless instructions, such as nop (i.e.,

no-operation instructions that do nothing). Other JCI

strategies transform the control-flow graph (CFG) of

apps by inserting goto instructions or arithmetic

branches. For example, a goto may be introduced in

the code pointing to an useless code sequence ending

on another goto instruction, which points to the

instruction after the first goto. The arithmetic branch

technique inserts a set of arithmetic computations

followed by branch instruction that depends on the

result of these computations, crafted in such a way

that the branch is never taken [29].

Call indirection (CI) This technique aims to modify

the call graph and, therefore, the CFG of the app.

It introduces a new intermediate chain of

method

invocations in the code, adding one or several nodes

between a pair of nodes in the original

graph. For

example, given a method invocation from m or1 to

m or2 in the code, m or1 is modified to call to the

start of a sequence of n intermediate methods ( m i :

1 <= i <= n ) that end in a call to m or2 . In this way,

the analysis could not reveal that m or2 is actually

invoked by m or1 [30].

Reflection This technique uses the reflection capabil-

ity of the Java language to replace direct

method

invocations with Java reflection methods that

use

class and method identifiers as parameters to per-

form the call. This makes actual method invocations

difficult to inspect [30]. Listings 1 and 2 show an

example of this transformation. In Listing 1, the

method m1 (of the class MyObject) is accessed

through the operator “ .” from the object instance,

whereas in Listing 2 shows the same invoked method

using the Java reflection API. In this example,

a java.lang.reflect.Method.invoke() object

is created (lines 2-3) and invoked (line 4) for a specific

object instance (i.e., obj), whereas the class and

method names are passed as parameters of

these

functions.

3.2.0.3 Encryption: This technique prevents access-

ing to parts or the entire code or resources (e.g.,

strings

and asset files) of the app by using symmetric encryption

algorithms. It involves storing the original code or resources

in an encrypted form so that a decryption routine, inserted

in the code, is invoked whenever an encrypted part needs

to be accessed. The decryption key is stored somewhere in

the APK or calculated at runtime. This technique introduces

extra latency during app execution and severely complicates

the analysis of the functionality of the encrypted part [27].

It is worth emphasizing that different obfuscation tech-

niques can be combined to improve their effectiveness. For

example, encrypting the strings of reflective calls can hide

the method and class names invoked at runtime. This makes5

Listing (1) Java direct method invocation

7 public class com . example . MyObject implements

public O b j e c tm1( O b j e c tprm ) {}

}

M y I n t e r f a c { e

5 public O b j e c t r e f e c t i o n I n v o k e ( com . example . M y I n t e r f o a b c j, e O b j e c targ ) {

C lass < ? > c l s = C l a s s . forName ("com.example.MyObject" ) ;

Method m = c l s . getDeclaredMethod (

"m1" ) ;

return m. invoke ( obj , param ) ;

}

7 public O b j e c tr e f E n c r I n v o k e ( com . example . M y I n t e r f o a b c j, e O b j e c targ ) {

S t r i n g className = d e c r y p t "

( AXubduuiao...ZXW" ) ;

S t r i n g methodName = decrypt ( "uibdadBUID...ncu" ) ;

C lass < ? > c l s = C l a s s . forName ( className ) ;

Method m = c l s . getDeclaredMethod ( methodName ) ;

return m. invoke ( obj , param ) ;

}

public O b j e c tstandardInvoke ( com . example . M y I n t e r f a c e o b j, O b j e c targ ) {

return o b j . m1( param ) ;

}

Listing (2) Java reflective method invocation

Listing (3) Java reflective method invocation with encrypted values

it difficult to recover these values by static analysis of the

apps. Listing 3 shows an example of the application of both

obfuscation techniques to the code in Listing 1.

In partic-

ular, the class and method names are decrypted at runtime

(lines 2-3), hiding which methods are actually invoked. Note

how these values are exposed only in an encrypted form,

and could change if a different encryption key or algorithm

was employed.

D ATASET

For our experiments, firstly, we constructed a dataset with

obfuscated and non-obfuscated apps. From this collection

of apps and by means of static analysis, we obtain a set

feature vectors that constitute the object of this study. This

section describes how the app dataset is built and the types

of features derived from the apps.

4.1

App Dataset

We build our app dataset using a subset of APKs from

the AndroZoo repository [31], which contains more than 20

million of APKs with associated meta-data. This meta-data

includes the source of the APK, the date, and the number

of positive detections (VTD) in VirusTotal. Our objective

was to obtain a dataset with the same number of malware

and goodware samples, all of them free of obfuscation. We

downloaded thousands of samples and filtered out those

marked by APKiD 6 as “suspicious” of including obfusca-

tion. To label samples we relied on the VTD values [32]: an

app with VTD ≥ 7 is considered malware, while an app with

VTD=0 was considered goodware (apps with intermediate

VTD values were filtered out).

In a second step, we generated obfuscated versions of the

apps in the filtered dataset. To perform this process, we used

6. https://github.com/rednaga/APKiD

the DroidChameleon [30], AAMO[33], and ObfuscAPK[29]

tools. We chose these tools because (1) they are open source,

(2) they provide a wide range of obfuscation techniques, and

(3) they have previously shown to effectively evade Android

malware detectors. Specifically, for each obfuscation tool, we

try to obfuscate every app in the filtered dataset using six

obfuscation techniques: Renaming, Junk Code Insertion, Re-

flection, Call Indirection and Encryption. The configuration

of the tools was left as default for all techniques. The results

of this process are summarized in Table 1.

Note that some tool combinations failed due to errors

during the APK decompilation process. It is worth noticing

that there were more failures in the case of malware apps

than in goodware apps. ObfuscAPK was the tool with the

best success rate, correctly obfuscating an average of 85%

of the apps. On the contrary, we were unable to obtain

obfuscated samples when trying to apply Encryption with

AAMO, due to bugs introduced in the code by this tool that

prevent the APK from being rebuilt. The attempts to use

Renaming with DroidChameleon were also unsuccessful

due to an error in the implementation of

the tool. For

other techniques, DroidChameleon and AAMO had average

success rates of 55% and 28%, respectively. During this pro-

cess, we realized that for some apps all the tool-technique

combinations failed, and thus these apps were removed

from the filtered dataset. As a result of this process, we

obtained a “Clean” dataset which consists of 4 749 goodware

and 4 067 malware (presumably) non obfuscated samples.

Table 2 summarizes the different datasets that will be

used in the experiments. The criteria for the composition of

these datasets will be explained in Section 5.

•

NonObf: It includes the non obfuscated versions of

the apps for which we could not obtain an obfuscated

version with all the tools for at least one technique,

i.e., apps that can be obfuscated using a specific tool

and technique but not with the remaining tools using6

TABLE 1: Success rate of different technique-tool obfusca-

tion combinations for the apps in the Clean dataset.

The

first part of the name refers to the tool used to obfuscate the

apps, with DC for DroidChamaleon, AA for AAMO, and OA

for ObfuscAPK. The characters after the underscore refer to

the strategy followed to obfuscate the apps: renaming (rnm),

junk code insertion (jcins), call indirection (ci), reflection (refl)

and encryption (encr).

tool-technique

#Goodware

samples #Malware

samples Obf.

Success

Rate

2244

5690

1855

2289

6003

3664

1337

6050

6200

1332

6080

5008

6074 -

1953

4317

1123

2019

4755

2209

1362

4765

3993

1402

4802

3746

4814 0%

34%

81%

24%

35%

87%

47%

22%

87%

82%

22%

88%

70%

88%

DC rnm

AA rnm

OA rnm

DC jcins

AA jcins

OA jcins

DC ci

AA ci

OA ci

DC refl

AA refl

OA refl

DC encr

AA encr

OA encr

TABLE 2: Composition of datasets used in this work. The

columns indicate the number of samples that comprise each

set. The CleanSuccObf dataset contains the clean (original)

apps for which we obtained obfuscated versions with all

tools for at least one technique.

Dataset

Clean

NonObf

CleanSuccObf

Renaming

JCI

CallIndirection

Reflection

Encryption

•

4.2

#Goodware

samples #Malware

samples

4749

1345

3404

3238

1515

2118

2667

4790 4067

1211

2856

2868

1008

1737

2484

4060

the same technique.

CleanSuccObf: includes the subset of non obfuscated

apps present in Clean, but not in NonObf. That is, all

the apps for which all the tools have worked for at

least one technique.

The remainder datasets (Renaming, JCI, CallIndirec-

tion, Reflection, and Encryiption) contain the obfus-

cated versions of the apps in CleanSuccObf for that

particular technique using all the tools.

Feature Dataset

An app dataset has to be transformed into a dataset

feature vectors prior to perform malware detection using

ML. Following a detailed literature analysis, we identified

seven families of static analysis features that have proven

to be useful for ML-based malware detection [6]. We used

two well-known and widely used static analysis frame-

works for Android to extract these features: Androguard

[34] and FlowDroid [35]. Sources of these features include:

the classes.dex and AndroidManifest.xml files, as well as the

contents of the res and assets directories of APKs.

4.2.1 Permissions

Permissions have commonly been used as a source of in-

formation for malware detection in Android [36–39]. In this

category, we consider as features the full set of permissions

provided by Google in the Android documentation

, as

well as the set of custom permissions that developers may

declare to enforce some functionality in their apps. Follow-

ing this procedure, we extracted a set of binary features,

each corresponding to the presence or absence of a given

permission.

4.2.2 Components

An app consists of different software components that must

be declared in the AndroidManifest.xml file. These elements

have been widely used as a source of information for mal-

ware detectors [36, 38, 40, 41]. We extract a list of hardware

and software components that can be declared using the

¡uses-feature¿ tag from the Android documentation 9 , as well

as every identifier for Activity, Service, ContentProvider,

BroadcastReceivers and Intent Filters. In total, we obtained

a set of 85 476 binary features, whose value is set to True

or False for an app according to the presence of the feature

in its AndroidManifest.xml file. We additionally derive seven

frequency features accounting for the number of elements

of each type in the app.

4.2.3 API functions

API libraries allow developers to easily incorporate addi-

tional functionality and features into their apps, being the

main mean of communication between the programming

layer and the underlying hardware. As such, analyzing the

calls to methods of these libraries (API functions) constitutes

a good instrument to characterize the functionality of apps,

and, therefore, for malware detection. Following similar ap-

proaches to those proposed in the literature [36, 39, 40, 42],

we extract a binary feature for each API method, and set its

value to True if the app contains any call

to that method

within its code. In total, this set consist of 66 118 binary

features.

4.2.4 Opcodes

The compiled Android code (Dalvik) consists of a sequence

of opcodes. Opcode-based features provide insights about

the code habits of developers as they represent fine-grained

information about the functionality of apps [43]. Subse-

quences of opcodes, or simply n -grams, have been used for

Android malware detection in [44–47]. Concerning the size

of the subsequences, Jerome et. al [44] and Canfora et. al

[45] observed that n = 2 offers a good trade-off between the

size of the feature vector generated and the performance

7. https://developer.android.com/reference/android/Manifest.

permission

8. https://developer.android.com/guide/topics/permissions/

defining

9. https://developer.android.com/guide/topics/manifest/

uses-feature-element.html7

obtained by detectors. Therefore, we extract unique opcode

subsequences of length 2 (or bi-grams) from the code of

the apps, and create a feature to represent the number of

appearances of each bigram in the code. The resulting vector

contains a total of 25 354 frequency features.

4.2.5 Strings

The APK file strings are a valuable source of information

for malware detection. In this regard, the most common

strings include IP addresses, host names and URLs [36, 48];

command names [49, 50] and numbers [48]. We processed

app files and found 2 425 892 unique strings. Following the

procedure in [12], we observed that 98.5% of the strings

were present in less than 1% of the samples. After removing

these rare strings, we obtained 39 793 binary features, each

representing the presence or absence of

a specific string

within the app files.

4.2.6 File related features

This type of features includes the size of

code files and

different file types inside the APK [16, 48, 49, 51]. We base

our file type extractor on both, the extension of the file and

the identification of the first bytes of the content (i.e., magic

numbers) of files. The result is a new frequency feature

for every unique combination of

the extension (ext) and

magic type (mtype), identified as ext mtype. For files without

extension, we use the complete file name instead. In total,

this set consist of 65 986 frequency features per app.

4.2.7 Ad-hoc Features

As explained earlier, some specific detectors claim to use

obfuscation-resistant features. We call the features used

by these detectors that do not fall into any of the above

categories ad-hoc features. They include: semantic features

based on sink and source relationships in the code [50];

certificate information [16]; flags about the use of crypto-

graphic, reflective, and command execution classes [42, 48,

51]; and resolved function names for native and reflective

calls [18]. Due to the computational cost of obtaining these

features, we limited the time spent computing them to 15

minutes per sample. The result is a set of 35 387 frequency

features, each representing the number of occurrences of the

feature within the app.

F EATURE V ALIDITY

As a first step in this study, we have designed a set of ex-

periments to determine the robustness and detection ability

when obfuscation is present of the seven feature families

described in the previous section. The first experiment an-

alyzes the impact that different obfuscation strategies and

tools have on the features. In the second experiment we

evaluate the performance and stability of

ML algorithms

when using these features for malware detection.

5.1

Feature persistence

In this experiment, we aim to examine the impact of obfus-

cation on the features presented above. We analyze when

and how much the features

change in the presence of

obfuscation. We highlight the disparities among obfuscation

tools and how different implementation strategies to achieve

the same obfuscation objective can affect the features.

To analyze these aspects, we calculate the feature persis-

tence for each tool-technique obfuscation combination. This

is done by determining the average level of overlap between

the features of an original (clean) app and its obfuscated

counterparts. To compute the feature overlap, we compare

each pair of feature vectors calculated for an original

app

and its obfuscated version, and quantify the proportion of

features with exact value matches. Note that for binary-

featured representations (Permissions, Components, Strings

and API functions), this is equivalent to computing the

Jaccard index that measures the ratio between the shared

elements and the total number of elements in the union of

both feature vectors. Note also that, for frequency vectors,

an increment or decrease in one unit or ten units has the

same effect in this metric.

The results of this experiment are shown in Table 3. We

find various degrees of persistence, in most cases over 0.8,

with many exact matches between the feature vectors of

clean and obfuscated APKs. Components and Permission

features suffer the smallest changes when applying strate-

gies such as Junk Code Insertion, Call Indirection, Reflection

and Encryption (independently of the tool). Despite being

affected by all techniques, File-Related features are also

among the least affected on average. On the contrary, Ad-

hoc, API functions and Opcode feature vectors change the

most when obfuscation is applied. Nonetheless, the average

persistence values for these features indicate that most fields

(about 75%) are not affected by obfuscation. Therefore, in

most cases, we conclude that the use of obfuscation is not

reflected as a radical change in the feature vectors.

Persistence values refer to the proportion of features that

remain unchanged, but do not tell us which particular fea-

tures change the most when a tool-technique combination

is applied. To shed some light on this regard, we selected

the 15 features that change the most when obfuscation is

applied. They may belong to different families. To obtain

them, we measured the degree of discrepancy in the number

of occurrences of each of these features, comparing the

original application and the obfuscated version. To simplify

the visualization, we show the results for each technique,

averaging the discrepancy values for the three tools.

The

resulting rankings are shown in Figure 4. The name of each

bar is the feature name (which includes its family).

The

number at the right of each bar is the degree of discrepancy,

i.e., the average difference in the frequency of the feature

between original and obfuscated versions of apps. Note that

for easier interpretation, the scales are specific to each figure.

Regarding the persistence of the different feature fami-

lies, Renaming mainly affected Components and API func-

tions features, due to changes in the names of user-defined

packages, classes, methods and fields. It also alters the decla-

ration of custom permissions present in the code, since they

depend on the name of the class where they are declared.

However, as can be seen in Figure 4a, none of these features

are among the 15 most affected, mainly because the names

assigned to the classes are app-specific. In contrast, Opcode

features are among those most significantly affected, due to

changes in the order of methods when processing class files.

This mainly changes the frequency of sequences that present8

(a) Renaming (b) Junk Code Insertion

(e) Encryption

Fig. 4: Top 15 of most changed features for each obfuscation strategy. The values on the right indicate the disparity or

difference in the frequency of features between the obfuscated apps and their original versions. Results from the three tools

have been averaged.9

TABLE 3: Persistence of static analysis features when comparing clean and obfuscated apps using ObfuscAPK (OA),

DroidChameleon (DC) and AAMO (AA).

Renaming

Permissions

Components

API functions

Opcodes

Strings

File Related

Ad-hoc

0.972

0.219

0.480

0.985

0.995

0.895

0.923

1.0

0.999

0.717

0.993

1.0

0.993

0.942

Junk Code

Insertion

1.0

0.269

0.107

0.116

1.0

0.803

0.880

0.987

0.655

0.345

0.560

Call

Indirection

1.0

0.493

0.718

0.489

0.832

0.970

1.0

0.791

0.874

0.983

0.455

0.667

0.409

invocation instructions (opcodes 110, 111 and 112).

In concordance with persistence values in Table 3, Fig-

ures 4c and 4b show that Call Indirection and Junk Code

Insertion techniques are particularly detrimental for features

based on code information, with Opcode and Ad-hoc fea-

tures being the most sensitive to both types of obfuscation.

In particular, Ad-hoc features are the most affected by Call

Indirection (see Figure 4c) due to the added complexity

in the analyses required for

their extraction. This is the

case of sink and source relations between API

functions

such as Cursor.getString and Log.i. Also, due to the addition

of indirect calls, this technique increases the frequency of

some opcode sequences such as “90 110” formed by an

iput (90) instruction followed by an invokeinvoke (110). This

technique involves adding hundreds of auxiliary (indirect

caller) methods per class, either in separate or in the API

classes inside the API. However, these methods are ran-

domly named, which limits their impact (their popularity

will be low). As shown in Figure 4b, Junk Code Insertion

greatly alters the frequency of most Opcode sequences due

to the inclusion of useless instructions, mainly goto (40) and

invoke (110, 112, 113). The introduction of useless code also

greatly impacts on the size of

the APK file (File-related

feature file:apk size).

Reflection changes the persistence of features extracted

from code analysis. This effect is clearly perceptible in

Figure 4d. With this technique, the code is modified to hide

the originally called methods and use reflective calls instead.

This mainly affects the API functions that are called more

frequently in the code, including string-related functions

such as toString, append, equals, or length. Ad-hoc functions

are among the most changed due to the added complexity

of identifying sink and source relationships that

contain

reflective code. Reflection also results in new string features

that contain the class and function names invoked by re-

flection. However, these are declared once in the code, so

their frequency is kept low. Permission features are affected

because Reflection can hide the presence of protected API

functions that require specific permissions to be granted.

Encryption adds helper classes with the decryption rou-

tines that are used to hide user-defined strings and pa-

rameters. Therefore, API, Opcode, Ad-hoc, and File-related

features are affected by the modifications

introduced in

the code. However, the main target of this technique are

String features, as illustrated in Figure 4e and Table 3.

These are heavily affected because their original values are

encrypted. In this regard, the top 15 features most changed

by encryption are strings related to the app’s user interface

(UTF-8,phone, id, title, type).

Reflection

0.932

1.0

0.985

0.979

0.906

0.845

0.888

0.799

1.0

0.407

0.919

1.0

0.904

0.859

Encryption

1.0

0.487

0.826

0.897

0.987

0.850

1.0

0.994

0.959

0.028

0.926

0.92

1.0

0.999

0.982

0.009

0.923

0.922

Avg.

0.977

0.939

0.751

0.833

0.907

0.722

5.1.1 Differences between Obfuscation Tools

As seen in Table 3, changes in features depend on the tool

used. These differences are due to implementation particu-

larities. Indeed, because of these peculiarities, obfuscation

can even alter features that are not primary target of the

chosen obfuscation technique. Figure 5 depicts the average

level of overlap between the features obtained for the same

apps when obfuscated using different tools. Darker colors

indicate less overlap in the obtained feature vectors, while

lighter colors represent higher agreement. To better explain

the differences obtained, we manually examined the code of

these obfuscation tools as well as the features obtained from

different samples. Due to space limitations and in order

to make this paper more readable, we omit very specific

implementation details and limit our discussion to the more

prominent differences.

We observed that of all the tools analyzed, none of them

considered parameter randomization when implementing

the different obfuscation techniques except for Junk Code

Insertion. As such, for a given tool, the extracted feature

vectors are only dependent on the input (app data). In other

words, given the same input, a particular tool-technique

combination will always return the same output. It is worth

mentioning that all tools obfuscate (modify) the Android or

Java libraries when this type of content is included in the

APK, mainly due to poor checks during obfuscation. Since

this code is not user-related, such changes may break the

execution flow of apps.

The largest differences between obfuscation tools are

present for API function and Ad-hoc features. This aspect is

clearly perceptible in Figures 5. In the case of Reflection, we

noticed that ObfuscAPK and AAMO perform a fine-grained

checking when selecting the set of candidate function calls

to be transformed, so that errors introduced by obfuscation

are minimal. In contrast, DroidChameleon obfuscates calls

whose package matches any of the prefixes included in a

pre-defined list, without making any additional checks. In

consequence, as shown in Figures 5a and 5c feature overlap

is low since DroidChameleon results in a higher number

of transformed API calls with respect to ObfuscAPK and

AAMO.

The way in which the files to be transformed are selected

is also the explanation behind the differences

observed

between AAMO and ObfuscAPK with Renaming (see Fig-

ure 5b). By default, ObfuscAPK selects all the files within

the APK as candidates for Renaming. This translates into

changes in the content of files even if they belong to the

Java library or the Android framework. Hence, features

in the Components and API functions families are greatly10

(a) ObfuscAPK vs DroidChameleon

(b) ObfuscAPK vs AAMO

Fig. 5: Feature overlap between every pair of obfuscation tools using different obfuscation strategies.

Note that because

Rename and Encryption do not work for DroidChameleon and AAMO, respectively, the corresponding columns are

omitted.

modified by this tool. In contrast, AAMO performs some

additional checks aiming to avoid modifying this type of

content and presents a reduced impact on these features.

Nonetheless, as shown by the persistence values for API-

related features in Table 3, these checks are insufficient. For

example, classes that are part of the “com.android” package

are obfuscated because they do not match the name of the

AAMO blacklisted “android” package.

The disparities observed for API function features with

CallIndirection between the three tools

(see Figures 5a,

5b and 5c) are due to the way intermediate methods are

created. ObfuscAPK and AAMO insert the code of inter-

mediate methods within the class file of the original

call-

ing method, whereas DroidChameleon adds this code to

a (separate) helper class. As a result, methods inserted by

ObfuscAPK and AAMO inside the class files of the Android

framework are considered as API features by the feature

extraction process. Since these tools use different naming

conventions for these new methods, the resulting features

do not overlap.

When applying encryption,

ObfuscAPK and Droid-

Chameleon use different algorithms and parameters. This

explains the differences observed between both tools in

Figure 5a. In particular, ObfuscAPK uses the AES cipher

for encryption, whereas DroidChameleon uses Caesar’s al-

gorithm. In both cases, the encryption key is hardcoded in

their respective code.

5.2

ML Performance

We devise a second set of experiments to 1) analyze the

ability of static features to detect malware, and 2) study

the stability of these features for malware detection within

ML algorithms in the presence of obfuscation. In all exper-

iments, the RandomForests classification algorithm is used

without any parameter optimization [52], as implemented

in scikit-learn, a widely-used python library for ML [53].

The first scenario is focused on analyzing the predic-

tive power of the different feature families in a fully non-

obfuscated (clean) environment. For model training we use

the NonObf dataset 10 . For evaluation purposes we used

the apps from the CleanSuccObf dataset. Our objective is

to evaluate the ability of an off-the-shelf classifier to ap-

proximate the class y of apps (malware or goodware) as

a function of the original (non-obfuscated) features x orig

obtained from clean apps.

Table 4 shows the performance of the trained models.

As can be seen, most features present high true positive

rates (TPR above 0.8) and moderately low false positive

rates (FPR below 0.2). Therefore, it can be concluded that

most feature families provide enough information to enable

effective malware detection using ML algorithms.

This is

particularly true for API functions and String features. On

the contrary, the model generated using File-Related fea-

tures performs similar to a random choice model (an A mean

value of 0.5) and therefore, we can say that these features

are not suitable for the purpose at hand.

Even with high persistence values,

small changes in

feature vectors can lead to large changes in the performance

of an ML algorithm. This may be the case if the small set of

changed features is the most informative for a classifier and

strongly influences its prediction. Consequently, this second

scenario investigates the sensitivity of ML algorithms to the

changes induced by feature vector obfuscation.

We use the ML models trained in the previous exper-

iment (i.e., with clean apps from the NonObf set) and

compile two separate evaluation sets for each obfuscation

10. Note that an error during the obfuscation process of an app from

this set for a given tool can be due to an error in the obfuscation tool,

since the same app has been successfully obfuscated using other tools

for the same and other strategies.11

TABLE 4: Performance of static analysis features for mal-

ware detection using non-obfuscated apps for both training

and evaluation. TPR stands for the True Positive Rate, i.e.,

the number of malware correctly identified. FPR stands

for the False Positive Rate, i.e., the number of goodware

erroneously identified as malware. The A mean is the average

of the TPR and the True Negative Ratio (1-FPR).

Permissions

Components

API functions

Opcodes

Strings

File Related

Ad-hoc

TPR

0.867

0.808

0.928

0.884

0.907

0.265

0.768

FPR

0.156

0.157

0.081

0.252

0.082

0.197

0.143

A mean

0.855

0.825

0.923

0.816

0.912

0.534

0.812

strategy. The first set, known as the obfuscated evaluation

set, consists of (obfuscated) samples from the corresponding

Renaming, JCI, CallIndirection, Reflection or Encryption

datasets. The other set comprises the clean versions of those

apps in the obfuscated dataset. By comparing the predic-

tions made by the ML model for the clean and obfuscated

versions of the same app, we can assess whether or not

obfuscating an app can change the decision made by the ML

model. We leverage the Jaccard index to compute the over-

lap between the predictions for the clean and obfuscated

apps, and we refer to this measure as insensitivity.

Thus,

a high level of insensitivity indicates that the predictions

made by a model are preserved even when obfuscation is

applied to the apps.

The measured insensitivity values are compiled in Ta-

ble 5. As can be seen, the decisions of ML models for most

feature families are consistent. Permissions, Components

or API functions result in stable predictions regardless of

the obfuscation status of the apps, with insensitivity levels

exceeding 90%. On the contrary, Ad-Hoc, Opcode and File-

Related features exhibit high fluctuations in the decisions

made by the models. This suggests a greater sensitivity

of models to changes introduced by obfuscation in these

features.

We wondered if the persistence of features when obfus-

cation is applied is somehow related to the insensitivity

of ML models based on those features.

In Figure 6, we

represented the persistence and insensitivity values for the

different feature families. As can be seen, in general, there

is a high correlation between low persistence and high sen-

sitivity, meaning that larger changes in the features vectors

induce larger changes in the predictions of

the ML algo-

rithm. See for example Opcode features with JCI in Figure 6b

and String features with Encryption in Figure 6e). However,

high persistence values do not necessarily mean that the

retained features are the ones that are more helpful to the

ML models in making accurate predictions. For example,

Ad-hoc features show high persistence values when apply-

ing Reflection (changed features are only 16% of the total).

Still, the insensitivity is rather low, indicating those include

the features that play an important role in the accuracy of

predictions (see Figure 6d). Another example is File-related

features, which show the most irregular behavior for ML

models despite the small proportion of features altered by

obfuscation (10% on average as shown in Figure 6f).

this regard, in the previous scenario we evidenced that

File-related features lack informativeness for detection (see

Table 4).

The previous results highlight an important finding:

while persistent features are commonly considered reliable

predictors for malware detection, persistency is not the sole

factor influencing the robustness of the detection model.

On the contrary, high insensitivity values implicate a high

persistence on features, so it is a more adequate indicator of

robustness. Therefore, it is crucial to carefully examine the

impact of obfuscation-induced changes on the informative-

ness of the features, as even small changes can significantly

impact prediction performance. In the next section, we ex-

plore the selection of different feature vectors based on ML

performance and feature insensitivity to changes to develop

robust malware detection models.

R OBUST M ALWARE D ETECTION

We hypothesize that it is possible to build a robust classifier

(one with accurate predictions

when dealing with both

clean and obfuscated apps) by using features that are both

relevant (generate good models with clean apps) and insen-

sitive (the decision of the classifier does not change between

the clean and the obfuscated version of

an app). We call

these robust features. On the contrary, features that obtained

low insensitivity values (i.e., are highly sensitive) or are

irrelevant for ML models are prone to cause fluctuations

in the predictions of ML models when obfuscation is used.

To select a set of robust features based on the previous

statements, we use the A mean metric reported in Table 4 and

the average feature insensitivity reported in Table 5. Specif-

ically, we set three thresholds (0.8, 0.85, and 0.9) for both

metrics (A mean and feature insensitivity values) as criteria

for selecting different sets of robust features – to be used

for training and testing ML-based detectors. Table 6 shows

the feature families selected at each threshold: the strictest

threshold selects only the API functions family (A), the in-

termediate threshold selects API functions, Permissions, and

Strings (PAS), and the lower threshold selects Permissions,

API functions, Components, and Strings feature families

(PACS). Given the large number of features in the selected

groups (particularly in PAS and PACS), we rank the features

of each family based on the relevancy value computed by

the corresponding RandomForest models from Section 5.2,

and select only the best 2 000 features of each family. This

selection does not apply to the Permissions family, as it only

includes 683 features.

Three RandomForest classifiers are trained using the

apps in the NonObf dataset (which does not include ob-

fuscated samples): one for A features, another one for PAS

features and the third one for PACS features. For evaluation,

we use the CleanSuccObf dataset for the non-obfuscated

scenario, whereas for the obfuscated scenario, we use the

apps from the Renaming, JCI, CallIndirection, Reflection

and Encryption datasets. The prediction performances of

these three models are summarized in Table 7. As expected,

results are good for the tests without obfuscation, with true

positive rates over 90% and low ratios of

false positives

(under 8%). When tested with obfuscated apps, the model12

TABLE 5: Feature insensitivity, i.e., the overlap between the classifications made by the ML models for original and their

obfuscated variants using ObfuscAPK (OA), DroidChameleon (DC) and AAMO (AA).

Renaming

Permissions

Components

API functions

Opcodes

Strings

File Related

Ad-hoc

0.986

0.506

0.986

0.950

1.0

0.027

0.624

1.0

0.992

0.964

1.0

0.805

0.813

Junk Code

Insertion

1.0

0.446

0.235

0.087

1.0

0.013

0.052

0.384

0.706

0.012

0.583

Call

Indirection

1.0

0.899

0.963

0.900

1.0

0.013

0.072

0.380

0.434

0.751

0.415

TABLE 6: Features selected for robust malware detection

based on different thresholds for the A mean and feature

insensitivity.

Threshold

0.8

0.85

0.9

Feature types

Permissions, API functions, Components, Strings

Permissions, API functions, Strings

API functions

#Features

6 683

4 683

2 000

TABLE 7: Performance of different robust feature combina-

tions for ML malware detection. A, stands for the model

using exclusively API functions. PAS, refers to proposal us-

ing Permissions, API functions and Strings, whereas PACS

uses Permissions, API functions Components and Strings.

TPR

FPR

A mean

Non-Obfuscated

PAS

PACS

0.928

0.920

0.930

0.081

0.065

0.923

0.927

0.932

0.858

0.060

0.898

Obfuscated

PAS

PACS

0.889

0.876

0.044

0.035

0.922

0.914

trained with API function features (A) showed a perfor-

mance reduction of 3% in terms of A mean with respect to

the use of the non-obfuscated versions of the same apps,

mainly due to the effect of Reflection. The use of additional

features (PAS, PACS) seems to provide valuable information

to models: with PAS features, we can observe a 2% reduc-

tion in the number of false positives, while the ability to

correctly detect malware increases 3% with respect to the

performance numbers obtained for the model

using API

functions. The addition of Components features (PACS) did

not improve the performance with respect to PAS, obtaining

2% fewer true positives and a reduction of 1% in the number

of obfuscated goodware being misclassified. Therefore, the

best model is PAS, the one that uses Permissions, API

functions and Strings.

For comparison, we also evaluate the performance of our

RandomForest classifier with PAS features against Reveal-

Droid, a robust state-of-the-art malware detector [18], and

Drebin, a high-performing detector [36]. Both detectors use

their own sets of static analysis features and ML algorithms.

RevealDroid features include API function and package

counts, native calls extracted from binary executables and

function names resolved from reflective and dynamic code

loading calls. These account for a total of 59 072 features that

are used to train a RandomForest model to perform malware

detection. The features used by Drebin comprise declared

and requested permissions, app components, hostnames,

IPs, commands and suspicious and restricted API functions.

This totals 253 881 binary features that are used to train a

linear Support Vector Machine (SVM) goodware-malware

Reflection

0.934

1.0

0.946

0.922

1.0

0.025

0.621

0.674

1.0

0.305

0.954

1.0

0.106

0.370

Encryption

1.0

0.998

0.893

1.0

0.575

0.238

1.0

0.923

0.296

0.091

0.720

1.0

0.940

0.078

0.030

0.741

Avg. Over.

0.968

0.961

0.939

0.774

0.874

0.197

0.540

classifier.

Table 8, shows the performance of our best model

(RandomForest with PAS features) against RevealDroid and

Drebin. As can be seen, PAS outperformed both state-of-

the-art detectors for the non-obfuscated and obfuscated

scenarios. With obfuscated apps, our robust proposal pre-

sented a 1% and a 6% higher detection rate,

and 4% and

8% lower malware misclassification, with respect to Drebin

and RevealDroid. These good numbers demonstrate that

using a small set of Permissions, API functions and Strings

is enough to perform malware detection in Android.

contrast to RevealDroid and Drebin, which mainly rely on

Strings and API features, PAS considers a more balanced

feature vector and hence, it is more robust against different

obfuscation strategies. This experiment demonstrates that

obfuscated malware and goodware can be identified using

off-the-shelf ML algorithms and features obtained from

static analysis, even without providing these algorithms

with any information about the strategy or tool used to

obfuscate apps.

D ISCUSSION AND F UTURE W ORK

The experiments carried out in this paper evidence that,

as commonly assumed [54], static analysis features can be

affected by specific obfuscation techniques. On one hand,

feature persistence showed that all the feature families are

affected by at least one obfuscation technique. Among them,

the features obtained from the manifest

of applications

proved to be the most stable. Nonetheless, and contrary

to what is commonly argued [55], our experiments also

demonstrate that static analysis features can be a reliable

source of information for ML malware detection. In this

regard, we observed that some obfuscation strategies can re-

sult in additional features while leaving the original features

unaltered. For example, this is the effect of CallIndirection

in API functions, or Reflection in Strings. In most cases, the

impact of obfuscation is limited to less than 20% of all the

features derived from the samples (for example, at most 20%

of the features are affected by Call Indirection and about

15% of them are altered by Reflection). An interesting line

of research in this regard could be to analyze whether static

analysis frameworks have flaws that magnify the risk of

obfuscation. This aspect would help developers to improve

static analysis tools and also facilitate practitioners to select

the most reliable tool.

We also observed that the alterations caused by obfus-

cation on the features vary significantly between different

tools, mainly due to implementation particularities. How-

ever, the lack of randomization in these tools makes them13

(a) Rename (b) JCI (c) Call Indirection

(d) Reflection (e) Encryption (f) All techniques

Fig. 6: Relation between persistence and insensitivity to changes of the different features for each obfuscation technique

and tool. Every color makes reference to a feature family,

with red for Permissions, green for Components, blue for

API functions, yellow for Opcodes, gray for Strings, violet for File-Related, and orange for Ad-hoc features; whereas

symbols make reference to the values reported on each feature family for ObfuscAPK (circles),

AAMO (triangles) and

DroidChameleon (squares). The average of all tools is represented by the croos symbol.

TABLE 8: Performance of robust ML detectors based on static analysis features. PAS refers to our robust detection proposal

using Permissions, API functions and Strings.

TPR

FPR

A mean

Non-Obfuscated

RevealDroid

Drebin

0.856

0.914

0.117

0.103

0.869

0.905

PAS

0.920

0.065

0.927

produce the same output for the same input value, even

for different source apps or executions of the same tool.

Such behavior is useful to hide the explicit information

provided by, for example a class name, but it is insufficient

to conceal the intrinsic information, i.e., relationships be-

tween features, such as correlations. This means that apps

that contain a similar characteristic, when obfuscated using

the same tool, will maintain a similar relations between the

obfuscated values than between the original (unobfuscated)

features. Additionally, most obfuscators need to improve the

implementation of some obfuscation techniques due to the

high failure rates they present. From the point of view of

the users of these tools, these aspects pose a limitation of

Obfuscated

RevealDroid

Drebin

0.832

0.876

0.12

0.088

0.856

0.893

PAS

0.889

0.044

0.922

obfuscators. Therefore, the proposition and implementation

of better obfuscation strategies and tools for Android is a

promising research area. Such strategies should be accom-

panied with evaluations to ensure that the execution of the

applications is not broken.

The performance analysis of ML models trained with

static analysis features revealed that

some feature types

(families) typically proposed for malware detection [56],

such as file-related features, are not effective in differ-

entiating malware. This experiment, conducted on non-

obfuscated applications, identified API functions and

Strings as the most informative features for malware detec-

tion, achieving detection rates of over 90% and a low false14

positive rates of 8%. We analysed the impact of obfuscation-

induced changes on the informativeness of

features and

showed that even small changes can have a significant im-

pact on performance. Therefore, feature persistence should

not be considered as the sole criterion for robust malware

detection. This finding demystifies a common assumption

in the Android malware detection field, which is to consider

highly persistent features as robust.

By combining features that exhibited high insensitivity

to changes and presented high ML accuracy with non-

obfuscated apps, we demonstrated that ML-based malware

detection using static analysis features can be robust in the

presence of obfuscation. Remarkably, this remains true even

in scenarios where no knowledge about

the obfuscation

techniques applied to the apps is assumed, i.e., the obfus-

cated data is not taken into account during the training

process. In this scenario, our proposed robust detection

approach based on a stock classifier outperformed Reveal-

Droid, the current state-of-the-art obfuscation-resilient de-

tector, and Drebin, the best proposal for malware detection

in Android according to a recent comparative [15]. Specif-

ically, our detector achieved 92% of correct classifications,

compared to 89% and 85% for Drebin and RevealDroid,

respectively. This result shows that Android malware detec-

tors go beyond the selection of a ML algorithm. Therefore,

instead of focusing on the ML aspect,

richer and more

robust app representations that benefit from the integration

of different static analysis data should be further explored.

As a final note, we are aware that some limitations apply

to the work carried out for this paper. The main one is that

our analysis is limited to individual obfuscation strategies.

However, these strategies can be combined in order

increase the probability of circumventing detectors. Also,

the order in which these obfuscation strategies are combined

influences the results obtained (note that the information

hidden by a previous obfuscation technique becomes invis-

ible for the next obfuscation strategy). Evaluating all these

combinations would drastically increase the number scenar-

ios to be evaluated. The cost of the required experimentation

would be unfeasible since: (1) samples would have to be

obfuscated combining strategies and tools, and (2), feature

extraction and model training would have to be performed

for the resulting obfuscated samples. Moreover, in our opin-

ion, such extensive analysis would also hinder to clearly

conclude the impact caused by each individual obfuscation

strategy. In this regard, this work can be seen as a first step

in investigating the impact that the combination of different

obfuscation strategies can have on static analysis features.

As for future work, we plan to extend our experiments to

additional obfuscation techniques, such as packing.

C ONCLUSIONS

This paper delved into the effectiveness of

static analysis

features for ML-based Android malware detection in the

presence of obfuscation. To perform this assessment, we

generated a variety of datasets by applying different ob-

fuscation strategies to apps with the help of three state-of-

the-art obfuscators. Seven families of static analysis features

were defined and evaluated throughout an extensive set of

experiments. We identified which families are more persis-

tent when obfuscation is applied, and which families are

more informative for correct Android malware detection.

Based on these findings, we proposed the use of Permis-

sions, API functions and Strings for ML-based malware

detection. A stock implementation of the RandomForest

classification algorithm using these robust features was used

to generate a ML model

able to separate malware from

goodware with a remarkable success rate, without any prior

knowledge of the specific obfuscation techniques applied

to apps. In particular, this detector correctly identified

89% of evasion attempts with a low false positive rate of

4%, outperforming the current state-of-the-art solution for

obfuscation-resistant Android malware detection.

A CKNOWLEDGEMENTS

This work has received support from the following pro-

grams: PID2019-104966GB-I00AEI (Spanish Ministry of Sci-

ence and Innovation), IT-1504-22 (Basque Government), KK-

2022/00106 (Elkartek project supported by the Basque Gov-

ernment). Borja Molina-Coronado holds a predoctoral grant

(ref. PRE 2021 2 0230) by the Basque Government.

R EFERENCES

[1]

Statista, “Mobile operating systems’ market share world-

wide from january 2012 to january 2021,” 2021, [Online]

Available: https://www.statista.com/statistics/272698/

global-market-share-held-by-mobile-operating-systems-since-2009/.

[2] K. Labs, “Mobile malware evolution 2020,” Mar

2021,

[Online]

Available:

https://securelist.com/

mobile-malware-evolution-2020/101029/.

[3] T. Hastie, R. Tibshirani, and J. Friedman, The elements

of statistical learning: data mining, inference, and prediction.

Springer Science & Business Media, 2009.

[4] B. Molina-Coronado, U. Mori, A. Mendiburu, and

J. Miguel-Alonso, “Survey of network intrusion detection

methods from the perspective of the knowledge discovery

in databases process,” IEEE Transactions on Network and

Service Management, vol. 17, no. 4, pp. 2451–2479, 2020.

[5] A. Sadeghi, H. Bagheri, J. Garcia, and S. Malek, “A tax-

onomy and qualitative comparison of program analysis

techniques for security assessment of android software,”

IEEE Transactions on Software Engineering, vol. 43, no. 6, pp.

492–530, 2016.

[6] W. Wang, M. Zhao, Z. Gao, G. Xu, H. Xian, Y. Li, and

X. Zhang, “Constructing features for detecting android

malicious applications: issues, taxonomy and directions,”

IEEE access, vol. 7, pp. 67 602–67 631, 2019.

[7] L. Li, T. F. Bissyandé, M. Papadakis, S. Rasthofer, A. Bar-

tel, D. Octeau, J. Klein, and L. Traon, “Static analysis of

android apps: A systematic literature review,” Information

and Software Technology, vol. 88, pp. 67–95, 2017.

[8] K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cav-

allaro, “The evolution of android malware and android

analysis techniques,” ACM Computing Surveys

(CSUR),

vol. 49, no. 4, pp. 1–41, 2017.

[9] V. Sihag, M. Vardhan, and P. Singh, “A survey of android

application and malware hardening,” Computer

Science

Review, vol. 39, p. 100365, 2021.

[10] C. S. Collberg and C. Thomborson, “Watermarking,

tamper-proofing, and obfuscation-tools for software pro-

tection,” IEEE Transactions on software engineering, vol. 28,

no. 8, pp. 735–746, 2002.

[11] S. Dong, M. Li, W. Diao, X. Liu, J. Liu, Z. Li, F. Xu,

K. Chen, X. Wang, and K. Zhang, “Understanding android15

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

obfuscation techniques: A large-scale investigation in the

wild,” in International conference on security and privacy in

communication systems. Springer, 2018, pp. 172–192.

H. Aghakhani, F. Gritti, F. Mecca, M. Lindorfer, S. Ortolani,

D. Balzarotti, G. Vigna, and C. Kruegel, “When malware is

packin’heat; limits of machine learning classifiers based on

static analysis features,” in Network and Distributed Systems

Security (NDSS) Symposium 2020, 2020.

A. Ruggia, E. Losiouk, L. Verderame, M. Conti, and

A. Merlo, “Repack me if you can: An anti-repackaging

solution based on android virtualization,” in Annual Com-

puter Security Applications Conference, 2021, pp. 970–981.

M. Hammad, J. Garcia, and S. Malek, “A large-scale empir-

ical study on the effects of code obfuscations on android

apps and anti-malware products,” in Proceedings of the 40th

International Conference on Software Engineering, 2018, pp.

421–431.

B. Molina-Coronado, U. Mori, A. Mendiburu, and

J. Miguel-Alonso, “Towards a fair comparison and realistic

evaluation framework of android malware detectors based

on static analysis and machine learning,” Computers

Security, vol. 124, p. 102996, 2023.

G. Suarez-Tangil, S. K. Dash, M. Ahmadi, J. Kinder, G. Gi-

acinto, and L. Cavallaro, “Droidsieve: Fast and accurate

classification of obfuscated android malware,” in Proceed-

ings of the Seventh ACM on Conference on Data and Applica-

tion Security and Privacy, 2017, pp. 309–320.

A. Bacci, A. Bartoli, F. Martinelli, E. Medvet, F. Mercaldo,

and C. A. Visaggio, “Impact of code obfuscation on an-

droid malware detection based on static and dynamic

analysis,” in ICISSP, 2018, pp. 379–385.

J. Garcia, M. Hammad, and S. Malek, “Lightweight,

obfuscation-resilient detection and family identification of

android malware,” ACM Transactions on Software Engineer-

ing and Methodology (TOSEM), vol. 26, no. 3, pp. 1–29, 2018.

V. Rastogi, Y. Chen, and X. Jiang, “Catch me if you can:

Evaluating android anti-malware against transformation

attacks,” IEEE Transactions on Information Forensics and

Security, vol. 9, no. 1, pp. 99–108, Jan 2014.

D. Maiorca, D. Ariu, I. Corona, M. Aresu, and G. Giacinto,

“Stealth attacks: An extended insight into the obfuscation

effects on android malware,” Computers & Security, vol. 51,

pp. 16–31, 2015.

W. Y. Lee, J. Saxe, and R. Harang, “Seqdroid: Obfuscated

android malware detection using stacked convolutional

and recurrent neural networks,” in Deep learning applica-

tions for cyber security. Springer, 2019, pp. 197–210.

J. Wu and A. Kanai, “Utilizing obfuscation information

in deep learning-based android malware detection,” in

2021 IEEE 45th Annual Computers, Software, and Applications

Conference (COMPSAC). IEEE, 2021, pp. 1321–1326.

D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi,

C. Wressnegger, L. Cavallaro, and K. Rieck, “Dos and

don’ts of machine learning in computer security,” in Proc.

of the USENIX Security Symposium, 2022.

G. Developers, “Enable multidex for apps with over

64k methods,” 2020, accessed online: October 15,

2023. [Online]. Available: https://developer.android.

com/studio/build/multidex

S. Ratabouil, Android NDK: beginner’s guide. Packt Pub-

lishing Ltd, 2015.

I. You and K. Yim, “Malware obfuscation techniques: A

brief survey,” in 2010 International conference on broadband,

wireless computing, communication and applications. IEEE,

2010, pp. 297–300.

X. Zhang, F. Breitinger, E. Luechinger, and

S. O’Shaughnessy, “Android application forensics:

A survey of obfuscation, obfuscation detection

and deobfuscation techniques and their impact on

investigations,” Forensic Science International: Digital

Investigation, vol. 39, p. 301285, 2021.

[28] Google, “Dalvik executable format,” 2020, accessed

online: October 15, 2023. [Online]. Available: https:

//source.android.com/docs/core/runtime/dex-format

[29] S. Aonzo, G. C. Georgiu, L. Verderame, and A. Merlo,

“Obfuscapk: An open-source black-box obfuscation tool

for android apps,” SoftwareX, vol. 11, p. 100403, 2020.

[30] V. Rastogi, Y. Chen, and X. Jiang, “Droidchameleon: eval-

uating android anti-malware against transformation at-

tacks,” in Proceedings of the 8th ACM SIGSAC symposium

on Information, computer and communications security, 2013,

pp. 329–334.

[31] K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “Andro-

zoo: Collecting millions of android apps for the research

community,” in 2016 IEEE/ACM 13th Working Conference

IEEE, 2016, pp.

on Mining Software Repositories (MSR).

468–471.

[32] S. Zhu, J. Shi, L. Yang, B. Qin, Z. Zhang, L. Song, and

G. Wang, “Measuring and modeling the label dynamics

of online anti-malware engines,” in 29th USENIX Security

USENIX Association,

Symposium (USENIX Security 20).

Aug. 2020, pp. 2361–2378.

[33] M. D. Preda and F. Maggi, “Testing android malware

detectors against code obfuscation: a systematization of

knowledge and unified methodology,” Journal of Computer

Virology and Hacking Techniques, vol. 13, no. 3, pp. 209–232,

2017.

[34] A. Desnos, G. Gueguen, and S. Bachmann, “Andro-

guard,” 2018, [Online] Available: https://androguard.

readthedocs.io/en/latest/.

[35] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein,

Y. Le Traon, D. Octeau, and P. McDaniel, “Flowdroid:

Precise context, flow, field, object-sensitive and lifecycle-

aware taint analysis for android apps,” Acm Sigplan No-

tices, vol. 49, no. 6, pp. 259–269, 2014.

[36] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon,

K. Rieck, and C. Siemens, “Drebin: Effective and explain-

able detection of android malware in your pocket.” in

Ndss, vol. 14, 2014, pp. 23–26.

[37] W. Wang, X. Wang, D. Feng, J. Liu, Z. Han, and X. Zhang,

“Exploring permission-induced risk in android applica-

tions for malicious application detection,” IEEE Transac-

tions on Information Forensics and Security, vol. 9, no. 11, pp.

1869–1882, 2014.

[38] A. Feizollah, N. B. Anuar, R. Salleh, G. Suarez-Tangil,

and S. Furnell, “Androdialysis: Analysis of android intent

effectiveness in malware detection,” computers & security,

vol. 65, pp. 121–134, 2017.

[39] H.-J. Zhu, Z.-H. You, Z.-X. Zhu, W.-L. Shi, X. Chen, and

L. Cheng, “Droiddet: effective and robust detection of

android malware using static analysis along with rotation

forest model,” Neurocomputing, vol. 272, pp. 638–646, 2018.

[40] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu,

“Droidmat: Android malware detection through manifest

and api calls tracing,” in 2012 Seventh Asia Joint Conference

on Information Security. IEEE, 2012, pp. 62–69.

[41] K. Xu, Y. Li, and R. H. Deng, “Iccdetector: Icc-based

malware detection on android,” IEEE Transactions on Infor-

mation Forensics and Security, vol. 11, no. 6, pp. 1252–1264,

2016.

[42] J. Koli, “Randroid: Android malware detection using ran-

dom machine learning classifiers,” in 2018 Technologies for

Smart-City Energy Security and Power (ICSESP). IEEE, 2018,

pp. 1–6.

[43] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A

multimodal deep learning method for android malware

detection using various features,” IEEE Transactions on

Information Forensics and Security, vol. 14, no. 3, pp. 773–

788, 2018.

[44] Q. Jerome, K. Allix, R. State, and T. Engel, “Using opcode-16

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

sequences to detect malicious android applications,” in

2014 IEEE international conference on communications (ICC).

IEEE, 2014, pp. 914–919.

G. Canfora, A. De Lorenzo, E. Medvet, F. Mercaldo, and

C. A. Visaggio, “Effectiveness of opcode ngrams for de-

tection of multi family android malware,” in 2015 10th In-

ternational Conference on Availability, Reliability and Security.

IEEE, 2015, pp. 333–340.

B. Kang, S. Y. Yerima, K. McLaughlin, and S. Sezer, “N-

opcode analysis for android malware classification and

categorization,” in 2016 International conference on cyber se-

curity and protection of digital services (cyber security). IEEE,

2016, pp. 1–7.

N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima,

P. Miller, S. Sezer, Y. Safaei, E. Trickel, Z. Zhao, A. Doupé

et al., “Deep android malware detection,” in Proceedings of

the seventh ACM on conference on data and application security

and privacy, 2017, pp. 301–308.

X. Wang, W. Wang, Y. He, J. Liu, Z. Han, and X. Zhang,

“Characterizing android apps’ behavior for effective detec-

tion of malapps at large scale,” Future generation computer

systems, vol. 75, pp. 30–45, 2017.

S. Y. Yerima, S. Sezer, G. McWilliams, and I. Muttik, “A

new android malware detection approach using bayesian

classification,” in 2013 IEEE 27th international

conference

on advanced information networking and applications (AINA).

IEEE, 2013, pp. 121–128.

X. Zhang and Z. Jin, “A new semantics-based android mal-

ware detection,” in 2016 2nd IEEE International Conference

on Computer and Communications (ICCC). IEEE, 2016, pp.

1412–1416.

M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang,

“Riskranker: scalable and accurate zero-day android mal-

ware detection,” in Proceedings of the 10th international

conference on Mobile systems, applications, and services, 2012,

pp. 281–294.

L. Breiman, “Random forests,” Machine learning, vol. 45,

pp. 5–32, 2001.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,

B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,

V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,

M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn:

Machine learning in Python,” Journal of Machine Learning

Research, vol. 12, pp. 2825–2830, 2011.

Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, “A survey on

malware detection using data mining techniques,” ACM

Computing Surveys (CSUR), vol. 50, no. 3, pp. 1–40, 2017.

K. Bakour, H. M. Ünver, and R. Ghanem, “The android

malware static analysis: techniques, limitations, and open

challenges,” in 2018 3rd International Conference on Com-

puter Science and Engineering (UBMK). Ieee, 2018, pp. 586–

593.

Y. Pan, X. Ge, C. Fang, and Y. Fan, “A systematic literature

review of android malware detection using static analy-

sis,” IEEE Access, vol. 8, pp. 116 363–116 379, 2020.

Borja Molina-Coronado received his B.Sc. in Computer Engineering

and his M.Sc. in Computer Science from the Technical University of

Valencia and the University of the Basque Country UPV/EHU, in 2015

and 2017, respectively. He is a Ph.D. student in the Dept. of Computer

Architecture and Technology of the UPV/EHU. His main research inter-

est include malware analysis, network security and machine learning.

Antonio Ruggia Antonio Ruggia is a Ph.D. student in Security, Risk,

and Vulnerability at the University of Genoa since November 2020. He

is interested in several security topics, including Mobile Security, with a

specific interest in Android, malware, and data protection. He graduated

in October 2020 from the University of Genoa and participated in the

2019 CyberChallenge.it, an Italian practical competition for students in

Cybersecurity. Since 2018, he has worked as a full-stack developer in a

multinational corporation.

Usue Mori received her M.Sc. Degree in Mathematics, and a Ph.D. in

Computer Science from the University of the Basque Country UPV/EHU,

Spain, in 2010 and 2015, respectively. Since 2019, she has been

working as a lecturer in the Dept. of Computer Science and Artificial

Intelligence of the University of the Basque Country UPV/EHU. Her main

research interests include clustering and classification of time series.

Alessio Merlo Alessio Merlo (Senior Member, IEEE) received the Ph.D.

degree in computer science from the University of Genoa in 2010. He

is currently a Full Professor in computer engineering with the Centre for

Higher Defence Studies (CASD), Rome, Italy. He has published more

than 120 scientific papers in international conferences and journals.

His research interests include mobile security, where he contributed to

discovering several high-risk vulnerabilities both in applications and the

android OS and system security.

Alexander Mendiburu is a full professor at the Dept. of Computer

Architecture and Technology of the University of the Basque Country

UPV/EHU, where he has been working since 1999. He received his

B.Sc. Degree in Computer Science and his Ph.D. Degree from the

University of the Basque Country, Spain, in 1995 and 2006, respectively.

His main research areas are evolutionary computation,

time series,

probabilistic graphical models, and parallel computing.

Jose Miguel-Alonso is a full professor at the Dept. of Computer Ar-

chitecture and Technology of the University of the Basque Country

UPV/EHU. Formerly, he was a Visiting Assistant Professor at Purdue

University. He received his M.Sc. in Computer Science in 1989 and his

Ph.D. in Computer Science in 1996, both from the UPV/EHU. He carries

out research related to parallel and distributed systems, in areas such as

network security, software security, performance modeling and resource

management in large-scale computing systems. Prof. Miguel-Alonso is

a member of the IEEE Computer Society and of the HiPEAC European

Network of Excellence.