Practical Workflow of Human-Aided and Automated Machine Learning-based Android Malware Detection

Daisuke Nakajima , Mobile Malware Researcher , McAfee

Rapid increase in the number of Android malware threats is making detection by anti-malware products harder and harder, especially the detection technology used is old-fashioned and relies on unique identifiers and/or signatures matching against malware which cannot catch up with the amount and complexity of recent malware.

One of the newer detection methodologies expected is Machine Learning (ML)-based malware detection with which we can realize more generic and wider-coverage detection of malware based on learning from the already-known set of malicious samples’ structures and behaviors.

In contrast to its high detection rate, one of the biggest concern in ML based malware detection is potentially higher false positive rate, which can easily kill usability of and trust in anti-malware products especially in consumer markets where tens or hundreds of millions of users are using apps every day.

To realize an ideal malware detection capability with high detection rate against both known and unknown malware and nearly zero false positive rate, we implemented a sophisticated workflow to create and publish malware detection ML models by applying pre- and post-processing aided by human malware researchers’ domain expertise as filters. The generated ML models are also resource-efficient and deployable over-the-air onto real consumer mobile devices running our detection engine, and easily updatable anytime when quick fix is necessary or detection coverage is improved.

In the presentation, I introduce this realistic approach to ML and hybrid detection solution which effectively combines the ML and traditional signature- and online reputation-based methods, and show the evaluation results using real-world samples, including the case study about actual malware detected by ML in the wild on public application stores in Asia which could not be detected by conventional methods.

For this new ML workflow, we adopted creation of a large number of finer-grained ML models tunable per malware family based on thorough static analysis of APK (Android Package) files and Linear SVM (Support Vector Machine) classifier training to achieve the maximum accuracy by excluding noises in features of samples, and also adopted conservative feature selection techniques – knowledge-based and statistical – to minimize false positive rate. Our workflow focuses on efficient generation and relearning of ML models, and also greatly helps end-to-end anti-malware operation, for example, to rapidly fix any found false positive cases and deliver the fixed ML model thanks to having fine-grained, per-family micro ML models, which is usually difficult in case of a single large ML model to just distinguish between malware or not.

We show that ML standalone is not a silver bullet against malware, but it can bring its maximum potential to protect us from ever growing mobile malware threats like banking trojan, spyware, click-fraud, and ransomware, when combined with long years’ and continuously accumulating human expertise and insights based on big data.

Daisuke Nakajima

Daisuke Nakajima, located in Tokyo, Japan, is working as Security Researcher in Mobile Malware Research and Operations team, Mobile Security Engineering, at McAfee, after over 10 years experiences in software research and development for applications and middleware on consumer embedded and mobile platforms at several companies. He specializes in mobile malware hunting and reverse engineering, malware detection signature development, malware analysis report authoring, and the recent focus is on architecting and developing machine learning-based Android malware detection backend system. He got his master’s degree in engineering science from Osaka University.