Machine Learning-based Malware Detection

Chao Ye , Chief Researcher , Rising

In the face of the explosive growth of malware and the ever-growing complexity of anti-detection technology, how to effectively detect the latest malware files has been a problem faced by the security industry. Ransomware maintains a rapid growth rate, zero-day vulnerabilities of MS-Office, PDF, Flash being frequently used in APT attacks. Traditional signature and passive response modes have been unable to meet current requirements for security protection. This has led to security vendors developing more intelligent and more effective static detection or evaluation technologies.

These days machine learning is widely used in zero-day malware detection and is known as “Next Generation” anti-malware technology. Over the past four years, we have studied and practiced several applications of machine learning in malware detection, including SVM, k-NN, Decision Tree, and Random Forest. Based on these experiences, in 2016 we completed a heuristic engine based on machine learning — RDM+ (being only for Windows PE files). It provides a cloud-based malware detection service, it’s core model being a two-layer Random Forest model with 4778 dimensional input features, which is “trained” by more than 100 million malware files and clean files. It learns and updates every few hours.

This report will briefly describe how we have used machine learning to improve malware detection over the past four years. The implementation and operation of the RDM+ will then be described in detail, including: how to choose the features that conforms to the malware evolution trend; how to choose the learning algorithm; how to combine the model; how to use the two-layer model to shorten the training period; how to mitigate false positives; and how to build a long-term automated operation program.

In addition, the report will introduce machine learning based malware detection for files not in PE format, which focuses on feature extraction, and shows the detection effectiveness of the model.

Chao Ye

Has more than ten years of research experience in anti-cyber threat. Works in Rising’s R&D center currently, mainly responsible for cyber threat analysis, malware static and dynamic identification technology, cyber threat big data integration/ analysis/ mining/ visualization, and the next generation of intelligent security solutions designing. Has designed and developed the anti-malware engine, threat intelligence analysis platform, malware intelligent identification technology and so on.