Detection is not a classification: reviewing machine learning techniques for cybersecurity specifics
Alexander Chistyakov , Research-Developer , Kaspersky
While more and more security vendors are starting to use Machine Learning (ML) models for malware detection, the basic pipeline for the construction of these detectors usually looks the same: collect a dataset of benign and malicious samples, train a binary classifier to predict the correct label, use a positive prediction of the model to detect new malware. However, this approach does not take into account one important and natural property: no malicious code could become clean after the injection of any new functionality. As a result, an intruder can often avoid detection, simply by adding some obfuscated or clean-looking payload into the malware sample.
In this talk we will show how to construct a ML detection model, that is provably secure against such attacks even, after the full reverse engineering. Using the real-time malicious activity detection problem as an example, we will review the classical step-by-step pipeline for designing, training and utilizing the ML classifier, and explain how to adapt it to the specifics of the malware detection problem.
We will explain how to transform almost any applicable ML architecture (Deep NN, tree-based ensembles, kernel SVM, etc.) to make your static or dynamic malware detection model more secure; how to update the model’s decision border without complete re-training; and how to explore the causes of the detection alert using the transformed architecture.
Alexander’s career in data science started in 2013, when he joined the Bayesian Methods Research Group. His key research was focused on the development of probabilistic models for the prediction of results of sports tournaments. Alexander graduated with distinction from Moscow State University in 2015 and from Yandex School of Data Analysis in 2017.