Automating Behavior-based Ransomware Analysis, Detection, and Classification Using Machine Learning
Ransomware is malware that hijacks a victim's data using encryption and demands a ransom in exchange for the decryption key. Ransomware has gained prominence due to its attack vector and the irreversible nature of damage to data. Ransomware has indiscriminately attacked individuals and organizations worldwide, disrupting their businesses and services. The number of successful ransomware attacks across the globe highlights the inadequacy of existing ransomware defense.
Static and dynamic analysis are two popular approaches to malware analysis. The former does not require execution of the malware binary, whereas the latter requires executing the binary in a controlled environment. Static analysis-based detection, e.g., signature-matching, is widely adopted by commercial antivirus solutions but can be thwarted by evasion techniques, e.g., polymorphism and code obfuscation, utilized in modern malware. Consequently, dynamic analysis-based or behavior-based detection approaches have gained popularity because malware behavior cannot be changed entirely across its variants. Both signature and behavior-based detection complement each other.
Behavior-based ransomware detection comes with certain challenges and problems, such as data high-dimensionality that occurs because a process may execute thousands of API calls per second. Manual inspection of these API calls for feature engineering requires an expert and is a time-intensive task. Another problem with some existing ransomware detection models is the reliance on handcrafted malice scoring functions that assign scores to the processes describing their threat levels.
Other challenges to ransomware detection research include the limited availability of ransomware data sets that can be used with Machine Learning (ML) methods and their reuse scope. The scope of reuse of available data sets is limited because of their format, e.g., sequential data may be used with recurrent neural networks but not with commonly used ML-based classification algorithms, and focus, e.g., network activity and filesystem activity. For the above-mentioned reasons, ransomware detection research is generally followed by ransomware analysis. However, to the best of our knowledge, not many of the existing ransomware behavior analysis studies discuss the challenges involved in the process.
This thesis aims to automate the solutions to the problems related to ransomware behavior detection and classification using evolutionary computation methods, i.e., particle swarm optimization and genetic programming, and deep neural networks, i.e., long short-term memory.
This thesis proposes a wrapper feature selection method to address the high dimensionality in ransomware behavior data. The proposed method utilizes particle swarm optimization to automatically select a suitable number of features from each feature group and therefore does not require expert input. This thesis further proposes an automated method of evolving malice scoring models for ransomware detection. The proposed method formulates the problem as a symbolic regression problem and solves it using genetic programming. Unlike existing methods, the proposed method does not require expert knowledge to design the model.
Furthermore, this thesis proposes an automated behavior analysis framework for highlighting challenges associated with ransomware behavior analysis and solutions to these challenges. Finally, this thesis proposes a new representation of the API call sequences that combines the API call names and important call arguments. The proposed representation of the API call sequences helped improve ransomware early detection performance. All the methods proposed in this thesis either automate the existing manual solutions or achieve comparable or better performance compared to existing methods.