Recently, malware classification and detection have become very competitive among researchers4, each one uses a method to prove the effectiveness of its results.
We often see the use of machine learning techniques with their different algorithms and especially neural networks because they are able to analyze in depth the ransomware behavior5. Among the proposed ransomware classification methods, the authors6 suggested an approach using machine learning algorithms which have been used for binary classification of ransomware using static analysis of opcodes transformed into n-gram, they achieved an accuracy of 91.43%.
Classification of malware using artificial neural networks (ANN)
Rad et al.7 have collected different types of malware from a malicious repository shared by Naiyarah Hussain8. The authors used as many samples as possible (both benign and malicious), the samples are placed as follows:
Malicious samples are placed in the “malicious_samples” folder and the benign samples are placed in the “benign_samples” folder. With this way, they can dispense with manually labeling each obtained file. To ensure that the samples obtained are usable and properly labelled (malicious or benign), they proposed criteria the following criteria:
An executable file (PE for Portable Executable) is valid by checking the first 2 bytes of the binary file if it contains “MZ”.
Exclude packaged malware from the dataset so as not to damage the classification model by verifying the signature of each piece of software to see if it is packaged by a packer (such as UPX) or not.
The collected samples are compared with anti-viruses. If the malicious sample is detected (even at a low rate), it will be removed. For the benign sample, if the anti-virus has not detected anything (with a rate of 0%), the sample will be in the list of benign samples.
The authors required the relevance and effectiveness of the aforementioned criteria in order to remove any sample that might affect the model that they want to realize. They proposed a neural network which does a binary classification of an invisible PE (Portable Executable) file as a begnin or malicious, developed with the Python library « TensorFlow-GPU». In addition, they used «scikit-learn» to do the cross-validation method. The dataset contained 4000 samples (1000 benign and 3000 malicious), each class is represented by vector indicating (0, 1) if the sample is malicious and (1, 0) if it is benign. The input size (Windows library function calls) was 34087 after the pre-processing of the dataset. They validated the model using the cross-validation method, they have assumed that the model can perform on an unknown dataset, and will operate based on the results that have been achieved. With this model, they obtained an average accuracy of 97.8%, with 97.6% of precision and 96.6% of recall.
Malware comes in several types; our goal is to focus on ransomware. The paper9 presents our previous work, which was a comparative study of a proposed ransomware dataset on work10 with a new proposed dataset. We changed the learning files but we did not have a test dataset, so we took 20% of the learning database file to build our test database. The results were high compared to those in paper10 but the problem was that the 20% of files taken, was not removed from the learning database, we copied the test part in question to another file. We obtained approximately a rate of 70% for correctly classified instances, but they were not as surprising and relevant as expected. We also compared the two algorithms Artificial Neural Network (ANN) and Bayesian Network (BN), to deduce the one that gives us the best results in our work, and we got the right results with the ANN.
Classification of malware using convolutional neural networks (CNN)
Many researchers use CNN to classify and detect malware. Kabanga et al.11 proposed a model of convolutional neural networks to extract features from images at a time, these images extracted from a sample of malware will be classified. The model has images of a size 128*128*1 (1 being the channel width), they used the image library from the PIL (Python Image Library) package of Python to generate vectors of images and after the generation of the vectors, processing was performed on these vectors. Then, they designed a three-layer deep convolutional neural network to do the classification task. The output layer has 25 neurons that correspond to the 25 families of malware available in the dataset used. The goal of this neural network is to have a single class containing the malware. The Malimg dataset12 that was used, has 25 malware families (with several grayscale images), which 90% are used for training and 10% for testing. They obtained in the results an accuracy of 98%. Using this technique, they have proven that it is the best compared to other traditional methods of classification. However, using images to classify malware can lead to erroneous results due to poor image extraction.
Classification of malware using recurrent neural networks (RNN)
Many researchers have used RNN to do speech recognition and automatic natural language processing, and they have had pertinent results, the question we ask is as follows: can we use recurrent neural networks to classify and/or detect malware?
Pascanu et al.13 attempt to learn the language of malware in order to detect it by constructing a recurrent model to predict the next API call using the hidden state of the model (which encodes the history of past events) as a fixed-length vector that is given to a classifier (by the logistic regression or Multi-layer Perceptron “MLP” algorithm). And to improve their recurrent model, the authors introduce Max-Pooling on the values of the hidden units in the input sequence, and they propose the Half-Frame model that increases the memory capacity of the final representation by including the mid-state information of the sequence of events in the final state. The authors suppose that the memory window of the standard ESN (Echo State Network) and RNN models is not adequate, to solve that, a Leaky-Units architecture14 has been added to the recurrent architectures, which allows the increase of the system memory in the long term.
In paper15, The authors added a bidirectional mode that combines two distinct models, one learning by treating events in the forward direction and the other learning by processing the events in the reverse direction.
After the weights have been learned for the RNN or generated for the ESN, the next processing step is to create a fixed-length representation for the event stream provided at the classification input and also to determine the detection time of a malicious or non-malicious file. In order to do this, they decided to eliminate sequences shorter than 15 and explore the values 50, 100, 200, and 65,536 of sequences longer than N steps, which is considered a hyper-parameter. Only the first N events was selected.
They used both logistic regression and multi-layer perceptrons with rectifier units16 to classify fixed-length projections and also Dropouts which showed a significant improvement in the generalization of the MLP model17.
RNN and ESN are formed independently from the classifier. Thus, they act as feature extractors trained in an unsupervised manner. The results show that the combination of a recurrent model with a standard classifier can improve the classification of malware, and on the other hand, ESN models outperform RNN models in the majority of experiments, but the use of recurrent neural networks is a bit complicated. More, we cannot assume that this is the right method to do malware classification and detection because of the need to learn the language of malware, which is a bit difficult considering that malware varies.
Ransomware detection systems can improve their performance by having the ability to capture recurrent (or repetitive) behavior and general sequence learning. That’s why Agrawal et al.18 worked on the attention mechanisms while processing executable sequences for ransomware detection. They proposed an implementation of an improved neural cell to integrate the attention mechanism in learning named ARI (Attended Recent Inputs); this ARI is used by LSTM (Long Short-Term Memory) networks, named ARI-LSTM. This cell allows to perform a detailed analysis of ransomware executable, i.e. it processes input sequences by taking the attention weights for each recent input. In the learning phase, they took the LSTM model (improved by incorporating the ARI cell) and Max Pooling (LaMP). The goal of the learning is to classify the input sequences by labeling the ransomware by “1” and the benign by “0”. They used a set of sequence data containing benign and ransomware executable of a Windows operating system captured from user’s computers. The results show that the accuracy of the ARI-LSTM is better than an LSTM, proving the effectiveness of attention mechanisms in the learning phase.
Other than malware detection, the types of neural networks studied in this article have also been successfully applying in other fields, such as:
Analyzing user’s check-in to predict the locations that they may visit using RNN19.
In order to control greenhouse climate, the authors20 propose a model for predicting greenhouse climate by focusing on the climatic factors: crop growth, temperature, humidity, lighting, carbon dioxide concentration, and soil temperature and moisture, using LSTM (Long Short‐Term Memory) to capture the reliance between historical climate data.
Combining the attention mechanism with bidirectional gated recurrent units (GRU) to increase the prediction of the point of interest (POI) category, in order to mitigate the scarcity of users’ check-in data during the Coronavirus (COVID-19) period21.