Sequence Intent Classification Using Hierarchical Attention Networks

  • March 6, 2018
  • Views



In this code story, we will discuss applications of Hierarchical Attention Neural Networks for sequence classification. In particular, we will use our work the domain of malware detection and classification as a sample application. Malware, or malicious software, refers to harmful computer programs such as viruses, ransomware, spyware, adware, and others that are usually unintentionally installed and executed. When detecting malware in a running process, a typical sequence to analyze could be a set of disk access actions that the program has taken. To analyze software without running it, we can treat series of assembly instructions in the disassembled binary as sequences to be classified in order to identify sections of the code with malicious intent. In our novel approach, we apply techniques typically used to discover structure in a narrative text to the data that describes the behavior of executables.

Hierarchical Attention Networks (HANs) have been used successfully in the domain of document classification (Yang et al. 2016, PDF). This approach is potentially applicable to other use cases that involve sequential data, such as application logs analysis (detecting particular behavior patterns), network logs analysis (detecting attacks), classification of streams, events, etc. We will demonstrate the application of HANs for malware detection both statically (executable binary analysis) and dynamically (process behavior analysis). The supporting code uses Keras with Microsoft Cognitive Toolkit (CNTK) as the backend to train the model.

This work was inspired by our collaboration with Acronis and the services that they provide to ensure backup data stays safe.


We will use two publicly available datasets to illustrate our approach and results.

The first, CSDMC 2010 API, is a sequence corpus that was also used in a data mining competition associated with ICONIP 2010. This dataset contains Windows APIs calls from malware and benign software.

For simplicity, the CSDMC 2010 dataset contains only the names of Windows APIs called by a running process. The set contains class labels for each sequence corresponding to a complete running process instance. However, we will consider the classification performance for both sequences of API calls as well as the overall trace/process.

Microsoft Malware Classification Challenge at WWW 2015/BIG 2015 dataset. It contains a collection of metadata manifests from 9 classes of malware. The metadata manifests were generated using the IDA disassembler tool. These are log files containing information extracted from the binary, such as a sequence of assembly instructions for function calls, strings declarations, memory addresses, register names, etc.

For demonstration purposes (and keeping training time to minutes on NVIDIA GeForce GTX 1060) we used hundreds of malware examples for class 1 and class 2 only. The full dataset size is 0.5 TB. However, the sample code provided in this Jupyter notebook supports multiple classes. The disassembled binary executable dumps in the Microsoft Malware Classification dataset are composed of sequences of operations and operands of unequal length, one operation per line. We will analyze the content as sequences of tuples of fixed length, considering only a number of fields extracted from each line in each experiment. For example, tuples of one field will include the operation code only, tuples of two fields will contain the operation code and the first operand, etc. Classification performance will be considered per sequence as well as for the overall executable.

What is a Hierarchical Attention Neural Network?

Yang, Zichao et al. proposed a HAN for document classification, presented at NAACL 2016. The solution leverages the hierarchical structure of a document, from words to sentences to documents, mirrored in a multi-layer recurrent neural network with special attention layers applied at the word and the sentence levels. The architecture of the HAN is shown in the figure below. The overall process is nicely described in detail in this blog post.

Our goal is to apply the natural language processing (NLP) inspired approach to a different domain by leveraging the sequential nature of their data: for example, classifying dynamic process behavior (sequence of API calls) or classifying malware binaries (sequence of assembly instructions).

Considering the dynamic process analysis scenario where the goal is to detect malware in a running process, a HAN-based solution will leverage the multi-layers of memory to understand the sequence of observed process actions. Classifying patterns of process behavior could then be viewed as classifying documents. Here, the sequence of process actions are analogous to a sequence of words in a text, and groups of actions are analogous to sentences.

Code walkthrough

The HAN code we provide in this Jupyter notebook is based on the LSTM HAN implementation found in this GitHub repo by Andreas Argyriou, which in turn was based on Richard Liao’s implementation of hierarchical attention networks and a related Google group discussion. The notebook also includes code from the Keras documentation and blog as well as this word2vec tutorial.

Our code extends the base notebook to support random and per-process stratified sampling, down/up-sampling of classes (binary classes only), learning hyper-parameter tuning, per-sequence and per-process evaluation metrics, configurable tuples and vocabulary sizes, extended data ingestion, configurable training environment, and other modifications to support our experimentation needs.

Important notes

  1. Data format: The code requires that “words” in your sequences are separated by whitespaces (tabs, spaces).
  2. Pre-processing: If your original data source (log/trace file, disassembled code) has a huge vocabulary, consider normalization during pre-processing and/or limiting the tuple length and vocabulary size by frequency while training. Huge “usually” means more than 5-20K. Watch out for things that can “explode” your vocabulary: GUIDs, hex addresses, temp file names, IP addresses.
  3. The number of words per sentence and number of sentences in a document are configurable in the code.
  4. It’s assumed that input uses token “<eos>” as an end-of-input sequence separator.
    Each input will be further broken down into sequences (documents) composed of multiple sentences as the unit of classification in the HAN training code. It is important to indicate the last sequence for a given input to avoid mixing content from different inputs. Groups (sub-sequences) of actions are treated as sentences.
  5. Word embeddings can be initialized using pre-trained vectors or left uninitialized. This behavior is defined by the parameter USE_WORD2VEC.
  6. Other configurable options define various parameters such as network dimensions, learning hyper-parameters, data sampling options, and other training environment settings.
  7. Deep learning framework: We use Keras with a CNTK backend, with GPU support.

The following is a high-level walk-through of the main parts of the code. For more details refer to the documentation included in the Jupyter notebook available here.

  1. Configure the training environment by setting the global parameters. These include:
    – MAX_SENT_LEN : the number of words in each sentence (a word in any token appearing in a process action tuple).
    – MAX_SENTS : the number of sentences in each document.
    – MAX_NB_WORDS : the size of the word encoding (number of most frequent words to keep in the vocabulary).
    – MIN_FREQ: minimum token frequency to be included in vocabulary. The final vocabulary size is the effective minimum size after considering the token dictionary, MAX_NB_WORDS and MIN_FREQ settings.
    – Training and test sampling strategy: randomized or stratified by process.
    – USE_WORD2VEC: initialize the network using pre-trained word2vec embedding weights. Note that during our experimentation, word2vec weights did not have much impact on the model performance when training for 5+ epochs. They only accelerate the learning rate in the initial epochs.
    – Set the network layer sizes (EMBEDDING_DIM, GRU_UNITS and CONTEXT_DIM).
    – Set learning hyper-parameters (REG_PARAM, LEARN_RATE, BATCH_SIZE and NUM_EPOCHS).
  2. Ingest the data files corresponding to each class. The ingestion function get_sequences() breaks the input lines into tokens, sentences, and sequences according to the configured parameters. It also sets the required label per ingested sequence in the input file.

  3. Transform the data into the 3D format required by the network.
  4. Initialize the word embedding weights (if this optional step is selected).
  5. Build the network according to the configured layer sizes and the vocabulary size. Note that a large vocabulary will increase the size of the trained model considerably. For example, model size may vary from 2-3 MB to 80+ MB with a vocabulary size of a few hundred to several thousands of unique tokens.
  6. Sample and split the data into training and test datasets.
  7. Train the HAN network and report training and validation metrics.
  8. Evaluate model performance on the test set and report the classification performance per-sequence, i.e., chunks/subsets of the process actions.
  9. Report the overall per-process performance using a simple majority label vote (i.e., assign the most frequent per-sequence label to the process and evaluate the metrics accordingly). Other per-process heuristics may be applied according to the application needs. For example, in a dynamic process analysis scenario, one may choose to abort the process as soon as a certain threshold number of malign sequences are detected.

Dynamic malware detection: API calls classification

We are now ready to use the CSDMC 2010 API sequence corpus to classify malware vs benign software based on their API call history.
For preprocessing you’ll need to save data for malware and benign sequences in separate files: malware.txt and benign.txt. The label column needs to be removed.
Next separate traces with an “<eos>” token – every row in the source file corresponds to one full trace and will be broken down into segments of configurable length, corresponding to documents or sequences.

Since the CSDMC 2010 dataset contains only the API names with no additional arguments, we only have one field per action line. We split the dataset into training and validation sets using a 90/10 stratified sampling so that all sequences belonging to a trace go to either the training or the validation set for an unbiased evaluation. This resulted in a training set containing 23,470 sequences and 3,195 validation sequences.

A sample Jupyter notebook is available here. The solution in the notebook is more generalized and supports any number of fields per API call. It is a simple adaptation of the base solution provided here. Pre-processed training datasets are provided in the “data” subfolder.

The following is a sample training and validation run using sentences of 20 tokens (API calls) each, aggregated in groups of five sentences per sequence:

The model achieves an accuracy of 83%+ after a few epochs on the 3,195 validation sequences. Note that the per-sequence ground truth labels were simply a propagation of the full process classification label, i.e., a noisy label.

Now let’s use a simple majority vote of all sequences belonging to a trace/process in order to estimate the performance metrics at the process level.

The per-process accuracy achieved is over 97% in this simple training/validation exercise, misclassifying only 1 out of 39 processes. In the following section, we present a more complex experiment using the Kaggle malware dataset provided by Microsoft.

Static malware classification: Analysis of disassembled binaries

In this code story, we unzip around 40 GB (out of 0.5 TB) from Kaggle dataset provided in the Microsoft Malware Classification Challenge (BIG 2015) dataset and pick only a couple of classes (class 1 and 2, for example, Ramnit and Lollipop)


Manifest files generated by the IDA disassembler tool provide lots of information. Our goal is to preserve “sequential” information while removing/normalizing noise.

Here are the pre-processing steps that we’ve taken:

  1. Remove all comments.
  2. Extract only “.text” section of the log file – the one that has assembly instructions.
  3. Remove all hexadecimal and decimal values, and commas.
  4. Remove “offset” keyword as it is used too often and becomes a “stop word”.
  5. Extract the most used 5-7 character-grams to a dictionary. If a lengthier word contains a dictionary ngram, then replace the whole word with it.
    Example of popular ngrams:

    Examples of words that were normalized using the ngrams above:
  6. Finally, replace all lengthy works with some token value. These are most likely some very creative variable/function names and keeping them as is only explodes the vocabulary and does not make the “sequence” any stronger. Examples:

Here is a snippet of an original file:

And here is an example of a pre-processed file:

By pre-processing we reduced the vocabulary from 200,000+ words to just over 20,000 words.

There many ways to run the above-mentioned data preparation steps. We used sed, a Unix text parsing utility, for most of the string replace/remove operations. For string replacement (ngram-based dictionary redaction, for example) we found that using Python’s memory map files reduced the text parsing runtime from hours to minutes:


Below is an overview of the results we obtained on a binary malware classifier (class 1 is “Ramnit” malware and class 2 is “Lollipop” malware). The pre-processed disassembly data was split into 80% for training and 20% for testing. The train/test sampling is again stratified by the source assembly file (i.e., all content belonging to the same disassembled executable goes to either the training set or the test set for a realistic evaluation). Furthermore, 90% of the training set was used to train the model, and 10% was used for validation while training.

We experimented with various settings and learning parameters, such as the number of fields to use from each assembly line, the structure of the hierarchical network (number of words/sentence, sentences/document, units per layer, etc.), the maximum number of unique tokens to keep in the vocabulary, whether to initialize the word embedding units based on a pre-trained word2vec model, and other learning hyper-parameters.

The experimentation code is available in this Jupyter notebook. A small subset of the pre-processed data is provided in the “data” subfolder for illustration purposes only. To run a full training and validation exercise, download and pre-process more data from the Microsoft Malware Classification Challenge (BIG 2015) dataset.

The graphs below illustrate train and validation accuracy, as well as loss evolved, over epochs for one run with following settings:

Getting to 96+% validation accuracy in just 4 epochs is quite promising!
Interestingly, we observed similar results using 2 or 5 fields, with or without word2vec, and after slightly varying the size of the vocabulary.

Below are the results using the top 1000 vocab words only, with word2vec turned off and using only the first two fields per assembly line (MAX_FIELDS=2). The training progress looks almost identical to the full vocab case, again achieving 97+% accuracy. The model size is much smaller though (~2MB).

Below are the performance metrics on the held-out 20% test set. We evaluated the test performance per individual sequences, i.e., subsets of assembly code, as well as the overall classification performance per executable using a simple majority vote. Overall accuracy is 97.8% for individual sequences, 98.2% for the programs as a whole.

Now compare the same run but using the full vocabulary of 25,573 tokens, with pre-trained word2vec weight initialization. As you see it has produced very close results as compared to the 1,000-token vocabulary without word2vec initialization:

Note that even though the word embedding weights are not initialized, the learned model catches up with the pre-initialized one after one or two epochs.

Finally, these are the performance metrics on the held-out 20% test set: For both sequence and program classification, an accuracy of 98.2% is achieved.


Limiting the vocabulary size by normalizing the disassembled content and using the top 1000 unique tokens by frequency in two-field action tuples produces similar performance results as compared to the full vocabulary, with or without word embedding initialization.


Hierarchical Attention Networks (HANs) are well suited for classification of sequences, leveraging multi-layers of short and long-term memory to capture sequence intent based on context. When applied to malware detection, both statically and dynamically, high detection performance is achieved using a controlled vocabulary size with negligible to no loss in accuracy.

To control the vocabulary size in the training data, several techniques are beneficial:

  • Normalization of the raw content to conflate file names, paths, memory addresses, registry keys, etc.
  • Considering the most frequent unique tokens only, for e.g., top 1000, captures the most useful signals in the data.
  • The first two fields per action tuple, such as (action, target) or (operation, operand), are the most predictive.

The trained model size is proportional to the vocabulary used for training. Model sizes may vary from 2-3 MB with limited vocab up to 80+ MB and more with larger vocab sizes, with similar classification performance.


  1. Zichao Yang, Diyi Yang, et al, Hierarchical Attention Networks for Document Classification, HLT-NAACL 2016 (PDF)
  2. GitHub repository link
  3. Keras — Python Deep Learning library
  4. Microsoft Cognitive Toolkit

Cover image photo credit: Photo by Toa Heftiba on Unsplash

Related Articles

Leave a reply

Your email address will not be published. Required fields are marked *

2018-06-07 10:37:23

Mona says:

Hi Sonia, the code is available in the GitHub repo

2018-05-30 10:31:39

Sonia ang says:

WHo is the author of the programs - we would like to get a copy to apply to a client site? Needed asap:)