Today’s cybersecurity threats continue to find ways to fly and stay under the radar. Cybercriminals use polymorphic malware because a slight change in the binary code or script could allow the said threats to avoid detection by traditional antivirus software. Threat actors customize their wares specific to their target organizations to increase their chances of breaking into and moving laterally through an entire corporate network, exfiltrating data, and leaving with little or no trace. The underground economy is rife with malware builders, Trojanized versions of legitimate applications, and other tools and services that allow malware operators to deploy highly evasive malware.
As the number of threats seen in the wild continues to increase exponentially, the continued evolution and innovation of their evasion tactics create a scenario where most malware is seen only once. Therefore, in today’s threat landscape, security solutions should no longer be just about the number of unique malware they can detect. Instead, they should deliver durable solutions that can defend against existing as well as future attacks. This requires comprehensive visibility into threats, coupled with the ability to process vast amounts of data. Microsoft 365 Defender provides such a capability using its cross-domain optics and the transformation of data into actionable security information through innovative applications of AI and machine learning methodologies.
We have previously discussed how we apply deep learning in detecting malicious PowerShell, exploring new approaches to classify malware, and in detecting threats via the fusion of behavior signals. In this blog post, we discuss a new approach that combines deep learning with fuzzy hashing. This approach utilizes fuzzy hashes as input to identify similarities among files and to determine if a sample is malicious or not. Then, a deep learning methodology inspired by natural language processing (NLP) better identifies similarities that actually matter, thus improving detection quality and scale of deployment.
This model aims to improve the overall accuracy of classifying malware and continue closing the gap between malware release and eventual detection. It can detect and block malware at first sight, a critical capability in defending against the wide range of threats, including sophisticated cyberattacks.
Case study: New GoldMax malware blocked at first sight
In March this year, Microsoft 365 Defender successfully blocked a file that would later be confirmed as a variant of the GoldMax malware. GoldMax, a command-and-control backdoor that persists on networks as a scheduled task impersonating systems management software, is part the of tools, tactics, and procedures (TTPs) of NOBELIUM, the threat actor behind the attacks against SolarWinds in December 2020.
Microsoft was able to proactively defend its customers from this newly discovered GoldMax variant because it leveraged two main technologies: fuzzy hashing, which serves as the input, and deep learning techniques inspired by NLP and computer vision, among others.
The earliest GoldMax sample, which Microsoft detects as Trojan:Win64/GoldMax.A!dha, was first submitted on VirusTotal in September 2020. While the new file was confirmed to be GoldMax variant in June 2021, or three months after Microsoft first blocked it, we started defending customers as soon as we saw it. As seen in the screenshots below, the new file’s TLSH and SSDEP hashes—the fuzzy hashes exposed on VirusTotal—are observably similar to the first GoldMax variant. Both files also have the exact ImpHash and file size, further supporting our initial conclusion that the second file is also part of the GoldMax family.
Figure 1. File properties of the first GoldMax variant (top) and the new file detected in March (bottom) (from VirusTotal)
In the next sections, we discuss fuzzy hashes and how we use them in conjunction with deep learning to detect new and unknown threats.
Understanding fuzzy hashes
Hashing has become an essential technique in malware research literature and beyond because its output—hashes—are commonly used as checksums or unique identifiers. For example, it is common practice to use SHA-256 cryptographic hash to query a knowledge database like VirusTotal to determine whether a file is malicious or not. The first antivirus products operated this way before antivirus signatures existed.
However, to identify or detect similar malware, traditional cryptographic hashing poses a challenge because of its inherent property called cryptographic diffusion, whose purpose is to hide the relationship between the original entity and the hash so that these are still considered one-way functions. With this property, even a minimal change in the original entity—in this case, a file—yields a radically different, undetected hash.
Below are screenshots that illustrate this principle. The word change in the text file and the resulting change in the MD5 hash represent the effect of changes in binary content of other files:
Figure 2. Example of cryptographic hashing
Fuzzy hashing breaks the aforementioned cryptographic diffusion while still hiding the relationship between entity and hash. In doing so, this method provides similar resulting hashes when given similar inputs. Fuzzy hashing is the key to finding new malware that looks like something we have seen previously.
Like cryptographic hashes, there are several algorithms to calculate a fuzzy hash. Some examples are Nilsimsa, TLSH, SSDEEP, or sdhash. Using the previous text files example, below is a screenshot of their SSDEEP hashes. Note how observably similar these hashes are because there is only a one-word difference in the text:
Figure 3. Example of fuzzy hashing
The main benefit of fuzzy hashes is similarity. Since these hashes can be calculated on several parts or the entirety of a file, we can focus on hash sequences that are like one another. This is important in determining the maliciousness of a previously undetected file and in categorizing malware according to type, family, malicious behavior, or even related threat actor.
Fuzzy hashes as “natural language” for deep learning
Deep learning in its many applications has recently been remarkable at modeling natural human language. For example, convolutional architectures, recursive architectures like Gated Recurrent Units (GRUs) or Long Short Term Memory networks (LSTMs), and most recently attention-based networks like all the variants of Transformers have been proven to be state-of-the-art in tackling human language tasks like sentiment analysis, question answering, or machine translation. As such, we explored if similar techniques can be applied to computer languages like binary code, with fuzzy hashing as an intermediate step to reduce sequence complexity and length of the original space. We discovered that segments of fuzzy hashes could be treated as “words,” and some sequences of such words could indicate maliciousness.
Architecture overview and deployment at scale
A common deep learning approach in dealing with words is to use word embeddings. However, because fuzzy hashes are not exactly natural language, we could not simply use pre-trained models. Instead, we needed to train our embeddings from scratch to identify malicious indicators.
Once with these embeddings, we attempted to do most things with a language deep neural network. We explored different architectures using standard techniques from literature, explored convolutions over these embeddings, attempted with multilayer perceptrons, and tried traditional sequential models (like the previously-mentioned LSTM and GRU) and attention-based networks (Transformers).
Figure 4. Architecture overview of the deep learning model using fuzzy hashes
We got fairly good results with most techniques. However, to deploy and enable this model to the Microsoft 365 Defender, we looked into other factors like inference times and the number of parameters in the network. Inference time ruled out the sequential models because even though they were the best in terms of precision or recall, they are the slowest to run inference on. Meanwhile, the Transformers we experimented on also yielded excellent results but had several million parameters. Such parameters will be too costly to deploy at scale .
That left us with the convolutional approach and multilayer perceptron. The perceptron yielded slightly better results between these two because the spatial adjacency intrinsically provided by the convolutional filters does not properly capture the relationship among the embeddings.
Once we had landed on a viable architecture, we used modern tools available to us that Microsoft continues to extend. We used Azure Machine Learning GPU capabilities to train these models at scale, then exported them to Open Neural Network Exchange (ONNX), which gave us the extra performance we needed to operationalize this at scale on Microsoft Defender Cloud.
Deep learning fuzzy hashes: Looking for the similarities that matter
A question that arises from an approach like this is: why use deep learning at all?
Adding machine learning allows us to learn which similarities on fuzzy hashes matter and which ones don’t. Additionally, adding deep learning and training on vast amounts of data increases the accuracy of malware classification and allows us to understand the minor nuances that differentiate legitimate software from its malware or Trojanized versions.
A deep learning approach also has its inherent benefits, one of which is creating big pre-trains on massive amounts of data. One can then reuse this model for different classification, clustering, and other scenarios by using its transfer learning properties. This is similar to how modern NLP approaches language tasks, like how OpenAI’s GPT3 solves question answering.
Another inherent benefit of deep learning is that one does not have to retrain the model from scratch. Since new data is constantly flowing into the Microsoft Defender Cloud, we can fine-tune the model with these incoming data to adapt and quickly respond to an ever-changing threat landscape.
Conclusion: Continuing to harness the immense potential of deep learning in security
Deep learning continues to provide opportunities to improve threat detection significantly. The deep learning approach discussed in this blog entry is just one of the ways we at Microsoft apply deep learning in our protection technologies to detect and block evasive threats. Data scientists, threat experts, and product teams work together to build AI-driven solutions and investigation experiences.
By treating fuzzy hashes as “words” and not mere codes, we proved that natural language techniques in deep learning are viable methods to solve the current challenges in the threat landscape. This change in perspective presents different possibilities in cybersecurity innovation that we are looking forward to exploring further.
Numerous AI-driven technologies like this allow Microsoft 365 Defender to automatically analyze massive amounts of data and quickly identify malware and other threats. As the GoldMax case study showed, the ability to identify new and unknown malware is a critical aspect of the coordinated defense that Microsoft 365 Defender delivers to protect customers against the most sophisticated threats.
Edir Garcia Lazo
Microsoft 365 Defender Research Team