Grammatical inference is a form of machine learning focusing on induction of formal grammars. Such induction is particularly useful for learning abstractions of software systems, either actively through queries, or passively through observation of behavior. In this talk, I will give three examples of how grammatical inference can be applied in security and software analysis.
First, I will discuss inference of protocol state machines in the realistic high-latency network setting, and applications of such inference to the analysis of botnet Command and Control (C&C) protocols. The proposed technique enables an order of magnitude reduction in the number of queries and time needed to learn a proprietary botnet C&C protocol compared to classic algorithms. I will also show that the computed protocol state machines enable formal analysis for botnet defense.
Second, I will show how the inference of protocol state machines can be combined with Directed Automated Random Testing (DART). The combination, called Model-inference-Assisted Concolic Execution (MACE) learns a state machine that abstracts the implemented protocol state machine. Using the inferred state machine, MACE guides further search and keeps refining the abstraction. Experiments show that MACE finds 7x more vulnerabilities than the baseline approach, achieves up to 58% higher instruction coverage, and gets significantly deeper into the search space.
Third, I will discuss a novel approach to automated malware detection and classification based on tree automata inference. After a brief introduction to malware analysis, I’m going to present a technique for inference of k-testable tree automata from kernel system call dependency graphs and show how such automata can be used for detection and classification of malware.
Finally, I will discuss possible directions for the future research.