A Multi-Corpus Evaluation of Dynamic Markov Coding for Spam Filtering
- Gordon V. Cormack | University of Waterloo
In the TREC 2005 Spam Evaluation Track, a number of popular spam filters – all owing their heritage to Graham’s “A Plan for Spam” – did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique – Prediction by Partial Matching (PPM) – performed exceptionally well, at or near the top of every test.
Are the TREC results an anomaly? Is PPM really the best method for spam filtering? How are these results to be reconciled with others showing that methods like Support Vector Machines (SVM) are superior?
I address these issues in three different ways. First, I show that my method of Dynamic Markov Coding (DMC), a PPM competitor, achieves results at least as good as PPM on the TREC tests. Second, I show that PPM and DMC outperform SVM – and all other reported results of which I am aware – on the public Ling Spam, PU1, and PU3 corpora using 10-fold cross validation. Third, I adapt implementations of SVM, Perceptron, and kNN filters to the TREC test methods, where they demonstrate inferior performance, even with pre-training that should be to their advantage.
Speaker Details
Gordon V. Cormack is Professor in the David R. Cheriton School of Computer Science at the University of Waterloo. Cormack’s research interests include programming language design and implementation, computer systems, and information retrieval. He is a TREC program committee member, the TREC Spam Track coordinator, and general chair for CEAS 2006 – the Third Conference on Email and Anti-Spam. Cormack coaches Waterloo’s ACM International Collegiate Programming Contest team, and is a member of the Scientific Committee for the International Olympiad for Informatics.
Watch Next
-
-
-
-
Magma: A foundation model for multimodal AI Agents
- Jianwei Yang
-
-
-
-
-
-