Support vector machines are a set of algorithms that learn from data by creating models that maximize their margin of error.
Support vector machines (SVMs) are a family of algorithms for classification, regression, transduction, novelty detection, and semi-supervised learning. They work by choosing a model that maximizes the error margin of a training set.
SVMs were originally developed by Vladimir Vapnik in 1963. Since the mid-90s, a energetic research community has grown around them. If you want to learn more about SVMs, you can read Chris Burges’ tutorial. Nello Cristianini and John Shawe-Taylor have written a textbook about them. Bernhard Schölkopf and Alex Smola wrote a textbook about kernel methods, which are a closely-related set of methods.
Since 1998, we’ve done basic research into making SVMs be more user-friendly. Our research has resulted in:
- SMO: A fast algorithm for training SVMs from data, which is easy to understand and code.
- A method for calibrating the output of an SVM to yield probabilities.
- A simple method to convert a multi-class problem into a series of faster-to-solve two-class SVMs
- A method to apply SVMs to find unusual items in a training set (novelty detection).
- An online approximation to SVMs.
See the list of publications, below, for complete citations.
The real-world data sets described in the technical report (below) are available in a compressed ASCII format (zip format). Both the adult data and the web data are available. There is a readme.txt file in each zip archive that explains the format of the file. The testing set for the adult data, the testing set for the web data set, and the MNIST data set is also available.
MSR currently does not have any software that implements SVMs. LIBSVM is a popular package that is based on a SMO-like algorithm.
Check here for errata on the SMO “Fast training” physical paper (already corrected in the on-line version).
Sathiya Keerthi and colleagues have a paper that describes an improved SMO: instead of updating a single threshold, they update the bounds on permissible thresholds. They report substantial improvement in speed, especially for extreme C values.
There are some unclear ideas and errors in the “Fast Training” paper that I (John Platt) would like to clarify on this web page:
- Equations (12.9) and (12.10) in the paper are for when the decision function (1.7) subtracts b, not adds it.
- Equation (12.18) has a sign error: the plus should be a minus.
- The sentence after (12.24) should read “is negative” not “is positive”
- In the pseudo-code, after the “if (eta < 0)” statement, right before the “if (|a2-alph2| < …)” statement, a2 should be checked to see if it is within 1e-8 of 0 or C. If so, then a2 should be set to the nearest bound. This prevents round-off error from mistakenly forcing a point to be a support vector. Richard Lippmann points out that this value of 1e-8 is appropriate when inputs have approximately unit variance. Therefore, it is a good idea to scale your inputs (or kernel outputs) so that they are on the order of unity.
Technical Fellow & Managing Director, Microsoft Research New England, New York City and Montreal