We tested several machine learning techniques: two decision tree methods, Parzen windows, and Fisher's Linear Discriminant.
Decision trees are a standard tool in data mining, and many are available in packages such as CART and C4.5 . Decision trees are generally preferred over other nonparametric techniques because of the readability of their learned hypotheses and the efficiency of training and evaluation.
Decision trees are rooted, usually binary trees, with simple classifiers placed at each internal node and a classification at each leaf. In order to evaluate a particular tree T with respect to an input x, each classifier in the tree is assigned the argument x. The outputs of the simple classifiers at the nodes determine a unique path from the root to a leaf of the decision tree: at each internal node, the left edge to a child is taken if the output of the function associated with that internal node is +1, and vice versa if it is -1. This path is known as the evaluation path . The value of the function T(x) is the classification associated with the leaf reached by the evaluation path.
Decision trees are generally learned by means of a top down growth procedure, which starts from the root node and greedily chooses a split of the data that maximizes some cost function, usually a measure of the ``impurity'' of the subsamples implicitly defined by the split. After choosing a split, the subsamples are then mapped to the two children nodes. This procedure is then recursively applied to the children, and the tree is grown until some stopping criterion is met. The tree is then used as a starting point for a bottom up search, performing a pruning of the tree. This eliminates nodes that are redundant or are unable to ``pay for themselves'' in terms of the cost function.
Typically, the simple classifier at an internal node compares one of the input attributes against a threshold. This test partitions the input space with axis parallel splits. The standard algorithm of this kind is C4.5. Another strategy uses hyperplanes in general position. This is the technique adopted by systems like OC1. We use an improved version of OC1, called MOC1, which implements a bias toward large margin splits in the purity measure, theoretically motivated by Vapnik-Chervonenkis theory. MOC1 has been shown to outperform the standard OC1.
We use the systems C4.5 and MOC1 in their basic version with all the default settings. Note that it would be possible to devise modifications of the same systems to account, for example, for the unequal numbers of positive and negative training examples, which might improve their performance.
Parzen windows classification is a technique for nonparametric density estimation, which can also be used for classification. Using a given kernel function, the technique approximates a given training set distribution via a linear combination of kernels centered on the observed points. In this work, we separately approximate densities for each of the two classes, and we assign a test point to the class with maximal posterior probability.
The resulting algorithm is extremely simple and closely related to support vector machines. The decision function is
f(X)=sign(\sum y_{i}K(X_{i},X)),where the kernel function K is the radial basis function, without normalization applied to the inputs. As for the radial basis SVM, a constant is added to the kernel function whenever the two inputs are identical.
The Parzen windows classification algorithm does not require any training phase; however, the lack of sparseness makes the test phase quite slow. Furthermore, although asymptotical convergence guarantees on the perfomance of Parzen windows classifiers exist , no such guarantees exist for finite sample sizes.
Parzen windows can be regarded as a generalization of k-nearest neighbor techniques. Rather than choosing the k nearest neighbors of a test point and labelling the test point with the weighted majority of its neighbors' votes, one can consider all points in the voting scheme and assign their weight by means of the kernel function. With Gaussian kernels, the weight decreases exponentially with the square of the distance, so far away points are practically irrelevant. The width \sigma of the Guassian determines the relative weighting of near and far points. Tuning this parameter controls the predictive power of the system. We have empirically optimized the value of \sigma.
Fisher's linear discriminant is a classification method that projects high-dimensional data onto a line and performs classification in this one-dimensional space. The projection maximizes the distance between the means of the two classes while minimizing the variance within each class. This defines the Fisher criterion, which is maximized over all linear projections, w:
J(w) = (|m_1 - m_2 |^2) / (s_1^2 +s_2^2)where m represents a mean, s^2 represents a variance, and the subscripts denote the two classes. In signal theory, this criterion is also known as the signal-to-interference ratio. Maximizing this criterion yields a closed form solution that involves the inverse of a covariance-like matrix. This method has strong parallels to linear perceptrons. We learn the threshold by optimizing a cost function on the training set.