|
Okay, I've got a few minutes. Teach me something funky.
Spam Be Gone! was developed as an example application for one of the most technically sound machine learning systems.
The system that has been developed is known as K* (short for Kolmogorov *) and is classified as an instance-based, complexity theoretic, machine learning system.
Initial versions of K* have been included in a state-of-the-art machine learning toolkit.
K* strengths include:
The Spam Be Gone! seemed like a good application for the K* algorithm. If you have downloaded and installed
Spam Be Gone! you would have noticed how easy it was to use, and how well it performed. As you teach it, it learns.
About K*
K* is an instance-based classifier, that is the class of a test
instance is based upon the class of those training instances
similar to it, as determined by some similarity function. The
underlying assumption of instance-based classifiers such as K*,
IB1, PEBLS, etc, is that similar instances will have similar
classes. The following description is an overview, consult
the definitive K* paper
for more details. This document is in .PDF format only. But the K* thesis will be available soon :-)
There are two important parts to any instance-based classifier:
the similarity function that determines how alike two instances
are, and the classification function that determines how to use
the class of similar instances in predicting the class of the
test instance.
Similarity
The K* classifier assumes that if two instances are similar then
there is a high probability of one instance randomly
"transforming" into the other by some sequence of small
mutations. A transformation between instances may be envisaged
as a series of smaller, basic transformations. For example a
numeric value may be mutated by adding one or subtracting one.
Assigning probabilities to the basic transformations allows us to
calculate the probability of the overall sequence. However there
are many possible sequences to transform instance A into instance
B. K* incorporates all possible transformation paths into its
similarity function and takes the probability of A transforming
into B in an unspecified manner (i.e the sum of the probabilities
of all possible transformation programs). The beauty of this
approach is that different attribute types (such as integer,
symbolic) can be modelled within the same framework. Adding
support for a new attribute type consists of defining an
appropriate set of basic transformations and their probabilities.
Classification
For a simple instance-based classifier, such as a
k-nearest-neighbour algorithm, the k closest neighbours to the
test instance are found and the predicted class is determined by
majority vote. If 4 of the 5 nearest neighbours have class B,
then class B is predicted as the class of the test instance.
K* can be thought of as a k-nearest-neighbour classifier with
votes weighted according to similarity. The probability that a
test instance is of class C is the probability of the instance
randomly transforming into one of the training instances of class C
(i.e the sum of the probabilities of transforming into each training
instance of that class). Because similar neighbours have a
higher transformation probability than dissimilar neighbours,
these instances have a higher weighting in the final
classification. The predicted class is the one in which the test
instance is most likely to randomly transform into.
Blend parameter
The probabilities assigned to the basic transformations determine
what is considered a small or large difference. This may be
controlled through a blend parameter. The blend parameter sets
the size of a "sphere of influence" around the test instance.
A blend setting of 100% sets transformation probabilities so
that all differences are considered small, and thus all of the
training instances appear to be similar to the test instance. A
blend setting of 5% will result in approximately 5% of the
training instances being considered "close".
Advanced Symbolic metric
K* optionally includes an advanced symbolic similarity function that incorporates ideas
from Stanfill & Waltz's Value Difference Metric.
© 1997 Internz Software
Last updated: 1997-07-10
|