This document and was originally written as an aid for an MSc student project. It describes how ID3 can be modified to deal with comparisons between attributes as well as single attribute conditions. A companion document Pseudo-code for ID3 describes data structures and algorithmic structure for ID3.
The ID3 algorithm is dealt with in chapter 4, Machine Learning, pages 90-95. This chapter also describes Query-by-Browsing an experimental intelligent database interface, which uses ID3 as its learning mechanism.
The fundamental reference for the ID3 algorithm is Quinlan.
The basic inductive learning algorithm is as follows.
The most complicated step is (2), choosing the attribute. This is accomplished by looking at each attribute in turn and all candidate conditions and seeing what sort of split they would achieve. The ID3 algorithm uses the entropy of the split to decide on the best attribute.
Another complication is that the condition C may not be a simple true/false one. If the attribute takes on a finite set of values v1, v2 ... vn then we might build a tree with an n way branch. The original ID3 algorithm did this. However, there are disadvantages to it and for the purposes of this project a binary split is more appropriate.
Both the ID3 algorithm and standard databases are based on tuples, each of which consists of a number of attributes with values. For example, a simple banking database might be as follows:
name | title | balance | overdraft-limit |
Jo Parkin | Ms | 575.23 | 0.00 |
John McCawber | Mr | -375.23 | -100.00 |
Liz Windsor | Mrs | 27629.76 | -9999.99 |
Peter Coney | Dr | -730.37 | -1000.00 |
Gladys Green | Ms | -15.23 | 0.00 |
The bank manager might pick out from the above John McCawber and Gladys Green. This would be the positive set and the remaining the negative set. The criteria being used is obvious -- those whose balance is below their overdraft limit!
This is an example of a data set exhibiting an inequality as its decision criterion. It could be made more complicated by adding fields to record those for whom letters have already been sent out and when (one wouldn't want to send letters every day).
In addition, equality constraints ought to be considered. For example, if we included the name of the personal manager responsible for each client, we might want to look for all tuples where the overdraft limit was large and the surname of the client was the same as the surname of the manager (possible collusion!).
A range of similar example data sets should be developed. They should be bigger than the above (perhaps 50-100 tuples) and exhibit decision rules of varying complexity. Perhaps some student/course/lecturer type examples would be interesting.
Notice that the attributes fell into several types. The 'title' attribute only takes one of a finite set of values Dr/Mr/Mrs/Ms/Miss. The 'name' attribute on the other hand can take on a virtually unlimited range of character string values. The balance and overdraft are similarly non-finite, but are numerical. The types of the attributes are important when we look for relationships between attributes. It is only sensible to make numerical inequality comparisons between numerical attributes:
balance < overdraft-limit
Similarly, one would not test for equality between a string attribute and a numerical one.
The general form of the inductive learning algorithm stays as it was before. It is only that step 2 is made more complicated. Whereas in the standard algorithm we simply look at one attribute at a time. In the modified algorithm we can also look at pairs of attributes. The conditions one would have for single attributes would be:
attribute = constant value attribute < (or >) constant value
Note that we do not have to treat 'not =' as a separate case as it yields the same branches as an '=' condition.
When we have two attributes (say A1 and A2) to consider, the possibilities include:
A1 = A2 A1 > A2 (or A1 < A2)
Indeed for numerical attributes, we could even consider general linear conditions:
alpha * A1 + beta * A2 > k
Step (2) of the algorithm is then of the form:
As each condition yields a simple binary decision, we can use the same entropy based measure of 'best' as the standard algorithm.