Machine learning with variables

Alan Dix © 1993

This document and was originally written as an aid for an MSc student project. It describes how ID3 can be modified to deal with comparisons between attributes as well as single attribute conditions. A companion document Pseudo-code for ID3 describes data structures and algorithmic structure for ID3.

These documents are being released in connection with:: An introduction to Artificial Intelligence; Janet Finlay and Alan Dix, UCL Press, 1996.

The ID3 algorithm is dealt with in chapter 4, Machine Learning, pages 90-95. This chapter also describes Query-by-Browsing an experimental intelligent database interface, which uses ID3 as its learning mechanism.

Inductive learning

The fundamental reference for the ID3 algorithm is Quinlan.

The basic inductive learning algorithm is as follows.

Start with a set of positive example P and negative examples N.
Choose the 'best' attribute to split on and a condition C on that attribute.
Divide the examples into those for which the condition is true (Pt and Nt) and those which it is false (Pf and Nf).
Use the algorithm recursively to build two trees Tt and Tf based on the examples (Pt and Nt) and (Pf and Nf) respectively.
The algorithm returns the tree T with condition C at the root and with branches Tt and Tf.

The most complicated step is (2), choosing the attribute. This is accomplished by looking at each attribute in turn and all candidate conditions and seeing what sort of split they would achieve. The ID3 algorithm uses the entropy of the split to decide on the best attribute.

Another complication is that the condition C may not be a simple true/false one. If the attribute takes on a finite set of values v1, v2 ... vn then we might build a tree with an n way branch. The original ID3 algorithm did this. However, there are disadvantages to it and for the purposes of this project a binary split is more appropriate.

Example data set

Both the ID3 algorithm and standard databases are based on tuples, each of which consists of a number of attributes with values. For example, a simple banking database might be as follows:

name title balance overdraft-limit
Jo Parkin Ms 575.23 0.00
John McCawber Mr -375.23 -100.00
Liz Windsor Mrs 27629.76 -9999.99
Peter Coney Dr -730.37 -1000.00
Gladys Green Ms -15.23 0.00
The headings 'name', 'title' etc. are the attribute names, and for any tuple (say <Jo Parkin, Ms, 575.23, 0.00>) these attributes have particular values (e.g, balance=575.23).

The bank manager might pick out from the above John McCawber and Gladys Green. This would be the positive set and the remaining the negative set. The criteria being used is obvious -- those whose balance is below their overdraft limit!

This is an example of a data set exhibiting an inequality as its decision criterion. It could be made more complicated by adding fields to record those for whom letters have already been sent out and when (one wouldn't want to send letters every day).

In addition, equality constraints ought to be considered. For example, if we included the name of the personal manager responsible for each client, we might want to look for all tuples where the overdraft limit was large and the surname of the client was the same as the surname of the manager (possible collusion!).

A range of similar example data sets should be developed. They should be bigger than the above (perhaps 50-100 tuples) and exhibit decision rules of varying complexity. Perhaps some student/course/lecturer type examples would be interesting.

Notice that the attributes fell into several types. The 'title' attribute only takes one of a finite set of values Dr/Mr/Mrs/Ms/Miss. The 'name' attribute on the other hand can take on a virtually unlimited range of character string values. The balance and overdraft are similarly non-finite, but are numerical. The types of the attributes are important when we look for relationships between attributes. It is only sensible to make numerical inequality comparisons between numerical attributes:

balance < overdraft-limit

Similarly, one would not test for equality between a string attribute and a numerical one.

Modifying the inductive learning algorithm

The general form of the inductive learning algorithm stays as it was before. It is only that step 2 is made more complicated. Whereas in the standard algorithm we simply look at one attribute at a time. In the modified algorithm we can also look at pairs of attributes. The conditions one would have for single attributes would be:

attribute = constant value
attribute < (or >) constant value

Note that we do not have to treat 'not =' as a separate case as it yields the same branches as an '=' condition.

When we have two attributes (say A1 and A2) to consider, the possibilities include:

A1   =   A2
A1   >   A2    (or A1   <   A2)

Indeed for numerical attributes, we could even consider general linear conditions:

alpha * A1  +  beta * A2   >   k

Step (2) of the algorithm is then of the form:

2.1. consider each single attribute and possible conditions on it
2.2. consider each pair of string attributes and the equality constraint between them
2.3. consider each pair of numerical attributes and consider equality and inequality constraints between them
2.4. choose the best

As each condition yields a simple binary decision, we can use the same entropy based measure of 'best' as the standard algorithm.

name	title	balance	overdraft-limit
Jo Parkin	Ms	575.23	0.00
John McCawber	Mr	-375.23	-100.00
Liz Windsor	Mrs	27629.76	-9999.99
Peter Coney	Dr	-730.37	-1000.00
Gladys Green	Ms	-15.23	0.00