This document and was originally written as an aid for an MSc student project. It describes data structures and algorithmic structure for ID3. A companion document Machine learning with variables describes how ID3 can be modified to deal with comparisons between attributes as well as single attribute conditions.
The ID3 algorithm is dealt with in chapter 4, Machine Learning, pages 90-95 This chapter also describes Query-by-Browsing an experimental intelligent database interface, which uses ID3 as its learning mechanism.
There are two principal data structures, one for the examples and one for the decision tree. The examples are easy, being a list or array of values. These values are of types given in a separate 'schema' of attribute types.
example_type == ARRAY of attribute_value
example_set == ARRAY of example_type
schema == RECORD
nos_of_attributes : integer
attribute_types : ARRAY of attribute_type
attribute_type == [ discrete | continuous | binary | string ... ]
Where the examples are of known classes, the class may be stored as an extra attribute value. The attribute values may either be a variant type (string or number ...) or held in some single form. For example, the values could be strings where numbers are converted when required.
The decision tree is substantially more complex. Each node may either be a branch (with and associated decision criterion) or a leaf. The leaf has stored with it the class it decides for and a certainty factor. This certainty factor will be one if only examples of the specified class are found at that node, but less than one if the examples were mixed.
node == EITHER
leaf : RECORD
class : class_type
certainty : real
OR
branch: RECORD
criterion : criteria_type
yes_branch : node_ptr
no_branch : node_ptr
These variants may be stored as two different data types with some form of initial identifier (or two derived classes from a base class in C++). Alternatively, both variants may be stored using a single record type (useful for PASCAL file storage) and a special 'LEAF' value of the criterion used to indicate LEAF nodes.
In standard ID3, the criteria are of the form 'attribute is equal to (or is less than or greater than) some value'. As we wish to extend this, critera can be of two forms:
criteria_type == EITHER
inter : RECORD
attr : attribute_no
relation : { =, >, <, etc. }
val : attribute_value
OR
intra : RECORD
attr1 : attribute_no
relation : { = or > }
attr2 : attribute_no
Again, this can be represented either with variant records or by having a tag (inter or intra) and having both a 'val' and 'attr2' field of which only one is used.
To be a purist, I would say that the use of true variant recoreds is better, but programs often use the accumulated form of storage to make dynamic allocation and input/output to files easier.
Assuming we have a decision tree and an example, we can classify it by simply running down the tree:
classify( example: example_type, root: node_ptr ) returns class_type
current_pos = root
WHILE current_pos is not a leaf
IF example satisfies current_pos^.criteria
THEN current_pos := current_pos^.yes_branch
ELSE current_pos := current_pos^.no_branch
{ ends with current_pos on a leaf }
return current_pos^.class
It would be reasonable to get the classification procedure to return the certainty factor as well. This is a simple modification to the above.
Actually learning the decision tree is more complicated. Basically at each stage, one takes a set of examples, chooses a decision criterion, splits the examples into those that do or do not satisfy the criterion and then builds new trees for those new subsets of examples. The process stops when no further progress can be made.
buildtree ( examples: example_set ) returns node_ptr
IF examples all of the same class
THEN return a pointer to a LEAF where the class is the one with the most
examples and the certainty factor is the number
in the class divided by the total number of examples.
ELSE
choose criteria C
IF choosing process fails
THEN return pointer to LEAF as above
ELSE
split_examples( C, examples, yes_examples {VAR} , no_examples {VAR} )
LET yes_b = buildtree( yes_examples )
AND no_b = buildtree( no_examples )
return a pointer to a BRANCH
with criteria = C
yes_branch = yes_b
no_branch = no_b
END {if choosing fails}
END {if examples of same class}
The only two sub-processes are the splitting of the example set and the choosing of the best decision criterion. The former is simply a matter of looking at each example in turn, seeing whether it satisfies the criterion and then adding it to the appropriate subset.
This leaves us with choosing an appropriate criterion. This may either be a single attribute criterion or a paired attribute one:
choose_criteria ( examples: example_set,
crit: VAR criteria_type, found: VAR bool)
found = true
choose_single ( examples, s_crit, , s_ent, s_found )
choose_pair ( examples, p_crit, , p_ent, p_found )
IF s_found AND p_found
THEN
IF p_ent < s_ent
THEN
crit = p_crit
ELSE
crit = s_crit
END { if p_ent < s_ent }
ELSE IF s_found { only single criteria found }
crit = s_crit
ELSE IF p_found { only pair criteria found }
crit = p_crit
ELSE { no criteria found }
found = false
END { if s_found p_found etc. }
The procedures 'choose_single' and 'choose_pair' search for suitable criteria and if found return the entropy so that they can be compared.
choose_single ( examples: example_set,
crit: VAR criteria_type, ent: VAR real, found: VAR bool)
ent = BIG_NUMBER
best_crit = none
found = false
FOR EACH attribute A
depending of attribute type:
CASE discrete or bool
FOR EACH attribute value AV
LET try_crit = INTER with attr = A
relation = equals
val = AV
LET try_ent = entropy(try_crit,examples)
IF try_ent < best_ent AND useful split
THEN
found = true
best_ent = try_ent
best_crit = try_crit
END { if try_ent }
CASE string
{ no sensible criteria }
CASE continuous
{ should look for < or > some value, but don't bother for now }
The 'useful split' condition should check that the split on the attribute actually makes progress, that is that all the examples don't go the same way. This can be coded as a separate test or the entropy calculating function can return a special value.
The paired search is similar just looks at each attribute pair. These pairs should be different and as the < and > test yield the same information it is silly to consider the attributes in both orders. This can be achieved for numerically coded attribute numbers, by having a nested loop where the inner loop starts from A1+1.
choose_single ( examples: example_set,
crit: VAR criteria_type, ent: VAR real, found: VAR bool)
ent = BIG_NUMBER
best_crit = none
found = false
FOR EACH attribute pair A1 A2 with the same type
depending of attribute type:
CASE discrete, bool or string
LET try_crit = INTER with attr1 = A1
relation = equals
attr2 = A2
LET try_ent = entropy(try_crit,examples)
IF try_ent < best_ent AND useful split
THEN
found = true
best_ent = try_ent
best_crit = try_crit
END { if try_ent }
CASE continuous
FOR relation R in '=' and '<'
LET try_crit = INTER with attr1 = A1
relation = R
attr2 = A2
LET try_ent = entropy(try_crit,examples)
IF try_ent < best_ent AND useful split
THEN
found = true
best_ent = try_ent
best_crit = try_crit
END { if try_ent }
Finally, we need to calculate the entropy. We assume we have the procedure which splits an example set into those which do or do not satisfy a given criterion. We also assume a procedure which given an example set returns an array with the count of the number of examples in each class (array length two for a simple binary choice).
entropy ( crit : criteria_type, examples: example_set
yes_cts, no_cts: ARRAY [1..nos_classes] of real
yes_examples, no_examples : example_set
split_examples( crit, examples, yes_examples , no_examples )
count_classes ( yes_examples, yes_cts )
count_classes ( no_examples, no_cts )
total = number of examples
ent = 0
non_zero = 0
FOR c = 1 .. nos_classes
py = yes_cts[c] / total
pn = no_cts[c] / total
ent := ent - py * log(py) - pn * log (pn)
IF ( py > 0 ) OR ( px > 0 )
THEN non_zero = non_zero + 1
return entropy = ent
{ at end of loop if 'non_zero' is < 2, this is a useless split }
{ the useless test could be identical to this, or entropy could }
{ have var parameters, one of which was a useless flag, or one }
{ have a line like: IF non_zero < 2 THEN ent = BIG_NUMBER }
The above is PASCAL-like pseudo-code, but the algorithms could also be programmed in prolog. This has the advantage that the 'dynamic typing' of the attribute values is far easier. Also variants can be easily dealt with:
decide(leaf(Class,T),Ex,Class). decide(branch(Crit,Y,N),Ex,Class) :- satisfies(Crit,Ex), decide(Y,Ex,Class). decide(branch(Crit,Y,N),Ex,Class) :- not satisfies(Crit,Ex), decide(N,Ex,Class). satisfies(eq(L,R),Ex) :- getval(L,Ex,Val), getval(R,Ex,Val). satisfies(lt(L,R),Ex) :- getval(L,Ex,cts(Lv)), getval(R,Ex,cts(Rv)), Lv < Rv. getval(attr(A),Ex,V) :- !, getattr(A,Ex,V). getval(V,Ex,V). getattr(A,[],error). getattr(A,Ex,error) :- A < 0. getattr(1,[V|Rest],V). getattr(A,[V|Rest],V) :- A1 is A - 1, getattr(A1,Rest,V).
This is of course the easy bit: deciding from a given tree. However, the tree building should not be too bad, the worst bit being finding the best out of a set of possibilities. Prolog is good at finding one successful thing out of a set, but not the best. Below are two ways of doing it.
In each of the following, 'findbest' takes five arguments:
findbest(Clause,Ret,Val,BestRet,BestVal)
It repeatedly tries to satisfy 'Clause', and looks for the
solution which maximises 'Val'.
The best value of 'Val' is returned in 'BestVal' and
'BestRet' is the value of 'Ret' with the same bindings
in place as when the optimal 'Clause' was found.
For example, the following finds the pair of numbers
adding up to ten with the biggest product:
smallnumber(0). smallnumber(1). smallnumber(2). smallnumber(3).
smallnumber(4). smallnumber(5). smallnumber(6). smallnumber(7).
smallnumber(8). smallnumber(9). smallnumber(10).
sum10(X,Y) :- smallnumber(X), smallnumber(Y), X+Y =:= 10.
?- findbest(sum10(A,B),(A,B),A*B,R,V).
returns - findbest(sum10(_1,_2),(_1,_2),_1*_2,(5,5),25).
The first bit of code uses cuts, assert and retract liberally.
retract1(C) :- retract(C),!.
best1(R,V) :- bestsofaris(R,V),!.
findbest(_,_,_,_,_)
:- asserta(bestsofaris(badvalue,-10000)), fail.
findbest(C,R,V,_,_)
:- call(C), best1(CR,CV), CV < V,
retract1(bestsofaris(CR,CV)),
asserta(bestsofaris(R,V)), fail.
findbest(_,_,_,_,_)
:- best1(badvalue,-10000), !,
retract1(bestsofaris(badvalue,-10000)), fail.
findbest(_,_,_,BR,BV)
:- best1(BR,BV), !, retract1(bestsofaris(BR,BV)).
The second version uses the meta-logical predicate 'setof'. This is not available on all Prologs. Also if the search space is very large setof may run out of memory.
findbest(Clause,Ret,Val,BestRet,BestVal)
:- setof(Clause,(Ret,Val),Set),
bestofset(Set,badvalue,-10000,BestRet,BestVal),
BestVal =\\= -10000.
bestofset([],Ret,Val,Ret,Val).
bestofset([(CR,CV)|RestOfSet],Ret,Val,Ret,Val) :- CV <= Val.
bestofset([(CR,CV)|RestOfSet],Ret,Val,CR,CV) :- CV > Val.
N.B. none of the above prolog is tested!!!