Pseudo-code for ID3

Alan Dix © 1993


This document and was originally written as an aid for an MSc student project. It describes data structures and algorithmic structure for ID3. A companion document Machine learning with variables describes how ID3 can be modified to deal with comparisons between attributes as well as single attribute conditions.

These documents are being released in connection with:
An introduction to Artificial Intelligence
Janet Finlay and Alan Dix, UCL Press, 1996.

The ID3 algorithm is dealt with in chapter 4, Machine Learning, pages 90-95 This chapter also describes Query-by-Browsing an experimental intelligent database interface, which uses ID3 as its learning mechanism.



Data structures

There are two principal data structures, one for the examples and one for the decision tree. The examples are easy, being a list or array of values. These values are of types given in a separate 'schema' of attribute types.

example_type == ARRAY of attribute_value
example_set  == ARRAY of example_type
schema == RECORD
             nos_of_attributes : integer
             attribute_types : ARRAY of attribute_type
attribute_type == [ discrete | continuous | binary | string ... ]

Where the examples are of known classes, the class may be stored as an extra attribute value. The attribute values may either be a variant type (string or number ...) or held in some single form. For example, the values could be strings where numbers are converted when required.

The decision tree is substantially more complex. Each node may either be a branch (with and associated decision criterion) or a leaf. The leaf has stored with it the class it decides for and a certainty factor. This certainty factor will be one if only examples of the specified class are found at that node, but less than one if the examples were mixed.

node == EITHER
          leaf : RECORD
                    class : class_type
                    certainty : real
        OR
          branch: RECORD
                    criterion : criteria_type
                    yes_branch : node_ptr
                    no_branch  : node_ptr

These variants may be stored as two different data types with some form of initial identifier (or two derived classes from a base class in C++). Alternatively, both variants may be stored using a single record type (useful for PASCAL file storage) and a special 'LEAF' value of the criterion used to indicate LEAF nodes.

In standard ID3, the criteria are of the form 'attribute is equal to (or is less than or greater than) some value'. As we wish to extend this, critera can be of two forms:

criteria_type == EITHER
                    inter : RECORD
                              attr : attribute_no
                              relation : { =, >, <, etc. }
                              val :  attribute_value
                 OR
                    intra : RECORD
                              attr1 : attribute_no
                              relation : { = or > }
                              attr2 : attribute_no

Again, this can be represented either with variant records or by having a tag (inter or intra) and having both a 'val' and 'attr2' field of which only one is used.

To be a purist, I would say that the use of true variant recoreds is better, but programs often use the accumulated form of storage to make dynamic allocation and input/output to files easier.


Classification

Assuming we have a decision tree and an example, we can classify it by simply running down the tree:

classify( example: example_type, root: node_ptr ) returns class_type
  current_pos = root
  WHILE  current_pos is not a leaf
     IF  example satisfies current_pos^.criteria
     THEN   current_pos := current_pos^.yes_branch
     ELSE   current_pos := current_pos^.no_branch
  { ends with current_pos on a leaf }
  return current_pos^.class

It would be reasonable to get the classification procedure to return the certainty factor as well. This is a simple modification to the above.


Learning

Actually learning the decision tree is more complicated. Basically at each stage, one takes a set of examples, chooses a decision criterion, splits the examples into those that do or do not satisfy the criterion and then builds new trees for those new subsets of examples. The process stops when no further progress can be made.

buildtree ( examples: example_set ) returns node_ptr
  IF examples all of the same class
  THEN  return a pointer to a LEAF where the class is the one with the most
               examples and the certainty factor is the number
               in the class divided by the total number of examples.
  ELSE
      choose criteria C
      IF choosing process fails
      THEN return pointer to LEAF as above
      ELSE
        split_examples( C, examples, yes_examples {VAR} , no_examples {VAR} )
        LET yes_b = buildtree( yes_examples )
        AND no_b  = buildtree( no_examples )
        return a pointer to a BRANCH
                       with criteria   =  C
                            yes_branch =  yes_b
                            no_branch  =  no_b
      END {if choosing fails}
  END {if examples of same class}

The only two sub-processes are the splitting of the example set and the choosing of the best decision criterion. The former is simply a matter of looking at each example in turn, seeing whether it satisfies the criterion and then adding it to the appropriate subset.

This leaves us with choosing an appropriate criterion. This may either be a single attribute criterion or a paired attribute one:

choose_criteria ( examples: example_set,
                 crit: VAR criteria_type, found: VAR bool)

  found = true
  choose_single ( examples, s_crit, , s_ent, s_found )
  choose_pair ( examples, p_crit, , p_ent, p_found )

  IF s_found AND p_found
  THEN
     IF p_ent < s_ent
     THEN
       crit = p_crit
     ELSE
       crit = s_crit
     END { if p_ent < s_ent }
  ELSE IF s_found    { only single criteria found }
     crit = s_crit
  ELSE IF p_found    { only pair criteria found }
     crit = p_crit
  ELSE  { no criteria found }
     found = false
  END { if s_found p_found etc. }

The procedures 'choose_single' and 'choose_pair' search for suitable criteria and if found return the entropy so that they can be compared.

choose_single (  examples: example_set,
                 crit: VAR criteria_type, ent: VAR real, found: VAR bool)

   ent = BIG_NUMBER
   best_crit = none
   found = false

   FOR EACH attribute A
      depending of attribute type:
      CASE discrete or bool
          FOR EACH attribute value AV
                LET try_crit = INTER with attr = A
                                          relation = equals
                                          val  = AV
                LET try_ent = entropy(try_crit,examples)
                IF try_ent < best_ent AND useful split
                THEN
                   found = true
                   best_ent = try_ent
                   best_crit = try_crit
                END { if try_ent }
      CASE string
               { no sensible criteria }
      CASE continuous
               { should look for < or > some value, but don't bother for now }

The 'useful split' condition should check that the split on the attribute actually makes progress, that is that all the examples don't go the same way. This can be coded as a separate test or the entropy calculating function can return a special value.

The paired search is similar just looks at each attribute pair. These pairs should be different and as the < and > test yield the same information it is silly to consider the attributes in both orders. This can be achieved for numerically coded attribute numbers, by having a nested loop where the inner loop starts from A1+1.

choose_single (  examples: example_set,
                 crit: VAR criteria_type, ent: VAR real, found: VAR bool)

   ent = BIG_NUMBER
   best_crit = none
   found = false

   FOR EACH attribute pair A1 A2 with the same type
      depending of attribute type:
      CASE discrete, bool or string
          LET try_crit = INTER with attr1 = A1
                                    relation = equals
                                    attr2 = A2
          LET try_ent = entropy(try_crit,examples)
          IF try_ent < best_ent AND useful split
          THEN
                found = true
                best_ent = try_ent
                best_crit = try_crit
          END { if try_ent }
      CASE continuous
          FOR relation R in '=' and '<'
                LET try_crit = INTER with attr1 = A1
                                          relation = R
                                          attr2 = A2
                LET try_ent = entropy(try_crit,examples)
                IF try_ent < best_ent AND useful split
                THEN
                   found = true
                   best_ent = try_ent
                   best_crit = try_crit
                END { if try_ent }

Finally, we need to calculate the entropy. We assume we have the procedure which splits an example set into those which do or do not satisfy a given criterion. We also assume a procedure which given an example set returns an array with the count of the number of examples in each class (array length two for a simple binary choice).

entropy ( crit : criteria_type, examples: example_set
    yes_cts, no_cts:  ARRAY [1..nos_classes] of real
    yes_examples, no_examples : example_set
    split_examples( crit, examples, yes_examples , no_examples )
    count_classes ( yes_examples, yes_cts )
    count_classes ( no_examples, no_cts )
    total = number of examples
    ent = 0
    non_zero = 0
    FOR c = 1 .. nos_classes
           py = yes_cts[c] / total
           pn = no_cts[c] / total
           ent := ent - py * log(py) - pn * log (pn)
           IF ( py > 0 ) OR ( px > 0 )
           THEN non_zero = non_zero + 1
    return entropy = ent
    { at end of loop if 'non_zero' is < 2, this is a useless split     }
    { the useless test could be identical to this, or entropy could   }
    { have var parameters, one of which was a useless flag, or one     }
    { have a line like:  IF non_zero < 2 THEN ent = BIG_NUMBER         }


Choice of language

The above is PASCAL-like pseudo-code, but the algorithms could also be programmed in prolog. This has the advantage that the 'dynamic typing' of the attribute values is far easier. Also variants can be easily dealt with:

decide(leaf(Class,T),Ex,Class).
decide(branch(Crit,Y,N),Ex,Class) :- satisfies(Crit,Ex), decide(Y,Ex,Class).
decide(branch(Crit,Y,N),Ex,Class) :- not satisfies(Crit,Ex), decide(N,Ex,Class).

satisfies(eq(L,R),Ex) :- getval(L,Ex,Val), getval(R,Ex,Val).
satisfies(lt(L,R),Ex) :- getval(L,Ex,cts(Lv)), getval(R,Ex,cts(Rv)), Lv < Rv.

getval(attr(A),Ex,V) :- !, getattr(A,Ex,V).
getval(V,Ex,V).

getattr(A,[],error).
getattr(A,Ex,error) :- A < 0.
getattr(1,[V|Rest],V).
getattr(A,[V|Rest],V) :- A1 is A - 1, getattr(A1,Rest,V).

This is of course the easy bit: deciding from a given tree. However, the tree building should not be too bad, the worst bit being finding the best out of a set of possibilities. Prolog is good at finding one successful thing out of a set, but not the best. Below are two ways of doing it.


Finding 'optimal' solutions in Prolog

In each of the following, 'findbest' takes five arguments:

findbest(Clause,Ret,Val,BestRet,BestVal)

It repeatedly tries to satisfy 'Clause', and looks for the solution which maximises 'Val'. The best value of 'Val' is returned in 'BestVal' and 'BestRet' is the value of 'Ret' with the same bindings in place as when the optimal 'Clause' was found. For example, the following finds the pair of numbers adding up to ten with the biggest product:

smallnumber(0). smallnumber(1). smallnumber(2). smallnumber(3).
smallnumber(4). smallnumber(5). smallnumber(6). smallnumber(7).
smallnumber(8). smallnumber(9). smallnumber(10).

sum10(X,Y) :- smallnumber(X), smallnumber(Y), X+Y =:= 10.

?- findbest(sum10(A,B),(A,B),A*B,R,V).

returns - findbest(sum10(_1,_2),(_1,_2),_1*_2,(5,5),25).

The first bit of code uses cuts, assert and retract liberally.

retract1(C) :- retract(C),!.
best1(R,V) :- bestsofaris(R,V),!.

findbest(_,_,_,_,_)
         :- asserta(bestsofaris(badvalue,-10000)), fail.
findbest(C,R,V,_,_)
         :- call(C), best1(CR,CV), CV < V,
            retract1(bestsofaris(CR,CV)),
            asserta(bestsofaris(R,V)), fail.
findbest(_,_,_,_,_)
         :- best1(badvalue,-10000), !,
            retract1(bestsofaris(badvalue,-10000)), fail.
findbest(_,_,_,BR,BV)
         :- best1(BR,BV), !, retract1(bestsofaris(BR,BV)).

The second version uses the meta-logical predicate 'setof'. This is not available on all Prologs. Also if the search space is very large setof may run out of memory.

findbest(Clause,Ret,Val,BestRet,BestVal)
         :- setof(Clause,(Ret,Val),Set),
            bestofset(Set,badvalue,-10000,BestRet,BestVal),
            BestVal =\\= -10000.

bestofset([],Ret,Val,Ret,Val).
bestofset([(CR,CV)|RestOfSet],Ret,Val,Ret,Val) :- CV <= Val.
bestofset([(CR,CV)|RestOfSet],Ret,Val,CR,CV) :- CV > Val.

N.B. none of the above prolog is tested!!!


Alan Dix