Kohonen Neural Network Package

Alan Dix © 1993

This code is being released in connection with:: An introduction to Artificial Intelligence; Janet Finlay and Alan Dix, UCL Press, 1996.

Overview

This package is based on the algorithms of T. Kohonen. It was developed on DOS and UNIX platforms as a simple basis for student projects.

The package comprises four programs:

ko_init - initialises network with random values
ko_train - unsupervised training of network
ko_label - labelling of network using training data
ko_class - classification of data using network

File formats

There are two file formats used by this package:

DATA format:: line 1: wid - width of each data vector; line 2-on: v_1 v_2 ... v_wid label - example vectors with labelling
CODE format:: line 1: wid X Y - vector width and network size; line 2-on: same as DATA format

In the CODE format the data lines represent the network nodes in the order:

    [1,1], [1,2] ... [1,Y], [2,1] ... [2,Y] ...[X,1] ... [X,Y]

That is the code file should contain exactly X*Y lines of data in addition to the header line.

Examples:

DATA format - data1.dat

     3
     1.0  0.0  0.0  CX
     0.0  1.0  0.0  DX
     0.0  0.0  1.0  EE
     0.7  0.2  0.1  CY
    -0.2  0.5 -0.3  DY
     0.3  0.3  0.4  EY

CODE format - code.cod

     3 2 2
     1.0 0.0 0.0 A
     0.0 1.0 0.0 B
     0.0 0.0 1.0 A
     0.5 0.5 0.5 C

DETAILED DESCRIPTIONS

KO_INIT

initialise network with random values

usage:

ko_init wid X Y lo hi
    wid   -   width of data set vectors
    X,Y   -   dimensions of 2d kohonen network
    lo,hi -   range of values in network

description:

Produces a network of size XxY each node of the network is a vector of length wid. Each element is initalised to a random number in the range lo <= x < hi.

output:

stdout: a code file representing the initial state

stderr: log of parameters

example:

ko_init 4 6 6 0.0 1.0 >code1.cod

KO_TRAIN

unsupervised training of network

usage:

ko_train datafile codefile rlen alpha diam K
    data   -   file containing training data
    code   -   file containing network (code book)
    rlen   -   run length (number of iterations)
    alpha  -   rate at which vectors swing
    dimen  -   initial size of affected region
    K      -   rate at which region affected shrinks

description:

The Kohonen network is trained by cycling through the data file aproximately rlen/dlen times, where dlen is the length of the data set. The `time' t is the number of times it has cycled. The parameter alpha determines how far the network vectors are `swung' by the training vectors, dimen is the initial width of the `window' within which network nodes are affected and K is a `cooling' constant deteriming how fast alpha and diam are reduced during training.

To be precise, let vec be a training vector during pass t. Let N[x,y] be the vectors of the network nodes, and assume xc, yc is the closest network element to vec. That is:

    || N[xc,yc] - vec || <= || N[x,y] - vec ||

for all x in (1...X), y in (1...Y) The network nodes are then swung according to the following formula:

    N'[x,y] = N[x,y] + alpha(t) * ( vec - N[x,y] )

If dist[(x,y),(xc,yc)] <= diam(t) both alpha(t) and diam(t) decrease exponetially with time:

    alpha(t)   =  exp(-Kt) * alpha(0)
    diam(t)    =  exp(-Kt) * diam(0)

The inital values at time zero are those determined by the program parameters. The exponential decay ensures that the network eventually stabalises.

output:

stdout: a code file representing the trained network

stderr: log of parameters

example:

ko_train data2.dat code1.cod 1000 0.3 4 0.2 >code2.cod
    #  N.B.  takes some time to run!

KO_LABEL

label network using training data

usage:

ko_label datafile codefile hi lo
    data   -   file containing training data
    code   -   file containing network (code book)
    hi     -   minimum proportion to label
    lo     -   maximum proportion for other labels

description:

The data items are classified into network positions (using closest Euclidean distance). For each network node the number of data items with each label is calculated. That is for each node n and label l, we have n(l), the number of data items with label l classifed to n. A node n is given a label L if:

L is the most prevalent label for that node.
i.e., n(L) >= n(l) for all l.
The proportion of L labeled data items is at least hi
i.e., n(L) >= hi * sigma n(l)
The proportion of any other label is less than lo
i.e., n(l) < lo * sigma n(l')

For two labels the use of both hi and lo is redundant, but say for three valued data we had hi = 0.5 and lo = 0.3. This would mean that if 20 data items were assigned to a node and the numbers with each label were {13, 0, 7}, we would NOT label this node with the first label. Although more than 50% of the training data items are of the first label, also over 30% are of the third label. This node would then be simply unlabeled.

To simply label to the maximum, use hi=0.0, lo=1.0.

output:

stdout:: modified code file where each network node is either given a label or left blank if it cannot be classified.
stderr:: log of parameters and code file statistics. these statistics are output in CODE format, with vector width the number of labels + 1. Each node vector represents the number of data items of each label, and the label field is the label assigned to that node.
N.B. you may want to redirect the stderr output(there may be a lot of it), but remember to check it for real errors!

example:

ko_label data2.dat code2.cod 0.5 0.3 >code3.cod 2>stats.out
head -10 stats.out  # check for errors

KO_CLASS

classify data using network

usage:

ko_class data code
    data   -   file containing data to classify
    code   -   file containing network (code book)

description:

Classifies each data vector to the closest node in the network, where closest is measured by Euclidean distance.

output:

stdout:: data file vectors with network node classification and labels, in format: v_1 ... v_wid old_label -> [ x y ] new_label where x,y is the position of the node in the network
stderr: nothing

example:

ko_class data2.dat code3.cod >data2.class   #  training set
ko_class data3.dat code3.cod >data3.class   #  unseen data
        # post-processing for single letter classes
        # produces a summary of classifications
cut -c44,71 data2.class >data2.sum
awk -f count.awk data2.sum | sort

Down loading and compiling

The package can be downloaded as:

a UNIX tar archive
a Macintosh stuffed binhex archive
individual files

To compile in a UNIX platform simply type: make all

The makefile will also work with many DOS compilers, although some modifications may be required on other non-UNIX platforms. In particular:

my_rand.c has a #define of the maximum return value from the math.h function rand. This is set to the normal value of MAXINT on a 32 bit platform, but may need setting for your platform.
the main functions use UNIX/DOS argc/argv command line arguments. Some Macintosh C compilers allow you to enter command line arguments in a dialogue box, other Mac C compilers may require you to modify the main functions.

With these exceptions the code is written in as portable fashion as possible. A pre-processor flag 'ANSI' controls whether the program compiles for an old fashioned K&R compiler or an ANSI compiler. See the makefile to modify this.

Alan Dix