Garbage collection in the TIM machine

Reference as:
Alan Dix (1988).
Garbage collection in the TIM machine.
unpublished internal report, Department of Computer Science, University of York.

This unpublished paper was originally distributed in 1988 and drew on various conversations with Dave Wakling who was working on TIM at that stage.

Some of the ideas in this paper were later incorporated into:
D. Wakeling and A. Dix (1993).
Optimizing Partial Applications in TIM.
YCS 215, Department of Computer Science, University of York.

That report was itself first distributed privately later in 1988 and only published as a technical report when we realised it was being referenced elsewhere.

TIM, an abstract target machine for compiling lazy functional programmes, was first described in:
J. Fairbairn and S. Wray. TIM: A Simple, Lazy Abstract Machine to Execute Supercombinators. In Proceedings of the 1987 Conference on Functional Programming Languages and Computer Architecture, pages 34-45. Springer-Verlag, September 1987. LNCS 274.

Alan Dix

1. Introduction

Several techniques for improving garbage collection in the TIM machine are described below. The TIM architecture is similar to conventional imperative and eager functional languages, being based on functions producing frames containing references to its arguments, and thus the techniques are likely to be of use in general frame based architectures. Three topics are discussed:

2. TIM architecture

It is useful to distinguish three types of frame:

  1. Data values - frames used to produce structured data types
  2. Partial applications - for higher order types, produced when insufficient arguments are present for reduction to commence
  3. Full applications - produced to hold the arguments of a function 'in midst' computation

The categories are not in practice distinct:

In the TIM paper, data types are described as partial applications of "distributing functions", e.g.:

pair = λ x λ y λ f   f x y

This representation has the advantage that unshared data values need not be built, that is very efficient garbage collection!

Similarly they do not distinguish between partial and full frames, and in fact the technique of making partial applications share frames with a full application gets rid of any distinction whatsoever.

In addition to frames, there is an argument stack, and markers. In the TIM paper, markers are set on the stack, but they could also be in a separate stack of their own. The objects on the stack reference frames, but not visa versa. The markers, which are present to preserve laziness, contain a frame pointer and an argument number within that frame. Lastly there is a current frame pointer.

3. Collecting marks

As markers contain references to frames, it would seem that they need to be scanned when garbage collecting, in fact this is not necessary. The only purpose of markers is to update an argument position in a frame. If that frame is not referenced from anywhere except markers, then that arguments new value will never be accessed, so the update would be unnecessary. Thus we see, if at garbage collection time, there are no references to frame except from markers, the frame may be collected and the markers removed.

To achieve this end, simply do not include the markers when scanning roots, then after garbage collection has finished, do a post-pass of the stack/marker-stack, from bottom to top, removing markers to redundant frames, and compacting.

The following is a rough sketch of the post-pass algorithm. If there is a separate mark stack, then the is_mark test is omitted.

object          stack[BIG];
stack_ptr       top_of_stack, base_of_stack;

/*      top_of_stack points to the next free location above the stack */

    stack_ptr   from, to;
    object      obj;

    from = to = base_of_stack;

    while ( from <= top_of_stack ) {
        obj = stack[from];
        from = from +1;
        if ( is_mark(obj) && inactive_frame(obj) )
                /* don't copy */ ;
        else {
                stack[to] = obj;
                to = to + 1;

        top_of_stack = to;

4. Frame stacking

4.1 General principles

In standard imperative structured languages such as Pascal or C, the frame for each function is put on a stack (often the same as the argument and computation stack). When the function exits returning its value, the frame is freed. When closures are passed about, either because of first class functions, or in TIM because of laziness, references to a functions frame may outlive the function call. This means that frames must be allocated on the heap and must be garbage collected. By doing a little static analysis, it my be possible to allocate some frames on the stack, if it can be proved that when the function returns there are no references outstanding.

In eager languages, this is not too difficult, as functional values are often passed as parameters to other functions, but are not so often returned. By detecting when this happens the resulting frame can be put on the stack. The ML, compiler does this, and in fact the default is for stacked frames. This is helped in ML by the fact that the frame is pure that is never updated. All updateable objects are put in the heap, and as ML encourages a purely functional style there are not too many of these.

Lazy languages pose more problems as although at the programmer's level there may still be few functional results, at the implementation level the data values may be represented lazily as closures. In addition in TIM the frames are not pure. In fact laziness is implemented by updating the frame. It leads one to wonder lazy functional programmers are really closet imperative hackers? Despite this we can still identify some conditions where frame stacking is possible.

4.2 Conditions for frame stacking

Consider the moment when a function, say f returns (defined as the moment when the stack pointer next drops below the start position). Where are the possible references to the function? The stack below clearly has no references to it, as the frame did not exist when the values were pushed. Frames that preceded f can only have references to it if these are inserted by updates from markers, by a similar argument. But any such markers must either have been on the stack already, whence by the definition of function return, they have not yet been encountered, or have been produced by sub-computations of existing lazy closures. These sub-computations can by recursive argument be shown to have no references to the frame, and hence when the marks update these will not put references into the old frames. A similar argument hold for all frames produced during the computation of f (i.e. not those produced by forcing of existing lazy structures). Thus the only places that can have references to the frame of f either directly or indirectly, are the current frame, or arguments on the stack above the original start position.

Downward condition

One special case, is when the return is of a machine value, at the point just before the Self is executed, the frame pointer will have the machine value in it, and the stack will be empty above the start position, thus there are no references to it, and the frame may be freed. Thus all functions that return machine values may have their frames put on the stack.

Simple upward condition By a similar argument, if there is any intermediate stage, when there are no references anywhere, we may free the frame at that stage, or at function return. Typically this occurs when doing an enter when the last remaining reference is the current frame itself. The simplest case of this is when no labels are pushed, and only unshared arguments. This case is mentioned in the TIM paper as a case where the frame need not even be built (the ultimate in garbage collection).

Upward condition More awkward cases can be dealt with when we can show that all intermediate (strict) calls of functions return leaving no references to the current frame. This happens either, when they return machine values, or when no references have been passed to them (as in the simple case above).

4.3 Interaction with strictness analysis

The downwards condition can easily be detected by type analysis, and may often be obvious locally, for example in:

f(x,y,z)  =  g( h(x), k(y,x) )  +  m ( f( n(x), z, y ) , z )
We can see from the "+" that f returns an integer, and so we need no knowledge of g, h, k, and m at all.

The upwards condition can similarly be detected by local analysis as in:

f(x,y,z)  =  g (  h(x,y) + k ( m ( f ( z, y, p(x) ) ) )  )
We can tell that h and k return machine values because of the addition. So, if g is entered after its argument has been evaluated, the frame can be freed. In order to do this, we must know that g is strict however otherwise we would simply push a label and enter g, leaving the label as a reference to the current frame. Similarly, if we want to satisfy the alternative upwards condition of no pushed references, we will probably want to know the called functions are strict. That is for upward stack framing we need reasonable strictness analysis.

4.4 Implementation

When we detect that a frame can be stacked, where can we put the code that frees the frame. There are three alternatives:

  1. Push a return address to some "tail-ender" code which does the free
  2. Push a special marker
  3. Free just before the last callout

The first option is the simplest, the code for the function would look like:

f:      take  n
        push  label  ret
        < push other args, strictly where necessary >
        enter something
ret:    free frame
        enter top_of_stack
This is only possible where the final value returned is a data value, as the return label needs to be picked up by a Self.

The second option is similar:

f:      take  n
        < push other args, strictly where necessary >
        enter something
Where take is amended to free the top most stacked frame when a special mark is encountered.

The third option looks something like

f:      take  n
        < push other args, strictly where necessary >
        free frame
        enter something
However, the enter could be of an unshared argument, in which case, we would need either a special "enter and free" instruction, or push it on the top of the stack before freeing the frame and then enter it.

As we've noted, the first option can be used only with machine values. Option 3 is preferable where it can be used, as it frees the frame sooner, but cannot be used if the reason for frame stacking is a downward condition. Option 1 unfortunately means that tail recursive functions that would normally run in constant stack space may not be able to because of the return labels. Option 2 need not have this problem as the special marks could be freed in a similar manner to that described for normal marks if normal garbage collection detects that the frame is already unreferenced before the special mark is encountered.

This however raises a more fundamental point: can putting frames on a stack actually increase space use? It may well be that the frame is actually unreferenced before execution reaches the point where we can prove this (and pop it). Consider the example:

fac(n,a)  =   if  n = 0  then a
              else  fac(n-1,n*a)
Assume that we have been unable to determine the strictness of fac (!) in its first argument, but do know that it is strict in its second! We also have determined that it is returning an integer. With the frames on the heap, this function would never need more than two frames for fac, one for the current call, and one for for the lazy value n-1 inherited from its parent. If instead, we had used the downward condition and put the frame on a stack, we would have had n frames in use at the bottom of the recursion.

This is of course a contrived example, it is far more likely that the strictness of the first argument would be detected as well, whence the upward condition could have been used and the frame freed before the tail recursive call. Alternatively if no strictness had been detected at all, the accumulating argument would have retained references to all the frames anyway and thus the space usage would be the same (but garbage collection harder). It is of course a general feature of lazy languages that they hog space unnecessarily, and thus additional space leakage caused by stacking is likely to be outweighed by its advantages.

It is worth noting that these problems are not so severe for frame stacking detected by upward conditions. It still may be that the frame could be found dynamically to be unreferenced before it is explicitly freed, but the space for the continuation would still be necessary, hence the order of magnitude of space use would be the same, just with a larger constant.

If instead of using a stack and popping frames when unneeded, we instead still allocate them on the heap, but then interpret explicit freeing as adding to some sort frames freelist, then any remaining disadvantages in terms of space usage disappear (assuming we use option 2 rather than option 1 for the downwards condition), but at the cost of using the slightly more complex procedure.

The exact procedure chosen would depend on the general garbage collection policy (e.g. if reference count were being used anyway, explicit freeing would have little cost). One may even vary the policy within an implementation depending on the calculated or observed properties of the particular function.

5. Heap lopping

5.1 General principle

As we've seen for stack framing, there are certain conditions when we can prove that a function's frame is no longer referenced. Heap lopping relies on the fact that sometimes we can determine that all space allocated between two program points, not just in the function analysed, but also in all functions called by it, is unreferenced. Thus we can explicitly free large amounts of space.

The advantage over other methods of static space reclamation analysis, such as stack framing, sharing analysis or listlessness is that its extent is non- local. Thus even if some part of an algorithm is complex and unammenable to analysis, its caller can still free its space. A good example of this is obtaining the nth prime.

nth_prime n    select n primes
    where primes  sieve ( from 2)

sieve (p:ps)    p : sieve (filter (λm (m mod p)  0) ps)
select n is assumed to give the nth argument from the list, from 2 delivers an infinite list of numbers starting at 2, and filter pred takes a list and produces a list of all the elements satisfying condition pred.

It is in fact quite easy to determine that all the space allocated during this computation is reclaimable. But the sieve of Eratosthenes is far less easy to analyse, and we would be unlikely to be able to explicitly free space within it.

5.2 Simple conditions for an eager language

In order to get a feel for the method, we first imagine an eager functional language, with space allocated sequentially from a heap (semi-space or compacted garbage collector). Imagine any sub-computation that returns a machine value as its result. It is reasonably easy to see that in a pure eager functional language, no references to space allocated within the sup-computation will remain. Thus before any such sub-computation a note can be made of the current top-of-heap pointer, and when the computation returns, this can be restored. That is we have lopped off the top of the heap.

Both assignment and laziness complicate this picture. Assignment because we might pass references to variables into the sub-computation which are then assigned values created within it, or (worse still) the sub-computation might allocate to global storage. It would probably restrict the method too much to apply it only to sub-computations that involved no assignment, but if we disallow global assignment and then either check that no references are passed in (easy for instance in the ML type scheme) or that assignment is only to local variables (ie. not parameters) then the method can still be successfully used.

5.3 Complications with lazy languages

We return to the problem of closet hackers. At the implementation level lazy languages assign far more readily and with far less dynamic discipline than the worst imperative program. In addition they have the problem that space that apparently (at the source level) is allocated outside the sub-computation of interest is actually allocated dynamically within it. This lack of referential transparency in space allocation makes it difficult for the programmer to reason about space use (c.f. accumulating fac above) and correspondingly difficult for automatic algorithms.

As we've noted, there are two problems, assignment and space creation from sub- expressions outside the expression of interest. In fact the two are only a problem in concert. Assignment in TIM only occurs when a mark is encountered. Marks already on the stack when the sub-computation begins cannot be encountered by it, hence it is only marks produced by lazy computations passed in that may cause problems. Similarly space allocated by lazy computations passed in will have no residual references when the sub-computation ends unless it was shared and uses a mark to preserve this sharing.

5.4 Simple conditions

As we've seen, the only problems with heap lopping is when shared lazy computations are passed into it. The simplest way to stop this is to allow no lazy computations in. The easiest such case to detect is when all the values passed in and out are machine values (mvin-mvout). The machine values in prevent lazy computations getting in, and the machine value out prevents references to space allocated in the sub-computation getting out. Slightly more complex is when a ground value is passed in and a machine value out (gdin-mvout). A ground value being a data value with no lazy components.

We can ask for a data value to be ground in a weaker or stronger sense. It is acceptable for the value to have within it partial applications, so long as all the arguments to these partial applications are themselves ground values. This is the weaker meaning. The stronger meaning is when no higher order values are allowed within it. This stronger meaning is not necessary for heap lopping to work as the weaker condition still means that no lazy computation has got in, but it is likely to be more difficult to detect the weaker sense and may not be worthwhile for the extra cases it finds. Further, in the TIM architecture (although not in graph reduction) it is difficult to speculatively force higher order values.

In order for these cases to arise some strictness analysis will be necessary. The case (mvin-mvout) only requires simple strictness analysis. (Gdin-mvout) is more difficult as it requires stronger strictness analysis detecting when values will need to be evaluated completely. Detecting for instance, that length evaluates its argument list completely, or that in order to return a ground value itself, map f must evaluate its argument completely if f is strict.

5.5 Full analysis

When the result of a computation is a machine value the two conditions mvin and gdin are approximations to the the general condition of no shared lazy computations in. We could in fact statically search for this general condition. This analysis is a mix of sharing and strictness analysis and could be done at all sorts of levels of abstraction. The two cases above are particular abstractions, simple and full evaluation strictness analysis. We could similarly do simple sharing analysis and apply heap lopping if we can prove that the parameters in contain no values shared with anything outside. The appropriate level of analysis will probably have to be chosen by experiment, and depends of course on what other analyses are being performed anyway. For instance, simple sharing analysis is advantageous for other reasons in the TIM compiler and its results could be used for heap lopping.

5.6 Multiple static domains

We have concentrated on proving that there are no references to space allocated within a sub-computation outstanding when it returns its result. A less stringent requirement is for there to be no references to certain classes of object. This is a particular case of multiple garbage collection domains. The classes may either be static, (i.e. an object produced at a particular program point is always in the same class) or dynamic. Dynamic classes are obviously more flexible, but are more difficult to reason about and tend to require more run time mechanism (but not necessarily). We will thus consider static classes first.

We have already considered ground values, both weak and strong, so it seems reasonable to look at the three classes of data values, partial applications and lazy values. The TIM paper does not (deliberately) distinguish these, but it is possible to distinguish any particular frame as being of one of these types, and allocate then from different heaps.

If we have an expression that is ground in and ground out (gdin-gdout) then we can know we have no references to lazy objects created since the start of its computation, and hence lop the lazy frame heap. Similarly if we discover an expression has strong ground result, even if it has only weak ground inputs (gdin-Sgdout), there will be no references to lazy or partial frames and hence both heaps can be lopped.

Note that the former condition clashes with my suggestion for sharing the first full application frame to represent partial applications, or the alternative of using the parent, as both of these strategies conflate the partial and lazy frame spaces. However the later condition is more easily detected anyway so this is not a strong disincentive.

In fact the analysis for these conditions is the same as required for the second of the simple conditions for heap lopping, and so the main cost of them is not in static analysis, but in run time upkeep of two (or three) heaps.

5.7 Dynamic domains

When we considered simple conditions for heap lopping, the downward condition of a machine value out was insufficient because of the possibility of lazy parameters creating frames shared outside the computation. If we can distinguish space allocated in computations originating outside the expression of interest from those generated within, we can lop the later on completion; thus allowing a situation where any inputs are acceptable so long as the output is a machine value (anyin-mvout). That is we create a new domain and allocate within it.

This is not too difficult to achieve. If we assume that it is possible to detect which domain a frame belongs to, then whenever we allocate space (with a take), we allocate it to the same domain as the current frame. Before embarking on a loppable sub-computation the parent computation creates a new domain, and then sets the current frame to a "dummy" frame in that domain. (All of this after pushing its return value of course!) On return, the whole domain is freed - the current frame will have already been returned to the parent's frame pointer by the return.

The simple heap lopping conditions are in fact special cases of this, only because we know the old domains cannot allocate during the life of the computation, it can share space with the new domain.

Care has to be taken with partial applications, if implemented naively they will allocate in their domain of creation, rather than their domain of application. This is a perfectly safe strategy, but may loose some space. The problem is particularly bad if the first application in the new domain is a partial application as in:

f x y z    ( y x z ) :  g x z
Even though we know that (y x z) is a machine value, if y is a partial application then the domain created will never be used. Again, it is unclear whether this is worth worrying about, there being plenty of cases where the applied function not a partial application, but if it is considered a problem, the change to the code for partial applications is not too difficult. When a partial application is performed, it can take a quick peek at the nearest return address and allocate in that domain. (??is this right or should it be the nearest mark??) If the return addresses have a separate stack then this is simply it's top. If not, and the partially applied function returns a data or machine value, then the return address is just below its arguments, and if it knows it has a higher order type, then it must look deeper. This leaves functions which do not know their order, which must take the conservative strategy. This will rarely arise, because even polymorphic schemes, such as Milner's can be resolved into monomorphic instances.

It is of course possible mix this scheme with the static domains, so for instance if we have anyin-gdout, we can create a new domain for lazy computations to be lopped on completion, and if we have anyin-gdout, we can create new domains for partials and lazies (if different). Again whether the gain is worth it is not clear.

5.8 Interaction with normal garbage collection

Even if we employ heap lopping extensively we are likely to still need normal garbage collection. Dynamic domains don't pose too many problems with collection, as all we have to do is ensure the domain identities are preserved. This will be quite easy in most schemes including copying. Their problem is more in the creation of new domains, we cannot have simple sequential allocation, as in simple semi-space schemes, but instead opt for freelist or page based schemes. Many garbage collectors allocate different types of objects to different page types anyway, and when reference counts are used this should combine well with the resulting freelists, so whether this is a problem will depend very much on the allocator.

Simple heap lopping works well on with sequential schemes such as semi-space, when allocating objects, and recovery is simply changing the top-of-heap pointer. Problems do occur during collection. If the collection scheme changes the relative order of frames, then we cannot simply adjust the relevant pointers, we may even need to scrap the information completely when a garbage collection occurs. Another non-optimal, but better, strategy is to replace all heap markers with the new top of the heap after garbage collection. This is safe, as the heap marker says that all space allocated during the lifetime of the enclosing expression is unreferenced when it returns, in particular everything allocated from the garbage collection to the return will be unreferenced and can be lopped. Unfortunately, this fails to lop everything allocated before the garbage collection. A third alternative that does not have this problem is to deliberately leave a "gappy" new space. On commencing garbage collection, a space twice as big as the old is created. Then each domain (in the sense here of between heap marks) is allocated as much space as it has in the old space, and objects are copied into their respective domains, starting at their beginnings. We would rely on the memory management to mean that the notional space in the gaps would have no cost. This looks at first glance as if the space allocated would have to double upon each garbage collection, but this is not so. If in addition, for each domain we keep a note of how much space was actually needed, then on the next garbage collection, we can compact domains that have been previously garbage collected. Obviously a scheme of this sort would fit very well with generational schemes. If we are sure that the space allocated within a computation is bounded (and not too large), we may prefer to inhibit garbage collection entirely for its duration to prevent complications! Applying simple heap lopping to non-sequential allocation schemes has little advantage of simplicity over using full dynamic domains, and is less powerful, so the letter would be preferred.

Multiple static domains have much the same properties as simple heap lopping on a single domain.

5.9 Maximal monomorphic types

So far we have only considered the classes data/partial/lazy, but it is possible to perform far richer analyses than this. In an eager language we can use the type scheme to distinguish different garbage collection domains which inherit the hierarchical structure of the types. This is not so useful for lazy languages as the laziness tangles the hierarchy. Adding lazy/evaluated attributes to each type will help a bit (essentially what full evaluation analysis must do) but will not disentangle the knot enough.

A better strategy is to first of all generate a type for every program point, (where suitable with lazy/evaluated sub-types) and then investigate their relationships. The two important relations for garbage collection are the ref (may contain a reference to) and val (may be a) relations. Consider x+y, we associate two types with each element, lazy and eager, denoted by L and E. We have:

Lx+y   ref   Lx
Lx+y   ref   Ly
Ex+y   val   E+
In addition there will be the generic rule Lz val E z as any lazy value may have been evaluated (we could even distinguish definitely lazy values if this is considered of advantage), and also transitivity of ref and val.

The ref relation may be sufficient on its own, or for data values we may want to be more specific:

Ex:xs   val   CONS Lx Lxs
We thus obtain maximal monomorphic types of each expression, maximal in the sense that they distinguish more types than any other scheme associating types with expressions.

We can use this type information to create multiple static or dynamic domains. If we show that the types of all inputs to an expression never reference (transitive closure of ref) an object of type t and if the output similarly does not reference it, then we can simply lop the t heap on completion. Similarly, if we can show merely that the output does not reference twe may create a new dynamic domain for type t.

Unfortunately, the complexity of such a scheme would be great, and it would need strong reasons to prompt its use.

5.10 Self or parent lopping

When we consider an expression of the form (f x) + 3, then f returns a machine value and hence all its space and everything called by it can be lopped. This could either be built into the code for f (as the in downwards condition for stack framing) or into the code that calls f (as in upwards conditions for stack framing). We noted when considering stack framing that tail-enders caused problems with last call optimisation. We could do little about it there except making the return address into a special marker that can be identified by the garbage collector, because the caller would not know which frame to free. (Note: if it's really a stack then it does, so we could make a rule of caller pops, but there are still problems with tail chains) We have more freedom here, as either caller or called can mark the heap or create new domains, so we can make a general rule that where possible caller frees, thus simplifying the deallocation code. For dynamic domains, this even works when there are multiple calls with tail optimisation:

f  x         if  g x   then 3 else h  ( x-1 )
h  x         if  k x   then 2 else m  ( x-1 )
m  x         if  p x   then 1 else q  ( x-1 )
because it is only the result which is important. For simple lopping the input is crucial, and hence there may be cases where the caller cannot lop. For instance, in the following code:
f  x y       if  x = 2  then  h ( y ) else h ( x )
If h is not known to be strict, then after the test, if it is false, f can mark the heap for lopping on return, the caller could not of course know when this is possible. Assuming it is possible for the caller to do the lopping, the called function may still need to do the marking.
f  x         g ( h ( x ) )
If we know that g is strict but don't know anything about h, then f can evaluate h(x) and then mark the heap, leaving it to its caller to lop after the return from g. The caller could not mark the heap, as lazy computation in the argument x may (but may not) cause allocation.

To summarise, in "straight line code" we push the deallocation back to the first strict evaluation where this is possible. However when there tail recursion, we cannot follow this as it would mean that deallocation would only be done when the recursion is complete. For example, fac would never deallocate until it was complete. Thus sometimes tail-enders will be necessary.

5.11 Over-lopping

The run time cost of simple heap lopping is low, often just a register-register move, so if there is unnecessary lopping, this is probably not a problem. Once the strategy gets more complex, either multiple domains or dynamic domain creation, the costs begin to rise, and we must be careful that we don't over-lop. Consider the Fibonacci function:

fib 0        0
fib 1        1
fib n        fib n-1 + fib n-2
The code for the general case of fib n can create a new domain before each recursive call, and then lop it after. However, because the recursive calls would themselves create new domains, only one frame would be allocated in each domain. If domains were created as tagged pages, then this might mean that each frame occupies a whole page of memory! Not very good for improving space performance.

The same problem can arise in straight line code, as in the example in the section above. We can in a similar way push the creation of new domains as far up the call tree as possible. To make this a little more formal, we can think of one function dominating another, if all calls of the second are made within a call of the former. Further, the dominance is direct if there are never any intervening recursive calls. Given these definitions, if f directly dominates g, and both f and g can create new domains for a class t (or if they can simply lop it) then the action of the latter can be surpressed. Note that this is on a per domain basis, if there is more than one class of object. Also the directness is crucial, if there are any intermediate recursive calls then g may be called many times within the lifetime of f and hence the lopping at g is useful.

Sometimes we may want to push the lopping in the other direction. If f dominates g, and both can lop class t, but we can show that none of the intermediate calls create objects of class t then it is usually appropriate to forgo the lopping at f. Note that here the dominance does not need to be direct. The exception to this rule is where the dominance is direct, but where there is some fan-out (i.e. several, but finite, calls to g). In this case it may be better to lop at f, or at the lowest function dominating all calls to g. Doing this will require fewer lops, but will leave space allocated longer.

The above analysis is complicated by functions called from several locations, however outside of mutually recursive sets, we can always make multiple copies of such functions (later remerging them if the code generated is the same) and many other analyses assume that this has been done. In addition complete copies of mutually recursive sets can be made for each call in.

So far however, we have not addressed the example of fib which prompted this discussion, as it is recursive, we cannot simply push the responsibility onto its caller, as this would mean store exponential in n would be allocated, rather than proportional to n. What we can do however is partially unroll the recursion. This is like loop unrolling, we don't know the maximum depth of the recursion, so we cannot unroll completely, but for a mutually recursive set F={f,g,h}, we make m+1 copies F0, F1, F2, ... Fm, so that the calls in Fi are to the functions in Fi+1, and the calls in Fm are to F1, and all external entries are into F0. This copying could be real code copying, or notional, passing around an extra "recursion depth" parameter. Having generated these copies, we would only generate lopping code in F0 and Fm. (These could be the same, but for other analyses it may be useful to distinguish the initial entry.) By varying the choice of m, the amount of dynamic space use as compared to the time cost of lopping can be balanced to any desired level.

5.12 Relation to distributed processing

The places where heap lopping (especially simple lopping) is possible are just where the interface between a sub-expression and its context is "narrow", that is precisely where executing the sub-expression on a separate processor is likely to be advantageous. The separate domains correspond to the actual physical memory of the different processors. A useful property for distributed processing is that after the evaluation of a sub-expression on a separate processor, there are no references to memory in that processor remaining, and it can be reinitialised. Heap lopping is exactly equivalent to this reinitialisation. Simple lopping is where all allocation is done on the separate processor. Dynamic domains corresponds to the second processor calling back to the parent for evaluation of lazy values passed in.

6. Other language and implementation paradigms

The various techniques have been presented in the context of a particular programming paradigm (lazy functional) and a particular implementation technique (TIM machine). Garbage collection of marks is obviously particular to the TIM architecture. Similarly, frame stacking is not an issue in traditional imperative languages where this is standard, and in eager languages the subject is well covered. Some graph reduction techniques are environment rather than combinator based, and these could benefit from similar rules. Heap lopping is of more interest in these other paradigms.

6.1 Heap lopping in imperative and eager languages

Heap lopping was introduced in the context of pure eager functional languages, and here it has very wide scope. It can be applied to any sub-expression that has a machine value result (anyin-mvout). That is the same condition as for dynamic domains in TIM. In fact the conditions examined for heap lopping in TIM are exactly those where the sub-expression's interface characteristics were eager. This means for instance, that all boolean expressions can be lopped.

As lazy languages had problems with updating after evaluation, imperative languages have problems with explicit update. We have already considered some of the ways of dealing with this when lopping was introduced. Clearly imperative sub-expressions can be lopped when they "behave like" pure functional ones. This property is often taken advantage of for other purposes, such as code movement, and ADA ensures that functions are pure. Most imperative languages do not have implicit garbage collection at the present time but this may well change.

6.2 Heap lopping and graph reduction

Focusing on simple combinator based graph reduction, we see that the same sort of conditions that applied to TIM apply here. If at any stage there is a sub-graph which evaluates to a machine value and such that any sub-graph of it which is shared outside is evaluated, then any space allocated during the evaluation of that sub-graph may be freed upon completion. Ideally the combinator that creates the sub-graph should notice this and allocate space for it in the new domain, but this may not always be possible.

The exact implementation of this differs with the reduction technique. Most reducers use a stack, in which case the technique is rather like that for TIM, with similar problems about tail-enders. Pointer reversal reducers would need to leave a heap popping node, that would pop the heap when the spine unrolls.

Dynamic domains pose some problems, in keeping track of what should be allocated where. The general rule would be that all space allocated during the evaluation of a (non-lopped) graph node, would be allocated in the same domain as the node. Here it becomes essential that the sub-expression is created in the new domain, as otherwise there will not be any record of that domain for further evaluation to latch onto. This is similar to the problem with partial applications in TIM, but far more pervasive. (Interestingly, partial applications don't pose a problem)

6.3 Prolog

Prolog implementations typically allocate space in several domains with different garbage collection strategies anyway. To achieve this they use techniques at least as powerful as stack framing. It's execution is similar in many ways to lazy languages, in that there is much non-local update, but the possibility of backtracking means that if anything the web of references is more convoluted again! The equivalent of simple lopping is unlikely to be possible, but with the definition of the lifetime of a call being from the first call to the last successful backtrack, dynamic domains may be a possibility.

6.4 Object oriented programming

In its more static and structured forms this would have properties similar to traditional imperative programming techniques. Self modifying environments like Smalltalk (and the same goes for most LISPs) clearly have problems, as the static analysis has to be incrementally redone after changes, and in addition this may change the properties of functions in midst calculation. The sort of algorithms that could achieve this I shudder to imagine. They seem to be getting on allright with techniques of their own, so let's leave them to it.

6.5 Concurrent evaluation

As we've seen there is a close interplay between the conditions for heap lopping and distributed processing of functional languages. If each processor has no more than one thread of control then simple heap lopping is quite possible. If however two threads of control coexist within the same memory allocation space simple lopping needs to be modified. This should not be too difficult as lopping is only possible where there is no interaction with other expressions, in particular with concurrent expressions. Effectively each thread of control needs to have a domain of its own. This is moving towards multiple dynamic domains , and not surprisingly these pose no problems.

For imperative languages, the conditions for heap lopping again imply a certain element of non-interference between candidate expressions, so similar arguments would apply to concurrency here.

7. Conclusions

The techniques presented above differ in their simplicity, generality and power.

The collection of marks is a simple process, and there is little reason not to use it, but the gains are not too great. It does reduce by a little the space usage, and (perhaps more importantly) the number of shared applications.

Frame stacking conditions are fairly simple, with little run-time cost, and are an obvious extension of similar work for eager languages.

Heap lopping is the most radical technique, allowing one to dream that garbage collection may be a thing of the past. Its most important feature being its non-local freeing of space, and in this respect it differs from all other static space reclamation schemes I have encountered. It is this non-locality which gives it its generality (because enclosed computation can be arbitrarily complex) and power (because an awful lot of space may be reclaimed). Unfortunately the conditions for it are correspondingly strong, restricting its range of applicability. Its may in fact be better applied to eager languages than to lazy.

Most of the techniques discussed assume some level of strictness and/or sharing analysis, I suppose this is part of the general rule: "if you want efficient lazy languages, make them eager".

Appendix - Partial application methods

There are various ways of implementing partial applications in TIM, which interact with the garbage collection methods above.

The TIM paper describes a method of sharing partial applications of functions by way of "interrupting" take operations. This involves creating a special partial application node when a mark is encountered, and then beginning again the take. The semantic description of restarting the take involves pushing all the partially applied arguments back onto the stack, and beginning again. The same procedure is used when a partial application is applied to more arguments.

To really implement this by pushing and then popping is unnecessary. The pushed arguments will be enterargs into the partial frame, hence the new frame can be created with the enterargs explicitly, and then the rest of the arguments taken. To achieve this it is sensible to keep in the partial frame, the address of the instruction after the take rather than its own address for re-entry.

The partial frame is of course needed for sharing, but is a special frame the best place to keep achieve this?

Ax.1 Child frame sharing

If we consider the case where the partial application is not reused, it is particularly easy to see the problems with this. Two frames will be used, one for the partial application, and one for the actual application (the child frame). In addition each partially applied argument is indirected through to the partial frame. This blossoming of enterargs for sharing will lead to still more unnecessary marks... David Wakeling's figures for TIM show that enterargs take up a considerable amount of time, and hence reducing them will save this, as well as making takes faster.

An alternative that does not have these disadvantages is for the partial application to share the frame of the first full application. In TIM we can be sure that there are always enough arguments for the full application there may just be some marks in between. When we encounter a mark, say at the ith argument, instead of creating a special frame with the first i arguments in it, we can just update the mark with a partial-i code pointer and the frame pointer of the existing frame. This is semantically identical to the TIM paper proposal, but saves creating an extra frame and the associated indirections.

Of course subsequent applications of this code>partial-i object will have to produce indirections to it, but this is at least no worse than the original proposal. Another worry might be space leaks, but if the scheme of masks proposed in the TIM paper is used, the partial applications can be given appropriate masks, and hence the additional unshared arguments can be garbage collected when appropriate.

This scheme is particularly advantageous with a separate mark stack, as then the checking for marks, and the copying of arguments from the stack are independent, and thus the copying can be achieved with efficient memory move instructions.

Ax.2 Parent frame sharing

The scenario presented earlier would have seemed even worse, if we had considered several cascading partial applications of more and more arguments, and the resulting chains of enterargs. In fact this never happens if (as is assumed in TIM) the functions are super-combinators. In fact super-combinators are usually only curried at one place (although there may be several marks here), usually the last but one position. This is because the early arguments are produced by a parent super-combinator which supplies all the arguments until a new user-level argument is needed. As the parent knows where the currying can take place, it seems reasonable to let the parent worry about sharing. (A similar technique is to let the parent do the take and thus allow more efficient passing of arguments between the two, but the two techniques are independent)

To achieve this, whenever a super combinator is going to be called with too few arguments, before the arguments are pushed a special share instruction is executed which updates any marks at the very top of the stack with the label of the share instruction, and the current frame.

/*    code for  f x ( g y) 3  - where f needs more than 3 arguments  */
s_lab:  share      s_lab
        push  int  3
        push  lab  gy           /*  code for g y     */
        push  arg  2            /*  x is second arg  */
        enter comb f
/*    code for f  */
f:      take_unshared 4
        << code for f >>
The take_unshared instruction is a simple memory move from the stack, with no checking of marks.

Again one might be concerned about space leaks, but again the masks will take care of this.

Ax.3 Interaction with garbage collection techniques

The most obvious thing that both these techniques do is to destroy the distinction between partial application frames and normal frames. This is only important if we want to selectively garbage collect one domain or other, as was suggested with multiple static domains. More important is that we must be aware when considering frame stacking and heap lopping that the current frame may be shared as a partial application.

For child sharing, this means that if it is possible that a frame is partially applied, then we cannot build it on the stack. This would tend to make one draw back from this technique if considering frame stacking. Alternatively two sets of code could be generated for each function, one for the first shared application and one for all others. This would have no run time cost, just mean extra code. As with all code copying techniques, if no optimisation is possible, the code can be remerged later. Heap lopping is not such a problem, if the frame may be shared one merely marks the heap just above, rather than just below it. If it is the caller who is marking anyway there is no problem.

Parent sharing, means that one cannot consider a frame for stacking if a share instruction may update a frame that outlives the current call. Considering upward conditions, it doesn't matter for shares passed in to sub-expressions returning machine values, as no references will escape anyway, and if no labels are passed to a sub-expression then no shares can be encountered there. The only remaining problem is if the result of the computation is a function, and there is a share to achieve this, as in:

f x    if  x = 3   then  ( + 1 )   else  λ y  x + y
Such cases can either be rejected or the share could branch on success (as with shared takes above). Heap lopping again is not so much of a problem. If the share instruction is encountered in embedded call levels, then the update will be to frames that will be lopped anyway, the only problem will be with a share instruction at the bottom level". That is, if the actual function result is being shared. But as the conditions for heap lopping were not for higher order functions, this will not be a problem. Downward conditions for frame stacking don't have any problems for similar reasons.

Ax.4 Summary

Both the above techniques have been successfully used in an interpreted version of the TIM machine. My preference lies with parent frame sharing as the sharing is made more explicit, and is thus easier to informally understand. As we have seen, it is also less likely to cause problems with frame stacking and heap lopping. Alan Dix 8/1/2002