This unpublished paper was originally distributed in 1988 and drew on various conversations with Dave Wakling who was working on TIM at that stage.
That report was itself first distributed privately later in 1988 and only published as a technical report when we realised it was being referenced elsewhere.
|
Alan Dix
Several techniques for improving garbage collection in the TIM machine are described below. The TIM architecture is similar to conventional imperative and eager functional languages, being based on functions producing frames containing references to its arguments, and thus the techniques are likely to be of use in general frame based architectures. Three topics are discussed:
It is useful to distinguish three types of frame:
The categories are not in practice distinct:
In the TIM paper, data types are described as partial applications of "distributing functions", e.g.:
pair = λ x λ y λ f f x y
This representation has the advantage that unshared data values need not be built, that is very efficient garbage collection!
Similarly they do not distinguish between partial and full frames, and in fact the technique of making partial applications share frames with a full application gets rid of any distinction whatsoever.
In addition to frames, there is an argument stack, and markers. In the TIM paper, markers are set on the stack, but they could also be in a separate stack of their own. The objects on the stack reference frames, but not visa versa. The markers, which are present to preserve laziness, contain a frame pointer and an argument number within that frame. Lastly there is a current frame pointer.
As markers contain references to frames, it would seem that they need to be scanned when garbage collecting, in fact this is not necessary. The only purpose of markers is to update an argument position in a frame. If that frame is not referenced from anywhere except markers, then that arguments new value will never be accessed, so the update would be unnecessary. Thus we see, if at garbage collection time, there are no references to frame except from markers, the frame may be collected and the markers removed.
To achieve this end, simply do not include the markers when scanning roots, then after garbage collection has finished, do a post-pass of the stack/marker-stack, from bottom to top, removing markers to redundant frames, and compacting.
The following is a rough sketch of the post-pass algorithm.
If there is a separate mark stack, then the is_mark
test is omitted.
object stack[BIG]; stack_ptr top_of_stack, base_of_stack; /* top_of_stack points to the next free location above the stack */ sweep_marks() { stack_ptr from, to; object obj; from = to = base_of_stack; while ( from <= top_of_stack ) { obj = stack[from]; from = from +1; if ( is_mark(obj) && inactive_frame(obj) ) /* don't copy */ ; else { stack[to] = obj; to = to + 1; } } top_of_stack = to; }
In standard imperative structured languages such as Pascal or C, the frame for each function is put on a stack (often the same as the argument and computation stack). When the function exits returning its value, the frame is freed. When closures are passed about, either because of first class functions, or in TIM because of laziness, references to a functions frame may outlive the function call. This means that frames must be allocated on the heap and must be garbage collected. By doing a little static analysis, it my be possible to allocate some frames on the stack, if it can be proved that when the function returns there are no references outstanding.
In eager languages, this is not too difficult, as functional values are often passed as parameters to other functions, but are not so often returned. By detecting when this happens the resulting frame can be put on the stack. The ML, compiler does this, and in fact the default is for stacked frames. This is helped in ML by the fact that the frame is pure that is never updated. All updateable objects are put in the heap, and as ML encourages a purely functional style there are not too many of these.
Lazy languages pose more problems as although at the programmer's level there may still be few functional results, at the implementation level the data values may be represented lazily as closures. In addition in TIM the frames are not pure. In fact laziness is implemented by updating the frame. It leads one to wonder lazy functional programmers are really closet imperative hackers? Despite this we can still identify some conditions where frame stacking is possible.
Consider the moment when a function, say f
returns
(defined as the moment when the
stack pointer next drops below the start position).
Where are the possible references to the function?
The stack below clearly has no references to it, as the frame did not exist
when the values were pushed.
Frames that preceded f
can only have references to it if these are
inserted by updates from markers, by a similar argument.
But any such markers must either have been on the stack already, whence by the
definition of function return, they have not yet been encountered,
or have been produced by sub-computations of existing lazy closures.
These sub-computations can by recursive argument be shown to have no references
to the frame, and hence when the marks update these will not put references
into the old frames.
A similar argument hold for all frames produced during the computation of
f
(i.e. not those produced by forcing of existing lazy structures).
Thus the only places that can have references to the frame of f
either
directly or indirectly, are the current frame, or arguments on the stack
above the original start position.
Downward condition
One special case, is when the return is of a machine value, at the point just
before the Self
is executed, the frame pointer will have the machine
value in it, and the stack will be empty above the start position, thus there
are no references to it, and the frame may be freed.
Thus all functions that return machine values may have their frames put on the
stack.
Simple upward condition
By a similar argument, if there is any intermediate stage, when there are no
references anywhere, we may free the frame at that stage, or at function
return.
Typically this occurs when doing an enter
when the last remaining
reference is the current frame itself.
The simplest case of this is when no labels are pushed, and only unshared
arguments.
This case is mentioned in the TIM paper as a case where the frame need not even
be built (the ultimate in garbage collection).
Upward condition More awkward cases can be dealt with when we can show that all intermediate (strict) calls of functions return leaving no references to the current frame. This happens either, when they return machine values, or when no references have been passed to them (as in the simple case above).
The downwards condition can easily be detected by type analysis, and may often be obvious locally, for example in:
We can see from the "+" that f returns an integer, and so we need no knowledge off(x,y,z) = g( h(x), k(y,x) ) + m ( f( n(x), z, y ) , z )
g, h, k,
and m
at all.
The upwards condition can similarly be detected by local analysis as in:
We can tell thatf(x,y,z) = g ( h(x,y) + k ( m ( f ( z, y, p(x) ) ) ) )
h
and k
return machine values because of the
addition.
So, if g
is entered
after
its argument has been evaluated, the frame can be freed.
In order to do this, we must know that g
is strict however otherwise we
would simply push a label and enter g
, leaving the label as a reference
to the current frame.
Similarly, if we want to satisfy the alternative upwards condition of no
pushed references, we will probably want to know the called functions are
strict.
That is for upward stack framing we need reasonable strictness analysis.
When we detect that a frame can be stacked, where can we put the code that frees the frame. There are three alternatives:
The first option is the simplest, the code for the function would look like:
This is only possible where the final value returned is a data value, as the return label needs to be picked up by af: take n push label ret < push other args, strictly where necessary > enter something ret: free frame enter top_of_stack
Self
.
The second option is similar:
Wheref: take n push_special_mark < push other args, strictly where necessary > enter something
take
is amended to free the top most stacked frame
when a special mark is encountered.
The third option looks something like
However, the enter could be of an unshared argument, in which case, we would need either a special "enter and free" instruction, or push it on the top of the stack before freeing the frame and then enter it.f: take n < push other args, strictly where necessary > free frame enter something
As we've noted, the first option can be used only with machine values. Option 3 is preferable where it can be used, as it frees the frame sooner, but cannot be used if the reason for frame stacking is a downward condition. Option 1 unfortunately means that tail recursive functions that would normally run in constant stack space may not be able to because of the return labels. Option 2 need not have this problem as the special marks could be freed in a similar manner to that described for normal marks if normal garbage collection detects that the frame is already unreferenced before the special mark is encountered.
This however raises a more fundamental point: can putting frames on a stack actually increase space use? It may well be that the frame is actually unreferenced before execution reaches the point where we can prove this (and pop it). Consider the example:
Assume that we have been unable to determine the strictness offac(n,a) = if n = 0 then a else fac(n-1,n*a)
fac
(!)
in its first argument, but do know that it is strict in its second!
We also have determined that it is returning an integer.
With the frames on the heap, this function would never need more than two
frames for fac
, one for the current call, and one for for the lazy value
n-1
inherited from its parent.
If instead, we had used the downward condition and put the frame on a stack, we
would have had n
frames in use at the bottom of the recursion.
This is of course a contrived example, it is far more likely that the strictness of the first argument would be detected as well, whence the upward condition could have been used and the frame freed before the tail recursive call. Alternatively if no strictness had been detected at all, the accumulating argument would have retained references to all the frames anyway and thus the space usage would be the same (but garbage collection harder). It is of course a general feature of lazy languages that they hog space unnecessarily, and thus additional space leakage caused by stacking is likely to be outweighed by its advantages.
It is worth noting that these problems are not so severe for frame stacking detected by upward conditions. It still may be that the frame could be found dynamically to be unreferenced before it is explicitly freed, but the space for the continuation would still be necessary, hence the order of magnitude of space use would be the same, just with a larger constant.
If instead of using a stack and popping frames when unneeded, we instead still allocate them on the heap, but then interpret explicit freeing as adding to some sort frames freelist, then any remaining disadvantages in terms of space usage disappear (assuming we use option 2 rather than option 1 for the downwards condition), but at the cost of using the slightly more complex procedure.
The exact procedure chosen would depend on the general garbage collection policy (e.g. if reference count were being used anyway, explicit freeing would have little cost). One may even vary the policy within an implementation depending on the calculated or observed properties of the particular function.
As we've seen for stack framing, there are certain conditions when we can prove that a function's frame is no longer referenced. Heap lopping relies on the fact that sometimes we can determine that all space allocated between two program points, not just in the function analysed, but also in all functions called by it, is unreferenced. Thus we can explicitly free large amounts of space.
The advantage over other methods of static space reclamation analysis, such as stack framing, sharing analysis or listlessness is that its extent is non- local. Thus even if some part of an algorithm is complex and unammenable to analysis, its caller can still free its space. A good example of this is obtaining the nth prime.
nth_prime n select n primes where primes sieve ( from 2) sieve (p:ps) p : sieve (filter (λm (m mod p) 0) ps)
select n
is assumed to give the n
th argument from the list,
from 2
delivers an infinite list of numbers starting at 2,
and filter pred
takes a list and produces a list of all the elements satisfying condition pred
.
It is in fact quite easy to determine that all the space allocated during this computation is reclaimable. But the sieve of Eratosthenes is far less easy to analyse, and we would be unlikely to be able to explicitly free space within it.
In order to get a feel for the method, we first imagine an eager functional language, with space allocated sequentially from a heap (semi-space or compacted garbage collector). Imagine any sub-computation that returns a machine value as its result. It is reasonably easy to see that in a pure eager functional language, no references to space allocated within the sup-computation will remain. Thus before any such sub-computation a note can be made of the current top-of-heap pointer, and when the computation returns, this can be restored. That is we have lopped off the top of the heap.
Both assignment and laziness complicate this picture. Assignment because we might pass references to variables into the sub-computation which are then assigned values created within it, or (worse still) the sub-computation might allocate to global storage. It would probably restrict the method too much to apply it only to sub-computations that involved no assignment, but if we disallow global assignment and then either check that no references are passed in (easy for instance in the ML type scheme) or that assignment is only to local variables (ie. not parameters) then the method can still be successfully used.
We return to the problem of closet hackers.
At the implementation level lazy languages assign far more readily and with
far less dynamic discipline than the worst imperative program.
In addition they have the problem that space that apparently (at the source
level) is allocated outside the sub-computation of interest is actually
allocated dynamically within it.
This lack of referential transparency in space allocation makes it difficult
for the programmer to reason about space use
(c.f. accumulating fac
above) and correspondingly difficult for automatic
algorithms.
As we've noted, there are two problems, assignment and space creation from sub- expressions outside the expression of interest. In fact the two are only a problem in concert. Assignment in TIM only occurs when a mark is encountered. Marks already on the stack when the sub-computation begins cannot be encountered by it, hence it is only marks produced by lazy computations passed in that may cause problems. Similarly space allocated by lazy computations passed in will have no residual references when the sub-computation ends unless it was shared and uses a mark to preserve this sharing.
As we've seen, the only problems with heap lopping is when shared lazy computations are passed into it. The simplest way to stop this is to allow no lazy computations in. The easiest such case to detect is when all the values passed in and out are machine values (mvin-mvout). The machine values in prevent lazy computations getting in, and the machine value out prevents references to space allocated in the sub-computation getting out. Slightly more complex is when a ground value is passed in and a machine value out (gdin-mvout). A ground value being a data value with no lazy components.
We can ask for a data value to be ground in a weaker or stronger sense. It is acceptable for the value to have within it partial applications, so long as all the arguments to these partial applications are themselves ground values. This is the weaker meaning. The stronger meaning is when no higher order values are allowed within it. This stronger meaning is not necessary for heap lopping to work as the weaker condition still means that no lazy computation has got in, but it is likely to be more difficult to detect the weaker sense and may not be worthwhile for the extra cases it finds. Further, in the TIM architecture (although not in graph reduction) it is difficult to speculatively force higher order values.
In order for these cases to arise some strictness analysis will be necessary.
The case (mvin-mvout) only requires simple strictness analysis.
(Gdin-mvout) is more difficult as it requires stronger strictness
analysis detecting when values will need to be evaluated completely.
Detecting for instance, that length
evaluates its argument list
completely, or that in order to return a ground value itself, map f
must evaluate its argument completely if f
is strict.
When the result of a computation is a machine value the two conditions mvin and gdin are approximations to the the general condition of no shared lazy computations in. We could in fact statically search for this general condition. This analysis is a mix of sharing and strictness analysis and could be done at all sorts of levels of abstraction. The two cases above are particular abstractions, simple and full evaluation strictness analysis. We could similarly do simple sharing analysis and apply heap lopping if we can prove that the parameters in contain no values shared with anything outside. The appropriate level of analysis will probably have to be chosen by experiment, and depends of course on what other analyses are being performed anyway. For instance, simple sharing analysis is advantageous for other reasons in the TIM compiler and its results could be used for heap lopping.
We have concentrated on proving that there are no references to space allocated within a sub-computation outstanding when it returns its result. A less stringent requirement is for there to be no references to certain classes of object. This is a particular case of multiple garbage collection domains. The classes may either be static, (i.e. an object produced at a particular program point is always in the same class) or dynamic. Dynamic classes are obviously more flexible, but are more difficult to reason about and tend to require more run time mechanism (but not necessarily). We will thus consider static classes first.
We have already considered ground values, both weak and strong, so it seems reasonable to look at the three classes of data values, partial applications and lazy values. The TIM paper does not (deliberately) distinguish these, but it is possible to distinguish any particular frame as being of one of these types, and allocate then from different heaps.
If we have an expression that is ground in and ground out (gdin-gdout) then we can know we have no references to lazy objects created since the start of its computation, and hence lop the lazy frame heap. Similarly if we discover an expression has strong ground result, even if it has only weak ground inputs (gdin-Sgdout), there will be no references to lazy or partial frames and hence both heaps can be lopped.
Note that the former condition clashes with my suggestion for sharing the first full application frame to represent partial applications, or the alternative of using the parent, as both of these strategies conflate the partial and lazy frame spaces. However the later condition is more easily detected anyway so this is not a strong disincentive.
In fact the analysis for these conditions is the same as required for the second of the simple conditions for heap lopping, and so the main cost of them is not in static analysis, but in run time upkeep of two (or three) heaps.
When we considered simple conditions for heap lopping, the downward condition of a machine value out was insufficient because of the possibility of lazy parameters creating frames shared outside the computation. If we can distinguish space allocated in computations originating outside the expression of interest from those generated within, we can lop the later on completion; thus allowing a situation where any inputs are acceptable so long as the output is a machine value (anyin-mvout). That is we create a new domain and allocate within it.
This is not too difficult to achieve.
If we assume that it is possible to detect which domain a frame belongs to,
then whenever we allocate space (with a take
), we allocate it to the same
domain as the current frame.
Before embarking on a loppable sub-computation the parent computation creates
a new domain, and then sets the current frame to a "dummy" frame in that
domain.
(All of this
after
pushing its return value of course!)
On return, the whole domain is freed - the current frame will have already been
returned to the parent's frame pointer by the return.
The simple heap lopping conditions are in fact special cases of this, only because we know the old domains cannot allocate during the life of the computation, it can share space with the new domain.
Care has to be taken with partial applications, if implemented naively they will allocate in their domain of creation, rather than their domain of application. This is a perfectly safe strategy, but may loose some space. The problem is particularly bad if the first application in the new domain is a partial application as in:
Even though we know thatf x y z ( y x z ) : g x z
(y x z)
is a machine value, if y
is a partial application then the domain created will
never
be used.
Again, it is unclear whether this is worth worrying about, there being plenty of
cases where the applied function not a partial application, but if it is
considered a problem, the change to the code for partial applications is not
too difficult.
When a partial application is performed, it can take a quick peek at the
nearest return address and allocate in that domain.
(??is this right or should it be the nearest mark??)
If the return addresses have a separate stack then this is simply it's top.
If not, and the partially applied function returns a data or machine value,
then the return address is just below its arguments,
and if it knows it has a higher order type, then it must look deeper.
This leaves functions which do not know their order, which must take the
conservative strategy.
This will rarely arise, because even polymorphic schemes, such as Milner's
can be resolved into monomorphic instances.
It is of course possible mix this scheme with the static domains, so for instance if we have anyin-gdout, we can create a new domain for lazy computations to be lopped on completion, and if we have anyin-gdout, we can create new domains for partials and lazies (if different). Again whether the gain is worth it is not clear.
Even if we employ heap lopping extensively we are likely to still need normal garbage collection. Dynamic domains don't pose too many problems with collection, as all we have to do is ensure the domain identities are preserved. This will be quite easy in most schemes including copying. Their problem is more in the creation of new domains, we cannot have simple sequential allocation, as in simple semi-space schemes, but instead opt for freelist or page based schemes. Many garbage collectors allocate different types of objects to different page types anyway, and when reference counts are used this should combine well with the resulting freelists, so whether this is a problem will depend very much on the allocator.
Simple heap lopping works well on with sequential schemes such as semi-space, when allocating objects, and recovery is simply changing the top-of-heap pointer. Problems do occur during collection. If the collection scheme changes the relative order of frames, then we cannot simply adjust the relevant pointers, we may even need to scrap the information completely when a garbage collection occurs. Another non-optimal, but better, strategy is to replace all heap markers with the new top of the heap after garbage collection. This is safe, as the heap marker says that all space allocated during the lifetime of the enclosing expression is unreferenced when it returns, in particular everything allocated from the garbage collection to the return will be unreferenced and can be lopped. Unfortunately, this fails to lop everything allocated before the garbage collection. A third alternative that does not have this problem is to deliberately leave a "gappy" new space. On commencing garbage collection, a space twice as big as the old is created. Then each domain (in the sense here of between heap marks) is allocated as much space as it has in the old space, and objects are copied into their respective domains, starting at their beginnings. We would rely on the memory management to mean that the notional space in the gaps would have no cost. This looks at first glance as if the space allocated would have to double upon each garbage collection, but this is not so. If in addition, for each domain we keep a note of how much space was actually needed, then on the next garbage collection, we can compact domains that have been previously garbage collected. Obviously a scheme of this sort would fit very well with generational schemes. If we are sure that the space allocated within a computation is bounded (and not too large), we may prefer to inhibit garbage collection entirely for its duration to prevent complications! Applying simple heap lopping to non-sequential allocation schemes has little advantage of simplicity over using full dynamic domains, and is less powerful, so the letter would be preferred.
Multiple static domains have much the same properties as simple heap lopping on a single domain.
So far we have only considered the classes data/partial/lazy, but it is possible to perform far richer analyses than this. In an eager language we can use the type scheme to distinguish different garbage collection domains which inherit the hierarchical structure of the types. This is not so useful for lazy languages as the laziness tangles the hierarchy. Adding lazy/evaluated attributes to each type will help a bit (essentially what full evaluation analysis must do) but will not disentangle the knot enough.
A better strategy is to first of all generate a type for
every
program point, (where suitable with lazy/evaluated sub-types) and then
investigate their relationships.
The two important relations for garbage collection are the
ref
(may contain a reference to)
and
val
(may be a)
relations.
Consider x+y
, we associate two types with each element, lazy and eager,
denoted by L and E.
We have:
LIn addition there will be the generic rule Lx+y
ref Lx
Lx+y
ref Ly
Ex+y
val E+
z
val E
z
as any lazy value may have been evaluated (we could
even distinguish definitely lazy values if this is considered of advantage), and
also transitivity of ref and val.
The ref relation may be sufficient on its own, or for data values we may want to be more specific:
EWe thus obtain maximal monomorphic types of each expression, maximal in the sense that they distinguish more types than any other scheme associating types with expressions.x:xs
valCONS
Lx
Lxs
We can use this type information to create multiple static or dynamic domains. If we show that the types of all inputs to an expression never reference (transitive closure of ref) an object of type t and if the output similarly does not reference it, then we can simply lop the t heap on completion. Similarly, if we can show merely that the output does not reference twe may create a new dynamic domain for type t.
Unfortunately, the complexity of such a scheme would be great, and it would need strong reasons to prompt its use.
When we consider an expression of the form (f x) + 3
,
then f
returns a machine value and hence all its space and everything
called by it can be lopped.
This could either be built into the code for f
(as the in downwards
condition for stack framing) or into the code that calls f
(as in upwards conditions for stack framing).
We noted when considering stack framing that tail-enders caused problems with
last call optimisation.
We could do little about it there except making the return address into a
special marker that can be identified by the garbage collector, because
the caller would not know which frame to free.
(Note: if it's really a stack then it does,
so we could make a rule of caller pops, but there are still problems with
tail chains)
We have more freedom here, as either caller or called can mark the heap
or create new domains, so we can make a general rule that where possible caller
frees, thus simplifying the deallocation code.
For dynamic domains, this even works when there are multiple calls with
tail optimisation:
because it is only the result which is important. For simple lopping the input is crucial, and hence there may be cases where the caller cannot lop. For instance, in the following code:f x if g x then 3 else h ( x-1 ) h x if k x then 2 else m ( x-1 ) m x if p x then 1 else q ( x-1 )
Iff x y if x = 2 then h ( y ) else h ( x )
h
is
not
known to be strict, then after the test, if it is false, f
can mark the
heap for lopping on return,
the caller could not of course know when this is possible.
Assuming it is possible for the caller to do the lopping, the called function
may still need to do the marking.
If we know thatf x g ( h ( x ) )
g
is strict but don't know anything about h
,
then f
can evaluate h(x)
and then mark the heap, leaving it to its
caller to lop after the return from g
.
The caller could not mark the heap, as lazy computation in the argument x
may (but may not) cause allocation.
To summarise, in "straight line code" we push the deallocation back to the
first strict evaluation where this is possible.
However when there tail recursion, we cannot follow this as it would mean that
deallocation would only be done when the recursion is complete.
For example, fac
would never deallocate until it was complete.
Thus sometimes tail-enders will be necessary.
The run time cost of simple heap lopping is low, often just a register-register move, so if there is unnecessary lopping, this is probably not a problem. Once the strategy gets more complex, either multiple domains or dynamic domain creation, the costs begin to rise, and we must be careful that we don't over-lop. Consider the Fibonacci function:
The code for the general case offib 0 0 fib 1 1 fib n fib n-1 + fib n-2
fib n
can create a new domain before
each recursive call, and then lop it after.
However, because the recursive calls would themselves create new domains, only
one frame would be allocated in each domain.
If domains were created as tagged pages, then this might mean that each frame
occupies a whole page of memory!
Not very good for improving space performance.
The same problem can arise in straight line code, as in the example in the
section above.
We can in a similar way push the creation of new domains as far up the call
tree as possible.
To make this a little more formal, we can think of one function
dominating
another, if all calls of the second are made within a call of the former.
Further, the dominance is
direct
if there are never any intervening recursive calls.
Given these definitions, if f
directly dominates g
, and both
f
and g
can create new domains for a class t
(or if they can simply lop it) then the action of the latter can be surpressed.
Note that this is on a per domain basis, if there is more than one class of
object.
Also the directness is crucial, if there are any intermediate recursive calls
then g
may be called many times within the lifetime of f
and hence
the lopping at g
is useful.
Sometimes we may want to push the lopping in the other direction.
If
f
dominates g
, and both can lop class t, but we can show that
none of the intermediate calls create objects of class t then it is
usually appropriate to forgo the lopping at f
.
Note that here the dominance does not need to be direct.
The exception to this rule is where the dominance
is
direct, but where there is some fan-out (i.e. several, but finite, calls to
g
).
In this case it may be better to lop at f
, or at the lowest function
dominating all calls to g
.
Doing this will require fewer lops, but will leave space allocated longer.
The above analysis is complicated by functions called from several locations, however outside of mutually recursive sets, we can always make multiple copies of such functions (later remerging them if the code generated is the same) and many other analyses assume that this has been done. In addition complete copies of mutually recursive sets can be made for each call in.
So far however, we have not addressed the example of fib
which prompted
this discussion, as it is recursive, we cannot simply push the responsibility
onto its caller, as this would mean store exponential in n
would be
allocated, rather than proportional to n
.
What we can do however is partially
unroll
the recursion.
This is like loop unrolling, we don't know the maximum depth of the recursion,
so we cannot unroll completely, but for a mutually recursive set
F={f
,g
,h
}, we make m+1 copies
F0, F1, F2, ... Fm, so that the calls in Fi
are to the functions in Fi+1, and the calls in Fm are to
F1, and all external entries are into F0.
This copying could be real code copying, or notional, passing around an extra
"recursion depth" parameter.
Having generated these copies, we would only generate lopping code in F0
and Fm.
(These could be the same, but for other analyses it may be useful to
distinguish the initial entry.)
By varying the choice of m, the amount of dynamic space use as compared to
the time cost of lopping can be balanced to any desired level.
The places where heap lopping (especially simple lopping) is possible are just where the interface between a sub-expression and its context is "narrow", that is precisely where executing the sub-expression on a separate processor is likely to be advantageous. The separate domains correspond to the actual physical memory of the different processors. A useful property for distributed processing is that after the evaluation of a sub-expression on a separate processor, there are no references to memory in that processor remaining, and it can be reinitialised. Heap lopping is exactly equivalent to this reinitialisation. Simple lopping is where all allocation is done on the separate processor. Dynamic domains corresponds to the second processor calling back to the parent for evaluation of lazy values passed in.
The various techniques have been presented in the context of a particular programming paradigm (lazy functional) and a particular implementation technique (TIM machine). Garbage collection of marks is obviously particular to the TIM architecture. Similarly, frame stacking is not an issue in traditional imperative languages where this is standard, and in eager languages the subject is well covered. Some graph reduction techniques are environment rather than combinator based, and these could benefit from similar rules. Heap lopping is of more interest in these other paradigms.
Heap lopping was introduced in the context of pure eager functional languages, and here it has very wide scope. It can be applied to any sub-expression that has a machine value result (anyin-mvout). That is the same condition as for dynamic domains in TIM. In fact the conditions examined for heap lopping in TIM are exactly those where the sub-expression's interface characteristics were eager. This means for instance, that all boolean expressions can be lopped.
As lazy languages had problems with updating after evaluation, imperative languages have problems with explicit update. We have already considered some of the ways of dealing with this when lopping was introduced. Clearly imperative sub-expressions can be lopped when they "behave like" pure functional ones. This property is often taken advantage of for other purposes, such as code movement, and ADA ensures that functions are pure. Most imperative languages do not have implicit garbage collection at the present time but this may well change.
Focusing on simple combinator based graph reduction, we see that the same sort of conditions that applied to TIM apply here. If at any stage there is a sub-graph which evaluates to a machine value and such that any sub-graph of it which is shared outside is evaluated, then any space allocated during the evaluation of that sub-graph may be freed upon completion. Ideally the combinator that creates the sub-graph should notice this and allocate space for it in the new domain, but this may not always be possible.
The exact implementation of this differs with the reduction technique. Most reducers use a stack, in which case the technique is rather like that for TIM, with similar problems about tail-enders. Pointer reversal reducers would need to leave a heap popping node, that would pop the heap when the spine unrolls.
Dynamic domains pose some problems, in keeping track of what should be allocated where. The general rule would be that all space allocated during the evaluation of a (non-lopped) graph node, would be allocated in the same domain as the node. Here it becomes essential that the sub-expression is created in the new domain, as otherwise there will not be any record of that domain for further evaluation to latch onto. This is similar to the problem with partial applications in TIM, but far more pervasive. (Interestingly, partial applications don't pose a problem)
Prolog implementations typically allocate space in several domains with different garbage collection strategies anyway. To achieve this they use techniques at least as powerful as stack framing. It's execution is similar in many ways to lazy languages, in that there is much non-local update, but the possibility of backtracking means that if anything the web of references is more convoluted again! The equivalent of simple lopping is unlikely to be possible, but with the definition of the lifetime of a call being from the first call to the last successful backtrack, dynamic domains may be a possibility.
In its more static and structured forms this would have properties similar to traditional imperative programming techniques. Self modifying environments like Smalltalk (and the same goes for most LISPs) clearly have problems, as the static analysis has to be incrementally redone after changes, and in addition this may change the properties of functions in midst calculation. The sort of algorithms that could achieve this I shudder to imagine. They seem to be getting on allright with techniques of their own, so let's leave them to it.
As we've seen there is a close interplay between the conditions for heap lopping and distributed processing of functional languages. If each processor has no more than one thread of control then simple heap lopping is quite possible. If however two threads of control coexist within the same memory allocation space simple lopping needs to be modified. This should not be too difficult as lopping is only possible where there is no interaction with other expressions, in particular with concurrent expressions. Effectively each thread of control needs to have a domain of its own. This is moving towards multiple dynamic domains , and not surprisingly these pose no problems.
For imperative languages, the conditions for heap lopping again imply a certain element of non-interference between candidate expressions, so similar arguments would apply to concurrency here.
The techniques presented above differ in their simplicity, generality and power.
The collection of marks is a simple process, and there is little reason not to use it, but the gains are not too great. It does reduce by a little the space usage, and (perhaps more importantly) the number of shared applications.
Frame stacking conditions are fairly simple, with little run-time cost, and are an obvious extension of similar work for eager languages.
Heap lopping is the most radical technique, allowing one to dream that garbage collection may be a thing of the past. Its most important feature being its non-local freeing of space, and in this respect it differs from all other static space reclamation schemes I have encountered. It is this non-locality which gives it its generality (because enclosed computation can be arbitrarily complex) and power (because an awful lot of space may be reclaimed). Unfortunately the conditions for it are correspondingly strong, restricting its range of applicability. Its may in fact be better applied to eager languages than to lazy.
Most of the techniques discussed assume some level of strictness and/or sharing analysis, I suppose this is part of the general rule: "if you want efficient lazy languages, make them eager".
There are various ways of implementing partial applications in TIM, which interact with the garbage collection methods above.
The TIM paper describes a method of sharing partial applications of functions
by way of "interrupting" take
operations.
This involves creating a special partial application node when a mark
is encountered, and then beginning again the take
.
The semantic description of restarting the take
involves pushing all the
partially applied arguments back onto the stack, and beginning again.
The same procedure is used when a partial application is applied to more
arguments.
To really implement this by pushing and then popping is unnecessary.
The pushed arguments will be enterargs
into the partial frame, hence the new
frame can be created with the enterargs
explicitly, and then the
rest of the arguments taken.
To achieve this it is sensible to keep in the partial frame, the address of the
instruction
after
the take
rather than its own address for re-entry.
The partial frame is of course needed for sharing, but is a special frame the best place to keep achieve this?
If we consider the case where the partial application is
not
reused, it is particularly easy to see the problems with this.
Two frames will be used, one for the partial application, and one for the
actual application (the child frame).
In addition each partially applied argument is indirected through
to the partial frame.
This blossoming of enterargs
for sharing will lead to still more unnecessary
marks...
David Wakeling's figures for TIM show that enterargs
take up a considerable
amount of time, and hence reducing them will save this, as well as making
take
s faster.
An alternative that does not have these disadvantages is for the partial
application to share the frame of the first full application.
In TIM we can be sure that there are always enough arguments for the full
application
there may just be some marks in between.
When we encounter a mark, say at the ith argument, instead of creating a
special frame with the first i arguments in it, we can just update the mark
with a partial-i
code pointer and the frame pointer of the existing frame.
This is semantically identical to the TIM paper proposal, but saves creating an
extra frame and the associated indirections.
Of course subsequent applications of this code>partial-i object will have to produce indirections to it, but this is at least no worse than the original proposal. Another worry might be space leaks, but if the scheme of masks proposed in the TIM paper is used, the partial applications can be given appropriate masks, and hence the additional unshared arguments can be garbage collected when appropriate.
This scheme is particularly advantageous with a separate mark stack, as then the checking for marks, and the copying of arguments from the stack are independent, and thus the copying can be achieved with efficient memory move instructions.
The scenario presented earlier would have seemed even worse, if we had
considered several cascading partial applications of more and more arguments,
and the resulting chains of enterargs
.
In fact this never happens if (as is assumed in TIM) the functions are
super-combinators.
In fact super-combinators are usually only curried at one place
(although there may be several marks here), usually the last but one position.
This is because the early arguments are produced by a parent super-combinator
which supplies all the arguments until a new user-level argument is needed.
As the parent knows where the currying can take place, it seems reasonable to
let the parent worry about sharing.
(A similar technique is to let the parent do the take
and thus allow more
efficient passing of arguments between the two, but the two techniques are
independent)
To achieve this, whenever a super combinator is going to be called with too few
arguments, before the arguments are pushed a special share
instruction is
executed which updates any marks at the very top of the stack with the label of
the share
instruction, and the current frame.
The/* code for f x ( g y) 3 - where f needs more than 3 arguments */ s_lab: share s_lab push int 3 push lab gy /* code for g y */ push arg 2 /* x is second arg */ enter comb f /* code for f */ f: take_unshared 4 << code for f >>
take_unshared
instruction is a simple memory move from the stack,
with no checking of marks.
Again one might be concerned about space leaks, but again the masks will take care of this.
The most obvious thing that both these techniques do is to destroy the distinction between partial application frames and normal frames. This is only important if we want to selectively garbage collect one domain or other, as was suggested with multiple static domains. More important is that we must be aware when considering frame stacking and heap lopping that the current frame may be shared as a partial application.
For child sharing, this means that if it is possible that a frame is partially applied, then we cannot build it on the stack. This would tend to make one draw back from this technique if considering frame stacking. Alternatively two sets of code could be generated for each function, one for the first shared application and one for all others. This would have no run time cost, just mean extra code. As with all code copying techniques, if no optimisation is possible, the code can be remerged later. Heap lopping is not such a problem, if the frame may be shared one merely marks the heap just above, rather than just below it. If it is the caller who is marking anyway there is no problem.
Parent sharing, means that one cannot consider a frame for stacking if a share
instruction may update a frame that outlives the current call.
Considering upward conditions,
it doesn't matter for share
s passed in to sub-expressions
returning machine values, as no references will escape anyway,
and if no labels are passed to a sub-expression then no share
s
can be encountered there.
The only remaining problem is if the result of the computation is a function,
and there is a share
to achieve this, as in:
Such cases can either be rejected or thef x if x = 3 then ( + 1 ) else λ y x + y
share
could branch on success
(as with shared take
s above).
Heap lopping again is not so much of a problem.
If the share
instruction is encountered in embedded call levels, then the
update will be to frames that will be lopped anyway, the only problem will be
with a share
instruction
at the bottom level".
That is, if the actual function result is being shared.
But as the conditions for heap lopping were not for higher order functions,
this will not be a problem.
Downward conditions for frame stacking don't have any problems
for similar reasons.
Both the above techniques have been successfully used in an interpreted version of the TIM machine. My preference lies with parent frame sharing as the sharing is made more explicit, and is thus easier to informally understand. As we have seen, it is also less likely to cause problems with frame stacking and heap lopping.
http://www.hcibook.com/alan/papers/gc-tim-99/ | Alan Dix 8/1/2002 |