Hierarchical grammars for more human-like compiler parsing

Nearly twenty years ago, back when I was in York, one of my student project suggestions was to try to make compiler parsers operate a little more like a human: scanning first for high-level structures like brackets and blocks and only moving on to finer level features later. If I recall there were several reasons for this, including connections with ‘dynamic pointers’¹, but most important to help error reporting, especially in cases of mismatched brackets or missing ‘;’ from line ends … still a big problem.

Looking back I can see that one MEng student considered it, but in the end didn’t do it, so it lay amongst that great pile of “things to do one day” and discuss occasionally over tea or beer. I recall too looking at grammar-to-grammar parsers … I guess now-a-days I might imagine using XSLT!

Today, 18 years on, while scanning David Unger’s publications I discover that he actually did this in the Java parser at Sun². I don’t know if this is actually used in the current Java implementations. Their reasons for looking at the issue were to do with making the parser easier to maintain, so it may actually be that this is being done under the hood, but the benefits for the Java programmer not being realised.

While I was originally thinking about programming languages, I have more recently found myself using the general methods in anger when doing data cleaning as often one approaches this in a pipeline fashion, creating elements of structure along the way that are picked up by future parsing/cleaning steps.

To my knowledge there are no general purpose tools for doing this. So, if anyone is looking for a little project, here is my own original project suggestion from 1993 …

Background
When compilers parse a computer program, they usually proceed in a sequential, left-to-right fashion. The computational requirement of limited lookahead means that the syntax of programming languages must usually be close to LL(1) or LR(1). Human readers use a very different strategy. They scan the text for significant features, building up an understanding of the text in a more top down fashion. The human reader thus looks at the syntax at multiple levels and we can think of this as a hierarchical grammar.

Objective
The purpose of this project is to build a parser based more closely on this human parsing strategy. The target language could be Pascal or C (ADA is probably a little complex!). The parser will operate in two or more passes. The first pass would identify the block structure, for example, in C this would be based on matching various brackets and delimiters `{};,()’. This would yield a partially sequential, partially tree-like structure. Mismatched brackets could be detected at this stage, avoiding the normally confusing error messages generated by this common error. Subsequent passes would `parse’ this tree eventually obtaining a standard syntax tree.

Options
Depending on progress, the project can develop in various ways. One option is to use the more human-like parsing to improve error reporting, for example, the first pass could identify the likely sites for where brackets have been missed by analysing the indentation structure of the program. Another option would be to build a YACC-like tool to assist in the production of multi-level parsers.

Reading

1. S. P. Robertson, E. F. Davis, K. Okabe and D. Fitz-Randolf, “Program comprehension beyond the line”³, pp. 959-963 in Proceedings of Interact’90, North-Holland, 1990.
2. Recommended reading from compiler construction course
3. YACC manual from UNIX manual set.

For more on Dynamic Pointers see my first book “Formal Methods for Interactive Systems“, a CSCW journal paper “Dynamic pointers and threads“[back]
Modular parser architecture with mini parsers. D M Ungar, US Patent 7,089,541, 2006[back]
Incidentally, “Program comprehension beyond the line” is a fantastic paper both for its results and also methodologically. In the days when eye-tracking was still pretty complex (maybe still now!), they wanted to study program comprehension, so instead of following eye gaze, they forced experimental subjects to physically scroll through code using a single-line browser. [back]

2 thoughts on “Hierarchical grammars for more human-like compiler parsing”

I have had for some time now the same ideas myself about a language based on hieararchical parsers, and I would like to experiment with it. Perhaps I would end up with something useful for other people to use, and perhaps it will have the potential of commercial value ?

After reading the US patent of Unger I wondered how it is possible to patent an idea like this. If I understand the patent text (always hard to read) right, the patent is on the general idea of hiearchical parsers ?
I think such patents will act like a concervatory force, protecting a language (Java) from possible future competition.
It will harm further development in this direction.

alan on May 11, 2011 at 12:26 pm said:

Although not against all software patents, there are many that appear crazily broad. It seems just that the US patent officers do not have sufficient technical expertise to sort them out prior to granting. Of course, they can still be challenged in the court, but that is very risky, especially, as in this case, when the patentor has a big bucks company behind them!

For this particular patent, the project proposal in the blog dates from 1993 so that would kill the patent on ‘prior art’, also I am sure that there are many ad hoc examples around.

Comments are closed.