Mon 24 Sep 2012
Yes, parsing C++ is difficult. There are numerous web pages out there that detail the traps for the unwary, commemorate fallen projects, and offer up grammars in varying states of incompletion. While what we had to do for the beautifier’s parsing is many times simpler than what a compiler writer for C++ has to deal with, you still get some interesting issues that arise from trying to navigate the full C++ syntax without the benefit of the information available to a full fledged C++ parser.
There’s not a lot of literature on parsing C++ with incomplete information.
Which in hindsight, isn’t too surprising. It’s a messy problem that doesn’t have a clean “apply methodology X to problem Y, repeat” solution that you see in most publications. And unlike a compiler, the beautifier can’t afford to do a 100% accurate parse – that would require the beautifier’s parser to accept every grammar and dialect of every compiler product/version used by any of our customers, a tremendous effort on it’s own.
Since we don’t have the fully stocked symbol tables that most methods of resolving C++ grammar ambiguities require, we deal with them in other ways:
- Some ambiguities don’t matter because resolving them wouldn’t effect any of the beautifier rules,so we ignore them.
- More complicated resolution is handled by some parsing state that is refined as we parse through the source, that describes how sure we are of our parse. (i.e.: “in expression or declaration”, “definitely in a type decl”,”I do not recognize this place”).
- When things get truly bad, we’ll rewind the token stream for a statement and restart the parse, using the information we gathered in the failed parse to guide the parse down another path. It’s a little bit expensive, and makes it hard to debug the parser, so this is a method of last resort.
Preprocessing makes everything more complicated.
As helpful as it may be for coding, preprocessing complicates the state handling in the beautifier substantially. Preprocessor conditionals can be tricky if different branches have wildly different pieces of code in them where braces don’t match up locally, or in cases where a single statement is spread across several branches of different conditionals. Unlike a compiler, we don’t just see the one path through the conditionals that the compiler gets from the preprocessing stage. We have to examine all of the branches, and we’d like to do something sane with all of them.
While simple macro expansions are less heinous, they can still be tricky for us when you consider we probably don’t know what the macro expands to.
You can always come up with a preprocessing trick that will trip up a parser, so the emphasis is more on minimizing the effects of the hiccups rather than assuming we can parse our way around all of them.
How you handle errors has a huge effect on your parsing strategy.
This is probably true for parsing in general, but differences in our strategy reemphasized the point for me.
For a compiler, you need errors to be informative, and you want to be conservative in how you pick restart points so you don’t flood the user with bogus cascading errors from an earlier syntax error.
For the beautifier, error handling is almost entirely dedicated to finding a good place further in the stream to restart the parse, while leaving bread crumbs for the output pass of the beautifier so it doesn’t become completely lost. The remainder of the handling is for either:
- Cases where we’ve obviously parsed ourselves into a corner, and need to rewind and reparse with a different assumption. (the assumption usually being either “declaration vs. expression” or ”expression with a ‘<’” vs “declaration with a template type in it”, etc…).
- Cases where we’ve hit an EOF during the parse. This can be common when beautifying selections, and beautifying while editing. You have to take care not to lose partial parse information when this happens, even if you technically don’t have enough information to form a valid node in the parse tree.
In our favor, we are much less constrained than a compiler when it comes to picking restart points. If we trip up in the declaration of a function, and scan forwards to the first statement in the function, we’re certainly losing information, but not anything that’s going to be relevant for the configuration rules we need to apply.
And finally, related to error handling….
Don’t underestimate the value of good defaults.
It is guaranteed that there will be parse problems in the field. Vendor extensions, perverse use of macros, oddball syntax accepted by ancient compilers, etc… So when the inevitable occurs, what we don’t want to do is announce this to the user by switching to brace style 3 with double spacing in the section of code that was parsed incorrectly. And yes, early development versions did do things like this. It was horrifying to behold.
90% of the problem was fixed by just changing both the configuration and behavior defaults to have a sane behavior in the absence of parse information. The remaining issues we handle by doing a little bit of special handling in the output stage when we’re dealing with unparsed source, so we leave unparsed code with as few changes as possible.