124 lines
4.7 KiB
Markdown
124 lines
4.7 KiB
Markdown
# Error recovery (2022-05-07)
|
|
|
|
Problem: we have two fairly generic cases of recovery bounded within a range:
|
|
- sequences: `int x; this is absolute garbage; x++;`
|
|
- brackets: `void foo() { this is garbage too }`
|
|
|
|
So far, we've thought of these as different, and had precise ideas about
|
|
brackets ("lazy parsing") and vague ones about sequences.
|
|
Con we unify them instead?
|
|
|
|
In both cases we want to recognize the bounds of the garbage item based on
|
|
basic token-level features of the surrounding code, and avoid any interference
|
|
with the surrounding code.
|
|
|
|
## Brackets
|
|
|
|
Consider a rule like `compound-stmt := { stmt-seq }`.
|
|
|
|
The desired recovery is:
|
|
- if we fail at `{ . stmt-seq }`
|
|
- ... and we can find for the matching `}`
|
|
- then consume up to that token as an opaque broken `stmt-seq`
|
|
- ... and advance state to `{ stmt-seq . }`
|
|
|
|
We can annotate the rule to describe this: `{ stmt-seq [recovery] }`.
|
|
We can generalize as `{ stmt-seq [recovery=rbrace] }`, allowing for different
|
|
**strategies** to find the resume point.
|
|
|
|
(It's OK to assume we're skipping over one RHS nonterminal, we can always
|
|
introduce a nonterminal for the bracket contents if necessary).
|
|
|
|
## Sequences
|
|
|
|
Can we apply the same technique to sequences?
|
|
Simplest case: delimited right-recursive sequence.
|
|
|
|
```
|
|
param-list := param
|
|
param-list := param , param-list
|
|
```
|
|
|
|
We need recovery in **both** rules.
|
|
`param` in the first rule matches the *last* param in a list,
|
|
in the second rule it matches all others. We want to recover in any position.
|
|
|
|
If we want to be able to recovery `int x int y` as two parameters, then we
|
|
should extract a `param-and-comma` rule that recovery could operate on.
|
|
|
|
### Last vs not-last elements
|
|
|
|
Sequences will usually have two rules with recovery, we discussed:
|
|
- how to pick the correct one to recover with
|
|
- in a left-recursive rule they correspond to last & not-last elements
|
|
- the recovery strategy knows whether we're recoverying last or not-last
|
|
- we could have the strategy return (pos, symbol parsed), and artificially
|
|
require distinct symbols (e.g. `stmt` vs `last_stmt`).
|
|
- we can rewrite left-recursion in the grammar as right-recursion
|
|
|
|
However, on reflection I think we can simply allow recovery according to both
|
|
rules. The "wrong" recovery will produce a parse head that dies.
|
|
|
|
## How recovery fits into GLR
|
|
|
|
Recovery should kick in at the point where we would otherwise abandon all
|
|
variants of an high-level parse.
|
|
|
|
e.g. Suppose we're parsing `static_cast<foo bar baz>(1)` and are up to `bar`.
|
|
Our GSS looks something like:
|
|
|
|
```
|
|
"the static cast's type starts at foo"
|
|
---> {expr := static_cast < . type > ( expr )}
|
|
| "foo... is a class name"
|
|
+---- {type := identifier .}
|
|
| "foo... is a template ID"
|
|
+---- {type := identifier . < template-arg-list >}
|
|
```
|
|
|
|
Since `foo bar baz` isn't a valid class name or template ID, both active heads
|
|
will soon die, as will the parent GSS Node - the latter should trigger recovery.
|
|
|
|
- we need a refcount in GSS nodes so we can recognize never-reduced node death
|
|
- when a node dies, we look up its recovery actions (symbol, strategy).
|
|
These are the union of the recovery actions for each item.
|
|
They can be stored in the action table.
|
|
Here: `actions[State, death] = Recovery(type, matching-angle-bracket)`
|
|
- we try each strategy: feeding in the start position = token of the dying node
|
|
(`foo`) getting out the end position (`>`).
|
|
- We form an opaque forest node with the correct symbol (`type`) spanning
|
|
[start, end)
|
|
- We create a GSS node to represent the state after recovery.
|
|
The new state is found in the Goto table in the usual way.
|
|
|
|
```
|
|
"the static cast's type starts at foo"
|
|
---> {expr := static_cast < . type > ( expr )}
|
|
| "`foo bar baz` is an unparseable type"
|
|
+---- {expr := static_cast < type . > (expr)}
|
|
```
|
|
|
|
## Which recovery heads to activate
|
|
|
|
We probably shouldn't *always* create active recovery heads when a recoverable
|
|
node dies (and thus keep it alive).
|
|
By design GLR creates multiple speculative parse heads and lets incorrect heads
|
|
disappear.
|
|
|
|
Concretely, the expression `(int *)(x)` is a valid cast, we probably shouldn't
|
|
also parse it as a call whose callee is a broken expr.
|
|
|
|
The simplest solution is to only create recovery heads if there are no normal
|
|
heads remaining, i.e. if parsing is completely stuck. This is vulnerable if the
|
|
"wrong" parse makes slightly more progress than the "right" parse which has
|
|
better error recovery.
|
|
|
|
A sophisticated variant might record recovery opportunities and pick the one
|
|
with the rightmost *endpoint* when the last parse head dies.
|
|
|
|
We should consider whether including every recovery in the parse forest might
|
|
work after all - this would let disambiguation choose "broken" but likely parses
|
|
over "valid" but unlikely ones.
|
|
|
|
|