Parsing
MotivationMaterial adapted from CSCI 3136 Principles of Programming Languages at Dalhousie University
Parsing is the second step in compilation, as shown in Figure 1, from Programming Language Pragmatics, by Michael L. Scott.
Figure 1. The parser performs syntax analysis and converts the token stream in a parse tree.
Grammars describe languages
A grammar for a subset of natural language:

Sentences in the language described by this grammar:
-
big Jim ate green cheese
-
green Jim ate green cheese
-
Jim ate cheese
-
cheese at Jim
Another example
A grammar for simple arithmetic expressions:

Context-free grammars
A context-free grammar is a 4-tuple \((V,\Sigma,P,S)\), where
-
\(V\) is a set of non-terminals or variables,
-
\(\Sigma\) is a set of terminals,
-
\(P\) is a set of rules or productions in the form \(N\to(V\cup\Sigma)^*\), where \(N \in V\),
-
\(S\in V\) is the start symbol.

Notational variations in productions

Derivations
A derivation is a sequence of rewriting operations that starts with the string \(\sigma=S\) and then repeats the following until \(\sigma\) contains only terminals:
- Choose a non-terminal \(X\) in \(\sigma\) and a production \(X \to\alpha\) in the grammar and replace \(X\) with \(\alpha\) in \(\sigma\).
As a picture \(\sigma=\lambda X \rho \Rightarrow \sigma'=\lambda\alpha\rho\)
An example

Intermediate strings are called sentential forms.
Language defined by a context-free grammar
We write \(S \Rightarrow^* \sigma\) if there exists a sequence \(S \Rightarrow \sigma_1 \Rightarrow \sigma_2 \Rightarrow\cdots\Rightarrow \sigma\)
Every grammar \(G\) defines a language
If \(G\) is a context-free grammar, \(\mathcal{L}(G)\) is a context-free language.
Example: The language defined by the Jim-ate-cheese grammar is
\(\mathcal{L}(G)\) = {Jim ate cheese, Jim ate Jim, big Jim ate cheese, …}
Context-free languages: examples
What is the language defined by the following grammar?
\(\mathcal{L}(G)=\{\sigma\sigma^R\mid \sigma\in\Sigma^*\}\), where \(\sigma^R\) is \(σ\) written backwards.
What is the language defined by the following grammar?
Neither of these two languages is regular.
There are more context-free languages than regular ones.
Parse trees
Every derivation can be represented by a parse tree:
-
The root is \(S\).
-
The children of each node are the symbols (terminals and non-terminals) it is replaced with.

The internal nodes are non-terminals. The leaves — called the yield of the parse-tree — are terminals.
Parse trees: examples (1)

Parse trees: examples (2)

Ambiguity
There are infinitely many context-free grammars generating a given context-free language.
- Just add arbitrary non-terminals to the right-hand sides of productions and then add \(\varepsilon\)-productions for these non-terminals.
There may be more than one parse tree for the same sentence generated by a grammar \(G\). If this is the case, we call \(G\) ambiguous.
To allow parsing of programming languages, their grammars have to be unambiguous.
Some context-free languages are inherently ambiguous, that is, do not have unambiguous grammars. Usually, this is not the case for programming languages.
Ambiguity: example (1)

Ambiguity: example (2)
An unambiguous grammar for the same language

This grammar respects precedence rules.
Ambiguity: example (3)
An unambiguous grammar for the same language

This grammar respects precedence rules. It also respects left-associativity.
Leftmost and rightmost derivations
For every sentence in a language defined by an unambiguous grammar, there is only one parse tree that generates this sentence.
There are many different derivations corresponding to this parse tree.
Leftmost derivation: replaces the leftmost non-terminal in each step.
Rightmost derivation: replaces the rightmost non-terminal in each step.

Context-free grammars for programming languages
Context-free grammars are used to formally describe the syntax of programming languages.
Parsing a program involves finding the parse tree of the program.
Context-free grammars for programming languages must be unambiguous and must capture the program structure.
The parser should be efficient:
-
Any context-free grammar can be used to derive a parser that runs in \(O(n^3)\) time.
-
We want linear time.
Regular grammars
A context-free grammar is right-linear if all productions are of the form \(A\to\sigma B\) or \(A\to\sigma\), where \(\sigma\) is a (possibly empty) string of terminals.
A context-free grammar is left-linear if all productions are of the form \(A\to B\sigma\) or \(A\to\sigma\), where \(\sigma\) is a (possibly empty) string of terminals.
A context-free grammar is regular if it is right-linear or left-linear.
The set of languages expressed by regular grammars is exactly the set of regular languages.
Regular grammars are too weak to express programming languages.
LL and LR grammars
Two kinds of grammars that can be parsed efficiently and guarantee unambiguousness:
An \(LL(k)\)-grammar can be scanned Left-to-right and generates a Leftmost derivation.
If the first letter in the current sentential form is a non-terminal, \(k\) tokens look-ahead in the input suffice to decide which production to use to expand it.
An \(LR(k)\)-grammar can be scanned Left-to-right and generates a Rightmost derivation.
The next \(k\) tokens in the input suffice to choose the next step the parser should perform. (We’ll see what these steps are.)
Almost every programming language can be described by an \(LL(1)\)- or \(LR(1)\)-grammar.
S-grammars
An \(S\)-grammar or simple grammar is a special case of an LL(1)-grammar.
A context-free grammar is an S-grammar if
-
Every production starts with a terminal and
-
The productions for the same LHS start with different terminals.

LL parsing: an example

LR parsing: an example

Parsing using LL(1) grammars
The crux with parsing deterministically using an LL(1) grammar is to decide which production to apply when the next symbol in the current sentential form is a non-terminal:
-
Input: \(a\)…
-
Sentential form: \(A\)…
-
Production: \(A\to\alpha\)?
Two cases
- \[A\Rightarrow\alpha\Rightarrow^*a\beta\]
- \(A\Rightarrow\alpha\Rightarrow^*\varepsilon\) and a derivation of \(A\) can be succeeded by \(a\).
Intuitively (but formally not quite correctly), a terminal \(a\) is in the predictor set of production \(A\to\alpha\) if \(A\beta \Rightarrow\alpha\beta\Rightarrow^*a\gamma\), for some \(\beta,\gamma\in\Sigma^*\).
A grammar is LL(1) if the predictor sets of all productions with the same LHS are disjoint.
