Parsing

MotivationMaterial adapted from CSCI 3136 Principles of Programming Languages at Dalhousie University

Parsing is the second step in compilation, as shown in Figure 1, from Programming Language Pragmatics, by Michael L. Scott.

⊕ <a name='figure1'>Figure 1</a>. The parser performs syntax analysis and converts the token stream in a parse tree.
Figure 1. The parser performs syntax analysis and converts the token stream in a parse tree.

Grammars describe languages

A grammar for a subset of natural language:

Sentences in the language described by this grammar:

big Jim ate green cheese
green Jim ate green cheese
Jim ate cheese
cheese at Jim

Another example

A grammar for simple arithmetic expressions:

Context-free grammars

A context-free grammar is a 4-tuple $(V,\Sigma,P,S)$, where

$V$ is a set of non-terminals or variables,
$\Sigma$ is a set of terminals,
$P$ is a set of rules or productions in the form $N\to(V\cup\Sigma)^*$, where $N \in V$,
$S\in V$ is the start symbol.

Notational variations in productions

Derivations

A derivation is a sequence of rewriting operations that starts with the string $\sigma=S$ and then repeats the following until $\sigma$ contains only terminals:

Choose a non-terminal $X$ in $\sigma$ and a production $X \to\alpha$ in the grammar and replace $X$ with $\alpha$ in $\sigma$.
As a picture $\sigma=\lambda X \rho \Rightarrow \sigma'=\lambda\alpha\rho$

An example

Intermediate strings are called sentential forms.

Language deﬁned by a context-free grammar

We write $S \Rightarrow^* \sigma$ if there exists a sequence $S \Rightarrow \sigma_1 \Rightarrow \sigma_2 \Rightarrow\cdots\Rightarrow \sigma$

Every grammar $G$ deﬁnes a language

$$\mathcal{L}(G)=\{\sigma\in\Sigma^*\mid S\Rightarrow^*\sigma\}$$

If $G$ is a context-free grammar, $\mathcal{L}(G)$ is a context-free language.

Example: The language deﬁned by the Jim-ate-cheese grammar is

$\mathcal{L}(G)$ = {Jim ate cheese, Jim ate Jim, big Jim ate cheese, …}

Context-free languages: examples

What is the language deﬁned by the following grammar?

$$S\to\varepsilon$$ $$S\to0 S 0$$ $$S\to1 S 1$$

$\mathcal{L}(G)=\{\sigma\sigma^R\mid \sigma\in\Sigma^*\}$, where $\sigma^R$ is $σ$ written backwards.

What is the language deﬁned by the following grammar?

$$S\to\varepsilon$$ $$S\to0 S 1$$ \[\mathcal{L}(G)=\{0^n1^n\mid n\ge0\}\]

Neither of these two languages is regular.

There are more context-free languages than regular ones.

Parse trees

Every derivation can be represented by a parse tree:

The root is $S$.
The children of each node are the symbols (terminals and non-terminals) it is replaced with.

The internal nodes are non-terminals. The leaves — called the yield of the parse-tree — are terminals.

Parse trees: examples (1)

Parse trees: examples (2)

Ambiguity

There are inﬁnitely many context-free grammars generating a given context-free language.

Just add arbitrary non-terminals to the right-hand sides of productions and then add $\varepsilon$-productions for these non-terminals.

There may be more than one parse tree for the same sentence generated by a grammar $G$. If this is the case, we call $G$ ambiguous.

To allow parsing of programming languages, their grammars have to be unambiguous.

Some context-free languages are inherently ambiguous, that is, do not have unambiguous grammars. Usually, this is not the case for programming languages.

Ambiguity: example (1)

Ambiguity: example (2)

An unambiguous grammar for the same language

This grammar respects precedence rules.

Ambiguity: example (3)

An unambiguous grammar for the same language

This grammar respects precedence rules. It also respects left-associativity.

Leftmost and rightmost derivations

For every sentence in a language deﬁned by an unambiguous grammar, there is only one parse tree that generates this sentence.

There are many different derivations corresponding to this parse tree.

Leftmost derivation: replaces the leftmost non-terminal in each step.

Rightmost derivation: replaces the rightmost non-terminal in each step.

Context-free grammars for programming languages

Context-free grammars are used to formally describe the syntax of programming languages.

Parsing a program involves ﬁnding the parse tree of the program.

Context-free grammars for programming languages must be unambiguous and must capture the program structure.

The parser should be efﬁcient:

Any context-free grammar can be used to derive a parser that runs in $O(n^3)$ time.
We want linear time.

Regular grammars

A context-free grammar is right-linear if all productions are of the form $A\to\sigma B$ or $A\to\sigma$, where $\sigma$ is a (possibly empty) string of terminals.

A context-free grammar is left-linear if all productions are of the form $A\to B\sigma$ or $A\to\sigma$, where $\sigma$ is a (possibly empty) string of terminals.

A context-free grammar is regular if it is right-linear or left-linear.

The set of languages expressed by regular grammars is exactly the set of regular languages.

Regular grammars are too weak to express programming languages.

LL and LR grammars

Two kinds of grammars that can be parsed efﬁciently and guarantee unambiguousness:

An $LL(k)$-grammar can be scanned Left-to-right and generates a Leftmost derivation.

If the ﬁrst letter in the current sentential form is a non-terminal, $k$ tokens look-ahead in the input sufﬁce to decide which production to use to expand it.

An $LR(k)$-grammar can be scanned Left-to-right and generates a Rightmost derivation.

The next $k$ tokens in the input sufﬁce to choose the next step the parser should perform. (We’ll see what these steps are.)

Almost every programming language can be described by an $LL(1)$- or $LR(1)$-grammar.

S-grammars

An $S$-grammar or simple grammar is a special case of an LL(1)-grammar.

A context-free grammar is an S-grammar if

Every production starts with a terminal and
The productions for the same LHS start with different terminals.

LL parsing: an example

LR parsing: an example

Parsing using LL(1) grammars

The crux with parsing deterministically using an LL(1) grammar is to decide which production to apply when the next symbol in the current sentential form is a non-terminal:

Input: $a$…
Sentential form: $A$…
Production: $A\to\alpha$?

Two cases

\[A\Rightarrow\alpha\Rightarrow^*a\beta\]
$A\Rightarrow\alpha\Rightarrow^*\varepsilon$ and a derivation of $A$ can be succeeded by $a$.

Intuitively (but formally not quite correctly), a terminal $a$ is in the predictor set of production $A\to\alpha$ if $A\beta \Rightarrow\alpha\beta\Rightarrow^*a\gamma$, for some $\beta,\gamma\in\Sigma^*$.

A grammar is LL(1) if the predictor sets of all productions with the same LHS are disjoint.

Back to course home