Parsing

MotivationMaterial adapted from CSCI 3136 Principles of Programming Languages at Dalhousie University

Parsing is the second step in compilation, as shown in Figure 1, from Programming Language Pragmatics, by Michael L. Scott.

<a name='figure1'>Figure 1</a>. The parser performs syntax analysis and converts the token stream in a parse tree.
Figure 1. The parser performs syntax analysis and converts the token stream in a parse tree.

Grammars describe languages

A grammar for a subset of natural language:

Alt text

Sentences in the language described by this grammar:

Another example

A grammar for simple arithmetic expressions:

Alt text

Context-free grammars

A context-free grammar is a 4-tuple \((V,\Sigma,P,S)\), where

Alt text

Notational variations in productions

Alt text

Derivations

A derivation is a sequence of rewriting operations that starts with the string \(\sigma=S\) and then repeats the following until \(\sigma\) contains only terminals:

An example

Alt text

Intermediate strings are called sentential forms.

Language defined by a context-free grammar

We write \(S \Rightarrow^* \sigma\) if there exists a sequence \(S \Rightarrow \sigma_1 \Rightarrow \sigma_2 \Rightarrow\cdots\Rightarrow \sigma\)

Every grammar \(G\) defines a language

$$\mathcal{L}(G)=\{\sigma\in\Sigma^*\mid S\Rightarrow^*\sigma\}$$

If \(G\) is a context-free grammar, \(\mathcal{L}(G)\) is a context-free language.

Example: The language defined by the Jim-ate-cheese grammar is

\(\mathcal{L}(G)\) = {Jim ate cheese, Jim ate Jim, big Jim ate cheese, …}

Context-free languages: examples

What is the language defined by the following grammar?

$$S\to\varepsilon$$
$$S\to0 S 0$$
$$S\to1 S 1$$

\(\mathcal{L}(G)=\{\sigma\sigma^R\mid \sigma\in\Sigma^*\}\), where \(\sigma^R\) is \(σ\) written backwards.

What is the language defined by the following grammar?

$$S\to\varepsilon$$
$$S\to0 S 1$$
\[\mathcal{L}(G)=\{0^n1^n\mid n\ge0\}\]

Neither of these two languages is regular.

There are more context-free languages than regular ones.

Parse trees

Every derivation can be represented by a parse tree:

Alt text

The internal nodes are non-terminals. The leaves — called the yield of the parse-tree — are terminals.

Parse trees: examples (1)

Alt text

Parse trees: examples (2)

Alt text

Ambiguity

There are infinitely many context-free grammars generating a given context-free language.

There may be more than one parse tree for the same sentence generated by a grammar \(G\). If this is the case, we call \(G\) ambiguous.

To allow parsing of programming languages, their grammars have to be unambiguous.

Some context-free languages are inherently ambiguous, that is, do not have unambiguous grammars. Usually, this is not the case for programming languages.

Ambiguity: example (1)

Alt text

Ambiguity: example (2)

An unambiguous grammar for the same language

Alt text

This grammar respects precedence rules.

Ambiguity: example (3)

An unambiguous grammar for the same language

Alt text

This grammar respects precedence rules. It also respects left-associativity.

Leftmost and rightmost derivations

For every sentence in a language defined by an unambiguous grammar, there is only one parse tree that generates this sentence.

There are many different derivations corresponding to this parse tree.

Leftmost derivation: replaces the leftmost non-terminal in each step.

Rightmost derivation: replaces the rightmost non-terminal in each step.

Alt text

Context-free grammars for programming languages

Context-free grammars are used to formally describe the syntax of programming languages.

Parsing a program involves finding the parse tree of the program.

Context-free grammars for programming languages must be unambiguous and must capture the program structure.

The parser should be efficient:

Regular grammars

A context-free grammar is right-linear if all productions are of the form \(A\to\sigma B\) or \(A\to\sigma\), where \(\sigma\) is a (possibly empty) string of terminals.

A context-free grammar is left-linear if all productions are of the form \(A\to B\sigma\) or \(A\to\sigma\), where \(\sigma\) is a (possibly empty) string of terminals.

A context-free grammar is regular if it is right-linear or left-linear.

The set of languages expressed by regular grammars is exactly the set of regular languages.

Regular grammars are too weak to express programming languages.

LL and LR grammars

Two kinds of grammars that can be parsed efficiently and guarantee unambiguousness:

An \(LL(k)\)-grammar can be scanned Left-to-right and generates a Leftmost derivation.

If the first letter in the current sentential form is a non-terminal, \(k\) tokens look-ahead in the input suffice to decide which production to use to expand it.

An \(LR(k)\)-grammar can be scanned Left-to-right and generates a Rightmost derivation.

The next \(k\) tokens in the input suffice to choose the next step the parser should perform. (We’ll see what these steps are.)

Almost every programming language can be described by an \(LL(1)\)- or \(LR(1)\)-grammar.

S-grammars

An \(S\)-grammar or simple grammar is a special case of an LL(1)-grammar.

A context-free grammar is an S-grammar if

Alt text

LL parsing: an example

Alt text

LR parsing: an example

Alt text

Parsing using LL(1) grammars

The crux with parsing deterministically using an LL(1) grammar is to decide which production to apply when the next symbol in the current sentential form is a non-terminal:

Two cases

Intuitively (but formally not quite correctly), a terminal \(a\) is in the predictor set of production \(A\to\alpha\) if \(A\beta \Rightarrow\alpha\beta\Rightarrow^*a\gamma\), for some \(\beta,\gamma\in\Sigma^*\).

A grammar is LL(1) if the predictor sets of all productions with the same LHS are disjoint.

Alt text


Back to course home