Scanning

MotivationMaterial adapted from CSCI 3136 Principles of Programming Languages at Dalhousie University

Scanning is the first step in compilation, as shown in Figure 1, from Programming Language Pragmatics, by Michael L. Scott.

<a name='figure1'>Figure 1</a>. The scanner performs lexical analysis and converts the source program into a token stream.
Figure 1. The scanner performs lexical analysis and converts the source program into a token stream.

Formal languages

A formal language \(\mathcal{L}\) is a set of strings over an alphabet \(\Sigma\).

Examples of formal languages

Some finite languages

Some infinite languages

A simple application of formal languages

\[\begin{align*} email\_address &\to user\_id @ domain\\ user\_id &\to word\\ domain &\to word\ |\ word.domain \end{align*}\]

Formal languages and automata

Regular languages

A recursive definition

A few examples

Regular expressions

More concise notation to represent regular languages

Usual way to indicate grouping using parentheses

Operator precedence

Examples of regular expressions

Which regular expression expresses the set of strings that do not contain 101 as a substring?

Applications of regular expressions

Two common operations

Used in

History of regular languages and regular expressions

Theory of regular languages (regular sets)

Stephen Cole Kleene (1909–1994) “Representation of events in nerve nets and finite automata” (1956):

Implementation

Deterministic finite automata (DFA)

A deterministic finite automaton (DFA) is a simple machine that accepts or rejects a given input string.

This defines a formal language: the set of strings accepted by the DFA. We say the DFA recognizes this language.

We will show that a language is regular if and only if there exists a DFA that recognizes this language.

Informal definition of a DFA

Example of a DFA

Deterministic finite automata

Transition function:

    Symbol  
  State 0 1
Start: \(q_1\) \(q_1\) \(q_2\)
Final: \(q_2\) \(q_1\) \(q_2\)

Which language does this DFA recognize?

Another example of a DFA

Deterministic finite automata

Which language does this DFA recognize?

Formal definition of a DFA

Deterministic finite automata

A DFA is a 5-tuple (\(Q,\Sigma,\delta,q_0,F\))

The DFA accepts a string \(\sigma \in \Sigma^∗\) if and only if it reaches a final state after consuming all letters in \(\sigma\).

The language recognized by a DFA \(D\) is the set of strings \(\mathcal{L}(D) := \{\sigma\in\Sigma^∗\mid D\) accepts \(\sigma\)}.

We will prove that a language \(\mathcal{L}\) is regular if and only if there exists a DFA \(D\) such that \(\mathcal{L} = \mathcal{L}(D)\).

Examples of DFAs (1)

Construct DFAs that recognize the following languages:

Deterministic finite automata
Deterministic finite automata
Deterministic finite automata

Non-deterministic finite automata (NFA)

A DFA is deterministic in the sense that every input traces exactly one path through the automaton.

A non-deterministic finite automaton (NFA) is identical to a DFA, except that there are possibly many paths traced by an input.

Two sources of non-determinism

Deterministic finite automata
Deterministic finite automata

An NFA accepts a string \(\sigma\in\Sigma^*\) if one of the paths traced by \(\sigma\) ends in an accepting state.

Example of an NFA

Non-deterministic finite automata
Transition function

Which language does this NFA recognize?

Formal definition of an NFA

An NFA is a 5-tuple \((Q,\Sigma,\delta,q_0,F)\)

Every input \(\sigma\in\Sigma^*\) defines a set of paths traced by \(\delta\) while consuming \(\sigma\). The NFA accepts \(\sigma\) if and only if one such path defined by \(\sigma\) ends in a final states.

The language recognized by an NFA \(D\) is the set of strings \(\mathcal{L}(D) := \{\sigma\in\Sigma^*\mid D\) accepts \(\sigma\)}.

Are NFAs more powerful than DFAs?

The answer depends on what we mean by this question.

Is there a language \(\mathcal{L}\) recognized by an NFA that cannot be recognized by a DFA?

No! We will prove this.

Is it easier to construct an NFA for a regular language than to construct a DFA for the same language?

Finite automata

Equivalent characterizations of regular languages

Theorem: The following statements are equivalent:

Regular languages and regular expressions

A language \(\mathcal{L}\) is regular if and only if it can be expressed using a regular expression.

Regular languages and their equivalent regular expressions

From regular expression to NFA (1)

If a language \(\mathcal{L}\) can be expressed using a regular expression, there exists an NFA that recognizes it.

Regular expressions to NFA

From regular expression to NFA (2)

Regular expressions to NFAs

From NFA to regular expression (1)

If a language \(\mathcal{L}\) can be recognized by an NFA, it can be expressed using a regular expression.

A generalized NFA (GNFA)

Proof idea: NFA \(\Rightarrow\) GNFA \(\Rightarrow\) GNFA \(\Rightarrow\cdots\Rightarrow\) GNFA \(\Rightarrow\) regular expression

An NFA is a GNFA. “Normalize” it by adding \(\varepsilon\)-transitions from its final states to a new final state \(f\) and making all original final states non-final.

From NFA to regular expression (2)

From GNFA to regular expression

Extract regular expression

DFA to regular expression

From NFA to regular expression (3)

State reduction

State reduction

This may create loops because some states may simultaneously be in- and out-neighbours of \(s\).

From DFA to NFA

If a language \(\mathcal{L}\) can be recognized by a DFA, it can be recognized by an NFA.

A DFA is an NFA!

From NFA to DFA (1)

If a language \(\mathcal{L}\) can be recognized by an NFA, it can be recognized by a DFA.

Proof idea: Construct a DFA whose states represent the sets of states the NFA may be in at any given point in time.

Start state

From NFA to DFA (2)

Transition function and construction of more DFA states

\(Q'' := \cup_{q'\in Q'}\) ECLOSE(\(q'\)), where \(Q' := \cup_{q\in Q}\delta(q,a)\)

Accepting states

From NFA to DFA: example

From NFA to DFA

From NFA to DFA: example

From NFA to DFA

From NFA to DFA: exercise

Which language is accepted by the following NFA?

NFA

Scanning

A scanner produces a token (token type, value) stream from a character stream.

Two modes of operation

In either case, the scanner always recognizes the longest possible token.

Scanner implementation

Building a Scanner

Regular expression \(\to\) NFA \(\to\) DFA \(\to\) minimized DFA

Extensions to pure DFAs:

Extended example of a scanner

Pascal scanner


Back to course home