Scanning

MotivationMaterial adapted from CSCI 3136 Principles of Programming Languages at Dalhousie University

Scanning is the first step in compilation, as shown in Figure 1, from Programming Language Pragmatics, by Michael L. Scott.

⊕ <a name='figure1'>Figure 1</a>. The scanner performs lexical analysis and converts the source program into a token stream.
Figure 1. The scanner performs lexical analysis and converts the source program into a token stream.

Formal languages

A formal language \(\mathcal{L}\) is a set of strings over an alphabet \(\Sigma\).

Alphabet \(\Sigma\): set of characters that can be used to form strings (letters, digits, punctuation, …)
String: ﬁnite sequence of characters
\(\varepsilon\) denotes the empty string (string with no letters: “”)
Length \(\mid s\mid\) of a string \(s\) = number of characters in \(s\)

\(\mid \varepsilon\mid\) = 0, \(\mid a \mid\) = 1, \(\mid abc \mid\) = 3, …
\(\Sigma^0\) = set of strings of length 0: \(\Sigma^0 = {\varepsilon}\)

\(\Sigma^1\) = set of strings of length 1: \(\Sigma^1\) = {a,b,c,…}

\(\Sigma^2\) = set of strings of length 2: \(\Sigma^2\) = {aa,ab,…,ca,…}

. . .
Kleene star: \(\Sigma^* = \Sigma^0\cup\Sigma^1\cup\Sigma^2\cup\ldots\) (set of all strings over alphabet \(\Sigma\))
A formal language \(\mathcal{L}\) over alphabet \(\Sigma\) is a subset \(\mathcal{L} \subseteq\Sigma^*\)

Examples of formal languages

Some ﬁnite languages

{}, {\(\varepsilon\)}, {0,1}, {\(\varepsilon\),0,1,00,01,100}
{the,is,I,you,he,she,it,man,are}
…

Some inﬁnite languages

{0}\(^∗\), {0,1}\(^∗\), {a,b,c}\(^∗\)
{01\(^n0 \mid n\ge 0\)}
{\(a^p \mid p\) is a prime number}
Set of all positive integers
Set of all syntactically correct C programs
…

A simple application of formal languages

Accept valid email addresses and reject invalid ones:

\[\begin{align*} email\_address &\to user\_id @ domain\\ user\_id &\to word\\ domain &\to word\ |\ word.domain \end{align*}\]

Formal languages and automata

Regular languages

– Recognized (decided) by ﬁnite automata

– Useful for tokenizing program text (lexical analysis)
Context-free languages

– Recognized (decided) by non-deterministic push-down automata

– Useful for parsing the syntax of a program (syntactic analysis)

Regular languages

A recursive deﬁnition

\(\emptyset\), {\(\varepsilon\)}, and {\(a\)} are regular languages, where \(a \in \Sigma\)
If \(A\) and \(B\) are regular languages, then the following are also regular languages:
- \[A\cup B\]
- \(AB := \{ab \mid a\in A, b \in B\}\) (the concatenation of two strings in \(A\) and \(B\))
- \[A^*\]

A few examples

{\(a,b,ab\)}
Any ﬁnite language
{0}\(^∗\)
{0,1}\(^∗\), {\(a,b,c\)}\(^∗\)
{\(01^n0\ \mid\ n\ge0\)}
Set of all positive integers in decimal representation
{\(0^m1^n\ \mid\ m\ge0,n\ge0\)}
{\(a^kb^mc^n\ \mid\ k \ge0,m\ge0,n\ge0\)}
{\((^n)^n\ \mid\ n \ge 0\)}
Set of all syntactically correct C programs
{\(a^p\ \mid\ p\) is a prime number}

Regular expressions

More concise notation to represent regular languages

\(\emptyset\): the empty language
\(a\), for \(a \in \Sigma\): {\(a\)}
\(\varepsilon\): {\(\varepsilon\)}
\(R_1\mid R_2\): the union of the two languages \(L_1\) and \(L_2\) deﬁned by regular expressions \(R_1\) and \(R_2\)
\(R_1R_2\): the concatenation of the two languages \(L_1\) and \(L_2\) deﬁned by \(R_1\) and \(R_2\)
\(R^*\): the Kleene star of the language deﬁned by \(R\)

Usual way to indicate grouping using parentheses

\(ab\mid c\): {\(ab,c\)}
\(a(b\mid c)\): {\(ab,ac\)}

Operator precedence

Parentheses
Kleene star (∗)
Concatenation
Union \((\mid)\)

Examples of regular expressions

\((0\mid1)^*0\): all binary strings that end in 0
\((0\mid1)00^*\): all bin. strings that start with 0 or 1, followed by one or more 0s
\((0\mid1)^*\): all binary strings

Which regular expression expresses the set of strings that do not contain 101 as a substring?

Applications of regular expressions

Two common operations

Searching for patterns in a text
Replacing text portions matching a pattern

Used in

Text editors: emacs, vim, …
System tools: shells, grep, lex, ﬂex, sed, awk, …
Programming languages: Perl, Ruby, Python, C/C++ (with regex library), Java (with regex package), …

History of regular languages and regular expressions

Theory of regular languages (regular sets)

Stephen Cole Kleene (1909–1994) “Representation of events in nerve nets and ﬁnite automata” (1956):

Used mathematical notion of regular sets to describe models of the nervous system by McCulloch and Pitts (1940s) as small simple automata.

Implementation

Stephen Kleene (1956)
Ken Thompson (1968): Editor QED and later ed and grep

– Patented algorithm

– Also used in awk, vi, lex, emacs
Henry Spencer (1986): C regular expression library used in tcl

– Also used in Perl

Deterministic finite automata (DFA)

A deterministic ﬁnite automaton (DFA) is a simple machine that accepts or rejects a given input string.

This deﬁnes a formal language: the set of strings accepted by the DFA. We say the DFA recognizes this language.

We will show that a language is regular if and only if there exists a DFA that recognizes this language.

Informal deﬁnition of a DFA

Finite set of states
Designated start state
Designated set of ﬁnal states**
Reads input one symbol at a time. New state is calculated as a function of current state and read symbol.
String is accepted if and only if a ﬁnal state is reached after reading the entire string.

Example of a DFA

Transition function:

		Symbol
	State	0	1
Start:	\(q_1\)	\(q_1\)	\(q_2\)
Final:	\(q_2\)	\(q_1\)	\(q_2\)

Which language does this DFA recognize?

Another example of a DFA

Which language does this DFA recognize?

Formal deﬁnition of a DFA

A DFA is a 5-tuple (\(Q,\Sigma,\delta,q_0,F\))

\(Q\) is a ﬁnite set of states
\(\Sigma\) is a ﬁnite set of input symbols (i.e., an alphabet)
\(\delta:Q\times\Sigma\to Q\) is a transition function
\(q_0\in Q\) is the start state
\(F \subseteq Q\) is the set of ﬁnal or accepting states

The DFA accepts a string \(\sigma \in \Sigma^∗\) if and only if it reaches a ﬁnal state after consuming all letters in \(\sigma\).

The language recognized by a DFA \(D\) is the set of strings \(\mathcal{L}(D) := \{\sigma\in\Sigma^∗\mid D\) accepts \(\sigma\)}.

We will prove that a language \(\mathcal{L}\) is regular if and only if there exists a DFA \(D\) such that \(\mathcal{L} = \mathcal{L}(D)\).

Examples of DFAs (1)

Construct DFAs that recognize the following languages:

{\(\sigma\in\{0,1\}^∗\) \(\mid\) the number of 0s in \(\sigma\) is even}

{\(\sigma\in{0,1}^∗\) \(\mid \sigma\) does not contain the substring 101}

Valid C comments

Non-deterministic finite automata (NFA)

A DFA is deterministic in the sense that every input traces exactly one path through the automaton.

A non-deterministic ﬁnite automaton (NFA) is identical to a DFA, except that there are possibly many paths traced by an input.

Two sources of non-determinism

\(\varepsilon\)-transitions

Multiple successor states for the same input symbol

An NFA accepts a string \(\sigma\in\Sigma^*\) if one of the paths traced by \(\sigma\) ends in an accepting state.

Example of an NFA

Which language does this NFA recognize?

Formal deﬁnition of an NFA

An NFA is a 5-tuple \((Q,\Sigma,\delta,q_0,F)\)

\(Q\) is a ﬁnite set of states
\(\Sigma\) is a ﬁnite set of input symbols (i.e., an alphabet)
\(\delta:Q\times(\Sigma\cup\{\varepsilon\})\to 2^Q\) is a transition function
\(q \in Q\) is the start state
\(F \subseteq Q\) is the set of ﬁnal or accepting states

Every input \(\sigma\in\Sigma^*\) deﬁnes a set of paths traced by \(\delta\) while consuming \(\sigma\). The NFA accepts \(\sigma\) if and only if one such path deﬁned by \(\sigma\) ends in a ﬁnal states.

The language recognized by an NFA \(D\) is the set of strings \(\mathcal{L}(D) := \{\sigma\in\Sigma^*\mid D\) accepts \(\sigma\)}.

Are NFAs more powerful than DFAs?

The answer depends on what we mean by this question.

Is there a language \(\mathcal{L}\) recognized by an NFA that cannot be recognized by a DFA?

No! We will prove this.

Is it easier to construct an NFA for a regular language than to construct a DFA for the same language?

\(\mathcal{L} \subseteq\{0,1\}^*\) deﬁned by the Perl regular expression /.*1../

Equivalent characterizations of regular languages

Theorem: The following statements are equivalent:

\(\mathcal{L}\) is a regular language.
\(\mathcal{L}\) is the language described by a regular expression (without extensions as, e.g., in Perl).
\(\mathcal{L}\) is recognized by an NFA.
\(\mathcal{L}\) is recognized by a DFA.

Regular languages and regular expressions

A language \(\mathcal{L}\) is regular if and only if it can be expressed using a regular expression.

Regular languages and their equivalent regular expressions

From regular expression to NFA (1)

If a language \(\mathcal{L}\) can be expressed using a regular expression, there exists an NFA that recognizes it.

From regular expression to NFA (2)

From NFA to regular expression (1)

If a language \(\mathcal{L}\) can be recognized by an NFA, it can be expressed using a regular expression.

A generalized NFA (GNFA)

Edges are labelled with regular expressions.
When in conﬁguration \((q_1,\alpha\beta)\) (state \(q_1\) with input \(\alpha\beta\) still left to be read) if and only if the edge \(q_1 \to q_2\) is labelled with a regular expression that matches \(\alpha\).

Proof idea: NFA \(\Rightarrow\) GNFA \(\Rightarrow\) GNFA \(\Rightarrow\cdots\Rightarrow\) GNFA \(\Rightarrow\) regular expression

An NFA is a GNFA. “Normalize” it by adding \(\varepsilon\)-transitions from its ﬁnal states to a new ﬁnal state \(f\) and making all original ﬁnal states non-ﬁnal.

From NFA to regular expression (2)

From GNFA to regular expression

Transform GNFA into equivalent GNFA with only two states: start state and ﬁnal state
Transform this two-state GNFA into a regular expression

Extract regular expression

From NFA to regular expression (3)

State reduction

This may create loops because some states may simultaneously be in- and out-neighbours of \(s\).

From DFA to NFA

If a language \(\mathcal{L}\) can be recognized by a DFA, it can be recognized by an NFA.

A DFA is an NFA!

From NFA to DFA (1)

If a language \(\mathcal{L}\) can be recognized by an NFA, it can be recognized by a DFA.

Proof idea: Construct a DFA whose states represent the sets of states the NFA may be in at any given point in time.

Start state

Before consuming any input, the NFA can be in its start state \(q_0\) or in any state that can be reached from \(q_0\) using a sequence of \(\varepsilon\)-transitions.
We call this the \(\varepsilon\)-closure ECLOSE(\(q_0\)) of \(q_0\).
ECLOSE(\(q_0\)) is the start state of the DFA.

From NFA to DFA (2)

Transition function and construction of more DFA states

Assume that after reading some input, the NFA can be in any of the states in a set \(Q\) represented by a DFA state. Which states can the NFA be in after reading an input symbol \(a\)?

\(Q'' := \cup_{q'\in Q'}\) ECLOSE(\(q'\)), where \(Q' := \cup_{q\in Q}\delta(q,a)\)

If \(Q''\) is not a state of the DFA yet, we add it to the set of DFA states.
We deﬁne \(δ'(Q,a):= Q''\).
We continue to inspect all DFA state-symbol pairs until we do not discover any new states.

Accepting states

A state \(Q\) of the DFA is accepting if, viewed as a set of NFA states, it contains an accepting state of the NFA.

From NFA to DFA: example

From NFA to DFA: exercise

Which language is accepted by the following NFA?

Scanning

A scanner produces a token (token type, value) stream from a character stream.

Two modes of operation

Complete pass produces the token stream, which is then passed to the parser.
Parser calls scanner to request next token.

In either case, the scanner always recognizes the longest possible token.

Scanner implementation

Hand-written, ad-hoc. Usually done when speed is a concern.
From regular expression using scanner generator. More convenient. Result:
- Case statements representing transitions of the DFA.
- Table representing the DFA’s transition function plus driver code to implement the DFA.

Building a Scanner

Regular expression \(\to\) NFA \(\to\) DFA \(\to\) minimized DFA

Extensions to pure DFAs:

Accepting a token is not enough. Need to know which token was accepted and its value.
One accepting state per token type
Return string read along the path to the accepting state
Keywords are not identiﬁers
- Look up identiﬁer in keyword table (e.g., hash table) to see whether it is in fact a keyword
“Look-ahead” to distinguish tokens with common preﬁx (e.g., 100 vs 100.5)
- Try to ﬁnd the longest possible match by continuing to scan from an accepting state.
- Backtrack to last accepting state when “stuck”.

Extended example of a scanner

Back to course home