Scanning
MotivationMaterial adapted from CSCI 3136 Principles of Programming Languages at Dalhousie University
Scanning is the first step in compilation, as shown in Figure 1, from Programming Language Pragmatics, by Michael L. Scott.
Figure 1. The scanner performs lexical analysis and converts the source program into a token stream.
Formal languages
A formal language \(\mathcal{L}\) is a set of strings over an alphabet \(\Sigma\).
- Alphabet \(\Sigma\): set of characters that can be used to form strings (letters, digits, punctuation, …)
- String: finite sequence of characters
- \(\varepsilon\) denotes the empty string (string with no letters: “”)
-
Length \(\mid s\mid\) of a string \(s\) = number of characters in \(s\)
\(\mid \varepsilon\mid\) = 0, \(\mid a \mid\) = 1, \(\mid abc \mid\) = 3, …
-
\(\Sigma^0\) = set of strings of length 0: \(\Sigma^0 = {\varepsilon}\)
\(\Sigma^1\) = set of strings of length 1: \(\Sigma^1\) = {a,b,c,…}
\(\Sigma^2\) = set of strings of length 2: \(\Sigma^2\) = {aa,ab,…,ca,…}
. . .
-
Kleene star: \(\Sigma^* = \Sigma^0\cup\Sigma^1\cup\Sigma^2\cup\ldots\) (set of all strings over alphabet \(\Sigma\))
- A formal language \(\mathcal{L}\) over alphabet \(\Sigma\) is a subset \(\mathcal{L} \subseteq\Sigma^*\)
Examples of formal languages
Some finite languages
- {}, {\(\varepsilon\)}, {0,1}, {\(\varepsilon\),0,1,00,01,100}
- {the,is,I,you,he,she,it,man,are}
- …
Some infinite languages
- {0}\(^∗\), {0,1}\(^∗\), {a,b,c}\(^∗\)
- {01\(^n0 \mid n\ge 0\)}
- {\(a^p \mid p\) is a prime number}
- Set of all positive integers
- Set of all syntactically correct C programs
- …
A simple application of formal languages
- Accept valid email addresses and reject invalid ones:
Formal languages and automata
-
Regular languages
– Recognized (decided) by finite automata
– Useful for tokenizing program text (lexical analysis)
-
Context-free languages
– Recognized (decided) by non-deterministic push-down automata
– Useful for parsing the syntax of a program (syntactic analysis)
Regular languages
A recursive definition
-
\(\emptyset\), {\(\varepsilon\)}, and {\(a\)} are regular languages, where \(a \in \Sigma\)
-
If \(A\) and \(B\) are regular languages, then the following are also regular languages:
- \[A\cup B\]
-
\(AB := \{ab \mid a\in A, b \in B\}\) (the concatenation of two strings in \(A\) and \(B\))
- \[A^*\]
A few examples
- {\(a,b,ab\)}
- Any finite language
- {0}\(^∗\)
- {0,1}\(^∗\), {\(a,b,c\)}\(^∗\)
- {\(01^n0\ \mid\ n\ge0\)}
- Set of all positive integers in decimal representation
- {\(0^m1^n\ \mid\ m\ge0,n\ge0\)}
- {\(a^kb^mc^n\ \mid\ k \ge0,m\ge0,n\ge0\)}
- {\((^n)^n\ \mid\ n \ge 0\)}
- Set of all syntactically correct C programs
- {\(a^p\ \mid\ p\) is a prime number}
Regular expressions
More concise notation to represent regular languages
-
\(\emptyset\): the empty language
-
\(a\), for \(a \in \Sigma\): {\(a\)}
-
\(\varepsilon\): {\(\varepsilon\)}
-
\(R_1\mid R_2\): the union of the two languages \(L_1\) and \(L_2\) defined by regular expressions \(R_1\) and \(R_2\)
-
\(R_1R_2\): the concatenation of the two languages \(L_1\) and \(L_2\) defined by \(R_1\) and \(R_2\)
-
\(R^*\): the Kleene star of the language defined by \(R\)
Usual way to indicate grouping using parentheses
-
\(ab\mid c\): {\(ab,c\)}
-
\(a(b\mid c)\): {\(ab,ac\)}
Operator precedence
-
Parentheses
-
Kleene star (∗)
-
Concatenation
-
Union \((\mid)\)
Examples of regular expressions
-
\((0\mid1)^*0\): all binary strings that end in 0
-
\((0\mid1)00^*\): all bin. strings that start with 0 or 1, followed by one or more 0s
-
\((0\mid1)^*\): all binary strings
Which regular expression expresses the set of strings that do not contain 101 as a substring?
Applications of regular expressions
Two common operations
- Searching for patterns in a text
- Replacing text portions matching a pattern
Used in
-
Text editors: emacs, vim, …
-
System tools: shells, grep, lex, flex, sed, awk, …
-
Programming languages: Perl, Ruby, Python, C/C++ (with regex library), Java (with regex package), …
History of regular languages and regular expressions
Theory of regular languages (regular sets)
Stephen Cole Kleene (1909–1994) “Representation of events in nerve nets and finite automata” (1956):
- Used mathematical notion of regular sets to describe models of the nervous system by McCulloch and Pitts (1940s) as small simple automata.
Implementation
- Stephen Kleene (1956)
-
Ken Thompson (1968): Editor QED and later
ed
andgrep
– Patented algorithm
– Also used in
awk
,vi
,lex
,emacs
-
Henry Spencer (1986): C regular expression library used in
tcl
– Also used in Perl
Deterministic finite automata (DFA)
A deterministic finite automaton (DFA) is a simple machine that accepts or rejects a given input string.
This defines a formal language: the set of strings accepted by the DFA. We say the DFA recognizes this language.
We will show that a language is regular if and only if there exists a DFA that recognizes this language.
Informal definition of a DFA
-
Finite set of states
-
Designated start state
-
Designated set of final states**
-
Reads input one symbol at a time. New state is calculated as a function of current state and read symbol.
-
String is accepted if and only if a final state is reached after reading the entire string.
Example of a DFA
Transition function:
Symbol | |||
---|---|---|---|
State | 0 | 1 | |
Start: | \(q_1\) | \(q_1\) | \(q_2\) |
Final: | \(q_2\) | \(q_1\) | \(q_2\) |
Which language does this DFA recognize?
Another example of a DFA
Which language does this DFA recognize?
Formal definition of a DFA
A DFA is a 5-tuple (\(Q,\Sigma,\delta,q_0,F\))
-
\(Q\) is a finite set of states
-
\(\Sigma\) is a finite set of input symbols (i.e., an alphabet)
-
\(\delta:Q\times\Sigma\to Q\) is a transition function
-
\(q_0\in Q\) is the start state
-
\(F \subseteq Q\) is the set of final or accepting states
The DFA accepts a string \(\sigma \in \Sigma^∗\) if and only if it reaches a final state after consuming all letters in \(\sigma\).
The language recognized by a DFA \(D\) is the set of strings \(\mathcal{L}(D) := \{\sigma\in\Sigma^∗\mid D\) accepts \(\sigma\)}.
We will prove that a language \(\mathcal{L}\) is regular if and only if there exists a DFA \(D\) such that \(\mathcal{L} = \mathcal{L}(D)\).
Examples of DFAs (1)
Construct DFAs that recognize the following languages:
- {\(\sigma\in\{0,1\}^∗\) \(\mid\) the number of 0s in \(\sigma\) is even}

- {\(\sigma\in{0,1}^∗\) \(\mid \sigma\) does not contain the substring 101}

- Valid C comments

Non-deterministic finite automata (NFA)
A DFA is deterministic in the sense that every input traces exactly one path through the automaton.
A non-deterministic finite automaton (NFA) is identical to a DFA, except that there are possibly many paths traced by an input.
Two sources of non-determinism
- \(\varepsilon\)-transitions

- Multiple successor states for the same input symbol

An NFA accepts a string \(\sigma\in\Sigma^*\) if one of the paths traced by \(\sigma\) ends in an accepting state.
Example of an NFA


Which language does this NFA recognize?
Formal definition of an NFA
An NFA is a 5-tuple \((Q,\Sigma,\delta,q_0,F)\)
-
\(Q\) is a finite set of states
-
\(\Sigma\) is a finite set of input symbols (i.e., an alphabet)
-
\(\delta:Q\times(\Sigma\cup\{\varepsilon\})\to 2^Q\) is a transition function
-
\(q \in Q\) is the start state
-
\(F \subseteq Q\) is the set of final or accepting states
Every input \(\sigma\in\Sigma^*\) defines a set of paths traced by \(\delta\) while consuming \(\sigma\). The NFA accepts \(\sigma\) if and only if one such path defined by \(\sigma\) ends in a final states.
The language recognized by an NFA \(D\) is the set of strings \(\mathcal{L}(D) := \{\sigma\in\Sigma^*\mid D\) accepts \(\sigma\)}.
Are NFAs more powerful than DFAs?
The answer depends on what we mean by this question.
Is there a language \(\mathcal{L}\) recognized by an NFA that cannot be recognized by a DFA?
No! We will prove this.
Is it easier to construct an NFA for a regular language than to construct a DFA for the same language?
- \(\mathcal{L} \subseteq\{0,1\}^*\) defined by the Perl regular expression
/.*1../

Equivalent characterizations of regular languages
Theorem: The following statements are equivalent:
-
\(\mathcal{L}\) is a regular language.
-
\(\mathcal{L}\) is the language described by a regular expression (without extensions as, e.g., in Perl).
-
\(\mathcal{L}\) is recognized by an NFA.
-
\(\mathcal{L}\) is recognized by a DFA.
Regular languages and regular expressions
A language \(\mathcal{L}\) is regular if and only if it can be expressed using a regular expression.

From regular expression to NFA (1)
If a language \(\mathcal{L}\) can be expressed using a regular expression, there exists an NFA that recognizes it.

From regular expression to NFA (2)

From NFA to regular expression (1)
If a language \(\mathcal{L}\) can be recognized by an NFA, it can be expressed using a regular expression.
A generalized NFA (GNFA)
-
Edges are labelled with regular expressions.
-
When in configuration \((q_1,\alpha\beta)\) (state \(q_1\) with input \(\alpha\beta\) still left to be read) if and only if the edge \(q_1 \to q_2\) is labelled with a regular expression that matches \(\alpha\).
Proof idea: NFA \(\Rightarrow\) GNFA \(\Rightarrow\) GNFA \(\Rightarrow\cdots\Rightarrow\) GNFA \(\Rightarrow\) regular expression
An NFA is a GNFA. “Normalize” it by adding \(\varepsilon\)-transitions from its final states to a new final state \(f\) and making all original final states non-final.
From NFA to regular expression (2)
From GNFA to regular expression
-
Transform GNFA into equivalent GNFA with only two states: start state and final state
-
Transform this two-state GNFA into a regular expression
Extract regular expression

From NFA to regular expression (3)
State reduction

This may create loops because some states may simultaneously be in- and out-neighbours of \(s\).
From DFA to NFA
If a language \(\mathcal{L}\) can be recognized by a DFA, it can be recognized by an NFA.
A DFA is an NFA!
From NFA to DFA (1)
If a language \(\mathcal{L}\) can be recognized by an NFA, it can be recognized by a DFA.
Proof idea: Construct a DFA whose states represent the sets of states the NFA may be in at any given point in time.
Start state
-
Before consuming any input, the NFA can be in its start state \(q_0\) or in any state that can be reached from \(q_0\) using a sequence of \(\varepsilon\)-transitions.
-
We call this the \(\varepsilon\)-closure
ECLOSE
(\(q_0\)) of \(q_0\). -
ECLOSE
(\(q_0\)) is the start state of the DFA.
From NFA to DFA (2)
Transition function and construction of more DFA states
- Assume that after reading some input, the NFA can be in any of the states in a set \(Q\) represented by a DFA state. Which states can the NFA be in after reading an input symbol \(a\)?
\(Q'' := \cup_{q'\in Q'}\) ECLOSE
(\(q'\)), where \(Q' := \cup_{q\in Q}\delta(q,a)\)
-
If \(Q''\) is not a state of the DFA yet, we add it to the set of DFA states.
-
We define \(δ'(Q,a):= Q''\).
-
We continue to inspect all DFA state-symbol pairs until we do not discover any new states.
Accepting states
- A state \(Q\) of the DFA is accepting if, viewed as a set of NFA states, it contains an accepting state of the NFA.
From NFA to DFA: example

From NFA to DFA: example

From NFA to DFA: exercise
Which language is accepted by the following NFA?

Scanning
A scanner produces a token (token type, value) stream from a character stream.
Two modes of operation
-
Complete pass produces the token stream, which is then passed to the parser.
-
Parser calls scanner to request next token.
In either case, the scanner always recognizes the longest possible token.
Scanner implementation
-
Hand-written, ad-hoc. Usually done when speed is a concern.
-
From regular expression using scanner generator. More convenient. Result:
-
Case statements representing transitions of the DFA.
-
Table representing the DFA’s transition function plus driver code to implement the DFA.
-
Building a Scanner
Regular expression \(\to\) NFA \(\to\) DFA \(\to\) minimized DFA
Extensions to pure DFAs:
-
Accepting a token is not enough. Need to know which token was accepted and its value.
-
One accepting state per token type
-
Return string read along the path to the accepting state
-
Keywords are not identifiers
- Look up identifier in keyword table (e.g., hash table) to see whether it is in fact a keyword
-
“Look-ahead” to distinguish tokens with common prefix (e.g., 100 vs 100.5)
-
Try to find the longest possible match by continuing to scan from an accepting state.
-
Backtrack to last accepting state when “stuck”.
-
Extended example of a scanner
