From a Regex to a Nondeterministic Finite Automata
July 25, 2019
Before building complex state machine we need to learn the basics blocks.
When the solution to a problem can be seen as set of states with transitions from ones to others, modeling them as a nondeterministic finite automatas makes clear how the solution works and allows to spot deficiencies.
A regular expression is an example of this. As an introductory step let’s review how to turn a regex into a NFA.
Take at look of the source code in Github.
From a regular expression to a NFA
Before getting deep in this, let’s define a very simple problem: we want to validate if a particular string follows or not a given structure.
Let’s assume that this structure can be writing using a regular language.
A regular expresion or regex is a handy way to write this in a concise way. Keep in mind that most of the regex engines are more powerful than a NFA so not all the features that such engines provide can be translated to a NFA.
But a NFA is powerful enough to solve a lot of problems so it worth it.
Labeled transitions
First, we say that the NFA can move from one state to another if there is a transition between the states and it is labeled with the same character that was read.
We represent this with a simple arrow connecting the two states labeled with the particular character.
A NFA allows the use of epsilon transitions or \(\epsilon\)-transitions for short.
A NFA moves from one state to another through a \(\epsilon\)-transition without reading any character: it represents the empty string match.
Optional match
We this two simple definitions we can build an optional match represented in regex syntax as a?
The optional part can be as complex as we want like another NFA, no necessary must be a simple literal.
Concatenation and repetition of NFAs
Concatenation of two state machines \(sm_1\) and \(sm_2\) (ab
in regex syntax).
Two or more NFAs can be concatenated to match a sequence of submatches being linked one to the other using \(\epsilon\)-transitions.
In regex notation this corresponds to ab
(a
followed by b
)
This is made obvious in the diagrams: \(sm_1\) cannot link to itself. Underwood we will have three \(sm_1\) identical objects.
As a extension, a NFA can be link to a clone of itself to match a sequence of repeated submatches. In regex syntax, a{n}
.
Repetition of \(sm_1\) three times (a{3}
in regex syntax).
We say that the link is to a clone because technically a link to itself would end up in an unbounded loop and what we want instead is a sequence of a fixed size.
When the NFA links to itself, the loop matches an unbounded repetition, a zero or more or klee construction, the famous a*
:
The repetition can have different finite lower and higher bounds to form a range with a minimum and a maximum of repetitions a{,2}
a{2,4}
or with the higher limit unbounded a{2,}
a+
.
Union
Finally, the a|b
regex. As you may guessed, we stick two or more state machines using \(\epsilon\)-transitions.
Further readings
Aho, Lam, Sethi and Ullman. Compilers: Principles, Techniques, & Tools, Second edition, Chapter 3.
You can find a NFA implementation in Python here in Github.
Related tags: regex, automata, state machine, NFA, string