This is a raw draft layout, questions, handnotes will be changed or removed, graphs will be added

This post is dedicated to regular expression implementation with Thompson algorithm (1969), which supports parallel searching and is much superior to the brute force algorithms sequentially looking for possibilities in exponential time.

Regular expressions principles

Note: terms metasymbol and metacharacter mean the same in this text, and they are parts of metasequences

Definition 1.
Regular expression, or regex, is sequence of characters each being either from alphabet $A$ or metasequence set $M$, which is mapped into a subset of the set of all strings from characters according to the function : () which determines transition on metasymbol-based rules. In other words, regular expression is one string which can be replaced by multiple strings depending on the metasequence transition rules, reduced example of which is described below (basically, different regular expressions implementations have different metasequence notation for different operations, but I will operate with a subset of POSIX):

Metasequence Meaning Examples
$$.$$ one any character $$ . \rightarrow A $$
$$?$$ $$\leq 1$$ instance of the previous expression $$ a? \rightarrow \{ \epsilon, a \} $$
$$*$$ any possible count of the previous expression $$ a* \rightarrow \{ \epsilon, a, aa, aaa, ... \} $$
$$+$$ at least one instance of the previous expression $$ a+ \rightarrow \{ a, aa, aaa, ... \} $$
$$|$$ only one of two expressions $$ (a | b) \rightarrow \{ a, b \} $$
$$\c$$ represent c literally, not as a metasymbol $$ \. \rightarrow \{ . \} $$
$$[a_1-a_n]$$ any one of diapason, multiple diapasons denoted by $$-$$ are possible $$ [a-z] \rightarrow \{ a, b, c, ... z \} $$, $$ [0-9a-z] \rightarrow \{0, 1, ... 9, a, b, ..., z \} $$
$$(s_n)$$ consider $$s_n$$ as one symbol $$ (abc)|(def) \rightarrow \{ abc, def \} $$ (a particular case is a full regex -- see $$|$$ example)
$$\{n\}$$, $$\{,n\}$$, $$\{n,\}$$, $$\{n,m\}$$ $$ x=n $$, $$ x \leq n $$, $$ x \geq n $$ and $$ n \leq x \leq m $$, where $$ x $$ is amount of instances of expression, respectively $$a\{2\} \rightarrow \{ aa \} $$, $$ a\{,2\} \rightarrow \{ \epsilon, a, aa \} $$, $$ a\{2,\} \rightarrow \{ aa, aaa, ... \} $$, $$ a\{2,5\} \rightarrow \{ aa, aaa, aaaa, aaaaa \} $$

here and below represents an empty string.

What about concatenation operator?

Basically, regexes allow for:

  • fundamental representation of large text patterns, with prooving theorems for regexes automatically leading to prooving of responsive theorems for such patterns of text. However, here one needs to proove the equivalence between regexes and the respective text patterns, which is a bit beyond the scope of this post.
  • squeezing of large text into smaller chunks. However, this text must have some internal patterns in it, and the further discussion of this topic is beyond the scope of this post. - generic search in text editors, programs like grep etc. This is the main area where the regex concept showed itself and I will cover it in a higher detail.

Relation to Text Search. Because of regex nature, each regex may act like a filter, which either approves each incoming text string consisting of alphabet characters, or rejects it. Because of this, if one has a task of finding some special text strings in the text stream, they may try to handle a reverse task of creating the regex for the given direct or approximate set of required strings (one knows what they want to find) and then passing to the regex matching program.
To understand the generic template of implementation of such programs (1) and to understand regexes better from formal point of view (2) one should firstly understand finite automatas concept.

screenshotRegular expression mapping example

Finite automata

Definition 2. Deterministic finite automata (DFA) is a set , where:

  • is a set of states
  • is the initial state
  • is the final states
  • is the alphabet
  • is function (or program) which defines the behaviour of automata. It is defined as:
    .

shows the behaviour of automata: if automata is currently in the given state , and if it reads the character , it moves to the next state , where is again applied.

When the string consisting from characters is “fed” to the automata, it is being processed according to the function , and the following mutually exclusive results may occur:

  • on some processing stage automata cannot move according to , because the respective transition is
    not defined (or defined as a transition to empty state)
  • the string has been processed, and automata is not in any of final states
  • the string has been processed, and automata is in one of final states

In the last case, automata allows the string, in the first two it does not.

DFA can be easily and conveniently represented in a form of oriented graph, where vertexes stand for states, and edges stand for transitions according to the function, with corresponding symbols being their additional values. Initial state and final states can be marked.

screenshotDeterministic finite automata example ( here basically means no element, so it is basically , not )

Definition 3 (independent on 2). Non-deterministic finite automata (NFA) is a set , where:

  • is a set of states
  • is the initial state
  • is the final states
  • is the alphabet
  • is the function: , is boolean of , or set of all
    its subsets

The main difference here is that NFA can move to more than one state for the given pair (state; symbol), sometimes simultaneously, and sometimes choosing some of such states, e.g. with the given probability distribution.

What about -transitions with no symbol?

I have not investigated the second topic, but
I guess that non-deterministic automatas are frequently faced in Quantum Mechanics in a way similar to this: the possible positions of electron in atom can be different (and there can be infinite amount of them depending whether Plank length exists or not, perhaps), and the next time we observe the electron ( its automata “next state”) it can be in any of them but with different probabilities. Here we just add the probability concept to non-deterministic automatas.

I want to concentrate on the first approach, where, shortly speaking, NFA is DFA updated to the point that makes simultaneous different transitions possible, when the automata will be in multiple states at the same operation stage.
The behaviour here while “feeding” the string into automata is similar to DFA, but with the exception that in this case we have more than one current state, and thus there is a separate structure representing all current states, and each automata “instance” corresponding to each of them is updated independently for each of them on each symbol reading stage, such that the amount of current states can rapidly grow (but also decrease).

Automata allows the string if there are any current states in the end of reading and at least one of them is marked final – this means there is a possibility (ways are mutually exclusive) of approving the string.

screenshotNon-deterministic finite automata example

Implementation of regular expressions based on finite automatas

Basically, regexes just show mapping of one character string with special characters (metacharacters) to a set of strings where there are no metacharacters anymore (basically, there can be the same characters if alphabet includes them but they will not have any special meaning).

Similarly, automatas just represent the system of states, and each automata allows some set of strings, rejecting others (but there is no notion of metacharacters there).

Combining these two concepts we may feel, that regexes can be implemented based on NFAs representing the same strings these regexes represent, without considering these strings themselves – and the last one is very important. We will also see how these automatas can perform regex matching.

What about a mathematical proof?

Basically, each complicated automata can be built on the simpler and simpler ones, corresponding to elementary constructs in regular expressions.

Such elementary constructs are shown below (although they can be applied to large expressions as well) and they are quite understandable (the sine-marked lines represent -transitions).

screenshotRegular expressions automata

There can be different ways of transforming the regex into such structure, but one of the most direct and easy ones is postfix stack calculation. According to such an approach, the regex is firstly transformed from infix to postfix form satisfying the following:

  • all s are removed
  • all blank symbols (not related to regex itself) are removed
  • all cases of concatenation like are made explicit by concatenation operator , which is binary metasymbol which should normally be the one not occurring in normal text (although I am referring to it as to in this text)
  • all binary operators are placed after their operands
    I will refer to this form as normalized postfix form.
For instance, $$a*(b c)abc *~$$ etc.

Any postfix expressions can be easily parsed based on stack in the following manner (the discussion of stack concept is beyond the scope of this post, but I want to note that stack is a data structure following LIFO principle – “last input first output”):

  • if the next character is operand, it is placed in the stack
  • if the next character is unary operator, the highest value from stack is taken, transformed and
    placed back
  • if the next character is binary operator, the same rule applies to two highest values from stack,
    keeping the order

During the applying of each transformation the structure of current automata is updated as shown on the picture above, considering half-finished automatas in the stack as substructures to operate on.

After reading of regex expression the last value taken from the stack will provide the complete structure of automata.

Parallel matcher

Now the above-described implementation of NFA for the given regex can obtain strings testing for matching this regex and either allow or reject them.

Basically, another data structure representing current positions each pointing to a particular state can be proposed. For each symbol being read each of all current positions will either “die” (if no transitions from its state are available for the given symbol) or move to another state if there is a respective transition, leaving its former state. If the state reached has -transitions, the position will clone itself for each of them and continue that recursively until no -transitions are available, thus reaching as high amount of possibilities as possible, pardon for tautology.

The results can be as follows:

  • If all positions “die” before the string end is reached, automata fails to match string.
  • If the string end is reached, and at least one of positions point to one of the final states,
    automata matches the string (calculating them may also say the possible amount of different ways of
    matching the same string (?))
  • Otherwise automata does not match the string

Basically, this is a simple NFA-matching of string with -transition rule added.
Also, NFA can match strings in other ways, but most of them will be sequential and therefore much slower than this algorithm.

Source code explanation

The program I wrote is available here.
It has clear and understandable source and looks like it works just great, performing all main advantages of the algorithm above, although I still have a lot of things to improve (see TODO file).

It is organized as follows:

  • main.c file code (in the main()):
    • reads one regular expression from command line together with all strings to match
    • it invokes functions updating the regex to normalized postfix form as expected by the NFA
    • it builds NFA based on this regex (only once)
    • for each of strings it invokes parallel matcher, printing the string if it matches the regex pattern, cleaning all left positions afterwards (since the same prebuilt form is matching other strings)
  • in2post.c file is updating the input expression:
    • icheck() checks the input expression for basic errors or unknown symbols
    • rmblank() removes all redundant blank symbols (which are not related to regex)
    • concat() performs adding the concat operator described before (it is encoded as character code
      1, unprintable ASCII character, since 0 is reserved for end of lines by C language conventions)
    • mkpar() adds parentheses around each expression where binary operation is performed, to prepare such expressions for transformations to the postfix form
    • in2post() actually changes the order inside such expressions, starting from the deepest nesting
    • rmpar() finally removes all parentheses, because the postfix expression does not need them by its nature
  • nfa.c file contains main implementation of NFA and all corresponding structures
    • mknfa() is stack-based implementation according to the algorithm described above.
      It produces the automata with only one final state.
    • different small functions perform partial NFA building according to the picture above
    • some functions perform allocation of NFAs, states and transitions
    • prnfa() prints NFA content to standard output, invoking recursive prst() which moves from the given set down by all connections in the graph – and infinite loops are prevented by locking this state from subfunctions code by marking it visited by setting the special flag in its structure. The state is unlocked after printing (the same principle can be applied for any similar
      recursive moving through the NFA or actually any similar structure).
  • match.c checks input string for the given NFA:
    • match() is the main function here, which creates the position for the first state, and then for each string character being read loops through available positions and moves them further if there are any transitions corresponding to the current character, by invoking visit(). If there are no positions on some iteration, as well as if none of positions remaining after all iterations is pointing to the final state, then the string cannot be matched. If some of positions point to the final state, the string is matched.
    • visit() is recursive function similar to prst() by its locking idea, but it allocates another position for each visited state and moves not through all transitions, but only through the empty ones, covering the highest possible amount of states each able to be reached from another one with only -transitions.
    • other functions allocate and show positions

I wrote the following simple recursive formula for regex language, assumed by functions in infix-postfix part:

which means that expression can be either literal (character), another expression in , unary operation call to expression or binary operation call to two expressions. From this description it is also seen that the first element – leaf – of each expression is either literal or and the last element is either literal, unary operator or .

Here are some handnotes and screenshots.
screenshot postfix form and automata
screenshot matching strings
screenshot matching strings screenshot matching strings

Conclusions

Basically, the following points should be mentioned:

  • regular expressions are character sequences representing the mapping of one string into multiple (sometimes, infinite amount) strings following regular pattern
  • finite automatas are discrete structures, representing states and transitions between them depending on input for each state, which allow some of strings consisting from alphabet
    symbols and reject others
  • due to this similarity of being equivalent to string set, automatas and regexes are also equivalent to each other and can replace each other
  • non-deterministic finite automata which can be in multiple states simultaneously allow for parallel, much faster (in orders I suppose) string matching than DFA


Now some set of tasks (for me as well):

  • prove the equivalence of automatas and regular expressions.
  • can you reduce all -transitions from and glue states in finite automata (do not look at picture hints! :))? Why or why not?
  • can you propose an algorithm for simplification of each regex (before automata is built on the simplified regex), similar to that of minimization of logical functions with DNF or Karno maps?
  • can you propose an algorithm for generation of regexes based on the given finite automata (the reverse task)?
  • how can Bayess formula (or other mathematical construct) be used in probability-based NFAs to consider possible ways of system behaviour given its initial state and pathway to it (this question may be quite silly, I have not investigated these topics yet)?
  • can you write a program (or its algorithm and data structures) which simulates the behaviour of probability-based NFA (so that for each function value there is a distribution of probabilities between states)? Computers only support pseudo-random numbers, if they do not rely on quantum effects, nevertheless you can simulate so that you will feel no difference.
  • imagine that you have a finite automata for a few million nodes (if they are built for regexes generated automatically to parse exceedingly large data or not for regexes at all), propose generic ways of simplification of its design or matching algorithm you can think about.