Constructing fast lexical analyzers with RE/flex – why another scanner generator?

0
32

Introduction

In this article I will give a brief introduction to the RE/flex lexical analyzer generator. The RE/flex open source project was motivated by the possibility to build a scanner generator based on an entirely different approach to tokenization that permits regex libraries to be used in the generated scanners. This provides additional flexibility to use a rich regex syntax and a choice of POSIX-mode (and even Perl-mode) regex engines. A second motivation was the challenge to come up with an algorithm to construct deterministic finite automata (DFAs) for efficient pattern matching with lazy quantifiers (a.k.a. non-greedy repeats, lazy repeats, reluctant repeats). This algorithm is also new.

Why RE/flex?

RE/flex is a new fast and flexible regex-based scanner generator for C++. It shares many similarities with Flex (the fast lexical analyzer) and Lex, but is fundamentally different in its overall design based on a simple basic idea. This idea makes RE/flex faster than Flex and/or enables a more expressive regex syntax to be used in lexer specifications. It also adds flexibility by permitting other regex libraries to be used, such as the Boost.Regex and RE/flex regex engines, and perhaps other libraries in future releases.

The basic idea is that any regex engine can in princple be used to tokenize input: given a set of n patterns pi, i=1,…,n,  a regex of the form “(p1) | (p2) | … | (pn)” can be used to match and tokenize the input. The group capture index i of a matching pattern pi that is returned by the regex matcher identifies the pattern pi that matched. Note that if a pattern uses a group (X) that group must be converted to non-capturing of the form (?:X) first before we can use it. We then use repeated partial matching to advance over the input to tokenize by our generated scanner:

int Lexer::lex()
{
  if (!has_matcher())
  matcher("(p1)|(p2)|...|(pn)"); while (true)
 {
 if (... EOF reached ...)
  return 0;
  switch (matcher().scan())   {
  case 1:   ...   break;
  case 2:   ...   break;
  ...   ...
  default:
  ...   }
  }
}

RE/flex uses this basic idea. The generated scanner source code is structured in this way and is therefore very different from Flex.

A very fast DFA for matching is generated with the reflex --fast option. The DFA is generated in direct code instead of tables and may run faster than a Flex-based scanner. For example, the following direct-coded DFA generated by reflex for pattern "\w+" runs very efficiently to match words:

void reflex_code_FSM(reflex::Matcher& m)
{
 int c0 = 0, c1 = c0;
 m.FSM_INIT(c1);
S0:
 c0 = c1, c1 = m.FSM_CHAR();
 if (97 <= c1 && c1 <= 122) goto S5;
 if (c1 == 95) goto S5;
 if (65 <= c1 && c1 <= 90) goto S5;
 if (48 <= c1 && c1 <= 57) goto S5;
 return m.FSM_HALT(c1);
S5:
 m.FSM_TAKE(1);
 c0 = c1, c1 = m.FSM_CHAR();
 if (97 <= c1 && c1 <= 122) goto S5;
 if (c1 == 95) goto S5;
 if (65 <= c1 && c1 <= 90) goto S5;
 if (48 <= c1 && c1 <= 57) goto S5;
 return m.FSM_HALT(c1);
}

Another advantage of RE/flex is the rich regex pattern syntax offered by Boost.Regex and also by the new RE/flex regex library that is included with the scanner generator. The latter includes Unicode, indent/nodent/dedent anchors, lazy quantifiers, word boundaries, and more.

How fast is RE/flex?

RE/flex is faster than Flex when comparing Flex with the best performance options. RE/flex is faster than Boost.Spirit.Lex, and much faster than regex libraries, such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a representative C source code file into 244 tokens takes only 13 microseconds:

Command / Function Software Time (μs)
reflex –fast RE/flex 13
flex -+ –full Flex 17
reflex –full RE/flex 29
boost::spirit::lex::lexertl::actor_lexer::iterator_type Boost.Spirit.Lex 40
reflex -m=boost-perl Boost.Regex (Perl mode) 230
pcre2_match() PCRE2 (pre-compiled) 318
reflex -m=boost Boost.Regex (POSIX mode) 450
flex -+ Flex 3968
RE2::Consume() RE2 (pre-compiled) 5088
std::cregex_iterator() C++11 std::regex 14784

The table shows the best times of 10 tests with average time in microseconds over 100 runs (using clang 8.0.0 with -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3).

To create this comparison, I used RE/flex option --regexp-file to generate the regex string and used it with a regex library to tokenize a string of C source code stored in memory (avoiding file reads). This regex string has n=57 patterns of the form “(p1) | (p2) | … | (pn)” where each pi is a pattern to match C/C++ token lexemes. The patterns use non-capturing groups (?:…) to replace parenthesis, which allows the capture group index to identify the token matched.

Compatibility of RE/flex with Flex and Bison/Yacc

When designing RE/flex using the aforementioned approach, I wanted to make sure that the project is compatible with Bison and Yacc parsers. These tools are popular and are typically pre-installed with Linux and Unix distributions. I also wanted made sure that RE/flex accepts standard Flex specifications (using RE/flex compatibility option --flex). This means that if you are already familiar with Flex or Lex then you will find RE/flex easy to use. To generate a yyFlexLexer class that has the same methods as a Flex++ yyFlexLexer class, use reflex option --flex.

RE/flex is for C++ projects only. Using C++ permitted a significant number of additions to the lexer specification language and permitted the introduction of many new methods that can be used in rule actions, as I will explain further in the next section. You can use RE/flex with C code, such as Bison-generated parsers that expect a global yylex() function. To do so, use the --flex and –-bison command-line options:

reflex --flex --bison lexspec.l

RE/flex also supports options to build Bison reentrant parsers, and bison-bridge and bison-locations parsers. These parsers are MT-safe and manage the locations of syntax errors in cooperation with the lexer.

To get you started with RE/flex to generate parsers with Bison, you will find several examples in the RE/flex download package from GitHub and SourceForge.

The structure of a RE/flex lexer specification

The structure of a lexer specification is the same as Flex, but with some additions. A lexer specification consists of three parts separated by two lines with %%: the Definitions section, the Rules section, and the User code section.

  1. The Definitions section may also contain user code: a %top block with code that goes all the way to the top of the generated scanner course code and the usual %{ code inbetween %} that is copied to the scanner’s source code. A %class block defines the lexer class member variables and functions. The %init block defines the lexer class constructor code for initializations.  We can also add one or more %option to configure the scanner. Named patterns can be defined in the Definitions section also.
  2. The Rules section contains pattern-action pairs. Actions are written in C++ and are triggered when the pattern matches.
  3. The User code section at the end can be used to add any C++ code necessary. For stand-alone scanners this typically has a main() function defined. 

The following outlines the basic structure of a RE/flex lexer specification:



%top{
  #include <iostream>
 
%}

%{
 
%}


%class{
 public:
 void a_public_function();
 private:
 std::string a_private_string;
%}


%init{
 a_private_string = "initialized";
%}


%option unicode


newline \n

%%



"foo" { a_public_function(); }
"bar" { a_private_string = text(); }
{newline} 
. { std::cerr << "unknown character at " << lineno() << ":" << columno(); }

%%



int main()
{
 Lexer lexer;
  while (lexer.lex() != 0)
 continue;
}

Patterns are regular expressions extended with Lex syntax, which means that the named pattern {newline} is expanded into (?:\n), quoted text "foo" defines literal matches, and a slash / (not shown) is a lookahead pattern. This example shows three built-in functions used in the rule actions: text() that returns the 0-terminated char text matched, lineno(), and columno(). The first two are the same as yytext and yylineno in Flex. Flex-compatible actions are enabled with option --flex.

The following table shows some of the actions of a long list of functions that can be used in rule actions:

RE/flex action Flex action Result
text() YYText(), yytext 0-terminated text match
str() n/a std::string text match
wstr() n/a std::wstring text match
size() YYLeng(), yyleng size of the match in bytes
wsize() n/a number of wide chars matched
lineno() yylineno line number of match (>=1)
columno() n/a column number of match (>=0)
echo() ECHO echo text match
start(n) BEGIN n set start condition to n

The classic Flex actions listed in the table can be used in a RE/flex lexer specification with reflex option --flex. Because the matcher library object is accessible in the rule actions via matcher(), you can use it in your rule actions for added power. Here are some useful example matcher() methods:

RE/flex action Flex action Result
matcher().input() yyinput() get next char from input
matcher().winput() n/a get wide character from input
matcher().unput(c) unput(c) put back 8-bit char
matcher().more() yymore() concat the next match to this match
matcher().less(n) yyless(n) shrink match length to n

These are only examples. For a complete list, see the RE/flex user manual.

Unicode pattern matching

A scanner generator that cannot handle Unicode is as good as using a typewriter for texting on a mobile phone. Most modern programming languages have Unicode lexical structures that require Unicode-capable scanners to construct parsers and compilers. At least the widely-used UTF-8 encoding format should be natively supported. Scanner generators produce lexers that scan files (in addition to 8-bit strings and perhaps wide strings), which makes them different from regex engines (i.e. many regex libraries can be used with to wide strings). To scan files that contain with Unicode wide characters, the files have to be decoded from UTF-8, UTF-16, or full UTF-32 (UCS-4) formats. This decoding is done automatically “on the fly” in RE/flex.

Unicode pattern matching is enabled with reflex command-line option --unicode or with %option unicode in a lexer specification:

%option unicode

With this option, the . (dot), \w, \s, \l, \u\u{XXXX}, and \p{C} patterns match Unicode. These match respectively any Unicode character (except newline \n), a Unicode word character, Unicode space, Unicode lower case letter, Unicode upper case letter, Unicode U+XXXX, and Unicode character class C.

\s+ 
\w+ { std::cout << text() << "\n"; }
\p{Unicode} { std::cout << "U+" << std::hex << wstr().at(0) << " \n"; }
. { std::cerr << "invalid Unicode!\n"; }

This lexer removes all white space from a file and prints words that are composed of Unicode letters, numerical digits, and connector-punctuation such as underscores. The matched text is printed in UTF-8 format to std::cout with text() (which is the same as yytext). Any other Unicode characters are printed in U+XXXX code using the first character of the wide string wstr() match. Finally, the . (dot) pattern is used to catch all else, including invalid characters that are outside of the valid Unicode range U+0000 to U+D7FF and U+E000 to U+10FFFF. When UTF-8 or UTF-16 files are scanned then an invalid UTF encoding such as an overlong form is also matched by . (dot).

To match any valid Unicode character I recommend to use \p{Unicode}. We only want to use . (dot) for catch-all-else cases as needed. One such case is to match invalid content as shown in the example above to prevent the scanner from jamming on its input when nothing valid matches. For this reason, the . (dot) is designed to be permissive and matches valid Unicode but also matches invalid UTF-8 and UTF-16 sequences in order to catch lexical errors.

Indent, nodent, and dedent matching

RE/flex takes a direct approach to matching indent, nodent and dedent positions in text by introducing two new anchors, one for indent and one for dedent.

To match and set an indent stop position on a new line with margin spacing matched by ^[ \t]+, simply add the \i anchor at the end ^[ \t]+\i. This sets and matches a new indent stop when the marging spacing is increased.

To match a dedent when margin spacing is decreased, use the \j anchor ^[ \t]+\j. This matches a previous indent stop and removes it.

Because a more significant decrease in margin spacing may result in multiple indent stop positions to be removed at once, we should also add a pattern with a single \j anchor to match each extra indent stop. This matches any extra dedents that occur on the same line, but one at a time is matched:

^[ \t]+ 
^[ \t]*\i 
^[ \t]*\j 
\j 

Notice that matching a “nodent” line, which is aligned to the current margin, is done with a pattern that lacks the \i and \j anchors.

Let’s see how this works. Say our text to scan contains the following:

def abs(n):
  if n < 0:
  n = -n
  return n
  else:
  return n

Then the four patterns we defined are matched in this order: 2, 2, 1, 3, 2, 3, 3. Note that the first line has no indent margin and no indent stop.

You are not restricted to use white space, such as [ \t]+, to define margin spacing. The margin spacing can be defined to be anything, not just spaces and tabs, as long as the pattern does not include a newline character \n.

It may be necessary to temporarily memorize the indent stops when scanning text over multiple lines by patterns that should not change the current indent stop positions, such as line continuations, multi-line comments and expressions that require “line joining” as in Python. This can be done by saving the current indent stops with matcher().push_stops() and retrieving them again with matcher().pop_stops().

For example, to scan a multi-line comment over multiple (indented) lines without affecting the current indent stops, we should save the current stops and transition to a new start condition state CONTINUE:

"/*" matcher().push_stops(); 
 BEGIN CONTINUE; 
<CONTINUE>{
"*/" matcher().pop_stops(); 
 BEGIN INITIAL; 
.|\n 
}

We typically define start condition states, such as CONTINUE in this example, when scanning text that is matched by a different set of patterns and rules. The matcher().push_stops() method resets the indent stop positions. The positions can be accessed directly using matcher().stops() that returns a reference to a std::vector<size_t> with these positions that are ordered from low to high (the order must be maintained).

There is a quick shortcut for matching multiple lines of text when the text has to be ignored by the scanner, such as multi-line comments and line continuations. To implement line continuations after a \ (backslash) placed at the end of a line, we must ignore the next indent. Similarly, C-style multi-line comments matched with the pattern "/*"(.|\n)*?"*/" (using a lazy repetition (.|\n)*?) should be ignored and not affect the current indent stops. To do this, we use the RE/flex (?^X) “negative pattern” that consumes text without actions, making the text invisible:

(?^"/*".*?"*/") 
(?^\\\n[ \t]*) 

The negative pattern (?^X) is a RE/flex-specific addition.

Lazy repeats are nice, but also noughty

In the previous section we saw how to use a lazy repeat (.|\n)*? to match C-style multi-line comments with "/*"(.|\n)*?"*/". The extra ? is called a lazy quantifier and changes the * greedy repeat into a *? non-greedy lazy repeat (also called reluctant repeat). Why is a lazy repeat needed here? Note that the pattern (.|\n) matches any character (. (dot) matches any character except newline \n). The problem is that a greedy repeat (.|\n)* in the pattern "/*"(.|\n)*"*/" eats all matching text including */, but up to the last */ in the input text before EOF. The usual way to work around greedy repeats is to prevent the repeated pattern from matching */, such as "/*"([^*]|\*+[^/])*"*/". By contrast, lazy repeats are simpler and more elegant.

The lazy quantifier ? can be used to make the following greedy patterns lazy:

X??Y 
X*?Y 
X+?Y 
X{n,m}?Y 

Take for example the pattern (a|b)*?bb. The text abaabb and bb match this pattern, but bbb does not match (actually, there is a partial match since the first bb matches this pattern, so the third b remains on the input for the next match by a scanner).

What about the pattern a*? that has no continuing pattern Y? This matches nothing. When the pattern Y permits an empty text match or matches the same text as X then lazyness kills X. To make lazy repeats behave nicely, make sure they are followed by a pattern that matches non-empty text or make sure they are properly anchored. For example, a*?$ matches a series of a‘s up to the end of the line. However, in this case lazyness is not required and a greedy pattern a*$ works just fine.

Lazy repeats are not available in POSIX matchers. Tthe concept was introduced in Perl that matches regex-based using backtracking on the regex (or by using an intermediate form for an engine). POSIX matchers are text-based and may use a deterministic finite automaton (DFA) to significantly speed up matching. A DFA-based matcher lacks lazy quantifiers, lookahead/behind, and grouping for group captures, due to the inability to dynamically evluate the context in which this form of regex-based information applies.

RE/flex adds lazy quantifiers to its POSIX matcher using a new DFA construction algorithm (technical publications are pending). A lazy repeat effectively cuts edges from a DFA, but may also lengthen the DFA to permit partial matches until the accept state is reached. Consider the following DFA constructed for regex (a|b)*bb:

and compare this to the DFA constructed for regex (a|b)*?bb:

As you can see, the second DFA matches any a‘s and b‘s until bb is matched.

Using the smart input class to scan UTF files

We often have to scan files with text encoded in various popular formats such as UTF-8, UTF-16 or even in UTF-32, ISO-8859-1 or EBCDIC. To support multiple input file formats, the RE/flex reflex::Input class converts the input format “on the fly” while scanning input by using a smart buffering technique. This smart buffering does not require the entire file to be read first, allowing the implementation of streaming lexers and regex matchers. The streaming converter detects UTF byte order marks (BOM) in files to enable automatic UTF-8/16/32 conversion by normalizing the text to UTF-8 for matching. In fact, when Unicode mode for matching is enabled with %option unicode, the input should be UTF encoded in this way.

If the input file is encoded in EBCDIC, ISO-8859-1 or extended Latin-1, then you should determine this in your code and set the decoding flag to use this file with a RE/flex lexer. For example as follows:

#include "lex.yy.h" // generated with reflex --flex --header-file

int main()
{
 FILE *fd = fopen("iso-8859-1-example-file.txt", "r"); if (fd != NULL)
 {
 reflex::Input input(fd, reflex::Input::file_encoding::latin);   yyFlexLexer lexer(input);   while (lexer.yylex() != 0)   continue;
  fclose(fd); }
}

Performance tuning

There are several options to debug a lexer and analyze its performance given some input text to scan. Use reflex command-line option -d to generate a scanner that prints the matched text. Use reflex command-line option -p to generate a scanner that prints a performance report. This report shows the performance statistics obtained for each rule in the lexer specification. The statistics are collected when the scanner runs while scanning input for which the statistics should be gathered.

The reflex -p option allows you to fine-tune the performance of your scanner by focussing on patterns and rules that turn out to be computationally expensive. This is perhaps best illustrated with an example. The JSON parser json.l located in the examples directory of the RE/flex download package was built and analyzed for some JSON input as follows:

reflex 0.9.22 json.l performance report:
  INITIAL rules matched:
    rule at line 51 accepted 58 times matching 255 bytes total in 0.009 ms
    rule at line 52 accepted 58 times matching 58 bytes total in 0.824 ms
    rule at line 53 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 54 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 55 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 56 accepted 5 times matching 23 bytes total in 0.007 ms
    rule at line 57 accepted 38 times matching 38 bytes total in 0.006 ms
    rule at line 72 accepted 0 times matching 0 bytes total in 0 ms
    default rule accepted 0 times
  STRING rules matched:
    rule at line 60 accepted 38 times matching 38 bytes total in 0.021 ms
    rule at line 61 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 62 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 63 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 64 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 65 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 66 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 67 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 68 accepted 314 times matching 314 bytes total in 0.04 ms
    rule at line 69 accepted 8 times matching 25 bytes total in 0.006 ms
    rule at line 72 accepted 0 times matching 0 bytes total in 0 ms
    default rule accepted 0 times
  WARNING: execution times are relative:
    1) includes caller's execution time between matches when yylex() returns
    2) perf-report instrumentation adds overhead and increases execution times

The timings shown include the time of the pattern match and the time of the code executed by the rule. If the rule returns to the caller than the time spent by the caller is also included in this timing. For this example, we have two start condition states namely INITIAL and STRING. The rule at line 52 is:

[][}{,:] { return yytext[0]; }

This returns a [ or ] bracket, a { or } brace, a comma, or a colon to the parser. Since the pattern and rule are very simple, we do not expect these to contribute much to the 0.824 ms time spent on this rule. The parser code executed when the scanner returns could be expensive. Depending on the character returned, the parser constructs a JSON array (bracket) or a JSON object (brace), and populates arrays and objects with an item each time a comma is matched. But which is most expensive? To obtain timings of these events separately, we can split the rule up into three similar rules:

[][] { return yytext[0]; }
[}{] { return yytext[0]; }
[,:] { return yytext[0]; }

Then we use reflex option -p, recompile the scanner lex.yy.cpp. and rerun our experiment to see these changes:

    rule at line 52 accepted 2 times matching 2 bytes total in 0.001 ms
    rule at line 53 accepted 14 times matching 14 bytes total in 0.798 ms
    rule at line 54 accepted 42 times matching 42 bytes total in 0.011 ms

So it turns out that the construction of a JSON object by the parser is relatively speaking the most expensive part of our application, when { and } are encountered on the input. We should focus our optimization effort there if we want to improve the overall speed of our JSON parser.

The JSON lexer and parser source code used in this example are available for download with this article.

Interactive input with readline

Shells, interpreters, and other interactive command-line tools allow users to enter commands that are interpreted and executed by the application. The GNU readline library greatly enhances the interactive experience by offering a history mechanism and line-based editing. By contrast, interactive input by reading standard input directly gives users a cumbersome experience.

To use readline with RE/flex is a good example of how we can use the Flex-like wrap() function to our advantage. The wrap() function is invoked when EOF is reached, to permit a new input source to be assigned to continue scanning. The lexer specification we give here first includes the required C header files in the %top block and then defines the lexer class wrap() function in the %class block and the lexer class construction in the %init block: 

%top{
 #include <stdlib.h>
 #include <stdio.h>
 #include <readline/readline.h>
 #include <readline/history.h>
%}

%class{
 const char *prompt;
 
 virtual bool wrap() {
 if (line)
 {
 free((void*)line);
 line = readline(prompt);
 if (line != NULL)
 {
 if (*line)
 add_history(line);
 linen.assign(line).push_back('\n');
 in(linen);
 }
 }
 
  
 return line == NULL ? false : true;
 }
 
 char *line;
 
 std::string linen;
%}

%init{
 prompt = NULL; 
 line = readline(prompt);
 if (line != NULL)
 {
 if (*line)
 add_history(line);
 linen.assign(line).push_back('\n');
 }
 in(linen);
%}

A readline-based example lexer similar to this example but using Flex actions is available for download with this article.

Conclusions

RE/flex is not merely a C++ version of Flex. It adds many new useful features that Flex lacks, including critically required Unicode pattern matching, accepting file input encoded in UTF-8/16/32 (with or without BOM), ASCII, ISO-8859-1, EBCDIC, lazy repeats in POSIX mode, and the generation of direct-coded DFAs that may be faster than Flex scanners.

Further reading

To learn more about RE/flex I encourage you to read Constructing Fast Lexical Analyzers with RE/flex and the RE/flex user guide.

Download & License

RE/flex is available for download form SourceForge and GitHub. RE/flex is licensed under the open source BSD-3 license.

History

April 25: First version of this article.

April 30: Some cosmetic updates.

May 16: Minor updates for clarification.

LEAVE A REPLY