Generating Tokenizers with Flat Automata

Hans de Nivelle
(School of Engineering and Digital Sciences, Nazarbayev University, Nursultan-City, Kazakkhstan)
Dina Muktubayeva
(School of Engineering and Digital Sciences, Nazarbayev University, Nursultan-City, Kazakhstan)

We introduce flat automata for automatic generation of tokenizers. Flat automata are a simple representation of standard finite automata. Using the flat representation, automata can be easily constructed, combined and printed.

Due to the use of border functions, flat automata are more compact than standard automata in the case where intervals of characters are attached to transitions, and the standard algorithms on automata are simpler.

We give the standard algorithms for tokenizer construction with automata, namely construction using regular operations, determinization, and minimization. We prove their correctness. The algorithms work with intervals of characters, but are not more complicated than their counterparts on single characters. It is easy to generate C++ code from the final deterministic automaton. All procedures have been implemented in C++ and are publicly available. The implementation has been used in applications and in teaching.

In Pierre Ganty and Dario Della Monica: Proceedings of the 13th International Symposium on Games, Automata, Logics and Formal Verification (GandALF 2022), Madrid, Spain, September 21-23, 2022, Electronic Proceedings in Theoretical Computer Science 370, pp. 66–80.
An implementation of flat automata can be found on: www.compiler-tools.eu
Published: 20th September 2022.

ArXived at: https://dx.doi.org/10.4204/EPTCS.370.5 bibtex PDF
References in reconstructed bibtex, XML and HTML format (approximated).
Comments and questions to: eptcs@eptcs.org
For website issues: webmaster@eptcs.org