In ‘Thought 2: Regex is Like Assembly’ I wondered why we are still doing regex in this kind of hard to understand, symbolic way, when we have already invented high level programming languages. There is no reason regex can’t be written as clearly as any other programming language we use today.
I thought doing this would be an interesting project, and so I came up with Regexl, a high level language for writing regex, that can be used as a simple library. The core design philosophy being to create something that’s as simple as possible without losing understandability or power.
I’ll start with an introduction to the Regexl language and then give an overview of it’s technical architecture.
Below are a few examples:
/friend/i
is equivalent to the regexl:
select 'friend'
/^friend/i
is equivalent to the regexl:
// This is a regexl comment.
// This set_options configuration is equivalent to: '/i'
set_options({
case_sensitive: false,
})
select starts_with('friend')
/Hello*/g
is equivalent to the regexl:
set_options({
find_all_matches: true,
})
//-- This '--' is to help the syntax highlighter :)
//-- The '+' performs a simple concatenation, as all functions return strings
select 'Hell' + zero_plus_of('o')
/^Golang$/
is equivalent to the regexl:
set_options({
case_sensitive: false,
})
//-- Functions can be nested, as outputs are strings.
//-- Alternative regexl: select starts_and_ends_with('Golang')
select ends_with(starts_with('Golang'))
/[abcd]/ig
(match any of these 4 letters) is equivalent to the regexl:
set_options({
find_all_matches: true,
case_sensitive: false,
})
//-- Can also be: select any_chars_of('abcd')
select any_chars_of('abc', 'd')
/[A-Z0-9]/ig
(match letters and numbers only) is equivalent to the regexl:
set_options({
find_all_matches: true,
case_sensitive: false,
})
//-- Can also be: select any_chars_of('abcd')
select any_chars_of(from_to('A', 'Z'), from_to(0, 9))
/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,10}/i
(a ‘simple’ email regex) is equivalent to the regexl:
set_options({
case_sensitive: false,
})
select
//-- Converts to: [A-Z0-9._%+-]+
one_plus_of(
any_chars_of(from_to('A', 'Z'), from_to(0, 9), '._%+-')
) +
//-- Converts to: @
'@' +
//-- Converts to: [A-Z0-9.-]+
one_plus_of(
any_chars_of(from_to('A', 'Z'), from_to(0, 9), '.-')
) +
//-- Converts to: \.
'.' +
//-- Converts to: [A-Z]{2,10}
count_between(
any_chars_of(from_to('A', 'Z')),
2,
10
)
While I still haven’t added all the functionality of regex, it should be no more complex than adding support for a few more functions.
Regexl compiles into a normal regex string like /hello/ig
, which is then given as input to the standard regex system of your programming language.
This means Regexl can be easily incorporated into existing projects.
Technical Details
The Regexl code is that of a very simple compiler, where the general steps involved are:
- Input query text is tokenized
- Tokens are used to create an Abstract Syntax Tree (AST)
- The AST is fed into a ‘backend’ that outputs a specific regex string (e.g. Go regex)
To explain the above, lets look at how the following query is compiled:
select starts_with('hello')
By tokenization we mean turning the input string into higher level segments, where each segment is split by some separator like a space, a bracket, and so on. In the above query you will get the following tokens:
- Token value:
select
; Type:keyword
- Token value:
starts_with
; Type:function name
- Token value:
(
; Type:open bracket
- Token value:
hello
; Type:string
- Token value:
)
; Type:close bracket
With this list of tokens, an AST is created. An Abstract Syntax Tree represents the structure of a program as a tree, where the parent nodes have a dependency on the children nodes. For example, if function A calls B, then this function call node becomes a child of A, and the arguments of this call are children of the function call node.
In our query, the linear tokens list produces this AST tree:
|-- select
| |-- starts_with
| | |-- hello
With the AST in place, we can traverse the tree and generate some output. In normal programming languages (e.g. C, Go, Python, etc…) the final output would be machine code, assembly, or perhaps byte code to be interpreted.
In Regexl, the output is some specific regex like Go-compatible regex, python-compatible regex, and so on (regex syntax and features differ between implementations).
The Go regex produced for our example Regexl query is:
(?i)^hello
Equivalent to the more common regex expression:
/^hello/i
The nice thing about this setup is that to support a new regex implementation all one has to do is implement a new backend (step 3), while tokenization and AST generation are reused as-is. Currently only a Go backend is implemented.
Regexl is open source. You can find the source code and documentation on GitHub, and play with it on the interactive playground.