Regular Expressions#

Although the Fandango grammars cover a wide range of input language features, there are situations where they may be a bit cumbersome to work with. Consider specifying every digit except for zeros: this requires you to enumerate all the other digits 1, 2, and so on. This is why Fandango also supports regular expressions, which allow you to use a concise syntax for character ranges, repeated characters and more. Specifying all digits from 1 to 9, for instance, becomes the short regular expression r'[1-9]'.

About Regular Expressions#

Regular expressions form a language on their own and come with several useful features. To get an introduction to the regular expressions Fandango uses, read the Python Regular Expression HOWTO and check out the Python Regular Expression Syntax for a complete reference.

In Fandango, regular expressions are used for two purposes:

  • When producing inputs, a regular expression is instantiated into a random string that matches the expression.

  • When parsing inputs, a regular expression is used to parse and match inputs.

Writing Regular Expressions#

In Fandango, a regular expression comes as a string, prefixed with a r character. To express that a digit can have the values 0 to 9, instead of

<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

you can write

<digit> ::= r'[0-9]'

which is much more concise.

Likewise, to match a sequence of characters that ends in ;, you can write

<some_sequence> ::= r'[^;]+;'

Besides the r prefix indicating a regular expression, it also makes the string a raw string. This means that backslashes are treated as literal characters. The regular expression \d, for instance, matches a Unicode digit, which includes [0-9], and also many other digit characters. To include \d in a regular expression, write it as is; do not escape the backslash with another backslash (as you would do in a regular string):

<any_digit> ::= r'\d'

Important

Be aware of the specific syntax of r-strings as it comes to backslashes.

One consequence of backslashes being interpreted literally is that you cannot escape quote characters in a regular expression. This causes a problem if you need two kinds of quotes (" and ') in the same regular expression – say, a rule that checks for forbidden characters.

However, encodings of the form \xNN are also interpreted by regular expressions. Hence, if you need quotes, you can use

  • \x22 instead of "

  • \x27 instead of '

Here is an example:

<forbidden_characters> ::= r'[\x22\x27;]'

Fine Points about Regular Expressions#

For parsing inputs, Fandango uses the Python re module for matching strings against regular expressions; for producing inputs, Fandango uses the Python exrex module for generating strings that match regular expressions. All the re and exrex capabilities and limitations thus extend to Fandango.

Repetition Limits#

Most notably, exrex imposes a repetition limit of 20 on generated strings that in principle can have arbitrary length; a + or * operator will not expand to more than 20 repetitions. Thus, a grammar infinity.fan

<start> ::= r"(abc)+"

that in principle, could produce arbitrary long sequences abcabcabcabc... will be limited to 20 repetitions at most:

$ fandango fuzz -f infinity.fan -n 10
abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc
abcabcabcabcabcabcabcabcabcabcabc
abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc
abcabcabcabcabcabc
abcabcabcabcabcabcabcabcabcabcabcabcabc
abcabcabcabcabc
abc
abcabcabcabcabcabcabcabcabc
abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc
abcabcabcabc

To precisely control the number of repetitions, use the regular expression {m,n} construct, limiting the number of repetitions from m to n. Let us limit the number of repetitions to the range 1…5:

<start> ::= r"(abc){1,5}"

This is what we get:

$ fandango fuzz -f finity.fan -n 5
abc
abcabcabc
abcabc
abcabcabcabcabc
abcabcabcabc

Tip

Remember that grammars also have operators +, *, ?, and {N,M} which apply to the preceding grammar element, and work like their regular expression counterparts. Using these, we could also write the above as

<start> ::= "abc"+

and

<start> ::= "abc"{1,5}

respectively.

Regular Expressions over Bytes#

Regular expressions can also be formed over bytes. See Bytes and Regular Expressions for details.

Regular Expressions vs. Grammars#

In many cases, a grammar can be replaced by a regular expression and vice versa. This raises the question: When should one use a regular expression, and when a grammar? Here are some points to help you decide.

  • Regular expressions are often more concise (but arguably harder to read) than grammars.

  • If you want to reference individual elements of a string (say, as part of a constraint now or in the future), use a grammar.

  • Since their underlying model is simpler, regular expressions are faster to generate, and much faster to parse than grammars.

  • If your underlying language separates lexical and syntactical processing, use

    • regular expressions for specifying lexical parts such as tokens and fragments;

    • a grammar for the syntax; and

    • constraints for any semantic properties.

  • Prefer grammars and constraints over overly complex regular expressions.

Important

Do not use regular expressions for inputs that are recursive. Languages like HTML, XML, even e-mail addresses or URLs, are much easier to capture as grammars.

Regular Expressions as Equivalence Classes#

The choice of grammars vs. regular expressions also affects the Fandango generation algorithm. Generally speaking, Fandango attempts to cover all alternatives of a grammar. If, say, <digits> is specified as

<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

then Fandango will attempt to produce every digit at least once, and also try to cover digit combinations up to a certain depth. This is useful if you want to specifically test digit processing, or if each of the digits causes a different behavior that needs to be covered.

If, however, you specify <digits> as

<digit> ::= r'[0-9]'

then Fandango will treat this as a single alternative (with all expansions considered semantically equivalent), which once expanded into (some) digit will be considered as covered.

Tip

  • If you do want or need to differentiate between individual elements of a set (because they would be treated differently), consider grammar alternatives.

  • If you do not want or need to differentiate between individual elements of a set (because they would all be treated the same), consider a regular expression.