Generating Binary Inputs#
Creating binary inputs with Fandango is a bit more challenging than creating human-readable inputs. This is because they have a few special features, such as checksums and length encodings. Fortunately, we can address all of them with dedicated constraints.
Checksums#
Our first challenge is checksums. Binary input formats frequently use checksums to ensure integrity. The problem is that checksums cannot be expressed in a grammar alone, as grammars lack the arithmetic functions required to compute and check checksums. In Fandango, though, we can express the computation of a checksum in a dedicated function, which is then used in a dedicated constraint.
As an example for checksums, let’s have a look at credit card numbers. These are definitely very human-readable and not binary at all, but for an example, they will do fine. A credit card number consists of a series of digits, where the last one is a check digit. Here is a grammar that expresses the structure for 16-digit credit card numbers:
<start> ::= <credit_card_number>
<credit_card_number> ::= <number> <check_digit>
<number> ::= <digit>{15} # for 16-digit numbers
<check_digit> ::= <digit>
All credit cards use Luhn’s algorithm to compute the check digit.
Here is an implementation, adapted from the Faker library.
The function credit_card_check_digit() gets all numbers of a credit card (except the last digit) and returns the computed check digit.
def credit_card_check_digit(number: str) -> str:
"""Create a check digit for the credit card number `number`."""
luhn_lookup = {
"0": 0,
"1": 2,
"2": 4,
"3": 6,
"4": 8,
"5": 1,
"6": 3,
"7": 5,
"8": 7,
"9": 9,
}
# Calculate sum
length = len(number) + 1
reverse = number[::-1]
tot = 0
pos = 0
while pos < length - 1:
tot += luhn_lookup[reverse[pos]]
if pos != (length - 2):
tot += int(reverse[pos + 1])
pos += 2
# Calculate check digit
check_digit = (10 - (tot % 10)) % 10
return str(check_digit)
We can easily make use of credit_card_check_digit() in a constraint that ties <check_digit> and <number>:
where <check_digit> == credit_card_check_digit(str(<number>))
All of this can go into a single .fan file: credit_card.fan joins the above grammar, the credit_card_check_digit() definition, and the above constraint into a single file.
We can now use credit-card.fan to produce valid credit card numbers:
$ fandango fuzz -f credit_card.fan -n 10
0531196330795430
8348853708526450
8150383480034991
0310259795433660
8703502854235718
4947661575499980
1659090396842992
3112981830846382
3425749780512334
4739789203274178
We can also use the grammar to parse and check numbers. This credit card number should be valid:
$ echo -n 4931633575526870 | fandango parse -f credit_card.fan
$ echo $? # print exit code
0
Adding a 1 digit to this number should make it invalid:
$ echo -n 4931633575526871 | fandango parse -f credit_card.fan
FandangoParseError: Did not match the following constraints: <check_digit> == credit_card_check_digit(str(<number>))
FandangoParseError: Did not match the following constraints: <check_digit> == credit_card_check_digit(str(<number>))
Traceback (most recent call last):
File "/home/runner/work/fandango/fandango/.venv/bin/fandango", line 10, in <module>
sys.exit(main())
~~~~^^
File "<@beartype(fandango.cli.main) at 0x7fe66edd1080>", line 68, in main
File "/home/runner/work/fandango/fandango/src/fandango/cli/__init__.py", line 51, in main
last_status = run(command, args)
File "<@beartype(fandango.cli.commands.run) at 0x7fe66e9e1260>", line 51, in run
File "/home/runner/work/fandango/fandango/src/fandango/cli/commands.py", line 502, in run
print_exception(e)
~~~~~~~~~~~~~~~^^^
File "<@beartype(fandango.logger.print_exception) at 0x7fe66fb40400>", line 54, in print_exception
File "/home/runner/work/fandango/fandango/src/fandango/logger.py", line 48, in print_exception
raise e
File "/home/runner/work/fandango/fandango/src/fandango/cli/commands.py", line 500, in run
command(args)
~~~~~~~^^^^^^
File "<@beartype(fandango.cli.commands.parse_command) at 0x7fe66e9e09a0>", line 33, in parse_command
File "/home/runner/work/fandango/fandango/src/fandango/cli/commands.py", line 275, in parse_command
print_exception(e)
~~~~~~~~~~~~~~~^^^
File "<@beartype(fandango.logger.print_exception) at 0x7fe66fb40400>", line 54, in print_exception
File "/home/runner/work/fandango/fandango/src/fandango/logger.py", line 48, in print_exception
raise e
File "/home/runner/work/fandango/fandango/src/fandango/cli/commands.py", line 272, in parse_command
tree = parse_file(fd, args, grammar, constraints, settings)
File "<@beartype(fandango.cli.utils.parse_file) at 0x7fe66ec253a0>", line 136, in parse_file
File "/home/runner/work/fandango/fandango/src/fandango/cli/utils.py", line 323, in parse_file
raise FandangoParseError(
f"Did not match the following constraints: {', '.join(failed_constraints)}"
)
fandango.errors.FandangoParseError: Did not match the following constraints: <check_digit> == credit_card_check_digit(str(<number>))
$ echo $? # print exit code
1
Similarly, you can define any kind of checksum function and then use it in a constraint. In Python, it is likely that someone has already implemented the specific checksum function, so you can also import it:
The
hashlibmodule provides hash functions such as MD5 or SHA-256.The
binasciimodule offers CRC checks.The
zlibmodule provides CRC32 and ADLER32 checks used in zip files.
Characters and Bytes#
The second set of features one frequently encounters in binary formats is, well, bytes.
So far, we have seen Fandango operates on strings of Unicode characters, which use UTF-8 encoding.
This clashes with a byte interpretation as soon as the produced string contains a UTF-8 prefix byte, such as \xc2 or \xe0, which mark the beginning of a two- and three-byte UTF-8 sequence, respectively.
To ensure bytes will be interpreted as bytes (and as bytes only), place a b (binary) prefix in front of them.
This ensures that a byte b'\xc2' will always be interpreted as a single byte, whereas 💃 will be interpreted as a single character (despite occupying multiple bytes).
Tip
Fandango provides a <byte> symbol by default, which expands into all bytes b'\x00'..b'\xff'.
Text Files and Binary Files#
By default, Fandango will read and write files in text mode, meaning that characters will be read in using UTF-8 encoding.
However, if a grammar can produce bytes (or bits), the associated files will be read and written in binary mode, reading and writing bytes instead of (encoded) characters.
If your grammar contains bytes and strings, then the strings will be written in UTF-8 encoding into the binary file.
You can enforce a specific behavior using the Fandango --file-mode flag for the fuzz and parse commands:
fuzz --file-mode=textopens all files intextmode. Strings and bytes will be written in UTF-8 encoding.fuzz --file-mode=binaryopens all files inbinarymode. Strings will be written in UTF-8 encoding; bytes will be written as is.
The default is fuzz --file=mode=auto (default), which will use binary or text mode as described above.
Tip
Avoid mixing non-ASCII strings with bits and bytes in a single grammar.
Bytes and Regular Expressions#
Fandango also supports regular expressions over bytes.
To obtain a regular expression over a byte string, use both r and b prefixes.
This is especially useful for character classes.
Here is an example: binfinity.fan produces strings of five bytes outside the range \x80-\xff:
<start> ::= rb"[^\x80-\xff]{5}"
This is what we get:
$ fandango fuzz -f binfinity.fan -n 10
.bF5@
'b#EZ
j[rK9
)7fVY
LfvkI
jq\NA
5^1/4
@+4jd
[k!d:
BL_Kj
Length Encodings#
The third set of features one frequently encounters in binary formats is length encodings - that is, a particular field holds a value that represents the length of one or more fields that follow.
Here is a simple grammar that expresses this characteristic: A <field> has a two-byte length, followed by the actual (byte) content (of length <length>).
<start> ::= <field>
<field> ::= <length> <content>
<length> ::= <byte> <byte>
<content> ::= <byte>+
Encoding Lengths with Constraints#
The relationship between <length> and <content> can again be expressed using a constraint.
Let us assume that <length> comes as a two-byte (16-bit) unsigned integer with little-endian encoding - that is, the low byte comes first, and the high byte follows.
The value 258 (hexadecimal 0x0102) would thus be represented as the two bytes \x02 and \x01.
We can define a function uint16() that takes an integer and converts it to a two-byte string according to these rules.
The Python method N.to_bytes(LENGTH, ENDIANNESS) converts the integer N into a bytes string of length LENGTH. ENDIANNESS is either 'big' (default) or 'little'.
def uint16(n: int) -> bytes:
return n.to_bytes(2, 'little')
Using uint16(), we can now define how the value of <length> is related to the length of <content>:
where <length> == uint16(len(bytes(<content>)))
Tip
Having a derived value (like <length>) isolated on the left-hand side of an equality equation makes it easy for Fandango to first compute the content and then compute and assign the derived value.
Again, all of this goes into a single .fan file: binary.fan holds the grammar, the uint16() definition, and the constraint.
Let us produce a single output using binary.fan and view its (binary) contents, using od -c:
$ fandango fuzz -n 1 -f binary.fan | hexdump -C
00000000 0b 00 4c 5d c2 b0 c2 ba 0a c3 b7 c3 bb c2 9e 62 |..L]...........b|
00000010 0d c3 96 0a |....|
00000014
The hexadecimal dump shows that the first two bytes encode the length of the string of digits that follows. The format is correct - we have successfully produced a length encoding.
Encoding Lengths with Repetitions#
Another way to implement length constraints is by using repetitions.
In Fandango, repetitions {} can also contain expressions, and like constraints, these can also refer to nonterminals that have already been parsed or produced.
Hence, we can specify a rule
<content> ::= <byte>{f(<length>)}
where f() is a function that computes the number of <byte> repetitions based on <length>.
Let us define a variant binary-rep.fan that makes use of this.
Here, we specify that <content> consists of N bytes, where N is given as follows:
<start> ::= <field>
<field> ::= <length> <content>
<length> ::= <byte> <byte> := b'\x00\x04'
<content> ::= <byte>{from_uint16(bytes(<length>))}
The method <length>.value() returns the bytes string value of the <length> element.
The function from_uint16() is defined as follows:
def from_uint16(n: bytes) -> int:
return n[0] << 8 | n[1]
With this, we can easily produce length-encoded inputs:
$ fandango fuzz -n 1 -f binary-rep.fan | hexdump -C
00000000 00 04 2b c3 be 22 5e 0a |..+.."^.|
00000008
Tip
When parsing inputs, computed repetitions are much more efficient than constraints.
Converting Values to Binary Formats#
Instead of implementing uint16() manually, we can also use the Python struct module, which offers several functions to convert data into binary formats.
Using struct, we can redefine uint16() as
from struct import pack
def uint16(n: int) -> str:
return pack('<H', n) # convert to bytes
and obtain the same result:
0000000 016 \0 1 6 4 0 3 9 5 6 9 4 9 0 2 8
0000020 \n
0000021
Note that the return value of struct.pack() has the type bytes (byte string), which is different from the str Unicode strings that Fandango uses:
pack('<H', 20)
b'\x14\x00'
type(pack('<H', 20))
bytes
In Python, comparisons of different types always return False:
# Left hand is byte string, right hand is Unicode string
b'\x14\x00' == '\x14\x00'
False
Hence, a constraint that compares a Fandango symbol against a byte string will always fail.
Important
When comparing symbols against values, always be sure to convert the values to the appropriate type first.
Tip
Using the 'iso8859-1' encoding (also known as 'latin-1') allows a 1:1 conversion of byte strings with values '\x00'..'\xff' into Unicode str strings without further interpretation.
Tip
Adding type annotations to functions in .fan files allows for future static type checking and further optimizations.
Check out the struct module for additional encodings, including float types, long and short integers, and many more.