Generating Binary Inputs#
Creating binary inputs with Fandango is a bit more challenging than creating human-readable inputs. This is because they have a few special features, such as checksums and length encodings. Fortunately, we can address all of them with dedicated constraints.
Checksums#
Our first challenge is checksums. Binary input formats frequently use checksums to ensure integrity. The problem is that checksums cannot be expressed in a grammar alone, as grammars lack the arithmetic functions required to compute and check checksums. In Fandango, though, we can express the computation of a checksum in a dedicated function, which is then used in a dedicated constraint.
As an example for checksums, let’s have a look at credit card numbers. These are definitely very human-readable and not binary at all, but for an example, they will do fine. A credit card number consists of a series of digits, where the last one is a check digit. Here is a grammar that expresses the structure for 16-digit credit card numbers:
<start> ::= <credit_card_number>
<credit_card_number> ::= <number> <check_digit>
<number> ::= <digit>{15} # for 16-digit numbers
<check_digit> ::= <digit>
All credit cards use Luhn’s algorithm to compute the check digit.
Here is an implementation, adapted from the Faker library.
The function credit_card_check_digit()
gets all numbers of a credit card (except the last digit) and returns the computed check digit.
def credit_card_check_digit(number: str) -> str:
"""Create a check digit for the credit card number `number`."""
luhn_lookup = {
"0": 0,
"1": 2,
"2": 4,
"3": 6,
"4": 8,
"5": 1,
"6": 3,
"7": 5,
"8": 7,
"9": 9,
}
# Calculate sum
length = len(number) + 1
reverse = number[::-1]
tot = 0
pos = 0
while pos < length - 1:
tot += luhn_lookup[reverse[pos]]
if pos != (length - 2):
tot += int(reverse[pos + 1])
pos += 2
# Calculate check digit
check_digit = (10 - (tot % 10)) % 10
return str(check_digit)
We can easily make use of credit_card_check_digit()
in a constraint that ties <check_digit>
and <number>
:
where <check_digit> == credit_card_check_digit(str(<number>))
All of this can go into a single .fan
file: credit_card.fan
joins the above grammar, the credit_card_check_digit()
definition, and the above constraint into a single file.
We can now use credit-card.fan
to produce valid credit card numbers:
$ fandango fuzz -f credit_card.fan -n 10
8505467856770528
3524048931388029
4793825066012892
7569694534583917
3467989042450217
0284334624027256
9371315888716226
9299437791862894
6492227771438620
7978467113800084
We can also use the grammar to parse and check numbers. This credit card number should be valid:
$ echo -n 4931633575526870 | fandango parse -f credit_card.fan
$ echo $? # print exit code
0
Adding 1 to this number should make it invalid:
$ echo -n 4931633575526871 | fandango parse -f credit_card.fan
FandangoError: '<stdin>': constraint <check_digit> == credit_card_check_digit(str(<number>)) not satisfied
FandangoParseError: 1 error(s) during parsing
$ echo $? # print exit code
1
Similarly, you can define any kind of checksum function and then use it in a constraint. In Python, it is likely that someone has already implemented the specific checksum function, so you can also import it:
The
hashlib
module provides hash functions such as MD5 or SHA-256.The
binascii
module offers CRC checks.The
zlib
module provides CRC32 and ADLER32 checks used in zip files.
Characters and Bytes#
The second set of features one frequently encounters in binary formats is, well, bytes.
So far, we have seen Fandango operates on strings of Unicode characters, which use UTF-8 encoding.
This clashes with a byte interpretation as soon as the produced string contains a UTF-8 prefix byte, such as \xc2
or \xe0
, which mark the beginning of a two- and three-byte UTF-8 sequence, respectively.
To ensure bytes will be interpreted as bytes (and as bytes only), place a b
(binary) prefix in front of them.
This ensures that a byte b'\xc2'
will always be interpreted as a single byte, whereas 💃
will be interpreted as a single character (despite occupying multiple bytes).
Tip
Fandango provides a <byte>
symbol by default, which expands into all bytes b'\x00'..b'\xff'
.
Text Files and Binary Files#
By default, Fandango will read and write files in text
mode, meaning that characters will be read in using UTF-8 encoding.
However, if a grammar can produce bytes (or bits), the associated files will be read and written in binary
mode, reading and writing bytes instead of (encoded) characters.
If your grammar contains bytes and strings, then the strings will be written in UTF-8 encoding into the binary file.
You can enforce a specific behavior using the Fandango --file-mode
flag for the fuzz
and parse
commands:
fuzz --file-mode=text
opens all files intext
mode. Strings and bytes will be written in UTF-8 encoding.fuzz --file-mode=binary
opens all files inbinary
mode. Strings will be written in UTF-8 encoding; bytes will be written as is.
The default is fuzz --file=mode=auto
(default), which will use binary
or text
mode as described above.
Tip
Avoid mixing non-ASCII strings with bits and bytes in a single grammar.
Bytes and Regular Expressions#
Fandango also supports regular expressions over bytes.
To obtain a regular expression over a byte string, use both r
and b
prefixes.
This is especially useful for character classes.
Here is an example: binfinity.fan
produces strings of five bytes outside the range \x80-\xff
:
<start> ::= rb"[^\x80-\xff]{5}"
This is what we get:
$ fandango fuzz -f binfinity.fan -n 10
wm<Hy
I'C[9
i3>Ba
QYvjZ
F\)s(
Q5;.S
7p%b%
K;xEj
JmLLM
_zjW%
Length Encodings#
The third set of features one frequently encounters in binary formats is length encodings - that is, a particular field holds a value that represents the length of one or more fields that follow.
Here is a simple grammar that expresses this characteristic: A <field>
has a two-byte length, followed by the actual (byte) content (of length <length>
).
<start> ::= <field>
<field> ::= <length> <content>
<length> ::= <byte> <byte>
<content> ::= <byte>+
Encoding Lengths with Constraints#
The relationship between <length>
and <content>
can again be expressed using a constraint.
Let us assume that <length>
comes as a two-byte (16-bit) unsigned integer with little-endian encoding - that is, the low byte comes first, and the high byte follows.
The value 258 (hexadecimal 0x0102
) would thus be represented as the two bytes \x02
and \x01
.
We can define a function uint16()
that takes an integer and converts it to a two-byte string according to these rules.
The Python method N.to_bytes(LENGTH, ENDIANNESS)
converts the integer N
into a bytes string of length LENGTH
. ENDIANNESS
is either 'big'
(default) or 'little'
.
def uint16(n: int) -> bytes:
return n.to_bytes(2, 'little')
Using uint16()
, we can now define how the value of <length>
is related to the length of <content>
:
where <length> == uint16(len(bytes(<content>)))
Tip
Having a derived value (like <length>
) isolated on the left-hand side of an equality equation makes it easy for Fandango to first compute the content and then compute and assign the derived value.
Again, all of this goes into a single .fan
file: binary.fan
holds the grammar, the uint16()
definition, and the constraint.
Let us produce a single output using binary.fan
and view its (binary) contents, using od -c
:
$ fandango fuzz -n 1 -f binary.fan -o - | hexdump -C
fandango:WARNING: Could not generate a full population of unique individuals. Population size reduced to 5.
00000000 02 00 02 00 |....|
00000004
The hexadecimal dump shows that the first two bytes encode the length of the string of digits that follows. The format is correct - we have successfully produced a length encoding.
Encoding Lengths with Repetitions#
Another way to implement length constraints is by using repetitions.
In Fandango, repetitions {}
can also contain expressions, and like constraints, these can also refer to nonterminals that have already been parsed or produced.
Hence, we can specify a rule
<content> ::- <byte>{f(<length>)}
where f()
is a function that computes the number of <byte>
repetitions based on <length>
.
Let us define a variant binary-rep.fan
that makes use of this.
Here, we specify that <content>
consists of N
bytes, where N
is given as follows:
<start> ::= <field>
<field> ::= <length> <content>
<length> ::= <byte> <byte> := b'\x00\x04'
<content> ::= <byte>{from_uint16(<length>.value())}
The method <length>.value()
returns the bytes string value of the <length>
element.
The function from_uint16()
is defined as follows:
def from_uint16(n: bytes) -> int:
return n[0] << 8 | n[1]
With this, we can easily produce length-encoded inputs:
$ fandango fuzz -n 1 -f binary-rep.fan -o - | hexdump -C
00000000 00 04 7e ec 7d c8 |..~.}.|
00000006
Tip
When parsing inputs, computed repetitions are much more efficient than constraints.
Converting Values to Binary Formats#
Instead of implementing uint16()
manually, we can also use the Python struct
module, which offers several functions to convert data into binary formats.
Using struct
, we can redefine uint16()
as
from struct import pack
def uint16(n: int) -> str:
return pack('<H', n).decode('iso8859-1') # convert to string
and obtain the same result:
fandango:WARNING: <field>: Concatenating 'bytes' (<length>) and 'str' (<content>)
fandango:WARNING: <length> == uint16(len(<content>)): '==': Cannot compare 'bytes' and 'str'
ValueError: I/O operation on closed file.
0000000 002 \0 2 7
0000004
Note that the return value of struct.pack()
has the type bytes
(byte string), which is different from the str
Unicode strings that Fandango uses:
pack('<H', 20)
b'\x14\x00'
type(pack('<H', 20))
bytes
In Python, comparisons of different types always return False
:
# Left hand is byte string, right hand is Unicode string
b'\x14\x00' == '\x14\x00'
False
Hence, a constraint that compares a Fandango symbol against a byte string will always fail.
Warning
When comparing symbols against values, always be sure to convert the values to the appropriate type first.
Tip
Using the 'iso8859-1'
encoding (also known as 'latin-1'
) allows a 1:1 conversion of byte strings with values '\x00'..'\xff'
into Unicode str
strings without further interpretation.
Tip
Adding type annotations to functions in .fan
files allows for future static type checking and further optimizations.
Check out the struct
module for additional encodings, including float types, long and short integers, and many more.