Generating Binary Inputs

Generating Binary Inputs#

Creating binary inputs with Fandango is a bit more challenging than creating human-readable inputs. This is because they have a few special features, such as checksums and length encodings. Fortunately, we can address all of them with dedicated constraints.

Checksums#

Our first challenge is checksums. Binary input formats frequently use checksums to ensure integrity. The problem is that checksums cannot be expressed in a grammar alone, as grammars lack the arithmetic functions required to compute and check checksums. In Fandango, though, we can express the computation of a checksum in a dedicated function, which is then used in a dedicated constraint.

As an example for checksums, let’s have a look at credit card numbers. These are definitely very human-readable and not binary at all, but for an example, they will do fine. A credit card number consists of a series of digits, where the last one is a check digit. Here is a grammar that expresses the structure for 16-digit credit card numbers:

<start>              ::= <credit_card_number>
<credit_card_number> ::= <number> <check_digit>
<number>             ::= <digit>{15}  # for 16-digit numbers
<check_digit>        ::= <digit>

All credit cards use Luhn’s algorithm to compute the check digit. Here is an implementation, adapted from the Faker library. The function credit_card_check_digit() gets all numbers of a credit card (except the last digit) and returns the computed check digit.

def credit_card_check_digit(number: str) -> str:
    """Create a check digit for the credit card number `number`."""
    luhn_lookup = {
        "0": 0,
        "1": 2,
        "2": 4,
        "3": 6,
        "4": 8,
        "5": 1,
        "6": 3,
        "7": 5,
        "8": 7,
        "9": 9,
    }

    # Calculate sum
    length = len(number) + 1
    reverse = number[::-1]
    tot = 0
    pos = 0
    while pos < length - 1:
        tot += luhn_lookup[reverse[pos]]
        if pos != (length - 2):
            tot += int(reverse[pos + 1])
        pos += 2

    # Calculate check digit
    check_digit = (10 - (tot % 10)) % 10
    return str(check_digit)

We can easily make use of credit_card_check_digit() in a constraint that ties <check_digit> and <number>:

where <check_digit> == credit_card_check_digit(str(<number>))

All of this can go into a single .fan file: credit_card.fan joins the above grammar, the credit_card_check_digit() definition, and the above constraint into a single file.

We can now use credit-card.fan to produce valid credit card numbers:

$ fandango fuzz -f credit_card.fan -n 10

3084357621105227
0164092957427528
8200201604237094
8352339268976270
2933129425215733
6284184152018095
6650150713076530
1432794167535487
5921494416277911
9292276014824860

We can also use the grammar to parse and check numbers. This credit card number should be valid:

$ echo -n 4931633575526870 | fandango parse -f credit_card.fan
$ echo $?  # print exit code
0

Adding 1 to this number should make it invalid:

$ echo -n 4931633575526871 | fandango parse -f credit_card.fan

FandangoError: '<stdin>': constraint NonTerminal('<check_digit>') == credit_card_check_digit(str(NonTerminal('<number>'))) not satisfied
FandangoParseError: 1 error(s) during parsing

$ echo $?  # print exit code
1

Similarly, you can define any kind of checksum function and then use it in a constraint. In Python, it is likely that someone has already implemented the specific checksum function, so you can also import it:

The hashlib module provides hash functions such as MD5 or SHA-256.
The binascii module offers CRC checks.
The zlib module provides CRC32 and ADLER32 checks used in zip files.

Characters and Bytes#

The second set of features one frequently encounters in binary formats is, well, bytes. So far, we have seen Fandango operates on strings of Unicode characters, which use UTF-8 encoding. This clashes with a byte interpretation as soon as the produced string contains a UTF-8 prefix byte, such as \xc2 or \xe0, which mark the beginning of a two- and three-byte UTF-8 sequence, respectively.

To ensure bytes will be interpreted as bytes (and as bytes only), place a b (binary) prefix in front of them. This ensures that a byte b'\xc2' will always be interpreted as a single byte, whereas 💃 will be interpreted as a single character (despite occupying multiple bytes).

Tip

Fandango provides a <byte> symbol by default, which expands into all bytes b'\x00'..b'\xff'.

Text Files and Binary Files#

By default, Fandango will read and write files in text mode, meaning that characters will be read in using UTF-8 encoding. However, if a grammar can produce bytes (or bits), the associated files will be read and written in binary mode, reading and writing bytes instead of (encoded) characters. If your grammar contains bytes and strings, then the strings will be written in UTF-8 encoding into the binary file.

You can enforce a specific behavior using the Fandango --file-mode flag for the fuzz and parse commands:

fuzz --file-mode=text opens all files in text mode. Strings and bytes will be written in UTF-8 encoding.
fuzz --file-mode=binary opens all files in binary mode. Strings will be written in UTF-8 encoding; bytes will be written as is.

The default is fuzz --file=mode=auto (default), which will use binary or text mode as described above.

Tip

Avoid mixing non-ASCII strings with bits and bytes in a single grammar.

Bytes and Regular Expressions#

Fandango also supports regular expressions over bytes. To obtain a regular expression over a byte string, use both r and b prefixes. This is especially useful for character classes.

Here is an example: binfinity.fan produces strings of five bytes outside the range \x80-\xff:

<start> ::= rb"[^\x80-\xff]{5}"

This is what we get:

$ fandango fuzz -f binfinity.fan -n 10

'k`u.
e.)wF
<MJN:
d$E:B
oT1*o
9B8@v
jIqfS
x;&&n
At:\E
:Y(71

Length Encodings#

The third set of features one frequently encounters in binary formats is length encodings - that is, a particular field holds a value that represents the length of one or more fields that follow. Here is a simple grammar that expresses this characteristic: A <field> has a two-byte length, followed by the actual (byte) content (of length <length>).

<start>    ::= <field>
<field>    ::= <length> <content>
<length>   ::= <byte> <byte>
<content>  ::= <byte>+

Encoding Lengths with Constraints#

The relationship between <length> and <content> can again be expressed using a constraint. Let us assume that <length> comes as a two-byte (16-bit) unsigned integer with little-endian encoding - that is, the low byte comes first, and the high byte follows. The value 258 (hexadecimal 0x0102) would thus be represented as the two bytes \x02 and \x01.

We can define a function uint16() that takes an integer and converts it to a two-byte string according to these rules. The Python method N.to_bytes(LENGTH, ENDIANNESS) converts the integer N into a bytes string of length LENGTH. ENDIANNESS is either 'big' (default) or 'little'.

def uint16(n: int) -> bytes:
    return n.to_bytes(2, 'little')

Using uint16(), we can now define how the value of <length> is related to the length of <content>:

where <length> == uint16(len(bytes(<content>)))

Tip

Having a derived value (like <length>) isolated on the left-hand side of an equality equation makes it easy for Fandango to first compute the content and then compute and assign the derived value.

Again, all of this goes into a single .fan file: binary.fan holds the grammar, the uint16() definition, and the constraint. Let us produce a single output using binary.fan and view its (binary) contents, using od -c:

$ fandango fuzz -n 1 -f binary.fan -o - | hexdump -C

fandango:WARNING: Could not generate a full population of unique individuals. Population size reduced to 5.

00000000  02 00 02 00                                       |....|
00000004

The hexadecimal dump shows that the first two bytes encode the length of the string of digits that follows. The format is correct - we have successfully produced a length encoding.

Encoding Lengths with Repetitions#

Another way to implement length constraints is by using repetitions. In Fandango, repetitions {} can also contain expressions, and like constraints, these can also refer to nonterminals that have already been parsed or produced. Hence, we can specify a rule

<content> ::= <byte>{f(<length>)}

where f() is a function that computes the number of <byte> repetitions based on <length>.

Let us define a variant binary-rep.fan that makes use of this. Here, we specify that <content> consists of N bytes, where N is given as follows:

<start>    ::= <field>
<field>    ::= <length> <content>
<length>   ::= <byte> <byte> := b'\x00\x04'
<content>  ::= <byte>{from_uint16(<length>.value())}

The method <length>.value() returns the bytes string value of the <length> element. The function from_uint16() is defined as follows:

def from_uint16(n: bytes) -> int:
    return n[0] << 8 | n[1]

With this, we can easily produce length-encoded inputs:

$ fandango fuzz -n 1 -f binary-rep.fan -o - | hexdump -C

00000000  00 04 b9 90 d2 32                                 |.....2|
00000006

Tip

When parsing inputs, computed repetitions are much more efficient than constraints.

Converting Values to Binary Formats#

Instead of implementing uint16() manually, we can also use the Python struct module, which offers several functions to convert data into binary formats. Using struct, we can redefine uint16() as

from struct import pack

def uint16(n: int) -> str:
    return pack('<H', n).decode('iso8859-1')  # convert to string

and obtain the same result:

fandango:WARNING: <field>: Concatenating 'bytes' (<length>) and 'str' (<content>)
fandango:WARNING: NonTerminal('<length>') == uint16(len(NonTerminal('<content>'))): '==': Cannot compare 'bytes' and 'str'

ValueError: I/O operation on closed file.

0000000 003  \0   4   9   5
0000005

Note that the return value of struct.pack() has the type bytes (byte string), which is different from the str Unicode strings that Fandango uses:

pack('<H', 20)

b'\x14\x00'

type(pack('<H', 20))

bytes

In Python, comparisons of different types always return False:

# Left hand is byte string, right hand is Unicode string
b'\x14\x00' == '\x14\x00'

False

Hence, a constraint that compares a Fandango symbol against a byte string will always fail.

Important

When comparing symbols against values, always be sure to convert the values to the appropriate type first.

Tip

Using the 'iso8859-1' encoding (also known as 'latin-1') allows a 1:1 conversion of byte strings with values '\x00'..'\xff' into Unicode str strings without further interpretation.

Tip

Adding type annotations to functions in .fan files allows for future static type checking and further optimizations.

Check out the struct module for additional encodings, including float types, long and short integers, and many more.