Applied Pokology
Back to blog...
_____
---' __\_______
______) Padding and aligning data in GNU poke
__)
__)
---._______)
Jose E. Marchesi
March 2, 2021
Table of Contents
_________________
1. Esoteric and exoteric padding
2. Reserved fields
3. Payloads
4. Aligning struct fields
5. Padding array elements
It is often the case in binary formats that certain elements are
separated by some data that is not really used for any meaningful
purpose other than occupy that space. The reason for keeping that space
varies from case to case; sometimes to reserve it for future use,
sometimes to make sure that the following data is aligned to some
particular alignment. This is known as "padding". There are several
ways to implement padding in GNU poke. This article shows these
techniques and discusses their advantages and disadvantages.
1 Esoteric and exoteric padding
===============================
So padding is the technique of keeping some amount of space between
two different elements in some data stream. GNU poke provides two
different ways to express sequences of data elements: the fields of a
struct type, which are defined one after the other, and elements in an
array.
We call adding space between two struct fields esoteric (or internal)
padding.
We call adding space between two array elements exoteric (or external)
padding.
Let's see some examples of the two kinds and how to better handle them
in Poke.
2 Reserved fields
=================
People designing binary encoded formats tend to be cautious and try to
avoid future backward incompatibilities by keeping some unused fields
that are reserved for future use. This is the first kind of padding
we will be looking at, and is particularly common in structures like
headers.
See for example the header used to characterize compressed section
contents in ELF files:
,----
| type Elf64_Chdr =
| struct
| {
| Elf_Word ch_type;
| Elf_Word ch_reserved;
| offset<Elf64_Xword,B> ch_size;
| offset<Elf64_Xword,B> ch_addralign;
| };
`----
where the `ch_reserved' field is reserved for future use. When the
time comes the space occupied by that field (32 bits in this case)
will be used to hold additional data in the form of one or more
fields. The idea is that implementations of the older formats will
still work.
The most obvious way to handle this in Poke is using a named field
like `ch_reserved' above. This field will be decoded/encoded by poke
when constructing/mapping/writing struct values of this type, and will
be available to the user as `chdr.ch_reserved'.
Sometimes reserved space is required to be filled with certain data
values, such as zeroes. This may be to simplify things, or to force
data producers to initialize the memory in order to avoid potential
leaking of sensible information. In these cases we can use Poke
initial values:
,----
| type Elf64_Chdr =
| struct
| {
| Elf_Word ch_type;
| Elf_Word ch_reserved = 0;
| offset<Elf64_Xword,B> ch_size;
| offset<Elf64_Xword,B> ch_addralign;
| };
`----
This will make poke to check that `ch_reserved' is zero when
constructing or mapping headers for compressed sections raising a
constraint violation exception otherwise. It will also make poke to
make sure `ch_reserved' to 0 when constructing `Elf64_Chdr' struct
values:
,----
| (poke) Elf64_Chdr { ch_reserved = 23 }
| unhandled constraint violation exception
`----
An alternative way to characterize reserved space in Poke is to use
anonymous fields. For example:
,----
| type Elf64_Chdr =
| struct
| {
| Elf_Word ch_type;
| Elf_Word;
| offset<Elf64_Xword,B> ch_size;
| offset<Elf64_Xword,B> ch_addralign;
| };
`----
Using Poke anonymous fields to implement reserved fields has at least
two advantages. First, the user cannot anymore temper with the data
in the reserved space in an easy way, i.e. `chdr.ch_reserved = 666' is
no longer valid. Second, the printed representation of anonymous
struct fields is more compact and denotes better than the involved
space is not to be messed with:
,----
| (poke) Elf64_Chdr {}
| Elf64_Chdr {
| ch_type=0x0U,
| 0x0U,
| ch_size=0x0UL#B,
| ch_addralign=0x0UL#B
| }
`----
A disadvantage of using anonymous fields is that you cannot specify
constraint expressions for them, nor initial values. At some point we
will probably add syntax to declare certain struct fields as
read-only.
At this point, it is important to note that anonymous fields are still
encoded/decoded by poke every time the struct value is mapped or
written, exactly like regular fields. Therefore using them doesn't
pose any advantage in terms of performance.
3 Payloads
==========
The reserved fields discussed in the previous section are most often
discrete units like words, double-words, and the like, they are
usually of some fixed size, and they are used to delimit some space
that is not to be used.
Another kind of padding happens when an entity contains space to be
used to store some kind of payload whose contents are not determined.
This would be such an example:
,----
| type Packet =
| struct
| {
| offset<uint<32>,B> payload_size;
| byte[payload_size] payload;
| int flags;
| };
`----
In this example we are using a `payload' field which is an array of
bytes. The size of the payload is determined by the packet header,
and the contents are not determined. Of course this assumes that the
payload sizes are divisible in whole bytes; a bit-oriented format may
need to use an array of bits instead.
This approach of using a byte (or bit) array like in the example above
has the advantage of providing a field with the bytes (or bits) to the
user, for inspection and modification:
,----
| (poke) packet.payload
| [23UB, ...]
| (poke) packet.payload[0] = 0
`----
The user can still map whatever payload structure in that space using
the attributes of a mapped `Packet'. For example, if the packet
contains an array of ULEB128 numbers, we could do:
,----
| (poke) var numbers = ULEB128[packet.payload'size] @ packet.payload'offset
`----
But this approach has a disadvantage: every time the packet structure
is mapped or written the entire payload array gets decoded and
encoded. If the payloads are big enough (think about the data blocks
of a file described by a filesystem i-node for example) this can be a
big problem in terms of performance.
Another problem of using byte (or bit) arrays for payloads is that the
printed representation of the struct values include the contents of
the arrays, and most often the user won't be interested in seeing
that:
,----
| (poke) Packet { payload_size = 23#B }
| Packet {
| payload_size=0x17U#B,
| payload=[0x0UB,0x0UB,0x0UB,0x0UB,0x0UB,...],
| flags=0x0
| }
`----
Another alternative is to implement the padding implied by a payload
using field labels:
,----
| type Packet =
| struct
| {
| offset<uint<32>,B> payload_size;:
| int flags @ OFFSET + payload_size;
| };
`----
Note how a `payload' field no longer exists in the struct type, and
the field `flags' is defined to start at offset `OFFSET +
payload_size'. This way no explicit array is encoded/decoded when
manipulating `Packet' values:
,----
| (poke) .set omaps yes
| (poke) Packet { payload_size = 500#Mb }
| Packet {
| payload_size=62500000U#B @ 0UL#b,
| flags=0 @ 4000000032UL#b
| } @ 0UL#b
`----
In this example we used the `omaps' option, which asks poke to print
the offsets of the fields. The offset of `flags' is 4000000032 bits,
or 500 megabytes:
,----
| (poke) 4000000032UL #b/#MB
| 500UL
`----
Mapping this new `Packet' involves reading and decoding five bytes,
for the `payload_size' and `flags' only. This is clearly much faster
and avoids unneeded IO.
However you may be wondering, if there is no explicit `payload' field,
how to access the payload space? A way is to define a method to the
struct to provide the payload attributes:
,----
| type Packet =
| struct
| {
| offset<uint<32>,B> payload_size;:
| var payload_offset = OFFSET;
| int flags @ OFFSET + payload_size;
|
| method get_payload_offset = off64:
| {
| return payload_offset;
| }
| };
`----
Note how we captured the offset of the payload using a variable in the
strict type definition. Returning `OFFSET' in `get_payload_offset'
wouldn't work for obvious reasons: in the body of the method `OFFSET'
evaluates to the end of `flags' in this case.
Using this method you can easily access the payload (again as an array
of ULEB128 numbers) like this:
,----
| var numbers = ULEB128[packet.payload_size @ packet.get_payload_offset
`----
Finally, using labels for this purpose makes the printed
representation of the struct values more readable by not including the
payload bytes in it:
,----
| (poke) Packet {}
| Packet {
| payload_size=0x0U#B,
| flags=0x0
| }
`----
4 Aligning struct fields
========================
Another kind of esoteric padding happens when certain fields in
entities are required to be aligned to some particular alignment. For
example, suppose that the `flags' field in the packets used in the
previous sections is required to always be aligned to 4 bytes
regardless of the size of the payload. This would be a common
requirement if the format is intended to be implemented in systems
where data is to be accessed using its "natural" alignment.
Using explicit fields for both the payload and the additional padding,
we could come with:
,----
| type Packet =
| struct
| {
| offset<uint<32>,B> payload_size;
| byte[payload_size] payload;
| byte[alignto (OFFSET, 4#B)] padding;
| int flags;
| };
`----
Where `alignto' is a little function defined in the Poke standard
library, like this:
,----
| fun alignto = (uoff64 offset, uoff64 to) uoff64:
| {
| return (to - (offset % to)) % to;
| }
`----
Alternatively, using the labels approach (which is generally better as
we discussed in the last section) the definition would become:
,----
| type Packet =
| struct
| {
| offset<uint<32>,B> payload_size;:
| var payload_offset = OFFSET;
| int flags @ OFFSET + payload_size + alignto (payload_size, 4#B);
|
| method get_payload_offset = off64:
| {
| return payload_offset;
| }
| };
`----
In this case, the payload space is still completely characterized by
the `payload_size' field and the `get_payload_offset' method.
5 Padding array elements
========================
Up to now all the examples of padding we have shown are in the
category of esoteric or internal padding, i.e. it was intended to add
space between fields of some particular entity.
However, sometimes we want to specify some padding between the
elements of a sequence of entities. In Poke this basically means an
array.
Suppose we have a simple filesystem that is conformed by a sequence of
inodes. The contents of the filesystem have the following form:
,----
| +-----------------+
| | inode |
| +-----------------+
| : :
| : data :
| : :
| +-----------------+
| | inode |
| +-----------------+
| : :
| : data :
| : :
| +-----------------+
| | ... |
`----
That's it, each inode describes a block of data of variable size that
immediately follows. Then more pairs of inode-data follow until the
end of the device. However, a requirement is that each inode has to
be aligned to 128 bytes.
Let's start by writing a simple type definition for the inodes:
,----
| type Inode =
| struct
| {
| string filename;
| int perms;
| offset<uint<32>,B> data_size;
| };
`----
This definition is simple enough, but it doesn't allow us to just map
an array of inodes like this:
,----
| (poke) Inode[] @ 0#B
`----
We could of course add the data and padding explicitly to the inode
structure:
,----
| type Inode =
| struct
| {
| string filename;
| int perms;
| offset<uint<32>,B> data_size;
| byte[data_size] data;
| byte[alignto (data_size, 128#B)] padding;
| };
`----
Then we could just map `Inode[] @ 0#B' and we would the get expected
result.
But this is not a good idea. On one hand because, as we know, this
would imply mapping the full filesystem data byte by byte, and that
would be very very slow. On the other hand, because the data is not
part of the inode, conceptually speaking.
A better solution is to use this idiom:
,----
| type Inode =
| struct
| {
| string filename;
| int perms;
| offset<uint<32>,B> data_size;
|
| byte[0] @ OFFSET + data_size + alignto (data_size, 128#B);
| };
`----
This uses an anonymous field at the end of the struct type, of size
zero, located at exactly the offset where the data plus padding would
end in the version with explicit fields.
This later solution is fast and still allows us to get an array of
inodes reflecting the whole filesystem with:
,----
| (poke) var inodes = Inode[] @ 0#B
`----
Like in the previous sections, a method `get_data_offset' can be added
to the struct type in order to allow accessing the data blocks
corresponding to a given inode.
Happy poking!