_____
---' __\_______
______) Values of the world, unite! - Offsets in Poke
__)
__)
---._______)
Jose E. Marchesi
June 06, 2019
Early in the design of what is becoming GNU poke I was struck by a
problem that, to my surprise, would prove not easy to fix in a
satisfactory way: would I make a byte-oriented program, or a
bit-oriented program? Considering that the program in question was
nothing less than an editor for binary data, this was no petty
dilemma.
Since the very beginning I had a pretty clear idea of what I wanted to
achieve: a binary editor that would be capable of editing user defined
data structures, besides bytes and bits. I also knew I needed some
sort of domain specific language to describe these structures and
operate on them. How that language would look like, and what kind of
abstractions it would provide, however, was not clear to me. Not at
all.
So once I sketched an initial language design, barely something very
similar to C structs, I was determined to not continue with the poke
implementation until I had described as many as binary formats in my
language as possible. That, I reckoned, was the only way to make sure
the implemented language would be expressive, complete and useful
enough to fulfill my requirements.
The first formats I implemented using my immature little language
included ELF, FLV, MP3, BSON... all of them describing structures
based on whole bytes. Even when they try to be compact, it is always
by packing bit-fields in elements that are, invariably, sized as a
multiple of bytes. Consequently, the language I was evolving became
byte oriented as well. No doubt also influenced by my C inheritance,
I would think of bitfields either as a sort of second class citizen,
or as mere results of shifting and masking.
This worked well. The language evolved to be able to express many
different aspects of these formats in a very nice way, like
variable-length data and holes in structures. Consider the following
definition in what actually is <b>not</b> valid today's Poke:
type Data =
struct
{
byte magic;
byte count;
byte dstart;
byte[count] data @ dstart;
};
The data starts with a byte that is a magic number. Then the size of
the data stored, in bytes, and then the data itself. This data,
however, doesn't start right after 'dstart': it starts at 'dstart',
which is expressed as an offset, in bytes, since the beginning of the
Data. I conceived struct field labels to be any expression evaluating
to an integer, which would be... bytes, obviously.
Very nice but...
Deflate!
Then, one day, it was the turn for IETF RFC1951, which is the
specification of the DEFLATE algorithm and associated file format. Oh
dear. Near the beginning of the spec document it can be read:
> This document does not address the issue of the order in which bits
> of a byte are transmitted on a bit-sequential medium, since the
> final data format described here is byte- rather than bit-oriented.
> However, we describe the compressed block format in below, as a
> sequence of data elements of various bit lengths, not a sequence of
> bytes.
Then it goes on describing rules to pack the DEFLATE elements into
bytes. I was appalled, and certainly sort of deflated as well. The
purpose of my program was precisely to edit binary in terms of the
data elements described by a format. And in this case, these data
elements came in all sort of bit lengths and alignments. This can be
seen in the following RFC1951 excerpt, that describes the header of a
compressed block:
> Each block of compressed data begins with 3 header bits
> containing the following data:
>
> first bit BFINAL
> next 2 bits BTYPE
>
> Note that the header bits do not necessarily begin on a byte
> boundary, since a block does not necessarily occupy an integral
> number of bytes.
At this point I understood that my little language on the works would
be never capable to describe the DEFLATE structures naturally: C-like
bit-fields, masking and shifting, all based on byte-oriented
containers and boundaries, would never provide the slickness I wanted
for my editor. I mean, just use C and get done with it.
This pissed me off. Undoubtely other formats and protocols would be
impacted in a similar way. Even when most formats are byte oriented,
what am I supposed to tell to the people hacking bit-oriented stuff?
"Sorry pal, this is not supported, this program is not for you"? No
way, I thought, not on my watch.
General Failure
The obvious solution for the problem, is to be general. In this case,
to express every offset and every memory size in bits instead of
bytes. While this obviously gives the language maximum expressiveness
power, and is ideal for expressing the few bit-oriented formats, it
has the disadvantage of being very inconvenient for most situations.
To see how annoying this is, let's revisit the little Data element we
saw above. In a bit-oriented description language, we would need to
write something like:
type BitData =
struct
{
byte magic;
byte count;
byte dstart;
byte[count] data @ dstart * 8;
};
Yeah... exactly. The '*' and '8' keys in the keyboards of the poke
users would wear out very fast, not to mention their patience as well.
Also, should I provide both 'sizeof' and 'bitsizeof' operators?
Nasty.
I am very fond of the maxim "Never write a program you would never use
yourself [1], so I resigned myself to make GNU poke byte oriented, and
to provide as many facilities for operating on bit-fields as possible.
Fortunately, I have smart friends :))
United Rabbits
During one of the Rabbit Herd's Hacking Weekends I shared my
frustration and struggle with the other rabbits, and we came to
realize that offsets and data sizes in Poke should not be pure
magnitudes or mere integer values: they should be united. They should
have units.
It makes full sense when you come to think about it. For a program
like poke, it is only natural to talk about different memory units,
like bits, bytes, kilobytes, megabits, and so on. Bits and bytes are
just too common units. Apart from allowing me to express values in
different units, this approach also has other benefits as we will see
shortly.
I'm really grateful to Bruno Haible, Luca Saiu and Nacho Gonzalez for
putting me on the right track.
So, armed with this insight, I developed the idea fast. The first
step was to come with a convenient syntax to denote these united
values, which I called <i>offsets</i> (because in a binary editor you
mostly use them to denote offsets in the file you are editing). After
a not very satisfactory attempt (where '[12 b]' would denote twelve
bits) I settled in what poke supports today:
12#B
7#b
1024#Kb
The offsets above denote twelve bytes, seven bits and one thousand
twenty four kilobytes, respectively. The unit can be separated from
the magnitude by blank characters, so you can write the following
instead if you are so inclined:
12 #B
7 #b
(1024 * 1024) #Kb
Then came a little algebra for offsets:
OFFSET + OFFSET -> OFFSET
OFFSET - OFFSET -> OFFSET
OFFSET * MAGNITUDE -> OFFSET
OFFSET / OFFSET -> MAGNITUDE
OFFSET % OFFSET -> OFFSET
The addition or subtraction of two offsets results in another offset:
'1#B + 1#b -> 9#b'. Multiplying an offset by a magnitude gives you
another offset [2]: '8#b * 2 -> 2#B'. Dividing two offsets gives you
a magnitude: '16#b / 1#B -> 2'. Finally, the modulus of two offsets
is another offset: '17#b / 1#B -> 1#b'.
These offsets soon proved a very comfortable and smooth way to convert
memory magnitudes between different units: just use units like you do
when doing physics or working with units in other contexts. To
promote this idea, I also added a little syntactic trick: if the
magnitude of an offset is 1, you can omit it. This allows you to
write things like:
24 #Kb/#B
Which gives you the number of bytes in 24 kilobytes.
Introducing offsets as first class citizen values in the Poke language
gave me exactly what I was aiming for. Typical byte oriented
constructs can still be expressed naturally:
type Data =
struct
{
byte magic;
byte count;
offset<byte,B> dstart;
byte[count] data @ dstart;
};
But also the odd bit-oriented format:
type BitData =
struct
{
byte magic;
byte count;
offset<byte,b> dstart;
byte[count] data @ dstart;
};
General Satisfaction
Once I had a first implementation of the offsets and their operators
working, I asked myself what would be the right set of units to offer
to users. In the offset syntax, units are specified as '#FOO', where
'FOO' is the name of the unit. I had a list of hardcoded units like
'b' for bits, 'B' for bytes, 'Kb' and so on. But what if the unit
that _you_ want is not among them?
So I also allowed to express units in multiples of the base unit,
which is the bit. Using this syntax, you can express offsets in any
arbitrary unit, as disparate as it may seem:
17#3
0#12
8#1
Thats it, 17 units of 3 bits each, zero units of 12 bits each, and
eight units of 1 bit each. Sweet!
But then, why stopping there? Poking is all about defining data
structures and operating on them... so why not using these structures
as units as well? Consider the following struct:
type Packet = struct { int i; long j; };
Unlike the 'Data' struct above, the size of a 'Packet' is known at
compile time. Wouldn't it be nice to use it as a unit in offsets?
Sure it is:
23#Packet
The above is the size occupied by 23 packets. And how many kilobytes
are in 23 packets? Easy:
23 #Packet/#Kb
Expressing offsets as united values also relieves the programmer from
doing many explicit unit conversions: poke can do them for you.
Consider for example an ELF section header. One of its fields is the
size of the described section, in bytes:
type Elf64_Shdr =
struct
{
...
offset<Elf64_Xword,B> sh_size;
...
};
If a given section is to contain, say, relocations with addends, we
can set its size doing something like this:
shdr.sh_size = 10#Elf64_Rela;
Instead of doing the conversion to bytes explicitly.
Finally, having fields in structs as offsets also makes poke aware of
what is a reference to other parts of the file. This will be used in
the near future when we add support for resizing structures
automagically... yeah that's coming.
Happy poking! :)
Footnotes:
[1] Actually it is Lord Vetinari's "Never build a dungeon you can't
get out of." but the point is the same.
[2] Considering memory is linear, in principle multiplying two offsets
woudn't make a lot of sense... then again, something like '2#B
* 8#B -> ##16#B<sup>2</sup>' could maybe be seful when working
with tables... hmmm tempting ;)