_____
 ---'   __\_______
            ______)      Values of the world, unite! - Offsets in Poke
            __)
           __)
 ---._______)

                                                      Jose E. Marchesi
                                                         June 06, 2019

Early in  the design of what  is becoming GNU  poke I was struck  by a
problem  that, to  my  surprise, would  prove  not easy  to  fix in  a
satisfactory  way:  would  I  make   a  byte-oriented  program,  or  a
bit-oriented program?   Considering that  the program in  question was
nothing  less than  an  editor  for binary  data,  this  was no  petty
dilemma.

Since the very beginning I had a pretty clear idea of what I wanted to
achieve: a binary editor that would be capable of editing user defined
data structures,  besides bytes and bits.   I also knew I  needed some
sort  of domain  specific language  to describe  these structures  and
operate on them.  How that language  would look like, and what kind of
abstractions it would  provide, however, was not clear to  me.  Not at
all.

So once I  sketched an initial language design,  barely something very
similar to C  structs, I was determined to not  continue with the poke
implementation until I  had described as many as binary  formats in my
language as possible.  That, I reckoned, was the only way to make sure
the  implemented language  would  be expressive,  complete and  useful
enough to fulfill my requirements.

The  first formats  I implemented  using my  immature little  language
included  ELF, FLV,  MP3, BSON...  all of  them describing  structures
based on whole bytes.  Even when they  try to be compact, it is always
by packing  bit-fields in  elements that are,  invariably, sized  as a
multiple of bytes.   Consequently, the language I  was evolving became
byte oriented as well.  No doubt  also influenced by my C inheritance,
I would think  of bitfields either as a sort  of second class citizen,
or as mere results of shifting and masking.

This worked  well.  The language  evolved to  be able to  express many
different  aspects  of  these  formats   in  a  very  nice  way,  like
variable-length data and holes  in structures.  Consider the following
definition in what actually is <b>not</b> valid today's Poke:

      type Data =
      struct
        {
          byte magic;
          byte count;
          byte dstart;

          byte[count] data @ dstart;
        };
The data starts with a byte that is a magic number. Then the size of the data stored, in bytes, and then the data itself. This data, however, doesn't start right after 'dstart': it starts at 'dstart', which is expressed as an offset, in bytes, since the beginning of the Data. I conceived struct field labels to be any expression evaluating to an integer, which would be... bytes, obviously. Very nice but...

Deflate!

Then, one day, it was the turn for IETF RFC1951, which is the specification of the DEFLATE algorithm and associated file format. Oh dear. Near the beginning of the spec document it can be read:
> This document does not address the issue of the order in which bits
> of a byte are transmitted on a bit-sequential medium, since the
> final data format described here is byte- rather than bit-oriented.
> However, we describe the compressed block format in below, as a
> sequence of data elements of various bit lengths, not a sequence of
> bytes.
Then it goes on describing rules to pack the DEFLATE elements into bytes. I was appalled, and certainly sort of deflated as well. The purpose of my program was precisely to edit binary in terms of the data elements described by a format. And in this case, these data elements came in all sort of bit lengths and alignments. This can be seen in the following RFC1951 excerpt, that describes the header of a compressed block:
> Each block of compressed data begins with 3 header bits
> containing the following data:
>
> first bit BFINAL
> next 2 bits BTYPE
>
> Note that the header bits do not necessarily begin on a byte
> boundary, since a block does not necessarily occupy an integral
> number of bytes.
At this point I understood that my little language on the works would be never capable to describe the DEFLATE structures naturally: C-like bit-fields, masking and shifting, all based on byte-oriented containers and boundaries, would never provide the slickness I wanted for my editor. I mean, just use C and get done with it. This pissed me off. Undoubtely other formats and protocols would be impacted in a similar way. Even when most formats are byte oriented, what am I supposed to tell to the people hacking bit-oriented stuff? "Sorry pal, this is not supported, this program is not for you"? No way, I thought, not on my watch.

General Failure

The obvious solution for the problem, is to be general. In this case, to express every offset and every memory size in bits instead of bytes. While this obviously gives the language maximum expressiveness power, and is ideal for expressing the few bit-oriented formats, it has the disadvantage of being very inconvenient for most situations. To see how annoying this is, let's revisit the little Data element we saw above. In a bit-oriented description language, we would need to write something like:
      type BitData =
      struct
        {
          byte magic;
          byte count;
          byte dstart;

          byte[count] data @ dstart * 8;
        };
Yeah... exactly. The '*' and '8' keys in the keyboards of the poke users would wear out very fast, not to mention their patience as well. Also, should I provide both 'sizeof' and 'bitsizeof' operators? Nasty. I am very fond of the maxim "Never write a program you would never use yourself [1], so I resigned myself to make GNU poke byte oriented, and to provide as many facilities for operating on bit-fields as possible. Fortunately, I have smart friends :))

United Rabbits

During one of the Rabbit Herd's Hacking Weekends I shared my frustration and struggle with the other rabbits, and we came to realize that offsets and data sizes in Poke should not be pure magnitudes or mere integer values: they should be united. They should have units. It makes full sense when you come to think about it. For a program like poke, it is only natural to talk about different memory units, like bits, bytes, kilobytes, megabits, and so on. Bits and bytes are just too common units. Apart from allowing me to express values in different units, this approach also has other benefits as we will see shortly. I'm really grateful to Bruno Haible, Luca Saiu and Nacho Gonzalez for putting me on the right track. So, armed with this insight, I developed the idea fast. The first step was to come with a convenient syntax to denote these united values, which I called <i>offsets</i> (because in a binary editor you mostly use them to denote offsets in the file you are editing). After a not very satisfactory attempt (where '[12 b]' would denote twelve bits) I settled in what poke supports today:
      12#B
      7#b
      1024#Kb
The offsets above denote twelve bytes, seven bits and one thousand twenty four kilobytes, respectively. The unit can be separated from the magnitude by blank characters, so you can write the following instead if you are so inclined:
      12 #B
      7 #b
      (1024 * 1024) #Kb
Then came a little algebra for offsets:
      OFFSET + OFFSET -> OFFSET
      OFFSET - OFFSET -> OFFSET
      OFFSET * MAGNITUDE -> OFFSET
      OFFSET / OFFSET -> MAGNITUDE
      OFFSET % OFFSET -> OFFSET
The addition or subtraction of two offsets results in another offset: '1#B + 1#b -> 9#b'. Multiplying an offset by a magnitude gives you another offset [2]: '8#b * 2 -> 2#B'. Dividing two offsets gives you

a magnitude: '16#b / 1#B -> 2'. Finally, the modulus of two offsets

is another offset: '17#b / 1#B -> 1#b'.

These offsets soon proved a very comfortable and smooth way to convert memory magnitudes between different units: just use units like you do when doing physics or working with units in other contexts. To promote this idea, I also added a little syntactic trick: if the magnitude of an offset is 1, you can omit it. This allows you to write things like:
      24 #Kb/#B
Which gives you the number of bytes in 24 kilobytes. Introducing offsets as first class citizen values in the Poke language gave me exactly what I was aiming for. Typical byte oriented constructs can still be expressed naturally:
      type Data =
        struct
        {
          byte magic;
          byte count;
          offset&lt;byte,B&gt; dstart;

          byte[count] data @ dstart;
        };
But also the odd bit-oriented format:
      type BitData =
        struct
        {
          byte magic;
          byte count;
          offset&lt;byte,b&gt; dstart;

          byte[count] data @ dstart;
        };

General Satisfaction

Once I had a first implementation of the offsets and their operators working, I asked myself what would be the right set of units to offer to users. In the offset syntax, units are specified as '#FOO', where 'FOO' is the name of the unit. I had a list of hardcoded units like 'b' for bits, 'B' for bytes, 'Kb' and so on. But what if the unit that _you_ want is not among them? So I also allowed to express units in multiples of the base unit, which is the bit. Using this syntax, you can express offsets in any arbitrary unit, as disparate as it may seem:
      17#3
      0#12
      8#1
Thats it, 17 units of 3 bits each, zero units of 12 bits each, and eight units of 1 bit each. Sweet! But then, why stopping there? Poking is all about defining data structures and operating on them... so why not using these structures as units as well? Consider the following struct:
      type Packet = struct { int i; long j; };
Unlike the 'Data' struct above, the size of a 'Packet' is known at compile time. Wouldn't it be nice to use it as a unit in offsets? Sure it is:
      23#Packet
The above is the size occupied by 23 packets. And how many kilobytes are in 23 packets? Easy:
      23 #Packet/#Kb
Expressing offsets as united values also relieves the programmer from doing many explicit unit conversions: poke can do them for you. Consider for example an ELF section header. One of its fields is the size of the described section, in bytes:
      type Elf64_Shdr =
        struct
        {
         ...
         offset&lt;Elf64_Xword,B&gt; sh_size;
         ...
        };
If a given section is to contain, say, relocations with addends, we can set its size doing something like this:
      shdr.sh_size = 10#Elf64_Rela;
Instead of doing the conversion to bytes explicitly. Finally, having fields in structs as offsets also makes poke aware of what is a reference to other parts of the file. This will be used in the near future when we add support for resizing structures automagically... yeah that's coming. Happy poking! :) Footnotes: [1] Actually it is Lord Vetinari's "Never build a dungeon you can't get out of." but the point is the same. [2] Considering memory is linear, in principle multiplying two offsets woudn't make a lot of sense... then again, something like '2#B * 8#B -> ##16#B<sup>2</sup>' could maybe be seful when working with tables... hmmm tempting ;)