_____
 ---'   __\_______
            ______)       Dealing with alternatives - Unions in Poke
            __)
           __)
 ---._______)

                                                      Jose E. Marchesi
                                                      October 18, 2019

The Poke type definitions can be seen as a sort of declarative
specifications for decoding and encoding procedures.  You specify the
structure of the data you want to operate on, and poke uses that
information to automagically decode and encode the data for you.
Under this perspective, struct types correspond to sequences of
instructions, array types to repetitions or loops, and union types to
alternatives or conditionals.

Decode-Compute-Encode


Computing with data whose form is not the most convenient way to be
manipulated, like is often the case in unstructured binary data,
requires performing a preliminary step that transforms the data into a
more convenient representation, usually featuring a higher level of
abstraction.  This step is known in computer jargon as
"unmarshalling", when the data is fetch from some storage or
transmission media or, more generally, "decoding".

Once the computation has been performed, the result should be
transformed back to the low-level representation to be stored or
transmitted.  This is performed in a closing step known as
"marshalling" or, more generally, "encoding".

Consider the following C program whose purpose is to read a 32-bit
signed integer from a byte-oriented storage media at a given offset,
multiply it by two, and store the result at the same offset.

void double_number (int fd, off_t offset, int endian)
{
   int number, i;
   int b[4];

   /* Decode.  */
   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      read (fd, &b[i], 1);

   if (endian == BIG)
     number = b[0] << 24 | b[1] << 16 | b[2] << 8 | b[3];
   else
     number = b[3] << 24 | b[2] << 16 | b[1] << 8 | b[0];

   /* Compute.  */
   number = number * 2;

   /* Encode.  */
   if (endian == BIG)
   {
     b[0] = (number >> 24) & 0xff;
     b[1] = (number >> 16) & 0xff;
     b[2] = (number >> 8) & 0xff;
     b[3] = number & 0xff;
   }
   else
   {
     b[3] = (number >> 24) & 0xff;
     b[2] = (number >> 16) & 0xff;
     b[1] = (number >> 8) & 0xff;
     b[0] = number & 0xff;
   }

   lseek (fd, offset, SEEK_SET);
   for (i = 0; i < 4; ++i)
      write (fd, &b[i], 1);
}


As we can see, decoding takes care of fetching the data from the
storage in simple units, bytes.  Then it mounts the more abstract
entity on which the computation will be performed, in this case a
signed 32-bit integer.  Considerations like endianness, negative
encoding (which is assumed to be two's complement in this example and
handled automatically by C) and error conditions (omitted in this
example for clarity) should be handled properly.

Conversely, encoding turns the signed 32-bit integer into a sequence
of bytes and then writes them out to the storage at the desired
offset.  Again, this requires taking endianness into account and
handling error conditions.

This example may look simplistic and artificial, and it is, but too
often the computation proper (like multiplying the integer by two) is
way more straightforward than the decoding and encoding of the data
used for the computation.

Generally speaking, decoding and encoding binary data is laborious and
error prone.  Think about sequences of elements, variable-length and
clever compact encodings, elements not aligned to byte boundaries, the
always bug-prone endianness, and a long etc.  Dirty business,
sometimes risky, and always *boring*.

Describe-Compute

This is where poke comes into play.  Basically, it allows you to
describe the characteristics of the data you want to compute on, and
then decodes and encodes it for you, taking care of the gory details.
That way you can concentrate your energy on the *fun* part: computing
on the data at your pleasure.

Of course, you are still required to provide a description of the
data.  In Poke, these descriptions take the form of "type
definitions", which are declarative: you specify *what* you want, and
poke extracts the *how* from that.

For example, consider the following Poke type definition:

type Packet =
  struct
  {
    uint<16> magic : magic == 0xef;
    uint<32> size;
    byte[size] data @ 64#B;
  };


This tells poke that, in order to decode a `Packet', it should perform
the following procedure (a similar procedure is implied for encoding):

* Read four bytes from the IO space, mount them into an unsigned
  16-bit integer using whatever current endianness, and put it in
  `magic'.

* Read eight bytes, mount them into an unsigned 32-bit integer using
  the same endianness, and put it in `size'.

* Seek the IO space to advance 16 bits.

* Do `size' times: read one byte and mount it into an unsigned 8-bit
  integer, then put the integer in the proper place in the `data'
  array.

If during this procedure an end of file is encountered, or a
constraint fails, an appropriate error is raised.

In the procedure sketched above we find a sequence of operations,
implied by the struct type, and a loop, implied by the array type.  As
we shall see below, it is also possible to decode conditionally.
Union types are used for that purpose.

BSON

BSON is a binary encoding for JSON documents.  The top-level entity
described by the spec is the document, which contains the total size
of the document, a sequence of elements, and a trailing null byte.

We can describe the above in Poke with the following type definition:

type BSON_Doc =
  struct
  {
    offset<int32,B> size;
    BSON_Elem[size - size'size - 1#B] elements;
    byte endmark : endmark == 0x0;
  };


BSON elements come in different kinds, which correspond to the
different types of JSON entities: 32-bit integers, 64-bit integers,
strings, arrays, timestamps, and so on.  Every element starts with a
tag, which is a 8-bit unsigned integer that identifies it's kind, and
a name encoded as a NULL-terminated string.  What comes next depends
on the kind of element.

The following Poke type definition describes a subset of BSON
elements, namely integers, big integers and strings:

type BSON_Elem =
  struct
  {
    byte tag;
    string name;

    union
    {
      int32 integer32 : tag == 0x10;
      int64 integer64 : tag == 0x12;
      BSON_String str : tag == 0x02;
    } value;
  };


The union in `BSON_Elem' corresponds to the variable part.  When poke
decodes an union, it tries to decode each alternative (union field) in
turn.  The first alternative that is successfully decoded without
raising a constraint violation exception is the chosen one.  If no
alternative can be decoded, a constraint violation exception is
raised.

To see this process in action, let's use the BSON corresponding to the
following little JSON document [1]:

{
    "name" : "Jose E. Marchesi",
    "age" : 39,
    "big" : 1076543210012345
}


Let's take a look to the different elements [2]:

(poke) .load bson.pk
(poke) var d = BSON_Doc @ 0#B
(poke) d.elements'length
<span class="integer">0x3UL
(poke) d.elements[0]
BSON_Elem&nbsp;{tag=0x2UB,name=&quot;name&quot;,value=struct&nbsp;{str=BSON_String&nbsp;{size=0x11,value=&quot;Jose&nbsp;E.&nbsp;Marchesi&quot;,chars=[0x4aUB,0x6fUB,0x73UB,0x65UB,0x20UB,0x45UB,0x2eUB,0x20UB,0x4dUB,0x61UB,0x72UB,0x63UB,0x68UB,0x65UB,0x73UB,0x69UB,0x0UB]}}}
(poke) d.elements[1]
BSON_Elem&nbsp;{tag=0x10UB,name=&quot;age&quot;,value=struct&nbsp;{integer32=0x27}}
(poke) d.elements[2]
BSON_Elem&nbsp;{tag=0x12UB,name=&quot;big&quot;,value=struct&nbsp;{integer64=0x3d31c3f9e3eb9L}}


Note how unions decode into structs featuring different fields.  What
field is available depends on the alternative chosen while decoding.

In the session above, `d.elements[1].value' contains an `integer32'
field, whereas `d.elements[2].value' contains an `integer64' field.
Let's see what happens if we try to access the wrong field:

(poke) d.elements[1].value.integer64
unhandled&nbsp;invalid&nbsp;element&nbsp;exception


We get a run-time exception.  This kind of errors cannot be catched at
compile time, since both `integer32' and `integer64' are potentially
valid fields in the union `value'.

Unions are Tagged

Unlike in C, Poke unions are tagged.  Unions have their own lexical
closures, and it is the capured values that determine what field is
chosen at every time.  Wherever the union goes, its tag accompanies
it.

To see this more clearly, consider the following alternative Poke
description of the BSON elements:

type BSON_Elem =
  union
  {
    struct
    {
      byte tag : tag == 0x10;
      string name;
      int32 value;
    } integer32;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      int64 value;
    } integer64;

    struct
    {
      byte tag : tag == 0x12;
      string name;
      BSON_String value;
    } str;
  };


This description is way more verbose, but it shows a few important
properties of Poke unions.

First, the constraints guiding the decoding are not required to appear
in the union itself: it is a recursive process.  In this example,
'BSON_String' could have constraints on it's own, and these
constraints will also impact the decoding of the union.

Second, there are generally many different ways to express the same
plain binary using different type structures.  This is no different
than getting different parse trees from the same sequence of tokens
using different grammars denoting the same language.

See how different a BSON element looks (and feels) using this
alternative description:

(poke) d.elements[0]
BSON_Elem {str=struct {tag=0x2UB,name=&quot;name&quot;,value=BSON_String {size=0x11,value=&quot;Jose E. Marchesi&quot;,chars=[0x4aUB,0x6fUB,0x73UB,0x65UB,0x20UB,0x45UB,0x2eUB,0x20UB,0x4dUB,0x61UB,0x72UB,0x63UB,0x68UB,0x65UB,0x73UB,0x69UB,0x0UB]}}}
(poke) d.elements[1]
BSON_Elem {integer32=struct {tag=0x10UB,name=&quot;age&quot;,value=0x27}}
(poke) d.elements[2]
BSON_Elem {integer64=struct {tag=0x12UB,name=&quot;big&quot;,value=0x3d31c3f9e3eb9L}}


What is the best way?  It certainly depends on the kind of data you
want to manipulate, and the level of abstraction you want to achieve.
Ultimately, it is up to you.

Generally speaking, the best structuring is the one that allows you to
manipulate the data in terms of the structured abstractions as
naturally as possible.  That's the art and craft of writing good
pickles.

GNU poke is evolving to facilitate this task as much as possible.
Features like struct flattening and unification of union fields
(i.e. to be able to have different alternatives with the same name but
potentially different types) are being worked on, so stay tuned.

Happy poking! :)

Footnotes:

[1] To generate the BSON from the JSON I used libbson and its little
    example program `json2bson'