_____
 ---'   __\_______
            ______)       Endianness in Poke - And a little nice hack
            __)
           __)
 ---._______)

                                                      Jose E. Marchesi
                                                      October 10, 2019


Byte endianness is an important aspect of encoding data.  As a good
binary editor poke provides support for both little and big endian,
and will soon acquire the ability to encode exotic endianness like PDP
endian.  Endianness control is integrated in the Poke language, and is
designed to be easily used in type descriptions.  Let's see how.

GNU poke maintains a global variable that holds the current
endianness.  This is the endianness that will be used when mapping
integers whose types do not specify an explicit endianness.

Like other poke global state, this global variable can be modified
using the `.set' dot-command:

.set endian little
.set endian big
.set endian host


We can easily see how changing The current endianness indeed impacts
the way integers are mapped:

(poke) dump :from 0#B :size 4#B :ruler 0 :ascii 0
00000000: 8845 4c46
(poke) .set endian little
(poke) int @ 0#B
0x464c4588
(poke) .set endian big
(poke) int @ 0#B
0x88454c46


However, as handy as this dot-command may be, it is also important to
be able to change the current endianness programmatically from a Poke
program.  For that purpose, the PKL compiler provides a couple of
built-in functions: `get_endian' and `set_endian'.

Their definitions, along with the specific supported values, look
like:

var ENDIAN_LITTLE = 0;
var ENDIAN_BIG = 1;

fun get_endian = int: { ... }
fun set_endian = (int endian) int: { ... }


Accessing the current endianness programmatically is especially useful
in situations where the data being poked features a different
structure, depending on the endianness.

A good (or bad) example of this is the way registers are encoded in
eBPF instructions.  eBPF is the in-kernel virtual machine of Linux,
and features an ISA with ten general-purpose registers.  eBPF
instructions generally use two registers, namely the source register
and the destination register.  Each register is encoded using 4 bits,
and the fields encoding registers are consecutive in the instructions.

Typical.  However, for reasons I won't be discussing right now
(because I'm having a nice night and don't want to ruin it) the order
of the source and destination register fields is switched depending on
the endianness.

In big-endian systems the order is:

dst:4 src:4


Whereas in little-endian systems the order is:

src:4 dst:4


In Poke, the obvious way of representing data whose structure depends
on some condition is using an union.  In this case, it could read like
this:

type BPF_Insn_Regs =
  union
  {
    struct
    {
      BPF_Reg src;
      BPF_Reg dst;
    } le : get_endian == ENDIAN_LITTLE;

    struct
    {
      BPF_Reg dst;
      BPF_Reg src;
    } be;
  };


Note the call to the `get_endian' function (which takes no arguments
and thus can be called Algol68-style, without specifying an empty
argument list) in the constraint of the union alternative.  This way,
the register fields will have the right order corresponding to the
current endianness.

Nifty.  However, there is an ever better way to denote the structure
of these fields.  This is it:

type BPF_Insn_Regs =
  struct
  {
    var little_p = (get_endian == ENDIAN_LITTLE);

    BPF_Reg src @ !little_p * 4#b;
    BPF_Reg dst @ little_p * 4#b;
  };


This version, where the ordering of the fields is implemented using
field labels, is not only more compact, but also has the virtue of not
requiring additional "intermediate" fields like `le' and `be' above.
It also shows how convenient can be to declare variables inside
structs.

Let's see it in action:

(poke) BPF_Insn_Regs @ 1#B
BPF_Insn_Regs {src=%r4,dst=%r5}
(poke) .set endian big
(poke) BPF_Insn_Regs @ 1#B
BPF_Insn_Regs {src=%r5,dst=%r4}


Note the pretty printing of registers.  This is achieved by having a
pretty-printer method in the definition of `BPF_Reg':

type BPF_Reg =
  struct
  {
   uint<4> code;

   fun _print = void:
   {
    print "%";
    if (code < BPF_R9)
      printf "r%i32d", code;
    else
      print "fp";
   }
  };


Changing the current endianness in constraint expressions is useful
when dealing with binary formats that specify the endianness of the
data that follows using some sort of tag.  This is the case of ELF,
for example.

The first few bytes in an ELF header conform what is known as the
'e_ident'.  One of these bytes is called `ei_data' and its value
specifies the endianness of the data stored in the ELF file.

This is how we handle this in Poke:

fun elf_endian = (int endian) byte:
 {
   if (endian == ENDIAN_LITTLE)
     return ELFDATA2LSB;
   else
     return ELFDAT2MSB;
 }

[...]

type Elf64_Ehdr =
  struct
  {
    struct
    {
      byte[4] ei_mag : ei_mag[0] == 0x7fUB
                       && ei_mag[1] == 'E'
                       && ei_mag[2] == 'L'
                       && ei_mag[3] == 'F';
      byte ei_class;
      byte ei_data : (ei_data != ELFDATANONE
                      && set_endian (elf_endian (ei_data)));
      byte ei_version;
      byte ei_osabi;
      byte ei_abiversion;
      byte[6] ei_pad;
      offset&lt;byte,B&gt; ei_nident;
    } e_ident;

    [...]
  };


Note how `set_endian' returns an integer value...  it is always
1. This is to facilitate its usage in fields constraint expressions.

Happy poking! :)