Applied Pokology - A blog about GNU poke

Endianness in Poke - And a little nice hack
[25-10-2019]

by Jose E. Marchesi

Byte endianness is an important aspect of encoding data. As a good binary editor poke provides support for both little and big endian, and will soon acquire the ability to encode exotic endianness like PDP endian. Endianness control is integrated in the Poke language, and is designed to be easily used in type descriptions. Let's see how.

GNU poke maintains a global variable that holds the current endianness. This is the endianness that will be used when mapping integers whose types do not specify an explicit endianness.

Like other poke global state, this global variable can be modified using the .set dot-command:

.set endian little
.set endian big
.set endian host

We can easily see how changing The current endianness indeed impacts the way integers are mapped:

(poke) dump :from 0#B :size 4#B :ruler 0 :ascii 0
00000000: 8845 4c46
(poke) .set endian little
(poke) int @ 0#B
0x464c4588
(poke) .set endian big
(poke) int @ 0#B
0x88454c46

However, as handy as this dot-command may be, it is also important to be able to change the current endianness programmatically from a Poke program. For that purpose, the PKL compiler provides a couple of built-in functions: get_endian and set_endian.

Their definitions, along with the specific supported values, look like:

var ENDIAN_LITTLE = 0;
var ENDIAN_BIG = 1;
      
fun get_endian = int: { ... }
fun set_endian = (int endian) int: { ... }

Accessing the current endianness programmatically is especially useful in situations where the data being poked features a different structure, depending on the endianness.

A good (or bad) example of this is the way registers are encoded in eBPF instructions. eBPF is the in-kernel virtual machine of Linux, and features an ISA with ten general-purpose registers. eBPF instructions generally use two registers, namely the source register and the destination register. Each register is encoded using 4 bits, and the fields encoding registers are consecutive in the instructions.

Typical. However, for reasons I won't be discussing right now (because I'm having a nice night and don't want to ruin it) the order of the source and destination register fields is switched depending on the endianness.

In big-endian systems the order is:

dst:4 src:4

Whereas in little-endian systems the order is:

src:4 dst:4

In Poke, the obvious way of representing data whose structure depends on some condition is using an union. In this case, it could read like this:

type BPF_Insn_Regs =
  union
  {
    struct
    {
      BPF_Reg src;
      BPF_Reg dst;
    } le : get_endian == ENDIAN_LITTLE;

    struct
    {
      BPF_Reg dst;
      BPF_Reg src;
    } be;
  };

Note the call to the get_endian function (which takes no arguments and thus can be called Algol68-style, without specifying an empty argument list) in the constraint of the union alternative. This way, the register fields will have the right order corresponding to the current endianness.

Nifty. However, there is an ever better way to denote the structure of these fields. This is it:

type BPF_Insn_Regs =
  struct
  {
    var little_p = (get_endian == ENDIAN_LITTLE);
    
    BPF_Reg src @ !little_p * 4#b;
    BPF_Reg dst @ little_p * 4#b;
  };

This version, where the ordering of the fields is implemented using field labels, is not only more compact, but also has the virtue of not requiring additional "intermediate" fields like le and be above. It also shows how convenient can be to declare variables inside structs.

Let's see it in action:

(poke) BPF_Insn_Regs @ 1#B
BPF_Insn_Regs {src=%r4,dst=%r5}
(poke) .set endian big
(poke) BPF_Insn_Regs @ 1#B
BPF_Insn_Regs {src=%r5,dst=%r4}

Note the pretty printing of registers. This is achieved by having a pretty-printer method in the definition of BPF_Reg:

type BPF_Reg =
  struct
  {
   uint<4> code;

   fun _print = void:
   {
    print "%";
    if (code < BPF_R9)
      printf "r%i32d", code;
    else
      print "fp";
   }
  };

Changing the current endianness in constraint expressions is useful when dealing with binary formats that specify the endianness of the data that follows using some sort of tag. This is the case of ELF, for example.

The first few bytes in an ELF header conform what is known as the e_ident. One of these bytes is called ei_data and its value specifies the endianness of the data stored in the ELF file.

This is how we handle this in Poke:

fun elf_endian = (int endian) byte:
 {
   if (endian == ENDIAN_LITTLE)
     return ELFDATA2LSB;
   else
     return ELFDAT2MSB;
 }

[...]

type Elf64_Ehdr =
  struct
  {
    struct
    {
      byte[4] ei_mag : ei_mag[0] == 0x7fUB
                       && ei_mag[1] == 'E'
                       && ei_mag[2] == 'L'
                       && ei_mag[3] == 'F';
      byte ei_class;
      byte ei_data : (ei_data != ELFDATANONE
                      && set_endian (elf_endian (ei_data)));
      byte ei_version;
      byte ei_osabi;
      byte ei_abiversion;
      byte[6] ei_pad;
      offset<byte,B> ei_nident;
    } e_ident;

    [...]
  };

Note how set_endian returns an integer value... it is always 1. This is to facilitate its usage in fields constraint expressions.

Happy poking! :)

Back to Applied Pokology Follow up in the mailing list...

Endianness in Poke - And a little nice hack[25-10-2019]

Endianness in Poke - And a little nice hack
[25-10-2019]