Okay fine, it’s 130ish lines, but a decent number of them are just entrypoint glue and some small helper predicates.

Let’s walk through it together.

Has this ever happened to you: You need to write a parser for a binary file format, but it’s total drudgery to manually write out all the byte reading/shifting stuff and structs, and blah blah. Not only that, but once you’ve written your parser, you may still need to write an encoder as well, which is double the work.

My claim is that using a declarative, logical language like Prolog solves both of those problems (as long as you stick to the pure subset!) Occasionally I see solutions out there for this problem like Kaitai Struct, QuickBMS, or whatever, but really in my heart I know Prolog beats them all. It’s a full blown programming language that just happens to work really well for this kind of thing.

We’re going to declaratively define a parser for the tar format, and once we’re done, our code will not only extract .tar files, but create them as well.

In Prolog, there’s a feature called DCGs, or Definite Clause Grammars, which are basically a way to describe formats and parse them. In our case, we’re going to be using them to parse the tar spec as found here.

Say we wanted to describe a format where it’s 2 of the letter a, any number of b, then two more as. It’s a useless “format” but it will explain how these things work:

And we can then query Prolog and see that the list [a,a,b,a,a] does indeed match this format:

?- phrase(our_format, [a,a,b,a,a]).
true .

What’s interesting also, is that not only can it validate that the input fits the format, it can also generate new examples for us:

?- phrase(our_format, Example).
Example = [a, a, a, a] ;
Example = [a, a, b, a, a] ;
Example = [a, a, b, b, a, a] ;
Example = [a, a, b, b, b, a, a] ;
Example = [a, a, b, b, b, b, a, a] ;
Example = [a, a, b, b, b, b, b, a, a] .

Cool. Those are all valid examples of the format. This principle applies for whatever you define, as long as you’re using the pure logical subset of Prolog (which we will be doing).

Just for example again, let’s say we had a format that’s 2 of the letter a, any value, and then 2 more as.

This time you’ll see that we can actually extract out the value:

?- phrase(our_format(Val), [a,a,20,a,a]).
Val = 20.

And of course it can “encode” the value as well:

?- phrase(our_format(20), Encoded).
Encoded = [a, a, 20, a, a].

So that’s DCGs. Let’s look at the tar file format spec and start tackling it.

According to this (not very well written) spec, a tar file is basically a series of 512-byte blocks: a header that contains some metadata, and then some number of 512-byte blocks that are the file’s content, padded if necessary, then more headers if there’s more files, and so on. Finally there are 2 blocks of just all zeros to signal the end of the stream. So a file with just “hello” in it will consist of:

  1. One header block
  2. One 512-byte data block padded with zeros
  3. Two empty blocks of all zeros

The header block looks like this:

Offset Size Description
0 100 File name
100 8 File mode
108 8 Owner’s numeric user ID
116 8 Group’s numeric user ID
124 12 File size in bytes (octal base)
136 12 Last modification time in numeric Unix time format (octal)
148 8 Checksum for header record
156 1 Type flag
157 100 Name of linked file
257 6 UStar indicator “ustar” then NUL
263 2 UStar version “00”
265 32 Owner user name
297 32 Owner group name
329 8 Device major number
337 8 Device minor number
345 155 Filename prefix

and the spec notes that

The only trick is, that file size is not stored in binary, rather in an ASCII octal string. For example 1025 is stored as ‘000000002001’.

Bizarre, but that’s the way it is. Let’s define the header first, and then move onto whatever else we need.

, [0], "00", string(32, OwnerUserName), string(32, OwnerGroupName), u64(DeviceMajorNumber), u64(DeviceMinorNumber), string(155, FilenamePrefix), padding(12). % Seems to be some extension thing

Note that this definition looks very similar to the spec itself– We’re not writing any sort of instructions or logic, it’s just a plain description of the format of our data. One thing that’s slightly different from our previous examples, is that we’re using curly brackets. Whatever is inside of curly brackets is just normal Prolog code (not DCG) so all it’s saying here is that the TypeFormat byte will always be 48. We could also have written u8(48) but I left it like this for consistency.

The spec also doesn’t mention a 12-byte extension that seems to exist in files I created using bsdtar. We just treat it as padding and ignore it.

In order to try this out, we need to implement the helper predicates it uses, which are string//2, u8//1, u64//1, octal_number//1, and padding//1.

u8(N) --> [N], {N in 0..255}.

u32(N) --> [B1, B2, B3, B4],
           {N in 0..4294967295,
            [B1, B2, B3, B4] ins 0..255,
            N #= (B4 << 24) + (B3 << 16) + (B2 << 8) + B1}.
u64(N) --> u32(DW1), u32(DW2),
           {N in 0..18446744073709551615,
            N #= (DW2 << 32) + DW1}.

These predicates just describe unsigned integers– that is, they define 1 byte, 4 bytes, and 8 bytes respectively, and just do some bit-shifting to turn the bytes into values that we can work with.

For example, here’s u32 doing both encoding and decoding.

?- phrase(u32(3131313), Bytes).
Bytes = [177, 199, 47, 0].

?- phrase(u32(N), [177, 199, 47, 0]).
N = 3131313.

Strings in the tar format are fixed-width, and padded at the end with zeros. So, we need to define a predicate that describes a string of a certain length, and strip off the trailing zeros.

We state the following rules:

  • Bytes has length Len.
  • We define two lists, Str0 and Zeros and say that concatenated, they are known as Bytes.
  • Zeros is all zeros.
  • Str0‘s elements are all different than zero.

Then, we just say that this predicate describes Bytes with seq//1. From these rules, Prolog does the work of splitting Bytes into Str0 and Zeros, as we only want the actual string content.

Pretty similar to the bs example, except padding just describes some number of zeros. The first rule states that padding of length zero is nothing, and padding with length is one zero, followed by a recursive call for more.

Now for the annoying one– we need to write a predicate that converts an octal string into a regular number. Why? Because tar demands it.

The first predicate relates each ASCII code to its numeric value (a number from 0 to 7, because octal is base 8) and then recursively says that each digit is “shifted” right by multiplying by 8.

Here’s how you’d implement it in Python:

The thing is, these numbers are in little endian, so the least significant byte comes first, and we want to process the numbers starting with the most significant byte. So, we can use a wrapper that just reverses the list first. Also very annoyingly, the tar files generated by bsdtar (and I think GNU as well) contain random spaces at the end of the numbers which isn’t in the spec, so we have to deal with that by tearing off 0x20 if it appears in the list first. That’s pretty much all the second predicate is doing.

Finally, we wrap the whole thing in a DCG so we can call it from our existing tar header grammar, and that’s pretty much it– we can officially parse tar headers.

?- phrase_from_file((tar_header(H), ...), "out2.tar", [type(binary)]).
H = header("doc1.txt", 9066805720002608, 9060182779768880, 9060182779768880, 21, 1673997133, 2305900399475831088, 48, "", "", "", 9060182779768880, 9060182779768880, "") .

Look at that, "doc1.txt". And here’s the cool thing, we can actually encode it back into bytes:

?- phrase_from_file((tar_header(H), ...), "out2.tar", [type(binary)]),
   phrase(tar_header(H), Bytes).

H = header("doc1.txt", 9066805720002608, 9060182779768880, 9060182779768880, 21, 1673997133, 2305900399475831088, 48, "", "", "", 9060182779768880, 9060182779768880, ""),
Bytes = [100, 111, 99, 49, 46, 116, 120, 116, 0|...]

Now we just need to extract the data and job well done.

First let’s define a predicate for describing data blocks. Remember that each block is just 512 bytes:

We should also define a predicate for describing an arbitrary number of data blocks.

Same as the padding//1 predicate really, except we’re describing tar data blocks instead of zeros.

From here, we have everything we need to define a tar “entry”– or, a single file inside a tar archive.

A tar entry consists of:

  • A tar header
  • Some number of data blocks

So, we say, get the tar header, calculate the number of blocks needed using this formula, and read that many data blocks using tar_data_blocks.

Once it’s got the blocks it concatenates them all and removes the padding zeros using append//2.

Of course, I’m using “procedural” language here like “get” and “read” but in reality, this code is just “describing” a grammar and can work both ways– it can encode and decode.

At this point, we can describe an entire tar file:

That’s it! A tar file is any number of tar_entry followed by 1024 bytes of padding.

I wrote some glue to make it callable from a shell and:

$ ./tar.pl cf my-cool-tar.tar doc1.txt doc2.txt vorticelli.wav

$ tar tf my-cool-tar.tar

$ rm doc1.txt doc2.txt vorticelli.wav

$ ./tar.pl xf my-cool-tar.tar

$ ls
doc1.txt  doc2.txt  my-cool-tar.tar  tar.pl  vorticelli.wav

Ain’t that just neato?

It’s unfortunately absolutely dog slow. I think it has to do with SWI-Prolog’s underlying representation of the linked lists. Even just reading a 200k file into memory causes the runtime to error out, so I had to set a flag that expands the available memory to use– even just reading the file causes seconds of wait.

I’ve heard however that Scryer Prolog uses a more efficient internal representation, which might work better but I haven’t tried it out yet.

Also, this is a really simple implementation that doesn’t deal with file paths, permissions, user IDs, etc.

Read More

By |2023-01-18T22:22:12+00:00January 18th, 2023|Education|0 Comments

Leave A Comment