Trebuchet diff (tbdiff)

Trebuchet diff is a tool that compares two POSIX directory trees (a and b) and generates a batch of commands to transform the directory tree a into b. It takes into account POSIX metadata, such as permissions and ownership.

Tbdiff also intends to define a standard format to define POSIX directory difference, but the specfication of such format will only come after the tool has been tested to cover all the possible use cases.

POSIX Extended Attribute (xattr) support is currently mandatory, but has only been tested with user-defined attributes (the user. namespace), so its interaction with SMACK and SeLinux haven't been tested.

Code

The git repository can be found on git.baserock.org

ToDo

  • Test suite

    A basic bash based test framework is in place, no bug should be fixed unless it provides a unit test to prevent regressions in the future. The suite is by no means exhaustive and tests are being added as bugs are discovered. It also doesn't cover all code paths according to GCov, so more tests are needed.

  • File system safety

    Multiple real file system trees have been tested successfully, so basic operation works. However it is untested as to whether a maliciously crafted patch could damage filesystems. A depth counter is used to prevent patches leaving more directories than have been entered in an attempt to escape the confines of the directory being patched, and symbolic links should not be dereferenced.

    In early testing it was possible to delete files in a directory pointed at by a symbolic link, this has been fixed, but other surprises may still be hiding.

  • Split stream parsing and filesystem operations Right now the stream parsing and the filesystem operations are both performed in one go. These operations should be made separately, so that dry runs can be performed.

  • Data format

    Currently data sizes follow the convention of the machine that created the image. It is not a reasonable assumption that patches will only be made on the same architecture as they are being deployed, as patches may be made by a powerful 64-bit server, and deployed to a 32-bit device.

    Either a standard transmission format that can accommodate all systems should be used, or the patch should define what format it is using.

    We should have a defined struct per command for easier understanding of what goes into the wire.

  • Abstract the stream object

    Right now FILE* is used as the stream object. However abstracting the stream object is desirable so an ssl socket may be the stream.

Problems Encountered

  • tbdiff tries to create a directory which already exists due to this bug

Why not use rsync?

Rsync has a batch mode, which can be used to create patches that can be deployed later. However it has some shortcomings.

  • Quck check algorithm

    By default rsync uses the heuristic that if files have the same mtime and size then they are the same. During tests it was found that this was not a reasonable assumption, several files with different contents, but the same size and mtime were found.

  • Ignore times

    Quick check can be disabled completely with the --ignore-times option, and every file encountered will be delta encoded (or sent completely if --whole-file is specified).

    However this makes the patch creation impossible. I had left it running overnight and it hadn't finished 16 hours later.

  • Checksumming

    Checksumming may be used instead with the --checksum option, so the data of every file is examined. This is an expensive operation, which cannot be disabled by the receiver, causing deploy times to be on the order of 12 times slower than tbdiff.

Speed comparison

tbdiff
creation time patch size deploy time
f-1 8:22 21 M 0:10
1-2 13:32 27 M 0:23
2-3 16:47 88 M 0:24
rsync (quick check)
creation time patch size deploy time
f-1 1:52 1.9M 0:21
1-2 1:04 30M 0:48
2-3 1:33 59M 0:53
rsync (checksum)
creation time patch size deploy time
f-1 4:08 2.3M 2:06
1-2 5:18 54M 3:25
2-3 5:17 69M 4:09