Background
Integrating automated integration
How Firehose might be built

Background

Introduction

This document attempts to lay out the concept behind, context for, and a possible design for, a tool we're calling 'Firehose'. The aim for this document is that by the end of the first sections (concepts, context etc) anyone interested in (and involved with) building systems using Baserock will understand what Firehose is, and why we want it. By the end of the middle section, anyone regularly integrating new versions of software into Baserock systems will understand how they might use an implementation of Firehose to ease their workflow. By the end of the last section, anyone also familar with how Baserock definitions are parsed should be capable of implementing a Firehose.

Integrating changes

The majority of time spent working on new kinds of systems using the Baserock tooling is spent integrating new software into the Baserock ecosystem. However there is a larger body of work in integrating new versions of software into an extant Baserock system. This body of work is often undertaken by "integration engineers" since they're rarely writing new code, but 'simply' taking new versions and ensuring that the new version can be brought into an extant system and that the system works at least as well as, if not better than, before.

While the work of integration engineers can be supported by good use of CI/CD pipelines, and effective automated testing, the actual integration work can still consume a lot of time. Time which could be better spent evaluating the results of the integration work instead.

The concept for Firehose

Firehose therefore seeks to provide a mechanism by which the mechanical integration work can be automated, leaving integration engineers to focus on the knottier problems of co-dependent changes and the critical work of evaluating the results of integrations to determine if they should be offered to the mainline of a project.

The goal for the Firehose project is to produce software which can mitigate the pre-evaluation workload for as wide a set of the inputs to a Baserock system as possible. Since a fully-functional system might involve upwards of six-hundred components, if each component is updated on average once a month and each mechanical integration step takes a human being 10 minutes including all associated paperwork then we're looking at saving nearly 5 hours per day of integration workload across an entire Baserock system.

Many projects do not update as often as once a month, and some update significantly more often. In an active development project you may want to be making updated releases of systems every day, or even every time a new commit hits master of a given piece of software.

Integrating automated integration

A little background on integrating components into Baserock Systems

In Baserock, software components (called chunks) are integrated into systems by way of their inclusion into aggregate components (called strata) and thence into full systems. In an ideal world, we would never need to modify the source code, build system, or default options for a chunk and all the integration work would exist solely in the definitions repository for the system in question. In addition, in our software utopia, upstream would never rebase or rework their branches and we could therefore rely on "once a commit exists, it'll always exist".

Unfortunately our world does not work like this. We often need to make very small patches to projects (or even large ones) simply to get them to behave correctly. Sadly the projects most likely to fall into this particular pile are those at the very bottom of the software stack (compiler, libc, etc) and those at the very top of the stack (custom applications from commercial third parties).

In some cases, upstreams are explicitly designed to require changes in order to be usefully integrated into systems. For example, if a system has a configuration header file to compile behaviours in, it may be necessary to have a commit which sorts that out for your system.

Finally, there are always bugfixes which come about as part of the day to day work of integrating codebases together. These also sit on a branch but may also be offered upstream where the patches may be taken verbatim or may be dropped and rewritten or even ignored.

All these cases will need to be handled by an integrator. Some are easier to automate than others, and Firehose aims to solve as much as possible.

What Firehose could do

The simplest case of taking an update into a Baserock system is merging one definitions change into another definitions branch. Firehose could address this, but frankly a human will very much want to be involved in upgrading the underlying system from one version of Baserock to another.

There's little to no need for an integration engineer to be involved in the mechanics of merging one system branch into another. As such, Firehose is not trying to address this simple case, despite it being semi-obvious low-hanging fruit.

Instead, the case of updating an externally supplied chunk (i.e. one not being developed in a system branch within the Baserock workflows) which has no code changes present against the upstream (either bugfixes or customisation) is clearly a case that Firehose can look at. From here on, we'll refer to this kind of update as a 'simple sha to sha update' or 'shawise'.

Chunks whose repositories carry integration patches or bugfixes could be handled in a couple of ways. Firstly firehose could "merge" either the local changes on top of the new upstream (merge-out), or the new upstream into the local code (merge-in), subtly different trees produced, actual merge operation should be equivalent). Or instead Firehose could rebase the local changes on top of the new code (rebase). Rebases are probably a good approach for rapidly changing components in an as-yet unreleased (or rarely released) system. It will produce a less complex history and also offers the opportunity to remove entirely redundant patches as firehose goes.

Having done these operations (shawise update, merge-in, merge-out or rebase) Firehose can produce an identity for the change and then create a system branch and push the content to that system branch, offering it up for automatic testing before the human integration engineers have to get involved.

What Firehose depends on in an infrastructure

Infrastructurally, Firehose can rely on as little as the git service of a Trove, since it requires somewhere to fetch upstream changes from, somewhere to fetch definitions branches from, and somewhere to push candidate branches to.

Firehose would, in an ideal world, be part of the Baserock infrastructure deployed as standard. Either as part of Trove, or an addon service such as CIAT. Firehose would, again ideally, be subscribed to the change feed from the git service which could then trigger operations. Failing that it should run periodically. Firehose itself requires very little in the way of local resources -- enough disk space to check out git repositories and do merges and enough networking access to the git service and review service to acquire and push branches.

What is left for a human

All Firehose can do is the easy work. It cannot resolve conflicts, it cannot decide if a resultant system is actually good. Human integration engineers still get to do the real thinking part of their current job -- namely the deep thought, semantic quality and evaluative parts while Firehose deals with the drudgery. Also at the end, the human engineer will need to make the decision to do trigger the merge of a Firehose proposed change.

How Firehose might be built

This section attempts to provide a guideline on how an implementation of Firehose might come about. It is not a strict set of requirements and specification because there is still an aspect of research involved in Firehose. The proof-of-concept covered a single mode of operation for a single update method (all-for-one, shawise), the rest will need work.

Modes of operation

In a generalised sense, Firehose will take one or more inputs (chunks to update, methods to update it). Firehose will then perform these updates, analyse the resulting change and push commits to appropriate change branches. Ideally then a CI system will pick up the changes and deal further, eventually producing tested candidate branches for integration engineers to approve.

The input sets which Firehose deals with can be categorised in a number of ways...

All for One

The 'All for One' approach aggregates a number of updated chunks into a single change which is then offered for integration as a unit. This approach might be taken for groups of chunks such as the toolchain and libc, or perhaps the GNOME desktop.

Every Change for Itself

The 'Every Change for Itself' approach is a collation of chunk updates where each individual update is given its own change commit on its own branch. This means that each change can be evaluated independently and merged as soon as it is acceptable to do so. However this approach cannot deal with mutually dependent updates.

Dependency Ordered Collation

The 'Dependency Ordered Collation' approach allows for a number of chunk updates to be brought together into a single "patch" which consists of a graph of commits which follow the dependency ordering. In the short term, this would then behave similarly to an all-for-one in the CI system, but later a more competent CI system could walk the graph and find the most complete change (nearest the tip) which passes the pipeline of tests.

This approach, like all-for-one, allows for resolving mutually dependent updates while simultaneously offering the possibility of finding the best all-working subset from a number of changes which are pending. Obviously only if the deepest change works.

Limited Power Set

The 'Limited Power Set' approach should not be used for large numbers of chunks but can be valuable in finding possible approaches to solving mutually dependent changes which previously were not clear to the integration engineers. When given a set of N chunk updates, this approach will generate (2^n)-1 changes which will be handed to the CI system. These are, trivially, the result of treating the inputs as 'Every Change for Itself', the result of treating it as 'All for One' and the result of 'All for One' for every non-empty proper subset of the chunk updates.

This approach produces a veritable spray of changes for consideration but could be used if none of the above approaches are solving a set of updates.

Inputs

Firehose's inputs probably all apply to a given definitions repository. It is debateable whether they should be limited to targetting a given system branch but ultimately each considered set of chunk updates will be doing so, thus for convenience we will consider that to be the case.

Each input to Firehose will be details of how to find out if a given chunk has updates available for it. Given that even if they are built from the same source, in different strata different chunks will behave differently, Firehose will treat them as independent elements.

Each input will, at minimum, indicate which chunk it is applying to, the update method it expects to use (shawise, merge-in, merge-out, rebase) and the tracking point it uses, which could be a single ref in track-tip mode, or perhaps a set of ref filters and transforms needed to order the refs filtered out if in a refs tracking mode. The former is good for following a stable branch such as cpython's 2.7 branch and the latter is good for a project which tags releases such as vim.

Rough process plan

Firehose will read its input files and determine a set of updates to perform by examining the Trove's git repositories and the target branch of the definitions repository in question.

For any updates where Firehose needs to do work (merge-in, merge-out, rebase) it will acquire the relevant repositories and prepare local copies of the changes.

Having determined an update set to perform, Firehose will create system branches for every change it has identified and will push a system branch for each change to the Trove. Where a system branch already exists, this indicates that Firehose's previous work was not merged (either due to timing or due to it failing to work). Firehose will push updated changes to those branches in the hope that work merged in the meantime has resolved the issue.

Ordering changes

When Firehose is trying to decide between various refs to merge from, there's a number of approaches it could take. It could attempt to determine relative history depth for each ref, but since release tags often include commits not present on other tags (e.g. setting the release version, unsetting dev markers etc) this is very hard to do in a believably correct way.

Instead, Firehose will simply order the refs by name. This means that we need a stable sorting method for ref names and a way to transform arbitrary names from repositories into something which can be stably ordered.

In the prototype of Firehose we used the Debian version number comparison algorithm and a series of regular-expression-plus-replacement pairs to transform ref names for comparison. This worked reasonably well and so we recommend this be done for the real implementation

Example config related to ref ordering

This example is from VIm:

tracking:
  mode: refs
  filters:
    - ^refs/tags/v[0-9]-[0-9]
  # Turns vX-YaZ into vX-Y~aZ so that tags can be usefully ordered
  transforms:
    - match: v([0-9]-[0-9])([^-].*)
      replacement: \1~\2

It is not required that any future implementation use the same mechanism for ordering refs, but this approach has been shown to work for several repositories' schemes for naming tags.

Implementation routes

Firehose is currently implemented as a morph plugin as the path of least resistance to have access to definition parsing and manipulation code.

A redesign would reduce complexity and improve composability by making use of libraries for git repository caching, managing trees of yaml documents, and representing Baserock definitions in a way that allows definitions to be searched and manipulated in a way that may be written back to their source files.