What do we really require from reproducibility and traceability?

We have set design goals for Baserock around these two words, but we are we meeting them?

Traceability

For us traceability means that we can

  • A reliably establish precisely what software is running in a given target foo, so that
  • B no time is wasted working with the wrong versions of software components
  • C problems and differences in behaviour can be isolated as effectively as possible
  • D we can efficiently study the known software, read and build its source if source is available, run it locally, debug it, modify it.

Note that

  • In real world projects lots of time is wasted due to version uncertainty and differences
  • Ideally we need to be able to trace every line of code for every component in a system including all build-time dependencies, runtime dependencies, configuration, environment and the whole OS.
  • For inputs delivered as binary (eg 3rd party proprietary IP) we believe we need to trace the exact version of the overall package and be able to identify, examine and copy all component files, and know their locations and configurations on the target.

Reproducibility

For us reproducibility means that we can

  • E reliably recreate a local version of foo, equivalent to an existing target version of foo (that may be in production, or on a customer site, or just archived), so that
  • F we can use our local version to check behaviour and reproduce problems seen on the target version, even if we can’t access the target foo directly
  • G we can modify our local version, address the problems and verify that our modified local version is better than the target version
  • H we can use our work as a basis for creating a new improved foo for delivery or deployment into production, either replacing or upgrading the existing target version of foo.

Note that

  • G and H are not strictly ‘reproducibility’ requirements, but they are real-world requirements, and E and F alone are not really worth very much
  • Being able retrieve foo from backups may satisfy E and F but is not normally sufficient for G and H

We believe that recreation of a modifiable local version of foo requires all of the source for foo, and the toolset and environment that was used to build it.

Arguably we only need working instances of the tools and environment. We don’t strictly need the source to reconstruct all of them too, but the design of Baserock makes it possible to do so.

It is not clear whether we need to be able to recreate artifacts bit-for-bit. As of now Baserock can not.

These requirements in themselves do not imply any forward or backward compatibility. So reproducibility and its implications do not force the need to be able to build an N+1 release of foo using the same toolset that was used to build release N.

In light of the above, note the following issues relating to our current approach, and some perceived requirements from customers and users

Baserock Traceability

  • All source (and binaries) are held on Trove. All systems include metadata defining the specific commit for each source component.
  • We are successfully satisfying A (known software version)
  • We mostly satisfy B (not waste time), but bugs in morph workflow can break this
  • We mostly satisfy C (isolate behaviour differences) but reliability problems (eg cache corruption) can break this
  • We mostly satisfy D (work efficiently with known version) for x86, but our process for other architectures is inefficient

Baserock Reproducibility

We can inspect the metadata on /baserock to satisfy E, F, G, H provided that

  • our repos are intact and available
  • and we have a machine to run morph in
  • and the machine is capable of building for the target architecture
  • and all other information about the target is captured (eg deployment configuration etc)

The metadata contains information such which system was built, from which git repository, using which ref...

Limitations with the Baserock solution

  • it requires a Trove. Troves are rather complex and require lots of storage
  • you need to have your own Trove, or push access to git.baserock.org
  • getting new components into a Trove is rather inefficient, with several friction points
    • manual creation of lorries
    • needs knowledge of gitano
    • needs keys setup
    • approval process
  • Trove/lorrying clashes with the ‘normal’ package manager approaches of NPM, PIP, RubyGems (let’s call them NPGs) etc.

NPGs Support in Baserock

Because of our design goals of traceability and reproducibility we would prefer not to support NPGs in their normal development modes. We have committed to undertake a proper consideration of how to address the NPG situations.

Option 0

We already have a Baserock web devel system with support for Ruby, Node, Python and their package managers. It is possible to install packages on this system in the ‘normal’ way for each toolset. However

  • there is no traceability for this approach - we are installing code from the ‘wild internet’
  • if the package manager host goes down, we are broke
  • the resulting systems are no longer reproducible, since they have been post-processed using the untraceable software.

As a mitigation for traceability, it is possible to create (private) mirrors of the package manager hosts.

Option 1

NGP all support some mechanism of ‘freezing’ or installing all source including dependencies in the tree of the target application. We could enable these three tools in Baserock devel and encourage that mode of working as a standard workflow, with the frozen source committed to git.

Assuming the freezing does not affect anything outside of the tree, we can avoid tainting our devel system, and Git will keep things tidy this should be relatively productive and low friction But

  • source history for dependencies is lost
  • we’re forking the chosen component by dumping the dependencies sources into our branch

Option 2

Make morph aware of the dependency definitions for NGP. This requires coding for each, but the basic approach would be

  • parse the manifest to get list of dependencies and their versions
  • for each dependency (and their dependencies all the way down)
  • find the upstream repo (NPM specifies, Gems may include it)
  • create a chunk based on that repo
  • set the ref to the version from the manifest (assuming it’s tagged)
  • for locked-down users, generate lorry file(s), lorry the dependencies
  • create stratum containing all the dependency chunks (either from upstream or Trove)

This is more in line with the mainline Baserock workflow approach but

  • there is more friction for developers
  • provenance of some dependencies (eg some Gems) will require manual googling
  • there may be circular dependencies involved when constructing packages from upstream repos.
  • it may go against the wishes of the communities who have developed and adopted their own packaging system: a unified packaging and distribution system independent of source repos allows them freedom to move, break and generally do what they want with their source repos.