Maintaining a distbuild network on a Calxeda Highbank server

This guide is for administrators who are managing a Morph distributed build network on a Calxeda Highbank ARM server. Note that Calxeda have gone out of business: for new deployments we recommend setting up a cluster of NVIDIA Jetson boards to do distributed builds (or something else, if you want).

This guide assumes you have a Baserock build environment set up. Follow the 'Preparation' section in HOW-TOs - Using distbuild for Baserock on ARM if you have not done this already.

Placeholders that are used in this guide:

- $trove -- hostname and trove-id of your Trove
- $infrastructure.git -- your infrastructure definitions repo. (This can be
    a keyed Morph URL, e.g. baserock:baserock/infrastructure.git, or a full
    URL like git://$trove/$trove/site/definitions.)
- $cxmanage -- hostname of a machine that contains `ipmitool` and has
    network access to your Calxeda server
- $version - a version label. This can be anything you want that is valid
    as a directory name. I recommend using the date, e.g. `2015-04-20`.

You will need an 'infrastructure' repository, forked from the Baserock reference system definitions. You may have this in your Trove as git://$trove/$trove/site/definitions.git.

Upgrading distbuild to the latest release

This describes how to upgrade your distbuild network to an imaginary 20.00 release of Baserock.

First, clone infrastructure definitions

git clone $infrastructure.git

Merge in the latest tagged release of the Baserock reference systems from the reference systems Git repository.

cd infrastructure
git remote add upstream git://git.baserock.org/baserock/baserock/definitions.git
git remote update upstream
git merge --no-ff baserock-20.00

Build the new version of systems/build-system-armv7lhf-highbank.morph using your distbuild network. You need to push your branch first, so the distbuild network can see it.

git push origin HEAD
morph distbuild systems/build-system-armv7lhf-highbank.morph --local-changes=ignore

Deploy the cluster for your distbuild network. I'll assume that this is called distbuild-cluster.morph for the purpose of this guide, and contains a system named distbuild, deployed with extensions/distbuild-trove-nfsboot.write . This will create a new rootfs for each node, it will not overwrite the running rootfs, so you can keep using the distbuild network while this completes.

morph upgrade distbuild-cluster.morph --local-changes=ignore \
    distbuild.VERSION_LABEL=$version

When you are ready, you need to restart the distbuild nodes in the updated system. This will stop any builds that are running, so it's a good idea to check first if there are any running builds:

morph distbuild-list-jobs

Restart all of the nodes on your distbuild network using impitool. First run ssh root@$cxmanage to log into the right machine. You then need to check and reset the power of each node. The example below assumes your nodes are on the subnet 172.17.1.0 with the first node's IPMI interface at 172.17.1.3, the second node's IPMI interface at 172.17.1.7, and so on. It also assumes you are using 8 nodes (0 to 7). You can use commands ipmitool chassis power status, ipmitool chassis power off and on as well as reset, which may be useful if not all of the nodes on the server are currently powered on, or you don't know the status of them.

for i in `seq 0 7`; do
    ipmi_address=172.17.1.$(expr $i \* 4 + 3)
    ipmitool -U admin -P admin -H $ipmi_address chassis power reset
done

Rollback distbuild nodes to an older version

If you find that a new version does not work for some reason and you want to roll back to the previous version, you need to update the 'default' symlinks on the NFS server to point to the old version.

First, log in to the Trove (assuming you use Trove as your NFS server) with ssh root@$trove. Then, set up an environment variable listing the name of each node. The example below assumes your nodes are called node0, node1 and node2.

nodes="node0 node1 node2"

You can see the 'default' symlink for each node with this command. This tells you which directory each node will read its root filesystem from next time it boots.

cd /srv/nfsboot
for name in $nodes; do readlink $name/systems/default; done

To cause each node to boot from an older system version, update the 'default' symlink for each node. For example, to switch them to version with label '2015-01-01', do this:

cd /srv/nfsboot
for name in $nodes; do
    ln -sf /srv/nfsboot/$name/systems/2015-01-01 $name/systems/default
done

You can then reboot the nodes as described above. When you reboot the nodes, any running builds will be cancelled.

Debugging distbuild

If you find that your distbuild network doesn't work, start by logging into the controller node and checking:

systemctl status distbuild-setup.service
systemctl status morph-controller.service
systemctl status morph-controller-helper.service
systemctl status morph-worker.service
systemctl status morph-worker-helper.service
systemctl status morph-cache-server.service

You can find full logs for these services in /var/log.