Angry robot logo

Mumbling about computers

Running a cross-architecture Nomad cluster

Posted on under [ ]

I wanted to play around with some projects on RISC-V, so I got a VisionFive 2 and tried to add it to my Nomad mini cluster. Surprisingly, nothing worked on this experimental platform!

First, Nomad does not provide builds on RISC-V, and it does not build, as it has a transitive dependency on boltdb, which itself does not build for RISC-V.

When building Nomad, we need to disable a subset of boltdb's features, can do so by using Elara's fork

replace github.com/hashicorp/raft-boltdb/v2 => github.com/Elara6331/raft-boltdb/v2 v2.0.0-20230729002801-1a3bff1d87a7

And then we can build with

make \
    GOOS=linux \
    GOARCH=riscv64 \
    CC=/usr/bin/riscv64-linux-gnu-gcc \
    CXX=/usr/bin/riscv64-linux-gnu-c++ \
    AR=/usr/bin/riscv64-linux-gnu-gcc-ar \
    dev

Though, when running nomad, I kept having issues with the raw_exec driver:

job "unamer" {
  type = "batch"

  constraint {
    attribute = "${attr.cpu.arch}"
    value     = "riscv64"
  }
  task "uname-task" {
    driver = "raw_exec"
    config {
      command = "/bin/uname"
      args = ["-a"]
    }
  }
}

As it was complaining about being unable to manage cpu slices via cgroups

2023-11-12T14:35:14.194Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4744f0ea-7616-ab3a-a420-8e25056e3861 task=envtask type="Setup Failure" msg="failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus: no such file or directory" failed=true
2023-11-12T14:35:14.194Z [ERROR] client.alloc_runner: prerun failed: alloc_id=4744f0ea-7616-ab3a-a420-8e25056e3861 error="pre-run hook \"cpuparts_hook\" failed: open /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus: no such file or directory"

Comparing to other, better behaving servers, I found that the /sys/fs/cgroup/controllers file did not contain the cpuset controller.

user@starfive:~$ cat /sys/fs/cgroup/cgroup.subtree_control 
cpu

Turns out that the default kernel config for the current 5.15 (ew) kernel does not have CONFIG_CPUSET enabled, which is what enables the cpuset controller. I've sent a PR to starfive to update these defaults, but it's not been merged.

I've also sent a PR to Nomad to disable cgroups when the required controllers are unavailable.

I then moved to cwt's Arch image, which has the cgroup controllers enabled. This is mostly because I spent hours trying to cross-compile a kernel following the instructions provided by VisionFive but none of these kernels are bootable.

user@starfive:~$ cat /sys/fs/cgroup/cgroup.subtree_control 
cpu cpuset io memory pids

With working cpuset I scheduled a Docker container on the node:

job "docker-test" {
  type = "batch"

  constraint {
    attribute = "${attr.cpu.arch}"
    value     = "riscv64"
  }
  task "uname-task" {
    driver = "docker"
    config {
      image = "hello-world"
    }
  }
}

and now I can finally run GH actions on it, via github-act-runner.