Angry robot logo

Mumbling about computers

Messing up backups

This is the story of how I managed to trash my server, all my VMs and 2 databases while upgrading proxmox during a boring Sunday afternoon.

Upgrading the host

A few weeks ago Proxmox 6.0 was released and I decided to upgrade; this included a stretch -> buster upgrade.
I was quite confident that the changes were going to be successful as I have upgrade this installation all the way since Wheezy.. oh wow, it's been a long time.

The upgrade didn't go quite right, as the kernel would silently hang without any notice after a reboot. Trying to boot previous kernels didn't help, and after 2 days fighting in a chroot I opted to simply reinstall the system without taking any precautions; I was confident on my backups.

The mistake

Turns out that, while the data itself was safe, proxmox's configuration lives on an in-memory filesystem, mounted on /etc/pve and my backup script calls rsync with -x (--one-file-system).

The files that live within /etc/pve are purely metadata about the containers, like what storage is used, number of cores, memory size, vlan, and mounts.

While losing this metadata was quite annoying, it was not the end of the world, as all of the containers were created with some ansible playbooks a few years ago:

commit ab5015c7cd11a31c7a7159a0384c627962ff6439
Author: David
Date:   Sun Dec 18 20:53:07 2016 -0300

    init dns container

A small upside?

For now, to avoid this from happening again I've added /etc/pve to the list of filesystems to back up, and moved the creation of VM/containers to ansible as well, an example snippet:

- hostname: web
  disk_size: 8
  cores: 4
  memory: 2048
  interfaces:
    - name: eth0
      gw: 192.168.20.1
      ip: 192.168.20.114/24
      bridge: vmbr20
  mounts: '{"mp0":"/storage/ownclouddata,mp=/ownclouddata"}'

Having metadata in a static representation has the (unintended) side effect that doing static-analysis is also easier.

Re-creating the VMs is then a simple call the the proxmox module in ansible in a loop:

- name: create
  proxmox:
    node: "bigserver"
    api_user: "{{ api_user }}"
    api_password: "{{ api_password }}"
    hostname: "{{ item.hostname }}"
    storage: "storage"
    cpus: "1" # numa nodes
    pubkey: "{{ pubkey }}"
    ostemplate: "{{ item.template | default(_template) }}"
    unprivileged: "{{ item.unprivileged | default('yes') }}"
    cores: "{{ item.cores | default(1)}}"
    memory: "{{ item.memory | default(2048) }}"
    onboot: "{{ item.onboot | default(1) }}"
    disk: "{{ item.disk_size | default(3) }}"
    netif: "{{lookup('proxmox_interface_format', item.interfaces)}}"
    state: present
  tags: [lxc_setup]
  loop: '{{vms}}'

Restoring data

Once the VMs were re-created, I had to recover data from a few stateful containers. All of the data was accessible from the host, as the filesystems are ZFS subvolumes and they remained intact.

InfluxDB

Restoring influx data was quite easy:

  • install influx
  • stop influx
  • overwrite /var/lib/influxdb/{data,wal}
  • run restore command
  • start influx

The restore command was:

root@db:~# sudo -u influxdb influx_inspect buildtsi -datadir /var/lib/influxdb/data/ -waldir /var/lib/influxdb/wal/

Gogs

Restoring Gogs was also quite trivial, had to only restore files:

  • sqlite database
  • gogs daemon config
  • repositories

MySQL

Restoring MySQL was a disaster.. by this point it was well past midnight and I made a grave mistake.. Copied the brand-new (empty) metadata files over the original metadata files, making the problem much worse.

With information from multiple sources I managed to re-generate the frm files.

To re-generate the metadata (frm files) I ran the following commands (as taken from history).

 # make a local copy the data to work on
 2005  scp -r root@bigserver:/tank/proxmox-images/subvol-105-disk-1/var/lib/mysql/owncloud/ .
 # to run mysqlfrm you need to have the mysql binaries installed locally
 2013  sudo apt install mysql-client mysqld
 # run a test to see the output
 2021  mysqlfrm --server=root:root@db owncloud:oc_accounts.frm --port=3307
 # this looks fine; simply outputs the `CREATE TABLE` commands

 # generate table schema for all tables
 2029  for f in *.frm; do mysqlfrm --server=root:root@db owncloud:"$f" --port=3308 >> results.sql; echo $f; done
 # 2 tables failed randomly -- re running the command fixed it
 2031  mysqlfrm --server=root:root@db owncloud:oc_properties.frm --port=3308 >> results.sql 
 2032  mysqlfrm --server=root:root@db owncloud:oc_retention.frm --port=3308 >> results.sql 
 # To make this valid SQL I had a few missing ;
 2033  sed 's/COMPRESSED$/COMPRESSED;/' results.sql > rr.sql
 # Import the sql file to create the tables
 2036  mysql -u root -proot -h db owncloud < rr.sql
 # Discard the newly created tablespaces with data
 2042  for f in *.frm; do echo $f; fname=$(echo $f | cut -d. -f1); mysql -u root -proot -h db owncloud -e "alter table owncloud.$fname DISCARD TABLESPACE;"; done
 # Overwrite the data
 2043  for f in *.ibd; do scp $f root@db:/var/lib/mysql/owncloud/$f; done
 # Re-import the tablespaces
 2044  for f in *.frm; do echo $f; fname=$(echo $f | cut -d. -f1); mysql -u root -proot -h db owncloud -e "alter table owncloud.$fname IMPORT TABLESPACE;"; done

This got the database back in working order.. it was quite stressful though.

Miscellaneous

For the rest of the VMs (music, web servers, reverse proxies, etc) it was just a matter of re-running the ansible playbooks against them.
It worked quite well; there were some differences that I had overcome with the change of the base image between jessie and buster.

Lessons

Backups are not backups until tested. This showed that while the data I have is kind-of safe; the cost of drives dying (and thus losing all metadata as well) would be quite high. I intend to re-visit the backup mechanism in the near future:

  • Backup /etc/pve
  • Full mysql backup
  • Full influxdb backup
  • Full postgres backup

I will see if it makes sense to try out some semi-automated environment recoveries somehow.