You can use the CLI to create and restore backups of your cluster. You can also choose to back up and restore the state of ZooKeeper running inside a DC/OS cluster.
Prerequisites
- A DC/OS Enterprise cluster
- The DC/OS CLI installed
- The DC/OS Enterprise CLI installed
Backing up a cluster
Backups are stored on the local file system of the master node. The backup state is maintained by a service running in the cluster, and backup/restore operations are initiated by hitting this service directly.
-
Create a backup and assign it a meaningful label. The label has the following restrictions:
- It must be between 3 and 25 characters in length.
- It cannot start with
..
. - It must be composed of the following characters: [A-Za-z0-9_.-].
dcos backup create --label=<backup-label>
-
Verify that your backup has been created.
dcos backup list
Or use the following command to restrict your search results to the label you used when you created the backup.
dcos backup list [label]
The backup will initially transition into the
STATUS_BACKING_UP
state, and should eventually arrive atSTATUS_READY
. If something goes wrong, it will show a state ofSTATUS_ERROR
. Usedcos backup show <backup-id>
to discover why Marathon errored out during the course of the backup. -
Use the ID produced by
dcos backup list
to refer to your backup in subsequent commands. A backup ID will resemble<backup-label>-ea6b49f5-79a8-4767-ae78-3f874c90e3da
.
Deleting a backup
Delete an unneeded backup.
dcos backup delete <backup-id>
Restoring a cluster
-
List the available backups, choose the backup you want to restore to, and make a note of the backup ID.
dcos backup list
-
Restore from the selected backup.
dcos backup restore <backup-id>
-
Monitor the status of the restore operation.
dcos backup show <backup-id>
The
restores.component_status.marathon
parameter of the JSON output will showSTATUS_RESTORING
, and thenSTATUS_READY
.
ZooKeeper backup and restore
This section describes the process of backing up and restoring the state of ZooKeeper running inside a DC/OS cluster.
Backing up ZooKeeper will allow you to return a cluster to a known good state. Therefore we highly recommend that you back up your ZooKeeper state regularly, to be prepared for the worst-case scenario. When performing maintenance operations, such as an upgrade or downgrade, you may wish to back up the ZooKeeper state before beginning the maintenance.
Backing up a ZooKeeper cluster
The ZooKeeper cluster within DC/OS is a system that provides distributed consensus between member nodes. An instance of ZooKeeper is running on each master node and these instances service the entire cluster. The ZooKeeper state can only progress once all nodes in the cluster have seen and agreed on a certain value. This implies that state of any one ZooKeeper node will contain the entire state information up until a certain point in time. Therefore backing up only one ZooKeeper node is sufficient to get reasonably close to the latest state for a ZooKeeper cluster backup. Creating the backup takes time and therefore the live system at the end of the backup will most likely no longer reflect the current state. However, the data available at the beginning of the procedure will be captured.
Prerequisites
- Make sure there is enough disk space available to temporarily store the ZooKeeper backup on a particular master node.
- Any shell commands must be issued as a privileged Linux user.
-
Stop the ZooKeeper instance on only one particular master node via the Exhibitor
systemd
unit.systemctl stop dcos-exhibitor
-
Create a ZooKeeper backup via the provided DC/OS ZooKeeper backup script on the same master node.
/opt/mesosphere/bin/dcos-shell dcos-zk backup <backup-tar-archive-path> -v
-
Start the previously stopped ZooKeeper instance again on the same master node.
systemctl start dcos-exhibitor
-
Download the created ZooKeeper backup tar archive from this master node to a safe location outside of the DC/OS cluster.
-
Remove the ZooKeeper backup tar archive from the master node.
Restoring from a ZooKeeper backup
You can restore from a single ZooKeeper node backup that is physically replicated to all ZooKeeper nodes in the cluster. This ensures that all nodes return to operation from the same state that was recorded up until the point when the backup procedure finished. Restoring requires all ZooKeeper nodes to be stopped, meaning this is only an option when an outage is tolerable or ongoing.
-
Copy the previously created single ZooKeeper backup tar archive to every master node’s file system.
-
Stop the ZooKeeper instances on every master node via the Exhibitor
systemd
unit.systemctl stop dcos-exhibitor
-
Initiate the restore procedure via the provided DC/OS ZooKeeper restore script on every master node.
/opt/mesosphere/bin/dcos-shell dcos-zk restore <backup-tar-archive-path> -v
-
Start the previously stopped ZooKeeper instances again on every master node.
systemctl start dcos-exhibitor
-
Monitor the Exhibitor state of the DC/OS cluster via the Exhibitor cluster status API endpoint (no authentication required).
curl https://<master-host-ip>/exhibitor/exhibitor/v1/cluster/status [ { "code": 3, "description": "serving", "hostname": "172.31.12.169", "isLeader": true }, { "code": 3, "description": "serving", "hostname": "172.31.13.255", "isLeader": false }, { "code": 3, "description": "serving", "hostname": "172.31.17.144", "isLeader": false } ]
The restore procedure is successful when all instances are in serving
state and a leader has been elected.
Limitations to ZooKeeper backups
- Backing up the ZooKeeper state in the current form requires stopping one ZooKeeper node. In cases where you are using 3 master nodes, this significantly reduces the tolerance of master node outages for a DC/OS cluster while a backup is taken, and impacts the resilience to a lesser degree when using 5 master nodes.
- Restoring from a ZooKeeper backup requires stopping all ZooKeeper instances within DC/OS. Hence this is only recommended as a last resort for restoring an otherwise non-recoverable cluster.
etcd backup and restore
This section describes the process of backing up and restoring the state of etcd running inside a DC/OS cluster.
Backing up etcd will allow you to return a cluster to a known good state. Therefore we highly recommend that you back up your etcd state regularly, to be prepared for the worst-case scenario. When performing maintenance operations, such as an upgrade or downgrade, you may wish to back up the etcd state before beginning the maintenance.
Backing up an etcd cluster
The etcd cluster within DC/OS is a system that provides distributed consensus between member nodes. An instance of etcd is running on each master node and these instances service the entire cluster. The etcd state can only progress once all nodes in the cluster have seen and agreed on a certain value. This implies that state of any one etcd node will contain the entire state information up until a certain point in time. Therefore backing up only one etcd node is sufficient to get reasonably close to the latest state for a etcd cluster backup. Creating the backup takes time and therefore the live system at the end of the backup will most likely no longer reflect the current state. However, the data available at the beginning of the procedure will be captured.
Prerequisites
- Make sure there is enough disk space available to temporarily store the etcd backup on a particular master node.
- Any shell commands must be issued as a privileged Linux user.
-
SSH to the master node which is the current Mesos leader. To discover which node is the correct leader, run this command on any cluster node. The IP that is shown is the IP of the current leader:
ping leader.mesos
-
Run this command to create the backup:
sudo /opt/mesosphere/bin/dcos-shell dcos-etcdctl backup <backup-tar-archive-path>
Restoring from an etcd backup
-
Copy the previously created single etcd backup tar archive to all master nodes.
-
Stop the etcd instances on all master nodes:
sudo systemctl stop dcos-etcd
-
Create a copy of the etcd data directory on all master nodes:
sudo cp -R /var/lib/dcos/etcd/default.etcd <backup-directory-path>
-
Initiate the restore procedure using the provided DC/OS etcd restore script on all master nodes:
sudo /opt/mesosphere/bin/dcos-shell dcos-etcdctl restore <backup-tar-archive-path>
-
Start the previously stopped etcd instance on all master nodes:
sudo systemctl start dcos-etcd
-
Check the etcd cluster status on all master nodes:
sudo /opt/mesosphere/bin/dcos-shell dcos-etcdctl diagnostic
The above command presents the results of etcdctl subcommands endpoint health
and member list -w json
. A healthy etcd cluster should meet the following requirements given the output of the commands:
endpoint health
checks the healthiness of on the current etcd instance, which should report as “healthy”.member list -w json
returns the cluster members, which should include all master nodes.
Limitations to etcd backups
- Restoring from an etcd backup requires stopping all etcd instances within DC/OS. Hence this is only recommended as a last resort for restoring an otherwise non-recoverable cluster.