This section describes how to back up and restore a Kubernetes cluster in case of a disaster. The state of the cluster comprises the package service configuration and any Kubernetes resources that exist when the backup is performed.
Prerequisites
For the time being, the backup artifacts are stored in an AWS S3 bucket. Therefore, the AWS CLI must be installed and the following steps need to be completed before backing up the cluster:
-
Create an IAM user; we will use the name
velero
:aws iam create-user --user-name velero
-
Attach a policy to give
velero
user the necessary permissions:aws iam attach-user-policy \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \ --user-name velero aws iam attach-user-policy \ --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess \ --user-name velero
-
Create an access key for the user
velero
:aws iam create-access-key --user-name velero
Disaster Recovery
You will use the command dcos kubernetes cluster
to create a backup of your deployment, and if necessary, to restore it. The command dcos kubernetes cluster
has two subcommands: backup
and restore
.
Back up the cluster
Using the credentials established in the previous step, use dcos kubernetes cluster backup
to start the backup process. Note that the flags --aws-region
, --aws-bucket
, --aws-access-key-id
and --aws-secret-access-key
are mandatory.
usage: dcos kubernetes cluster backup --cluster-name=CLUSTER-NAME [<flags>]
Flags:
-h, --help Show context-sensitive help.
-v, --verbose Enable extra logging of requests/responses
--version Show application version.
--cluster-name=CLUSTER-NAME
Name of the Kubernetes cluster
--aws-secret-access-key=""
AWS secret access key
--aws-access-key-id="" AWS access key id
--aws-region="" AWS S3 region
--aws-bucket="" AWS S3 bucket name
--backup-name="kubernetes-backup"
Name for the backup
--backup-ttl=720h How long before backup can be garbage collected
--timeout=1200s Maximum time to wait for the backup process to complete
$ dcos kubernetes cluster backup --cluster-name=CLUSTER-NAME --aws-region=us-east1-d --aws-bucket=my_bucket --aws-access-key-id=ABC --aws-secret-access-key=XYZ
Backup creation: [COMPLETE]
Backup has been successfully created!
View backup log messages
To diagnose a failed MKE cluster backup, you can view the log files for the Kubernetes pod launched to perform the backup. Use these steps:
- While the “dcos kubernetes cluster backup” command is running, get the pod id of the pod that is trying to do the ark backup.
$ kubectl get pods -n heptio-ark
- Check the log file of the running
heptio-ark
pod that is attempting to do the backup. Use this command and replace<pod-id>
with the pod id given by the previous command.
$ kubectl logs -f -n heptio-ark <pod-id>
The -f
option will “follow” the log file thus you will see all the messages, including any error messages that assist determining the cause of the backup failure.
Remove backup entries
If you no longer need a backup entry, you can remove it from the Kubernetes cluster with the following steps.
- Get a list of heptio-ark kubernetes cluster backups:
$ kubectl get backup.ark.heptio.com -n heptio-ark
- Delete the heptio-ark backup entry, replace
<backup-id>
with one of the backup names listed with the previous command:
$ kubectl delete -n heptio-ark backup.ark.heptio.com <backup-id>
- Use the AWS S3 console to remove the s3 bucket that stored the backup content:
$ aws s3 rm --recursive s3://<bucket-for-backups>/<backup-id>
Restore the cluster
The subcommand restore
retrieves the backup artifacts from S3 and imports the saved state into a newly provisioned cluster. The flags --aws-region
, --aws-bucket
, --aws-access-key-id
and --aws-secret-access-key
are mandatory.
usage: dcos kubernetes cluster restore --cluster-name=CLUSTER-NAME [<flags>]
Flags:
-h, --help Show context-sensitive help.
-v, --verbose Enable extra logging of requests/responses
--version Show application version.
--cluster-name=CLUSTER-NAME
Name of the Kubernetes cluster
--aws-secret-access-key=""
AWS secret access key
--aws-access-key-id="" AWS access key id
--aws-region="" AWS S3 region
--aws-bucket="" AWS S3 bucket name
--backup-name="kubernetes-backup"
Name of the backup to restore
--timeout=1200s Maximum time to wait for the backup process to complete
--yes Disable interactive mode and assume "yes" is the answer to all prompts
$ dcos kubernetes cluster restore --cluster-name=CLUSTER-NAME --aws-region=us-east1-d --aws-bucket=my_bucket --aws-access-key-id=ABC --aws-secret-access-key=XYZ
Backup restore: [COMPLETE]
Backup successfully restored!
Verify
-
On a running Kubernetes cluster, deploy a couple of pods:
kubectl create -f ./artifacts/nginx/nginx-deployment.yaml
$ kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE (...) default nginx-6c54bd5869-pt62l 1/1 Running 0 39s default nginx-6c54bd5869-xt82y 1/1 Running 0 39s
-
Create a backup of the cluster:
$ dcos kubernetes cluster backup --cluster-name=CLUSTER-NAME --aws-region=us-east1-d --aws-bucket=my_bucket --aws-access-key-id=ABC --aws-secret-access-key=XYZ
-
Delete the deployment that was previously created:
kubectl delete -f ./artifacts/nginx/nginx-deployment.yaml
-
Restore the backup and verify that the pods are running again:
$ dcos kubernetes cluster restore --cluster-name=CLUSTER-NAME --aws-region=us-east1-d --aws-bucket=my_bucket --aws-access-key-id=ABC --aws-secret-access-key=XYZ