The DC/OS Apache Cassandra service provides a robust API (accessible by HTTP or DC/OS CLI) for managing, repairing, and monitoring the service. Here, only the CLI version is presented for conciseness, but see the API Reference for HTTP instructions.
Plan Operations
An instance of the DC/OS Apache Cassandra service is ultimately a set of plans (how to run the service) and a set of pods (what to run). The service is driven primarily by two plans, deploy
and recovery
.
The deploy
plan is used during initial installation, as well as during configuration and service updates. The recovery
plan is an always-running plan that ensures that the pods and tasks that should be running, stay running. Other plans (“sidecars”) are used for secondary operations.
Every plan is made up of a set of phases, each of which consists of one or more steps.
Listing plans
A list of plans can be retrieved using the CLI command:
dcos cassandra --name=cassandra plan list
Inspecting a plan
View the current status of a plan using the CLI command:
dcos cassandra --name=cassandra plan status <plan-name>
For example, the status of the completed deploy plan of the service will be:
dcos cassandra --name=cassandra plan status deploy
deploy (serial strategy) (COMPLETE)
└─ node-deploy (serial strategy) (COMPLETE)
├─ node-0:[server] (COMPLETE)
├─ node-0:[init_system_keyspaces] (COMPLETE)
├─ node-1:[server] (COMPLETE)
└─ node-2:[server] (COMPLETE)
Operating on a plan
Start
Start a plan with the CLI command:
dcos cassandra --name=cassandra plan start <plan-name>
Pause
Pauses the plan, or a specific phase in that plan with the provided phase name (or UUID).
dcos cassandra --name=cassandra plan pause <plan (required)> <phase (optional)>
Stop
Stops the running plan with the provided name.
dcos cassandra --name=cassandra plan stop <plan>
Plan Stop differs from Plan Pause in the following ways:
- Pause can be issued for a specific phase or for all phases within a plan. Stop can only be issued for a plan.
- Pause updates the underlying Phase/Step state. Stop both ceases execution and of the plan and resets the plan to its initial pending state.
Resume
Resumes the plan, or a specific phase in that plan, with the provided phase name (or UUID).
dcos cassandra --name=cassandra plan resume <plan (required)> <phase (optional)>
Force-Restart
Restarts the specified step, phase if no step is specified, or plan if no phase is specified.
dcos cassandra --name=cassandra force-restart <plan (required)> <phase (optional)> <step (optional)>
Force-Complete
Force completes a specific step in the provided phase of the plan. From the CLI it is only possible to force complete a step. The HTTP API supports force completing on phases and plans in their entirety.
dcos cassandra --name=cassandra force-complete <plan (required)> <phase (required)> <step (required)>
Pod Operations
A deployed instance of the DC/OS Apache Cassandra service is made up of a set of running pods. Using the pod API, it is possible to manage the lifecycle of these pods as well as investigate failures of the pods and their tasks.
List
To list all the pods of the service run the CLI command:
dcos cassandra --name=cassandra pod list
Status
To view the status of all pods or optionally just one pod run the CLI command:
dcos cassandra --name=cassandra pod status <pod-name (optional)>
This will show any status overrides for pods and their tasks.
Restart
To restart a pod in place, use the CLI command:
dcos cassandra --name=cassandra pod restart <pod-name>
This will kill the tasks of the pod, and relaunch them in-place. The progress of the restart can be monitored by watching the recovery
plan of the service.
Replace
Replace should be used only when the current instance of the pod should be completely destroyed. All persistent data (read: volumes) of the pod will be destroyed. Replace should be used when a DC/OS agent is being removed, is permanently down, or pod placement constraints need to be updated.
Issue a replace by running the CLI command:
dcos cassandra --name=cassandra pod replace <pod-name>
Pause
Pausing a pod relaunches it in an idle command state. This allows you to debug the contents of the pod, possibly making changes to fix problems, while still having access to all the context of the pod (such as volumes)
Using pause and dcos task exec
is a very powerful debugging tool. To pause a pod use the CLI command:
dcos cassandra --name=cassandra debug pod pause <pod-name>
To pause a specific task of a pod append the -t <task-name>
flag,
dcos cassandra --name=cassandra debug pod pause <pod-name> -t <task-name>
Use the pod status
command to check what tasks and pods are in an overridden state.
Resume
To resume a paused pod or task use the CLI command:
dcos cassandra --name=cassandra debug pod resume <pod-name> [-t <task-name>]
Check the pod status
command to verify that the pod (or task) has properly resumed.
Service Metrics
The DC/OS Apache Cassandra service pushes metrics into the DC/OS metrics system. Details on consuming them can be found at the documentation on the DC/OS metrics system.
Logging
DC/OS has three ways to access service scheduler and service task logs.
- Via the DC/OS GUI
- Via the Mesos GUI
- Via the DC/OS CLI
DC/OS GUI
A service’s logs are accessed by selecting the service in the Services tab. Both the service scheduler and service tasks are displayed side by side in this view. To view the tasks sandbox (files) as well as its stdout
and stderr
, click on a task.
Mesos GUI
The Mesos GUI provides similar access as that of the DC/OS UI, but the service scheduler is separate from the service tasks. The service tasks can all be found in the Frameworks tab under the framework with the same name as the service. The service scheduler can be found in the Marathon framework, it will be a task with the same name as the service.
Access both the files and logs of a service by clicking on its sandbox
link.
Figure 1 - Scheduler sandbox
DC/OS CLI
The dcos task log
subcommand allows you to pull logs from multiple tasks at the same time. The dcos task log <task-pattern>
command will fetch logs for all tasks matching the prefix pattern. See the help of the subcommand for full details of the command.
Mesos Agent logs
Occasionally, it can also be useful to examine what a given Mesos agent is doing. The Mesos Agent handles deployment of Mesos tasks to a given physical system in the cluster. One Mesos Agent runs on each system. These logs can be useful for determining if there is a problem at the system level that is causing alerts across multiple services on that system.
Mesos agent logs can be accessed via the Mesos GUI or directly on the agent. The GUI method is described here.
Navigate to the agent you want to view either directly from a task by clicking the “Agent” item in the breadcrumb when viewing a task (this will go directly to the agent hosting the task), or by navigating through the “Agents” menu item at the top of the screen (you will need to select the desired agent from the list).
In the Agent view, you will see a list of frameworks with a presence on that Agent. In the left pane you will see a plain link named “LOG”. Click that link to view the agent logs.
Figure 2 - List of frameworks
Performing Cassandra Cleanup and Repair Operations
You may manually trigger certain nodetool
operations against your Cassandra instance using the CLI or the HTTP API.
Cleanup
You may trigger a nodetool cleanup
operation across your Cassandra nodes using the cleanup
plan. This plan requires the following parameters to run:
CASSANDRA_KEYSPACE
: the Cassandra keyspace to be cleaned up.
Use the following command to initiate this plan from the command line:
dcos cassandra --name=<service-name> plan start cleanup -p CASSANDRA_KEYSPACE=space1
Use the following command to view the status of this plan from the command line:
dcos cassandra --name=<service-name> plan status cleanup
cleanup (IN_PROGRESS)
└─ cleanup-deploy (IN_PROGRESS)
├─ node-0:[cleanup] (COMPLETE)
├─ node-1:[cleanup] (STARTING)
└─ node-2:[cleanup] (PENDING)
When the plan is completed, its status will be marked COMPLETE
.
The above plan start
and plan status
commands may also be made directly to the service over HTTP. Run the above commands with an additional -v
flag to see the queries involved.
For more information about nodetool cleanup
, see the Cassandra documentation.
Repair
You may trigger a nodetool repair
operation across your Cassandra nodes using the repair
plan. This plan requires the following parameters to be executed:
CASSANDRA_KEYSPACE
: the Cassandra keyspace to be cleaned up.
Use the following command to initiate CASSANDRA_KEYSPACE
from the command line:
dcos cassandra --name=<service-name> plan start repair -p CASSANDRA_KEYSPACE=space1
Use the following command to view the status of this plan from the command line:
dcos cassandra --name=<service-name> plan status repair
repair (STARTING)
└─ repair-deploy (STARTING)
├─ node-0:[repair] (STARTING)
├─ node-1:[repair] (PENDING)
└─ node-2:[repair] (PENDING)
When the plan is completed, its status will be marked as COMPLETE
.
The above plan start
and plan status
commands may also be made directly to the service over HTTP. Run the above commands with an additional -v
flag to see the queries involved.
For more information about nodetool repair
, see the Cassandra documentation.
Seed nodes
Cassandra seed nodes are those nodes with indices smaller than the seed node count. By default, Cassandra is deployed with a seed node count of two (node-0 and node-1 are seed nodes). When a replace operation is performed on one these nodes, all other nodes must be restarted and updated with the ip address of the new seed node. This operation is performed automatically.
For example: Use the following command, if you want to replace node-0
:
dcos cassandra --name=<service-name> pod replace node-0
The output of a recovery plan is as follows:
dcos cassandra --name=<service-name> plan show recovery
recovery (IN_PROGRESS)
└─ permanent-node-failure-recovery (IN_PROGRESS)
├─ node-0:[server] (COMPLETE)
├─ node-1:[server] (STARTING)
└─ node-2:[server] (PENDING)
...
Back up and Restore
Backing Up to S3
You can back up an entire cluster’s data and schema to Amazon S3 or S3 compatible storage using the backup-s3
plan. This plan requires the following parameters to run:
SNAPSHOT_NAME
: The name of this snapshot. Snapshots for individual nodes will be stored as S3 folders inside of a top levelsnapshot
folder.CASSANDRA_KEYSPACES
: The Cassandra keyspaces to back up. The entire keyspace and its schema will be backed up for each keyspace specified.AWS_ACCESS_KEY_ID
: The access key ID for the AWS IAM user running this backupAWS_SECRET_ACCESS_KEY
: The secret access key for the AWS IAM user running this backupAWS_REGION
: The region of the S3 bucket being used to store this backupS3_BUCKET_NAME
: The name of the S3 bucket in which to store this backupHTTPS_PROXY
: Specifications for the backup plan, taken fromconfig.yaml
.
Optional parameters:
S3_ENDPOINT_URL
: URL of S3 compatible storage. If you want to backup to S3 compatible storage, please provide this parameter inbackup-s3
plan.AWS_SESSION_ID
: It may also be necessary to set the session ID depending on how you authenticate with AWS.AWS_SESSION_TOKEN
: It may also be necessary to set the session token depending on how you authenticate with AWS.
Make sure to provision your nodes with enough disk space to perform a backup. backups are stored on disk before being uploaded to S3, and will take up as much space as the data currently in the tables and requires half of your total available space to be free to backup every keyspace at once.
As noted in the documentation for the backup/restore strategy configuration option, it is possible to run transfers to S3 either in serial or in parallel, but care must be taken not to exceed any throughput limits you may have in your cluster. Throughput depends on a variety of factors, including uplink speed, proximity to region where the backups are being uploaded and downloaded, and the performance of the underlying storage infrastructure. You should perform periodic tests in your local environment to understand what you can expect from S3.
You can configure whether snapshots are created and uploaded in serial (default) or in parallel. The serial backup/restore strategy is recommended.
Before initiating the backup plan, it is recommended to reset the plan to its original state. This will ensure the backup plan runs in the correct sequence. You can reset the plan with the following command:
dcos cassandra --name=
Use the following parameters to initiate this plan from the command line:
SNAPSHOT_NAME=<my_snapshot>
CASSANDRA_KEYSPACES="space1 space2"
AWS_ACCESS_KEY_ID=<my_access_key_id>
AWS_SECRET_ACCESS_KEY=<my_secret_access_key>
AWS_REGION=us-west-2
S3_BUCKET_NAME=backups
dcos cassandra --name=<service-name> plan start backup-s3 \
-p SNAPSHOT_NAME=$SNAPSHOT_NAME \
-p "CASSANDRA_KEYSPACES=$CASSANDRA_KEYSPACES" \
-p AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-p AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-p AWS_REGION=$AWS_REGION \
-p S3_BUCKET_NAME=$S3_BUCKET_NAME
-p HTTPS_PROXY=http://internal.proxy:8080
If you are backing up multiple keyspaces, they must be separated by spaces and wrapped in quotation marks when supplied to the plan start
command, as shown in the above example. If the CASSANDRA_KEYSPACES
parameter is not supplied, then every keyspace in your cluster will be backed up.
To view the status of this plan from the command line:
dcos cassandra --name=<service-name> plan status backup-s3
backup-s3 (IN_PROGRESS)
├─ backup-schema (COMPLETE)
│ ├─ node-0:[backup-schema] (COMPLETE)
│ ├─ node-1:[backup-schema] (COMPLETE)
│ └─ node-2:[backup-schema] (COMPLETE)
├─ create-snapshots (IN_PROGRESS)
│ ├─ node-0:[snapshot] (STARTED)
│ ├─ node-1:[snapshot] (STARTED)
│ └─ node-2:[snapshot] (COMPLETE)
├─ upload-backups (PENDING)
│ ├─ node-0:[upload-s3] (PENDING)
│ ├─ node-1:[upload-s3] (PENDING)
│ └─ node-2:[upload-s3] (PENDING)
└─ cleanup-snapshots (PENDING)
├─ node-0:[cleanup-snapshot] (PENDING)
├─ node-1:[cleanup-snapshot] (PENDING)
└─ node-2:[cleanup-snapshot] (PENDING)
The above plan start
and plan status
commands may also be made directly to the service over HTTP. Run the above commands with an additional -v
flag to see the queries involved.
Backing up to Azure
You can also back up to Microsoft Azure using the backup-azure
plan. This plan requires the following parameters to run:
SNAPSHOT_NAME
: the name of this snapshot. Snapshots for individual nodes will be stored as gzipped tarballs with the namenode-<POD_INDEX>.tar.gz
.CASSANDRA_KEYSPACES
: the Cassandra keyspaces to backup. The entire keyspace, as well as its schema, will be backed up for each keyspace specified.CLIENT_ID
: the client ID for the Azure service principal running this backupTENANT_ID
: the tenant ID for the tenant that the service principal belongs toCLIENT_SECRET
: the service principal’s secret keyAZURE_STORAGE_ACCOUNT
: the name of the storage account that this backup will be sent toAZURE_STORAGE_KEY
: the secret key associated with the storage accountCONTAINER_NAME
: the name of the container in which to store this backup.
Before initiating the backup plan, it is recommended to reset the plan to its original state. This will ensure the backup plan runs in the correct sequence. You can reset the plan with:
dcos cassandra --name=
You can initiate this plan from the command line in the same way as the Amazon S3 backup plan:
dcos cassandra --name=<service-name> plan start backup-azure \
-p SNAPSHOT_NAME=$SNAPSHOT_NAME \
-p "CASSANDRA_KEYSPACES=$CASSANDRA_KEYSPACES" \
-p CLIENT_ID=$CLIENT_ID \
-p TENANT_ID=$TENANT_ID \
-p CLIENT_SECRET=$CLIENT_SECRET \
-p AZURE_STORAGE_ACCOUNT=$AZURE_STORAGE_ACCOUNT \
-p AZURE_STORAGE_KEY=$AZURE_STORAGE_KEY \
-p CONTAINER_NAME=$CONTAINER_NAME
To view the status of this plan from the command line:
dcos cassandra --name=<service-name> plan status backup-azure
backup-azure (IN_PROGRESS)
├─ backup-schema (COMPLETE)
│ ├─ node-0:[backup-schema] (COMPLETE)
│ ├─ node-1:[backup-schema] (COMPLETE)
│ └─ node-2:[backup-schema] (COMPLETE)
├─ create-snapshots (COMPLETE)
│ ├─ node-0:[snapshot] (COMPLETE)
│ ├─ node-1:[snapshot] (COMPLETE)
│ └─ node-2:[snapshot] (COMPLETE)
├─ upload-backups (IN_PROGRESS)
│ ├─ node-0:[upload-azure] (COMPLETE)
│ ├─ node-1:[upload-azure] (STARTING)
│ └─ node-2:[upload-azure] (PENDING)
└─ cleanup-snapshots (PENDING)
├─ node-0:[cleanup-snapshot] (PENDING)
├─ node-1:[cleanup-snapshot] (PENDING)
└─ node-2:[cleanup-snapshot] (PENDING)
The above plan start
and plan status
commands may also be made directly to the service over HTTP. Run the above commands with an additional -v
flag to see the queries involved.
Restore
All restore plans will restore the schema from every keyspace backed up with the backup plan and populate those keyspaces with the data they contained at the time the snapshot was taken. Downloading and restoration of backups will use the configured backup/restore strategy. This plan assumes that the keyspaces being restored do not already exist in the current cluster, and will fail if any keyspace with the same name is present.
Restoring From S3
Restoring cluster data is similar to backing it up. The restore-s3
plan assumes that your data is stored in an S3 bucket in the format that backup-s3
uses. The restore plan has the following required parameters:
SNAPSHOT_NAME
: the snapshot name from thebackup-s3
planAWS_ACCESS_KEY_ID
: the access key ID for the AWS IAM user running this restoreAWS_SECRET_ACCESS_KEY
: the secret access key for the AWS IAM user running this restoreAWS_REGION
: the region of the S3 bucket being used to store the backup being restoredS3_BUCKET_NAME
: the name of the S3 bucket where the backup is stored
Optional parameters:
S3_ENDPOINT_URL
: URL of S3 compatible storage. If you want to restore from S3 compatible storage, please provide this parameter inrestore-s3
plan.AWS_SESSION_ID
: It may also be necessary to set the session ID depending on how you authenticate with AWS.AWS_SESSION_TOKEN
: It may also be necessary to set the session token depending on how you authenticate with AWS.
To initiate this plan from the command line:
SNAPSHOT_NAME=<my_snapshot>
CASSANDRA_KEYSPACES="space1 space2"
AWS_ACCESS_KEY_ID=<my_access_key_id>
AWS_SECRET_ACCESS_KEY=<my_secret_access_key>
AWS_REGION=us-west-2
S3_BUCKET_NAME=backups
dcos cassandra --name=<service-name> plan start restore-s3 \
-p SNAPSHOT_NAME=$SNAPSHOT_NAME \
-p "CASSANDRA_KEYSPACES=$CASSANDRA_KEYSPACES" \
-p AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-p AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-p AWS_REGION=$AWS_REGION \
-p S3_BUCKET_NAME=$S3_BUCKET_NAME
To view the status of this plan from the command line:
dcos cassandra --name=<service-name> plan status restore-s3
restore-s3 (IN_PROGRESS)
├─ fetch-s3 (COMPLETE)
│ ├─ node-0:[fetch-s3] (COMPLETE)
│ ├─ node-1:[fetch-s3] (COMPLETE)
│ └─ node-2:[fetch-s3] (COMPLETE)
├─ restore-schema (IN_PROGRESS)
│ ├─ node-0:[restore-schema] (COMPLETE)
│ ├─ node-1:[restore-schema] (STARTED)
│ └─ node-2:[restore-schema] (PENDING)
└─ restore-snapshots (PENDING)
├─ node-0:[restore-snapshot] (PENDING)
├─ node-1:[restore-snapshot] (PENDING)
└─ node-2:[restore-snapshot] (PENDING)
The above plan start
and plan status
commands may also be made directly to the service over HTTP. Run the above commands with an additional -v
flag to see the queries involved.
Restoring From Azure
You can restore from Microsoft Azure using the restore-azure
plan. This plan requires the following parameters to run:
SNAPSHOT_NAME
: The name of this snapshot. Snapshots for individual nodes will be stored as gzipped tarballs with the namenode-<POD_INDEX>.tar.gz
.CLIENT_ID
: The client ID for the Azure service principal running this backup.TENANT_ID
: The tenant ID for the tenant that the service principal belongs to.CLIENT_SECRET
: The service principal’s secret key.AZURE_STORAGE_ACCOUNT
: The name of the storage account that this backup will be sent to.AZURE_STORAGE_KEY
: The secret key associated with the storage account.CONTAINER_NAME
: The name of the container in which to store this backup.
You can initiate this plan from the command line similar to the Amazon S3 restore plan:
dcos cassandra --name=<service-name> plan start restore-azure \
-p SNAPSHOT_NAME=$SNAPSHOT_NAME \
-p CLIENT_ID=$CLIENT_ID \
-p TENANT_ID=$TENANT_ID \
-p CLIENT_SECRET=$CLIENT_SECRET \
-p AZURE_STORAGE_ACCOUNT=$AZURE_STORAGE_ACCOUNT \
-p AZURE_STORAGE_KEY=$AZURE_STORAGE_KEY \
-p CONTAINER_NAME=$CONTAINER_NAME
Use the following command to view the status of this plan from the command line:
dcos cassandra --name=<service-name> plan status restore-azure
restore-azure (IN_PROGRESS)
├─ fetch-azure (COMPLETE)
│ ├─ node-0:[fetch-azure] (COMPLETE)
│ ├─ node-1:[fetch-azure] (COMPLETE)
│ └─ node-2:[fetch-azure] (COMPLETE)
├─ restore-schema (COMPLETE)
│ ├─ node-0:[restore-schema] (COMPLETE)
│ ├─ node-1:[restore-schema] (COMPLETE)
│ └─ node-2:[restore-schema] (COMPLETE)
└─ restore-snapshots (IN_PROGRESS)
├─ node-0:[restore-snapshot] (COMPLETE)
├─ node-1:[restore-snapshot] (STARTING)
└─ node-2:[restore-snapshot] (PENDING)
The above plan start
and plan status
commands may also be made directly to the service over HTTP. Run the above commands with an additional -v
flag to see the queries involved.