Release notes for DC/OS Storage Service version 1.0.0
- This version of the DC/OS Storage Service is considered Generally Available and all users are strongly encouraged to upgrade from previous versions.
- This version of the DC/OS Storage Service requires DC/OS Enterprise version 1.13.2 or later.
New Features
- The
devices
volume provider now blacklists descendants of blacklisted devices by default. To override the default behavior, you can also explicitly blacklist a device using theblacklist-exactly
configuration option. - The
devices
andlvm
volume providers now emit metrics. For more information see New Metrics. - DSS can manage storage providers and volumes on agents that also advertise GPU resources.
- Operator can scrub volume removal operations that will never complete due to interrupted
DESTROY_DISK
operations. - Operator can scrub local volumes and local volume providers that DSS reports as
MISSING
. - Volume remove operations can be canceled. If no Mesos operations been issued to remove the volume, you can cancel the removal request.
- Operator can more easily remove failing providers from a node.
- The
dcos storage volume create
accepts create parameters via JSON file orstdin
. - The
dcos storage ...
commands accept a-v
flag to toggle verbose logging.
Updates
- Additional logging of API requests and responses.
- Enforce uniqueness of device provider names.
- More robust enforcement of non-overlapping devices among multiple
lvm
volume providers. - Device provider creation validates that the target node is known to DSS.
- Prevent volume lifecycle operations when the parent provider is being modified, or is otherwise not ready.
- Prevent provider modifications when that provider has an in-progress volume operation.
- Removed permissions that are no longer needed by storage principal (related to marathon, package, storage service).
- DSS running on permissive mode clusters requires storage principal configuration.
- DSS running on strict mode clusters requires
enforce-authorization
to be enabled. dcos storage ... list
commands display results in sorted order.dcos storage provider list
table headerSTATE
is now calledSTATUS
(for consistency).- Removed the
--all
flag fromdcos storage provider list
- The
--timeout
flag sets a timeout after which the CLI will abort its operation instead of relying on the server to time out the operation. The CLI will keep retrying internally until the timeout is hit or a non-timeout error or success is achieved. - Removed the previously deprecated “Artifacts Container” installation method.
- Secondary DSS instances will refuse to start if a primary instance is already running.
- Actively monitor Mesos heartbeats to DSS and trigger re-connection as needed.
- The DSS package includes a
LICENSES
file that contains copies of all OSS licenses. - Service bug fixes, performance fixes, security fixes, as well as other doc fixes and improvements.
Limitations
- Only local volume storage is currently supported.
- Only manual upgrades of a running DC/OS Storage Service on an existing cluster are supported at this time.
- Volume size must be a multiple of 4MiB, which is the default size of an LVM extent. Otherwise, DSS will reply with an error when attempting to create the volume.
- When planning to manually remove a logical volume via
lvremove
, the operator is responsible for zeroing the volume prior to removal.
Known Issues
- In the event of an unexpected device and/or volume change on an agent, you must restart the agent for the
devices
andlvm
providers to reconcile the condition. For example, if you add or remove devices, restart the agent to update thedevices
volume provider with the changes. dcos storage
CLI subcommands may fail with a gateway timeout error, but still complete successfully in the background.- The Mesos SLRP implementation is not yet compatible with multiple profiles that consume capacity from the same provider in different ratios (for example, RAID1 and linear). To work around this, create multiple providers, each of which is wholly dedicated to linear or RAID1.
- The storage service should only list providers that it currently manages; incompletely removed providers may be incorrectly listed in some cases.
- Deleting a volume may fail with “Cannot allocate memory” on some versions of CoreOS. To avoid this issue, ensure you are using a supported version of CoreOS.
- Kernels from (including) 3.10.0-862.6.3.el7 up to (including) 3.10.0-862.11.6.el7 may panic as a result of LVM operations (https://access.redhat.com/solutions/3520511).
- The DC/OS installer may issue one or more WARNING messages regarding missing kernel modules:
To resolve the issue, configure theChecking if kernel module raid1 is loaded: WARNING Kernel module raid1 is not loaded. DC/OS Storage Service (DSS) depends on it. Checking if kernel module dm_raid is loaded: WARNING Kernel module dm_raid is not loaded. DC/OS Storage Service (DSS) depends on it.
raid1
anddm_raid
kernel modules to load at OS boot time. - Using NVMe storage with DSS may require additional modifications to the underlying OS. For more information see these suggested commands and helper scripts.
- The device names (e.g.
sda
) used to create volume providers can be unstable over time thus precautions should be taken to avoid this condition. - The DC/OS UI shows an incorrect unit for DC/OS Storage volume size in the service create modal – the value will be treated as MiB instead of GiB as stated in the UI.
- The DC/OS cluster’s reported total disk resources is inflated due to double-counting of DSS devices.
New Metrics
All metrics related to the DC/OS Storage Service have a prefix of csidevices_
, csilvm_
, or dss_
.
devices
provider metrics
New csidevices_uptime
: the uptime (in seconds) of the processcsidevices_requests
: number of requests served, tagged by:result_type
: one ofsuccess
,error
method
: the RPC name, e.g.,/csi.v0.Controller/ListVolumes
csidevices_requests_latency_(stddev,mean,lower,count,sum,upper)
: the request duration (in milliseconds), tagged by:method
: the RPC name, e.g.,/csi.v0.Controller/ListVolumes
csidevices_devices
: the number of devices reported by ListVolumes
lvm
volume provider metrics
New csilvm_uptime
: the uptime (in seconds) of the processcsilvm_requests
: number of requests served, tagged by:result_type
: one ofsuccess
,error
method
: the RPC name, e.g.,/csi.v0.Controller/CreateVolume
csilvm_requests_latency_(stddev,mean,lower,count,sum,upper)
: the request duration (in milliseconds), tagged by:method
: the RPC name, e.g.,/csi.v0.Controller/CreateVolume
csilvm_volumes
: the number of active logical volumescsilvm_bytes_total
: the total number of bytes in the volume groupcsilvm_bytes_free
: the number of bytes available for creating a linear logical volumecsilvm_bytes_used
: the number of bytes allocated to active logical volumescsilvm_pvs
: the number of physical volumes in the volume groupcsilvm_missing_pvs
: the number of pvs given on the command-line but are not found in the volume groupcsilvm_unexpected_pvs
: the number of pvs not given on the command-line but are found in the volume groupcsilvm_lookup_pv_errs
: the number of errors encountered while looking for pvs specified on the command-line
New DSS metrics
dss_agent_lookups_hits
: number of successful agent address lookups (via cache)dss_agent_lookups_misses
: number of failed agent address lookups (via cache)dss_mesosclient_master_getAgents_shared
: count of coalesced API callsdss_obj_providers_missing
: number ofMISSING
providersdss_obj_volumes_missing
: number ofMISSING
volumesdss_ops_providers_create
: duration of provider create operationsdss_ops_providers_modify
: duration of provider modify operationsdss_ops_providers_remove
: duration of provider remove operationsdss_ops_volumes_create
: duration of volume create operationsdss_ops_volumes_remove
: duration of volume remove operationsdss_sched_hb_disabled
: non-zero if scheduler is subscribed to mesos w/o heartbeats enableddss_sched_hb_missed
: missed mesos heartbeatsdss_sched_hb_missed2Many
: how many times the number of consecutively missed mesos heartbeats triggered reconnection to mesos