Learn how to configure automatic cleanup of completed and idle workloads created by Kaptain components or the Kaptain SDK.
Prerequisites
- A Provisioned Konvoy cluster running Konvoy v1.7.0 or above.
Automatically cleanup idle Notebooks
Kaptain Notebooks are the primary interface for end users to interact with the platform. A notebook is a long-running application deployed as a Kubernetes StatefulSet with an attached volumes for persisting your working directory. Although notebooks are not supposed to be garbage collected, in some cases, they can have significant amounts of cluster resources reserved to run local (in-notebook) training. Once the training is complete, these resources are not available for other workloads to use and should be cleaned up.
The Notebook Controller provides functionality called “notebook culling” which can scale down idle notebooks. Scaling down notebooks frees up the resources allocated to those notebooks and makes them available to other workloads.
Jupyter Notebook exposes an endpoint reporting the last activity within a notebook; if the last activity exceeds the configured limit, the controller scales the underlying StatefulSet to 0 replicas.
When a notebook is up and running, it is displayed as active in the UI and has one replica in the StatefulSet:
kubectl get notebooks.kubeflow.org && kubectl get statefulsets
NAME AGE
notebook 9m36s
NAME READY AGE
notebook 1/1 9m37s
After the notebook has idled longer than the specified culling time, it is scaled down:
kubectl get notebooks.kubeflow.org && kubectl get statefulsets
NAME AGE
notebook 12m
NAME READY AGE
notebook 0/0 13m
You can resume the notebook from the UI later, and the corresponding StatefulSet is scaled back to 1 replica if there are sufficient resources on the cluster. The workspace volume is automatically attached to the resumed notebook.
The notebook culling feature is disabled by default. To enable it, set the notebookEnableCulling
parameter to true
during the installation or update the existing Kaptain instance using the following command:
kubectl kudo update --instance kaptain --namespace kubeflow -p notebookEnableCulling=true
See the Configuration Reference for additional parameters for this functionality.
Automatic cleanup of completed Pipeline Runs (Workflows)
Overview
Kubeflow Pipelines rely on Argo Workflows for running workloads. Starting with Kaptain 1.1, Kubeflow Pipelines schedule the workflows in the user namespace, providing better multi-tenant isolation and workload locality. Once all the steps in the pipeline are complete, the Pods corresponding to the pipeline terminate, but the Argo Workflow custom resources (workflow.argoproj.io
) remain in the namespace:
kubectl get workflows.argoproj.io
NAME STATUS AGE
data-passing-btwn-componefjdf8-1-3068851699 Running 17s
dsl-control-structures-rugqkrh-1-2276733026 Succeeded 111s
dsl-control-structures-rugqkrh-2-2259955407 Succeeded 51s
end-to-end-mnist-pipeline-mnrr6
Each step of the pipeline is implemented using a Pod.
Pipeline pods are not deleted as long as the workflow that created them is present. Without cleanup, your namespace can become filled with completed pods:
kubectl get pods -l workflows.argoproj.io/workflow=dsl-control-structures-rugqkrh-1-2276733026
NAME READY STATUS RESTARTS AGE
dsl-control-structures-rugqkrh-1-2276733026-2018045073 0/2 Completed 0 4m11s
dsl-control-structures-rugqkrh-1-2276733026-2405487652 0/2 Completed 0 3m40s
dsl-control-structures-rugqkrh-1-2276733026-3461867059 0/2 Completed 0 3m51s
dsl-control-structures-rugqkrh-1-2276733026-4042755208 0/2 Completed 0 4m1s
Using Python DSL for setting Pipeline TTL
Kubeflow Pipelines provide a Python Domain Specific Language (DSL) that allows you to specify a time-to-live (TTL) for the submitted Pipeline. Here is an excerpt from the Pipeline tutorial:
@dsl.pipeline(
name="End-to-End MNIST Pipeline",
description="A sample pipeline to demonstrate multi-step model training, evaluation, export, and serving",
)
def mnist_pipeline(
input_bucket: str = "tutorial",
...
):
train_and_serve(
input_bucket=input_bucket,
...
)
...
# TTL for the workflow to persist after completion (1 hour)
dsl.get_pipeline_conf().set_ttl_seconds_after_finished(60 * 60)
This setting specifies the ttlSecondsAfterFinished
property in the Argo Workflow definition specifying the amount of time the workflow will persist before it is cleaned up by the dedicated controller.
Setting global TTL for completed Pipelines
Kaptain has a global configuration property that allows you to set the default TTL for all created Pipelines; Workflow objects are deleted after the specified amount of time. The default value for this property is 24 hours. However, this property only affects the Pipeline API Server; the Argo Workflow controller does not use this property. While Notebook users can set this property via the DSL, they cannot specify a longer interval than the global property setting. The Pipeline component always uses the smaller of the two specified TTL values between the DSL config and the global property. Because workflow objects can be useful in debugging, we recommend choosing a conservative value for the global property value.
To set the default TTL for all Pipelines, install or update Kaptain instance with the following parameter:
kubectl kudo update --instance kaptain --namespace kubeflow -p workflowsTTLSecondsAfterFinish="<ttl seconds>"
Automatic cleanup for resources created by KFServing
Overview
KFServing serves models over HTTP(s) using the Knative Serving component. When a model is deployed to serving,
KFServing creates a set of Knative resources such as Service
,Route
, and Revision
.
There is always one Knative Service per model deployment, however, the number of Revisions can grow with time
because every new deployment (a new model version with a new image name) has its own Revision
.
When a new Revision
is deployed, the older one scales the associated deployment to zero replicas, but it does not delete it.
Over time, the number of Revisions and associated deployments can grow significantly; to avoid the undesirable overhead it
is recommended to garbage collect the outdated Revisions.
For example:
$> kubectl get revisions
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY
dev-mnist-predictor-c5kzr dev-mnist-predictor dev-mnist-predictor-c5kzr 1 True
dev-mnist-predictor-d6tdr dev-mnist-predictor dev-mnist-predictor-d6tdr 2 True
dev-mnist-predictor-tqzqw dev-mnist-predictor dev-mnist-predictor-tqzqw 3 True
$> kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
dev-mnist-predictor-c5kzr-deployment 0/0 0 0 33m
dev-mnist-predictor-d6tdr-deployment 0/0 0 0 18m
dev-mnist-predictor-tqzqw-deployment 1/1 1 1 5m53s
Configure Knative addon cleanup
KFServing itself doesn’t provide controls for garbage collection of stale Revisions, however, the underlying Knative Addon that ships with Kaptain has a set of parameters to control the garbage collection of stale revisions:
Parameter | Default | Description |
---|---|---|
minNonActiveRevisions | 20 | Minimum number of non-active revisions to retain. If the number of revisions for a service is less than provided value, the cleanup will not be triggered disregarding other settings. |
retainSinceCreateTime | 48h | Duration since a revision was created before considering it for the cleanup. The revision must be non-active to be considered for the cleanup. |
retainSinceLastActiveTime | 15h | Duration since a revision was active before considering it for the cleanup. An active revision is the one that has the service network traffic routed to it. |
maxNonActiveRevisions | 1000 | Maximum number of non-active revisions to retain. If the maximum number of revisions reached, the oldest non-active revision will be deleted disregarding the other settings. |
Update Knative addon configuration
To specify or update the Knative addon configuration, edit the cluster.yaml
section and specify the values for
the garbage collection settings:
- configRepository: https://github.com/mesosphere/kubeaddons-kaptain
configVersion: stable-1.20-1.3.0
addonsList:
- name: knative
enabled: true
values: |
serving:
gc:
retainSinceCreateTime: "48h"
retainSinceLastActiveTime: "15h"
minNonActiveRevisions: "20"
maxNonActiveRevisions: "1000"
After updating the settings, run konvoy deploy addons
to apply the changes.
Example configurations
If you only need to keep the latest revision of each model, the following settings can be used:
- configRepository: https://github.com/mesosphere/kubeaddons-kaptain
configVersion: stable-1.20-1.3.0
addonsList:
- name: knative
enabled: true
values: |
serving:
gc:
minNonActiveRevisions: "0"
retainSinceCreateTime: "1s"
retainSinceLastActiveTime: "1s"
Example configuration that retains the last ten non-active revisions:
- configRepository: https://github.com/mesosphere/kubeaddons-kaptain
configVersion: stable-1.20-1.3.0
addonsList:
- name: knative
enabled: true
values: |
serving:
gc:
minNonActiveRevisions: "10"
retainSinceCreateTime: "1s"
retainSinceLastActiveTime: "1s"