Kaptain and KUDO
To print the status of the Kaptain operator instance:
kubectl kudo plan status -n kubeflow --instance kaptain
To show deployments and pods in the Kaptain operator instance:
kubectl get deployments -n kubeflow
kubectl get pods -n kubeflow
kubectl describe pod <pod_name> -n kubeflow
To print the logs from the KUDO controller:
kubectl logs -n kudo-system kudo-controller-manager-0 -f
Konvoy
To create the Konvoy diagnostics bundle, use:
konvoy diagnose --logs-all-namespaces --yes
Afterwards, check Konvoy troubleshooting techniques.
ML workloads
TFJobs
List all TFJobs runs in user namespace:
kubectl get tfjobs -n <namespace>
Get the details of TFJob run:
kubectl describe tfjob <job_name> -n <namespace>
List all TFJob’s pods:
kubectl get pods -l job-name=<job_name> -n <namespace>
Print the logs from TFJob’s pods:
kubectl logs -l job-name=<job_name> --prefix=true -n <namespace>
List Kubernetes Events
associated with TFJob:
kubectl get events --field-selector involvedObject.kind=TFJob,involvedObject.name=<job_name> -n <namespace>
Delete TFJob:
kubectl delete tfjob <job_name> -n <namespace>
PyTorchJobs
List all PyTorchJobs runs in user namespace:
kubectl get pytorchjob -n <namespace>
Get the details of PyTorchJob run:
kubectl describe pytorchjob <job_name> -n <namespace>
List all PyTorchJob’s pods:
kubectl get pods -l job-name=<job_name> -n <namespace>
Print the logs from PyTorchJob’s pods:
kubectl logs -l pytorch-job-name=<job_name> --prefix=true -n <namespace>
List Kubernetes Events
associated with PyTorchJob:
kubectl get events --field-selector involvedObject.kind=PyTorchJob,involvedObject.name=<job_name> -n <namespace>
Delete PyTorchJob:
kubectl delete pytorchjob <job_name> -n <namespace>
Experiments
Get the details of an Experiment:
kubectl describe experiment <experiment_name> -n <namespace>
List all Experiments runs in user namespace:
kubectl get experiments -n <namespace>
Get the Experiment’s trials:
kubectl get trials -l experiment=<experiment_name> -n <namespace>
Inference Services
Get the details of Inference Service:
kubectl describe inferenceservice <inference_service_name> -n <namespace>
Get the Inference Service’s pods:
kubectl get pods -l serving.kubeflow.org/inferenceservice=<inference_service_name> -n <namespace>
Print Inference Service pod’s logs:
kubectl logs -l serving.kubeflow.org/inferenceservice=<inference_service_name> --prefix=true -c storage-initializer -n <namespace>
kubectl logs -l serving.kubeflow.org/inferenceservice=<inference_service_name> --prefix=true -c kfserving-container -n <namespace>
To find Knative revisions that are no longer being used, run this:
kubectl get revisions -l serving.kubeflow.org/inferenceservice=<inference_service_name> -l serving.knative.dev/routingState=reserve -n <namespace>
You can clean up revisions with:
kubectl delete revision <revision_name> -n <namespace>
This will clean up associated with those revisions deployment that have been scaled to 0.
Kubeflow Pipelines
List all pipeline runs in user namespace:
kubectl get workflows.argoproj.io -n <namespace>
Print the logs from all pipeline steps:
kubectl logs -l workflows.argoproj.io/workflow=<workflow_name> -c main --prefix=true -n <namespace>
Delete all completed pipeline runs:
kubectl delete workflows.argoproj.io -l workflows.argoproj.io/completed=true -n <namespace>
Delete all the pipeline runs with by final status (Succeeded
or Failed
):
kubectl delete -l workflows.argoproj.io -l workflows.argoproj.io/completed=true -l workflows.argoproj.io/phase=Succeeded -n <namespace>
Common issues
Most problems could be identified by checking the following, going from higher-level description to low level details:
- description of the workload (TFJob, PyTorchJob, InferenceService etc.)
- Events associated with the workload
- Status of the workload
- pod logs
Errors in job code
Check if the job is running:
$ kubectl get tfjob tfjob-example
NAME STATE AGE
tfjob-example Running 6m47s
Check the pod status of the job:
$ kubectl get pods -l job-name=tfjob-example
NAME READY STATUS RESTARTS AGE
tfjob-example-chief-0 0/1 CrashLoopBackOff 6 7m10s
tfjob-example-worker-0 0/1 CrashLoopBackOff 6 7m10s
tfjob-example-worker-1 0/1 CrashLoopBackOff 6 7m10s
In this example, the job is in the Running
state but the pods are in the Error
or CrashLoopBackOff
state.
First start with the describe
command which provide high-level view of the workload. Status
and Events
fields will give an overview of current status of the workload along with associated cluster events.
$ kubectl describe tfjob tfjob-sample
Name: tfjob-sample
Namespace: mynamespace
...
Status:
Conditions:
Last Transition Time: 2021-11-14T22:08:40Z
Last Update Time: 2021-11-14T22:08:40Z
Message: TFJob tfjob-sample is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2021-11-14T22:08:49Z
Last Update Time: 2021-11-14T22:08:49Z
Message: TFJob mynamespace/tfjob-sample is running.
Reason: TFJobRunning
Status: True
Type: Running
Replica Statuses:
Chief:
Active: 1
Worker:
Active: 2
Start Time: 2021-11-14T22:08:41Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 15m tf-operator Created pod: tfjob-sample-chief-0
Normal SuccessfulCreateService 15m tf-operator Created service: tfjob-sample-chief-0
Normal SuccessfulCreatePod 15m tf-operator Created pod: tfjob-sample-worker-0
Normal SuccessfulCreatePod 15m tf-operator Created pod: tfjob-sample-worker-1
Normal SuccessfulCreateService 15m tf-operator Created service: tfjob-sample-worker-0
Normal SuccessfulCreateService 15m tf-operator Created service: tfjob-sample-worker-1
Warning Error 14m (x2 over 14m) tf-operator Error pod tfjob-sample-worker-1 container tensorflow exitCode: 1 terminated message:
Normal ExitedWithCode 14m (x2 over 14m) tf-operator Pod: mynamespace.tfjob-sample-worker-1 exited with code 1
Warning Error 14m (x5 over 14m) tf-operator Error pod tfjob-sample-chief-0 container tensorflow exitCode: 1 terminated message:
Normal ExitedWithCode 14m (x2 over 14m) tf-operator Pod: mynamespace.tfjob-sample-worker-0 exited with code 1
Normal ExitedWithCode 14m (x6 over 14m) tf-operator Pod: mynamespace.tfjob-sample-chief-0 exited with code 1
Warning Error 9m54s (x13 over 14m) tf-operator Error pod tfjob-sample-worker-0 container tensorflow exitCode: 1 terminated message:
Warning CrashLoopBackOff 118s tf-operator Error pod tfjob-sample-worker-1 container tensorflow waiting message: back-off 5m0s restarting failed container=tensorflow pod=tfjob-sample-worker-1_mynamespace(13030281-b595-4486-89eb-1025455ec091)
From the events list, it is clear that all pods were created successfully and container tensorflow
has started but terminated with code 1:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 15m tf-operator Created pod: tfjob-sample-chief-0
Normal SuccessfulCreateService 15m tf-operator Created service: tfjob-sample-chief-0
Normal SuccessfulCreatePod 15m tf-operator Created pod: tfjob-sample-worker-0
Normal SuccessfulCreatePod 15m tf-operator Created pod: tfjob-sample-worker-1
Normal SuccessfulCreateService 15m tf-operator Created service: tfjob-sample-worker-0
Normal SuccessfulCreateService 15m tf-operator Created service: tfjob-sample-worker-1
Warning Error 14m (x2 over 14m) tf-operator Error pod tfjob-sample-worker-1 container tensorflow exitCode: 1 terminated message:
Normal ExitedWithCode 14m (x2 over 14m) tf-operator Pod: mynamespace.tfjob-sample-worker-1 exited with code 1
Warning Error 14m (x5 over 14m) tf-operator Error pod tfjob-sample-chief-0 container tensorflow exitCode: 1 terminated message:
Normal ExitedWithCode 14m (x2 over 14m) tf-operator Pod: mynamespace.tfjob-sample-worker-0 exited with code 1
Normal ExitedWithCode 14m (x6 over 14m) tf-operator Pod: mynamespace.tfjob-sample-chief-0 exited with code 1
Warning Error 9m54s (x13 over 14m) tf-operator Error pod tfjob-sample-worker-0 container tensorflow exitCode: 1 terminated message:
Warning CrashLoopBackOff 118s tf-operator Error pod tfjob-sample-worker-1 container tensorflow waiting message: back-off 5m0s restarting failed container=tensorflow pod=tfjob-sample-worker-1_mynamespace(13030281-b595-4486-89eb-1025455ec091)
This implies that scheduling was successful - the scheduler was able to find enough cluster to schedule the pod and the image was pulled successfully.
The next step is to check the logs for the pod:
$ kubectl logs -l job-name=tfjob-sample --prefix=true
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:Using MirroredStrategy with devices ('/job:chief/task:0',)
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:Waiting for the cluster, timeout = inf
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:Cluster is ready.
[pod/tfjob-sample-chief-0/tensorflow] INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = ...
[pod/tfjob-sample-chief-0/tensorflow] Traceback (most recent call last):
[pod/tfjob-sample-chief-0/tensorflow] File "trainer.py", line 131, in <module>
[pod/tfjob-sample-chief-0/tensorflow] main()
[pod/tfjob-sample-chief-0/tensorflow] File "trainer.py", line 54, in main
[pod/tfjob-sample-chief-0/tensorflow] tf.reshape([1, 2, 3], [2, 2])
[pod/tfjob-sample-chief-0/tensorflow] InvalidArgumentError: Input to reshape is a tensor with 3 values, but the requested shape has 4
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:Using MirroredStrategy with devices ('/job:worker/task:0',)
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:Waiting for the cluster, timeout = inf
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:Cluster is ready.
[pod/tfjob-sample-worker-0/tensorflow] INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = ...
[pod/tfjob-sample-worker-0/tensorflow] Traceback (most recent call last):
[pod/tfjob-sample-worker-0/tensorflow] File "trainer.py", line 131, in <module>
[pod/tfjob-sample-worker-0/tensorflow] main()
[pod/tfjob-sample-worker-0/tensorflow] File "trainer.py", line 54, in main
[pod/tfjob-sample-worker-0/tensorflow] tf.reshape([1, 2, 3], [2, 2])
[pod/tfjob-sample-worker-0/tensorflow] InvalidArgumentError: Input to reshape is a tensor with 3 values, but the requested shape has 4
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:Using MirroredStrategy with devices ('/job:worker/task:1',)
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:Waiting for the cluster, timeout = inf
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:Cluster is ready.
[pod/tfjob-sample-worker-1/tensorflow] INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = ...
[pod/tfjob-sample-worker-1/tensorflow] Traceback (most recent call last):
[pod/tfjob-sample-worker-1/tensorflow] File "trainer.py", line 131, in <module>
[pod/tfjob-sample-worker-1/tensorflow] main()
[pod/tfjob-sample-worker-1/tensorflow] File "trainer.py", line 54, in main
[pod/tfjob-sample-worker-1/tensorflow] tf.reshape([1, 2, 3], [2, 2])
[pod/tfjob-sample-worker-1/tensorflow] InvalidArgumentError: Input to reshape is a tensor with 3 values, but the requested shape has 4
In the logs above we see the source of the issue - an exception was raised during the running of the trainer code.
Missing Image
Check if the job is running:
$ kubectl get tfjob tfjob-example
NAME STATE AGE
tfjob-example Running 1m47s
Check the pod status of the job:
$ kubectl get pods -l job-name=tfjob-example
NAME READY STATUS RESTARTS AGE
tfjob-sample-chief-0 1/1 Running 0 2m32s
tfjob-sample-worker-0 0/1 ImagePullBackOff 0 2m32s
tfjob-sample-worker-1 0/1 ImagePullBackOff 0 2m32s
The worker pods for the Job have ImagePullBackOff status.
We can get more details by describing the TFJob:
$ kubectl describe tfjob tfjob-example
Name: tfjob-sample
Namespace: mynamespace
...
API Version: kubeflow.org/v1
Kind: TFJob
Spec:
Tf Replica Specs:
Chief:
...
Spec:
Containers:
...
Image: mesosphere/kubeflow:mnist-sdk-example
Image Pull Policy: Always
Name: tensorflow
...
Worker:
Replicas: 2
...
Spec:
Containers:
...
Image: mesosphere/kubeflow:mnist-sdk-sample
Name: tensorflow
Status:
Conditions:
Last Transition Time: 2021-11-14T23:07:19Z
Last Update Time: 2021-11-14T23:07:19Z
Message: TFJob tfjob-sample is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2021-11-14T23:07:25Z
Last Update Time: 2021-11-14T23:07:25Z
Message: TFJob mynamespace/tfjob-sample is running.
Reason: TFJobRunning
Status: True
Type: Running
Replica Statuses:
Chief:
Active: 1
Worker:
Start Time: 2021-11-14T23:07:20Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 10m tf-operator Created pod: tfjob-sample-worker-0
Normal SuccessfulCreatePod 10m tf-operator Created pod: tfjob-sample-worker-1
Normal SuccessfulCreateService 10m tf-operator Created service: tfjob-sample-worker-0
Normal SuccessfulCreateService 10m tf-operator Created service: tfjob-sample-worker-1
Normal SuccessfulCreatePod 10m tf-operator Created pod: tfjob-sample-chief-0
Normal SuccessfulCreateService 10m tf-operator Created service: tfjob-sample-chief-0
Warning ImagePullBackOff 9m32s (x2 over 9m47s) tf-operator Error pod tfjob-sample-worker-1 container tensorflow waiting message: Back-off pulling image "mesosphere/kubeflow:mnist-sdk-sample"
Warning ErrImagePull 9m31s (x7 over 10m) tf-operator Error pod tfjob-sample-worker-0 container tensorflow waiting message: rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/mesosphere/kubeflow:mnist-sdk-sample": failed to resolve reference "docker.io/mesosphere/kubeflow:mnist-sdk-sample": docker.io/mesosphere/kubeflow:mnist-sdk-sample: not found
Warning ErrImagePull 9m21s (x6 over 10m) tf-operator Error pod tfjob-sample-worker-1 container tensorflow waiting message: rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/mesosphere/kubeflow:mnist-sdk-sample": failed to resolve reference "docker.io/mesosphere/kubeflow:mnist-sdk-sample": docker.io/mesosphere/kubeflow:mnist-sdk-sample: not found
Warning ImagePullBackOff 4m2s (x11 over 9m47s) tf-operator Error pod tfjob-sample-worker-0 container tensorflow waiting message: Back-off pulling image "mesosphere/kubeflow:mnist-sdk-sample"
In the Events
we can find the message that explains the cause of the issue: the image is not found in the registry:
Events:
Type Reason Age From Message
...
Warning ErrImagePull 9m31s (x7 over 10m) tf-operator Error pod tfjob-sample-worker-0 container tensorflow waiting message: rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/mesosphere/kubeflow:mnist-sdk-sample": failed to resolve reference "docker.io/mesosphere/kubeflow:mnist-sdk-sample": docker.io/mesosphere/kubeflow:mnist-sdk-sample: not found
Insufficient resources
Check if the job is running:
$ kubectl get tfjob tfjob-example
NAME STATE AGE
tfjob-example Created 90s
Check the pod status of the job:
$ kubectl get pods -l job-name=tfjob-sample
NAME READY STATUS RESTARTS AGE
tfjob-sample-chief-0 0/1 Pending 0 109s
tfjob-sample-worker-0 0/1 Pending 0 109s
tfjob-sample-worker-1 0/1 Pending 0 109s
All of the pods are in the Pending state.
k describe tfjob tfjob-sample
Name: tfjob-sample
Namespace: mynamespace
API Version: kubeflow.org/v1
Kind: TFJob
...
Spec:
Tf Replica Specs:
Chief:
Replicas: 1
...
Spec:
Containers:
...
Image: mesosphere/kubeflow:mnist-sdk-example
Name: tensorflow
Resources:
Requests:
Memory: 128G
...
Worker:
Replicas: 2
...
Spec:
Containers:
...
Image: mesosphere/kubeflow:mnist-sdk-example
Name: tensorflow
Resources:
Requests:
Memory: 128G
Status:
Conditions:
Last Transition Time: 2021-11-14T23:26:56Z
Last Update Time: 2021-11-14T23:26:56Z
Message: TFJob tfjob-sample is created.
Reason: TFJobCreated
Status: True
Type: Created
Replica Statuses:
Chief:
Worker:
Start Time: 2021-11-14T23:26:57Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 2m20s tf-operator Created pod: tfjob-sample-chief-0
Normal SuccessfulCreateService 2m20s tf-operator Created service: tfjob-sample-chief-0
Normal SuccessfulCreatePod 2m20s tf-operator Created pod: tfjob-sample-worker-0
Normal SuccessfulCreatePod 2m20s tf-operator Created pod: tfjob-sample-worker-1
Normal SuccessfulCreateService 2m19s tf-operator Created service: tfjob-sample-worker-0
Normal SuccessfulCreateService 2m19s tf-operator Created service: tfjob-sample-worker-1
Warning Unschedulable 2m19s tf-operator Error pod tfjob-sample-worker-1 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
Warning Unschedulable 2m19s tf-operator Error pod tfjob-sample-chief-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
Warning Unschedulable 2m19s tf-operator Error pod tfjob-sample-worker-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
In the Events
we can find the message explains the cause of the issue - there is insufficient memory to schedule the workload:
Events:
Type Reason Age From Message
...
Warning Unschedulable 2m19s tf-operator Error pod tfjob-sample-worker-1 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
Warning Unschedulable 2m19s tf-operator Error pod tfjob-sample-chief-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
Warning Unschedulable 2m19s tf-operator Error pod tfjob-sample-worker-0 condition message: 0/6 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 5 Insufficient memory.
Limitations
Kubeflow Pipelines
Kubeflow Pipelines steps can fail if the main container exits too quickly and the Argo sidecar fails to collect artifacts. This can happen when the container image is not available on a node and needs to be pulled from the registry first. Retry the pipeline run or to pre-download the container image to the relevant nodes.
Using Kubeflow Fairing with Private Docker Registries
Kubeflow Fairing does not currently support Docker registries using self-signed TLS certificates, certificate chaining, or insecure (plaintext HTTP) registries. It is recommended to use the Kaptain SDK for building and pushing Docker images as a part of the model development process.
Spark and Horovod
Running Spark and Horovod on Spark in client mode from a notebook with Istio enabled is not supported. It is recommended to use the Spark Operator for running Spark applications.
Pocket Chrome Extension
Users who have the Google Chrome extension for Pocket installed may not be able to see large portions of the Kaptain UI. Disable the Pocket extension to ensure the Kaptain UI is completely visible.
Component Versions
Kaptain includes:
- Kubeflow 1.4.0
- Notebook controller 1.4.0
- Argo Workflows 3.1.6
- Katib 0.12.0
- KFServing 0.6.1
- Percona Kubernetes Operator 1.10.0
- Kubeflow Pipelines 1.7.0
- Training Operator 1.3.0
- MinIO Operator 4.0.3
- MinIO RELEASE.2021-03-01T04-20-55Z
- kubectl 1.21
- Kudobuilder 0.19.0
- Kaniko 1.3.0
- TensorFlow Serving 1.14.0
- ONNX server 0.5.1
- Nvidia TensorRT server 19.05
- Knative 20200410
- TFX MLMD Store Server 0.21.1
Python libraries (excluding transitive dependencies):
- Miniconda 4.10.3
- JupyterLab 3.0.16
- Kaptain SDK 1.0.0
- Kubernetes SDK 18.20.0
- ML Metadata 0.22.0
- Kubeflow Pipelines 1.8.1
- TensorFlow 2.5.0
- PyTorch 1.7.1
- MXNet 1.9.0
- Horovod 0.22.0
- CUDA 11.2
- Matplotlib 3.2.1
- Papermill 2.0.0
- Open MPI
- gensim
- future
- h5py
- Keras
- NLTK
- NumPy
- Pandas
- SciPy
- scikit-learn
- Seaborn
- spaCy
- statsmodels
- typing
- boto3
- ipywidgets
- NodeJS
- Plotly
- Toree