Kaptain and KUDO
To print the status of the Kaptain operator instance:
kubectl kudo plan status -n kubeflow --instance kaptain
To show deployments and pods in the Kaptain operator instance:
kubectl get deployments -n kubeflow
kubectl get pods -n kubeflow
kubectl describe pod <pod_name> -n kubeflow
To print the logs from the KUDO controller:
kubectl logs -n kudo-system kudo-controller-manager-0 -f
Kubeflow Pipelines
List all pipeline runs in user namespace:
kubectl get workflows.argoproj.io -n <namespace>
Print the logs from all pipeline steps:
kubectl logs -l workflows.argoproj.io/workflow=<workflow_name> -c main --prefix=true -n <namespace>
Delete all completed pipeline runs:
kubectl delete workflows.argoproj.io -l workflows.argoproj.io/completed=true -n <namespace>
Delete all the pipeline runs with by final status (Succeeded
or Failed
):
kubectl delete -l workflows.argoproj.io -l workflows.argoproj.io/completed=true -l workflows.argoproj.io/phase=Succeeded -n <namespace>
Konvoy
To create the Konvoy diagnostics bundle, use:
konvoy diagnose --logs-all-namespaces --yes
Afterwards, check Konvoy troubleshooting techniques.
Limitations
Kubeflow Pipelines
Kubeflow Pipelines steps can fail if the main container exits too quickly and the Argo sidecar fails to collect artifacts. This can happen when the container image is not available on a node and needs to be pulled from the registry first. Retry the pipeline run or to pre-download the container image to the relevant nodes.
Using Kubeflow Fairing with Private Docker Registries
Kubeflow Fairing does not currently support Docker registries using self-signed TLS certificates, certificate chaining, or insecure (plaintext HTTP) registries. It is recommended to use the Kaptain SDK for building and pushing Docker images as a part of the model development process.
Spark and Horovod
Running Spark and Horovod on Spark in client mode from a notebook with Istio enabled is not supported. It is recommended to use the Spark Operator for running Spark applications.
Pocket Chrome Extension
Users who have the Google Chrome extension for Pocket installed may not be able to see large portions of the Kaptain UI. Disable the Pocket extension to ensure the Kaptain UI is completely visible.
Component Versions
Kaptain includes:
- Kubeflow 1.3.0
- Notebook controller 1.3.0
- Argo Workflows 2.12.9
- Katib 0.11.0
- KFServing 0.5.1
- Percona Kubernetes Operator 1.7.0
- Kubeflow Pipelines 1.5.0
- PyTorch Operator 0.7.0
- Tensorflow Operator 1.1.0
- MinIO Operator 4.0.3
- MXNet Operator 1.1.0
- MinIO RELEASE.2021-03-01T04-20-55Z
- kubectl 1.19
- Kudobuilder 0.19.0
- KUDO Spark 3.0.0
- Kaniko 1.3.0
- TensorFlow Serving 1.14.0
- ONNX server 0.5.1
- Nvidia TensorRT server 19.05
- Knative 20200410
- TFX MLMD Store Server 0.21.1
Python libraries (excluding transitive dependencies):
- Miniconda 4.8.2
- JupyterLab 3.0.16
- Kaptain SDK 0.3.0
- kubernetes SDK 10.0.1
- ML Metadata 0.22.0
- Kubeflow Pipelines 1.4.0
- Kubeflow Fairing 1.0.1
- TensorFlow 2.4.0
- PyTorch 1.7.1
- MXNet 1.8.0
- Horovod 0.21.0
- CUDA 11.0
- Matplotlib 3.2.1
- Papermill 2.0.0
- Open MPI
- gensim
- future
- h5py
- Keras
- NLTK
- NumPy
- Pandas
- SciPy
- scikit-learn
- Seaborn
- spaCy
- statsmodels
- typing
- boto3
- ipywidgets
- NodeJS
- Plotly
- Toree