curl -L https://downloads.d2iq.com/kaptain/d2iq-tutorials-2.0.0.tar.gz | tar xz
Hyperparameter Tuning with Katib
Introduction
Hyperparameter tuning is the process of optimizing a model’s hyperparameter values in order to maximize the predictive quality of the model. Examples of such hyperparameters are the learning rate, neural architecture depth (layers) and width (nodes), epochs, batch size, dropout rate, and activation functions. These are the parameters that are set prior to training; unlike the model parameters (weights and biases), these do not change during the process of training the model.
Katib automates the process of hyperparameter tuning by running a pre-configured number of training jobs (known as trials) in parallel. Each trial evaluates a different set of hyperparameter configurations. Within each experiment it automatically adjusts the hyperparameters to find their optimal values with regard to the objective function, which is typically the model’s metric (e.g. accuracy, AUC, F1, precision). An experiment therefore consists of an objective, a search space for the hyperparameters, and a search algorithm. At the end of the experiment, Katib outputs the optimized values, which are also known as suggestions.
What You Will Learn
This notebook shows how you can create and configure an Experiment
for both TensorFlow
and PyTorch
training jobs.
In terms of Kubernetes, such an experiment is a custom resource handled by the Katib operator.
What You Need
A Docker image with either a TensorFlow or PyTorch model that accepts hyperparameters as arguments. Click on the links to see such models.
How to Specify Hyperparameters in Your Models
In order for Katib to be able to tweak hyperparameters it needs to know what these are called in the model.
Beyond that, the model must specify these hyperparameters either as regular (command line) parameters or as environment variables.
Since the model needs to be containerized, any command line parameters or environment variables must to be passed to the container that holds your model.
By far the most common and also the recommended way is to use command line parameters that are captured with argparse
or similar; the trainer (function) then uses their values internally.
How to Expose Model Metrics as Objective Functions
By default, Katib collects metrics from the standard output of a job container by using a sidecar container.
In order to make the metrics available to Katib, they must be logged to stdout in the key=value
format.
The job output will be redirected to /var/log/katib/metrics.log
file.
This means that the objective function (for Katib) must match the metric’s key
in the models output.
It is therefore possible to define custom model metrics for your use case.
How to Create Experiments
Note that you typically use (YAML) resource definitions for Kubernetes from the command line, but you can do everything from a notebook, too!
Of course, if you are more familiar or comfortable with kubectl
and the command line, feel free to use a local CLI or the embedded terminals from the Jupyter Lab launch screen.
TF_EXPERIMENT_FILE = "katib-tfjob-experiment.yaml"
PYTORCH_EXPERIMENT_FILE = "katib-pytorchjob-experiment.yaml"
Set the following constants depending on whether you want to use GPUs or a custom image.
GPUS = 1 # set to 0 if the experiment should not use GPUs
PARALLEL_TRIAL_COUNT = 3
TOTAL_TRIAL_COUNT = 9
Make the defined constants available as shell environment variables. They parameterize the Experiment
manifests below.
%env GPUS $GPUS
%env TF_EXPERIMENT_FILE $TF_EXPERIMENT_FILE
%env PYTORCH_EXPERIMENT_FILE $PYTORCH_EXPERIMENT_FILE
%env PARALLEL_TRIAL_COUNT $PARALLEL_TRIAL_COUNT
%env TOTAL_TRIAL_COUNT $TOTAL_TRIAL_COUNT
env: GPUS=1
env: TF_EXPERIMENT_FILE=katib-tfjob-experiment.yaml
env: PYTORCH_EXPERIMENT_FILE=katib-pytorchjob-experiment.yaml
env: PARALLEL_TRIAL_COUNT=3
env: TOTAL_TRIAL_COUNT=9
Define a helper function to capture output from a cell that usually looks like some-resource created
, using %%capture
:
import re
from IPython.utils.capture import CapturedIO
def get_resource(captured_io: CapturedIO) -> str:
"""
Gets a resource name from `kubectl apply -f <configuration.yaml>`.
:param str captured_io: Output captured by using `%%capture` cell magic
:return: Name of the Kubernetes resource
:rtype: str
:raises Exception: if the resource could not be created
"""
out = captured_io.stdout
matches = re.search(r"^([^/]+/)?(.+)\s+created", out)
if matches is not None:
return matches.group(2)
else:
raise Exception(
f"Cannot get resource as its creation failed: {out}. It may already exist."
)
TensorFlow: a TFJob Experiment
The TFJob
definition for this example is based on the MNIST with TensorFlow notebook.
The model accepts several arguments:
--batch-size
--buffer-size
--epochs
--steps
--learning-rate
--momentum
For the experiment, focus on the learning rate and momentum of the SGD algorithm. You can add the other hyperparameters in a similar manner. Please note that discrete values (e.g. epochs) and categorical values (e.g. optimization algorithms) are supported, too.
The following YAML file describes an Experiment
object:
%env IMAGE mesosphere/kubeflow:2.0.0-mnist-tensorflow-2.8.0-gpu
%%bash
cat <<END > $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
name: katib-tfjob-experiment
spec:
parallelTrialCount: $PARALLEL_TRIAL_COUNT
maxTrialCount: $TOTAL_TRIAL_COUNT
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.3"
max: "0.4"
- name: momentum
parameterType: double
feasibleSpace:
min: "0.6"
max: "0.7"
resumePolicy: Never
trialTemplate:
primaryContainerName: tensorflow
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: learning_rate
- name: momentum
description: Momentum for the training model
reference: momentum
trialSpec:
apiVersion: "kubeflow.org/v1"
kind: TFJob
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: ${IMAGE}
imagePullPolicy: Always
command: ["python", "-u", "/mnist.py"]
args:
- "--learning-rate=\${trialParameters.learningRate}"
- "--momentum=\${trialParameters.momentum}"
resources:
limits:
cpu: 1
memory: 3G
nvidia.com/gpu: $GPUS
END
Please note that the Docker image that contains the model has to be set for the trialTemplate
configuration.
This experiment will create 9 trials with different sets of hyperparameter values passed to each training job.
It uses a random search to maximize the accuracy on the test data set.
A comment has been added where you can change the Docker image. The one listed should work, but you may want to try it with your own container registry.
The Experiment
specification has the following sections to configure experiments:
spec.parameters
contains the list of hyperparameters that are used to tune the modelspec.objective
defines the metric to optimizespec.algorithm
defines which search algorithm to use for the tuning process
Many more configuration options exist, but they are too numerous to go through here. Please have a look at the official documentation for more details.
PyTorch: a PyTorchJob experiment
This example is based on the MNIST with PyTorch notebook. It accepts the following parameters relevant to training the model:
--batch-size
--epochs
--lr
(i.e. the learning rate)--gamma
For the experiment, find the optimal learning rate in the range of [0.7, 1.0] with regard to the accuracy on the test data set.
This is logged as accuracy=<value>
, as can be seen in the original notebook for distributed training.
Run up to 9 trials with three such trials in parallel.
Again, use a random search:
%env IMAGE mesosphere/kubeflow:2.0.0-mnist-pytorch-1.11.0-gpu
%%bash
cat <<END > $PYTORCH_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
name: katib-pytorchjob-experiment
spec:
parallelTrialCount: $PARALLEL_TRIAL_COUNT
maxTrialCount: $TOTAL_TRIAL_COUNT
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.7"
max: "1.0"
resumePolicy: Never
trialTemplate:
primaryContainerName: pytorch
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
trialSpec:
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: ${IMAGE}
imagePullPolicy: Always
command: ["python", "-u", "/mnist.py"]
args:
- "--epochs=5"
- "--batch-size=1024"
- "--lr=\${trialParameters.learningRate}"
resources:
limits:
nvidia.com/gpu: $GPUS
requests:
cpu: 1
memory: 1G
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: pytorch
image: ${IMAGE}
imagePullPolicy: Always
args:
- "--epochs=5"
- "--batch-size=1024"
- "--lr=\${trialParameters.learningRate}"
resources:
limits:
nvidia.com/gpu: $GPUS
requests:
cpu: 1
memory: 1G
END
Please note the subtle differences in the trialTemplate
: the kind
is either TFJob
or PyTorchJob
and the Docker images are obviously different.
Run and Monitor Experiments
You can either execute these commands on your local machine with kubectl
or you can run them from the notebook.
If you do run these locally, you cannot rely on cell magic, so you have to manually copy-paste the experiment name wherever you see $EXPERIMENT
.
If you intend to run the following command locally, you have to set the user namespace for all subsequent commands:
kubectl config set-context --current --namespace=<insert-namespace>
Please change the namespace to whatever has been set up by your administrator.
Pick one of the following depending on which framework you want to use.
%env EXPERIMENT_FILE $PYTORCH_EXPERIMENT_FILE
%env EXPERIMENT_FILE $TF_EXPERIMENT_FILE
To submit the experiment, execute:
%%capture kubectl_output --no-stderr
%%sh
kubectl apply -f "${EXPERIMENT_FILE}"
The cell magic grabs the output of the kubectl
command and stores it in an object named kubectl_output
:
%env EXPERIMENT {get_resource(kubectl_output)}
To see the status, run:
%%sh
kubectl describe experiment.kubeflow.org $EXPERIMENT
To see experiment suggestions:
%%sh
kubectl describe suggestions.kubeflow.org $EXPERIMENT
To get the list of created trials, use the following command:
%%sh
kubectl get trials.kubeflow.org -l katib.kubeflow.org/experiment=$EXPERIMENT
NAME TYPE STATUS AGE
katib-pytorchjob-experiment-62b9lr7k Created True 2s
katib-pytorchjob-experiment-qcl4jkc6 Created True 2s
katib-pytorchjob-experiment-vnzgj7q6 Created True 2s
To fetch the logs of a particular trial, use the following command:
%%sh
kubectl logs -l job-name=<trial-name> --all-containers --prefix=true
After the experiment is completed, use describe
to get the best trial results:
%%sh
kubectl describe experiment.kubeflow.org $EXPERIMENT
The relevant section of the output looks like this:
Name: katib-pytorchjob-experiment
...
Status:
...
Current Optimal Trial:
Best Trial Name: katib-pytorchjob-experiment-jv4sc9q7
Observation:
Metrics:
Name: accuracy
Value: 0.9902
Parameter Assignments:
Name: --lr
Value: 0.5512569257804198
...
Trials: 6
Trials Succeeded: 6
...
If an Experiment needs to be deleted, including deleting it from the “Experiments (AutoML)” page:
%%sh
kubectl delete experiments.kubeflow.org $EXPERIMENT
Katib UI
So far, you have created and submitted experiments via the command line or from within Jupyter notebooks. Katib provides a user interface which allows you to create, configure, monitor, and delete experiments from a browser. The Katib UI can be launched from Kubeflow’s central dashboard. Select “Experiments (AutoML)” in the navigation menu.
To see detailed information, such as trial results, metrics, and a plot, click on the experiment itself.