Training MNIST with PyTorch
Introduction
Recognizing handwritten digits based on the MNIST (Modified National Institute of Standards and Technology) data set is the “Hello, World” example of machine learning. Each (anti-aliased) black-and-white image represents a digit from 0 to 9 and fits in a 28×28 pixel bounding box. The problem of recognizing digits from handwriting is, for instance, important to the postal service when automatically reading zip codes from envelopes.
What You Will Learn
You will see how to use PyTorch to build a model with two convolutional layers and two fully connected layers to perform the multi-class classification of images provided. In addition, there is a dropout layer after the convolutional layers (and before the first fully connected layer) and another one right after the first fully connected layer.
The example in the notebook includes both training a model in the notebook and running a distributed PyTorchJob
on the cluster, so you can easily scale up your own models.
For the distributed training job, you have to package the complete trainer code in a Docker image.
You will see how to do that with Kaptain SDK, so that you do not have to leave your favorite notebook environment at all!
You will also find instructions for local development, in case you prefer that.
PyTorchJob
is a custom resource (definition) (CRD) provided by the PyTorch operator.
Operators extend Kubernetes by capturing domain-specific knowledge on how to deploy and run an application or service, how to deal with failures, and so on.
The PyTorch operator controller manages the lifecycle of a PyTorchJob
.
What You Need
All you need is this notebook. If you prefer to create your Docker image locally, you must also have a Docker client installed on your machine.
Prerequisites
Before proceeding, check you are using the correct notebook image, that is, PyTorch is available:
To package the trainer in a container image, you need a file (on the cluster) that contains both the code and a file with the resource definition of the job for the Kubernetes cluster:
Define a helper function to capture output from a cell that usually looks like some-resource created
, using %%capture
:
How to Load and Inspect the Data
Before training, inspect the data that will go into the model:
That shows there are 60,000 28×28 pixel grayscale images. These have not yet been scaled into the [0, 1] range, as you can see for yourself:
The corresponding label is:
Normalize the data set to improve the training speed, which means you need to know the mean and standard deviation:
These are the values hard-coded in the transformations within the model. Ideally, you re-compute these based on the training data set to ensure you capture the correct values when the underlying data changes. This data set is static, though. Note that these values are always re-used (i.e. not re-computed) when predicting (or serving) to minimize training/serving skew. For this demonstration, it is fine to define these up front. Batch normalization would be an alternative that scales better with data sets of any size.
Batch normalization computes the mean and variance per batch of training data and per layer to rescale the batch's input values with the aid of two hyperparameters: β (shift) and γ (scale). It is typically applied before the activation function (as in the original paper), although there is no consensus on the matter and there may be valid reasons to apply it afterward. Batch normalization allows weights in later layers to be more robust against changes in input values of the earlier ones; while the input of later layers can obviously vary from step to step, the mean and variance will remain fairly constant. This is because you shuffle the training data set and each batch is therefore on average roughly representative of the entire data set. Batch normalization limits the distribution of weight values in the later layers of a neural network, and therefore provides a regularizing effect that decreases as the batch size increases.
At prediction time, batch normalization requires (fixed) values for the mean and variance to transform the data. The population statistics are the obvious choice, computed across batches from either moving averages or the exponentially weighted averages. It is often argued that you use these rather the equivalent values at inference time, because you may not receive batches to predict on; you cannot compute the mean and variance for individual examples. While that is certainly true in some cases (e.g. online predictions), the main reason is to avoid training/serving skew. Even with batches at inference, there may be significant correlation present (e.g. data from the same entity, such as a user, a session, a region, a product, a machine or assembly line, and so on). In these cases, the mean/variance of the prediction batch may not be representative of the population at large. Once scaled, these input values may well be near the population mean of zero with unit variance, even though in the overall population they would have been near the tails of the distribution.
PyTorchJob
parameters from the main code.
The reason we do that is to ensure we can run the notebook in so-called headless mode with Papermill for custom parameters.
This allows us to test the notebooks end-to-end automatically.
If you check the cell tag of the next cell, you can see it is tagged as parameters
.
Feel free to ignore it!
Make the defined constants available as shell environment variables. They parameterize the PyTorchJob
manifest below.
How to Train the Model in the Notebook
Since you ultimately want to train the model in a distributed fashion (potentially on GPUs), put all the code in a single cell. That way you can save the file and include it in a container image:
That saves the file as defined by TRAINER_FILE
but it does not run it.
The log entries for ‘Katib’ are to re-use the same file for hyperparameter tuning, which is done in a separate notebook.
All you need to know for that is that Katib looks for key=value
entries in the logs.
A common choice for activation functions is a ReLU (Rectified Linear Unit). It is linear for non-negative values and zero for negative ones. The main benefits of ReLU as opposed to sigmoidal functions (e.g. logistic or `tanh`) are:
- ReLU and its gradient are very cheap to compute;
- Gradients are less likely to vanish, because for (non-)negative values its gradient is constant and therefore does not saturate, which for deep neural networks can accelerate convergence
- ReLU has a regularizing effect, because it promotes sparse representations (i.e. some nodes' weights are zero);
- Empirically it has been found to work well.
While it is not our intention to cover the basics of convolutional neural networks (CNNs), there are a few matters worth mentioning. Convolutional layers are spatial feature extractors for images. A series of convolutional kernels (of the same dimensions) is applied to the image to obtain different versions of the same base image (i.e. filters). These filters extract patterns hierarchically. In the first layer, filters typically capture dots, edges, corners, and so on. With each additional layer, these patterns become more complex and turn from basic geometric shapes into constituents of objects and entire objects. That is why often the number of filters increases with each additional convolutional layer: to extract more complex patterns.
Convolutional layers are often followed by a pooling layer to down-sample the input. This aids in lowering the computational burden as you increase the number of filters. A max pooling layer simply picks the largest value of pixels in a small (rectangular) neighbourhood of a single channel (e.g. RGB). This has the effect of making features locally translation-invariant, which is often desired: whether a feature of interest is on the left or right edge of a pooling window, which is also referred to as a kernel, is largely irrelevant to the problem of image classification. Note that this may not always be a desired characteristic and depends on the size of the pooling kernel. For instance, the precise location of tissue damage in living organisms or defects on manufactured products may be very significant indeed. Pooling kernels are generally chosen to be relatively small compared to the dimensions of the input, which means that local translation invariance is often desired.
Another common component of CNNs is a dropout layer. Dropout provides a mechanism for regularization that has proven successful in many applications. It is surprisingly simple: some nodes' weights (and biases) in a specific layer are set to zero at random, that is, arbitrary nodes are removed from the network during the training step. This causes the network to not rely on any single node (a.k.a. neuron) for a feature, as each node can be dropped at random. The network therefore has to learn redundant representations of features. This is important because of what is referred to as internal covariate shift (often mentioned in connection with batch normalization): the change of distributions of internal nodes' weights due to all other layers, which can cause nodes to stop learning (i.e. updating their weights). Thanks to dropout, layers become more robust to changes, although it also means it limits what can be learned (as always with regularization). Layers with a high risk of overfitting (e.g. layers with many units and lots of inputs) typically have a higher dropout rate.
A nice visual explanation of convolutional layers is available here. If you are curious what a CNN "sees" while training, you can have a look here.
Run the code from within the notebook to check that it is correct:
``sh INFO:root:Epoch: 1 ( 0.0%) - Loss: 2.293032646179199 INFO:root:Epoch: 1 ( 13.6%) - Loss: 0.5257666110992432 INFO:root:Epoch: 1 ( 27.3%) - Loss: 0.08510863780975342 INFO:root:Epoch: 1 ( 40.9%) - Loss: 0.32805368304252625 INFO:root:Epoch: 1 ( 54.6%) - Loss: 0.3279671370983124 INFO:root:Epoch: 1 ( 68.2%) - Loss: 0.06365685909986496 INFO:root:Epoch: 1 ( 81.9%) - Loss: 0.29687821865081787 INFO:root:Epoch: 1 ( 95.5%) - Loss: 0.03434577211737633 INFO:root:Test accuracy: 9834/10000 ( 98.3%) INFO:root:loss=0.0500 INFO:root:accuracy=0.9834 INFO:root:Epoch: 2 ( 0.0%) - Loss: 0.08447802066802979 INFO:root:Epoch: 2 ( 13.6%) - Loss: 0.2620002329349518 INFO:root:Epoch: 2 ( 27.3%) - Loss: 0.10486980527639389 INFO:root:Epoch: 2 ( 40.9%) - Loss: 0.07522107660770416 INFO:root:Epoch: 2 ( 54.6%) - Loss: 0.044803790748119354 INFO:root:Epoch: 2 ( 68.2%) - Loss: 0.06450511515140533 INFO:root:Epoch: 2 ( 81.9%) - Loss: 0.25487586855888367 INFO:root:Epoch: 2 ( 95.5%) - Loss: 0.01875779777765274 INFO:root:Test accuracy: 9859/10000 ( 98.6%) INFO:root:loss=0.0399 INFO:root:accuracy=0.9859 INFO:root:Epoch: 3 ( 0.0%) - Loss: 0.029139619320631027 INFO:root:Epoch: 3 ( 13.6%) - Loss: 0.09397225826978683 INFO:root:Epoch: 3 ( 27.3%) - Loss: 0.11303514242172241 INFO:root:Epoch: 3 ( 40.9%) - Loss: 0.14118748903274536 INFO:root:Epoch: 3 ( 54.6%) - Loss: 0.05904180556535721 INFO:root:Epoch: 3 ( 68.2%) - Loss: 0.04524335265159607 INFO:root:Epoch: 3 ( 81.9%) - Loss: 0.27801263332366943 INFO:root:Epoch: 3 ( 95.5%) - Loss: 0.03176506236195564 INFO:root:Test accuracy: 9886/10000 ( 98.9%) INFO:root:loss=0.0359 INFO:root:accuracy=0.9886 INFO:root:Epoch: 4 ( 0.0%) - Loss: 0.07127423584461212 INFO:root:Epoch: 4 ( 13.6%) - Loss: 0.20250867307186127 INFO:root:Epoch: 4 ( 27.3%) - Loss: 0.0050563933327794075 INFO:root:Epoch: 4 ( 40.9%) - Loss: 0.14717304706573486 INFO:root:Epoch: 4 ( 54.6%) - Loss: 0.10025180876255035 INFO:root:Epoch: 4 ( 68.2%) - Loss: 0.13863351941108704 INFO:root:Epoch: 4 ( 81.9%) - Loss: 0.10420405864715576 INFO:root:Epoch: 4 ( 95.5%) - Loss: 0.004818277433514595 INFO:root:Test accuracy: 9887/10000 ( 98.9%) INFO:root:loss=0.0329 INFO:root:accuracy=0.9887 INFO:root:Epoch: 5 ( 0.0%) - Loss: 0.008954059332609177 INFO:root:Epoch: 5 ( 13.6%) - Loss: 0.19676166772842407 INFO:root:Epoch: 5 ( 27.3%) - Loss: 0.0015074732946231961 INFO:root:Epoch: 5 ( 40.9%) - Loss: 0.09220609813928604 INFO:root:Epoch: 5 ( 54.6%) - Loss: 0.015971817076206207 INFO:root:Epoch: 5 ( 68.2%) - Loss: 0.05801410600543022 INFO:root:Epoch: 5 ( 81.9%) - Loss: 0.07174661010503769 INFO:root:Epoch: 5 ( 95.5%) - Loss: 0.0020152931101620197 INFO:root:Test accuracy: 9909/10000 ( 99.1%) INFO:root:loss=0.0306 INFO:root:accuracy=0.9909
FROM mesosphere/kubeflow:2.0.0-pytorch-1.11.0-gpu ADD mnist.py / ADD datasets /datasets
ENTRYPOINT [“python”, “/mnist.py”]
The image is available as mesosphere/kubeflow:2.0.0-mnist-pytorch-1.11.0-gpu
in case you want to skip it for now.
PyTorchJob
How to Create a Distributed For large training jobs, run the trainer in a distributed mode. Once the notebook server cluster can access the Docker image from the registry, you can launch a distributed PyTorch job.
The specification for a distributed PyTorchJob
is defined using YAML:
What this does is create one master and two worker pods.
These can be adjusted via spec.pytorchReplicaSpecs.<type>.replicas
with <type>
either Master
or Worker
.
The distributed sampler passes chunks of the training data set equally to the pods.
Custom training arguments can be passed to the container by means of the spec.containers.args
.
What is supported is visible in main()
of mnist.py
.
The container image specified (twice) is what is for the code shown above. Still, it’s best to change the image name listed under the comments of the specification to use an equivalent image in your own container registry, to ensure everything works as expected.
The job can run in parallel on CPUs or GPUs, provided these are available in your cluster.
To switch to CPUs or define resource limits, please adjust spec.containers.resources
as required.
To clean up finished PyTorchJob
set spec.ttlSecondsAfterFinished
. It may take extra ReconcilePeriod
seconds for the cleanup, since reconcile gets called periodically. Defaults to infinite.
You can either execute the following commands on your local machine with kubectl
or directly from the notebook.
If you do run these locally, you cannot rely on cell magic, so you have to manually copy-paste the variables’ values wherever you see $SOME_VARIABLE
.
If you execute the following commands on your own machine (and not inside the notebook), you obviously do not need the cell magic %%
lines either.
In that case, you have to set the user namespace for all subsequent commands:
Please change the namespace to whatever has been set up by your administrator.
Deploy the distributed training job:
Check the status like so:
The output looks like this:
You should now be able to see the pods created, matching the specified number of replicas.
The job name matches metadata.name
from the YAML.
As per the specification, the training runs for 15 epochs.
During that time, stream the logs from the Master
pod to follow the progress:
Note that it may take a while when the image has to be pulled from the registry. It usually takes a few minutes, depending on the arguments and resources of the cluster, for the status for all pods to be ‘Running’.
The setting spec.ttlSecondsAfterFinished
will result in the cleanup of the created job:
To delete the job manually, execute: