To Install DC/OS Data Science Engine with TensorFlow 2.1.0. DC/OS Data Science Engine comes with TensorFlow 2.1.0 support by default. Run the following command:
dcos package install data-science-engine
To Install DC/OS Data Science Engine with TensorFlow 1.15. Run the following command:
dcos package install --options=options.json data-science-engine
With options.json
having the following content:
{
"service": {
"jupyter_notebook_type": "TensorFlow-1.15"
}
}
TensorFlow local machine learning
Open a Python 3
Notebook and put the following sections in different code cells.
-
Prepare the test data:
import tensorflow as tf (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (28, 28, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255
-
Define a model:
from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D model = Sequential() model.add(Conv2D(28, kernel_size=(3,3), input_shape=input_shape)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(256, activation=tf.nn.relu)) model.add(Dropout(0.2)) model.add(Dense(10,activation=tf.nn.softmax))
-
Training and Evaluating the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(x=x_train,y=y_train, epochs=10) model.evaluate(x_test, y_test)
-
Use the model to predict a hand-written number:
image_index = 5555 # should be '3' pred = model.predict(x_test[image_index].reshape(1, 28, 28, 1)) print("Predicted Number: {}".format(pred.argmax()))
TensorFlow 2.1.0 Distributed Learning with Horovod on Spark
DC/OS Data Science Engine includes Horovod on Spark
integration, which allows you to run TensorFlow in a distributed mode, using Apache Spark as an engine.
Open a Python 3
Notebook and put the following sections in different code cells.
-
Define Utility functions to prepare dataset and model
def get_dataset(rank=0, size=1): import tensorflow as tf (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data('MNIST-data-%d' % rank) x_train = x_train[rank::size] y_train = y_train[rank::size] x_test = x_test[rank::size] y_test = y_test[rank::size] # Normalizing the RGB codes by dividing it to the max RGB value. x_train, x_test = x_train / 255.0, x_test / 255.0 return (x_train, y_train), (x_test, y_test) def get_model(num_classes=10): import tensorflow as tf model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) return model def deserialize(model_bytes): import horovod.tensorflow.keras as hvd import h5py import io bio = io.BytesIO(model_bytes) with h5py.File(bio,'a') as f: return hvd.load_model(f)
-
Implement distributed training function using
Horovod
def train_hvd(num_classes=10, learning_rate=0.001, batch_size=128, epochs=2): import os import tempfile import tensorflow as tf import horovod.tensorflow.keras as hvd hvd.init() # Horovod: pin GPU to be used to process local rank (one GPU per process) gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) if gpus: tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU') (x_train, y_train), (x_test, y_test) = get_dataset(hvd.rank(), hvd.size()) model = get_model(num_classes) # Horovod: add Horovod DistributedOptimizer. optimizer = hvd.DistributedOptimizer(tf.keras.optimizers.Adam(lr=learning_rate * hvd.size())) model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', experimental_run_tf_function=False, metrics=['accuracy']) callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=3, verbose=1), ] # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them. # Model checkpoint location. ckpt_dir = tempfile.mkdtemp() ckpt_file = os.path.join(ckpt_dir, 'checkpoint.h5') if hvd.rank() == 0: callbacks.append(tf.keras.callbacks.ModelCheckpoint(ckpt_file, monitor='accuracy', mode='max', save_best_only=True)) history = model.fit(x_train, y_train, batch_size=batch_size, callbacks=callbacks, epochs=epochs, verbose=2, validation_data=(x_test, y_test)) if hvd.rank() == 0: with open(ckpt_file, 'rb') as f: #returning a tuple of history and model bytes return history.history, f.read()
-
Create Spark Session
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("TF2HorovodOnSpark").getOrCreate()
-
Run distributed training
import horovod.spark model_bytes = horovod.spark.run(train_hvd, verbose=2)[0][1]
-
Evaluate model
(x_train, y_train), (x_test, y_test) = get_dataset() model = deserialize(model_bytes) evaluation = model.evaluate(x_test, y_test, verbose=2) print('Model Evaluation:', evaluation)
-
Shutdown Spark workers
spark.stop()
TensorFlow 1.15 Distributed Learning with Horovod on Spark
Open a Python 3
Notebook and put the following sections in different code cells.
-
Describe layers of the model
def conv_model(feature, target, mode): import tensorflow as tf layers = tf.layers """2-layer convolution model.""" # Convert the target to a one-hot tensor of shape (batch_size, 10) and # with a on-value of 1 for each one-hot vector of length 10. target = tf.one_hot(tf.cast(target, tf.int32), 10, 1, 0) # Reshape feature to 4d tensor with 2nd and 3rd dimensions being # image width and height final dimension being the number of color channels. feature = tf.reshape(feature, [-1, 28, 28, 1]) # First conv layer will compute 32 features for each 5x5 patch with tf.variable_scope('conv_layer1'): h_conv1 = layers.conv2d(feature, 32, kernel_size=[5, 5], activation=tf.nn.relu, padding="SAME") h_pool1 = tf.nn.max_pool( h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # Second conv layer will compute 64 features for each 5x5 patch. with tf.variable_scope('conv_layer2'): h_conv2 = layers.conv2d(h_pool1, 64, kernel_size=[5, 5], activation=tf.nn.relu, padding="SAME") h_pool2 = tf.nn.max_pool( h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') # reshape tensor into a batch of vectors h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64]) # Densely connected layer with 1024 neurons. h_fc1 = layers.dropout( layers.dense(h_pool2_flat, 1024, activation=tf.nn.relu), rate=0.5, training=mode == tf.estimator.ModeKeys.TRAIN) # Compute logits (1 per class) and compute loss. logits = layers.dense(h_fc1, 10, activation=None) loss = tf.losses.softmax_cross_entropy(target, logits) return tf.argmax(logits, 1), loss
-
Implement train input generator
def train_input_generator(x_train, y_train, batch_size=64): import numpy as np assert len(x_train) == len(y_train) while True: p = np.random.permutation(len(x_train)) x_train, y_train = x_train[p], y_train[p] index = 0 while index <= len(x_train) - batch_size: yield x_train[index:index + batch_size], \ y_train[index:index + batch_size], index += batch_size
-
Implement distributed training function using
Horovod
def train_hvd(): import os import tensorflow as tf import horovod.tensorflow as hvd import numpy as np from tensorflow import keras tf.logging.set_verbosity(tf.logging.INFO) # Horovod: initialize Horovod. hvd.init() # Download and load MNIST dataset. (x_train, y_train), (x_test, y_test) = \ keras.datasets.mnist.load_data('MNIST-data-%d' % hvd.rank()) # The shape of downloaded data is (-1, 28, 28), hence we need to reshape it # into (-1, 784) to feed into our network. Also, need to normalize the # features between 0 and 1. x_train = np.reshape(x_train, (-1, 784)) / 255.0 x_test = np.reshape(x_test, (-1, 784)) / 255.0 # Build model... with tf.name_scope('input'): image = tf.placeholder(tf.float32, [None, 784], name='image') label = tf.placeholder(tf.float32, [None], name='label') predict, loss = conv_model(image, label, tf.estimator.ModeKeys.TRAIN) lr_scaler = hvd.size() # Horovod: adjust learning rate based on lr_scaler. opt = tf.train.AdamOptimizer(0.001 * lr_scaler) # Horovod: add Horovod Distributed Optimizer. opt = hvd.DistributedOptimizer(opt, op=hvd.Average) global_step = tf.train.get_or_create_global_step() train_op = opt.minimize(loss, global_step=global_step) hooks = [ # Horovod: BroadcastGlobalVariablesHook broadcasts initial variable states # from rank 0 to all other processes. This is necessary to ensure consistent # initialization of all workers when training is started with random weights # or restored from a checkpoint. hvd.BroadcastGlobalVariablesHook(0), # Horovod: adjust number of steps based on number of CPUs/GPUs. tf.train.StopAtStepHook(last_step=20 // hvd.size()), tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss}, every_n_iter=10), ] # Horovod: save checkpoints only on worker 0 to prevent other workers from # corrupting them. checkpoint_dir = '/tmp/checkpoints' if hvd.rank() == 0 else None training_batch_generator = train_input_generator(x_train, y_train, batch_size=100) # Horovod: pin GPU to be used to process local rank (one GPU per process) config = tf.ConfigProto() if tf.test.is_gpu_available(): config.gpu_options.allow_growth = True config.gpu_options.visible_device_list = str(hvd.local_rank()) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. step = 0 with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir, hooks=hooks, config=config) as mon_sess: while not mon_sess.should_stop(): # Run a training step synchronously. image_, label_ = next(training_batch_generator) mon_sess.run(train_op, feed_dict={image: image_, label: label_}) step += 1 print('Total Step Taken: {}'.format(step))
-
Create Spark Session
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("TF1HorovodOnSpark").getOrCreate()
-
Run distributed training
import horovod.spark horovod.spark.run(train_hvd, verbose=2)
-
Shutdown Spark workers
spark.stop()
TensorBoard
DC/OS Data Science Engine comes with TensorBoard
installed. It can be found at
http://<dcos-url>/service/data-science-engine/tensorboard/
.
Log directory
TensorBoard reads log data from specific directory, with the default being /mnt/mesos/sandbox
. It can be changed with advanced.tensorboard_logdir
option. HDFS paths are supported as well.
Here is an example:
-
Install HDFS:
dcos package install hdfs
-
Install
data-science-engine
with overridden log directory option:dcos package install --options=options.json data-science-engine
With
options.json
having the following content:{ "advanced": { "tensorboard_logdir": "hdfs://tf_logs" } }
-
Open TensorBoard at
https://<dcos-url>/service/data-science-engine/tensorboard/
and confirm the change.
Disabling TensorBoard
DC/OS Data Science Engine can be installed with TensorBoard
disabled by using the following configuration:
{
"advanced": {
"start_tensorboard": false
}
}