Kerberos is an authentication system that allows DC/OS Data Science Engine to retrieve and write data securely to a Kerberos-enabled HDFS cluster. Long-running jobs will renew their delegation tokens (authentication credentials).
This guide assumes you have already set up a Kerberos-enabled HDFS cluster.
Configuring Kerberos with DC/OS Data Science Engine
DC/OS Data Science Engine and all Kerberos-enabled components need a valid krb5.conf
configuration file. The krb5.conf
file tells data-science-engine
how to connect to your Kerberos key distribution center (KDC). You can specify properties for the krb5.conf
file with the following options.
{
"security": {
"kerberos": {
"enabled": true,
"kdc": {
"hostname": "<kdc_hostname>",
"port": <kdc_port>
},
"primary": "<primary_for_principal>",
"realm": "<kdc_realm>",
"keytab_secret": "<path_to_keytab_secret>"
}
}
}
Make sure your keytab
file is in the DC/OS secret store, under a path that is accessible by the data-science-engine service.
Example: Using HDFS with Spark in a Kerberized Environment
Here is an example notebook of Tensorflow on Spark
using HDFS
as a storage backend in a Kerberized environment.
First of all, you need to make sure that HDFS service is installed and DC/OS Data Science Engine is configured with its endpoint. To read more about configuring an HDFS integration of DC/OS Data Science Engine, see the Using HDFS with DC/OS Data Science Engine section.
-
Make sure HDFS Client service is installed and running with the “Kerberos enabled” option.
-
Run the following commands to set up a directory on HDFS with proper permissions:
# Suppose the HDFS Client version you are running is "2.6.0-cdh5.0.1", then command will be dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -mkdir -p /data-science-engine' # Suppose the name of the primary mentioned above is "jupyter" dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chown jupyter:jupyter /data-science-engine' dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chmod 700 /data-science-engine'
-
Launch Terminal from the Notebook UI.
-
Clone
TensorFlow on Spark
repository and download a sample dataset:rm -rf TensorFlowOnSpark && git clone https://github.com/yahoo/TensorFlowOnSpark rm -rf mnist && mkdir mnist curl -fsSL -O https://infinity-artifacts.s3-us-west-2.amazonaws.com/jupyter/mnist.zip unzip -d mnist/ mnist.zip
-
List files in the target HDFS directory and remove it if it is not empty.
hdfs dfs -ls -R /data-science-engine/mnist_kerberos && hdfs dfs -rm -R /data-science-engine/mnist_kerberos
-
Generate sample data and save to HDFS.
spark-submit \ --verbose \ $(pwd)/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \ --output /data-science-engine/mnist_kerberos/csv \ --format csv hdfs dfs -ls -R /data-science-engine/mnist_kerberos
-
Train the model and checkpoint it to the target directory in HDFS.
You will need to specify two additional options to distribute the Kerberos ticket cache file to executors:
--files <Kerberos ticket cache file>
and--conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99"
. The Kerberos ticket cache file will be used by executors for authentication with Kerberized HDFS:spark-submit \ --files /tmp/krb5cc_99 --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99" \ --verbose \ --py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --cluster_size 4 \ --images /data-science-engine/mnist_kerberos/csv/train/images \ --labels /data-science-engine/mnist_kerberos/csv/train/labels \ --format csv \ --mode train \ --model /data-science-engine/mnist_kerberos/mnist_csv_model
-
Verify that the model has been saved.
hdfs dfs -ls -R /data-science-engine/mnist_kerberos/mnist_csv_model