Kerberos is an authentication system that allows DC/OS Data Science Engine to retrieve and write data securely to a Kerberos-enabled HDFS cluster. Long-running jobs will renew their delegation tokens (authentication credentials). This section assumes you have previously set up a Kerberos-enabled HDFS cluster.
Configuring Kerberos with DC/OS Data Science Engine
DC/OS Data Science Engine and all Kerberos-enabled components need a valid krb5.conf
configuration file. The krb5.conf
file tells data-science-engine
how to connect to your Kerberos key distribution center (KDC). You can specify properties for the krb5.conf
file with the following options.
Make sure your keytab
file is in the DC/OS secret store, under a path that is accessible by the data-science-engine service.
Example: Using HDFS with Spark in a Kerberized Environment
Here is the example notebook of Tensorflow on Spark
using HDFS
as a storage backend in Kerberized environment. First of all, you need to make sure that HDFS service is installed and DC/OS Data Science Engine is configured with its endpoint. To find more about configuring HDFS integration of DC/OS Data Science Engine follow Using HDFS with DC/OS Data Science Engine section.
-
Make sure HDFS Client service is installed and running with the “Kerberos enabled” option.
-
Run the following commands to set up a directory on HDFS with proper permissions:
-
Launch Terminal from the Notebook UI.
-
Clone
TensorFlow on Spark
repository and download a sample dataset: -
List files in the target HDFS directory and remove it if it is not empty.
-
Generate sample data and save to HDFS.
-
Train the model and checkpoint it to the target directory in HDFS. You will need to specify two additional options to distribute Kerberos ticket cache file to executors:
--files <Kerberos ticket cache file>
and--conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99"
. The Kerberos ticket cache file will be used by executors for authentication with Kerberized HDFS: -
Verify that the model has been saved.