Prerequisites
If you plan to read and write from HDFS using DC/OS Data Science Engine, there are two Hadoop configuration files that you should include in the classpath:
hdfs-site.xml
, which provides default behaviors for the HDFS client.core-site.xml
, which sets the default file system name.
You can specify the location of these files at install time or for each DC/OS Data Science Engine instance.
Configuring DC/OS Data Science Engine to work with HDFS
Within the DC/OS Data Science Engine service configuration, set service.jupyter_conf_urls
to be a list of URLs that serves your hdfs-site.xml
and core-site.xml
. The following example uses http://mydomain.com/hdfs-config/hdfs-site.xml
and http://mydomain.com/hdfs-config/core-site.xml
URLs:
{
"service": {
"jupyter_conf_urls": "http://mydomain.com/hdfs-config"
}
}
You can also specify the URLs through the UI. If you are using the default installation of HDFS from Mesosphere, this would be http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints
for HDFS service installed with the hdfs
name.
Example of Using HDFS with Spark
Here is an example notebook for Tensorflow on Spark
using HDFS
as a storage backend.
-
Launch Terminal from the Notebook UI. This step is mandatory; all subsequent commands will be executed from the Terminal.
-
Clone the
TensorFlow on Spark
repository and download the sample dataset:rm -rf TensorFlowOnSpark && git clone https://github.com/yahoo/TensorFlowOnSpark rm -rf mnist && mkdir mnist curl -fsSL -O https://infinity-artifacts.s3-us-west-2.amazonaws.com/jupyter/mnist.zip unzip -d mnist/ mnist.zip
-
List the files in the target HDFS directory and remove it if it is not empty.
hdfs dfs -ls -R mnist/ && hdfs dfs -rm -R mnist/
-
Generate sample data and save to HDFS.
spark-submit \ --verbose \ $(pwd)/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \ --output mnist/csv \ --format csv hdfs dfs -ls -R mnist
-
Train the model and checkpoint it to the target directory in HDFS.
spark-submit \ --verbose \ --py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \ $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \ --cluster_size 4 \ --images mnist/csv/train/images \ --labels mnist/csv/train/labels \ --format csv \ --mode train \ --model mnist/mnist_csv_model
-
Verify that the model has been saved.
hdfs dfs -ls -R mnist/mnist_csv_model
S3
To set up S3 connectivity, you must be on a cluster in permissive or strict mode. If a service account has not been created, follow the steps described in the Security section to create one.
After your service account is created, follow these steps to create AWS credentials secrets and configure DC/OS Data Science Engine to use them for authenticating with S3:
-
Upload your credentials to the DC/OS secret store:
dcos security secrets create <secret_path_for_key_id> -v <AWS_ACCESS_KEY_ID> dcos security secrets create <secret_path_for_secret_key> -v <AWS_SECRET_ACCESS_KEY>
-
Grant read access to the secrets to the previously created service account:
dcos security org users grant <SERVICE_ACCOUNT> dcos:secrets:list:default:<secret_path_for_key_id> read dcos security org users grant <SERVICE_ACCOUNT> dcos:secrets:list:default:<secret_path_for_secret_key> read
-
After uploading your credentials, DC/OS Data Science Engine service can get the credentials via service options:
{ "service": { "service_account": "<service-account-id>", "service_account_secret": "<service-account-secret>", }, "s3": { "aws_access_key_id": "<secret_path_for_key_id>", "aws_secret_access_key": "<secret_path_for_secret_key>" } }
-
To make Spark integration, use credentials-based access to S3. Change Spark’s credentials provider to
com.amazonaws.auth.EnvironmentVariableCredentialsProvider
in the service options:{ "service": { "service_account": "<service-account-id>", "service_account_secret": "<service-account-secret>", }, "spark": { "spark_hadoop_fs_s3a_aws_credentials_provider": "com.amazonaws.auth.EnvironmentVariableCredentialsProvider" }, "s3": { "aws_access_key_id": "<secret_path_for_key_id>", "aws_secret_access_key": "<secret_path_for_secret_key>" } }