Configuring DC/OS Data Science Engine to work with S3
To set up S3 connectivity, you must be on a cluster in permissive or strict mode. If a service account has not been created, follow the steps described in the Security section to create one.
After your service account is created, follow these steps to create AWS credentials secrets and configure DC/OS Data Science Engine to use them for authenticating with S3:
-
Upload your credentials to the DC/OS secret store:
dcos security secrets create <secret_path_for_key_id> -v <AWS_ACCESS_KEY_ID> dcos security secrets create <secret_path_for_secret_key> -v <AWS_SECRET_ACCESS_KEY>
-
Grant read access to the secrets to the previously created service account:
dcos security org users grant <SERVICE_ACCOUNT> dcos:secrets:list:default:<secret_path_for_key_id> read dcos security org users grant <SERVICE_ACCOUNT> dcos:secrets:list:default:<secret_path_for_secret_key> read
-
After uploading your credentials, DC/OS Data Science Engine service can get the credentials via service options:
{ "service": { "service_account": "<service-account-id>", "service_account_secret": "<service-account-secret>", }, "s3": { "aws_access_key_id": "<secret_path_for_key_id>", "aws_secret_access_key": "<secret_path_for_secret_key>" } }
-
To make Spark integration, use credentials-based access to S3. Change Spark’s credentials provider to
com.amazonaws.auth.EnvironmentVariableCredentialsProvider
in the service options:{ "service": { "service_account": "<service-account-id>", "service_account_secret": "<service-account-secret>", }, "spark": { "spark_hadoop_fs_s3a_aws_credentials_provider": "com.amazonaws.auth.EnvironmentVariableCredentialsProvider" }, "s3": { "aws_access_key_id": "<secret_path_for_key_id>", "aws_secret_access_key": "<secret_path_for_secret_key>" } }
Example of Using S3 with Spark
Here is an example of running Spark Job
using S3
as a storage backend.
-
Launch Python3 Notebook from the Notebook UI. Put the following code in a code cell.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("S3 Read Write Example").getOrCreate() s3_path = "<S3_BUCKET_NAME>/jupyter/test" spark.range(10).write.parquet('s3a://{}'.format(s3_path)) result = spark.read.parquet('s3a://{}'.format(s3_path)) print("COUNT={}".format(result.count())) spark.stop()
The expected output would be:
COUNT=10
-
Verify that the file has been saved. Go to the Terminal from the Notebook UI and run following command.
aws s3 ls s3://<S3_BUCKET_NAME>/jupyter/test --recursive
The expected output would be:
2020-06-09 15:31:01 0 jupyter/test/_SUCCESS 2020-06-09 15:30:56 443 jupyter/test/part-00000-4a5040f9-a967-400a-b49b-9474f18be733-c000.snappy.parquet 2020-06-09 15:30:59 445 jupyter/test/part-00001-4a5040f9-a967-400a-b49b-9474f18be733-c000.snappy.parquet