Integration with HDFS

Integrating HDFS with DC/OS Apache Spark service

HDFS

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that you should include in the Spark classpath:

  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default file system name.

You can specify the location of these files at install time or for each job.

Spark installation

Within the Spark service configuration, set hdfs.config-url to be a URL that serves your hdfs-site.xml and core-site.xml. The following example uses http://mydomain.com/hdfs-config/hdfs-site.xml and http://mydomain.com/hdfs-config/core-site.xml URLs:

{
  "hdfs": {
    "config-url": "http://mydomain.com/hdfs-config"
  }
}

You can also specify the URLs through the UI. If you are using the default installation of HDFS from Mesosphere, this is probably http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints.

Adding HDFS configuration files per-job

To add the configuration files manually for a job, use --conf spark.mesos.uris=<location_of_hdfs-site.xml>,<location_of_core-site.xml>. This setting downloads the files to the sandbox of the driver Spark application, and DC/OS Spark automatically loads these files into the correct location.

IMPORTANT: It is important these files are called hdfs-site.xml and core-site.xml.

Spark checkpointing

To use Spark with checkpointing, make sure you follow the instructions here and use an HDFS directory as the checkpointing directory. For example:

val checkpointDirectory = "hdfs://hdfs/checkpoint"
val ssc = ...
ssc.checkpoint(checkpointDirectory)

The HDFS directory is automatically created on HDFS. The Spark streaming app will work from checkpointed data, even in the event of an application restarts or failure.

S3

You can read/write files to S3 using environment variable-based secrets to pass your AWS credentials.

  1. Upload your credentials to the DC/OS secret store:

    dcos security secrets create <secret_path_for_key_id> -v <AWS_ACCESS_KEY_ID>
    dcos security secrets create <secret_path_for_secret_key> -v <AWS_SECRET_ACCESS_KEY>
    
  2. After uploading your credentials, Spark jobs can get the credentials directly:

    dcos spark run --submit-args="\
    ...
    --conf spark.mesos.containerizer=mesos  # required for secrets
    --conf spark.mesos.driver.secret.names=<secret_path_for_key_id>,<secret_path_for_secret_key>
    --conf spark.mesos.driver.secret.envkeys=AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY
    ...