HDFS

ENTERPRISE

Using HDFS with DC/OS Data Science Engine

Prerequisites

NOTE: If you are planning to use HDFS on DC/OS Data Science Engine, you will need a minimum of five nodes.

If you plan to read and write from HDFS using DC/OS Data Science Engine, there are two Hadoop configuration files that you should include in the classpath:

  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default file system name.

You can specify the location of these files at install time or for each DC/OS Data Science Engine instance.

Configuring DC/OS Data Science Engine to work with HDFS

Within the DC/OS Data Science Engine service configuration, set service.jupyter_conf_urls to be a list of URLs that serves your hdfs-site.xml and core-site.xml. The following example uses http://mydomain.com/hdfs-config/hdfs-site.xml and http://mydomain.com/hdfs-config/core-site.xml URLs:

{
 "service": {
   "jupyter_conf_urls": "http://mydomain.com/hdfs-config"
 }
}

You can also specify the URLs through the UI. If you are using the default installation of HDFS from Mesosphere, this would be http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints for HDFS service installed with the hdfs name.

Example of Using HDFS with Spark

Here is an example of running Spark Job using HDFS as a storage backend.

  1. Launch Python3 Notebook from the Notebook UI. Put the following code in a code cell.

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.appName("HDFS Read Write Example").getOrCreate()
    
    hdfs_path = "hdfs://hdfs/jupyter/test"
    spark.range(10).write.parquet(hdfs_path)
    result = spark.read.parquet(hdfs_path)
    print("COUNT={}".format(result.count()))
    
    spark.stop()
    

    The expected output would be:

    COUNT=10
    
  2. Verify that the file has been saved. Go to the Terminal from the Notebook UI and run following command.

    hdfs dfs -ls -R /jupyter/test
    

    The expected output would be:

    -rw-r--r--   3 nobody supergroup          0 2020-06-09 15:25 /jupyter/test/_SUCCESS
    -rw-r--r--   3 nobody supergroup        443 2020-06-09 15:25 /jupyter/test/part-00000-4260aa54-4302-40a2-8fb6-370fe8393f8b-c000.snappy.parquet
    -rw-r--r--   3 nobody supergroup        445 2020-06-09 15:25 /jupyter/test/part-00001-4260aa54-4302-40a2-8fb6-370fe8393f8b-c000.snappy.parquet