Spark

ENTERPRISE

Using Spark with DC/OS Data Science Engine

DC/OS Data Science Engine comes with Apache Spark integration and allows you to run Spark jobs from notebooks and a terminal.

Launching a Spark job

Using Terminal

Open a Terminal from the Notebook UI and run this example spark-submit job:

spark-submit --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.4.0.jar 100

Using a Python Notebook

Open a Python Notebook and put the following code in a code cell.

from __future__ import print_function
from random import random
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Spark Pi").getOrCreate()

partitions = 100
n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

spark.stop()

Spark UI

The Spark UI starts automatically when a SparkContext is created, and is available at

http://<dcos_url>/service/data-science-engine/sparkui

In case of having multiple Spark sessions respective SparkUI is available at

http://<dcos_url>/service/data-science-engine/sparkui/<index>

Where index is a number 0,1,...N defining order of launched Spark sessions:

SparkUI for the first session is available at

http://<dcos_url>/service/data-science-engine/sparkui/0

SparkUI for the second launched session is available at

http://<dcos_url>/service/data-science-engine/sparkui/1

Spark History Server

DC/OS Data Science Engine includes the Spark History Server (SHS), which is up and running by default, using org.apache.spark.deploy.history.FsHistoryProvider as a default provider, with spark.history.fs.logDirectory set to file:/mns/mesos/sandbox/. It is highly recommended to use HDFS as the backend storage for SHS.

You can configure SHS to use HDFS with the following steps:

Installing HDFS

NOTE: HDFS requires at least five private nodes.

  1. Install HDFS:

    dcos package install hdfs
    
  2. Create a history HDFS directory (default is /history). SSH into your cluster and run:

    docker run -it mesosphere/hdfs-client:1.0.0-2.6.0 bash
    ./bin/hdfs dfs -mkdir /history
    

Configuring Spark History Server

  1. Configure Spark History Log Directory to point to the created HDFS directory in service.json:

    {
        "service": {
            "jupyter_conf_urls": "<DCOS HDFS endpoint>"
        },
        "spark": {
            "start_spark_history_server": true,
            "spark_history_fs_logdirectory": "hdfs://hdfs/history"
        }
    }
    

    To find more about configuring HDFS integration of DSEngine, see the Using HDFS with DC/OS Data Science Engine documentation.

  2. Enable the Spark Event log and set the HDFS directory:

    {
        "spark": {
            "start_spark_history_server": true,
            "spark_history_fs_logdirectory": "hdfs://hdfs/history",
            "spark_eventlog_enabled": true,
            "spark_eventlog_dir": "hdfs://hdfs/history"
        }
    }
    
  3. Restart the data-science-engine service to apply the changes.

Confirm Spark History Server installation

The Spark History Server UI is available at http://<dcos_url>/service/data-science-engine/sparkhistory, listing incomplete and completed applications and attempts.