DC/OS Data Science Engine comes with Apache Spark integration and allows you to run Spark jobs from notebooks and a terminal.
Launching a Spark job
Using Terminal
Open a Terminal
from the Notebook UI and run this example spark-submit
job:
spark-submit --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.4.0.jar 100
Using a Python Notebook
Open a Python Notebook
and put the following code in a code cell.
from __future__ import print_function
from random import random
from operator import add
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark Pi").getOrCreate()
partitions = 100
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
Spark UI
The Spark UI starts automatically when a SparkContext is created, and is available at
http://<dcos_url>/service/data-science-engine/sparkui
In case of having multiple Spark sessions respective SparkUI is available at
http://<dcos_url>/service/data-science-engine/sparkui/<index>
Where index is a number 0,1,...N
defining order of launched Spark sessions:
SparkUI for the first session is available at
http://<dcos_url>/service/data-science-engine/sparkui/0
SparkUI for the second launched session is available at
http://<dcos_url>/service/data-science-engine/sparkui/1
Spark History Server
DC/OS Data Science Engine includes the Spark History Server (SHS), which is up and running by default, using org.apache.spark.deploy.history.FsHistoryProvider
as a default provider, with
spark.history.fs.logDirectory
set to file:/mns/mesos/sandbox/
. It is highly recommended to use HDFS as the backend storage for SHS.
You can configure SHS to use HDFS with the following steps:
Installing HDFS
-
Install HDFS:
dcos package install hdfs
-
Create a history HDFS directory (default is
/history
). SSH into your cluster and run:docker run -it mesosphere/hdfs-client:1.0.0-2.6.0 bash ./bin/hdfs dfs -mkdir /history
Configuring Spark History Server
-
Configure Spark History Log Directory to point to the created HDFS directory in
service.json
:{ "service": { "jupyter_conf_urls": "<DCOS HDFS endpoint>" }, "spark": { "start_spark_history_server": true, "spark_history_fs_logdirectory": "hdfs://hdfs/history" } }
To find more about configuring HDFS integration of DSEngine, see the Using HDFS with DC/OS Data Science Engine documentation.
-
Enable the
Spark Event
log and set the HDFS directory:{ "spark": { "start_spark_history_server": true, "spark_history_fs_logdirectory": "hdfs://hdfs/history", "spark_eventlog_enabled": true, "spark_eventlog_dir": "hdfs://hdfs/history" } }
-
Restart the data-science-engine service to apply the changes.
Confirm Spark History Server installation
The Spark History Server UI is available at http://<dcos_url>/service/data-science-engine/sparkhistory
, listing incomplete and completed applications and attempts.