This guide will show you how to build a custom Docker image for Spark Executor. You can use these custom-built images with libraries used in distributed Spark jobs.
-
To identify the currently used image, open a Terminal from the Notebook UI and run the following command:
cat /opt/spark/conf/spark-defaults.conf | grep "spark.mesos.executor.docker.image" # Output would be something like this: # mesosphere/jupyter-service-worker:26a3231f513a686a2fcfb6f9ddb8acd45da467b261311b48a45b2a55bb0f2613 -
Create a Dockerfile. We will use
conda install -yq spacyas an example, and will installspacyfor demo purposes. On your personal laptop or server, create a file with nameDockerfileand put the following content into it:FROM mesosphere/jupyter-service-worker:26a3231f513a686a2fcfb6f9ddb8acd45da467b261311b48a45b2a55bb0f2613 USER nobody RUN conda install -yq spacy -
Build and push the Dockerfile. From the same directory where Dockerfile has been created, run the commands
docker buildanddocker pushas shown below. Assuming that the docker repository name isdocker123and image name isspacy-example, the commands will be:docker build -t docker123/spacy-example . docker push docker123/spacy-example
Example Notebook
In the following example, you will use a user-defined function to import a library we installed in a custom image.
Open a Python Notebook and put the following in a code cell:
import pyspark
from pyspark.sql import SparkSession
# Specify custom image to be used, via spark.mesos.executor.docker.image configuration property
spark = SparkSession.builder.config("spark.mesos.executor.docker.image", "docker123/spacy-example").appName("Test UDF").getOrCreate()
from pyspark.sql.types import StringType
def test_func(x):
import spacy
return x
spark.udf.register("test_func", test_func, StringType())
spark.range(1, 20).registerTempTable("test")
spark.sql("SELECT test_func(id) from test").collect()
spark.stop()
Expected output would be:
[Row(test_func(id)='1'),
Row(test_func(id)='2'),
Row(test_func(id)='3'),
Row(test_func(id)='4'),
Row(test_func(id)='5'),
Row(test_func(id)='6'),
Row(test_func(id)='7'),
Row(test_func(id)='8'),
Row(test_func(id)='9'),
Row(test_func(id)='10'),
Row(test_func(id)='11'),
Row(test_func(id)='12'),
Row(test_func(id)='13'),
Row(test_func(id)='14'),
Row(test_func(id)='15'),
Row(test_func(id)='16'),
Row(test_func(id)='17'),
Row(test_func(id)='18'),
Row(test_func(id)='19')]
Data Science Engine Documentation