This guide will show you how to build a custom Docker image for Spark Executor. You can use these custom-built images with libraries used in distributed Spark jobs.
-
To identify the currently used image, open a Terminal from the Notebook UI and run the following command:
cat /opt/spark/conf/spark-defaults.conf | grep "spark.mesos.executor.docker.image" # Output would be something like this: # mesosphere/jupyter-service-worker:26a3231f513a686a2fcfb6f9ddb8acd45da467b261311b48a45b2a55bb0f2613
-
Create a Dockerfile. We will use
conda install -yq spacy
as an example, and will installspacy
for demo purposes. On your personal laptop or server, create a file with nameDockerfile
and put the following content into it:FROM mesosphere/jupyter-service-worker:26a3231f513a686a2fcfb6f9ddb8acd45da467b261311b48a45b2a55bb0f2613 USER nobody RUN conda install -yq spacy
-
Build and push the Dockerfile. From the same directory where Dockerfile has been created, run the commands
docker build
anddocker push
as shown below. Assuming that the docker repository name isdocker123
and image name isspacy-example
, the commands will be:docker build -t docker123/spacy-example . docker push docker123/spacy-example
Example Notebook
In the following example, you will use a user-defined function to import a library we installed in a custom image.
Open a Python Notebook
and put the following in a code cell:
import pyspark
from pyspark.sql import SparkSession
# Specify custom image to be used, via spark.mesos.executor.docker.image configuration property
spark = SparkSession.builder.config("spark.mesos.executor.docker.image", "docker123/spacy-example").appName("Test UDF").getOrCreate()
from pyspark.sql.types import StringType
def test_func(x):
import spacy
return x
spark.udf.register("test_func", test_func, StringType())
spark.range(1, 20).registerTempTable("test")
spark.sql("SELECT test_func(id) from test").collect()
spark.stop()
Expected output would be:
[Row(test_func(id)='1'),
Row(test_func(id)='2'),
Row(test_func(id)='3'),
Row(test_func(id)='4'),
Row(test_func(id)='5'),
Row(test_func(id)='6'),
Row(test_func(id)='7'),
Row(test_func(id)='8'),
Row(test_func(id)='9'),
Row(test_func(id)='10'),
Row(test_func(id)='11'),
Row(test_func(id)='12'),
Row(test_func(id)='13'),
Row(test_func(id)='14'),
Row(test_func(id)='15'),
Row(test_func(id)='16'),
Row(test_func(id)='17'),
Row(test_func(id)='18'),
Row(test_func(id)='19')]