Quick Start

ENTERPRISE

Get up and running with DC/OS Data Science Engine

This page explains how to install the DC/OS Data Science Engine Service.

Prerequisites

  • DC/OS and DC/OS CLI installed with a minimum of three agent nodes, with eight GB of memory and 10 GB of disk space.
  • Depending on your security mode, DC/OS Data Science Engine requires service authentication for access to DC/OS. See Provisioning a service account for more information.

NOTE: If you are planning to use HDFS on DC/OS Data Science Engine, you will need a minimum of five nodes.

Security Mode Service Account
Disabled Not available
Permissive Optional
Strict Required

Install DC/OS Data Science Engine

From the DC/OS UI

  1. Select the Catalog tab, and search for DC/OS Data Science Engine. Select the data-science-engine package.

  2. Click the Review & Run button to display the Edit Configuration page.

  3. Configure the package settings using the DC/OS UI or by clicking JSON Editor and modifying the app definition manually. For example, you might customize the package by enabling HDFS support.

  4. Click Review & Run.

  5. Review the installation notes, then click Run Service to deploy the data-science-engine package.

From the command line

Install the data-science-engine package. This may take a few minutes. This step installs the data-science-engine service.

dcos package install data-science-engine

Expected output:

Installing Marathon app for package [data-science-engine] version [2.8.0-2.4.0]
DC/OS data-science-engine is being installed!

    Documentation: https://docs.d2iq.com/services/data-science-engine/
    Issues: https://docs.d2iq.com/support/

Run a Python Notebook Using Spark

  1. From DC/OS , select Services, then click on the “Open” icon for the data-science-engine.

    Open JupyterLab

    Figure 1 - Open new Jupyter window

    This will open a new window or tab in the browser for JupyterLab. Log in using the password specified during the installation of the data-science-engine package in Service -> Jupyter Password option or use jupyter by default.

  2. In JupyterLab, create a new notebook by selecting File > New > Notebook:

    Create new notebook

    Figure 2 - Create a new notebook

  3. Select Python 3 as the kernel language.

  4. Rename the notebook to “Estimate Pi.ipynb” using the menu at File -> Rename Notebook.

  5. Paste the following Python code into the notebook. If desired, you can type sections of code into separate cells as shown below.

from pyspark import SparkContext, SparkConf
import random

conf = SparkConf().setAppName("pi-estimation")
sc = SparkContext(conf=conf)

num_samples = 100000000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

sc.stop()
  1. Run the notebook. From the menu, select Run -> Run All Cells. The notebook will run for some time, then print out the calculated value.

    • Expected output: 3.1413234

Enable GPU support

DC/OS Data Science Engine supports GPU acceleration if the cluster nodes have GPUs available and CUDA drivers installed. To enable GPU support for DC/OS Data Science Engine add the following configuration in service config:

"service": {
    "gpu": {
        "enabled": true,
        "gpus": "<desired number of GPUs to allocate for the service>"
    }
}