Scenario 1

Tutorial - Resource Allocation

IMPORTANT: Tutorials are intended to give you hands-on experience working with a limited set of DC/OS features with no implied or explicit warranty of any kind. None of the information provided--including sample scripts, commands, or applications--is officially supported by Mesosphere. You should not use this information in a production environment without independent testing and validation.

Scenario 1: Resource Allocation

Setup

For this first scenario, deploy this app definition as follows:

$ dcos marathon app add https://raw.githubusercontent.com/dcos-labs/dcos-debugging/master/1.10/app-scaling1.json

Check the application status using the DC/OS web interface. The status of the application will most likely be “Waiting” followed by some number of thousandths “x/1000”. “Waiting” refers to the overall application status and the number; “x” here represents how many instances have successfully deployed (6 in this example).

You can also check this status from the CLI:

$ dcos marathon app list

would produce the following output in response:

ID              MEM   CPUS  TASKS   HEALTH  DEPLOYMENT  WAITING  CONTAINER  CMD

/app-scaling-1  128    1    6/1000   ---      scale     True       mesos    sleep 10000

Or, if you want to see all ongoing deployments, enter:

$ dcos marathon deployment list

to see something like the following:

APP             POD  ACTION  PROGRESS  ID

/app-scaling-1  -    scale     1/2     c51af187-dd74-4321-bb38-49e6d224f4c8

So now we know that some (6/1000) instances of the application have successfully deployed, but the overall deployment status is “Waiting”. But what does this mean?

Resolution

The “Waiting” state means that DC/OS (or more precisely Marathon) is waiting for a suitable resource offer. So it seems to be an deployment issue and we should start by checking the available resources.

If we look at the DC/OS dashboard, we should see a pretty high CPU allocation.

Since we are not yet at 100% allocation, but we are still waiting to deploy, something interesting is going on. So let’s look at the recent resource offers in the debug view of the DC/OS web interface.

We can see that there are no matching CPU resources. But again, the overall CPU allocation is only at 75%. Further puzzling, when we take a look at the ‘Details’ section further below, we see that the latest offers from a different host match the resource requirements of our application. So, for example, the first offer coming from host 10.0.0.96 matched the role, constraint (not present in this app-definition) memory, disk, port resource requirements — but failed the CPU resource requirements. The offer before this also seemed like it should have met the resource requirements. So despite it looking like we have enough CPU resources available, the application seems to be failing for just this reason.

Let’s look at the ‘Details’ more closely. Some of the remaining CPU resources are allocated to a different Mesos resource role and so cannot be used by our application (it runs in role ‘*’, the default role).

To check the roles of different resources let us have a look at the state-summary endpoint, which you can access at https://<master-ip>/mesos/state-summary.

That endpoint will give us a rather long json output, so it is helpful to use jq to make the output readable:

curl -skSL

-X GET

-H "Authorization: token=$(dcos config show core.dcos_acs_token)"

-H "Content-Type: application/json"

"$(dcos config show core.dcos_url)/mesos/state-summary" |

jq '.'

When looking at the agent information we can see two different kinds of agent. The first kind has no free CPU resources and also no reserved resources. Of course, this might be different if you had other workloads running on your cluster prior to these exercises. Note that these unreserved resources correspond to the default role ‘*’ — the role by which we are trying to deploy our tasks.

The second kind has unused CPU resources, but these resources are reserved in the role ‘slave_public’.

We now know that the issue is that there are not enough resources in the desired resource role across the entire cluster. As a solution we could either scale down the application (1000 instances does seem a bit excessive), or we need to add more resources to the cluster.

General Pattern

When your application framework (e.g. Marathon) is not accepting resource offers, check whether there are sufficient resources available in the respective resource role.

This was a straightforward scenario with too few CPU resources. Typically resource issues are more likely caused by more complex factors - such as improperly configured port resources or placement constraints. Nonetheless, this general workflow pattern still applies.

Cleanup

Remove the application from the cluster with:

$ dcos marathon app remove /app-scaling-1