General Strategy: Debugging Application Deployment on DC/OS
Now that we have defined a toolset for debugging applications on DC/OS, let us consider a step-by-step general troubleshooting strategy for actually implementing these tools in a application debugging scenario. Once we have gone over this general strategy, we will consider a few concrete scenarios of how to apply this strategy in the practice section.
Beyond considering any information special to your scenario, a reasonable approach to debugging an application deployment issue is to apply our debugging tools in the following order:
- 1: Check web interfaces
- 2: Check Task Logs
- 3: Check Scheduler Logs
- 4: Check Agent Logs
- 5: Test Task Interactively
- 6: Check Master Logs
- 7: Ask Community
Step 1: Check the web interfaces
Start by examining the DC/OS web interface (or use the CLI) to check the status of the task. If the task has an associated health check, it is also a good idea to check the task’s health status.
If it could be relevant, check the Mesos web interface or Exhibitor/ZooKeeper web interface for potentially relevant debugging information there.
Step 2: Check the Task Logs
If the web interfaces cannot provide sufficient information, next check the task logs using the DC/OS web interface or the CLI. This helps a better understanding of what might have happened to the application. If the issue is related to our app not deploying (for example, the task status continues to wait indefinitely), try looking at the ‘Debug’ page. It could be helpful in getting a better understanding of the resources being offered by Mesos.
Step 3: Check the Scheduler Logs
Next, when there is a deployment problem and the task logs do not provide enough information to fix the issue, it can be helpful to double-check the app definition. Then, after confirming the app definition, check the Marathon log or web interface to better understand how it was scheduled or why not.
Step 4: Check the Agent Logs
The Mesos Agent logs provide information regarding how the task and that task’s environment are being started. Recall that increasing the log level can be helpful in some cases to obtain more information with which to work.
Step 5: Test the Task Interactively
The next step is to interactively look at the task running inside the container. If the task is still running, dcos task exec or docker exec can be helpful to start an interactive debugging session. If the application is based on a Docker container image, manually starting it using docker run followed by docker exec can also get you started in the right direction.
Step 6: Check the Master Logs
If you want to understand why a particular scheduler has received certain resources or a particular status, then the master logs can be very helpful. Recall that the master is forwarding all status updates between the agents and scheduler, so it might even be helpful in cases where the agent node might not be reachable (for example, network partition or node failure).
Step 7: Ask the Community
As mentioned above, the community can be very helpful by either using the DC/OS Slack or the mailing list can be very helpful in debugging further.
 DC/OS Documentation
DC/OS Documentation