Troubleshooting

Diagnosing Kafka

Configuration update errors

After a configuration change, the service may enter an unhealthy state. This commonly occurs when an invalid configuration change was made by the user. Certain configuration values may not be changed, or may not be decreased. To verify whether this is the case, check the service’s deploy plan for any errors.

dcos kafka --name=<service-name> plan show deploy

Accessing Logs

Logs for the scheduler and all service nodes can be viewed from the DC/OS web interface.

  • Scheduler logs are useful for determining why a node is not being launched (this is under the purview of the Scheduler).
  • Node logs are useful for examining problems in the service itself.

In all cases, logs are generally piped to files named stdout and/or stderr.

To view logs for a given node, perform the following steps:

  1. Visit <dcos-url> to access the DC/OS web interface.
  2. Navigate to Services and click on the service to be examined.
  3. In the list of tasks for the service, click on the task to be examined (scheduler is named after the service, nodes are each named node-<#>-server according to their type).
  4. In the task details, click on the Logs tab to go into the log viewer. By default, you will see stdout, but stderr is also useful. Use the pull-down in the upper right to select the file to be examined.

You can also access the logs via the Mesos UI:

  1. Visit <dcos-url>/mesos to view the Mesos UI.
  2. Click the Frameworks tab in the upper left to get a list of services running in the cluster.
  3. Navigate into the correct framework for your needs. The scheduler runs under marathon with a task name matching the service name (default kafka). Service nodes run under a framework whose name matches the service name (default kafka).
  4. You should now see two lists of tasks. Active Tasks are tasks currently running, and Completed Tasks are tasks that have exited. Click the Sandbox link for the task you wish to examine.
  5. The Sandbox view will list files named stdout and stderr. Click the file names to view the files in the browser, or click Download to download them to your system for local examination. Note that very old tasks will have their Sandbox automatically deleted to limit disk space usage.

Replacing a Permanently Failed Node

The DC/OS Apache Kafka service is resilient to temporary pod failures, automatically relaunching them in-place if they stop running. However, if a machine hosting a pod is permanently lost, manual intervention is required to discard the downed pod and reconstruct it on a new machine.

The following command should be used to get a list of available pods.

dcos kafka --name=<service-name> pod list

The following command should then be used to replace the pod residing on the failed machine, using the appropriate pod_name provided in the above list.

dcos kafka --name=<service-name> pod replace <pod_name>

The pod recovery may then be monitored via the recovery plan.

dcos kafka --name=<service-name> plan show recovery

Restarting a Node

If you must forcibly restart a pod’s processes but do not wish to clear that pod’s data, use the following command to restart the pod on the same agent machine where it currently resides. This will not result in an outage or loss of data.

The following command should be used to get a list of available pods.

dcos kafka --name=<service-name> pod list

The following command should then be used to restart the pod, using the appropriate pod_name provided in the above list.

dcos kafka --name=<service-name> pod restart <pod_name>

The pod recovery may then be monitored via the recovery plan.

dcos kafka --name=<service-name> plan show recovery

Tasks not deploying / Resource starvation

When the scheduler is trying to launch tasks, it will log its decisions about offers it has received. This can be useful when you are determining why a task is failing to deploy at all.

In recent versions of the service, a scheduler endpoint at http://yourcluster.com/service/<service-name>/v1/debug/offers will display an HTML table containing a summary of recently-evaluated offers. This table’s contents are currently very similar to what can be found in logs, but in a slightly more accessible format. Alternatively, we can look at the scheduler’s logs in stdout.

The following example assumes a hypothetical task requiring a pre-determined amount of resources. From scrolling through the scheduler logs, we see a couple of patterns. First, there are failures like this, where the only thing missing is CPUs. The remaining task requires two CPUs but this offer apparently did not have enough:

INFO  2017-04-25 19:17:13,846 [pool-8-thread-1] com.mesosphere.sdk.offer.evaluate.OfferEvaluator:evaluate(69): Offer 1: failed 1 of 14 evaluation stages:
  PASS(PlacementRuleEvaluationStage): No placement rule defined
  PASS(ExecutorEvaluationStage): Offer contains the matching Executor ID
  PASS(ResourceEvaluationStage): Offer contains sufficient 'cpus': requirement=type: SCALAR scalar { value: 0.5 }
  PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 500.0 }
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 2.0 } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 8000.0 }
  PASS(MultiEvaluationStage): All child stages passed
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 9042 end: 9042 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 9160 end: 9160 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7000 end: 7000 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7001 end: 7001 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8609 end: 8609 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8182 end: 8182 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7199 end: 7199 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 21621 end: 21621 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8983 end: 8983 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7077 end: 7077 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7080 end: 7080 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7081 end: 7081 } }
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  PASS(ReservationEvaluationStage): Added reservation information to offer requirement

If we scroll up from this rejection summary, we find a message describing what the agent had offered in terms of CPU:

INFO  2017-04-25 19:17:13,834 [pool-8-thread-1] com.mesosphere.sdk.offer.MesosResourcePool:consumeUnreservedMerged(239): Offered quantity of cpus is insufficient: desired type: SCALAR scalar { value: 2.0 }, offered type: SCALAR scalar { value: 0.5 }

Understandably, our scheduler is refusing to launch a node on a system with 0.5 remaining CPUs when the node needs 2.0 CPUs.

Another pattern we see is a message like this, where the offer is being rejected for several reasons:

INFO  2017-04-25 19:17:14,849 [pool-8-thread-1] com.mesosphere.sdk.offer.evaluate.OfferEvaluator:evaluate(69): Offer 1: failed 6 of 14 evaluation stages:
  PASS(PlacementRuleEvaluationStage): No placement rule defined
  PASS(ExecutorEvaluationStage): Offer contains the matching Executor ID
  FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 0.5 } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 500.0 }
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 2.0 } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } } FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'mem': name: "mem" type: SCALAR scalar { value: 8000.0 } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
  FAIL(MultiEvaluationStage): Failed to pass all child stages
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 9042 end: 9042 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 9160 end: 9160 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7000 end: 7000 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7001 end: 7001 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8609 end: 8609 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8182 end: 8182 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7199 end: 7199 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 21621 end: 21621 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8983 end: 8983 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7077 end: 7077 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7080 end: 7080 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7081 end: 7081 } } role: "kafka-role" reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
  FAIL(VolumeEvaluationStage): Failed to satisfy required volume 'disk': name: "disk" type: SCALAR scalar { value: 10240.0 } role: "kafka-role" disk { persistence { id: "" principal: "kafka-principal" } volume { container_path: "kafka-data" mode: RW } } reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  FAIL(VolumeEvaluationStage): Failed to satisfy required volume 'disk': name: "disk" type: SCALAR scalar { value: 10240.0 } role: "kafka-role" disk { persistence { id: "" principal: "kafka-principal" } volume { container_path: "solr-data" mode: RW } } reservation { principal: "kafka-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  PASS(ReservationEvaluationStage): Added reservation information to offer requirement

In this case, we see that none of the ports our task needs are available on this system (not to mention the lack of sufficient CPU and RAM). This will typically happen when we are looking at an agent that we have already deployed to. The agent in question here is likely running an existing node for this service, where we had already reserved those ports ourselves.

We are seeing that none of the remaining agents in the cluster have room to fit our third node. To resolve this, we need to either add more agents to the DC/OS cluster or we need to reduce the requirements of our service to make it fit. In the latter case, be aware of any performance issues that may result if resource usage is reduced too far. Insufficient CPU quota will result in throttled tasks, and insufficient RAM quota will result in OOMed tasks.

This is a good example of the kind of diagnosis you can perform by skimming the scheduler logs.

Accidentally deleted Marathon task but not service

A common mistake is to remove the scheduler task from Marathon, which does not do anything to uninstall the service tasks themselves. If you do this, you have two options:

Uninstall the rest of the service

If you really want to uninstall the service, you just need to complete the normal package uninstall steps described under Uninstall.

Recover the Scheduler

If you want to bring the scheduler back, you can run the dcos package install process, using the options that you had configured before. This will re-install a new scheduler that should match the previous one (assuming you got your options right), and it will resume where it left off. To ensure that you don’t forget the options your services are configured with, we recommend keeping a copy of your service’s options.json in source control so that you can easily recover it later.

‘Framework has been removed’

An instance of the service was previously installed with the same name and not properly uninstalled. See Uninstall for steps on completing an incomplete uninstall.

OOMed task

Your tasks can be killed from an OOM (out of memory error) if you did not give them sufficient resources. This will manifest as sudden Killed messages in Task logs, sometimes consistently but often not. To verify that the cause is an OOM error, the following places can be checked:

  • Check Scheduler logs (or dcos <svcname> pod status <podname>) to see TaskStatus updates from mesos for a given failed pod.
  • Check Agent logs directly for mention of the Mesos Agent killing a task due to excess memory usage.

After you have been able to confirm that the problem is indeed an OOM error, you can solve it by either updating the service configuration to reserve more memory.

Partition replication

Kafka may become unhealthy when it detects any underreplicated partitions. This error condition usually indicates a malfunctioning broker. Use the dcos kafka --name=kafka topic under_replicated_partitions and dcos kafka --name=kafka topic describe <topic-name> commands to find the problem broker and determine what actions are required.

Possible repair actions include restarting the affected broker and destructively replacing the affected broker. The replace operation is destructive and will irrevocably lose all data associated with the broker. The restart operation is not destructive and indicates an attempt to restart a broker process.

Broker Shutdown

If the Kafka brokers do not complete the clean shutdown within the configured brokers.kill_grace_period (Kill Grace Period), extend the Kill Grace Period.