Configuration update errors
After a configuration change, the service may enter an unhealthy state. This commonly occurs when an invalid configuration change was made by the user. Certain configuration values may not be changed, or may not be decreased. To verify whether this is the case, check the service’s deploy
plan for any errors.
dcos confluent-kafka --name=<service-name> plan show deploy
Accessing Logs
Logs for the scheduler and all service nodes can be viewed from the DC/OS web interface.
- Scheduler logs are useful for determining why a node is not being launched (this is under the purview of the Scheduler).
- Node logs are useful for examining problems in the service itself.
In all cases, logs are generally piped to files named stdout
and/or stderr
.
To view logs for a given node, perform the following steps:
- Visit
<dcos-url>
to access the DC/OS web interface. - Navigate to
Services
and click on the service to be examined. - In the list of tasks for the service, click on the task to be examined (scheduler is named after the service, nodes are each named
node-<#>-server
according to their type). - In the task details, click on the
Logs
tab to go into the log viewer. By default, you will seestdout
, butstderr
is also useful. Use the pull-down in the upper right to select the file to be examined.
You can also access the logs via the Mesos UI:
- Visit
<dcos-url>/mesos
to view the Mesos UI. - Click the
Frameworks
tab in the upper left to get a list of services running in the cluster. - Navigate into the correct framework for your needs. The scheduler runs under
marathon
with a task name matching the service name (default confluent-kafka). Service nodes run under a framework whose name matches the service name (default confluent-kafka). - You should now see two lists of tasks.
Active Tasks
are tasks currently running, andCompleted Tasks
are tasks that have exited. Click theSandbox
link for the task you wish to examine. - The
Sandbox
view will list files namedstdout
andstderr
. Click the file names to view the files in the browser, or clickDownload
to download them to your system for local examination. Note that very old tasks will have their Sandbox automatically deleted to limit disk space usage.
Replacing a Permanently Failed Node
The DC/OS Confluent Kafka service is resilient to temporary pod failures, automatically relaunching them in-place if they stop running. However, if a machine hosting a pod is permanently lost, manual intervention is required to discard the downed pod and reconstruct it on a new machine.
The following command should be used to get a list of available pods.
dcos confluent-kafka --name=<service-name> pod list
The following command should then be used to replace the pod residing on the failed machine, using the appropriate pod_name
provided in the above list.
dcos confluent-kafka --name=<service-name> pod replace <pod_name>
The pod recovery may then be monitored via the recovery
plan.
dcos confluent-kafka --name=<service-name> plan show recovery
Restarting a Node
If you must forcibly restart a pod’s processes but do not wish to clear that pod’s data, use the following command to restart the pod on the same agent machine where it currently resides. This will not result in an outage or loss of data.
The following command should be used to get a list of available pods.
dcos confluent-kafka --name=<service-name> pod list
The following command should then be used to restart the pod, using the appropriate pod_name
provided in the above list.
dcos confluent-kafka --name=<service-name> pod restart <pod_name>
The pod recovery may then be monitored via the recovery
plan.
dcos confluent-kafka --name=<service-name> plan show recovery
Tasks not deploying / Resource starvation
When the scheduler is trying to launch tasks, it will log its decisions about offers it has received. This can be useful when you are determining why a task is failing to deploy at all.
In recent versions of the service, a scheduler endpoint at http://yourcluster.com/service/<service-name>/v1/debug/offers
will display an HTML table containing a summary of recently-evaluated offers. This table’s contents are currently very similar to what can be found in logs, but in a slightly more accessible format. Alternatively, we can look at the scheduler’s logs in stdout
.
The following example assumes a hypothetical task requiring a pre-determined amount of resources. From scrolling through the scheduler logs, we see a couple of patterns. First, there are failures like this, where the only thing missing is CPUs. The remaining task requires two CPUs but this offer apparently did not have enough:
INFO 2017-04-25 19:17:13,846 [pool-8-thread-1] com.mesosphere.sdk.offer.evaluate.OfferEvaluator:evaluate(69): Offer 1: failed 1 of 14 evaluation stages:
PASS(PlacementRuleEvaluationStage): No placement rule defined
PASS(ExecutorEvaluationStage): Offer contains the matching Executor ID
PASS(ResourceEvaluationStage): Offer contains sufficient 'cpus': requirement=type: SCALAR scalar { value: 0.5 }
PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 500.0 }
PASS(LaunchEvaluationStage): Added launch information to offer requirement
FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 2.0 } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 8000.0 }
PASS(MultiEvaluationStage): All child stages passed
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 9042 end: 9042 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 9160 end: 9160 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7000 end: 7000 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7001 end: 7001 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8609 end: 8609 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8182 end: 8182 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7199 end: 7199 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 21621 end: 21621 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8983 end: 8983 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7077 end: 7077 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7080 end: 7080 } }
PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7081 end: 7081 } }
PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
PASS(LaunchEvaluationStage): Added launch information to offer requirement
PASS(ReservationEvaluationStage): Added reservation information to offer requirement
If we scroll up from this rejection summary, we find a message describing what the agent had offered in terms of CPU:
INFO 2017-04-25 19:17:13,834 [pool-8-thread-1] com.mesosphere.sdk.offer.MesosResourcePool:consumeUnreservedMerged(239): Offered quantity of cpus is insufficient: desired type: SCALAR scalar { value: 2.0 }, offered type: SCALAR scalar { value: 0.5 }
Understandably, our scheduler is refusing to launch a node on a system with 0.5 remaining CPUs when the node needs 2.0 CPUs.
Another pattern we see is a message like this, where the offer is being rejected for several reasons:
INFO 2017-04-25 19:17:14,849 [pool-8-thread-1] com.mesosphere.sdk.offer.evaluate.OfferEvaluator:evaluate(69): Offer 1: failed 6 of 14 evaluation stages:
PASS(PlacementRuleEvaluationStage): No placement rule defined
PASS(ExecutorEvaluationStage): Offer contains the matching Executor ID
FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 0.5 } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 500.0 }
PASS(LaunchEvaluationStage): Added launch information to offer requirement
FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 2.0 } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } } FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'mem': name: "mem" type: SCALAR scalar { value: 8000.0 } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(MultiEvaluationStage): Failed to pass all child stages
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 9042 end: 9042 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 9160 end: 9160 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7000 end: 7000 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7001 end: 7001 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8609 end: 8609 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8182 end: 8182 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7199 end: 7199 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 21621 end: 21621 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8983 end: 8983 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7077 end: 7077 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7080 end: 7080 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7081 end: 7081 } } role: "confluent-kafka-role" reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
FAIL(VolumeEvaluationStage): Failed to satisfy required volume 'disk': name: "disk" type: SCALAR scalar { value: 10240.0 } role: "confluent-kafka-role" disk { persistence { id: "" principal: "confluent-kafka-principal" } volume { container_path: "confluent-kafka-data" mode: RW } } reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
FAIL(VolumeEvaluationStage): Failed to satisfy required volume 'disk': name: "disk" type: SCALAR scalar { value: 10240.0 } role: "confluent-kafka-role" disk { persistence { id: "" principal: "confluent-kafka-principal" } volume { container_path: "solr-data" mode: RW } } reservation { principal: "confluent-kafka-principal" labels { labels { key: "resource_id" value: "" } } }
PASS(LaunchEvaluationStage): Added launch information to offer requirement
PASS(ReservationEvaluationStage): Added reservation information to offer requirement
In this case, we see that none of the ports our task needs are available on this system (not to mention the lack of sufficient CPU and RAM). This will typically happen when we are looking at an agent that we have already deployed to. The agent in question here is likely running an existing node for this service, where we had already reserved those ports ourselves.
We are seeing that none of the remaining agents in the cluster have room to fit our third node. To resolve this, we need to either add more agents to the DC/OS cluster or we need to reduce the requirements of our service to make it fit. In the latter case, be aware of any performance issues that may result if resource usage is reduced too far. Insufficient CPU quota will result in throttled tasks, and insufficient RAM quota will result in OOMed tasks.
This is a good example of the kind of diagnosis you can perform by skimming the scheduler logs.
Accidentally deleted Marathon task but not service
A common mistake is to remove the scheduler task from Marathon, which does not do anything to uninstall the service tasks themselves. If you do this, you have two options:
Uninstall the rest of the service
If you really want to uninstall the service, you just need to complete the normal package uninstall
steps described under Uninstall.
Recover the Scheduler
If you want to bring the scheduler back, you can run the dcos package install
process, using the options that you had configured before. This will re-install a new scheduler that should match the previous one (assuming you got your options right), and it will resume where it left off. To ensure that you don’t forget the options your services are configured with, we recommend keeping a copy of your service’s options.json
in source control so that you can easily recover it later.
‘Framework has been removed’
An instance of the service was previously installed with the same name and not properly uninstalled. See Uninstall for steps on completing an incomplete uninstall.
OOMed task
Your tasks can be killed from an OOM (out of memory error) if you did not give them sufficient resources. This will manifest as sudden Killed
messages in Task logs, sometimes consistently but often not. To verify that the cause is an OOM error, the following places can be checked:
- Check Scheduler logs (or
dcos <svcname> pod status <podname>)
to see TaskStatus updates from mesos for a given failed pod. - Check Agent logs directly for mention of the Mesos Agent killing a task due to excess memory usage.
After you have been able to confirm that the problem is indeed an OOM error, you can solve it by either updating the service configuration to reserve more memory.
Partition replication
Kafka may become unhealthy when it detects any underreplicated partitions. This error condition usually indicates a malfunctioning broker. Use the dcos confluent-kafka --name=confluent-kafka topic under_replicated_partitions
and dcos confluent-kafka --name=confluent-kafka topic describe <topic-name>
commands to find the problem broker and determine what actions are required.
Possible repair actions include restarting the affected broker and destructively replacing the affected broker. The replace operation is destructive and will irrevocably lose all data associated with the broker. The restart operation is not destructive and indicates an attempt to restart a broker process.
Broker Shutdown
If the Kafka brokers are not completing the clean shutdown within the configured
brokers.kill_grace_period
(Kill Grace Period), extend the Kill Grace Period.