For creating diagnostic bundles, D2iQ is using a customized version of troubleshoot.sh
integrated into dkp-diagnose
.
Customizations
To meet the specific needs of diagnosing DKP 2 clusters we have developed custom collectors and modified the behavior of upstream collectors. Go to our repository for more information on the details of the changes.
ExecCopyFromHost
collector
This is a new collector created specifically for gathering host level information from cluster nodes. The collector allows you to run a provided container image in a privileged mode, as a root user, with additional Linux capabilities and with the host filesystem mounted in the container.
You can collect host level information other than copying host level files. (This is already possible with the CopyFromHost
collector.) Like the CopyFromHost
collector, this collector runs as a Kubernetes DaemonSet
executed on all nodes in the system. The data produced by the container are copied from a pre-defined directory into the diagnostics bundle under each node name. The name of the parent directory, in the diagnostics bundle, is determined by the name of the collector specified in its configuration.
The data written into the diagnostics bundle follows this format:
<collector-name> / <node-name> / data / (file1|file2|...)
The following is a sample configuration file:
spec:
collectors:
- execCopyFromHost:
name: node-diagnostics
image: mesosphere/dkp-diagnostics-node-collector:latest
timeout: 30s
command:
- "/bin/bash"
- "-c"
- "/diagnostics/container.sh --hostroot /host --hostpath ${PATH} --outputroot /output"
workingDir: "/diagnostics"
includeControlPlane: true
privileged: true
capabilities:
- AUDIT_CONTROL
- AUDIT_READ
- BLOCK_SUSPEND
- BPF
- CHECKPOINT_RESTORE
- DAC_READ_SEARCH
- IPC_LOCK
- IPC_OWNER
- LEASE
- LINUX_IMMUTABLE
- MAC_ADMIN
- MAC_OVERRIDE
- NET_ADMIN
- NET_BROADCAST
- PERFMON
- SYS_ADMIN
- SYS_BOOT
- SYS_MODULE
- SYS_NICE
- SYS_PACCT
- SYS_PTRACE
- SYS_RAWIO
- SYS_RESOURCE
- SYS_TIME
- SYS_TTY_CONFIG
- SYSLOG
- WAKE_ALARM
extractArchive: true
The following is an example of the data produced by running this collector:
├── node-diagnostics
│ ├── troubleshoot-control-plane
│ │ └── data
│ │ ├── certs_expiration_kubeadm
│ │ ├── containerd_config.toml
...
│ │ └── whoami_validate
│ └── troubleshoot-worker
│ └── data
│ ├── containerd_config.toml
│ ├── containers_crictl
...
│ └── whoami_validate
In the event that an error occurs while collecting node diagnostics, the node-diagnostics/<node>/pod-collector.json
file contains the serialized JSON representations of the running pod. This helps debug the reasons for the collection failure. The node-diagnostics/<node>/pod-collector.log
file contains stdout from the collector container that runs the diagnostics script. In addition, the command may also produce certain -error.txt
files. file-copy-error.txt
and pod-collector-files-copy-error.txt
are two file examples. These files contain error messages generated while trying to fetch log files from the collector.
When using this collector for node level information you must run additional docker containers and must have the following docker images:
mesosphere/pause-alpine:3.2
mesosphere/dkp-diagnostics-node-collector:$(dkp-diagnose version)
For more information on the configuration options see the ExecCopyFromHost
in the pkg/apis/troubleshoot/v1beta2/exec_copy_from_host.go
file.
AllLogs
collector
This collector gathers pod logs from specified namespaces or from all namespaces if none are specified. You can collect logs of all the pods from all the namespaces. The pod logs are collected under the allPodLogs
directory.
The data written into the diagnostics bundle follows this format:
<collector-name> / <namespace-name> / <pod-name> - (container1|container2|...)
The following is a sample configuration file to collect logs from all the pods from all the namespaces:
spec:
collectors:
- allLogs:
namespaces:
- "*"
The following is a sample configuration file to collect logs from all the pods from specific namespaces:
spec:
collectors:
- allLogs:
namespaces:
- default
- dev
- prod
The following is an example of the data produced by running this collector:
├── node-diagnostics
│ ├── troubleshoot-control-plane
│ │ └── data
│ │ ├── certs_expiration_kubeadm
│ │ ├── containerd_config.toml
...
│ │ └── whoami_validate
│ └── troubleshoot-worker
│ └── data
│ ├── containerd_config.toml
│ ├── containers_crictl
...
│ └── whoami_validate
In the event that an error occurs while collecting node diagnostics, the node-diagnostics/<node>/pod-collector.json
file contains the serialized JSON representations of the running pod. This helps debug the reasons for the collection failure. The node-diagnostics/<node>/pod-collector.log
file contains stdout from the collector container that runs the diagnostics script.
When using this collector for node level information you must run additional docker containers and must have the following docker images:
mesosphere/pause-alpine:3.2
mesosphere/dkp-diagnostics-node-collector:$(dkp-diagnose version)
For more information on the configuration options see the ExecCopyFromHost
in the pkg/apis/troubleshoot/v1beta2/exec_copy_from_host.go
file.
ConfigMap
and Secret
collector
Support for collecting from all namespaces for In the original collectors namespace
there is a required parameter. This adds support for collecting from all namespaces by not setting the namespace
(or setting it to ""
).
Note: To collect all config maps / secrets an empty selector must be used (selector: [""]
).
Support for optional support-bundle name prefix
When generating a support bundle, you need naming defaults to provide deterministic bundle identifiers. This feature is especially useful for our convenience extension of providing diagnostics for both, a bootstrap, Konvoy, or other K8s cluster. Using an empty prefix keeps the original naming convention.
ClusterResources
collector
Another customization is added to collect custom resource definitions and all custom resources in the cluster.