Custom collectors

Custom collectors

For creating diagnostic bundles, D2iQ is using a customized version of troubleshoot.sh integrated into dkp-diagnose.

Customizations

To meet the specific needs of diagnosing DKP 2 clusters we have developed custom collectors and modified the behavior of upstream collectors. Go to our repository for more information on the details of the changes.

ExecCopyFromHost collector

This is a new collector created specifically for gathering host level information from cluster nodes. The collector allows you to run a provided container image in a privileged mode, as a root user, with additional Linux capabilities and with the host filesystem mounted in the container.

You can collect host level information other than copying host level files. (This is already possible with the CopyFromHost collector.) Like the CopyFromHost collector, this collector runs as a Kubernetes DaemonSet executed on all nodes in the system. The data produced by the container are copied from a pre-defined directory into the diagnostics bundle under each node name. The name of the parent directory, in the diagnostics bundle, is determined by the name of the collector specified in its configuration.

The data written into the diagnostics bundle follows this format:

<collector-name> / <node-name> / data / (file1|file2|...)

The following is a sample configuration file:

spec:
  collectors:
    - execCopyFromHost:
        name: node-diagnostics
        image: mesosphere/dkp-diagnostics-node-collector:latest
        timeout: 30s
        command:
          - "/bin/bash"
          - "-c"
          - "/diagnostics/container.sh --hostroot /host --hostpath ${PATH} --outputroot /output"
        workingDir: "/diagnostics"
        includeControlPlane: true
        privileged: true
        capabilities:
          - AUDIT_CONTROL
          - AUDIT_READ
          - BLOCK_SUSPEND
          - BPF
          - CHECKPOINT_RESTORE
          - DAC_READ_SEARCH
          - IPC_LOCK
          - IPC_OWNER
          - LEASE
          - LINUX_IMMUTABLE
          - MAC_ADMIN
          - MAC_OVERRIDE
          - NET_ADMIN
          - NET_BROADCAST
          - PERFMON
          - SYS_ADMIN
          - SYS_BOOT
          - SYS_MODULE
          - SYS_NICE
          - SYS_PACCT
          - SYS_PTRACE
          - SYS_RAWIO
          - SYS_RESOURCE
          - SYS_TIME
          - SYS_TTY_CONFIG
          - SYSLOG
          - WAKE_ALARM
        extractArchive: true

The following is an example of the data produced by running this collector:

├── node-diagnostics
│   ├── troubleshoot-control-plane
│   │   └── data
│   │       ├── certs_expiration_kubeadm
│   │       ├── containerd_config.toml
                ...
│   │       └── whoami_validate
│   └── troubleshoot-worker
│       └── data
│           ├── containerd_config.toml
│           ├── containers_crictl
                ...
│           └── whoami_validate

In the event that an error occurs while collecting node diagnostics, the node-diagnostics/<node>/pod-collector.json file contains the serialized JSON representations of the running pod. This helps debug the reasons for the collection failure. The node-diagnostics/<node>/pod-collector.log file contains stdout from the collector container that runs the diagnostics script. In addition, the command may also produce certain -error.txt files. file-copy-error.txt and pod-collector-files-copy-error.txt are two file examples. These files contain error messages generated while trying to fetch log files from the collector.

When using this collector for node level information you must run additional docker containers and must have the following docker images:

  • mesosphere/pause-alpine:3.2
  • mesosphere/dkp-diagnostics-node-collector:$(dkp-diagnose version)

For more information on the configuration options see the ExecCopyFromHost in the pkg/apis/troubleshoot/v1beta2/exec_copy_from_host.go file.

AllLogs collector

This collector gathers pod logs from specified namespaces or from all namespaces if none are specified. You can collect logs of all the pods from all the namespaces. The pod logs are collected under the allPodLogs directory.

The data written into the diagnostics bundle follows this format:

<collector-name> / <namespace-name> / <pod-name> - (container1|container2|...)

The following is a sample configuration file to collect logs from all the pods from all the namespaces:

spec:
  collectors:
    - allLogs:
        namespaces:
          - "*"

The following is a sample configuration file to collect logs from all the pods from specific namespaces:

spec:
  collectors:
    - allLogs:
        namespaces:
          - default
          - dev
          - prod

The following is an example of the data produced by running this collector:

├── node-diagnostics
│   ├── troubleshoot-control-plane
│   │   └── data
│   │       ├── certs_expiration_kubeadm
│   │       ├── containerd_config.toml
                ...
│   │       └── whoami_validate
│   └── troubleshoot-worker
│       └── data
│           ├── containerd_config.toml
│           ├── containers_crictl
                ...
│           └── whoami_validate

In the event that an error occurs while collecting node diagnostics, the node-diagnostics/<node>/pod-collector.json file contains the serialized JSON representations of the running pod. This helps debug the reasons for the collection failure. The node-diagnostics/<node>/pod-collector.log file contains stdout from the collector container that runs the diagnostics script.

When using this collector for node level information you must run additional docker containers and must have the following docker images:

  • mesosphere/pause-alpine:3.2
  • mesosphere/dkp-diagnostics-node-collector:$(dkp-diagnose version)

For more information on the configuration options see the ExecCopyFromHost in the pkg/apis/troubleshoot/v1beta2/exec_copy_from_host.go file.

Support for collecting from all namespaces for ConfigMap and Secret collector

In the original collectors namespace there is a required parameter. This adds support for collecting from all namespaces by not setting the namespace (or setting it to ""). Note: To collect all config maps / secrets an empty selector must be used (selector: [""]).

Support for optional support-bundle name prefix

When generating a support bundle, you need naming defaults to provide deterministic bundle identifiers. This feature is especially useful for our convenience extension of providing diagnostics for both, a bootstrap, Konvoy, or other K8s cluster. Using an empty prefix keeps the original naming convention.

ClusterResources collector

Another customization is added to collect custom resource definitions and all custom resources in the cluster.