Skip to main content

Control Plane Testing

Deploying applications with Kubernetes is easier than ever, yet developers face increasing complexity.

Kubernetes simplifies deployment, but with it comes a labyrinth of potential issues. From resource conflicts to version incompatibilities, a failure in one component can cascade. Understanding application health through metric models like RED (Requests, Errors, Duration) and USE (Utilization, Saturation, Errors) isn't always enough. Latent errors might only surface during deployment or scaling.

For example, consider deploying a stateful PostgreSQL database via Flux on AWS. Problems could arise, including:

  • Tools like helm template and helm lint can validate chart rendering and syntax, but they don't guarantee compatibility with a specific Kubernetes version or the operators running on the cluster.
  • ct install on a kind or simulated cluster can verify API compatibility and ensure all resources and operators work correctly in ideal conditions.
  • Deploying to a staging environment can help catch issues before they reach production, but this approach doesn't detect capacity, performance or latent errors that only surface under load.

Control plane testing can help improve resilience by continuously redeploying workloads, ensuring there is enough capacity within the system and that all operators and external dependencies are working correctly.

Canary checker is a kubernetes-native test platform that continuously runs tests using 30+ check styles against your workloads. In this tutorial, we use it to continuously verify the ability to provision and run stateful workloads in a cluster.

The kubernetesResource check creates kubernetes resources based on the provided manifests & perform checks on them, it has 5 lifecycle stages:

Lifecycle

  1. Apply Static Resources - Applies all staticResources that are required for all tests to pass e.g. namespaces, secrets, etc..
  2. Apply Resources - Applies all the workloads defined in resources
  3. Wait - Using the parameters defined in waitFor, wait for the resources to be ready using is-healthy
  4. Run Checks - Run all the checks against the workloads
  5. Cleanup - Delete all the resources that were created during the test.

Tutorial

Prerequisites

Prerequisites

To follow this tutorial, you need:

  • A Kubernetes cluster
  • FluxCD installed
  1. Define the workload under test

    Before you can create a canary you should start with a working example of a resource, in this example we use a HelmRelease to deploy a postgres database.

    apiVersion: v1
    kind: Namespace
    metadata:
    name: control-plane-tests
    ---
    apiVersion: source.toolkit.fluxcd.io/v1
    kind: HelmRepository
    metadata:
    name: bitnami
    namespace: control-plane-tests
    spec:
    type: oci
    interval: 1h
    url: oci://registry-1.docker.io/bitnamicharts
    ---
    apiVersion: helm.toolkit.fluxcd.io/v2
    kind: HelmRelease
    metadata:
    name: postgresql
    spec:
    chart:
    spec:
    chart: postgresql
    sourceRef:
    kind: HelmRepository
    name: bitnami
    namespace: control-plane-tests
    version: "*"
    interval: 1h
    values:
    auth:
    database: my_database
    password: qwerty123
    username: admin
    primary:
    persistence:
    enabled: true
    size: 8Gi

    Once you have verified the helm release is working on its own, you can then begin building the control plane test using canary-checker.

  2. Install the canary-checker binary

    wget  https://github.com/flanksource/canary-checker/releases/latest/download/canary-checker_linux_amd64   \
    -O /usr/bin/canary-checker && \
    chmod +x /usr/bin/canary-checker
    Operator Mode

    This tutorial uses the CLI for faster feedback, when rolling this out to production we recommend installing canary-checker as an operator

  3. Next create a Canary CustomResourceDefinition (CRD) using the kubernetesResource check type, the layout of the canary is as follows:

    apiVersion: canaries.flanksource.com/v1
    kind: Canary
    metadata:
    name: control-plane-tests
    namespace: control-plane-tests
    spec:
    # how often to run the test
    schedule: "@every 1h"
    kubernetesResource: # this is type of canary we are executing, canary-checker has many more
    - name: helm-release-postgres-check
    waitFor:
    # The time to wait for the resources to be ready before considering the test a failure
    timeout: 10m
    staticResources:
    - # A list of resources that should be created once only and re-used across multiple tests
    resources:
    - # A list of resources to be created every time the check runs
    display:
    # optional Go text template to display the results of the check
    template: |+
    Helm release created: {{ .health | toYAML }}

    Using the workload defined in step 1, the check definition is as follows:

    apiVersion: canaries.flanksource.com/v1
    kind: Canary
    metadata:
    name: control-plane-tests
    namespace: control-plane-tests
    spec:
    schedule: "@every 1h"
    kubernetesResource:
    - name: helm-release-postgres-check
    description: "Deploy postgresql via HelmRelease"
    waitFor:
    timeout: 1m
    display:
    template: |+
    Helm release created: {{ .health | toYAML }}
    staticResources:
    - apiVersion: source.toolkit.fluxcd.io/v1
    kind: HelmRepository
    metadata:
    name: bitnami
    spec:
    type: oci
    interval: 1h
    url: oci://registry-1.docker.io/bitnamicharts
    resources:
    - apiVersion: helm.toolkit.fluxcd.io/v2
    kind: HelmRelease
    metadata:
    name: postgresql
    spec:
    chart:
    spec:
    chart: postgresql
    sourceRef:
    kind: HelmRepository
    name: bitnami
    interval: 5m
    values:
    auth:
    username: admin
    password: qwerty123
    database: exampledb
    primary:
    persistence:
    enabled: true
    size: 8Gi

  4. Run the test locally using canary-checker run basic-canary.yaml

    canary-checker run basic-canary.yaml
    18:01:52.745 INF (k8s) Using kubeconfig /Users/moshe/.kube/config
    18:01:52.749 INF Checking basic-canary.yaml, 1 checks found
    18:01:55.209 INF (control-plane-tests) HelmRelease/control-plane-tests/postgresql (created) +kustomized
    18:02:21.072 INF (control-plane-tests.helm-release-postgres-check) PASS duration=28321 Helm release created:
    control-plane-tests/HelmRelease/postgresql:
    health: healthy
    message: Helm install succeeded for release control-plane-tests/postgresql.v1 with chart postgresql@16.2.2
    ready: true
    status: InstallSucceeded
    control-plane-tests/HelmRepository/bitnami:
    health: unknown
    ready: true
    18:02:21.073 INF 1 passed, 0 failed in 28s

    And if you run kubectl get events you should see:

    kubectl get events
    LAST SEEN TYPE REASON OBJECT MESSAGE
    26m Normal ChartPullSucceeded helmchart/control-plane-tests-postgresql pulled 'postgresql' chart with version '16.2.2'
    26m Normal Scheduled pod/postgresql-0 Successfully assigned control-plane-tests/postgresql-0 to ip-10-0-4-167.eu-west-1.compute.internal
    26m Normal Pulled pod/postgresql-0 Container image "docker.io/bitnami/postgresql:17.2.0-debian-12-r0" already present on machine
    26m Normal Created pod/postgresql-0 Created container postgresql
    26m Normal Started pod/postgresql-0 Started container postgresql
    26m Warning Unhealthy pod/postgresql-0 Readiness probe failed: 127.0.0.1:5432 - rejecting connections
    26m Warning Unhealthy pod/postgresql-0 Readiness probe failed: 127.0.0.1:5432 - no response
    26m Normal Killing pod/postgresql-0 Stopping container postgresql
    113s Normal Scheduled pod/postgresql-0 Successfully assigned control-plane-tests/postgresql-0 to ip-10-0-4-167.eu-west-1.compute.internal
    112s Normal Pulled pod/postgresql-0 Container image "docker.io/bitnami/postgresql:17.2.0-debian-12-r0" already present on machine
    112s Normal Created pod/postgresql-0 Created container postgresql
    112s Normal Started pod/postgresql-0 Started container postgresql
    96s Normal Killing pod/postgresql-0 Stopping container postgresql
    26m Normal HelmChartCreated helmrelease/postgresql Created HelmChart/control-plane-tests/control-plane-tests-postgresql with SourceRef 'HelmRepository/control-plane-tests/bitnami'
    26m Normal SuccessfulCreate statefulset/postgresql create Pod postgresql-0 in StatefulSet postgresql successful
    26m Normal InstallSucceeded helmrelease/postgresql Helm install succeeded for release control-plane-tests/postgresql.v1 with chart postgresql@16.2.2
    26m Normal UninstallSucceeded helmrelease/postgresql Helm uninstall succeeded for release control-plane-tests/postgresql.v1 with chart postgresql@16.2.2
    26m Normal HelmChartDeleted helmrelease/postgresql deleted HelmChart 'control-plane-tests/control-plane-tests-postgresql'
    116s Normal HelmChartCreated helmrelease/postgresql Created HelmChart/control-plane-tests/control-plane-tests-postgresql with SourceRef 'HelmRepository/control-plane-tests/bitnami'
    113s Normal SuccessfulCreate statefulset/postgresql create Pod postgresql-0 in StatefulSet postgresql successful
    101s Normal InstallSucceeded helmrelease/postgresql Helm install succeeded for release control-plane-tests/postgresql.v1 with chart postgresql@16.2.2
    96s Warning CalculateExpectedPodCountFailed poddisruptionbudget/postgresql Failed to calculate the number of expected pods: found no controllers for pod "postgresql-0"
    96s Normal UninstallSucceeded helmrelease/postgresql Helm uninstall succeeded for release control-plane-tests/postgresql.v1 with chart postgresql@16.2.2
    95s Normal HelmChartDeleted helmrelease/postgresql deleted HelmChart 'control-plane-tests/control-plane-tests-postgresql'
  5. Add custom check

    By default kubernetesResource only checks if the resource is ready. However, you can add custom checks to validate the resource further.

    For example, you can validate the PostgreSQL database is running and accepting connections, with a custom postgres check:

    apiVersion: canaries.flanksource.com/v1
    kind: Canary
    #...
    spec:
    kubernetesResource:
    - #...
    checks:
    - postgres:
    - name: postgres schemas check
    url: "postgres://$(username):$(password)@postgresql.default.svc:5432/exampledb?sslmode=disable"
    username:
    value: admin
    password:
    value: qwerty123
    # Since we just want to check if database is responding,
    # a SELECT 1 query should suffice
    query: SELECT 1
    Accessing variables

    This example uses the $(username) and $(password) syntax to access the username and password variables hardcoded in the checks section, but in a production setting, reference secrets using valueFrom

    Alternatives to custom checks

    Instead of using a custom check you can also add a standard helm test pod to your chart or define a canary inside the chart to automatically include health checks for all workloads.

  6. The final test looks like:


    apiVersion: canaries.flanksource.com/v1
    kind: Canary
    metadata:
    name: control-plane-tests
    namespace: control-plane-tests
    spec:
    schedule: "@every 1m"
    kubernetesResource:
    - name: helm-release-postgres-check
    namespace: default
    description: "Deploy postgresql via HelmRelease"
    staticResources:
    - apiVersion: source.toolkit.fluxcd.io/v1
    kind: HelmRepository
    metadata:
    name: bitnami
    spec:
    type: oci
    interval: 1h
    url: oci://registry-1.docker.io/bitnamicharts
    resources:
    - apiVersion: helm.toolkit.fluxcd.io/v2
    kind: HelmRelease
    metadata:
    name: postgresql
    namespace: default
    spec:
    chart:
    spec:
    chart: postgresql
    sourceRef:
    kind: HelmRepository
    name: bitnami
    namespace: control-plane-tests
    interval: 5m
    values:
    auth:
    username: admin
    password: qwerty123
    database: exampledb
    primary:
    persistence:
    enabled: true
    size: 8Gi
    checks:
    - postgres:
    - name: postgres schemas check
    url: "postgres://$(username):$(password)@postgresql.default.svc:5432/exampledb?sslmode=disable"
    username:
    value: admin
    password:
    value: qwerty123
    # Since we just want to check if database is responding,
    # a SELECT 1 query should suffice
    query: SELECT 1

    checkRetries:
    delay: 15s
    interval: 10s
    timeout: 5m

Conclusion

Continuous testing of your control plane is essential for maintaining resilient infrastructure at scale. By implementing continuous testing with tools like Canary Checker, Flux, and Helm, you can:

  • Catch breaking changes early
  • Validate infrastructure changes
  • Ensure security compliance
  • Maintain platform stability
  • Reduce incident recovery time

This proactive approach helps catch issues before they impact production environments and affect your users.

References