Setting up GKE autopilot with custom Datadog metrics

Setting up GKE autopilot with custom Datadog metrics

As Pragli has started to scale, we have identified a strong need to move away from using GCP virtual machines to host our services and onto a platform that could help us scale horizontally without having to manually provision new resources. The obvious choice was to move onto GKE (Kubernetes). With the fairly recent release of GKE Autopilot, we figured we would give it a shot and see where we land!

As we started to onboard onto the platform, we encountered a couple of bumps along the way. We resolved most issues with Google's existing documentation but configuring Datadog to capture custom metrics to use for monitoring, alerting, and autoscaling was especially challenging to figure out.

This post assumes basic knowledge of Helm, the de facto package manager for Kubernetes. If you or your team is still in the early stages of setting up your GKE Autopilot cluster, there are good resources out there, and I would focus on getting a stable setup prior to embarking on setting up custom metrics.

With that out of the way, I hope to outline the bare minimum to setup your service with Datadog running and capturing all your custom metrics!

Tech Stack / Requirements

  • Node (Typescript)
  • HotShots (NodeJS Datadog Lib)
  • Kubernetes
  • Helm (drastically reduces complexity for managing your cluster(s))
  • GKE Autopilot
  • Docker (Not necessary for this Tutorial, but we use Docker for packaging our applications)

Configure / Install Datadog Agent on your cluster

Helm uses configurable values in a values.yaml file to customize Kubernetes resources that are deployed. We also include our custom Datadog configuration in a separate configurable YAML file called datadog-values.yml. Here are the config params that you'll need to get metrics working in Datadog

datadog.dogstatsd.originDetection: true - Used to tag all collected metrics with additional information such as the pod name.

datadog.dogstatsd.tagCardinality: orchestrator - The default value is set to "low". We set it to "orchestrator" as it gives us pod name tagging which we found to be highly necessary to get the granularity needed for monitoring our services. We would be surprised if this wasn't the case for most use cases. More info can be found in the Datadog Docs for setting up metric tagging.

datadog:
  apiKey: ...
  appKey: ...

  logs:
    enabled: true
    containerCollectAll: true
  apm:
    enabled: true
  kubeStateMetricsEnabled: false
  kubeStateMetricsCore:
    enabled: true

  dogstatsd:
    port: 8125
    useHostPort: true
    nonLocalTraffic: true
    originDetection: true
    tagCardinality: orchestrator

clusterAgent:
  enabled: true
  metricsProvider:
    enabled: true
    useDatadogMetrics: true

providers:
  gke:
    autopilot: true

Once you add these values, let's add the Datadog repo to our Helm configuration and get them deployed.

helm repo add datadog https://helm.datadoghq.com

helm install datadog -f <datadog-values.yaml> datadog/datadog

If you make changes to your Datadog values file in the future, instead of doing a helm install you'll be doing helm upgrade.

Deploy A Sample Service

There are two environment variables that are critical to make sure that Datadog forwards metrics correctly.

DD_AGENT_HOST: Used by the hot shots library to send all necessary information to the Datadog agent

DD_ENTITY_ID: Is used by Datadog for origin detection. In short, this environment variable in conjunction with use enabling origin detection in our Datadog values file will automatically tag metrics with information about where the metrics come from. This will make it easier for your team to find and properly set up dashboards within Datadog.

apiVersion: apps/v1
kind: Deployment

...
	env:
	- name: DD_AGENT_HOST
	  valueFrom:
	    fieldRef:
	      fieldPath: status.hostIP
	- name: DD_ENTITY_ID
	  valueFrom:
	    fieldRef:
	      fieldPath: metadata.uid

Now that your service is configured to get custom metrics to the Datadog agent, the last thing to do is use one of the handful of libraries to publish those metrics. In our case, we are using the HotShots library, but Datadog has a long list of libs to use depending on your language of choice.

Here's a quick example of how to use HotShots with Node.

import StatsD from 'hot-shots';

const dogstatsd = new StatsD();

...

dogstatsd.increment("customMetric")

The last step here is to go through your deploy process (build / package) and do a helm upgrade on your service. If everything has been setup properly, you should be able to use the Datadog Metric Explorer to find your newly created metrics once they have been incremented at least once!

Bonus: Autoscale your service based on a custom metric

Now that your cluster is properly setup to capture your custom metrics in your service, we can also use those metrics to set up autoscaling for our cluster. The below code blocks show how we scale our infrastructure by the number of active web socket connections that are detected.

In our example below, we set up two Kubernetes resources. A DatadogMetric will allow us to define and expose a Datadog query to our GKE cluster, and our HorizontalPodAutoscaler, is the typical Kubernetes resource used for autoscaling.

apiVersion: datadoghq.com/v1alpha1
kind: DatadogMetric
metadata:
  name: active-connections
spec:
    query: default_zero(exclude_null(sum:connections.active{pod_name:sync-*}))
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: active-connection-scaling
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sync
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: [email protected]:active-connections # "default" just is the name of your namespace in GKE
        target:
          type: AverageValue
          averageValue: 100

Conclusion

Once setup properly, the integration between GKE Auto Pilot and Datadog works fantastic. It just takes the right settings for each piece to talk to each other correctly. If you have any questions, feel free to find me on Twitter @Scalahansolo!


What is Pragli?

Pragli is a team communication platform that makes remote and hybrid work more fast, fun, and inclusive.

Our product is currently completely free - try it out at https://pragli.com/

Show Comments