Nomad Prometheus Metrics with mTLS

Enabling mTLS for Nomad is a great way to secure traffic and restrict access to the Nomad API, but it does come with some downsides including breaking the default setup of Prometheus because the metrics API is also secured with mTLS. This is easily fixable, but not entirely obvious if you aren’t as experienced with Nomad and Prometheus.

This example is assuming you either have:

  1. Existing certificates for CLI access
  2. Dynamic certificates through vault

I won’t get into how to generate the certificates, but instead I’ll just focus on what needs to be adjusted to get metrics working again.

The first thing to do is to inject the certificates into the Prometheus job so that Prometheus can use them when scraping the metric endpoints. This is easy by just using vault scripts within the Nomad job.

This is just using the usual template stanza’s in the Nomad task definition:

      template {
        change_mode = "noop"
        destination = "secrets/ca.crt"

        data = <<EOF
-----BEGIN CERTIFICATE-----
...snip...
-----END CERTIFICATE-----
EOF

      }

      template {
        change_mode = "noop"
        destination = "secrets/cli.pem"

        data = <<EOF
{{ with secret "secrets/data/nomad"}}{{ .Data.data.tls_key }}{{ end }}
EOF

      }

      template {
        change_mode = "noop"
        destination = "secrets/cli.crt"

        data = <<EOF
{{ with secret "secrets/data/nomad"}}{{ .Data.data.tls_cert }}{{ end }}
EOF

Next step is to configure Prometheus scraper to use https over http:

scrape_configs:
  - job_name: 'nomad_metrics'

    tls_config:
      # even though we're providing a CA, unless you sign a unique cert w/ the IP address for each client, the validation
      # will fail since most servers are signed with either just `server.region.nomad` or `client.region.nomad`.
      # if you do give each server a cert with the appropriate IP SAN, you can remove this line.
      insecure_skip_verify: true
      ca_file: '/opt/ssl/ca.crt'
      cert_file: '/opt/ssl/cli.crt'
      key_file: '/opt/ssl/cli.pem'

    consul_sd_configs:
    # consul server IP here
    - server: '10.10.10.10:8500'
      services: ['nomad-client', 'nomad']

    relabel_configs:
    - source_labels: ['__meta_consul_tags']
      regex: '(.*)http(.*)'
      action: keep

    # This will force Prometheus to access the metrics endpoing with HTTPS
    - source_labels: ['__scheme__']
      target_label: __scheme__
      replacement: https

    scrape_interval: 5s
    metrics_path: /v1/metrics
    params:
      format: ['prometheus']

Once you add these to the Prometheus job definition and re-deploy it should start working to scrape the API.

One improvement that you could add to this is to use Vault to generate the client certificates dynamically so they are short lived, instead of storing a long lived cert in Vault but that’s a topic for another day.