Prometheus: Discover services with DNS

In a previous post I covered how to use Consul for service discovery, allowing Prometheus to automatically discover what services to monitor.

There are some cases where either setting up Consul (or similar) is not viable, or adds complexity that is not required. If you are already running your own DNS nameservers, you could make use of DNS SRV records.

Common DNS record types

The most common DNS records are A, AAAA and PTR. An A record is a simple “name to IPv4” mapping, e.g. one.one.one.one would become 1.1.1.1. A AAAA record is the same, except for IPv6.

$ dig A one.one.one.one

; <<>> DiG 9.14.7 <<>> A one.one.one.one
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34576
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: 19b10b10c5d45d10 (echoed)
;; QUESTION SECTION:
;one.one.one.one.		IN	A

;; ANSWER SECTION:
one.one.one.one.	176	IN	A	1.1.1.1
one.one.one.one.	176	IN	A	1.0.0.1

;; Query time: 8 msec

$ dig AAAA one.one.one.one

; <<>> DiG 9.14.7 <<>> AAAA one.one.one.one
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12686
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: a91fb8e973aa5e78 (echoed)
;; QUESTION SECTION:
;one.one.one.one.		IN	AAAA

;; ANSWER SECTION:
one.one.one.one.	299	IN	AAAA	2606:4700:4700::1111
one.one.one.one.	299	IN	AAAA	2606:4700:4700::1001

;; Query time: 24 msec

A PTR record, or Pointer, is what provides reverse DNS. When you see IPs translated to a hostname (for example, in a traceroute), it is PTR records that are providing this. Some tools, like host automatically translate the IP address into the correct format for PTR records: -

$ host 1.1.1.1
1.1.1.1.in-addr.arpa domain name pointer one.one.one.one.

However other tools do not: -

$ dig PTR 1.1.1.1
; <<>> DiG 9.14.7 <<>> PTR 1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 40153
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: 4ed0f8ba650f2734 (echoed)
;; QUESTION SECTION:
;1.1.1.1.			IN	PTR

;; AUTHORITY SECTION:
.			773	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2019121000 1800 900 604800 86400

;; Query time: 1 msec

To use dig to check a PTR record, you need to supply the IP address in the following format: -

1.1.1.1 -> 1.1.1.1.in-addr.arpa

$ dig 1.1.1.1.in-addr-arpa

; <<>> DiG 9.14.7 <<>> PTR 1.1.1.1.in-addr.arpa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19361
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: 2d9b763cb41a84ed (echoed)
;; QUESTION SECTION:
;1.1.1.1.in-addr.arpa.		IN	PTR

;; ANSWER SECTION:
1.1.1.1.in-addr.arpa.	248	IN	PTR	one.one.one.one.

;; Query time: 1 msec

The same is true for IPv6 records, except the format is much longer: -

2606:4700:4700::1111 -> 1.1.1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.7.4.0.0.7.4.6.0.6.2.ip6.arpa

$ dig PTR 1.1.1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.7.4.0.0.7.4.6.0.6.2.ip6.arpa

; <<>> DiG 9.14.7 <<>> PTR 1.1.1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.7.4.0.0.7.4.6.0.6.2.ip6.arpa
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2723
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: 5762756121316ea0 (echoed)
;; QUESTION SECTION:
;1.1.1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.7.4.0.0.7.4.6.0.6.2.ip6.arpa. IN PTR

;; ANSWER SECTION:
1.1.1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.7.4.0.0.7.4.6.0.6.2.ip6.arpa. 165 IN PTR one.one.one.one.

;; Query time: 1 msec

What is an SRV record?

Rather than just being a mapping from a hostname to an IP (e.g. A or AAAA), or the reverse (PTR), an SRV record contains hostnames, ports and the protocols (TCP/UDP). Common usage of this include SIP and Active Directory Domain Controller discovery.

If you try to join a Windows Domain with just the domain name (e.g. example.com), the SRV record is providing a list of Domain Controllers under a DNS SRV record for example.com: -

dig SRV _ldap._tcp.dc._msdcs.example.com

; <<>> DiG 9.14.7 <<>> SRV _ldap._tcp.dc._msdcs.example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54352
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 6

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: 00d2c28406648fbf (echoed)
;; QUESTION SECTION:
;_ldap._tcp.dc._msdcs.example.com. IN SRV

;; ANSWER SECTION:
_ldap._tcp.dc._msdcs.example.com. 600 IN SRV 0 100 389 dc-01.example.com.
_ldap._tcp.dc._msdcs.example.com. 600 IN SRV 0 100 389 dc-02.example.com.
_ldap._tcp.dc._msdcs.example.com. 600 IN SRV 0 100 389 dc-03.example.com.

;; ADDITIONAL SECTION:
dc-01.example.com. 3600	IN	A  192.168.20.1
dc-02.example.com. 3600	IN	A  192.168.20.2
dc-03.example.com. 3600	IN	A  192.168.20.3

It is worth noting that SRV records point to A/AAAA records (see the ADDITIONAL SECTION), so they must be set up too.

What the above gives you is the protocol, the port and the hostname to reach Active Directory.

How can Prometheus use this?

To monitor a host, Prometheus requires the IP/hostname, port and protocol. This is exactly what an SRV record exposes, and so can be leveraged for service discovery. The exact implementation is documented here

Example: ETCD

ETCD (a distributed key/value store) can discover what other members are in the cluster using DNS SRV records (documentation (here)[https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/clustering.md]). Additionally, we can use the same SRV records to monitor the ETCD instances too.

An example of an ETCD SRV record is below: -

$ dig _etcd-client-ssl._tcp.staging.example.com

; <<>> DiG 9.14.7 <<>> SRV _etcd-client-ssl._tcp.staging.example.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20086
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
; COOKIE: db9a9582f9740d45 (echoed)
;; QUESTION SECTION:
;_etcd-client-ssl._tcp.staging.example.com. IN SRV

;; ANSWER SECTION:
_etcd-client-ssl._tcp.staging.example.com. 204 IN	SRV 10 50 2379 etcd-10-11-99-42.staging.example.com.
_etcd-client-ssl._tcp.staging.example.com. 204 IN	SRV 10 50 2379 etcd-10-11-160-216.staging.example.com.
_etcd-client-ssl._tcp.staging.example.com. 204 IN	SRV 10 50 2379 etcd-10-11-164-63.staging.example.com.
_etcd-client-ssl._tcp.staging.example.com. 204 IN	SRV 10 50 2379 etcd-10-11-46-92.staging.example.com.
_etcd-client-ssl._tcp.staging.example.com. 204 IN	SRV 10 50 2379 etcd-10-11-97-104.staging.example.com.

;; ADDITIONAL SECTION:
etcd-10-11-99-42.staging.example.com. 4 IN A 10.11.99.42

;; Query time: 26 msec

To make use of this within Prometheus, you need to format the scrape configuration like so: -

  - job_name: 'etcd-scrape'
    scheme: https
    dns_sd_configs:
    - names:
      - '_etcd-client-ssl._tcp.staging.example.com.'
    tls_config:
      ca_file: etcd-certs/ca.pem
      cert_file: etcd-certs/client.pem
      key_file: etcd-certs/client-key.pem

The only part you need for DNS discovery is the dns_sd_configs section. The rest are to allow you to speak HTTPS to the ETCD API. These will then appear as targets in Prometheus.

How to update the SRV record?

It all depends on your use case. In some cases, this may be a manual process. It can also be done by the systems themselves (Active Directory being a good example). Alternatively, use whatever automation method you feel is appropriate.

For example, I built a small Golang utility for ETCD that will scrape AWS tags, and for those that have the correct etcd-cluster tag, it will update the SRV record for that cluster. This has the advantage of all cluster nodes being able to run the utility, rather than reliant on one node to make the updates.

Summary

My personal preference for service discovery is definitely using Consul. However if you already have DNS records that are getting created (e.g. ETCD, Active Directory), or the additional complexity of Consul will not provide enough benefit, DNS service discovery could be the way to go.