(I’m so bad finding a good post title)

A story about PGBouncer not able to resolve the name of a service in the same namespace, with a bit of c-ares and Core DNS.

The DNS problem

“it’s alway DNS”

my colleague, Feb 2024

Overall issue

While deploying the Percona Operator for PostgreSQL in Kubernetes, we discovered PGBouncer pod was not able to connect the Primary PostgreSQL service because of a DNS resolution issue. Our attempts done in the container of PGBouncer were working fine, using getent hosts postgresql-primary either with A or AAAA type.

Strange thing is inside minikube on my colleague’s laptop it was working fine.

Going deeper into the rabbit hole

After some investigation from a colleague (with strace and tcpdump), he found out that PGBouncer provided in the container does not rely on the venerable GNU libc to resolve hostname but actually use lib c-ares, an asynchronous resolver.

Here the steps observed during the DNS resolution

  1. PGBouncer creates a packet AF_UNSPEC which means it does not specify to use IPv4 or IPv6 and do a ares_gethostbyname on postgresql-primary
  2. The data is passed to c-ares, and it processes first using IPv6, so it sends a AAAA query to the resolver.
  3. The AAAA reaches CoreDNS (the inside-Kubernetes resolver) which at some point, is forwarded to the kubernetes node.
  4. The resolution on the node is performed by systemd-resolved, which returns a SERVFAIL answer.
  5. PBBouncer receives the SERVFAIL answer and returned the error to the client. Final stop.

A quick look to the CoreDNS configuration

The default CoreDNS configuration looks like more or less this

.:53 {
    errors
    health {
        lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
    import config/*.config
}

The important points are

kubernetes cluster.local in-addr.arpa ip6.arpa {
    pods insecure
    fallthrough in-addr.arpa ip6.arpa
    ttl 30
}

which implement the service discovery DNS resolution inside kubernetes cluster.

and

forward . /etc/resolv.conf

which means queries are resolved using the content of /etc/resolv.conf in the CoreDNS pod, which is actually the file of the Kubernetes node.

Why systemd-resolved returns a SERVFAIL ?

As I understand it, systemd-resolved, for the sake of security, does not forward what it considers local queries over the internet resolvers, and answers SERVFAIL over NXDOMAIN.

Could it be related that .local TLD is used by zeroconf and mdns (Apple Bonjour or Avahi) ? I don’t know.

Anyway, after this, PBBouncer stops any further DNS query.

Resolving the resolution

The solution we needed is at some point was to make PGBouncer resolves using a A type instead of AAAA. We skipped any solution that would have required do configuration on the hosts because we don’t want to manage them.

Attempt #1: Reply a A record on a AAAA query

CoreDNS provides a plugin rewrite, which allows as its name implies allow to rewrite DNS queries. Skipping over our test and investigation, we eventually added this new config as a configMap.

rewrite AAAA A

Which means, in the event of a AAAA query, return a A answer if the A answer exists. Full stop. This is kind of bulldozer solution because, whatever the namespace or the pod, CoreDNS replies A. It worked fine, at least until we discovered the pod loki-gateway which sports a nginx process as a reverse-proxy to the various loki components, was not working anymore because it was not happy receiving a A answer on a AAAA query.

Attempt #2: Return an empty reply om a AAAA query for the specific record

So back to work; we need to limit the rewrite the record postgresql-primary only else we could break other pods.

But as rewrite did not allow a regexp for TYPE we used template plugin that permit to construct an DNS answer.

We ended up with such easy snipplet

template IN AAAA {
    match `^postgresql-primary.*`
    rcode NOERROR
    fallthrough
}

for AAAA query resolving ^postgresql-primary.* answers nothing, else continue the resolution.

So in our case, once PGBouncer receive the empty answer with NOERROR for AAAA, a new DNS query is done, this time using A type, and it is resolved without any issue.

Problem solved.