DNS-Based Service Discovery

Introduction

Service discovery is a fundamental problem in distributed systems: how does one component find and communicate with another when instances are ephemeral, addresses change, and the topology is dynamic?
While purpose-built service discovery tools like Consul, etcd, and ZooKeeper have gained popularity, DNS remains one of the most widely deployed and underappreciated mechanisms for service discovery.
Its ubiquity, simplicity, and broad client support make it a compelling choice, but its limitations around caching, propagation delay, and health checking requires careful engineering to use effectively at scale.

This article examines DNS-based service discovery in detail, covering the protocol mechanisms that enable it, the SRV record type that makes it practical, common architectural patterns, and the tradeoffs engineers must navigate.

Why DNS?

Every networked application already speaks DNS.
There is no additional client library to install, no sidecar proxy to configure, and no proprietary protocol to learn.
This zero-dependency property is DNS's strongest advantage.
A service can be discovered with a standard library call (getaddrinfo, resolve, or equivalent) in any programming language on any operating system.

DNS also benefits from a mature caching infrastructure.
Recursive resolvers, stub resolvers, and application-level caches all reduce the load on authoritative servers.
For service discovery, this caching behavior is both an advantage (reducing latency and backend load) and a liability (serving stale records for instances that have been removed).

Core Mechanisms

A and AAAA Records

The simplest form of DNS-based service discovery maps a service name to one or more IP addresses using A (IPv4) or AAAA (IPv6) records.
For example:

api.internal.example.com.  30  IN  A  10.0.1.5
api.internal.example.com.  30  IN  A  10.0.1.6
api.internal.example.com.  30  IN  A  10.0.1.7

Clients resolve the hostname and receive multiple addresses.
Many DNS clients will round-robin over the returned set, providing basic load distribution.
The TTL (30 seconds in this example) controls how long resolvers cache the response before re-querying.

This approach is simple but limited.
It encodes only IP addresses, not ports, protocols, or instance priorities.
If a service moves to a non-standard port or you need weighted routing, A records alone are insufficient.

SRV Records

RFC 2782 introduced the SRV record type specifically to address service location.
SRV records encode the hostname, port, priority, and weight for instances of a named service.
The query name follows the convention _service._proto.name:

_http._tcp.api.example.com.  30  IN  SRV  10  60  8080  api-1.example.com.
_http._tcp.api.example.com.  30  IN  SRV  10  40  8080  api-2.example.com.
_http._tcp.api.example.com.  30  IN  SRV  20  50  8080  api-3.example.com.

The fields are: priority (lower is preferred), weight (for load distribution within a priority level), port, and target hostname.
Clients should first group records by priority, then distribute traffic within each group according to weights.

DNS-SD (DNS-Based Service Discovery)

RFC 6763 builds on SRV records to define a complete service discovery protocol.
DNS-SD adds browsing capabilities: a client can enumerate available service types and instances without knowing their names in advance.
It uses PTR records to list instances of a service type, SRV records for location, and TXT records for metadata (key-value pairs like protocol version, or feature flags).

A typical DNS-SD lookup involves three steps:

Query a PTR record for _http._tcp.example.com to discover instance names.
Query an SRV record for each instance to get host, port, priority, and weight.
Optionally query a TXT record for each instance to get metadata.

Walkthrough

The following walkthrough describes how a client resolves a service endpoint using SRV records, following the selection algorithm defined in RFC 2782.

SRV Record Selection Algorithm

diagram-1 — SRV record selection and retry flow

Input: Set of SRV records R for a given service name

1. Sort R by priority (ascending). Group records into priority classes P_1, P_2, ..., P_n.

2. Select the lowest priority class P = P_1.

3. While P is non-empty:
   a. Compute total_weight = sum of weights of all records in P.
   b. If total_weight == 0:
        Select a record uniformly at random from P.
      Else:
        Generate a random integer r in [0, total_weight] inclusive.
        Iterate through records in P, adding each record's weight to a
        running sum (starting at 0). Select the first record where the
        running sum >= r.
        Note: RFC 2782 specifies that records with weight 0 should be
        given a small but non-zero selection probability. Implementations
        may assign weight-0 records a running-sum increment of 0, giving
        them a chance of selection only when r == 0.
   c. Remove the selected record from P.
   d. Attempt to connect to the selected record's target:port.
   e. If connection succeeds, return (target, port).

4. If all records in P_1 are exhausted without success,
   repeat from step 2 with the next priority class P_2.

5. If all priority classes are exhausted, return failure.

This algorithm ensures that lower-priority-value (higher preference) instances are tried first, and weighted random selection provides proportional load distribution within each tier.

Architectural Patterns

Internal DNS with Short TTLs

diagram-2 — Authoritative updates and heterogeneous cache convergence with short TTLs

The most common pattern for DNS-based service discovery in microservice architectures uses an internal authoritative DNS server that is updated by the orchestration layer.
Kubernetes, for example, runs CoreDNS as a cluster add-on that watches the Kubernetes API and creates DNS records for Services and Pods.

When a Deployment scales or a Pod is replaced, the DNS server updates its records.
Clients using short TTLs (5 to 30 seconds) will pick up changes relatively quickly.
The tradeoff is clear: shorter TTLs mean faster convergence at the cost of higher query volume.

Anycast and GeoDNS

For geographically distributed services, anycast DNS allows the same IP address to be announced from multiple locations via BGP.
Clients are routed to the nearest (in network terms) instance.
This is how most large CDNs and public DNS resolvers (like 1.1.1.1 and 8.8.8.8) operate — often in combination with GeoDNS, which returns different records based on the client's inferred geographic location.
These approaches provide latency-based routing without client-side logic, but anycast failover depends on BGP convergence times, which can range from seconds to minutes.

Hybrid Approaches

Many production systems combine DNS with a more dynamic service mesh or load balancer layer.
DNS provides the initial endpoint (e.g., resolving to a load balancer's VIP), while the load balancer handles fine-grained health checking, circuit breaking, and traffic shifting.
This avoids pushing all dynamism into the DNS layer while still leveraging its universality for initial bootstrap.

Limitations and Tradeoffs

Caching and staleness. DNS caching is pervasive and not always well-behaved.
Some resolvers ignore TTLs or enforce minimum cache times.
Java's JVM DNS cache, for instance, historically cached entries indefinitely when running with a security manager; while modern JVM versions (Java 8u131+ and Java 9+) have improved defaults, application-level DNS caching behavior should always be verified for the specific runtime version in use.
Stale records pointing to dead instances cause connection failures or misdirected traffic.

No health checking. DNS is a passive system.
The authoritative server does not know whether a registered instance is actually healthy unless an external system performs health checks and updates the records accordingly.
This is a significant gap compared to systems like Consul, where health checks are a first-class concept integrated with the service registry.

Propagation delay. Even with low TTLs, there is an inherent delay between an instance becoming unavailable and all clients learning about it.
The worst case is approximately equal to the TTL value, but in practice it can be longer due to resolver caching chains.

Limited metadata. While TXT records can carry key-value metadata, the DNS protocol imposes a 512-byte UDP message limit without EDNS0 (RFC 6891).
With EDNS0, responses can be larger, up to the negotiated buffer size, but DNS remains a poor transport for complex service metadata, routing rules, or configuration data, which are better served by a dedicated registry.

Load distribution is coarse. DNS round-robin distributes queries, not connections or requests.
A client that resolves once and holds a persistent connection will send all traffic to a single backend.
Combined with uneven TTL expiration across clients, this can produce significant load imbalance.

When to Use DNS-Based Service Discovery

DNS-based service discovery works best in environments where simplicity and broad compatibility are prioritized, where the service topology changes infrequently (on the order of minutes, not seconds), or where DNS is already the established resolution mechanism (as in Kubernetes).
It is a poor fit for scenarios requiring sub-second failover, fine-grained load balancing, or rich metadata exchange between services.

In practice, most large-scale systems use DNS as one layer in a multi-tier discovery and routing stack.
DNS handles the coarse resolution, while application-level mechanisms or sidecar proxies handle the rest.

Key Points

DNS-based service discovery leverages a universally supported protocol, eliminating the need for specialized client libraries or agents.
SRV records (RFC 2782) extend basic DNS resolution with port, priority, and weight information, enabling structured service location.
DNS-SD (RFC 6763) adds browsing and metadata capabilities on top of SRV and TXT records for a complete discovery protocol.
Caching is both DNS's greatest operational strength and its most significant liability for service discovery, since stale records cause misdirected traffic.
DNS has no native health checking; an external system must monitor instance health and update records accordingly.
Short TTLs (5 to 30 seconds) improve convergence speed but increase query volume and do not eliminate the staleness window.
Most production architectures use DNS as a coarse discovery layer combined with load balancers or service meshes for fine-grained routing and failover.

References

P. Mockapetris. "Domain Names - Implementation and Specification." RFC 1035, Internet Engineering Task Force, November 1987.

A. Gulbrandsen, P. Vixie, L. Esibov. "A DNS RR for specifying the location of services (DNS SRV)." RFC 2782, Internet Engineering Task Force, February 2000.

S. Cheshire, M. Krochmal. "DNS-Based Service Discovery." RFC 6763, Internet Engineering Task Force, February 2013.

J. Damas, M. Graff, P. Vixie. "Extension Mechanisms for DNS (EDNS0)." RFC 6891, Internet Engineering Task Force, April 2013.