Diagnosing DNS Issues Within Azure Pods: A Practical Troubleshooting Guide

Table of Contents

Diagnosing DNS Issues Within Azure Pods

This article provides a practical guide for troubleshooting Domain Name System (DNS) resolution failures specifically originating from inside Kubernetes pods within a Microsoft Azure Kubernetes Service (AKS) cluster. These issues manifest when outbound connections from a pod fail due to name resolution problems, even though DNS might be functioning correctly from the worker node itself. Understanding the DNS flow within AKS is crucial for effectively diagnosing these types of problems. We will cover the necessary tools and a step-by-step process to pinpoint the root cause of the failure.

Prerequisites

Before beginning the troubleshooting process, ensure you have the necessary tools installed and configured to interact with your AKS cluster and its underlying nodes.

  • Kubernetes kubectl tool: This command-line tool is essential for interacting with the Kubernetes API server. You will use kubectl to check the status of pods, services, and deploy temporary resources for testing. If you don’t have it, you can install it via the Azure CLI using the az aks install-cli command, which is a convenient way to get the correct version for your cluster.
  • apt-get command-line tool: While troubleshooting from within a test pod (often based on Debian/Ubuntu images), apt-get is used to install necessary utilities like DNS lookup tools. Ensure the image you choose for your test pod includes this package manager.
  • host command-line tool: The host utility is a simple and effective tool for performing DNS lookups. It’s part of the dnsutils package. Using host from within the pod helps confirm whether DNS queries are reaching the expected nameservers and if they are being resolved correctly.
  • systemctl command-line tool: This tool is part of the systemd init system, commonly found on Linux distributions used for AKS nodes. You might need systemctl to manage services, such as restarting the network service on a node, although direct node access should be used cautiously and typically as a last resort or for specific diagnostic steps.

Having these tools readily available will significantly streamline the troubleshooting workflow and allow you to quickly gather the necessary information from different points in the DNS resolution path.

Background

To understand DNS resolution issues in AKS pods, it’s important to know the typical flow of a DNS request. When a pod needs to resolve a domain name (either internal or external), it sends a DNS query. This query is directed to the IP address specified as the nameserver in the pod’s /etc/resolv.conf file.

In a standard AKS setup, the resolv.conf inside a pod points to the cluster’s internal DNS service, which is handled by CoreDNS pods running in the kube-system namespace. These CoreDNS pods act as the central point for all DNS queries originating from within the cluster.

If the DNS query is for a service or pod within the same Kubernetes cluster (e.g., my-service.my-namespace.svc.cluster.local), the CoreDNS pod handles the resolution internally using Kubernetes service discovery mechanisms. However, if the request is for an external domain name (e.g., microsoft.com), the CoreDNS pod acts as a forwarder. It sends the request to an upstream DNS server configured to resolve names outside the cluster. The upstream DNS servers that CoreDNS uses are typically derived from the /etc/resolv.conf file found on the worker node where the CoreDNS pod is running. This node’s resolv.conf file (often located at /run/systemd/resolve/resolv.conf on systemd-based systems) is populated based on the DNS settings configured for the virtual network that the AKS cluster is using. Thus, the path for an external DNS query is Pod -> CoreDNS -> Worker Node Resolv.conf -> Upstream DNS Server (from VNet settings). Any failure along this path can cause DNS resolution issues within the pod.

Troubleshooting Checklist

Follow these steps systematically to diagnose DNS issues starting from within the affected pod and moving outwards towards the cluster’s infrastructure and upstream DNS servers.

Step 1: Troubleshoot DNS Issues From Within the Cluster

Begin by checking the health and status of the CoreDNS deployment within your AKS cluster. Issues here can affect all pods.

  1. Verify that the CoreDNS pods are running:
    Use kubectl to list the CoreDNS pods in the kube-system namespace. All pods labeled k8s-app=kube-dns should be in a Running state and have healthy restart counts.
    kubectl get pods -l k8s-app=kube-dns -n kube-system
    

    Look for any pods that are crashing, in a pending state, or have an excessive number of restarts. These can indicate configuration errors or resource constraints preventing them from functioning correctly.
  2. Check whether the CoreDNS pods are overused:
    High CPU or memory usage on CoreDNS pods can lead to slow responses or dropped requests, causing DNS resolution failures for other pods. Use kubectl top to check their resource consumption.
    $ kubectl top pods -n kube-system -l k8s-app=kube-dns
    NAME                      CPU(cores)   MEMORY(bytes)
    coredns-dc97c5f55-424f7   3m           23Mi
    coredns-dc97c5f55-wbh4q   3m           25Mi
    

    Compare these values to their resource requests and limits (if defined). Consistently high usage might require scaling up the CoreDNS deployment or increasing resource allocations.
  3. Verify that the nodes that host the CoreDNS pods aren’t overused:
    The performance of CoreDNS pods is also dependent on the health and resources of the underlying worker nodes they are scheduled on. First, identify the nodes hosting the CoreDNS pods.
    kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'
    

    This command outputs the names of the nodes. Note these node names for the next step.
  4. Check the usage of these nodes:
    Now, use kubectl top nodes to inspect the CPU and memory usage of the nodes identified in the previous step.
    kubectl top nodes
    

    If these nodes are experiencing high resource utilization, it could be impacting the performance and responsiveness of the CoreDNS pods running on them. Consider scaling the node pool or identifying other workloads consuming excessive resources on these nodes.
  5. Verify the logs for the CoreDNS pods:
    CoreDNS logs often contain valuable information about resolution failures, errors connecting to upstream servers, or configuration issues.
    kubectl logs -l k8s-app=kube-dns -n kube-system --tail=100
    

    Examine the logs for errors like connection timed out, cannot connect to, or configuration parsing errors. These messages can directly point to problems with upstream server reachability or CoreDNS configuration. Increase the --tail value or remove it to see more logs if needed.

Step 2: Create a Test Pod to Run Commands

To isolate the issue and test DNS resolution from a controlled environment identical to a regular workload pod, create a temporary test pod. This allows you to use standard network debugging tools directly within the pod’s network namespace.

  1. Run a test pod in the same namespace as the problematic pod:
    Using the same namespace is important because DNS search domains in /etc/resolv.conf are often configured based on the pod’s namespace (e.g., mydomain.svc.cluster.local, svc.cluster.local, cluster.local). This ensures the test environment mimics the original pod’s configuration accurately.
  2. Start a test pod in the cluster:
    Use kubectl run to create a temporary pod. The --rm flag ensures the pod is automatically deleted after you exit the shell, keeping your cluster clean. Choose a lightweight image with basic shell access like debian:stable or ubuntu:latest.
    kubectl run -it --rm aks-ssh --namespace <namespace> --image=debian:stable
    

    Replace <namespace> with the namespace of the pod experiencing DNS issues. This command creates the pod and immediately gives you an interactive terminal session inside it.
  3. Run the following commands to install the required packages:
    Once inside the test pod’s shell, you need to install network diagnostic tools.
    apt-get update -y
    apt-get install dnsutils -y
    

    dnsutils includes tools like host, nslookup, and dig, which are indispensable for diagnosing DNS issues. The -y flag automatically confirms the installation.
  4. Verify that the resolv.conf file has the correct entries:
    Inspect the contents of /etc/resolv.conf inside the pod. This file dictates how the pod’s applications perform DNS lookups.
    cat /etc/resolv.conf
    search default.svc.cluster.local svc.cluster.local cluster.local 00idcnmrrm4edot5s2or1onxsc.bx.internal.cloudapp.net
    nameserver 10.0.0.10
    options ndots:5
    

    The nameserver IP address (10.0.0.10 in this example, but it varies per cluster) is the cluster’s internal DNS service endpoint, usually pointing to a service that routes traffic to the CoreDNS pods. The search path defines the domains appended to unqualified names during lookup. ndots affects when an unqualified name is considered a fully qualified domain name (FQDN) or needs the search suffixes appended. Confirm the nameserver IP matches your cluster’s DNS service IP and that the search path seems correct for your environment.
  5. Use the host command to determine whether the DNS requests are being routed to the upstream server:
    The host command can show the lookup process, including the servers being queried. Test resolving an external domain name.
    $ host -a microsoft.com
    Trying "microsoft.com.default.svc.cluster.local"
    Trying "microsoft.com.svc.cluster.local"
    Trying "microsoft.com.cluster.local"
    Trying "microsoft.com.00idcnmrrm4edot5s2or1onxsc.bx.internal.cloudapp.net"
    Trying "microsoft.com"
    Trying "microsoft.com"
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62884
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 27, AUTHORITY: 0, ADDITIONAL: 5
    
    ;; QUESTION SECTION:
    ;microsoft.com.                 IN      ANY
    
    ;; ANSWER SECTION:
    microsoft.com.          30      IN      NS      ns1-39.azure-dns.com.
    ...
    ...
    ns4-39.azure-dns.info.  30      IN      A       13.107.206.39
    
    Received 2121 bytes from 10.0.0.10#53 in 232 ms
    

    This output shows the search domains being tried and the final successful resolution. Crucially, it indicates the IP address from which the response was received (10.0.0.10#53), confirming that the request went to the cluster’s internal DNS service (CoreDNS). If this step fails or times out, the issue lies between the pod and CoreDNS.
  6. Check the upstream DNS server from the pod:
    If the lookup via the internal nameserver (10.0.0.10) fails for external domains, attempt to directly query the upstream DNS server(s) that your CoreDNS is supposed to be using. In Azure, a common upstream server is Azure DNS (168.63.129.16). Use nslookup or dig for this, specifying the server IP.
    $ nslookup microsoft.com 168.63.129.16
    Server:         168.63.129.16
    Address:        168.63.129.16#53
    ...
    ...
    Address: 20.81.111.85
    

    If this direct query to the upstream server (like 168.63.129.16) succeeds, but the lookup via the internal CoreDNS service IP fails (Step 5), it suggests the problem is with CoreDNS itself or the path from CoreDNS to the upstream server, rather than the upstream server being unreachable from the node. If this direct query also fails, the upstream server might be unreachable from the pod’s perspective (potentially due to network policies or routing issues) or from the node hosting the CoreDNS pod.

Step 3: Check Whether DNS Requests Work When the Upstream DNS Server is Explicitly Specified

If the direct query to the upstream DNS server (e.g., nslookup example.com 168.63.129.16) from the test pod succeeds, but a regular lookup (host example.com or nslookup example.com) pointing to the cluster’s internal DNS service IP (10.0.0.10) fails, it indicates a problem within the cluster’s DNS forwarding path or configuration. Verify the following conditions:

  1. Check for a custom ConfigMap for CoreDNS:
    AKS allows customizing CoreDNS behavior using a ConfigMap named coredns-custom. Check if this ConfigMap exists and inspect its contents.
    kubectl describe cm coredns-custom -n kube-system
    

    A misconfiguration in coredns-custom, such as incorrect forwarder addresses, invalid syntax, or conflicting rules, can disrupt DNS resolution. Refer to the official AKS documentation on customizing CoreDNS to validate the configuration. Incorrect settings here are a common source of problems when default resolution fails.
  2. Check whether a network policy is blocking traffic on User Datagram Protocol (UDP) port 53 to the CoreDNS pods in the kube-system namespace.
    Kubernetes Network Policies can restrict network traffic between pods. A policy might inadvertently block UDP traffic on port 53 (the standard DNS port) from your application pods (or the test pod) to the CoreDNS pods in the kube-system namespace. Review any active Network Policies, especially those targeting your application’s namespace or the kube-system namespace, to ensure they permit DNS traffic.
  3. Check whether the CoreDNS pods are on a different node pool (System node pool).
    By default, CoreDNS runs on the system node pool. If you have Network Security Groups (NSGs) applied to the Virtual Network subnet(s) used by your node pools, an NSG could be blocking traffic on UDP port 53 between the subnet hosting your application pods and the subnet hosting the system node pool where CoreDNS resides. Verify the NSG rules associated with both subnets.
  4. Check whether the virtual network was updated recently to add new DNS servers.
    Changes to the DNS server configuration at the Azure Virtual Network level do not automatically propagate to existing AKS nodes and CoreDNS pods immediately. For these changes to take effect, the network configuration on the nodes must be reloaded, and the CoreDNS pods might need to pick up the changes or be restarted.
    If a virtual network DNS update occurred, check whether one of the following events has also occurred:
    • The nodes were restarted.
    • The network service in the node was restarted.
    • The CoreDNS pods were restarted or the deployment was rolled out.
      For the updated VNet DNS settings to be reflected in the node’s /etc/resolv.conf (which CoreDNS uses as upstream), and subsequently used by CoreDNS, a refresh is needed. The simplest way to guarantee the nodes pick up new VNet DNS settings is to reboot the nodes or scale the node pool (new nodes get the fresh config). Alternatively, you can attempt to restart the network service on the nodes and then restart CoreDNS.
      To restart the network service or the pods, use one of the following methods:
    • Restart the node: This is often the most reliable method as it ensures a clean state for the node’s networking configuration. This can be done by deleting the node’s VM in the Azure portal (AKS will automatically recreate it) or using AKS node drain/reboot operations if available.
    • Scale new nodes: Increase the node count of the node pool. New nodes will be provisioned with the current VNet DNS settings. You can then scale down the old nodes.
    • Restart the network service in the nodes, and then restart the CoreDNS pods: This method is more involved and requires SSH access to the nodes.
      1. Make a Secure Shell (SSH) connection to the nodes identified as hosting CoreDNS pods (or ideally, all nodes in the system pool). For more information, see Azure documentation on connecting to AKS nodes.
      2. From within the node, restart the network service responsible for processing VNet DNS settings. On many Linux distributions used by AKS, this is systemd-networkd.
        sudo systemctl restart systemd-networkd
        

        Note the use of sudo as this requires elevated privileges.
      3. Check whether the settings are updated in the node’s resolv.conf file.
        cat /run/systemd/resolve/resolv.conf
        

        Verify that this file now lists the correct, updated DNS servers configured in the Azure VNet.
      4. After confirming the node has the updated configuration, use kubectl from your local machine to restart the CoreDNS pods. This forces them to re-read the node’s resolv.conf.
        kubectl delete pods -l k8s-app=kube-dns -n kube-system
        

        Kubernetes will automatically create new CoreDNS pods to replace the deleted ones.
  5. Check whether more than one DNS server is specified in the virtual network DNS settings.
    This is a common cause of intermittent or confusing DNS issues within pods compared to nodes. If multiple DNS servers are configured in the VNet settings, the behavior of resolution can differ depending on whether the query originates directly from the node’s OS or is forwarded by CoreDNS.

    • Node Resolution: The operating system’s resolver library (/etc/resolv.conf) typically queries the listed nameservers sequentially. It tries the first server, and if it times out or is unreachable, it tries the second, and so on. This means the node will reliably use the first responsive DNS server listed in its /run/systemd/resolve/resolv.conf.
    • CoreDNS Forwarding: CoreDNS uses its forward plugin to send external queries to upstream servers. By default, the forward plugin uses a random policy to select the upstream server from the list obtained from the node’s resolv.conf. This means each individual DNS query originating from a pod (and forwarded by CoreDNS) could potentially go to any of the configured upstream servers, chosen randomly.

    $ kubectl describe cm coredns -n kube-system
    ...
    Data
    ====
    Corefile:
    ----
    .:53 {
        errors
        ready
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf                            # CoreDNS forwards to servers in this file
        cache 30
        loop
        reload
        loadbalance # Default policy is 'random'
        import custom/*.override
    }
    import custom/*.server
    ...
    

    The forward plugin’s policy setting dictates how upstream servers are chosen. While random is the default, other policies like round_robin or sequential can be configured (often via the coredns-custom ConfigMap).

    Policy Name Description
    random Selects an upstream server randomly for each query (default).
    round_robin Selects upstream servers in a rotating sequence.
    sequential Selects upstream servers in the order they appear in the resolv.conf.

    This difference in behavior (node using sequential retry vs. CoreDNS using random selection by default) is key when multiple upstream DNS servers are configured. The node might consistently use the first server, while CoreDNS might randomly pick a server that is unable to resolve the required name, leading to failures for pods even when node resolution works.

Cause: Multiple Destinations for DNS Requests

A common scenario leading to these specific issues occurs when the Azure Virtual Network DNS settings are configured with a mix of custom DNS servers and Azure DNS (168.63.129.16). For example, if your VNet DNS settings list Custom_DNS_Server_1, Custom_DNS_Server_2, and then 168.63.129.16.

When an AKS node needs to resolve a name, its OS resolver reads /run/systemd/resolve/resolv.conf. It will typically try Custom_DNS_Server_1 first. If that server is running and reachable, the node will use it successfully. This works well if Custom_DNS_Server_1 can resolve all necessary names, including internal corporate domains.

However, when a pod attempts to resolve an external domain name, the request goes to CoreDNS. CoreDNS reads the same /run/systemd/resolve/resolv.conf on its node to find upstream servers. With the default random forward policy, CoreDNS might randomly select Custom_DNS_Server_1, Custom_DNS_Server_2, or 168.63.129.16 for any given query. If Custom_DNS_Server_1 and Custom_DNS_Server_2 are only aware of internal domains and cannot resolve public internet names, or if 168.63.129.16 cannot resolve your internal custom domains, then queries randomly directed to the “wrong” server by CoreDNS will fail. This explains why resolution works from the node (which consistently uses the first capable server) but fails intermittently or consistently from the pod (which uses CoreDNS’s randomly selected server).

The conflict arises because CoreDNS treats all listed upstream servers equally and distributes queries among them according to its policy, whereas the node’s OS resolver uses a strict sequential fallback.

Solution: Remove Azure DNS from Virtual Network Settings

The recommended approach to avoid conflicts when using custom DNS servers with AKS is to configure DNS resolution hierarchically. Instead of listing both custom DNS servers and Azure DNS in the Azure Virtual Network settings, list only your custom DNS servers in the VNet settings.

Then, configure your custom DNS servers to use Azure DNS (168.63.129.16) as a forwarder for any queries they cannot resolve internally (e.g., public internet domains). This ensures that all DNS queries from the AKS VNet (both node-level and CoreDNS-forwarded) consistently go first to your reliable custom DNS servers. If a custom server cannot resolve a name, it will then forward the query to Azure DNS, which can resolve public names. This creates a predictable and unified resolution path.

By removing 168.63.129.16 directly from the VNet list, you prevent CoreDNS from randomly selecting it for internal domain lookups (which would fail) and ensure that all external lookups are correctly forwarded through your custom servers first, leveraging their forwarding configuration.

For more detailed guidance on setting up name resolution using your own DNS server in Azure Virtual Networks, consult the official Microsoft documentation on name resolution for VMs and role instances, specifically the section on using your own DNS server. This setup guarantees that all DNS resolution within the VNet, including from AKS nodes and pods, follows the path you define through your managed DNS infrastructure, only leveraging Azure DNS as an external forwarder when explicitly configured to do so by your servers.

We hope this guide helps you effectively diagnose and resolve DNS resolution issues within your Azure AKS pods. Understanding the flow from the pod through CoreDNS to the upstream servers is key to identifying where the breakdown is occurring.

Do you have any further questions or scenarios you’d like to discuss regarding DNS troubleshooting in AKS? Please feel free to leave a comment below.

Post a Comment