42.8 Networking Debugging: DNS, Service, and Network Policy Issues

Alright, let’s get our hands dirty. Networking in Kubernetes is where the rubber meets the road, and where things often go spectacularly, head-scratchingly wrong. It’s a complex beast, but we can tame it by breaking it down into its core components: DNS, Services, and Network Policies. Forget the marketing fluff; we’re going to talk about what actually happens on the wire.

The First Command: `nslookup` is Your Best Friend

When a pod can’t talk to another pod via its service name, your very first move shouldn’t be to panic. It should be to drop into a shell on a pod and run nslookup. This humble tool will tell you if CoreDNS (or whatever DNS server you’re running) is even responding and if it can resolve the service name to a ClusterIP.

# Get a shell on a pod. Any pod will do, but one in the same namespace is best.
kubectl exec -it my-app-pod -- /bin/sh

# Now, inside the pod, try to resolve your service
nslookup my-svc.my-namespace.svc.cluster.local

If that fails, you’ve got a DNS problem. The most common culprits are:

The CoreDNS pods are down. Check them with kubectl get pods -n kube-system -l k8s-app=kube-dns.
The pod’s /etc/resolv.conf is wrong. It should point to the ClusterIP of the kube-dns service. Check it with cat /etc/resolv.conf. It should look something like:
```
nameserver 10.96.0.10
search my-namespace.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
```
That ndots:5 option is a classic foot-gun. It means any DNS query with fewer than five dots will be tried against the search domains first. So a lookup for my-svc (one dot) will try my-svc.my-namespace.svc.cluster.local, then my-svc.svc.cluster.local, then my-svc.cluster.local before finally trying just my-svc. This is mostly fine, but it can cause surprising delays for external lookups. It’s a design choice you just have to live with.

When Services Lie (Well, Don’t Work as Expected)

So nslookup works, but your connection is still timing out or failing. The service exists, but it’s not routing traffic. Time to debug the service itself. A service is just an abstraction; it’s really just iptables/ipvs rules on each node that do the load balancing. Let’s dissect it.

First, describe the service. Is the Endpoints list populated?

kubectl describe svc my-broken-service

Look at the Endpoints: line. If it’s empty, that’s your problem. The service’s selector doesn’t match any pods. Check your labels. It’s always the labels. If the Endpoints are there, the next step is to see if the service’s ClusterIP is being routed correctly. The best way to test this is from a pod on the same node to eliminate the CNI (Container Network Interface) from the equation for a moment.

# Get the ClusterIP of your service
kubectl get svc my-broken-service -o wide

# Now, from a shell on a pod, try to curl the ClusterIP and port *directly*
curl http://10.96.0.10:8080

If that works, but using the service name doesn’t, you’ve got a DNS issue masquerading as a network issue. If it doesn’t work, you’ve got a real network or service routing issue. The next step is to check the iptables rules on the node itself, but that’s a deep, dark rabbit hole. Often, it’s easier to check the CNI plugin (like Calico, Cilium, or Flannel) is healthy on that node (kubectl get pods -n kube-system).

The Network Policy Black Hole

You’ve confirmed DNS works, the service has endpoints, and the ClusterIP is reachable. But traffic is still being dropped. Welcome to the world of Network Policies. They are essentially firewall rules for your pods. The most common mistake is forgetting the default-deny state.

If you have any Network Policies in a namespace that select a pod, that pod is suddenly isolated. All traffic to it is denied unless explicitly allowed by an ingress policy. This catches everyone out.

# This policy isolates ALL pods in the namespace "my-namespace"
# by creating a default-deny for all ingress traffic.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: my-namespace
spec:
  podSelector: {} # This selects all pods. The absence of this key does NOT.
  policyTypes:
  - Ingress

The fix is to create policies that explicitly allow the traffic you need. For example, to allow traffic from other pods in the same namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: my-namespace
spec:
  podSelector: {}
  ingress:
  - from:
    - podSelector: {} # Allow from all pods in the same namespace

Always test network policies in a non-production namespace first. The semantics of what podSelector: {} actually means (selects all pods vs. selects no pods) is a common point of confusion and a questionable choice by the API designers. It’s verbose, but it’s what we have to work with.

The First Command: nslookup is Your Best Friend

When Services Lie (Well, Don’t Work as Expected)

The Network Policy Black Hole

The First Command: `nslookup` is Your Best Friend