42.9 Kubernetes Events: Using kubectl get events Effectively

Right, let’s talk about events. If your cluster is a crime scene (and it often feels like one), kubectl get events is your first and best witness. It’s the system’s gossip column, a running log of the “who, what, and when” behind almost every state change. Ignore it, and you’re troubleshooting blind. Rely on it too much, and you’ll drown in a firehose of mostly-irrelevant data. Let’s learn how to drink from that firehose effectively.

42.8 Networking Debugging: DNS, Service, and Network Policy Issues

Alright, let’s get our hands dirty. Networking in Kubernetes is where the rubber meets the road, and where things often go spectacularly, head-scratchingly wrong. It’s a complex beast, but we can tame it by breaking it down into its core components: DNS, Services, and Network Policies. Forget the marketing fluff; we’re going to talk about what actually happens on the wire. The First Command: nslookup is Your Best Friend When a pod can’t talk to another pod via its service name, your very first move shouldn’t be to panic. It should be to drop into a shell on a pod and run nslookup. This humble tool will tell you if CoreDNS (or whatever DNS server you’re running) is even responding and if it can resolve the service name to a ClusterIP.

42.7 Control Plane Failures: API Server, etcd, Scheduler

Right, so your cluster has gone sideways. The apps are down, kubectl commands are timing out, and that little voice in your head is whispering, “Was it something I did?” Probably. But more likely, it’s the control plane throwing a tantrum. This isn’t your application code; this is the brain of your entire operation having a stroke. We need to triage the patient. The control plane’s job is to maintain state. Its entire existence is a constant loop of “observe reality, compare to desired state, reconcile.” When it fails, that loop breaks. Your first clue is almost always the kubectl command hanging or spitting out a beautiful, utterly useless The connection to the server <server-name:port> was refused - did you specify the right host or port?. Don’t panic. This just means the API server, the front door to everything, is closed for business.

42.6 Node NotReady: Common Causes and Remediation

Alright, let’s talk about a Node going into NotReady state. It’s Kubernetes’ way of telling you, “Hey, I’ve got a problem over here and I can’t schedule any more work on this server.” It’s not being lazy; it’s being honest. Your job is to figure out why. Think of the Kubelet on each node as a harried middle manager. Its sole job is to constantly report back to the Control Plane (Head Office) that its node (retail store) is open for business and has shelf space. The Node object is that status report. When the Kubelet stops sending good reports—or any reports at all—the Control Plane, after a few minutes of radio silence, marks the node as NotReady. It’s a safety mechanism. It’d rather stop sending you customers than send them to a store that might be on fire.

42.5 Debugging with Ephemeral Containers and kubectl debug

Right, so your pod is in a broken state. It’s either crashlooping, stuck in Pending, or just behaving in a way that makes absolutely no sense. Your first instinct is to kubectl exec into it to see what’s going on. But what if the container won’t start? You can’t exec into a container that isn’t running. This is the classic “my car won’t start, and I need to look under the hood but the hood is locked” scenario.

42.4 Exec Into a Running Container for Live Debugging

Right, so your pod is running, but it’s doing something deeply weird. Maybe it’s eating CPU like it’s at an all-you-eat-buffer, or perhaps it’s just… not responding. The logs (kubectl logs) are useless, showing nothing but the digital equivalent of crickets chirping. This is where you stop looking at the autopsy report and start talking to the patient. You need to exec into the running container. Think of kubectl exec as your all-access backstage pass. It lets you open an interactive shell right inside the container, or run any one-off command you can dream up. It’s the difference between reading a log file and actually being there, poking around the filesystem, checking processes, and seeing what the application actually sees. It’s your primary tool for live debugging, and you should be deeply suspicious of anyone who tells you to debug a container without it.

42.3 Debugging with kubectl describe and kubectl logs

Right, so your pod is stuck in Pending or your application is coughing up an error. You’re not going to just stare at kubectl get pods and hope it magically starts working, are you? Of course not. You’re going to ask the cluster what on earth it’s thinking. Your two best friends for this are kubectl describe and kubectl logs. One tells you what the cluster thinks is happening to your pod, and the other tells you what’s actually happening inside it. Let’s break them down.

42.2 Pod Not Starting: Pending, CrashLoopBackOff, ImagePullBackOff

Alright, let’s get our hands dirty. Your pod isn’t starting. It’s just sitting there, mocking you with a status like Pending, CrashLoopBackOff, or ImagePullBackOff. This isn’t a failure; it’s the cluster’s way of sending you a strongly worded letter explaining exactly what you did wrong. Your job is to learn how to read it. First, the golden rule: always start with kubectl describe. Your kubectl get pods output is the headline; kubectl describe is the full investigative report. If you don’t do this first, I can’t help you. It’s like calling a mechanic and saying “my car is broken” but refusing to pop the hood.

42.1 Systematic Troubleshooting Methodology

Right, let’s get this sorted. You’re staring at a CrashLoopBackOff or some other Kubernetes-induced hieroglyphic, and the panic is starting to set in. Don’t. The single biggest mistake you can make is just frantically running kubectl describe on random things, hoping for a clue. That’s like trying to fix a car engine by randomly tapping components with a hammer. You might get lucky, but you’ll probably just make it worse.

72.9 faulthandler: Diagnosing Crashes and Segfaults

Right, so you’ve written some Python. It’s beautiful, it’s elegant, and then—without warning—it exits. No traceback, no KeyboardInterrupt, just a sudden, silent return to the comforting glow of your terminal prompt. Or worse, it spits out Segmentation fault (core dumped) and mocks you from the history log. This, my friend, is where your usual Python tools tap out. print() statements? Useless. The logging module? Never got the message. pdb? Didn’t even get to wake up. Your code has crashed in the C layers, far beneath the comfortable Python runtime where exceptions are raised and caught. This is the realm of dangling pointers, buffer overflows, and corrupted memory. And to diagnose this, you need a different kind of tool. You need faulthandler.

72.8 sys.settrace(): Writing Your Own Debugger or Profiler

Right, so you’ve graduated from print() statements and you’re ready to get serious. You’ve used pdb and maybe even a fancy IDE debugger, and a little voice in your head asked, “How do they do that?” The answer, my friend, is sys.settrace(). It’s the arcane, powerful, and slightly terrifying incantation that allows you to hook into the CPython interpreter’s execution flow. It’s how debuggers, profilers, and coverage tools are born.

72.7 Remote Debugging with debugpy (VS Code, PyCharm)

Right, so your code is misbehaving. But it’s misbehaving on a remote server, in a Docker container, or inside a virtual environment so alien it might as well be on the dark side of the moon. You can’t just slap a print("got here lol") statement in there and run it locally. This is where we graduate from caveman debugging to something with a bit more finesse: remote debugging. We’re going to use debugpy, Microsoft’s brilliantly capable debugger protocol for Python. It’s what lets VS Code’s debugger do its magic, and it plays nicely with PyCharm and other modern IDEs too.

72.6 breakpoint() and PYTHONBREAKPOINT

Right, so you’ve graduated from print("got here") to actual debugging. Congratulations, we’re all very proud. But let’s be honest, fumbling with import pdb; pdb.set_trace() is the digital equivalent of trying to start a fire with two wet sticks. It works, but it’s clumsy, it leaves a mess, and there’s a much better way. Enter breakpoint(). This isn’t just a new function; it’s a cultural shift in Python debugging, and it’s about damn time.

72.5 pdb: Setting Breakpoints and Inspecting State

Right, so print() statements have failed you. They always do. Welcome to the big leagues. When your code is doing something so profoundly idiotic that you can’t even begin to guess why, you need to stop it mid-execution, climb inside its brain, and have a look around. That’s what pdb, the Python debugger, is for. It’s your surgical tool for figuring out what the hell is actually happening, not what you think is happening.

72.4 Structured Logging with structlog

Right, let’s talk about making your logs actually useful. You’ve probably been there: staring at a text file that looks like a frantic, unstructured diary entry written by a machine on three cups of espresso. Timestamp, log level, some vague message… good luck finding the one error in that mess. The default logging module is fine for telling you that something happened, but it’s terrible at telling you the story of why it happened. That’s where structlog comes in. It’s not just a library; it’s a philosophy for turning your logs from a liability into a debuggable, queryable asset.

72.3 Logging to Files, Rotating Handlers, and External Services

Right, so you’ve graduated from print() statements. Good for you. Now let’s talk about doing it properly. Logging to the console is fine for a quick script, but for anything that runs longer than five minutes, you need persistence. You need logs that survive a reboot, that you can grep through at 2 AM when things are on fire, and that don’t fill up your disk and bring the whole operation to a grinding halt. Let’s get into it.

72.2 basicConfig() vs Manual Configuration

Right, let’s settle this. You’ve seen basicConfig() everywhere. It’s the logging equivalent of a friendly “Easy” button. And you’ve probably also seen people creating Logger objects, Handler objects, Formatter objects… and thought, “Why would anyone do that the hard way?” I’m here to tell you that basicConfig() is a fantastic one-night stand, but for a serious, long-term relationship with your application’s logs, you need to do things manually. Let’s break down why.

72.1 The logging Module: Levels, Loggers, Handlers, Formatters

Right, let’s talk about logging. You’ve been using print() statements to debug your code since you wrote your first Hello, World. I get it. It’s immediate, it’s simple, and when your script is three lines long, it’s perfect. But you’re not writing three-line scripts anymore, are you? You’re building applications. And when your application is running on a server at 2 AM and something goes horribly wrong, you’re not going to SSH in to tail -f a bunch of print() output. You need a system. A robust, configurable, and frankly, adult system for understanding what your code is doing when you’re not there to watch it. That system is the logging module.

41.5 Defensive Programming Strategies

Defensive programming is a disciplined approach to software development that prioritizes the creation of robust, fault-tolerant, and predictable code. It operates on the principle that software should not only function correctly under ideal conditions but should also behave gracefully and predictably when encountering unexpected inputs, internal errors, or external system failures. The core philosophy is one of deep skepticism: assume that inputs to a function may be invalid, that external systems may fail, and that code you depend on may have bugs. By proactively anticipating and handling these potential issues, you create systems that are more secure, stable, and easier to debug.

41.4 Deprecation Warnings in Library Code

Deprecation warnings serve as a crucial communication channel between library maintainers and their users, signaling that a specific function, class, module, or parameter is slated for removal in a future release. Their primary purpose is not to break existing code immediately but to provide a grace period for developers to update their codebases, thereby preventing abrupt and disruptive changes. This practice is a cornerstone of semantic versioning; deprecations are introduced in minor releases (e.g., 1.4.0) before the offending feature is removed in the next major release (e.g., 2.0.0). This systematic approach allows for stable, predictable evolution of an API.

41.3 The warnings Module: warn(), filterwarnings(), simplefilter()

The warnings Module: A Filtering System Unlike exceptions, which are designed to halt program flow for critical errors, warnings are a mechanism for reporting non-fatal or deprecated usage issues to the developer without stopping execution. The warnings module in Python provides a sophisticated filtering system to control which warnings are shown, how they are formatted, and even how they are handled (e.g., ignored or elevated to exceptions). The system is built around the concept of a filter list, which is processed in order for every triggered warning to decide its fate.

41.2 Why assert Is Not For Validation (and -O Disables It)

The assert statement in Python serves a specific purpose: it is a debugging aid that tests conditions which should never be false in a correctly running program. Its primary design goal is to catch programming errors, not user errors or invalid data from external sources. This critical distinction is the cornerstone of understanding its proper use and its behavior when Python is run with optimizations enabled. The Core Purpose: Debugging Aid, Not Validation Logic An assertion expresses a invariant—a condition that must always be true at a certain point in your code if the program’s logic is sound. For example, a function that calculates the square root of a number might assert that its input is non-negative. This isn’t to validate user input; it’s to verify the programmer’s assumption that before this function is called, the calling code has already ensured the value is valid. If the assertion fails, it signifies a bug in the program’s logic, not a mistake by the user.

41.1 assert: When and How to Use It

The assert statement is a powerful tool for embedding sanity checks directly into your code. It acts as a self-check mechanism that validates assumptions your program makes about its own state. When an assumption holds true, the program continues execution as normal. When it is false, the program halts immediately by raising an AssertionError. This fail-fast behavior is the cornerstone of defensive programming, allowing developers to catch logic errors and invalid states as close to their source as possible, drastically simplifying the debugging process.

— joke —

...