23.7 VPC Flow Logs: Capturing Accept and Reject Traffic for Analysis
Right, let’s talk about VPC Flow Logs. This is where we stop guessing why that darn instance can’t talk to the database and start knowing. Think of Security Groups and NACLs as your bouncers—they decide who gets in and who gets tossed out. Flow Logs are the meticulous club managers who keep a perfect record of every single decision those bouncers made, plus all the randos who showed up without an invite. It’s your first, last, and best tool for untangling the rat’s nest of network connectivity issues in your VPC.
I love them because they’re brutally honest. A security group might say it allows traffic, but Flow Logs will show you the packet that got rejected because it was trying to use the wrong protocol. They don’t lie. They just report the facts as observed by the network interface itself.
What Exactly Gets Logged?
Every flow log record is a line of JSON (or text, but we’re not savages) that captures a specific network flow. A “flow” here is basically a 5-tuple: source IP, destination IP, source port, destination port, and protocol. It logs both accepted and rejected traffic, which is the whole point. Each record tells you the action (ACCEPT or REJECT), who made that decision (src for security group, dst for NACL, etc.), and a bunch of other juicy details.
Here’s a real, runnable example of how you enable them. You can do it for an entire VPC, a subnet, or a specific network interface. I usually start at the subnet level for broad-stroke analysis.
# Enable flow logs for a subnet, capturing both accepted and rejected traffic
aws ec2 create-flow-logs \
--resource-type Subnet \
--resource-id subnet-12345abcde67890fg \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name "my-vpc-flow-logs" \
--deliver-logs-permission-arn arn:aws:iam::123456789012:role/FlowLogsRole
The key flag here is --traffic-type ALL. You want ALL. Don’t settle for just ACCEPT or REJECT; you need the full picture to debug effectively. The other crucial bit is setting up that IAM role (FlowLogsRole) correctly. AWS documentation will give you the permissions policy, but the trust policy is what always trips people up. The role must trust the vpc-flow-logs.amazonaws.com service principal to deliver logs. Mess this up, and you’ll spend an hour wondering why your log group is emptier than a promises repository.
Decoding the Output: The Good Stuff
Now, let’s look at a sample log entry. This is where the magic happens.
{
"version": 2,
"account-id": "123456789012",
"interface-id": "eni-123abc456def789ghi",
"srcaddr": "10.0.1.5",
"dstaddr": "10.0.2.98",
"srcport": 54332,
"dstport": 443,
"protocol": 6,
"packets": 4,
"bytes": 291,
"start": 1676484662,
"end": 1676484662,
"action": "REJECT",
"log-status": "OK",
"vpc-id": "vpc-abcdef123",
"subnet-id": "subnet-12345abc",
"instance-id": "i-0987654321abcdef0",
"tcp-flags": 19,
"type": "IPv4",
"pkt-srcaddr": "192.0.2.1",
"pkt-dstaddr": "10.0.2.98"
}
See that action: REJECT? Gold. Now, look at pkt-srcaddr vs. srcaddr. This is a critical nuance that causes endless confusion. srcaddr/dstaddr are the post-NAT addresses. The pkt-srcaddr/pkt-dstaddr are the original packet headers pre-NAT. If you’re using a NAT Gateway or a load balancer, the srcaddr is often the private IP of the NAT device, which is useless. You need to look at pkt-srcaddr to see the actual original source IP. I’ve watched seasoned engineers facepalm when they learn this.
The tcp-flags field is another superpower. A value of 19 (SYN, ACK, FIN) for an accepted flow is a normal connection. A value of 2 (just SYN) for a rejected flow? That’s a TCP handshake that never completed—a sure sign something is being blocked. It’s like finding a broken handshake at a crime scene.
Common Pitfalls and How to Avoid Them
First pitfall: cost. Flow logs are cheap… until they’re not. Logging ALL traffic for a high-throughput resource can generate a staggering amount of data. I once saw a team accidentally enable it on a NAT Gateway and get a bill that made them physically recoil. Use a retention policy on your CloudWatch Log Group (30 days is usually plenty) and consider aggregating your findings into metrics if you need long-term trends.
Second pitfall: analysis paralysis. The raw logs are a firehose. You must use tools to drink from it. CloudWatch Logs Insights is your best friend here. Don’t try to read these line by line; query them.
-- Find the top 10 sources of rejected traffic
fields @timestamp, pkt-srcaddr, srcport, pkt-dstaddr, dstport, action, protocol
| filter action = "REJECT"
| stats count() by pkt-srcaddr
| sort count desc
| limit 10
This query instantly shows you your noisiest would-be attackers or, more likely, your most misconfigured applications.
Finally, remember that Flow Logs are not a real-time tool. There’s a propagation delay, often several minutes. Don’t frantically refresh the logs after changing a security group rule; go get a coffee first. They are for forensic analysis and trend spotting, not for live tailing a connection.
They are, without a doubt, one of the most powerful and underutilized features in AWS. Turn them on. Get familiar with queries. You’ll never have to blindly guess about network traffic again.