12.8 API Gateway Logging, Access Logs, and X-Ray Tracing

Right, let’s talk about visibility. You’ve built this beautiful, intricate API Gateway-powered clockwork mouse, and now you need to see if it’s actually running or if it’s just a pile of cogs and hopes. This is where logging and tracing come in. Without them, you’re flying blind, and when a client calls you at 3 AM because their “thingy is broken,” you’ll have precisely zero clues. We’re going to fix that.

The Two Flavors of Logging: Execution vs. Access

First, don’t get these confused. They serve different purposes.

Execution Logs are your application logs. They’re the console.log or logger.info statements you painstakingly put in your Lambda function code. API Gateway is just the messenger here; it invokes your Lambda and then dutifully sends whatever your function spits out (stdout) to CloudWatch Logs. The key thing to remember: these live with your Lambda function’s log group, not API Gateway’s. The name will look something like /aws/lambda/my-function-name.

Access Logs are the star of the show for API Gateway itself. They are the dump of the request and the response as seen by the gateway. This is your classic web server log on steroids. It tells you who asked for what, when, what they sent, what you sent back, and how long it all took. This is configured directly on the API Gateway stage.

Configuring Access Logs: The Good, The Bad, The JSON

You turn on Access Logs in the API Gateway console for a Stage, or via CloudFormation. The most important part is the Log Format. You can use the clunky classic format, but for the love of all that is holy, use JSON. The classic format is a string of key=value pairs that you then have to parse with regex nightmares. JSON logs go straight into CloudWatch and are instantly queryable.

Here’s a robust JSON log format you can steal. I’ve added comments for clarity, which you’d remove in the actual console field.

{
  "requestId":"$context.requestId",
  "ip":"$context.identity.sourceIp",
  "caller":"$context.identity.caller",
  "user":"$context.identity.user",
  "requestTime":"$context.requestTime",
  "httpMethod":"$context.httpMethod",
  "resourcePath":"$context.resourcePath",
  "status":"$context.status", // The HTTP status you sent back
  "protocol":"$context.protocol",
  "responseLength":"$context.responseLength",
  "integrationError":"$context.integration.error", // Lifesaver for Lambda failures
  "integrationStatus":"$context.integration.status", // The status from Lambda itself
  "responseLatency":"$context.responseLatency", // Total time for the client
  "integrationLatency":"$context.integration.latency" // Just the Lambda execution time
}

The magic variables here are $context.integration.status and $context.integration.error. Your Lambda function can throw an unhandled exception, but if you have Lambda Proxy Integration (which you should), your function might still return a 500 response. The gateway sees this as a successful integration (it got a response). The real error will be in integration.error. Always, always include these two fields.

Connecting the Dots with X-Ray

Access logs tell you what happened. X-Ray tells you why it was slow. It’s a distributed tracing system. You enable it on your API Gateway stage and on your Lambda function. When you do, AWS automatically instruments the journey of a request.

It creates a trace, which is a collection of segments and subsegments. API Gateway will create a segment for the time the request is in its domain. Then, when it calls Lambda, Lambda creates its own segment as a child, detailing exactly what happened inside your function. If your function calls DynamoDB, X-Ray will automatically create another subsegment showing the query and its latency.

The visualizations in the X-Ray service console are fantastic for spotting bottlenecks. Is the latency in integrationLatency? Then your Lambda is the problem. Is the latency between responseLatency and integrationLatency huge? That’s API Gateway overhead, which usually means you’re not using a Lambda Provisioned Concurrency for a cold start or the request/response body is massive.

Here’s how you enable it for a Lambda function via its Execution IAM Role. The role needs these permissions:

{
  "Effect": "Allow",
  "Action": [
    "xray:PutTraceSegments",
    "xray:PutTelemetryRecords"
  ],
  "Resource": [
    "*"
  ]
}

The Rough Edges and Pitfalls

It’s Not Free: This is the biggest one. Access logs and X-Ray data incur costs. For a high-volume API, the cost of logs can easily surpass the cost of the API execution itself. Be mindful of volume. X-Ray tracing is a 5% sampling rate by default, which is usually sufficient.
Latency Overhead: Enabling X-Ray does add a tiny bit of latency, as it’s sending trace data asynchronously. It’s negligible for most applications, but don’t expect nanosecond precision.
Permission Nightmares: I cannot count the number of times I’ve seen “Failed to emit logs” errors because the API Gateway service role (apigateway.amazonaws.com) didn’t have the logs:CreateLogStream and logs:PutLogEvents permissions on the correct CloudWatch Log Group. Get your IAM roles right. The console setup wizard often does this for you, but if you’re using Infrastructure-as-Code, you have to wire it up yourself.
The VPC Problem: If your Lambda function is in a VPC without a NAT Gateway, it can’t reach the public X-Ray service endpoint to send trace data. Your traces will be incomplete. The solution is to create a PrivateLink VPC Endpoint for the xray service.

So, enable Access Logs (as JSON), include the integration error fields, and turn on X-Ray when you need to debug performance. It turns a black box into a transparent, queryable system. And that’s how you avoid 3 AM phone calls. Mostly.