Structured Logging Patterns I Adopted This Year

Unstructured logs are how you lose hours in production. You know the symptom: something broke, you grep through CloudWatch, you find nothing useful, you add logs, you redeploy, you wait. Structured logging fixes this pattern. Here are the patterns I standardized on this year.

The Core Principle

Every log line is a JSON object with a fixed set of top-level fields and a flexible payload. That structure lets CloudWatch Logs Insights, Datadog, or any log platform actually filter and aggregate.

python

import json
import logging
import time
 
def log_event(event: str, **kwargs):
    record = {
        'timestamp': time.time(),
        'event': event,
        'service': 'posts-api',
        **kwargs,
    }
    print(json.dumps(record))

The Required Fields

Every log line I write has at minimum:

timestamp: epoch seconds
event: short identifier like post.created or email.send.failed
service: which service produced the log
level: info, warn, error
request_id: ties logs from one request together

With just those five fields I can answer most production questions.

The Request-ID Pattern

The single highest-leverage change was propagating a request ID through every log call. A user reports a bug, they send me the request ID from the response header, I filter the logs by that ID, and I see every step of their request. Thirty seconds instead of thirty minutes.

What Not to Log

Passwords, tokens, API keys (obviously)
Full request bodies unless sanitized
PII beyond what is necessary
Log-per-loop-iteration patterns (hello, bill explosion)

The Shape Rule

Logs and metrics are different. Logs are strings and events. Metrics are numbers over time. Do not try to use logs as metrics. Emit metrics separately using CloudWatch Embedded Metric Format or your platform equivalent.

Structured logging is not a silver bullet. It is a small discipline that compounds over the life of a project. Add it on day one, not on day one hundred.

The Log Levels Problem

Most apps have too many info logs and too few warn logs. The fix is a discipline: info is for normal flow, warn is for recoverable issues, error is for things a human needs to see. If every log is info, nothing is actionable. Spend time tuning levels until warn and error are rare enough to be meaningful.

What I Added Late

Metric logs for latency histograms (should have had earlier)
Correlation IDs across service boundaries (retrofitted painfully)
Sampling for high-volume log paths (keeps CloudWatch bills reasonable)

Each of these produced an obvious improvement the week it shipped. Each should have been in the initial design, not added in response to an outage.

See the AWS structured logging guide for Lambda-specific patterns.