Technical
Observability for Small Teams: What to Log, Metric, and Trace
Observability vendors will sell you a stack that costs more than your AWS bill. For a small team or solo operator, most of that is waste. Four months of production operations later, here is the minimal setup that actually keeps me informed without draining my budget.
The Three Pillars, Sized Down
The textbook says logs, metrics, traces. For a small team I interpret each narrowly. Logs are for unexpected events and debugging. Metrics are for dashboards and alerts. Traces are for the rare deep investigation.
Logs: Structured, Sparse, Searchable
I log the start and end of every request, every error, and every state-changing event. I do not log every database call. That path leads to CloudWatch bills that rival your compute bill.
Every log is JSON. Every log has a correlation ID that threads a request across services. CloudWatch Logs Insights handles my query needs and costs cents per month at my volume.
Metrics: Four Golden Signals and a Business Metric
I track rate, errors, duration, and saturation on every service. On top of that I track one business metric per app: signups per hour, posts published per day, emails sent per minute. These business metrics are what tell me the system is doing its actual job, not just staying up.
# CloudWatch embedded metric format in Lambda
import json
def emit_metric(name: str, value: float, unit: str = 'Count'):
print(json.dumps({
'_aws': {
'Timestamp': int(time.time() * 1000),
'CloudWatchMetrics': [{
'Namespace': 'App',
'Dimensions': [['Service']],
'Metrics': [{'Name': name, 'Unit': unit}],
}]
},
'Service': 'api',
name: value
}))Traces: Only When You Need Them
I do not enable distributed tracing on every request. That gets expensive. I sample 1 percent for routine traffic and 100 percent of errors. For debugging one specific flow I can turn up sampling temporarily.
AWS X-Ray integrates with Lambda and API Gateway in two clicks. Honeycomb is worth it if you graduate beyond X-Ray's capabilities.
The One Dashboard
I keep a single dashboard with the five metrics I actually look at: request rate, error rate, p95 latency, business metric of the day, and cost accrued this month. If something is wrong, this dashboard tells me in five seconds.
More tools do not buy more insight. See the AWS observability best practices for the platform-specific patterns. The discipline of sizing observability to your actual needs is the real skill.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read