Observability for Developers: Logs, Metrics, Traces, and OpenTelemetry
Observability for Developers: Logs, Metrics, Traces, and OpenTelemetry
Observability is the ability to understand what your system is doing from its external outputs. When something breaks in production -- and it will -- observability is the difference between fixing it in 5 minutes and debugging for 5 hours. This guide covers the three pillars (logs, metrics, traces), the tools that collect them, and how to build observable systems from the start.
The Three Pillars
| Pillar | What It Tells You | Example |
|---|---|---|
| Logs | What happened | "User 123 failed to authenticate: invalid password" |
| Metrics | How much / how often | "95th percentile response time: 450ms" |
| Traces | The journey of a request | "Request hit API gateway -> auth service -> user DB -> response (total: 230ms)" |
Each pillar answers different questions:
- Logs: What went wrong? What was the error message?
- Metrics: Is the system healthy? Are things getting worse?
- Traces: Where is the bottleneck? Which service is slow?
You need all three. Logs without metrics mean you can't detect problems until users complain. Metrics without traces mean you know something is slow but can't find where. Traces without logs mean you can see the slow request but can't understand why.
Structured Logging
Unstructured logs are almost useless at scale. You can't search, filter, or aggregate them reliably. Structured logging outputs JSON, which log aggregation tools (Grafana Loki, Datadog, ELK) can parse and query.
Unstructured vs Structured
# Unstructured -- impossible to parse reliably
[2026-02-09 14:23:01] ERROR: Failed to process order #4567 for user [email protected] - insufficient inventory for SKU-12345
# Structured -- every field is queryable
{"timestamp":"2026-02-09T14:23:01.000Z","level":"error","message":"Failed to process order","orderId":"4567","userId":"user_abc123","email":"[email protected]","sku":"SKU-12345","reason":"insufficient_inventory","service":"order-processor"}
Structured Logging with Pino (Node.js)
Pino is the fastest structured logger for Node.js:
import pino from "pino";
// Create a logger
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level: (label) => ({ level: label }), // "info" instead of 30
},
// Add default fields to every log
base: {
service: "order-processor",
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV,
},
});
// Create child loggers with request context
function requestLogger(req: Request) {
return logger.child({
requestId: req.headers["x-request-id"],
userId: req.user?.id,
method: req.method,
path: req.url,
});
}
// Usage in a route handler
app.post("/orders", async (req, res) => {
const log = requestLogger(req);
log.info("Processing order");
try {
const order = await processOrder(req.body);
log.info({ orderId: order.id, total: order.total }, "Order processed successfully");
res.json(order);
} catch (error) {
log.error(
{ err: error, orderData: req.body },
"Failed to process order"
);
res.status(500).json({ error: "Order processing failed" });
}
});
Fastify Built-in Logging
Fastify uses Pino natively:
import Fastify from "fastify";
const app = Fastify({
logger: {
level: "info",
transport:
process.env.NODE_ENV === "development"
? { target: "pino-pretty" } // Human-readable in dev
: undefined, // JSON in production
},
});
app.get("/users/:id", async (request, reply) => {
request.log.info({ userId: request.params.id }, "Fetching user");
// Fastify automatically logs request/response with timing
});
Log Levels and When to Use Them
| Level | When to Use | Example |
|---|---|---|
trace |
Extremely detailed debugging | "Checking cache key: user_123" |
debug |
Useful during development | "Database query took 45ms" |
info |
Normal operations worth recording | "Order 123 processed successfully" |
warn |
Something unexpected but recoverable | "Rate limit approaching for user 456" |
error |
Something failed, needs attention | "Payment processing failed: timeout" |
fatal |
Application is crashing | "Database connection pool exhausted" |
Production should run at info level. Only drop to debug when actively investigating an issue. Never log at trace in production (the volume will overwhelm your log storage).
What Not to Log
// NEVER log sensitive data
log.info({ password: user.password }); // NO
log.info({ token: authToken }); // NO
log.info({ creditCard: cardNumber }); // NO
log.info({ ssn: socialSecurityNumber }); // NO
// Redact sensitive fields automatically
const logger = pino({
redact: ["password", "token", "authorization", "creditCard", "*.password"],
});
Metrics: Numbers That Tell Stories
Metrics are numerical measurements collected over time. They answer "how much" and "how often" questions and are the foundation of alerting.
Metric Types
| Type | What It Measures | Example |
|---|---|---|
| Counter | Cumulative count (only goes up) | Total HTTP requests, total errors |
| Gauge | Current value (goes up and down) | Active connections, memory usage |
| Histogram | Distribution of values | Request duration percentiles |
| Summary | Pre-calculated percentiles | Similar to histogram, calculated client-side |
Prometheus Metrics in Node.js
import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from "prom-client";
const registry = new Registry();
// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({ register: registry });
// Custom metrics
const httpRequestsTotal = new Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "path", "status_code"],
registers: [registry],
});
const httpRequestDuration = new Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration in seconds",
labelNames: ["method", "path"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [registry],
});
const activeConnections = new Gauge({
name: "active_connections",
help: "Number of active connections",
registers: [registry],
});
// Middleware to instrument requests
app.use((req, res, next) => {
const start = performance.now();
activeConnections.inc();
res.on("finish", () => {
const duration = (performance.now() - start) / 1000;
httpRequestsTotal.inc({
method: req.method,
path: req.route?.path || req.path,
status_code: res.statusCode,
});
httpRequestDuration.observe(
{ method: req.method, path: req.route?.path || req.path },
duration
);
activeConnections.dec();
});
next();
});
// Expose metrics endpoint for Prometheus scraping
app.get("/metrics", async (req, res) => {
res.set("Content-Type", registry.contentType);
res.send(await registry.metrics());
});
Key Metrics to Track
For any web application, these metrics are essential:
# The RED method (for request-driven services)
Rate: http_requests_total
Errors: http_requests_total{status_code=~"5.."}
Duration: http_request_duration_seconds
# The USE method (for resources)
Utilization: cpu_usage_percent, memory_usage_bytes
Saturation: event_loop_lag_seconds, connection_pool_waiting
Errors: database_errors_total, cache_errors_total
Grafana Dashboard Configuration
{
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{path}}"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}
]
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p50"
}
]
}
]
}
Distributed Tracing with OpenTelemetry
When a request passes through multiple services (API gateway, auth service, database, cache, external API), logs from each service are disconnected. Distributed tracing connects them by propagating a trace ID across service boundaries.
OpenTelemetry: The Standard
OpenTelemetry (OTel) is the industry standard for distributed tracing. It's vendor-neutral, supported by every major observability platform, and provides a unified SDK for traces, metrics, and logs.
# Install OpenTelemetry packages
npm install @opentelemetry/api \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-metrics-otlp-http
Automatic Instrumentation
The easiest way to start is automatic instrumentation, which patches popular libraries (Express, Fastify, pg, Redis, fetch) to generate traces without code changes:
// tracing.ts -- import this BEFORE your application code
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
const sdk = new NodeSDK({
serviceName: "order-service",
traceExporter: new OTLPTraceExporter({
url: "http://localhost:4318/v1/traces", // OTLP HTTP endpoint
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: "http://localhost:4318/v1/metrics",
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-fs": { enabled: false }, // Too noisy
}),
],
});
sdk.start();
// Graceful shutdown
process.on("SIGTERM", () => sdk.shutdown());
# Start your app with tracing
node --require ./tracing.ts src/server.ts
# or
node --import ./tracing.ts src/server.ts # ESM
Manual Instrumentation
For custom spans around business logic:
import { trace, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(orderData: OrderInput) {
return tracer.startActiveSpan("processOrder", async (span) => {
try {
span.setAttribute("order.items_count", orderData.items.length);
span.setAttribute("order.customer_id", orderData.customerId);
// Each step creates a child span
const inventory = await tracer.startActiveSpan(
"checkInventory",
async (childSpan) => {
const result = await inventoryService.check(orderData.items);
childSpan.setAttribute("inventory.all_available", result.allAvailable);
childSpan.end();
return result;
}
);
if (!inventory.allAvailable) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: "Insufficient inventory",
});
span.end();
throw new Error("Insufficient inventory");
}
const payment = await tracer.startActiveSpan(
"processPayment",
async (childSpan) => {
const result = await paymentService.charge(orderData);
childSpan.setAttribute("payment.amount", result.amount);
childSpan.setAttribute("payment.method", result.method);
childSpan.end();
return result;
}
);
span.setStatus({ code: SpanStatusCode.OK });
span.end();
return { orderId: payment.orderId };
} catch (error) {
span.recordException(error as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
span.end();
throw error;
}
});
}
Viewing Traces in Jaeger
Jaeger is a popular open-source trace visualization tool:
# docker-compose.yml for local development
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver
environment:
- COLLECTOR_OTLP_ENABLED=true
Open http://localhost:16686 to see traces. Each trace shows the full request journey:
[order-service] POST /orders (230ms)
├─ [order-service] processOrder (225ms)
│ ├─ [order-service] checkInventory (45ms)
│ │ └─ [inventory-service] GET /check (40ms)
│ │ └─ [postgres] SELECT (12ms)
│ ├─ [order-service] processPayment (150ms)
│ │ └─ [payment-service] POST /charge (145ms)
│ │ └─ [stripe-api] POST /v1/charges (130ms)
│ └─ [order-service] sendConfirmation (25ms)
│ └─ [email-service] POST /send (20ms)
The Observability Stack
Open-Source Stack (Self-Hosted)
| Component | Tool | Purpose |
|---|---|---|
| Trace collection | OpenTelemetry Collector | Receive, process, export telemetry |
| Trace storage/UI | Jaeger or Tempo | Store and visualize traces |
| Metrics storage | Prometheus | Time-series metrics database |
| Log aggregation | Grafana Loki | Log storage and querying |
| Dashboards | Grafana | Unified visualization |
| Alerting | Grafana Alerting or Alertmanager | Alert on thresholds |
# docker-compose.yml -- full observability stack
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config", "/etc/otel-config.yaml"]
volumes:
- ./otel-config.yaml:/etc/otel-config.yaml
ports:
- "4318:4318" # OTLP HTTP
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
volumes:
- ./grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
OpenTelemetry Collector Configuration
# otel-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: "0.0.0.0:4318"
processors:
batch:
timeout: 5s
send_batch_size: 1000
exporters:
jaeger:
endpoint: "jaeger:14250"
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Managed Alternatives
| Service | Strengths | Pricing Model |
|---|---|---|
| Datadog | Full platform, great UX | Per host + ingestion |
| Grafana Cloud | Open standards, generous free tier | Usage-based |
| Honeycomb | Best trace analysis, high cardinality | Events-based |
| New Relic | Full platform, free tier | Per-user + ingestion |
| Axiom | Simple, generous free tier | Ingestion-based |
For small teams and startups, Grafana Cloud's free tier (50GB logs, 10K metrics series, 50GB traces) is usually enough. Honeycomb is worth the price if you need deep trace analysis for debugging complex distributed systems.
Alerting That Doesn't Cause Alert Fatigue
Alert on Symptoms, Not Causes
# BAD: Alerting on a cause (noisy)
- alert: HighCPU
expr: cpu_usage > 80
# CPU at 80% might be fine. This alert fires constantly.
# GOOD: Alerting on a symptom (actionable)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 5m # Must persist for 5 minutes (not a blip)
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 5 minutes"
# GOOD: Alerting on user-facing impact
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency above 2 seconds"
Severity Levels
| Severity | Response | Example |
|---|---|---|
| Critical | Page on-call immediately | Error rate > 5%, service down |
| Warning | Investigate during business hours | P95 latency > 2s, disk > 80% |
| Info | Review in daily standup | Unusual traffic spike, deployment completed |
Building Observable Applications: A Checklist
From Day One
- Structured logging with Pino or equivalent (JSON, not plaintext)
- Request ID propagation -- every log line includes a request/trace ID
- Health check endpoint --
/healthreturns service status and dependency checks - Basic metrics -- request rate, error rate, latency (RED method)
Before Going to Production
- OpenTelemetry integration -- automatic instrumentation for your framework and database
- Error tracking (Sentry, Bugsnag, or similar) with source maps
- Dashboards -- at minimum, the RED metrics and resource utilization
- Alerts -- error rate, latency, and critical dependency health
At Scale
- Custom spans around critical business logic
- SLIs and SLOs -- define what "healthy" means and alert on violations
- Runbooks linked from alerts -- every alert should tell the responder what to do
- Log retention policies -- not everything needs to be kept forever
Summary
Observability isn't optional -- it's how you maintain production systems. Start with structured logging (it's the lowest effort, highest value change), add Prometheus metrics for dashboards and alerting, and integrate OpenTelemetry tracing when you have multiple services. The open-source stack (OTel + Grafana + Loki + Prometheus + Jaeger) is mature and production-ready. For managed services, Grafana Cloud offers the best free tier. Whatever you choose, instrument your code before you need it -- adding observability during an incident is the worst time to start.