Observability for Developers: Logs, Metrics, Traces, and OpenTelemetry

DevOps 2026-02-09 · 9 min read observability opentelemetry logging tracing metrics grafana jaeger devops

Observability for Developers: Logs, Metrics, Traces, and OpenTelemetry

Observability is the ability to understand what your system is doing from its external outputs. When something breaks in production -- and it will -- observability is the difference between fixing it in 5 minutes and debugging for 5 hours. This guide covers the three pillars (logs, metrics, traces), the tools that collect them, and how to build observable systems from the start.

The Three Pillars

Pillar	What It Tells You	Example
Logs	What happened	"User 123 failed to authenticate: invalid password"
Metrics	How much / how often	"95th percentile response time: 450ms"
Traces	The journey of a request	"Request hit API gateway -> auth service -> user DB -> response (total: 230ms)"

Each pillar answers different questions:

Logs: What went wrong? What was the error message?
Metrics: Is the system healthy? Are things getting worse?
Traces: Where is the bottleneck? Which service is slow?

You need all three. Logs without metrics mean you can't detect problems until users complain. Metrics without traces mean you know something is slow but can't find where. Traces without logs mean you can see the slow request but can't understand why.

Structured Logging

Unstructured logs are almost useless at scale. You can't search, filter, or aggregate them reliably. Structured logging outputs JSON, which log aggregation tools (Grafana Loki, Datadog, ELK) can parse and query.

Unstructured vs Structured

# Unstructured -- impossible to parse reliably
[2026-02-09 14:23:01] ERROR: Failed to process order #4567 for user [email protected] - insufficient inventory for SKU-12345

# Structured -- every field is queryable
{"timestamp":"2026-02-09T14:23:01.000Z","level":"error","message":"Failed to process order","orderId":"4567","userId":"user_abc123","email":"[email protected]","sku":"SKU-12345","reason":"insufficient_inventory","service":"order-processor"}

Structured Logging with Pino (Node.js)

Pino is the fastest structured logger for Node.js:

import pino from "pino";

// Create a logger
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }), // "info" instead of 30
  },
  // Add default fields to every log
  base: {
    service: "order-processor",
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
});

// Create child loggers with request context
function requestLogger(req: Request) {
  return logger.child({
    requestId: req.headers["x-request-id"],
    userId: req.user?.id,
    method: req.method,
    path: req.url,
  });
}

// Usage in a route handler
app.post("/orders", async (req, res) => {
  const log = requestLogger(req);
  log.info("Processing order");

  try {
    const order = await processOrder(req.body);
    log.info({ orderId: order.id, total: order.total }, "Order processed successfully");
    res.json(order);
  } catch (error) {
    log.error(
      { err: error, orderData: req.body },
      "Failed to process order"
    );
    res.status(500).json({ error: "Order processing failed" });
  }
});

Fastify Built-in Logging

Fastify uses Pino natively:

import Fastify from "fastify";

const app = Fastify({
  logger: {
    level: "info",
    transport:
      process.env.NODE_ENV === "development"
        ? { target: "pino-pretty" } // Human-readable in dev
        : undefined, // JSON in production
  },
});

app.get("/users/:id", async (request, reply) => {
  request.log.info({ userId: request.params.id }, "Fetching user");
  // Fastify automatically logs request/response with timing
});

Log Levels and When to Use Them

Level	When to Use	Example
`trace`	Extremely detailed debugging	"Checking cache key: user_123"
`debug`	Useful during development	"Database query took 45ms"
`info`	Normal operations worth recording	"Order 123 processed successfully"
`warn`	Something unexpected but recoverable	"Rate limit approaching for user 456"
`error`	Something failed, needs attention	"Payment processing failed: timeout"
`fatal`	Application is crashing	"Database connection pool exhausted"

Production should run at info level. Only drop to debug when actively investigating an issue. Never log at trace in production (the volume will overwhelm your log storage).

What Not to Log

// NEVER log sensitive data
log.info({ password: user.password }); // NO
log.info({ token: authToken }); // NO
log.info({ creditCard: cardNumber }); // NO
log.info({ ssn: socialSecurityNumber }); // NO

// Redact sensitive fields automatically
const logger = pino({
  redact: ["password", "token", "authorization", "creditCard", "*.password"],
});

Metrics: Numbers That Tell Stories

Metrics are numerical measurements collected over time. They answer "how much" and "how often" questions and are the foundation of alerting.

Metric Types

Type	What It Measures	Example
Counter	Cumulative count (only goes up)	Total HTTP requests, total errors
Gauge	Current value (goes up and down)	Active connections, memory usage
Histogram	Distribution of values	Request duration percentiles
Summary	Pre-calculated percentiles	Similar to histogram, calculated client-side

Prometheus Metrics in Node.js

import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from "prom-client";

const registry = new Registry();

// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({ register: registry });

// Custom metrics
const httpRequestsTotal = new Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "path", "status_code"],
  registers: [registry],
});

const httpRequestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry],
});

const activeConnections = new Gauge({
  name: "active_connections",
  help: "Number of active connections",
  registers: [registry],
});

// Middleware to instrument requests
app.use((req, res, next) => {
  const start = performance.now();
  activeConnections.inc();

  res.on("finish", () => {
    const duration = (performance.now() - start) / 1000;
    httpRequestsTotal.inc({
      method: req.method,
      path: req.route?.path || req.path,
      status_code: res.statusCode,
    });
    httpRequestDuration.observe(
      { method: req.method, path: req.route?.path || req.path },
      duration
    );
    activeConnections.dec();
  });

  next();
});

// Expose metrics endpoint for Prometheus scraping
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", registry.contentType);
  res.send(await registry.metrics());
});

Key Metrics to Track

For any web application, these metrics are essential:

# The RED method (for request-driven services)
Rate:     http_requests_total
Errors:   http_requests_total{status_code=~"5.."}
Duration: http_request_duration_seconds

# The USE method (for resources)
Utilization: cpu_usage_percent, memory_usage_bytes
Saturation:  event_loop_lag_seconds, connection_pool_waiting
Errors:      database_errors_total, cache_errors_total

Grafana Dashboard Configuration

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "rate(http_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
          "legendFormat": "Error %"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p99"
        },
        {
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p50"
        }
      ]
    }
  ]
}

Distributed Tracing with OpenTelemetry

When a request passes through multiple services (API gateway, auth service, database, cache, external API), logs from each service are disconnected. Distributed tracing connects them by propagating a trace ID across service boundaries.

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the industry standard for distributed tracing. It's vendor-neutral, supported by every major observability platform, and provides a unified SDK for traces, metrics, and logs.

# Install OpenTelemetry packages
npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

Automatic Instrumentation

The easiest way to start is automatic instrumentation, which patches popular libraries (Express, Fastify, pg, Redis, fetch) to generate traces without code changes:

// tracing.ts -- import this BEFORE your application code
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";

const sdk = new NodeSDK({
  serviceName: "order-service",
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4318/v1/traces", // OTLP HTTP endpoint
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on("SIGTERM", () => sdk.shutdown());

# Start your app with tracing
node --require ./tracing.ts src/server.ts
# or
node --import ./tracing.ts src/server.ts  # ESM

Manual Instrumentation

For custom spans around business logic:

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(orderData: OrderInput) {
  return tracer.startActiveSpan("processOrder", async (span) => {
    try {
      span.setAttribute("order.items_count", orderData.items.length);
      span.setAttribute("order.customer_id", orderData.customerId);

      // Each step creates a child span
      const inventory = await tracer.startActiveSpan(
        "checkInventory",
        async (childSpan) => {
          const result = await inventoryService.check(orderData.items);
          childSpan.setAttribute("inventory.all_available", result.allAvailable);
          childSpan.end();
          return result;
        }
      );

      if (!inventory.allAvailable) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: "Insufficient inventory",
        });
        span.end();
        throw new Error("Insufficient inventory");
      }

      const payment = await tracer.startActiveSpan(
        "processPayment",
        async (childSpan) => {
          const result = await paymentService.charge(orderData);
          childSpan.setAttribute("payment.amount", result.amount);
          childSpan.setAttribute("payment.method", result.method);
          childSpan.end();
          return result;
        }
      );

      span.setStatus({ code: SpanStatusCode.OK });
      span.end();
      return { orderId: payment.orderId };
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.end();
      throw error;
    }
  });
}

Viewing Traces in Jaeger

Jaeger is a popular open-source trace visualization tool:

# docker-compose.yml for local development
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4318:4318"     # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Open http://localhost:16686 to see traces. Each trace shows the full request journey:

[order-service] POST /orders                          (230ms)
  ├─ [order-service] processOrder                     (225ms)
  │   ├─ [order-service] checkInventory               (45ms)
  │   │   └─ [inventory-service] GET /check            (40ms)
  │   │       └─ [postgres] SELECT                     (12ms)
  │   ├─ [order-service] processPayment               (150ms)
  │   │   └─ [payment-service] POST /charge            (145ms)
  │   │       └─ [stripe-api] POST /v1/charges         (130ms)
  │   └─ [order-service] sendConfirmation              (25ms)
  │       └─ [email-service] POST /send                (20ms)

The Observability Stack

Open-Source Stack (Self-Hosted)

Component	Tool	Purpose
Trace collection	OpenTelemetry Collector	Receive, process, export telemetry
Trace storage/UI	Jaeger or Tempo	Store and visualize traces
Metrics storage	Prometheus	Time-series metrics database
Log aggregation	Grafana Loki	Log storage and querying
Dashboards	Grafana	Unified visualization
Alerting	Grafana Alerting or Alertmanager	Alert on thresholds

# docker-compose.yml -- full observability stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config", "/etc/otel-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-config.yaml
    ports:
      - "4318:4318"     # OTLP HTTP

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - ./grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml

OpenTelemetry Collector Configuration

# otel-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  jaeger:
    endpoint: "jaeger:14250"
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Managed Alternatives

Service	Strengths	Pricing Model
Datadog	Full platform, great UX	Per host + ingestion
Grafana Cloud	Open standards, generous free tier	Usage-based
Honeycomb	Best trace analysis, high cardinality	Events-based
New Relic	Full platform, free tier	Per-user + ingestion
Axiom	Simple, generous free tier	Ingestion-based

For small teams and startups, Grafana Cloud's free tier (50GB logs, 10K metrics series, 50GB traces) is usually enough. Honeycomb is worth the price if you need deep trace analysis for debugging complex distributed systems.

Alerting That Doesn't Cause Alert Fatigue

Alert on Symptoms, Not Causes

# BAD: Alerting on a cause (noisy)
- alert: HighCPU
  expr: cpu_usage > 80
  # CPU at 80% might be fine. This alert fires constantly.

# GOOD: Alerting on a symptom (actionable)
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m  # Must persist for 5 minutes (not a blip)
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1% for 5 minutes"

# GOOD: Alerting on user-facing impact
- alert: HighLatency
  expr: |
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P95 latency above 2 seconds"

Severity Levels

Severity	Response	Example
Critical	Page on-call immediately	Error rate > 5%, service down
Warning	Investigate during business hours	P95 latency > 2s, disk > 80%
Info	Review in daily standup	Unusual traffic spike, deployment completed

Building Observable Applications: A Checklist

From Day One

Structured logging with Pino or equivalent (JSON, not plaintext)
Request ID propagation -- every log line includes a request/trace ID
Health check endpoint -- /health returns service status and dependency checks
Basic metrics -- request rate, error rate, latency (RED method)

Before Going to Production

OpenTelemetry integration -- automatic instrumentation for your framework and database
Error tracking (Sentry, Bugsnag, or similar) with source maps
Dashboards -- at minimum, the RED metrics and resource utilization
Alerts -- error rate, latency, and critical dependency health

At Scale

Custom spans around critical business logic
SLIs and SLOs -- define what "healthy" means and alert on violations
Runbooks linked from alerts -- every alert should tell the responder what to do
Log retention policies -- not everything needs to be kept forever

Summary

Observability isn't optional -- it's how you maintain production systems. Start with structured logging (it's the lowest effort, highest value change), add Prometheus metrics for dashboards and alerting, and integrate OpenTelemetry tracing when you have multiple services. The open-source stack (OTel + Grafana + Loki + Prometheus + Jaeger) is mature and production-ready. For managed services, Grafana Cloud offers the best free tier. Whatever you choose, instrument your code before you need it -- adding observability during an incident is the worst time to start.