← All articles
DEVOPS Observability for Developers: Logs, Metrics, Traces,... 2026-02-09 · 9 min read · observability · opentelemetry · logging

Observability for Developers: Logs, Metrics, Traces, and OpenTelemetry

DevOps 2026-02-09 · 9 min read observability opentelemetry logging tracing metrics grafana jaeger devops

Observability for Developers: Logs, Metrics, Traces, and OpenTelemetry

Observability is the ability to understand what your system is doing from its external outputs. When something breaks in production -- and it will -- observability is the difference between fixing it in 5 minutes and debugging for 5 hours. This guide covers the three pillars (logs, metrics, traces), the tools that collect them, and how to build observable systems from the start.

The Three Pillars

Pillar What It Tells You Example
Logs What happened "User 123 failed to authenticate: invalid password"
Metrics How much / how often "95th percentile response time: 450ms"
Traces The journey of a request "Request hit API gateway -> auth service -> user DB -> response (total: 230ms)"

Each pillar answers different questions:

You need all three. Logs without metrics mean you can't detect problems until users complain. Metrics without traces mean you know something is slow but can't find where. Traces without logs mean you can see the slow request but can't understand why.

Structured Logging

Unstructured logs are almost useless at scale. You can't search, filter, or aggregate them reliably. Structured logging outputs JSON, which log aggregation tools (Grafana Loki, Datadog, ELK) can parse and query.

Unstructured vs Structured

# Unstructured -- impossible to parse reliably
[2026-02-09 14:23:01] ERROR: Failed to process order #4567 for user [email protected] - insufficient inventory for SKU-12345

# Structured -- every field is queryable
{"timestamp":"2026-02-09T14:23:01.000Z","level":"error","message":"Failed to process order","orderId":"4567","userId":"user_abc123","email":"[email protected]","sku":"SKU-12345","reason":"insufficient_inventory","service":"order-processor"}

Structured Logging with Pino (Node.js)

Pino is the fastest structured logger for Node.js:

import pino from "pino";

// Create a logger
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }), // "info" instead of 30
  },
  // Add default fields to every log
  base: {
    service: "order-processor",
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
});

// Create child loggers with request context
function requestLogger(req: Request) {
  return logger.child({
    requestId: req.headers["x-request-id"],
    userId: req.user?.id,
    method: req.method,
    path: req.url,
  });
}

// Usage in a route handler
app.post("/orders", async (req, res) => {
  const log = requestLogger(req);
  log.info("Processing order");

  try {
    const order = await processOrder(req.body);
    log.info({ orderId: order.id, total: order.total }, "Order processed successfully");
    res.json(order);
  } catch (error) {
    log.error(
      { err: error, orderData: req.body },
      "Failed to process order"
    );
    res.status(500).json({ error: "Order processing failed" });
  }
});

Fastify Built-in Logging

Fastify uses Pino natively:

import Fastify from "fastify";

const app = Fastify({
  logger: {
    level: "info",
    transport:
      process.env.NODE_ENV === "development"
        ? { target: "pino-pretty" } // Human-readable in dev
        : undefined, // JSON in production
  },
});

app.get("/users/:id", async (request, reply) => {
  request.log.info({ userId: request.params.id }, "Fetching user");
  // Fastify automatically logs request/response with timing
});

Log Levels and When to Use Them

Level When to Use Example
trace Extremely detailed debugging "Checking cache key: user_123"
debug Useful during development "Database query took 45ms"
info Normal operations worth recording "Order 123 processed successfully"
warn Something unexpected but recoverable "Rate limit approaching for user 456"
error Something failed, needs attention "Payment processing failed: timeout"
fatal Application is crashing "Database connection pool exhausted"

Production should run at info level. Only drop to debug when actively investigating an issue. Never log at trace in production (the volume will overwhelm your log storage).

What Not to Log

// NEVER log sensitive data
log.info({ password: user.password }); // NO
log.info({ token: authToken }); // NO
log.info({ creditCard: cardNumber }); // NO
log.info({ ssn: socialSecurityNumber }); // NO

// Redact sensitive fields automatically
const logger = pino({
  redact: ["password", "token", "authorization", "creditCard", "*.password"],
});

Metrics: Numbers That Tell Stories

Metrics are numerical measurements collected over time. They answer "how much" and "how often" questions and are the foundation of alerting.

Metric Types

Type What It Measures Example
Counter Cumulative count (only goes up) Total HTTP requests, total errors
Gauge Current value (goes up and down) Active connections, memory usage
Histogram Distribution of values Request duration percentiles
Summary Pre-calculated percentiles Similar to histogram, calculated client-side

Prometheus Metrics in Node.js

import { Registry, Counter, Histogram, Gauge, collectDefaultMetrics } from "prom-client";

const registry = new Registry();

// Collect default Node.js metrics (CPU, memory, event loop, etc.)
collectDefaultMetrics({ register: registry });

// Custom metrics
const httpRequestsTotal = new Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "path", "status_code"],
  registers: [registry],
});

const httpRequestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [registry],
});

const activeConnections = new Gauge({
  name: "active_connections",
  help: "Number of active connections",
  registers: [registry],
});

// Middleware to instrument requests
app.use((req, res, next) => {
  const start = performance.now();
  activeConnections.inc();

  res.on("finish", () => {
    const duration = (performance.now() - start) / 1000;
    httpRequestsTotal.inc({
      method: req.method,
      path: req.route?.path || req.path,
      status_code: res.statusCode,
    });
    httpRequestDuration.observe(
      { method: req.method, path: req.route?.path || req.path },
      duration
    );
    activeConnections.dec();
  });

  next();
});

// Expose metrics endpoint for Prometheus scraping
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", registry.contentType);
  res.send(await registry.metrics());
});

Key Metrics to Track

For any web application, these metrics are essential:

# The RED method (for request-driven services)
Rate:     http_requests_total
Errors:   http_requests_total{status_code=~"5.."}
Duration: http_request_duration_seconds

# The USE method (for resources)
Utilization: cpu_usage_percent, memory_usage_bytes
Saturation:  event_loop_lag_seconds, connection_pool_waiting
Errors:      database_errors_total, cache_errors_total

Grafana Dashboard Configuration

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "rate(http_requests_total[5m])",
          "legendFormat": "{{method}} {{path}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
          "legendFormat": "Error %"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p99"
        },
        {
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p50"
        }
      ]
    }
  ]
}

Distributed Tracing with OpenTelemetry

When a request passes through multiple services (API gateway, auth service, database, cache, external API), logs from each service are disconnected. Distributed tracing connects them by propagating a trace ID across service boundaries.

OpenTelemetry: The Standard

OpenTelemetry (OTel) is the industry standard for distributed tracing. It's vendor-neutral, supported by every major observability platform, and provides a unified SDK for traces, metrics, and logs.

# Install OpenTelemetry packages
npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

Automatic Instrumentation

The easiest way to start is automatic instrumentation, which patches popular libraries (Express, Fastify, pg, Redis, fetch) to generate traces without code changes:

// tracing.ts -- import this BEFORE your application code
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";

const sdk = new NodeSDK({
  serviceName: "order-service",
  traceExporter: new OTLPTraceExporter({
    url: "http://localhost:4318/v1/traces", // OTLP HTTP endpoint
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: "http://localhost:4318/v1/metrics",
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on("SIGTERM", () => sdk.shutdown());
# Start your app with tracing
node --require ./tracing.ts src/server.ts
# or
node --import ./tracing.ts src/server.ts  # ESM

Manual Instrumentation

For custom spans around business logic:

import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(orderData: OrderInput) {
  return tracer.startActiveSpan("processOrder", async (span) => {
    try {
      span.setAttribute("order.items_count", orderData.items.length);
      span.setAttribute("order.customer_id", orderData.customerId);

      // Each step creates a child span
      const inventory = await tracer.startActiveSpan(
        "checkInventory",
        async (childSpan) => {
          const result = await inventoryService.check(orderData.items);
          childSpan.setAttribute("inventory.all_available", result.allAvailable);
          childSpan.end();
          return result;
        }
      );

      if (!inventory.allAvailable) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: "Insufficient inventory",
        });
        span.end();
        throw new Error("Insufficient inventory");
      }

      const payment = await tracer.startActiveSpan(
        "processPayment",
        async (childSpan) => {
          const result = await paymentService.charge(orderData);
          childSpan.setAttribute("payment.amount", result.amount);
          childSpan.setAttribute("payment.method", result.method);
          childSpan.end();
          return result;
        }
      );

      span.setStatus({ code: SpanStatusCode.OK });
      span.end();
      return { orderId: payment.orderId };
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      span.end();
      throw error;
    }
  });
}

Viewing Traces in Jaeger

Jaeger is a popular open-source trace visualization tool:

# docker-compose.yml for local development
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4318:4318"     # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

Open http://localhost:16686 to see traces. Each trace shows the full request journey:

[order-service] POST /orders                          (230ms)
  ├─ [order-service] processOrder                     (225ms)
  │   ├─ [order-service] checkInventory               (45ms)
  │   │   └─ [inventory-service] GET /check            (40ms)
  │   │       └─ [postgres] SELECT                     (12ms)
  │   ├─ [order-service] processPayment               (150ms)
  │   │   └─ [payment-service] POST /charge            (145ms)
  │   │       └─ [stripe-api] POST /v1/charges         (130ms)
  │   └─ [order-service] sendConfirmation              (25ms)
  │       └─ [email-service] POST /send                (20ms)

The Observability Stack

Open-Source Stack (Self-Hosted)

Component Tool Purpose
Trace collection OpenTelemetry Collector Receive, process, export telemetry
Trace storage/UI Jaeger or Tempo Store and visualize traces
Metrics storage Prometheus Time-series metrics database
Log aggregation Grafana Loki Log storage and querying
Dashboards Grafana Unified visualization
Alerting Grafana Alerting or Alertmanager Alert on thresholds
# docker-compose.yml -- full observability stack
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config", "/etc/otel-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-config.yaml
    ports:
      - "4318:4318"     # OTLP HTTP

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
    volumes:
      - ./grafana/datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml

OpenTelemetry Collector Configuration

# otel-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  jaeger:
    endpoint: "jaeger:14250"
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Managed Alternatives

Service Strengths Pricing Model
Datadog Full platform, great UX Per host + ingestion
Grafana Cloud Open standards, generous free tier Usage-based
Honeycomb Best trace analysis, high cardinality Events-based
New Relic Full platform, free tier Per-user + ingestion
Axiom Simple, generous free tier Ingestion-based

For small teams and startups, Grafana Cloud's free tier (50GB logs, 10K metrics series, 50GB traces) is usually enough. Honeycomb is worth the price if you need deep trace analysis for debugging complex distributed systems.

Alerting That Doesn't Cause Alert Fatigue

Alert on Symptoms, Not Causes

# BAD: Alerting on a cause (noisy)
- alert: HighCPU
  expr: cpu_usage > 80
  # CPU at 80% might be fine. This alert fires constantly.

# GOOD: Alerting on a symptom (actionable)
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status_code=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m  # Must persist for 5 minutes (not a blip)
  labels:
    severity: critical
  annotations:
    summary: "Error rate above 1% for 5 minutes"

# GOOD: Alerting on user-facing impact
- alert: HighLatency
  expr: |
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P95 latency above 2 seconds"

Severity Levels

Severity Response Example
Critical Page on-call immediately Error rate > 5%, service down
Warning Investigate during business hours P95 latency > 2s, disk > 80%
Info Review in daily standup Unusual traffic spike, deployment completed

Building Observable Applications: A Checklist

From Day One

  1. Structured logging with Pino or equivalent (JSON, not plaintext)
  2. Request ID propagation -- every log line includes a request/trace ID
  3. Health check endpoint -- /health returns service status and dependency checks
  4. Basic metrics -- request rate, error rate, latency (RED method)

Before Going to Production

  1. OpenTelemetry integration -- automatic instrumentation for your framework and database
  2. Error tracking (Sentry, Bugsnag, or similar) with source maps
  3. Dashboards -- at minimum, the RED metrics and resource utilization
  4. Alerts -- error rate, latency, and critical dependency health

At Scale

  1. Custom spans around critical business logic
  2. SLIs and SLOs -- define what "healthy" means and alert on violations
  3. Runbooks linked from alerts -- every alert should tell the responder what to do
  4. Log retention policies -- not everything needs to be kept forever

Summary

Observability isn't optional -- it's how you maintain production systems. Start with structured logging (it's the lowest effort, highest value change), add Prometheus metrics for dashboards and alerting, and integrate OpenTelemetry tracing when you have multiple services. The open-source stack (OTel + Grafana + Loki + Prometheus + Jaeger) is mature and production-ready. For managed services, Grafana Cloud offers the best free tier. Whatever you choose, instrument your code before you need it -- adding observability during an incident is the worst time to start.