← All articles
DEVOPS Log Management Tools: ELK Stack, Loki, and Cloud Alt... 2026-02-09 · 10 min read · logging · elk · loki

Log Management Tools: ELK Stack, Loki, and Cloud Alternatives

DevOps 2026-02-09 · 10 min read logging elk loki datadog log-management observability

Log Management Tools: ELK Stack, Loki, and Cloud Alternatives

Logs are the most expensive part of most observability stacks. Not because the tools are expensive (though they can be), but because logs are verbose by nature. Every HTTP request, every database query, every error, every debug statement -- they all generate log lines. At scale, you are storing and indexing terabytes of text data, most of which nobody will ever read.

The challenge of log management is not "how do I collect logs" -- that part is straightforward. The challenge is "how do I make logs searchable without going bankrupt, retain them long enough to be useful, and actually find the needle in the haystack when something breaks at 3 AM."

This guide compares the major log management solutions and covers the practices that matter more than any specific tool.

Structured Logging: The Foundation

Before choosing a log management tool, fix your logging. Unstructured log lines are the single biggest source of log management pain:

# Bad: unstructured log line
[2026-02-09 14:32:01] ERROR: Failed to process order 12345 for user [email protected] - connection timeout after 30s

# Good: structured log (JSON)
{"timestamp":"2026-02-09T14:32:01.234Z","level":"error","message":"Failed to process order","order_id":"12345","user_email":"[email protected]","error":"connection timeout","timeout_seconds":30,"service":"order-processor","trace_id":"abc123"}

Structured logs are machine-parseable. You can filter by order_id, aggregate by error type, correlate by trace_id, and alert on level: error without writing regex. Every log management tool works better with structured logs.

Implementing Structured Logging

In Node.js/TypeScript, use pino (fastest) or winston (most popular):

import pino from "pino";

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Use child loggers for request context
app.use((req, res, next) => {
  req.log = logger.child({
    request_id: req.headers["x-request-id"],
    method: req.method,
    path: req.path,
    user_id: req.user?.id,
  });
  next();
});

// Structured logging in handlers
app.post("/orders", async (req, res) => {
  req.log.info({ order_total: req.body.total }, "Processing order");

  try {
    const order = await processOrder(req.body);
    req.log.info({ order_id: order.id }, "Order processed successfully");
    res.json(order);
  } catch (err) {
    req.log.error({ err, order_data: req.body }, "Order processing failed");
    res.status(500).json({ error: "Internal server error" });
  }
});

In Python, use structlog:

import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer(),
    ]
)

logger = structlog.get_logger()

def process_order(order_id: str, user_id: str):
    log = logger.bind(order_id=order_id, user_id=user_id)
    log.info("processing_order")

    try:
        result = do_process(order_id)
        log.info("order_processed", total=result.total)
    except Exception as e:
        log.error("order_failed", error=str(e))
        raise

Log Levels: Use Them Consistently

Define what each level means for your team and enforce it:

A common mistake is logging too much at INFO level. If your INFO logs generate more than a few hundred lines per minute per service, you are probably logging things that should be DEBUG.

The ELK Stack

ELK -- Elasticsearch, Logstash, Kibana -- is the classic self-hosted log management stack. Elasticsearch indexes and stores logs. Logstash (or Filebeat/Fluentd) collects and ships them. Kibana provides the search UI and dashboards.

Architecture

Application -> Filebeat -> Logstash -> Elasticsearch -> Kibana
                  |                        |
              (collection)            (storage + index)

In modern ELK deployments, Filebeat replaces Logstash for log collection (it is lighter and more reliable), and Logstash is used only when you need complex log transformations.

Setup with Docker Compose

version: "3.8"
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    volumes:
      - es-data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    volumes:
      - ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - elasticsearch

volumes:
  es-data:
# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - decode_json_fields:
          fields: ["message"]
          target: ""
          overwrite_keys: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  indices:
    - index: "logs-%{+yyyy.MM.dd}"

Strengths

Weaknesses

Cost Management Tips

ELK costs are driven by storage and indexing. Reduce them with:

// ILM policy: hot for 7 days, warm for 23 days, delete after 30 days
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "7d" }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Grafana Loki

Loki takes a fundamentally different approach to log management. Where Elasticsearch indexes the full content of every log line, Loki indexes only the metadata (labels) and stores log content as compressed chunks. This makes it dramatically cheaper to run but with different query trade-offs.

The Key Insight

Loki's philosophy is "like Prometheus, but for logs." Instead of full-text indexing, it uses labels to identify log streams:

{service="api-server", environment="production", level="error"}

When you query Loki, it first narrows down to the relevant log streams using labels, then does a brute-force search through the compressed log content. This is slower than Elasticsearch for arbitrary text search but fast enough for most operational use cases -- and vastly cheaper.

Setup

Loki integrates naturally with the Grafana stack:

# docker-compose.yml
version: "3.8"
services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    volumes:
      - loki-data:/loki
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml

  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/etc/promtail/config.yml
    command: -config.file=/etc/promtail/config.yml

  grafana:
    image: grafana/grafana:10.3.0
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin

volumes:
  loki-data:
# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 30d

compactor:
  working_directory: /loki/retention
  retention_enabled: true

LogQL: Loki's Query Language

LogQL looks like PromQL with log-specific extensions:

# Find error logs from the API server
{service="api-server"} |= "error"

# Parse JSON logs and filter by field
{service="api-server"} | json | level="error" | order_id != ""

# Count errors per service over the last hour
count_over_time({level="error"}[1h]) by (service)

# Find slow requests (parse duration from structured logs)
{service="api-server"} | json | duration > 5s

# Top 10 error messages
{service="api-server"} | json | level="error" |
  top 10 by (message) | count_over_time([1h])

# Pattern matching for unstructured logs
{service="legacy-app"} |~ "timeout|connection refused|ECONNRESET"

Strengths

Weaknesses

Datadog Logs

Datadog is the dominant SaaS observability platform. Datadog Logs provides log collection, indexing, search, and alerting as a managed service with deep integration into the rest of the Datadog platform (APM, metrics, infrastructure monitoring).

Setup

Datadog uses an agent for log collection:

# datadog-agent config: /etc/datadog-agent/datadog.yaml
api_key: YOUR_API_KEY
logs_enabled: true

# /etc/datadog-agent/conf.d/myapp.d/conf.yaml
logs:
  - type: file
    path: /var/log/myapp/*.log
    service: myapp
    source: nodejs
    sourcecategory: application

Or pipe logs directly from your application:

import { createLogger, format, transports } from "winston";

const logger = createLogger({
  format: format.json(),
  transports: [
    new transports.Http({
      host: "http-intake.logs.datadoghq.com",
      path: `/api/v2/logs?dd-api-key=${process.env.DD_API_KEY}`,
      ssl: true,
    }),
  ],
});

Strengths

Weaknesses

Cost Management

Datadog's pricing model rewards careful log management:

Cloud-Native Options

AWS CloudWatch Logs

CloudWatch Logs is the default for AWS workloads. Lambda, ECS, EKS, and EC2 all ship logs to CloudWatch with minimal configuration.

Pros: Zero setup for AWS services. Integrated with CloudWatch Alarms and dashboards. Logs Insights provides a SQL-like query language. Pay per GB ingested ($0.50/GB) and stored ($0.03/GB/month).

Cons: The query language is limited compared to Elasticsearch or LogQL. The UI is functional but not pleasant. Cross-account and cross-region log aggregation is cumbersome. CloudWatch Logs Insights queries can be slow on large datasets.

Best for: AWS-heavy teams that want simplicity and do not need advanced search. Good enough for most small-to-medium workloads.

Google Cloud Logging

Cloud Logging (formerly Stackdriver) integrates deeply with GCP services. Logs are automatically collected from GKE, Cloud Run, Cloud Functions, and Compute Engine.

Pros: Automatic collection from GCP services. Powerful query syntax. Log-based metrics (turn log patterns into time-series metrics). Integrated with Cloud Monitoring for alerting.

Cons: Pricing is complex (free allotment, then per-GB). The UI can be slow. Log routing and exclusion configuration is not intuitive.

Azure Monitor Logs

Azure's log management is built on Log Analytics workspaces and uses KQL (Kusto Query Language) for queries.

Pros: Deep Azure integration. KQL is a genuinely powerful query language. Application Insights for application-level logging and tracing.

Cons: KQL has a steep learning curve. Pricing is per-GB ingested. The portal experience is cluttered.

Logs vs. Metrics vs. Traces: When to Use What

Logs are not always the answer. Many things people log should be metrics or traces instead.

Use metrics when you want to track a number over time. "How many requests per second?" "What is the 95th percentile latency?" "How much memory is the service using?" Metrics are cheap to store, fast to query, and ideal for dashboards and alerts.

Use traces when you want to understand the flow of a single request through multiple services. "Why was this request slow?" "Which downstream service caused the timeout?" Traces are request-scoped and show you the call graph.

Use logs when you need the full context of a specific event. "What was the exact error message?" "What was the request payload that caused the failure?" "What happened in the 30 seconds before the crash?" Logs are event-scoped and provide the detail that metrics and traces cannot.

The most common mistake is logging metrics. Do not write logger.info("Request took 234ms") -- emit a histogram metric instead. Do not write logger.info("Queue depth: 42") -- expose that as a gauge. Logs that contain numbers you want to aggregate over time should be metrics.

Choosing a Solution

ELK Stack: Choose when you need powerful full-text search, have a team that can operate Elasticsearch, and want to self-host. Best for medium-to-large teams with dedicated platform engineering.

Grafana Loki: Choose when you want cost-effective log management, already use Grafana for metrics, and can live with label-based (rather than full-text) search. Best for teams that prioritize operational simplicity and cost.

Datadog Logs: Choose when you want a fully managed solution, are already using Datadog for other observability, and have the budget. Best for teams that value integration and are willing to pay for convenience.

Cloud-native (CloudWatch/Cloud Logging/Azure Monitor): Choose when you are all-in on a single cloud provider and want zero-setup log collection. Best for small teams and simple architectures.

Regardless of which tool you choose, the practices matter more: structure your logs as JSON, use log levels consistently, set retention policies aggressively, and always ask "should this be a metric instead?" before adding a new log line. The cheapest log is the one you never generate.