DuckDB: The Analytics Database That Runs Everywhere

Databases 2026-03-04 · 5 min read duckdb analytics database parquet olap python sql data engineering
By DevTools Guide Editorial Team — Software engineers and developer advocates covering tools, workflows, and productivity for modern development teams.

SQLite changed how developers think about relational databases: no server, no setup, just embed it in your application. DuckDB is attempting a similar shift for analytics workloads.

Photo by Fotis Fotopoulos on Unsplash

It's an in-process OLAP (Online Analytical Processing) database. No server to run, no cluster to manage. You embed DuckDB in your Python script, your Go program, or your Node app, and it runs complex analytical queries locally at speeds that used to require dedicated infrastructure.

Why DuckDB Exists

Most databases are optimized for transactional workloads (OLTP): many small reads and writes, row-based storage, indexed lookups. This is what SQLite, PostgreSQL, and MySQL are built for.

Analytical workloads are different: full table scans, aggregations across millions of rows, column-oriented access patterns. Running SELECT AVG(revenue) FROM orders GROUP BY month across 10 million rows is where row-based databases struggle and columnar databases shine.

DuckDB is a columnar OLAP engine that:

Runs as an embedded library (no network protocol, no server process)
Reads Parquet, CSV, JSON, and Arrow files directly — no import required
Executes SQL including window functions, CTEs, and complex aggregations
Parallelizes queries across CPU cores automatically
Handles datasets that don't fit in RAM via spill-to-disk

The result: data analysis tasks that take minutes in Python/pandas or require cloud infrastructure can run in seconds locally.

Installation

DuckDB is available for most languages:

Python:

pip install duckdb

Node.js:

npm install @duckdb/node-api

CLI:

# macOS
brew install duckdb

# Direct download
curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip
unzip duckdb.zip

Rust, Java, Go, and R packages are also available. The same engine powers all language bindings — queries behave identically.

Querying Files Directly

DuckDB's most immediately useful feature is querying files without importing them:

-- Query a CSV file
SELECT * FROM read_csv('data.csv') LIMIT 10;

-- Query a Parquet file
SELECT year, SUM(revenue)
FROM read_parquet('sales/*.parquet')
GROUP BY year
ORDER BY year;

-- Query a remote S3 file
SELECT COUNT(*) FROM read_parquet('s3://my-bucket/data/events.parquet');

-- Query multiple files with a glob
SELECT * FROM read_csv('logs/*.csv') WHERE status = 'error';

This means DuckDB is often useful without even creating a database. Point it at your CSV export and start querying immediately.

Want more databases guides? Get guides like this in your inbox — DevTools Guide delivers one free deep-dive every week.

Python Integration

The Python API is clean and integrates with pandas and Arrow:

import duckdb

# Run a query and get a pandas DataFrame
conn = duckdb.connect()
df = conn.execute("""
    SELECT
        date_trunc('month', created_at) AS month,
        COUNT(*) AS orders,
        AVG(order_value) AS avg_value
    FROM read_csv('orders.csv')
    GROUP BY 1
    ORDER BY 1
""").df()

print(df)

Query a pandas DataFrame directly:

import pandas as pd
import duckdb

orders = pd.read_csv('orders.csv')

# DuckDB can query the in-memory DataFrame by name
result = duckdb.query("SELECT * FROM orders WHERE status = 'completed'").df()

Convert to Arrow for zero-copy interop:

arrow_table = conn.execute("SELECT * FROM large_table").arrow()

DuckDB's Python API is deliberately minimal. You don't need connection pools, ORMs, or connection string configuration for most use cases.

SQL Capabilities

DuckDB implements most of modern SQL:

Window functions:

SELECT
    user_id,
    event_date,
    revenue,
    SUM(revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 7 PRECEDING AND CURRENT ROW
    ) AS rolling_7day_revenue
FROM events;

CTEs and recursive queries:

WITH RECURSIVE category_tree AS (
    SELECT id, name, parent_id, 1 AS depth
    FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.id, c.name, c.parent_id, ct.depth + 1
    FROM categories c
    JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY depth, name;

Struct and list types:

SELECT
    user_id,
    LIST(event_type ORDER BY created_at) AS event_sequence
FROM events
GROUP BY user_id;

PIVOT:

PIVOT sales ON region USING SUM(revenue);

DuckDB also adds several PostgreSQL-compatible extensions and some unique SQL features like EXCLUDE in SELECT (exclude specific columns) and COLUMNS(*) expressions.

Persistent Database Files

For persistent storage, specify a file path:

conn = duckdb.connect('analytics.duckdb')

conn.execute("""
    CREATE TABLE IF NOT EXISTS events AS
    SELECT * FROM read_parquet('raw/*.parquet')
""")

conn.execute("CREATE INDEX idx_user ON events(user_id)")

The .duckdb file stores data in DuckDB's columnar format, optimized for analytical queries. Subsequent reads are much faster than re-parsing the original files.

Performance

DuckDB is fast for analytical queries. Some benchmarks for context:

Aggregating 100M rows with GROUP BY: ~1-3 seconds on a modern laptop
Joining two 10M-row tables: ~500ms
Reading a 1GB Parquet file: ~200ms

These numbers depend on hardware and query complexity, but the order of magnitude is correct. Operations that take minutes in Python with pandas often complete in seconds with DuckDB.

Why it's fast:

Columnar storage: only read the columns a query needs
Vectorized execution: process batches of values instead of one at a time
Automatic parallelism: queries use all available CPU cores without configuration
SIMD instructions: exploits modern CPU vector operations

DuckDB vs. SQLite vs. BigQuery

Aspect	DuckDB	SQLite	BigQuery
Workload type	Analytics (OLAP)	Transactional (OLTP)	Analytics (OLAP)
Server required	No	No	No (managed)
Query speed (analytics)	Excellent	Poor	Excellent
Concurrent writes	Limited	Limited	Excellent
Row inserts/updates	Slow	Fast	Batch only
Cost	Free	Free	Pay per query
Data location	Local	Local	Cloud
Max data size	Limited by disk	Limited by disk	Petabytes

DuckDB isn't a replacement for SQLite (wrong workload type) or BigQuery (wrong scale). It's the right tool when you have gigabytes-to-terabytes of data for analysis that fits on local infrastructure.

Common Use Cases

Log analysis: Query application logs exported as CSV or Parquet without importing into a database system.

duckdb -c "SELECT status, COUNT(*) FROM read_csv('nginx.log') GROUP BY status ORDER BY 2 DESC"

ETL pipelines: Transform data between formats with SQL instead of Python loops:

COPY (
    SELECT
        date_trunc('hour', timestamp) AS hour,
        source,
        COUNT(*) AS event_count
    FROM read_json('raw/events/*.json.gz')
    GROUP BY 1, 2
) TO 'output/hourly_rollup.parquet' (FORMAT parquet);

Data exploration: Explore a new dataset interactively before deciding how to process it. DuckDB's CLI provides an REPL with tab completion.

Replacing pandas for aggregations: DuckDB queries often replace complex pandas operations with more readable SQL and significantly better performance on larger datasets.

Testing analytics queries locally: Develop and test BigQuery or Snowflake queries locally against a subset of data before running them against cloud infrastructure.

Integrations

Jupyter notebooks: DuckDB's Python API works directly in notebooks. The duckdb-engine package enables SQLAlchemy integration for tools that use it.

dbt: DuckDB is a supported dbt adapter — run dbt transformations locally without a cloud data warehouse.

Apache Arrow: DuckDB and Arrow share data without copying, making DuckDB a natural fit in Arrow-based data pipelines.

Parquet ecosystem: DuckDB reads and writes Parquet natively, integrating cleanly with Spark-generated data, Delta Lake tables, and Iceberg catalogs.

Getting Started

The fastest way to explore DuckDB:

# Install CLI
brew install duckdb  # or download binary

# Query a CSV directly
duckdb -c "SELECT * FROM read_csv_auto('your_file.csv') LIMIT 10"

# Start interactive session
duckdb
D SELECT version();

The DuckDB documentation is thorough and the SQL dialect is well-documented. For Python use, the Python client guide covers the full API.

DuckDB is a rare tool that genuinely changes how you work with data. Once you have it available, you'll find yourself reaching for it whenever a data task comes up — it's fast, it requires no infrastructure, and it's just SQL.