DuckDB: The Analytics Database That Runs Everywhere
SQLite changed how developers think about relational databases: no server, no setup, just embed it in your application. DuckDB is attempting a similar shift for analytics workloads.
It's an in-process OLAP (Online Analytical Processing) database. No server to run, no cluster to manage. You embed DuckDB in your Python script, your Go program, or your Node app, and it runs complex analytical queries locally at speeds that used to require dedicated infrastructure.
Why DuckDB Exists
Most databases are optimized for transactional workloads (OLTP): many small reads and writes, row-based storage, indexed lookups. This is what SQLite, PostgreSQL, and MySQL are built for.
Analytical workloads are different: full table scans, aggregations across millions of rows, column-oriented access patterns. Running SELECT AVG(revenue) FROM orders GROUP BY month across 10 million rows is where row-based databases struggle and columnar databases shine.
DuckDB is a columnar OLAP engine that:
- Runs as an embedded library (no network protocol, no server process)
- Reads Parquet, CSV, JSON, and Arrow files directly — no import required
- Executes SQL including window functions, CTEs, and complex aggregations
- Parallelizes queries across CPU cores automatically
- Handles datasets that don't fit in RAM via spill-to-disk
The result: data analysis tasks that take minutes in Python/pandas or require cloud infrastructure can run in seconds locally.
Installation
DuckDB is available for most languages:
Python:
pip install duckdb
Node.js:
npm install @duckdb/node-api
CLI:
# macOS
brew install duckdb
# Direct download
curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip
unzip duckdb.zip
Rust, Java, Go, and R packages are also available. The same engine powers all language bindings — queries behave identically.
Querying Files Directly
DuckDB's most immediately useful feature is querying files without importing them:
-- Query a CSV file
SELECT * FROM read_csv('data.csv') LIMIT 10;
-- Query a Parquet file
SELECT year, SUM(revenue)
FROM read_parquet('sales/*.parquet')
GROUP BY year
ORDER BY year;
-- Query a remote S3 file
SELECT COUNT(*) FROM read_parquet('s3://my-bucket/data/events.parquet');
-- Query multiple files with a glob
SELECT * FROM read_csv('logs/*.csv') WHERE status = 'error';
This means DuckDB is often useful without even creating a database. Point it at your CSV export and start querying immediately.
Python Integration
The Python API is clean and integrates with pandas and Arrow:
import duckdb
# Run a query and get a pandas DataFrame
conn = duckdb.connect()
df = conn.execute("""
SELECT
date_trunc('month', created_at) AS month,
COUNT(*) AS orders,
AVG(order_value) AS avg_value
FROM read_csv('orders.csv')
GROUP BY 1
ORDER BY 1
""").df()
print(df)
Query a pandas DataFrame directly:
import pandas as pd
import duckdb
orders = pd.read_csv('orders.csv')
# DuckDB can query the in-memory DataFrame by name
result = duckdb.query("SELECT * FROM orders WHERE status = 'completed'").df()
Convert to Arrow for zero-copy interop:
arrow_table = conn.execute("SELECT * FROM large_table").arrow()
DuckDB's Python API is deliberately minimal. You don't need connection pools, ORMs, or connection string configuration for most use cases.
SQL Capabilities
DuckDB implements most of modern SQL:
Window functions:
SELECT
user_id,
event_date,
revenue,
SUM(revenue) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 7 PRECEDING AND CURRENT ROW
) AS rolling_7day_revenue
FROM events;
CTEs and recursive queries:
WITH RECURSIVE category_tree AS (
SELECT id, name, parent_id, 1 AS depth
FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.name, c.parent_id, ct.depth + 1
FROM categories c
JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY depth, name;
Struct and list types:
SELECT
user_id,
LIST(event_type ORDER BY created_at) AS event_sequence
FROM events
GROUP BY user_id;
PIVOT:
PIVOT sales ON region USING SUM(revenue);
DuckDB also adds several PostgreSQL-compatible extensions and some unique SQL features like EXCLUDE in SELECT (exclude specific columns) and COLUMNS(*) expressions.
Persistent Database Files
For persistent storage, specify a file path:
conn = duckdb.connect('analytics.duckdb')
conn.execute("""
CREATE TABLE IF NOT EXISTS events AS
SELECT * FROM read_parquet('raw/*.parquet')
""")
conn.execute("CREATE INDEX idx_user ON events(user_id)")
The .duckdb file stores data in DuckDB's columnar format, optimized for analytical queries. Subsequent reads are much faster than re-parsing the original files.
Performance
DuckDB is fast for analytical queries. Some benchmarks for context:
- Aggregating 100M rows with GROUP BY: ~1-3 seconds on a modern laptop
- Joining two 10M-row tables: ~500ms
- Reading a 1GB Parquet file: ~200ms
These numbers depend on hardware and query complexity, but the order of magnitude is correct. Operations that take minutes in Python with pandas often complete in seconds with DuckDB.
Why it's fast:
- Columnar storage: only read the columns a query needs
- Vectorized execution: process batches of values instead of one at a time
- Automatic parallelism: queries use all available CPU cores without configuration
- SIMD instructions: exploits modern CPU vector operations
DuckDB vs. SQLite vs. BigQuery
| Aspect | DuckDB | SQLite | BigQuery |
|---|---|---|---|
| Workload type | Analytics (OLAP) | Transactional (OLTP) | Analytics (OLAP) |
| Server required | No | No | No (managed) |
| Query speed (analytics) | Excellent | Poor | Excellent |
| Concurrent writes | Limited | Limited | Excellent |
| Row inserts/updates | Slow | Fast | Batch only |
| Cost | Free | Free | Pay per query |
| Data location | Local | Local | Cloud |
| Max data size | Limited by disk | Limited by disk | Petabytes |
DuckDB isn't a replacement for SQLite (wrong workload type) or BigQuery (wrong scale). It's the right tool when you have gigabytes-to-terabytes of data for analysis that fits on local infrastructure.
Common Use Cases
Log analysis: Query application logs exported as CSV or Parquet without importing into a database system.
duckdb -c "SELECT status, COUNT(*) FROM read_csv('nginx.log') GROUP BY status ORDER BY 2 DESC"
ETL pipelines: Transform data between formats with SQL instead of Python loops:
COPY (
SELECT
date_trunc('hour', timestamp) AS hour,
source,
COUNT(*) AS event_count
FROM read_json('raw/events/*.json.gz')
GROUP BY 1, 2
) TO 'output/hourly_rollup.parquet' (FORMAT parquet);
Data exploration: Explore a new dataset interactively before deciding how to process it. DuckDB's CLI provides an REPL with tab completion.
Replacing pandas for aggregations: DuckDB queries often replace complex pandas operations with more readable SQL and significantly better performance on larger datasets.
Testing analytics queries locally: Develop and test BigQuery or Snowflake queries locally against a subset of data before running them against cloud infrastructure.
Integrations
Jupyter notebooks: DuckDB's Python API works directly in notebooks. The duckdb-engine package enables SQLAlchemy integration for tools that use it.
dbt: DuckDB is a supported dbt adapter — run dbt transformations locally without a cloud data warehouse.
Apache Arrow: DuckDB and Arrow share data without copying, making DuckDB a natural fit in Arrow-based data pipelines.
Parquet ecosystem: DuckDB reads and writes Parquet natively, integrating cleanly with Spark-generated data, Delta Lake tables, and Iceberg catalogs.
Getting Started
The fastest way to explore DuckDB:
# Install CLI
brew install duckdb # or download binary
# Query a CSV directly
duckdb -c "SELECT * FROM read_csv_auto('your_file.csv') LIMIT 10"
# Start interactive session
duckdb
D SELECT version();
The DuckDB documentation is thorough and the SQL dialect is well-documented. For Python use, the Python client guide covers the full API.
DuckDB is a rare tool that genuinely changes how you work with data. Once you have it available, you'll find yourself reaching for it whenever a data task comes up — it's fast, it requires no infrastructure, and it's just SQL.