SQLStream

A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, Markdown, and XML files with lazy evaluation and intelligent optimizations.

Quick Example

# Query a CSV file
$ sqlstream query "SELECT * FROM 'data.csv' WHERE age > 25"

# Query JSON with nested paths
$ sqlstream query "users.json#json:data.users" "SELECT name, email FROM users"

# Query HTML tables from files or URLs
$ sqlstream query "report.html#html:0" "SELECT * FROM report WHERE revenue > 1000000"

# Query Markdown tables
$ sqlstream query "README.md#markdown:1" "SELECT column1, column2 FROM readme"

# Query XML files
$ sqlstream query "data.xml#xml:record" "SELECT name, age FROM data WHERE age > 25"

# Join multiple files (any format combination)
$ sqlstream query "SELECT c.name, o.total FROM 'customers.csv' c JOIN 'orders.parquet' o ON c.id = o.customer_id"

# Interactive shell with full TUI
$ sqlstream shell

# Query S3 files
$ sqlstream query "SELECT * FROM 's3://my-bucket/data.parquet' WHERE date > '2024-01-01'"

Choose Your Path

SQLStream can be used in three ways. Pick the one that fits your needs:

Python Module

Write Python code to query files programmatically.

Best for: Data analysis, ETL pipelines, notebooks, automation scripts
CLI Tool

Run SQL queries from the command line or shell scripts.

Best for: Shell scripts, one-liners, CI/CD pipelines, cron jobs
Interactive Shell

Full-featured TUI for exploring and querying data interactively.

Best for: Ad-hoc analysis, data exploration, generating reports

Not sure which to use?

See our decision guide to help you choose.

Key Features

Pure Python

No database installation required. Works anywhere Python runs.
Multiple Formats

Support for CSV, Parquet, JSON, JSONL, HTML, Markdown, XML files, HTTP URLs, and S3 buckets.
10-100x Faster

Optional duckdb and pandas backends for massive performance boost.
JOIN Support

INNER, LEFT, RIGHT joins across multiple files.
Aggregations

GROUP BY with COUNT, SUM, AVG, MIN, MAX functions.
Beautiful Output

Rich tables, JSON, CSV with syntax highlighting.
Interactive Shell

Full-featured TUI with multiple tabs, state persistence, file browser, and query plan visualization.
Table Extraction

Extract and query tables from HTML pages and Markdown documents with multi-table support.
Smart Optimizations

Column pruning, predicate pushdown, lazy evaluation.

Check all features here

Installation

Using `uv` (recommended)

Basic (CSV only)With Parquet and Pandas supportAll FeaturesMultiple Sub Dependencies

uv tool install sqlstream

uv tool install "sqlstream[parquet,pandas]"

uv tool install "sqlstream[all]"

uv tool install "sqlstream[interactive,pandas,s3,http,html,duckdb]"

Using `pip`

Basic (CSV only)All Features

pip install sqlstream

pip install "sqlstream[all]"

Quick Start

CLI Usage

# Simple query
$ sqlstream query data.csv "SELECT name, age FROM data WHERE age > 25"

# With output format
$ sqlstream query data.csv "SELECT * FROM data" --format json

# Show execution time
$ sqlstream query data.csv "SELECT * FROM data" --time

# Use pandas backend for performance
$ sqlstream query data.parquet "SELECT * FROM data" --backend pandas

Python API

from sqlstream import query

# Execute query
results = query("data.csv").sql("SELECT * FROM data WHERE age > 25")

# Iterate over results (lazy evaluation)
for row in results:
    print(row)

# Or convert to list
results_list = query("data.csv").sql("SELECT * FROM data").to_list()

Why SQLStream?

Perfect For

Data Exploration: Quick analysis without database setup
ETL Pipelines: Process CSV/Parquet files with SQL
Data Science: Filter and join datasets before pandas
DevOps: Query logs and data files in CI/CD
Learning: Understand query execution internals

Not For

Large Databases: Use PostgreSQL, MySQL instead
Real-time Analytics: Use ClickHouse, DuckDB
Production OLTP: SQLStream is read-only

Performance

SQLStream offers three execution backends:

Backend	Speed	Use Case
Python	Baseline	Learning, small files (<100K rows)
Pandas	10-100x faster	Basic queries, large files (>100K rows)
DuckDB	100x+ faster	Complex SQL, analytics, huge files

Performance Tips

Use --backend duckdb for complex SQL (CTEs, window functions)
Use --backend pandas for simple queries on large files
Use column pruning: SELECT name, age instead of SELECT *
Add WHERE filters to reduce data scanned
Use Parquet format for better compression

What's Next?

Python Module Guide

Deep dive into the programmatic Python API.
CLI Tool Guide

Complete reference for command-line usage and scripting.
Interactive Shell Guide

Master the full-featured TUI with all its features.
SQL Reference

Learn about supported SQL syntax and features.
File Formats

CSV, Parquet, JSON, HTML, Markdown, XML, S3, HTTP support.
Performance Guide

Optimize queries with backends and best practices.

Project Status

SQLStream is in active development. Current phase: 12

✅ Phase 0-2: Core query engine with Volcano model
✅ Phase 3: Parquet support
✅ Phase 4: Aggregations & GROUP BY
✅ Phase 5: JOIN operations (INNER, LEFT, RIGHT, FULL OUTER)
✅ Phase 5.5: Pandas backend (10-100x speedup)
✅ Phase 6: HTTP data sources
✅ Phase 7: CLI with beautiful output
✅ Phase 7.5: Interactive shell with Textual
✅ Phase 7.6: Inline file path support
✅ Phase 7.7: S3 Support for CSV and Parquet
✅ Phase 8: Type system & schema inference
✅ Phase 9: Enhanced interactive shell (multiple tabs, state persistence, file browser, query plan)
✅ Phase 10: HTML & Markdown readers with table extraction
✅ Phase 11: Enhanced type system (Decimal, DateTime, Date, Time, JSON) & DuckDB backend integration
🚧 Phase 12: Comprehensive testing & documentation (560 tests, 15% coverage → 80% target)

License

SQLStream is licensed under the MIT License.

Contributing

Contributions are welcome! See the Contributing Guide for details.