Skip to content

SQLStream

A lightweight, pure-Python SQL query engine for CSV, Parquet, JSON, JSONL, HTML, Markdown, and XML files with lazy evaluation and intelligent optimizations.


Quick Example

# Query a CSV file
$ sqlstream query "SELECT * FROM 'data.csv' WHERE age > 25"

# Query JSON with nested paths
$ sqlstream query "users.json#json:data.users" "SELECT name, email FROM users"

# Query HTML tables from files or URLs
$ sqlstream query "report.html#html:0" "SELECT * FROM report WHERE revenue > 1000000"

# Query Markdown tables
$ sqlstream query "README.md#markdown:1" "SELECT column1, column2 FROM readme"

# Query XML files
$ sqlstream query "data.xml#xml:record" "SELECT name, age FROM data WHERE age > 25"

# Join multiple files (any format combination)
$ sqlstream query "SELECT c.name, o.total FROM 'customers.csv' c JOIN 'orders.parquet' o ON c.id = o.customer_id"

# Interactive shell with full TUI
$ sqlstream shell

# Query S3 files
$ sqlstream query "SELECT * FROM 's3://my-bucket/data.parquet' WHERE date > '2024-01-01'"

Choose Your Path

SQLStream can be used in three ways. Pick the one that fits your needs:

  • Python Module


    Write Python code to query files programmatically.

    Best for: Data analysis, ETL pipelines, notebooks, automation scripts

  • CLI Tool


    Run SQL queries from the command line or shell scripts.

    Best for: Shell scripts, one-liners, CI/CD pipelines, cron jobs

  • Interactive Shell


    Full-featured TUI for exploring and querying data interactively.

    Best for: Ad-hoc analysis, data exploration, generating reports

Not sure which to use?

See our decision guide to help you choose.


Key Features

  • Pure Python


    No database installation required. Works anywhere Python runs.

  • Multiple Formats


    Support for CSV, Parquet, JSON, JSONL, HTML, Markdown, XML files, HTTP URLs, and S3 buckets.

  • 10-100x Faster


    Optional duckdb and pandas backends for massive performance boost.

  • JOIN Support


    INNER, LEFT, RIGHT joins across multiple files.

  • Aggregations


    GROUP BY with COUNT, SUM, AVG, MIN, MAX functions.

  • Beautiful Output


    Rich tables, JSON, CSV with syntax highlighting.

  • Interactive Shell


    Full-featured TUI with multiple tabs, state persistence, file browser, and query plan visualization.

  • Table Extraction


    Extract and query tables from HTML pages and Markdown documents with multi-table support.

  • Smart Optimizations


    Column pruning, predicate pushdown, lazy evaluation.

Check all features here


Installation

uv tool install sqlstream
uv tool install "sqlstream[parquet,pandas]"
uv tool install "sqlstream[all]"
uv tool install "sqlstream[interactive,pandas,s3,http,html,duckdb]"

Using pip

pip install sqlstream
pip install "sqlstream[all]"

Quick Start

CLI Usage

# Simple query
$ sqlstream query data.csv "SELECT name, age FROM data WHERE age > 25"

# With output format
$ sqlstream query data.csv "SELECT * FROM data" --format json

# Show execution time
$ sqlstream query data.csv "SELECT * FROM data" --time

# Use pandas backend for performance
$ sqlstream query data.parquet "SELECT * FROM data" --backend pandas

Python API

from sqlstream import query

# Execute query
results = query("data.csv").sql("SELECT * FROM data WHERE age > 25")

# Iterate over results (lazy evaluation)
for row in results:
    print(row)

# Or convert to list
results_list = query("data.csv").sql("SELECT * FROM data").to_list()

Why SQLStream?

Perfect For

  • Data Exploration: Quick analysis without database setup
  • ETL Pipelines: Process CSV/Parquet files with SQL
  • Data Science: Filter and join datasets before pandas
  • DevOps: Query logs and data files in CI/CD
  • Learning: Understand query execution internals

Not For

  • Large Databases: Use PostgreSQL, MySQL instead
  • Real-time Analytics: Use ClickHouse, DuckDB
  • Production OLTP: SQLStream is read-only

Performance

SQLStream offers three execution backends:

Backend Speed Use Case
Python Baseline Learning, small files (<100K rows)
Pandas 10-100x faster Basic queries, large files (>100K rows)
DuckDB 100x+ faster Complex SQL, analytics, huge files

Performance Tips

  • Use --backend duckdb for complex SQL (CTEs, window functions)
  • Use --backend pandas for simple queries on large files
  • Use column pruning: SELECT name, age instead of SELECT *
  • Add WHERE filters to reduce data scanned
  • Use Parquet format for better compression

What's Next?


Project Status

SQLStream is in active development. Current phase: 12

  • ✅ Phase 0-2: Core query engine with Volcano model
  • ✅ Phase 3: Parquet support
  • ✅ Phase 4: Aggregations & GROUP BY
  • ✅ Phase 5: JOIN operations (INNER, LEFT, RIGHT, FULL OUTER)
  • ✅ Phase 5.5: Pandas backend (10-100x speedup)
  • ✅ Phase 6: HTTP data sources
  • ✅ Phase 7: CLI with beautiful output
  • ✅ Phase 7.5: Interactive shell with Textual
  • ✅ Phase 7.6: Inline file path support
  • ✅ Phase 7.7: S3 Support for CSV and Parquet
  • ✅ Phase 8: Type system & schema inference
  • ✅ Phase 9: Enhanced interactive shell (multiple tabs, state persistence, file browser, query plan)
  • ✅ Phase 10: HTML & Markdown readers with table extraction
  • ✅ Phase 11: Enhanced type system (Decimal, DateTime, Date, Time, JSON) & DuckDB backend integration
  • 🚧 Phase 12: Comprehensive testing & documentation (560 tests, 15% coverage → 80% target)

License

SQLStream is licensed under the MIT License.


Contributing

Contributions are welcome! See the Contributing Guide for details.