Frequently Asked Questions (FAQ)
General Questions
How is SQLStream different from DuckDB?
DuckDB is a fully-featured embedded database with a complete SQL implementation, optimized for OLAP workloads.
SQLStream is a lightweight Python library focused on: - Simplicity: Pure Python, minimal dependencies - Learning: Understand query execution internals - Portability: Works anywhere Python runs - File-first: Query files without database setup
Use DuckDB if: You need maximum performance, full SQL compatibility, or production OLAP workloads.
Use SQLStream if: You want a lightweight tool for data exploration, learning query engines, or quick CSV/Parquet analysis.
How is SQLStream different from pandas?
pandas is a data manipulation library with DataFrame API.
SQLStream provides SQL interface for: - Familiar syntax: Use SQL instead of DataFrame methods - Lazy evaluation: Process large files efficiently - Join multiple files: Combine datasets without loading all into memory - SQL users: Leverage existing SQL knowledge
Interoperability: You can use pandas backend for performance (--backend pandas) or convert SQLStream results to pandas DataFrames.
Can I use this in production?
Current status: SQLStream is in active development and suitable for: - ✅ Data exploration and analysis - ✅ ETL scripts and data pipelines - ✅ CI/CD data processing - ✅ Learning and prototyping
Considerations: - 🚧 API may change between versions - ✅ Full SQL support with DuckDB backend (CTEs, window functions, subqueries) - 🚧 Limited SQL with Python/Pandas backends - 🚧 No transaction support (read-only) - 🚧 Growing test coverage (560 tests, 15% - actively improving to 80%)
Recommendation: Use DuckDB backend for production workloads requiring complex SQL. Python/Pandas backends suitable for simple queries and learning.
What's the maximum file size?
Python Backend:
- RAM limited: Can handle files up to available memory
- Typical: 10-100 MB comfortably
- Large files: Use LIMIT or pandas backend
Pandas Backend: - Much larger: 100 MB - 10 GB+ depending on RAM - Chunk processing: Automatically handles larger-than-memory with streaming
S3 files: - Streaming: Data is streamed, not loaded entirely into memory - Practical limit: 10 GB+ for Parquet, smaller for CSV
Best practices:
# Preview large files
sqlstream query large.csv "SELECT * FROM large LIMIT 1000" --backend pandas
# Use Parquet for better compression
sqlstream query large.parquet "SELECT * FROM large WHERE date > '2024-01-01'" --backend pandas
Does it work on Windows?
Yes! SQLStream is pure Python and works on: - ✅ Windows (Windows 10/11, Server) - ✅ macOS (Intel and Apple Silicon) - ✅ Linux (Ubuntu, Debian, CentOS, etc.)
Platform-specific notes:
- Windows: Use PowerShell or CMD
- Interactive shell: Works best in Windows Terminal (not CMD)
- File paths: Use forward slashes / or escape backslashes \\
Can I use it with Jupyter notebooks?
Yes! SQLStream works great in Jupyter:
# Install in notebook
!pip install "sqlstream[all]"
# Use in cells
from sqlstream import query
results = query("data.csv").sql("""
SELECT department, AVG(salary) as avg_salary
FROM data
GROUP BY department
""")
# Display as DataFrame
import pandas as pd
df = pd.DataFrame(results.to_list())
df
Tips:
- Use to_list() for small results
- Use iteration for large results to avoid memory issues
- Consider DuckDB for faster notebook performance
See Jupyter Integration Guide (coming soon) for more details.
How do I report bugs?
- Check existing issues: GitHub Issues
- Create new issue: Include:
- SQLStream version (
pip show sqlstream) - Python version
- Operating system
- Minimal reproduction code
- Error message with full traceback
- Expected vs actual behavior
Example:
## Bug Report
**Environment:**
- SQLStream: 0.2.5
- Python: 3.11.0
- OS: Ubuntu 22.04
**Code:**
\```python
from sqlstream import query
results = query("data.csv").sql("SELECT * FROM data")
\```
**Error:**
\```
TypeError: ...
\```
**Expected:** Should return results
**Actual:** Raises TypeError
How do I request features?
- Check roadmap: See Project Status
- Search discussions: GitHub Discussions
- Create discussion: Use "Ideas" category
- Include:
- Use case description
- Example of desired behavior
- Why existing features don't work
- Willingness to contribute
What features are planned? - See Development Status for roadmap - Phase 10: Error handling & user feedback - Phase 11: Testing & documentation - Future: Enhanced Python/Pandas backend SQL support (window functions, CTEs currently available in DuckDB backend)
Is there commercial support?
Currently: No commercial support or SLA.
Community support: - GitHub Discussions - GitHub Issues - Documentation: https://subhayu99.github.io/sqlstream
For enterprises: Consider DuckDB, which has commercial support options.
Installation & Setup
What dependencies does SQLStream have?
Minimal (CSV only):
Recommended (all features):
By feature:
- sqlstream[parquet] - Parquet support (pyarrow)
- sqlstream[pandas] - Pandas backend (pandas)
- sqlstream[duckdb] - DuckDB backend (duckdb)
- sqlstream[s3] - S3 support (s3fs)
- sqlstream[cli] - Basic CLI shell (click)
- sqlstream[interactive] - Full blown TUI (textual)
- sqlstream[http] - HTTP data sources (httpx)
- sqlstream[html] - HTML parsing (lxml)
How do I upgrade SQLStream?
# Upgrade to latest version
pip install --upgrade sqlstream
# Upgrade with all features
pip install --upgrade "sqlstream[all]"
# Check current version
pip show sqlstream
# or
sqlstream --version
Breaking changes? Check Changelog (coming soon).
Usage Questions
Can I query multiple files at once?
Yes! Use JOINs:
SELECT e.name, d.department_name, e.salary
FROM 'employees.csv' e
JOIN 'departments.csv' d ON e.dept_id = d.id
WHERE e.salary > 80000
Limitations: - All files must be accessible (local or remote) - JOIN performance depends on file sizes - Use pandas backend for better performance
Can I use aggregate functions?
Yes! Supported aggregations:
SELECT
department,
COUNT(*) as count,
AVG(salary) as avg_salary,
MIN(salary) as min_salary,
MAX(salary) as max_salary,
SUM(salary) as total_salary
FROM 'employees.csv'
GROUP BY department
ORDER BY avg_salary DESC
Supported: COUNT, SUM, AVG, MIN, MAX
Not yet: COUNT(DISTINCT), STDDEV, VARIANCE, custom aggregates
Can I use subqueries or CTEs?
Yes, with the DuckDB backend!
Supported (DuckDB backend): - ✅ Subqueries in FROM clause - ✅ Common Table Expressions (WITH clause) - ✅ Correlated subqueries - ✅ Subqueries in WHERE clause
Example:
from sqlstream import query
# Use CTEs with DuckDB backend
results = query().sql("""
WITH high_earners AS (
SELECT * FROM 'employees.csv'
WHERE salary > 100000
)
SELECT
department,
AVG(salary) as avg_salary
FROM high_earners
GROUP BY department
""", backend="duckdb")
Not supported (Python/Pandas backends): - ❌ Python backend: Limited SQL support - ❌ Pandas backend: Basic queries only
Recommendation: Use --backend duckdb for complex SQL features.
See DuckDB Backend Guide for full details.
How do I specify column types manually?
Currently: Types are inferred automatically.
CSV type inference: - Samples first 100 rows - Detects: int, float, string, date, datetime, boolean
Parquet: Types come from file metadata
Manual types: Not yet supported. Planned for future release.
Workaround: Use CAST (if implemented) or process with pandas first.
Can I write data (INSERT, UPDATE, DELETE)?
No. SQLStream is read-only.
Supported: SELECT queries only
Not supported: - ❌ INSERT - ❌ UPDATE - ❌ DELETE - ❌ CREATE TABLE - ❌ ALTER TABLE
For data modification: Use pandas, DuckDB, or traditional databases.
Performance & Optimization
Why is my query slow?
Common reasons:
- Using Python backend on large files
-
Solution: Use
--backend pandas -
Not using WHERE filters
-
Solution: Add filters to reduce data scanned
-
Using SELECT *
-
Solution: Select only needed columns
-
Large JOINs
-
Solution: Filter before joining
-
CSV vs Parquet
- Solution: Convert to Parquet for better performance
Example optimization:
# Slow (Python backend, all columns, all rows)
sqlstream query large.csv "SELECT * FROM large"
# Fast (pandas, filtered, specific columns)
sqlstream query large.parquet \
"SELECT name, salary FROM large WHERE date > '2024-01-01'" \
--backend pandas
What backends are supported?
SQLStream supports three execution backends:
- Python (Default/Educational): Pure Python implementation of the Volcano model. Great for learning how databases work, but slower for large data.
- Pandas: Translates SQL to pandas operations. 10-100x faster than Python backend. Best for general use.
- DuckDB (New!): Uses DuckDB's engine for full SQL support (window functions, CTEs, etc.) and maximum performance. 10-1000x faster.
How do I use the DuckDB backend?
Install with pip install duckdb or pip install "sqlstream[duckdb]".
Then use it in your code:
Or via CLI:
Does DuckDB backend support all file formats?
Yes! The DuckDB backend uses SQLStream's unified reader architecture, so it supports: - CSV, Parquet, JSON - HTML tables, Markdown tables - S3 files (s3://) - HTTP/HTTPS URLs
It automatically handles authentication and caching just like the other backends.
See Performance Guide (coming soon) for details.
Should I use CSV or Parquet?
Parquet is almost always better:
| Feature | CSV | Parquet |
|---|---|---|
| Read Speed | Baseline | 10-100x faster |
| File Size | Large | 2-10x smaller |
| Type Safety | Inferred | Stored in file |
| Column Access | Read all | Columnar (faster) |
| Compression | None/gzip | Snappy/Gzip/LZ4 |
Use CSV when: - Need human-readable format - Editing files manually - Tool doesn't support Parquet
Use Parquet when: - Performance matters - Large files - Production pipelines
Convert CSV to Parquet:
Error Messages
"No module named 'textual'"
Problem: Interactive shell not installed
Solution:
"No module named 'pandas'"
Problem: Pandas backend not installed
Solution:
"File not found" error
Possible causes:
-
Wrong path:
-
Relative vs absolute path:
-
Path with spaces:
S3 authentication errors
Problem: Can't access S3 files
Solutions:
-
Set AWS credentials:
-
Use AWS config:
-
Check bucket permissions: Ensure you have read access
-
Check region: Some buckets require specific region
See S3 Support for details.
Contributing
How can I contribute?
Ways to contribute:
- Report bugs: See "How do I report bugs?" above
- Request features: See "How do I request features?" above
- Fix bugs: Pick an issue labeled "good first issue"
- Add features: Discuss first in GitHub Discussions
- Improve docs: Submit PRs for typos or clarifications
- Write tests: Increase test coverage
- Write examples: Add real-world examples
Get started:
See Contributing Guide for details.
Where should I start as a new contributor?
Good first issues: - Look for "good first issue" label - Documentation improvements - Test coverage improvements - Bug fixes with reproduction steps
Learning path:
1. Read Architecture Guide
2. Read Volcano Model
3. Browse code in sqlstream/ directory
4. Run tests to understand behavior
5. Pick a small issue to work on