Troubleshooting Guide
This guide helps you diagnose and fix common issues when using SQLstream.
Table of Contents
- Installation Issues
- SQL Syntax Errors
- File Access Errors
- S3 Authentication Errors
- Type Conversion Errors
- Memory Errors
- Performance Issues
- Interactive Shell Issues
- Backend Selection Issues
Installation Issues
Problem: ModuleNotFoundError: No module named 'sqlstream'
Cause: SQLstream is not installed or not in your Python path.
Solution:
# Install from PyPI
pip install sqlstream
# Or install from source
git clone https://github.com/yourusername/sqlstream.git
cd sqlstream
pip install -e .
Problem: ImportError: cannot import name 'Query'
Cause: Circular import or outdated installation.
Solution:
Problem: Optional dependencies missing (Pandas, DuckDB, S3, etc.)
Symptom:
Cause: Optional dependencies not installed.
Solution:
# Install with specific extras
pip install sqlstream[pandas] # Pandas backend
pip install sqlstream[duckdb] # DuckDB backend
pip install sqlstream[parquet] # Parquet support
pip install sqlstream[s3] # S3 support
pip install sqlstream[http] # HTTP sources
pip install sqlstream[html] # HTML table parsing
pip install sqlstream[cli] # Interactive shell
# Install all extras
pip install sqlstream[all]
Problem: Platform-specific installation errors
Windows
Symptom: error: Microsoft Visual C++ 14.0 is required
Solution:
- Install Microsoft C++ Build Tools
- Or use pre-built wheels: pip install --only-binary :all: sqlstream[all]
macOS
Symptom: clang: error: unsupported option '-fopenmp'
Solution:
Linux
Symptom: error: command 'gcc' failed
Solution:
# Ubuntu/Debian
sudo apt-get install build-essential python3-dev
# CentOS/RHEL
sudo yum install gcc gcc-c++ python3-devel
# Then install sqlstream
pip install sqlstream
Problem: Python version incompatibility
Symptom: ERROR: Package 'sqlstream' requires a different Python: 3.9.0 not in '>=3.10'
Solution: - SQLstream requires Python 3.10 or higher - Upgrade Python or use a virtual environment:
# Using pyenv
pyenv install 3.11
pyenv local 3.11
# Or using conda
conda create -n sqlstream python=3.11
conda activate sqlstream
pip install sqlstream
SQL Syntax Errors
Problem: ParseError: Expected SELECT but found 'SHOW'
Cause: Unsupported SQL syntax. SQLstream supports a subset of SQL.
Solution: - Use supported SQL features: SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT - Check SQL Support docs for full list
Supported:
Not Supported (with Python backend):
Workaround: Use DuckDB backend for advanced SQL features:
from sqlstream import query
result = query().sql("""
WITH cte AS (SELECT * FROM 'data.csv')
SELECT * FROM cte
""", backend="duckdb")
Problem: ParseError: Invalid column name
Symptom:
Cause: Column names with special characters need quoting.
Solution:
-- Use quotes for column names with special characters
SELECT "user-name", "first.name" FROM 'data.csv'
-- Or rename columns to avoid special characters
SELECT name AS user_name FROM 'data.csv'
Problem: ParseError: Unterminated string literal
Symptom:
Cause: Single quote inside string not escaped.
Solution:
-- Escape single quotes with another single quote
SELECT * FROM 'data.csv' WHERE name = 'O''Brien'
-- Or use different quoting
SELECT * FROM 'data.csv' WHERE name = "O'Brien"
Problem: Reserved keyword conflicts
Symptom:
Cause: from, to, date are reserved keywords.
Solution:
Problem: JOIN syntax errors
Symptom:
Cause: Implicit joins (comma syntax) not fully supported.
Solution:
-- Use explicit JOIN syntax
SELECT *
FROM 'users.csv' AS users
INNER JOIN 'orders.csv' AS orders
ON users.id = orders.user_id
File Access Errors
Problem: FileNotFoundError: CSV file not found: data.csv
Cause: File doesn't exist or path is incorrect.
Solution:
# Use absolute path
from pathlib import Path
data_path = Path("/full/path/to/data.csv")
result = query(str(data_path)).sql("SELECT * FROM source")
# Or use relative path from current working directory
import os
print(os.getcwd()) # Check current directory
result = query("./data/data.csv").sql("SELECT * FROM source")
Problem: PermissionError: Permission denied: 'data.csv'
Cause: Insufficient file permissions.
Solution:
# Check file permissions
ls -l data.csv
# Fix permissions
chmod 644 data.csv # Read/write for owner, read for others
# Or run with appropriate user
sudo chown $USER data.csv
Problem: UnicodeDecodeError: 'utf-8' codec can't decode byte
Cause: File encoding doesn't match specified encoding.
Solution:
from sqlstream import query
# Try different encodings
result = query("data.csv", encoding="latin-1").sql("SELECT * FROM source")
# Or detect encoding
import chardet
with open("data.csv", "rb") as f:
encoding = chardet.detect(f.read())["encoding"]
result = query("data.csv", encoding=encoding).sql("SELECT * FROM source")
Common encodings:
- utf-8 (default, most common)
- latin-1 (Western European)
- windows-1252 (Windows default)
- utf-8-sig (UTF-8 with BOM)
Problem: IsADirectoryError: Is a directory: 'data'
Cause: Trying to read a directory instead of a file.
Solution:
# Specify the exact file
result = query("data/sales.csv").sql("SELECT * FROM source")
# Or query multiple files
result = query().sql("""
SELECT * FROM 'data/sales_2023.csv'
UNION ALL
SELECT * FROM 'data/sales_2024.csv'
""")
Problem: Network paths not working on Windows
Symptom:
Solution:
# Use raw strings or forward slashes
query(r"\\server\share\data.csv") # Raw string
query("//server/share/data.csv") # Forward slashes (recommended)
S3 Authentication Errors
Problem: botocore.exceptions.NoCredentialsError: Unable to locate credentials
Cause: AWS credentials not configured.
Solution:
Option 1: Environment variables
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="us-east-1" # Optional
python your_script.py
Option 2: AWS credentials file
# Create ~/.aws/credentials
mkdir -p ~/.aws
cat > ~/.aws/credentials <<EOF
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
EOF
chmod 600 ~/.aws/credentials
Option 3: IAM role (EC2/ECS) - No credentials needed if running on AWS with IAM role attached - Ensure IAM role has S3 read permissions
Problem: botocore.exceptions.ClientError: An error occurred (403) Forbidden
Cause: Insufficient S3 permissions.
Solution:
-
Check bucket policy:
-
Check IAM policy:
-
Test access:
Problem: s3fs.core.S3FileSystemError: Access Denied
Cause: Credentials work but lack specific permissions.
Solution:
# Test with public bucket first
result = query("s3://public-bucket/data.csv").sql("SELECT * FROM source")
# Check specific object permissions
import boto3
s3 = boto3.client('s3')
response = s3.head_object(Bucket='your-bucket', Key='data.csv')
print(response)
Problem: Cross-region S3 access is slow
Symptom: Queries against S3 bucket in different region are very slow.
Solution:
# Specify region explicitly
import s3fs
fs = s3fs.S3FileSystem(client_kwargs={'region_name': 'us-west-2'})
# Or use endpoint URL
result = query("s3://bucket/data.csv").sql("SELECT * FROM source")
Better solution: Copy data to bucket in same region as your application.
Type Conversion Errors
Problem: TypeError: '>' not supported between instances of 'str' and 'int'
Cause: Type inference failed or column has mixed types.
Solution:
# Type inference happens per-row, may vary
# Manually cast in SQL
result = query("data.csv").sql("""
SELECT * FROM source
WHERE CAST(age AS INTEGER) > 30
""")
Or Pre-process CSV to ensure consistent types:
import pandas as pd
df = pd.read_csv("data.csv")
df['age'] = df['age'].astype(int) # Force type
df.to_csv("data_clean.csv", index=False)
result = query("data_clean.csv").sql("SELECT * FROM source WHERE age > 30")
Problem: Dates not being parsed
Symptom:
# Date column remains as string
result = query("events.csv").sql("SELECT date FROM source")
print(type(result.rows[0]['date'])) # <class 'str'>
Cause: SQLstream's CSV reader doesn't auto-detect dates (yet).
Solution:
# Use DuckDB backend for automatic date parsing
result = query("events.csv").sql("""
SELECT * FROM source
WHERE date > '2024-01-01'
""", backend="duckdb")
# Or manually parse dates
from datetime import datetime
rows = result.rows
for row in rows:
row['date'] = datetime.strptime(row['date'], '%Y-%m-%d')
Problem: NULL handling inconsistencies
Symptom:
Cause: CSV empty cells become None, but type inference varies.
Solution:
# Normalize NULL checks
def is_null(value):
return value is None or value == "" or value == "null"
# Or use SQL NULL handling
result = query("data.csv").sql("""
SELECT * FROM source
WHERE field IS NOT NULL
""")
Memory Errors
Problem: MemoryError or system runs out of RAM
Cause: Loading large files into memory.
Solution:
1. Use Pandas backend (more memory efficient):
2. Use DuckDB backend (disk-based, handles GB-scale):
result = query("very_large_file.csv").sql("""
SELECT * FROM source WHERE age > 30
""", backend="duckdb")
3. Use Parquet instead of CSV:
# Convert CSV to Parquet first (one-time cost)
import pandas as pd
df = pd.read_csv("large.csv")
df.to_parquet("large.parquet")
# Query Parquet (columnar format, much faster)
result = query("large.parquet").sql("SELECT * FROM source")
4. Filter early with predicate pushdown:
# GOOD: Filter pushed down to reader
result = query("large.csv").sql("""
SELECT name FROM source WHERE age > 30
""")
# BAD: Loads all data then filters
result = query("large.csv").sql("SELECT name FROM source")
# Then filter in Python - too late!
5. Select only needed columns:
# GOOD: Only loads 2 columns
result = query("large.csv").sql("SELECT name, age FROM source")
# BAD: Loads all 50 columns
result = query("large.csv").sql("SELECT * FROM source")
Problem: Cannot allocate memory on Linux
Cause: System swap disabled or insufficient.
Solution:
# Check swap
free -h
# Add swap space (temporary)
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Problem: Pandas backend still running out of memory
Symptom:
Solution:
Use chunking with Pandas backend:
# Process in chunks
import pandas as pd
chunks = []
for chunk in pd.read_csv("large.csv", chunksize=100000):
filtered = chunk[chunk['age'] > 30]
chunks.append(filtered)
result_df = pd.concat(chunks)
Or switch to DuckDB backend which handles larger-than-RAM data:
Performance Issues
Problem: Queries are very slow
Symptom: Simple query takes minutes instead of seconds.
Diagnostic steps:
1. Check backend selection:
# Python backend is slowest (educational/debugging)
result = query("data.csv").sql("...", backend="python") # Slow
# Pandas backend is 10-100x faster
result = query("data.csv").sql("...", backend="pandas") # Fast
# DuckDB backend is 10-1000x faster for complex queries
result = query("data.csv").sql("...", backend="duckdb") # Fastest
2. Check file format:
# CSV is slowest (text parsing)
result = query("data.csv").sql("...") # Slow
# Parquet is 10-100x faster (columnar, compressed)
result = query("data.parquet").sql("...") # Fast
3. Use EXPLAIN to see query plan:
result = query("data.csv").sql("SELECT * FROM source WHERE age > 30")
print(result.explain()) # Shows optimizations applied
Problem: HTTP sources are slow
Symptom: Querying HTTP URL takes a long time every time.
Cause: No caching enabled or cache expired.
Solution:
HTTP reader automatically caches in ~/.cache/sqlstream/http/:
# First query: Downloads and caches
result = query("https://example.com/data.csv").sql("SELECT * FROM source")
# Subsequent queries: Uses cache (fast)
result = query("https://example.com/data.csv").sql("SELECT * FROM source")
Check cache:
Clear cache if needed:
Problem: S3 queries are slow
Symptom: Queries against S3 are much slower than local files.
Solutions:
1. Use Parquet on S3:
# CSV on S3: Slow (downloads entire file)
result = query("s3://bucket/large.csv").sql("SELECT * FROM source WHERE age > 30")
# Parquet on S3: Fast (columnar reads, compression)
result = query("s3://bucket/large.parquet").sql("SELECT * FROM source WHERE age > 30")
2. Use partitioned Parquet:
s3://bucket/data/
year=2023/
month=01/data.parquet
month=02/data.parquet
year=2024/
month=01/data.parquet
# Only reads relevant partitions
result = query("s3://bucket/data/year=2024/").sql("SELECT * FROM source")
3. Ensure bucket is in same region:
- Cross-region data transfer is slow and expensive
- Use aws s3 cp or AWS DataSync to move data to same region
Problem: JOIN queries are slow
Symptom: Query with JOIN takes very long.
Solution:
1. Join smaller table on the left:
# GOOD: Small table LEFT JOIN large table
SELECT * FROM 'small.csv' AS s
LEFT JOIN 'large.csv' AS l ON s.id = l.id
# BAD: Large table LEFT JOIN small table
SELECT * FROM 'large.csv' AS l
LEFT JOIN 'small.csv' AS s ON l.id = s.id
2. Filter before joining:
# GOOD: Filter first
SELECT * FROM 'users.csv' AS u
INNER JOIN 'orders.csv' AS o ON u.id = o.user_id
WHERE u.active = true AND o.status = 'completed'
# BAD: Join first, filter later (processes all data)
3. Use DuckDB backend for large joins:
result = query().sql("""
SELECT * FROM 'large1.csv' AS a
INNER JOIN 'large2.csv' AS b ON a.id = b.id
""", backend="duckdb") # Much faster for large joins
Problem: GROUP BY queries are slow
Symptom: Aggregation queries take a long time.
Solution:
# Use DuckDB backend for fast aggregations
result = query("large.csv").sql("""
SELECT category, COUNT(*), SUM(amount)
FROM source
GROUP BY category
""", backend="duckdb")
# Or use Pandas backend
result = query("large.csv").sql("""
SELECT category, COUNT(*), SUM(amount)
FROM source
GROUP BY category
""", backend="pandas")
Interactive Shell Issues
Problem: Shell doesn't start or crashes immediately
Symptom:
Cause: CLI dependencies not installed.
Solution:
Problem: Shell rendering issues (garbled text, wrong colors)
Cause: Terminal doesn't support required features.
Solution:
1. Use a modern terminal: - Windows: Windows Terminal, not CMD.exe - macOS: iTerm2 or built-in Terminal.app - Linux: GNOME Terminal, Konsole, or Alacritty
2. Set TERM environment variable:
3. Check terminal color support:
Problem: Query history not persisting
Symptom: Previous queries not available after restarting shell.
Cause: History file not writable or corrupted.
Solution:
# Check history file
ls -l ~/.sqlstream/history.txt
# Fix permissions
chmod 644 ~/.sqlstream/history.txt
# Or delete and recreate
rm ~/.sqlstream/history.txt
# Shell will recreate it on next run
Problem: Keyboard shortcuts not working
Symptom: F2, F4, Ctrl+X don't work in shell.
Cause: Terminal intercepts key codes.
Solution:
Check terminal preferences to ensure function keys are passed to applications:
- macOS Terminal: Preferences → Profiles → Keyboard → "Use Option as Meta key"
- iTerm2: Preferences → Keys → Key Mappings
- Windows Terminal: Settings → Actions
Backend Selection Issues
Problem: ImportError: DuckDB not available
Symptom:
Solution:
Problem: DuckDB backend doesn't support feature
Symptom:
result = query("data.csv").sql("SELECT custom_function(col) FROM source", backend="duckdb")
# Error: Function 'custom_function' not found
Solution:
DuckDB has extensive built-in functions but may not support custom Python functions.
Workaround: Use Python backend:
Problem: Backend auto-selection chooses wrong backend
Symptom: Query runs slowly because wrong backend was chosen.
Solution:
Explicitly specify backend:
# Force DuckDB for complex analytics
result = query("data.csv").sql("""
SELECT category, AVG(price)
FROM source
GROUP BY category
HAVING AVG(price) > 100
""", backend="duckdb")
# Force Pandas for medium-size data
result = query("data.csv").sql("SELECT * FROM source", backend="pandas")
# Force Python for debugging or custom logic
result = query("data.csv").sql("SELECT * FROM source", backend="python")
Still Having Issues?
If your problem isn't covered here:
- Check the FAQ: docs/faq.md
- Check Limitations: docs/../reference/limitations.md
-
Enable verbose logging:
-
File a bug report: GitHub Issues
- Include: OS, Python version, SQLstream version, minimal reproducible example
- Include: Full error traceback
-
Include: Query and data sample (if possible)
-
Ask on Discussions: GitHub Discussions
Last Updated: 2025-12-03