Querying JSON Data
SQLStream allows you to query JSON and JSONL (JSON Lines) files with powerful nested path syntax.
JSON File Format
JSON files can be:
- Array of objects: [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
- Object with records key: {"data": [{"id": 1}, {"id": 2}]}
Basic JSON Querying
Example 1: Simple JSON Array
from sqlstream import query
# data.json: [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
results = query("data.json").sql("SELECT * FROM data WHERE age > 25")
for row in results:
print(row)
Example 2: JSON with Nested Key
# api_response.json: {"result": {"users": [{"id": 1, "name": "Alice"}]}}
# Use fragment syntax to specify nested path
results = query("api_response.json#json:result.users").sql(
"SELECT * FROM users"
)
Advanced: Nested Path Syntax
SQLStream supports JSONPath-like syntax for accessing nested data:
Path Operators
| Syntax | Description | Example |
|---|---|---|
key |
Simple key access | users |
key.nested |
Nested object access | result.users |
key[0] |
Array index | users[0].transactions |
key[] |
Flatten array | users[].transactions |
| Combined | Any combination | data.users[].orders |
Example 3: Array Indexing
# Get transactions for the first user only
data = {
"users": [
{"name": "Alice", "transactions": [{"id": 1, "amount": 50}]},
{"name": "Bob", "transactions": [{"id": 2, "amount": 100}]}
]
}
# Access first user's transactions
results = query("data.json#json:users[0].transactions").sql(
"SELECT * FROM transactions"
)
Example 4: Array Flattening
# Merge transactions from ALL users into a single table
results = query("data.json#json:users[].transactions").sql(
"SELECT * FROM transactions WHERE amount > 50"
)
# This flattens [users] and extracts transactions from each
Example 5: Deeply Nested Data
# Complex nested structure
data = {
"api_version": "1.0",
"response": {
"data": {
"orders": [
{"id": 1, "status": "completed"},
{"id": 2, "status": "pending"}
]
}
}
}
# Navigate to deeply nested orders
results = query("api.json#json:response.data.orders").sql(
"SELECT * FROM orders WHERE status = 'completed'"
)
JSONL (JSON Lines) Format
JSONL files contain one JSON object per line - perfect for streaming and large datasets.
Format
{"id": 1, "name": "Alice", "age": 30}
{"id": 2, "name": "Bob", "age": 25}
{"id": 3, "name": "Charlie", "age": 35}
Querying JSONL
from sqlstream import query
# Query JSONL file (auto-detected by .jsonl extension)
results = query("logs.jsonl").sql(
"SELECT * FROM logs WHERE age > 25"
)
# Or specify format explicitly
results = query("data#jsonl").sql("SELECT * FROM data")
Why JSONL?
- ✅ Streaming: Process line-by-line without loading entire file
- ✅ Append-friendly: Easy to add new records
- ✅ Common format: Used by APIs, logging systems, data dumps
- ✅ Large files: Better for massive datasets than standard JSON
HTTP/URL Support
Query JSON/JSONL from URLs:
# Remote JSON
url = "https://api.example.com/users.json"
results = query(url).sql("SELECT * FROM users WHERE active = true")
# Remote JSONL
url = "https://example.com/logs.jsonl"
results = query(url).sql("SELECT * FROM logs WHERE level = 'ERROR'")
# With nested path
url = "https://api.example.com/data.json#json:result.items"
results = query(url).sql("SELECT * FROM items")
S3 Support
Query JSON/JSONL directly from S3:
# S3 JSON file
results = query("s3://my-bucket/data.json").sql("SELECT * FROM data")
# S3 JSONL file
results = query("s3://my-bucket/logs.jsonl").sql(
"SELECT * FROM logs WHERE timestamp > '2024-01-01'"
)
Auto-Detection
SQLStream automatically detects JSON format:
# By extension
query("data.json") # → JSON Reader
query("logs.jsonl") # → JSONL Reader
# By fragment
query("file#json") # → Force JSON Reader
query("file#jsonl") # → Force JSONL Reader
# By content (HTTP URLs)
query("https://api.example.com/data") # Detects JSON from content
Installation
JSON/JSONL support is included in the base installation:
No additional dependencies required!
Real-World Examples
Example: API Response Processing
# Process GitHub API response
url = "https://api.github.com/repos/owner/repo/issues"
results = query(url).sql("""
SELECT
number,
title,
state,
user.login as author
FROM issues
WHERE state = 'open'
ORDER BY created_at DESC
LIMIT 10
""")
Example: Log Analysis
# Analyze application logs (JSONL format)
results = query("app.jsonl").sql("""
SELECT
level,
COUNT(*) as count
FROM logs
WHERE timestamp >= '2024-01-01'
GROUP BY level
ORDER BY count DESC
""")
Example: Nested E-commerce Data
# Flatten order items from all customers
data = {
"customers": [
{
"id": 1,
"name": "Alice",
"orders": [
{"id": "o1", "total": 100},
{"id": "o2", "total": 200}
]
},
{
"id": 2,
"name": "Bob",
"orders": [
{"id": "o3", "total": 150}
]
}
]
}
# Get all orders across all customers
results = query("sales.json#json:customers[].orders").sql("""
SELECT
id,
total
FROM orders
WHERE total > 100
""")
Limitations
- Single
[]operator per path (e.g.,a[].b[].cnot supported) - JSON files loaded into memory (use JSONL for large datasets)
- Nested path syntax only for JSON (not JSONL)