Querying JSON Data

SQLStream allows you to query JSON and JSONL (JSON Lines) files with powerful nested path syntax.

JSON File Format

JSON files can be: - Array of objects: [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}] - Object with records key: {"data": [{"id": 1}, {"id": 2}]}

Basic JSON Querying

Example 1: Simple JSON Array

from sqlstream import query

# data.json: [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
results = query("data.json").sql("SELECT * FROM data WHERE age > 25")

for row in results:
    print(row)

Example 2: JSON with Nested Key

# api_response.json: {"result": {"users": [{"id": 1, "name": "Alice"}]}}

# Use fragment syntax to specify nested path
results = query("api_response.json#json:result.users").sql(
    "SELECT * FROM users"
)

Advanced: Nested Path Syntax

SQLStream supports JSONPath-like syntax for accessing nested data:

Path Operators

Syntax	Description	Example
`key`	Simple key access	`users`
`key.nested`	Nested object access	`result.users`
`key[0]`	Array index	`users[0].transactions`
`key[]`	Flatten array	`users[].transactions`
Combined	Any combination	`data.users[].orders`

Example 3: Array Indexing

# Get transactions for the first user only
data = {
    "users": [
        {"name": "Alice", "transactions": [{"id": 1, "amount": 50}]},
        {"name": "Bob", "transactions": [{"id": 2, "amount": 100}]}
    ]
}

# Access first user's transactions
results = query("data.json#json:users[0].transactions").sql(
    "SELECT * FROM transactions"
)

Example 4: Array Flattening

# Merge transactions from ALL users into a single table
results = query("data.json#json:users[].transactions").sql(
    "SELECT * FROM transactions WHERE amount > 50"
)

# This flattens [users] and extracts transactions from each

Example 5: Deeply Nested Data

# Complex nested structure
data = {
    "api_version": "1.0",
    "response": {
        "data": {
            "orders": [
                {"id": 1, "status": "completed"},
                {"id": 2, "status": "pending"}
            ]
        }
    }
}

# Navigate to deeply nested orders
results = query("api.json#json:response.data.orders").sql(
    "SELECT * FROM orders WHERE status = 'completed'"
)

JSONL (JSON Lines) Format

JSONL files contain one JSON object per line - perfect for streaming and large datasets.

Format

{"id": 1, "name": "Alice", "age": 30}
{"id": 2, "name": "Bob", "age": 25}
{"id": 3, "name": "Charlie", "age": 35}

Querying JSONL

from sqlstream import query

# Query JSONL file (auto-detected by .jsonl extension)
results = query("logs.jsonl").sql(
    "SELECT * FROM logs WHERE age > 25"
)

# Or specify format explicitly
results = query("data#jsonl").sql("SELECT * FROM data")

Why JSONL?

✅ Streaming: Process line-by-line without loading entire file
✅ Append-friendly: Easy to add new records
✅ Common format: Used by APIs, logging systems, data dumps
✅ Large files: Better for massive datasets than standard JSON

HTTP/URL Support

Query JSON/JSONL from URLs:

# Remote JSON
url = "https://api.example.com/users.json"
results = query(url).sql("SELECT * FROM users WHERE active = true")

# Remote JSONL
url = "https://example.com/logs.jsonl"
results = query(url).sql("SELECT * FROM logs WHERE level = 'ERROR'")

# With nested path
url = "https://api.example.com/data.json#json:result.items"
results = query(url).sql("SELECT * FROM items")

S3 Support

Query JSON/JSONL directly from S3:

# S3 JSON file
results = query("s3://my-bucket/data.json").sql("SELECT * FROM data")

# S3 JSONL file
results = query("s3://my-bucket/logs.jsonl").sql(
    "SELECT * FROM logs WHERE timestamp > '2024-01-01'"
)

Auto-Detection

SQLStream automatically detects JSON format:

# By extension
query("data.json")    # → JSON Reader
query("logs.jsonl")   # → JSONL Reader

# By fragment
query("file#json")    # → Force JSON Reader
query("file#jsonl")   # → Force JSONL Reader

# By content (HTTP URLs)
query("https://api.example.com/data")  # Detects JSON from content

Installation

JSON/JSONL support is included in the base installation:

pip install sqlstream

No additional dependencies required!

Real-World Examples

Example: API Response Processing

# Process GitHub API response
url = "https://api.github.com/repos/owner/repo/issues"

results = query(url).sql("""
    SELECT 
        number,
        title,
        state,
        user.login as author
    FROM issues
    WHERE state = 'open'
    ORDER BY created_at DESC
    LIMIT 10
""")

Example: Log Analysis

# Analyze application logs (JSONL format)
results = query("app.jsonl").sql("""
    SELECT 
        level,
        COUNT(*) as count
    FROM logs
    WHERE timestamp >= '2024-01-01'
    GROUP BY level
    ORDER BY count DESC
""")

Example: Nested E-commerce Data

# Flatten order items from all customers
data = {
    "customers": [
        {
            "id": 1,
            "name": "Alice",
            "orders": [
                {"id": "o1", "total": 100},
                {"id": "o2", "total": 200}
            ]
        },
        {
            "id": 2,
            "name": "Bob",
            "orders": [
                {"id": "o3", "total": 150}
            ]
        }
    ]
}

# Get all orders across all customers
results = query("sales.json#json:customers[].orders").sql("""
    SELECT 
        id,
        total
    FROM orders
    WHERE total > 100
""")

Limitations

Single [] operator per path (e.g., a[].b[].c not supported)
JSON files loaded into memory (use JSONL for large datasets)
Nested path syntax only for JSON (not JSONL)