Real-World Scenario: Sales Analytics
This example demonstrates the power of SQLStream by joining three different file formats located on the web into a single result set.
The Goal
Generate a "Department Sales Performance" report by combining: 1. Employees (CSV) - Who made the sale? 2. Departments (CSV) - Which department are they in? 3. Sales (Parquet) - Transaction details.
The Code
from sqlstream import query
# Base URL for raw files
base = "https://github.com/subhayu99/sqlstream/raw/main/examples"
# 1. Join Sales (Parquet) -> Employees (CSV) -> Departments (CSV)
# 2. Aggregate revenue by Department Name
sql = f"""
SELECT
d.dept_name,
COUNT(s.transaction_id) as sales_count,
SUM(s.amount) as total_revenue,
ROUND(AVG(s.amount), 2) as avg_ticket_size
FROM '{base}/sales.parquet' s
JOIN '{base}/employees.csv' e ON s.emp_id = e.id
JOIN '{base}/departments.csv' d ON e.dept_id = d.dept_id
GROUP BY d.dept_name
ORDER BY total_revenue DESC
"""
print("Running Federation Query...")
results = query().sql(sql, backend="duckdb")
# Output results
for row in results:
print(f"{row['dept_name']}: ${row['total_revenue']} ({row['sales_count']} sales)")
Advanced: Adding JSON Product Data
Let's verify inventory levels for items sold in the "North" region.
# Join Sales (Parquet) -> Products (JSON)
sql = f"""
SELECT
p.name as product_name,
p.stock as current_stock,
SUM(s.amount) as sales_in_north
FROM '{base}/sales.parquet' s
JOIN '{base}/products.json' p ON s.product_id = p.id
WHERE s.region = 'North'
GROUP BY p.name, p.stock
ORDER BY sales_in_north DESC
LIMIT 3
"""
results = query().sql(sql, backend="duckdb")