Querying Parquet Files

Parquet is a column-oriented binary file format that provides high performance and efficient compression. SQLStream supports querying Parquet files directly, often with significant performance benefits due to column pruning and predicate pushdown.

Basic Usage

You can query a remote Parquet file just like a CSV.

Data Source: sales.parquet

Example 1: Basic Select

sqlstream query "SELECT * FROM 'https://github.com/subhayu99/sqlstream/raw/main/examples/sales.parquet' LIMIT 5"

Example 2: Aggregating Sales by Region

SQLStream (and the underlying engines) can optimize this query by only reading the region and amount columns.

from sqlstream import query

url = "https://github.com/subhayu99/sqlstream/raw/main/examples/sales.parquet"

results = query().sql(f"""
    SELECT region, SUM(amount) as total_sales
    FROM '{url}'
    GROUP BY region
    ORDER BY total_sales DESC
""")

print(results.to_list())

Mixing Formats (CSV + Parquet)

You can join data across different file formats. Here we join a CSV of employees with a Parquet file of sales transactions.

Data:

Employees (CSV): employees.csv
Sales (Parquet): sales.parquet

from sqlstream import query

base_url = "https://github.com/subhayu99/sqlstream/raw/main/examples"

# Find top performing employees by sales amount
sql = f"""
    SELECT 
        e.name, 
        e.dept_id, 
        SUM(s.amount) as total_revenue
    FROM '{base_url}/employees.csv' e
    JOIN '{base_url}/sales.parquet' s ON e.id = s.emp_id
    GROUP BY e.name, e.dept_id
    ORDER BY total_revenue DESC
    LIMIT 5
"""

results = query().sql(sql, backend='duckdb')