JOIN Operations
SQLStream supports joining data from multiple files, allowing you to combine datasets based on common columns. This is particularly powerful as it allows you to treat separate files (CSV, Parquet) as if they were tables in a relational database.
Syntax
The syntax follows standard SQL conventions. You can specify the files directly in the query using string literals.
Supported Join Types
SQLStream currently supports:
- INNER JOIN: Returns records that have matching values in both tables.
- LEFT JOIN: Returns all records from the left table, and the matched records from the right table.
Cross-Format Joins
You can join files of different formats. For example, you can join a CSV file with a Parquet file:
SELECT
users.name,
orders.amount
FROM 'users.csv' users
JOIN 'orders.parquet' orders
ON users.user_id = orders.user_id
Performance Considerations
- Backend: Using the
pandasbackend is generally faster for joins on larger datasets as it leverages optimized merge algorithms. - Memory: The Python backend uses a nested-loop or hash join implementation which streams data, making it memory efficient but potentially slower for large datasets compared to Pandas.