Optimizers Reference
Query optimization strategies.
Optimizer
Optimizer
Bases: ABC
Base class for all query optimizers
Each optimizer implements a single optimization rule. Optimizers are applied in a pipeline before query execution.
Source code in sqlstream/optimizers/base.py
__init__
can_optimize
abstractmethod
Check if this optimization can be applied
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if optimization is applicable |
Source code in sqlstream/optimizers/base.py
optimize
abstractmethod
Apply the optimization
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Modifies
- reader: Sets optimization hints
- self.applied: Marks optimization as applied
- self.description: Describes what was optimized
Source code in sqlstream/optimizers/base.py
get_name
abstractmethod
get_description
Get description of what was optimized
Returns:
| Type | Description |
|---|---|
str
|
Description string if applied, empty string otherwise |
was_applied
OptimizerPipeline
OptimizerPipeline
Pipeline that applies multiple optimizers in sequence
Optimizers are applied in order, and each can build on the previous optimizations.
Source code in sqlstream/optimizers/base.py
__init__
Initialize pipeline
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
optimizers
|
list[Optimizer]
|
List of optimizers to apply in order |
required |
optimize
Apply all optimizers in sequence
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Source code in sqlstream/optimizers/base.py
get_applied_optimizations
Get list of optimizations that were applied
Returns:
| Type | Description |
|---|---|
list[str]
|
List of optimization descriptions |
Source code in sqlstream/optimizers/base.py
get_summary
Get summary of all applied optimizations
Returns:
| Type | Description |
|---|---|
str
|
Human-readable summary |
Source code in sqlstream/optimizers/base.py
ColumnPruningOptimizer
ColumnPruningOptimizer
Bases: Optimizer
Prune (skip reading) unused columns
Benefits: - Massive I/O reduction for wide tables - Reduces memory usage - Critical for columnar formats (Parquet, ORC) - Can read 10x faster if selecting 1 column from 10
Example
SELECT name, age FROM employees -- 100 columns total
Without pruning: Read all 100 columns → Project 2 With pruning: Read only 2 columns → Much faster
Source code in sqlstream/optimizers/column_pruning.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | |
can_optimize
Check if column pruning is applicable
Conditions: 1. Reader supports column selection 2. Not SELECT * (can't prune if all columns needed)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if optimization can be applied |
Source code in sqlstream/optimizers/column_pruning.py
optimize
Apply column pruning optimization
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Source code in sqlstream/optimizers/column_pruning.py
ColumnStatistics
ColumnStatistics
dataclass
Statistics about a single column
Attributes:
| Name | Type | Description |
|---|---|---|
distinct_count |
int
|
Number of distinct values (cardinality) |
null_count |
int
|
Number of NULL values |
min_value |
Any
|
Minimum value |
max_value |
Any
|
Maximum value |
avg_length |
float
|
Average value length (for strings) |
Source code in sqlstream/optimizers/cost_based.py
CostBasedOptimizer
CostBasedOptimizer
Bases: Optimizer
Cost-based optimization framework
This is a meta-optimizer that provides infrastructure for cost-based decisions in other optimizers.
Benefits: - Statistics-driven decisions - Better join ordering - Better index selection (future) - Adaptive query execution (future)
Note
This is a framework/placeholder. Real cost-based optimization requires statistics collection, which is expensive. For now, we just provide the infrastructure and simple cost models.
Source code in sqlstream/optimizers/cost_based.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 | |
can_optimize
Check if cost-based optimization is applicable
For now, this is disabled as it requires statistics collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
False (disabled for now) |
Source code in sqlstream/optimizers/cost_based.py
optimize
Apply cost-based optimizations
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Note
This is a placeholder for future implementation
Source code in sqlstream/optimizers/cost_based.py
collect_statistics
Collect statistics from a data source
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reader
|
BaseReader
|
Data source to collect stats from |
required |
sample_size
|
int
|
Number of rows to sample (for efficiency) |
1000
|
Returns:
| Type | Description |
|---|---|
TableStatistics
|
Table statistics |
Note
This is expensive - requires reading data In production, stats would be cached and updated periodically
Source code in sqlstream/optimizers/cost_based.py
CostModel
CostModel
Cost model for estimating query operation costs
Costs are in abstract units. Lower is better. The goal is to compare different plans, not to predict absolute runtime.
Source code in sqlstream/optimizers/cost_based.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | |
estimate_scan_cost
classmethod
Estimate cost of scanning a table
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_count
|
int
|
Number of rows to scan |
required |
Returns:
| Type | Description |
|---|---|
float
|
Estimated cost |
estimate_filter_cost
classmethod
Estimate cost of filtering rows
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_count
|
int
|
Number of input rows |
required |
selectivity
|
float
|
Fraction of rows that pass filter (0.0-1.0) |
0.1
|
Returns:
| Type | Description |
|---|---|
float
|
Estimated cost |
Source code in sqlstream/optimizers/cost_based.py
estimate_join_cost
classmethod
Estimate cost of hash join
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
left_rows
|
int
|
Number of rows in left table |
required |
right_rows
|
int
|
Number of rows in right table |
required |
selectivity
|
float
|
Fraction of cartesian product that matches |
0.1
|
Returns:
| Type | Description |
|---|---|
float
|
Estimated cost |
Source code in sqlstream/optimizers/cost_based.py
estimate_sort_cost
classmethod
Estimate cost of sorting
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_count
|
int
|
Number of rows to sort |
required |
Returns:
| Type | Description |
|---|---|
float
|
Estimated cost (O(N log N)) |
Source code in sqlstream/optimizers/cost_based.py
estimate_selectivity
classmethod
Estimate selectivity of a filter condition
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
condition
|
Condition
|
Filter condition |
required |
stats
|
ColumnStatistics | None
|
Column statistics (if available) |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Estimated selectivity (0.0-1.0) |
Note
These are rough heuristics. Real databases use histograms.
Source code in sqlstream/optimizers/cost_based.py
TableStatistics
TableStatistics
dataclass
Statistics about a table/data source
Attributes:
| Name | Type | Description |
|---|---|---|
row_count |
int
|
Total number of rows |
column_stats |
dict[str, ColumnStatistics]
|
Per-column statistics (cardinality, min/max, nulls) |
size_bytes |
int
|
Approximate size in bytes |
Source code in sqlstream/optimizers/cost_based.py
JoinReorderingOptimizer
JoinReorderingOptimizer
Bases: Optimizer
Reorder joins to minimize intermediate result size
Benefits: - Smaller intermediate results = less memory - Faster execution (less data to process) - Better cache utilization
Strategy: - For now: Simple heuristic (join smallest first) - Future: Cost-based with statistics
Example
Tables: A (1M rows), B (100 rows), C (1K rows)
Bad order: A JOIN B JOIN C → (1M × 100) JOIN C = huge intermediate result
Good order: B JOIN C JOIN A → (100 × 1K) JOIN A = smaller intermediate result
Note
This is a placeholder implementation. Full join reordering requires table statistics and is complex. For now, we just track that joins could be reordered.
Source code in sqlstream/optimizers/join_reordering.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | |
can_optimize
Check if join reordering is applicable
Conditions: 1. Query has JOIN clause 2. No circular dependencies in join conditions 3. All joins are inner joins (outer joins have order constraints)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if optimization can be applied |
Source code in sqlstream/optimizers/join_reordering.py
optimize
Apply join reordering optimization
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Note
This is a placeholder. Real implementation would: 1. Collect table statistics (row counts) 2. Estimate join selectivity 3. Build join graph 4. Find optimal join order (dynamic programming or greedy) 5. Rewrite AST with new join order 6. Preserve join semantics (INNER vs OUTER)
Source code in sqlstream/optimizers/join_reordering.py
LimitPushdownOptimizer
LimitPushdownOptimizer
Bases: Optimizer
Push LIMIT to the reader for early termination
Benefits: - Stop reading after N rows - Massive speedup for large files - Reduces memory usage
Example
SELECT * FROM large_file.csv LIMIT 10
Without pushdown: Read entire file → Take first 10 With pushdown: Stop reading after 10 rows → Much faster
Note
Cannot push down if query has: - ORDER BY (need to see all rows to sort) - GROUP BY (need to see all rows to group) - Aggregates (need all rows to aggregate) - JOIN (complex - may need all rows)
Source code in sqlstream/optimizers/limit_pushdown.py
can_optimize
Check if limit pushdown is applicable
Conditions: 1. Query has LIMIT clause 2. Reader supports limit pushdown 3. No ORDER BY (would need to read all rows first) 4. No GROUP BY (would need to read all rows first) 5. No aggregates (would need to read all rows first) 6. No JOIN (complex - skip for now)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if optimization can be applied |
Source code in sqlstream/optimizers/limit_pushdown.py
optimize
Apply limit pushdown optimization
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Source code in sqlstream/optimizers/limit_pushdown.py
PartitionPruningOptimizer
PartitionPruningOptimizer
Bases: Optimizer
Prune (skip) partitions that don't match filter conditions
Benefits: - Massive I/O reduction for partitioned datasets - Skip entire directories/files - Critical for data lakes and big data - Can reduce data read by 10x-1000x
Example
Dataset partitioned by date: year=YYYY/month=MM/day=DD/ Query: WHERE date >= '2024-01-01' → Skip all partitions before 2024
Source code in sqlstream/optimizers/partition_pruning.py
can_optimize
Check if partition pruning is applicable
Conditions: 1. Reader supports partition pruning 2. Query has WHERE clause 3. WHERE clause references partition columns
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if optimization can be applied |
Source code in sqlstream/optimizers/partition_pruning.py
optimize
Apply partition pruning optimization
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Source code in sqlstream/optimizers/partition_pruning.py
QueryPlanner
QueryPlanner
Query planner and optimizer orchestrator
Applies a pipeline of optimizations to improve query performance: 1. Join reordering - optimize join order for performance 2. Partition pruning - skip entire partitions/files based on filters 3. Predicate pushdown - push WHERE filters to readers 4. Column pruning - tell readers which columns to read 5. Limit pushdown - early termination for LIMIT queries 6. Projection pushdown - push computed expressions (future)
The planner modifies the reader in-place with optimization hints.
Example
Source code in sqlstream/optimizers/planner.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | |
__init__
Initialize planner with default optimization pipeline
The order matters: 1. Join reordering first (affects join execution plan) 2. Partition pruning second (can skip entire files!) 3. Predicate pushdown third (reduces data read) 4. Column pruning fourth (narrows columns) 5. Limit pushdown fifth (early termination) 6. Projection pushdown last (transform data at source)
Source code in sqlstream/optimizers/planner.py
optimize
Apply all applicable optimizations
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Modifies
- reader: Sets optimization hints (filters, columns, limit, etc.)
- self.optimizations_applied: List of applied optimizations
Source code in sqlstream/optimizers/planner.py
get_optimization_summary
Get summary of optimizations applied
Returns:
| Type | Description |
|---|---|
str
|
Human-readable summary |
Example
"Optimizations applied: - Predicate pushdown: 2 condition(s) - Column pruning: 3 column(s) selected"
Source code in sqlstream/optimizers/planner.py
get_optimizers
Get list of all optimizers in the pipeline
Returns:
| Type | Description |
|---|---|
list
|
List of optimizer instances |
add_optimizer
Add a custom optimizer to the pipeline
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
optimizer
|
Optimizer
|
Optimizer instance to add |
required |
Source code in sqlstream/optimizers/planner.py
PredicatePushdownOptimizer
PredicatePushdownOptimizer
Bases: Optimizer
Push WHERE conditions to the reader
Benefits: - Reduces I/O by filtering at the source - Reduces memory usage - Especially effective for columnar formats (Parquet) - Can leverage indexes if available
Example
SELECT * FROM data WHERE age > 30
Without pushdown: Read all rows → Filter in memory With pushdown: Filter while reading → Less data read
Source code in sqlstream/optimizers/predicate_pushdown.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
can_optimize
Check if predicate pushdown is applicable
Conditions: 1. Query has WHERE clause 2. Reader supports pushdown 3. Not a JOIN query (complex - needs smarter analysis)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if optimization can be applied |
Source code in sqlstream/optimizers/predicate_pushdown.py
optimize
Apply predicate pushdown optimization
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Source code in sqlstream/optimizers/predicate_pushdown.py
ProjectionPushdownOptimizer
ProjectionPushdownOptimizer
Bases: Optimizer
Push projection (SELECT expressions) to the reader
Benefits (when implemented): - Evaluate expressions at read time - Reduce data movement - Leverage database/engine native functions
Example (future): SELECT UPPER(name), age * 2 FROM data
With pushdown: Reader evaluates UPPER() and age*2
Without: Read raw data → Apply transformations later
Status: Placeholder - not yet implemented Reason: Requires expression evaluation framework
Source code in sqlstream/optimizers/projection_pushdown.py
can_optimize
Check if projection pushdown is applicable
Currently always returns False as this is not yet implemented.
Future conditions: 1. Reader supports expression evaluation 2. Expressions are supported by reader (native functions) 3. Not complex nested expressions
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |
Returns:
| Type | Description |
|---|---|
bool
|
False (not yet implemented) |
Source code in sqlstream/optimizers/projection_pushdown.py
optimize
Apply projection pushdown optimization
Currently a no-op placeholder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ast
|
SelectStatement
|
Parsed SQL statement |
required |
reader
|
BaseReader
|
Data source reader |
required |