Aggregation is one of the most fundamental operations in query processing.
Whether computing a SUM, COUNT, AVG, or GROUP BY, the database engine must decide how to organize tuples into groups and apply aggregate functions.
Two dominant strategies exist: hash aggregation and sort-based aggregation.
The choice between them has significant implications for memory consumption, I/O cost, CPU efficiency, and performance under varying data distributions.
The Aggregation Problem
Given a relation R and a set of grouping attributes G, the aggregation operator must partition tuples of R into groups that share the same values of G, then computes one or more aggregate functions over each group.
The operator produces one output tuple per distinct group.
This is conceptually simple, but the implementation must handle cases where the number of distinct groups is small (a few hundred) or enormous (millions), where the input fits in memory or spills to disk, and where downstream operators may benefit from sorted output.
Sort-Based Aggregation
Sort-based aggregation works by first sorting the input on the grouping columns, and then performing a single linear scan to compute aggregates.
Because sorting brings all tuples with identical group keys into contiguous positions, the scan needs only a constant amount of state: the current group key and the running aggregate accumulators.
Advantages
Sorted output. If the query plan includes an ORDER BY on the grouping columns, or if a downstream merge join benefits from sorted input, sort-based aggregation amortizes the sorting cost across multiple operators.
This is sometimes called "interesting order" propagation in query optimization (Selinger et al., 1979).
Bounded memory for the scan phase. Once sorted, the aggregation scan itself requires O(1) memory per group transition.
The memory challenge is entirely in the sort phase, which has well-understood external sorting algorithms.
Predictable I/O patterns. External merge sort produces sequential I/O during both the run-generation and merge phases.
On spinning disks, this advantage is substantial compared to the random I/O patterns that hash spilling can produce.
Handles skew gracefully. Sorting performance is largely independent of key distribution.
Whether groups are uniformly distributed or heavily skewed, the sort cost remains O(N log N).
Disadvantages
The O(N log N) cost is unavoidable even when the number of distinct groups is small.
If a table has 100 million rows but only 50 distinct group keys, sorting all 100 million rows is wasteful compared to maintaining a 50-entry hash table.
Hash Aggregation
Hash aggregation builds a hash table keyed on the grouping columns.
For each input tuple, the operator probes the hash table.
If an entry for that group exists, the aggregate accumulators are updated in place.
If not, a new entry is created.
After processing all input, the hash table is scanned to produce output tuples.
Advantages
O(N) expected time. Each tuple requires one hash computation and one hash table probe, both O(1) expected cost.
No comparison-based sorting lower bound applies.
Efficient when groups fit in memory. If the hash table for all distinct groups fits in memory, the entire operation completes in a single pass with no I/O beyond reading the input.
Early aggregation. Because accumulators are updated incrementally, hash aggregation naturally performs early (partial) aggregation.
This reduces the volume of data that must be spilled to disk when memory is insufficient.
Disadvantages
Unsorted output. The result emerges in hash table iteration order, which is effectively random.
If a subsequent operator requires sorted data, a separate sort is needed.
Memory pressure. The hash table must hold one entry per distinct group.
When the number of distinct groups exceeds available memory, the operator must spill partitions to disk and recursively aggregate them, which introduces complexity.
Hash table overhead. Each entry in the hash table carries overhead for pointers, hash values, and potential chaining structures.
For very wide group keys, the memory footprint can be significant.
Sensitivity to hash function quality. Poor hash functions or adversarial inputs can degrade probe performance from O(1) to O(N) in the worst case, though modern hash functions make this rare in practice.
Walkthrough
Hash Aggregation: Step by Step
Consider a query: SELECT department, SUM(salary) FROM employees GROUP BY department.
HASH_AGGREGATE(input, group_cols, agg_func):
ht = new HashTable()
for each tuple t in input:
key = extract(t, group_cols)
entry = ht.lookup(key)
if entry exists:
entry.accumulator = agg_func.merge(entry.accumulator, t)
else:
ht.insert(key, agg_func.init(t))
for each entry in ht:
emit(entry.key, agg_func.finalize(entry.accumulator))
When the hash table exceeds memory, the operator partitions input tuples by hash value into B partitions, which are written to disk.
Each partition is then loaded independently and is aggregated.
If a single partition still exceeds memory, recursive partitioning is applied.
The key correctness invariant in external hash aggregation is that tuples belonging to the same group must end up in the same partition.
Groups whose keys were aggregated into the in-memory hash table before spilling began must be flushed and written to their corresponding partition files (not emitted as final output), unless the implementation can guarantee those keys will not appear in any spilled partition.
A common safe approach is to flush all in-memory state to partition files whenever a spill is triggered, then process each partition independently:
HASH_AGGREGATE_EXTERNAL(input, group_cols, agg_func, mem_budget):
B = mem_budget / page_size
partitions = array of B empty files
// Phase 1: Partition input (with optional partial aggregation per partition)
// Use a secondary hash function to assign groups to partitions,
// ensuring all tuples for a given group key go to the same partition.
ht = new HashTable(mem_budget)
for each tuple t in input:
key = extract(t, group_cols)
p = hash2(key) mod B
if ht.has_space_for_partition(p) or ht.contains_in_partition(p, key):
merge or insert into ht partition p
else:
// Flush partition p's in-memory entries to disk, then write t
flush ht partition p to partitions[p]
write t to partitions[p]
// Flush remaining in-memory entries to their partition files
for each partition p with entries in ht:
flush ht partition p to partitions[p]
// Phase 2: Fully aggregate each partition independently
for each partition p in partitions:
if p fits in memory:
HASH_AGGREGATE(p, group_cols, agg_func) // emits final output
else:
HASH_AGGREGATE_EXTERNAL(p, group_cols, agg_func, mem_budget)
Sort-Based Aggregation: Step by Step
SORT_AGGREGATE(input, group_cols, agg_func):
sorted_input = external_sort(input, group_cols)
current_key = null
accumulator = null
for each tuple t in sorted_input:
key = extract(t, group_cols)
if key == current_key:
accumulator = agg_func.merge(accumulator, t)
else:
if current_key is not null:
emit(current_key, agg_func.finalize(accumulator))
current_key = key
accumulator = agg_func.init(t)
if current_key is not null:
emit(current_key, agg_func.finalize(accumulator))
The external sort itself uses a standard run-generation and k-way merge approach, with cost O(N log_M N) where M is the memory available in pages.
Cost Comparison
| Factor | Hash Aggregation | Sort-Based Aggregation |
|---|---|---|
| Time complexity | O(N) expected | O(N log N) |
| Memory requirement | O(G) where G = distinct groups | O(M) sort buffer, and O(1) for scan |
| I/O (in-memory case) | Single pass | Sort passes + single scan |
| I/O (spill case) | 2N I/O per partition level (read + write each tuple per recursive level) | 2N per sort pass; typically 2–3 passes total |
| Output order | Unordered | Sorted on group key |
| Skew sensitivity | Partition imbalance possible | Largely insensitive |
Note on spill I/O: Both strategies incur roughly 2N I/O per level/pass.
Hash aggregation with recursive partitioning requires one level per log_B(G/M) depth; sort-based aggregation requires ceil(log_M(N/M)) passes.
In practice, with sufficient memory, both converge to 2–3 passes for typical workloads.
When the number of distinct groups G is much smaller than the input size N (high aggregation ratio), hash aggregation wins decisively.
The hash table stays small, everything fits in memory, and the operation completes in a single O(N) pass.
When G approaches N (low aggregation ratio, many distinct groups), the hash table grows large and may spill.
In this regime, sort-based aggregation becomes competitive because its cost is predictable and it produces useful sorted output.
How Query Optimizers Decide
Modern query optimizers consider several factors when choosing between these strategies:
Cardinality estimates. The optimizer estimates the number of distinct groups.
A high group count generally favors sort-based aggregation when sorted output benefits downstream operators or when memory is constrained; a low group count strongly favors hashing.
Available memory. Systems with generous memory budgets favor hash aggregation because spilling becomes unlikely.
Interesting orders. If the plan already produces sorted input (from an index scan or a preceding sort), sort-based aggregation can skip the sort entirely and aggregate during the scan.
Conversely, if a downstream operator needs sorted output, sort-based aggregation provides it for free.
Pre-existing partitioning. In parallel/distributed systems, if data is already hash-partitioned on the group key, hash aggregation can proceed locally without redistribution.
Many systems (PostgreSQL, SQL Server, and DuckDB) implement both strategies and let the optimizer choose based on cost models.
Some systems, like older versions of MySQL, historically supported only one strategy, which could result in suboptimal plans for certain workloads.
Hybrid Approaches
Several systems use hybrid strategies.
A common pattern is pre-aggregation (also called partial or local aggregation) using a small hash table, followed by a sort-based final aggregation.
This approach reduces the volume of data fed into the sort, capturing the benefits of both methods.
Vectorized engines like DuckDB and HyPer use hash aggregation with overflow-aware partitioning that avoids worst-case recursive spilling by monitoring partition sizes during execution and adaptively switching strategies.
Another hybrid is the sorted hash aggregation used in some systems, where the hash table is periodically flushed in sorted order, producing sorted runs that are later merged.
This provides sorted output without a full separate sort pass.
Key Points
- Hash aggregation achieves O(N) expected time by maintaining a hash table of group accumulators, making it optimal when distinct groups fit in memory.
- Sort-based aggregation costs O(N log N) but produces sorted output, which can eliminate redundant sorts elsewhere in the query plan.
- The ratio of input size to distinct group count (aggregation ratio) is the primary factor determining which strategy is cheaper.
- When input is already sorted on the grouping key (e.g., from an index scan), sort-based aggregation degenerates to a simple O(N) linear scan.
- Hash aggregation is sensitive to memory pressure; when spilling occurs, recursive partitioning introduces additional I/O and complexity.
- Modern optimizers choose between the two strategies using cardinality estimates, available memory, and the presence of interesting orders in the query plan.
- Hybrid approaches (partial hash pre-aggregation followed by sort-based final aggregation) can capture advantages of both strategies.
References
P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, T. Price. "Access Path Selection in a Relational Database Management System." Proceedings of ACM SIGMOD, 1979.
G. Graefe. "Query Evaluation Techniques for Large Databases." ACM Computing Surveys, Vol. 25, No. 2, 1993.
G. Graefe, R. Bunker, S. Cooper. "Hash Joins and Hash Teams in Microsoft SQL Server." Proceedings of VLDB, 1998.
P. Larson. "Data Reduction by Partial Preaggregation." Proceedings of IEEE ICDE, 2002.