Canonical Parallel Reduction

A Fixed Expression Structure for Run-To-Run Consistency

Document:	DxxxxR0
Date:	2026-02-14
Reply-To:	Andrew Drakeford <andreedrakeford@hotmail.com>
Audience:	LEWG, SG6 (Numerics), WG14

Canonical Parallel Reduction

Abstract

C++ today offers two endpoints for reduction:

std::accumulate: a specified left-fold expression, inherently sequential.
std::reduce: scalable, but permits reassociation; for non-associative operations (e.g., floating-point addition), the returned value may vary.

This paper specifies a canonical reduction expression structure: for a given input order and topology coordinate (lane count L), the expression — its parenthesization and operand order — is unique and fully specified. Implementations are free to schedule evaluation using parallelization, vectorization, or any other strategy, provided the returned value matches that of the specified expression.

The proposal standardizes the expression structure only. API design is deferred.

This proposal generalizes the Standard Library technique used by std::accumulate: determinism is obtained by fixing the abstract expression structure, not by constraining execution strategy or floating-point arithmetic (see Appendix D). Bitwise identity of results additionally depends on the floating-point evaluation model (§6).

Reading guide. The normative content of this paper is §4–§5 (~8 pages). Everything else is informative rationale and appendices.

Reader	Read	Skim	Reference as needed
LEWG reviewer	§1, §2, §4, §5, Polls	§3 (design rationale)	Appendices
Implementer	Add Appendix B, N	§3.8 (tree shape rationale)
Numerical analyst	Add §6, Appendix C, K	§3.8 (throughput data)

1. The Semantic Gap in C++ Reduction Facilities

C++ specifies two reduction semantics, but there is a gap:

Facility	Expression Specified	Scalable	Semantic Model
`std::accumulate`	✓ (left-fold)	✗	Sequential, specified grouping
`std::reduce`	✗ (unspecified)	✓	Parallel, unspecified grouping
Canonical reduction (this proposal)	✓ (canonical tree)	✓	Parallel, specified grouping

A summary comparison with HPC frameworks and industry practice appears in §7.

std::accumulate specifies a left-fold expression and therefore a deterministic result for any given binary_op, but it is inherently sequential. std::reduce enables scalable execution by permitting reassociation; as a consequence, the abstract expression is not specified, and for non-associative operations the returned value may vary.

This proposal closes that gap by defining a canonical, parallel-friendly expression structure: a deterministic lane partitioning determined by the topology coordinate L together with a canonical iterative-pairwise tree over lane results, as defined in §4.

Because the expression is fixed, the facility also supports heterogeneous verification: the same topology coordinate can be used to reproduce an accelerator-produced result on a CPU by evaluating the identical abstract expression, provided the floating-point evaluation models are aligned.

[Note: In this paper, canonical refers strictly to the abstract expression structure (parenthesization and operand order). Implementations retain freedom in evaluation schedule; the facility provides a stable, specified baseline topology. —end note]

1.1 Why This Paper Now

Each standard cycle without a specified parallel reduction compounds fragmentation. Frameworks like Kokkos, TBB, CUB, and oneMKL each provide their own solutions with different semantics. A standard semantic foundation enables convergence.

Why standardize, not just use a library? Today, you cannot write a generic library component that demands a specified reduction expression from the standard toolkit. Standardizing the semantic definition lets vendors and users align on a common contract.

1.2 A tiny motivating example (informative)

Consider the same 6 floating-point inputs evaluated with the same + operator:

[1e16, 1, 1, -1e16, 1, 1]

Because floating-point addition is not associative, different parenthesizations can legitimately produce different results.

A left-to-right fold (as with std::accumulate) groups as (((((1e16 + 1) + 1) + -1e16) + 1) + 1) which may yield a different result (e.g., 4 on many IEEE-754 double implementations) because small terms can survive until after the large cancellation.
A plausible tree reduction (as may happen inside std::reduce) can group as ((1e16 + 1) + (1 + -1e16)) + (1 + 1), in which some +1 terms may be lost early, yielding a different result (e.g., 2), depending on evaluation strategy and rounding behavior.

Today, both outcomes are consistent with the Standard because std::reduce does not fix the abstract expression (it requires associativity/commutativity to be well-defined). This paper’s goal is to define a single standard-fixed expression (“canonical”) so that the returned value is reproducible within a fixed consistency domain (fixed input order, fixed topology coordinate L, and a fixed floating‑point evaluation model), while still permitting parallel execution.

Appendix K.3.1 provides a cancellation‑heavy dataset in which std::reduce exhibits run‑to‑run variability while the canonical reduction remains stable for fixed L, and in which varying L intentionally yields different results (confirming that topology selection is semantic, not a hint).

1.3 Determinism via Expression Ownership in Existing Practice

The C++ Standard already achieves deterministic results for reductions by fixing the abstract expression structure, not by restricting floating‑point arithmetic or execution strategy.

std::accumulate is the canonical example. Its specification mandates a left‑to‑right fold, fully determining the parenthesization and left/right operand order of the reduction expression. As a consequence, implementations are not permitted to reassociate operations, even when the supplied binary_op is non‑associative. This guarantee is structural, not numerical: the Standard does not promise bitwise identity across platforms or builds, but it does fully specify the abstract expression being evaluated.

This proposal applies the same semantic technique to parallel reduction. Rather than permitting reassociation (as std::reduce explicitly does), it standardizes a single canonical reduction expression that admits parallel evaluation. The novelty of this proposal lies in the topology of the expression, not in the mechanism by which determinism is obtained. Appendix D analyzes the std::accumulate precedent in detail and explains why the same conformance model applies to the canonical reduction specified in §4.

2. Scope and Non-Goals

This paper proposes semantics only. It seeks LEWG validation of the fixed expression structure defined in §4 before committing to API design. This proposal introduces no new requirements on binary_op beyond invocability and convertibility, and does not modify existing algorithms.

This is not “just std::reduce + an execution policy”: std::reduce deliberately leaves the abstract evaluation expression unspecified (and requires associativity/commutativity for meaning under parallelization), whereas this paper standardizes a single fixed canonical expression that enables opt-in reproducible results across implementations while still permitting parallel execution.

The facility defines the returned value for a chosen topology coordinate (lane count L). Execution strategy remains unconstrained (§4).

Reproducibility under this facility is defined relative to a chosen topology coordinate L. Different topology selections intentionally define different abstract expressions and may therefore produce different results for non-associative operations. Determinism is guaranteed for a fixed input sequence, fixed topology coordinate L, and fixed floating-point evaluation model.

For clarity: §4–§5 are normative and define the semantic contract of the facility. All other sections and all appendices are informative and do not introduce additional requirements. Appendix K records external demonstrator programs (Compiler Explorer links) used to validate the semantics on multiple architectures; they are semantic witnesses, not reference implementations and not part of the proposal.

In scope (this paper):

Semantic definition of the canonical compute sequence (fixed abstract expression)
Parameterization by lane count L, with a derived span spelling M (§9.2)
Invariance guarantees (§5)
Rationale for design choices

Deferred to subsequent revision (pending LEWG direction):

API surface (new algorithm vs execution policy vs both)
Range-based overloads
Scheduler/executor integration (sender/receiver-based execution models)

2.1 Opt-in reproducibility and performance trade-off (informative)

This facility is an opt-in choice for users who require a stable, specified abstract expression structure (e.g., for debugging, verification, regression testing, or reproducible numerical workflows). It is not intended to replace existing high-throughput facilities. Users who prioritize maximum throughput over expression identity should continue to use std::reduce (or domain-specific facilities) where unspecified reassociation is acceptable.

2.2 Traversal and sizing requirements (informative)

Evaluation of the canonical expression requires a well-defined N (the number of elements in the input range). The iterative pairwise algorithm is single-pass, but the normative definition (§4) is stated in terms of K = ceil(N/L), so the specification as written requires N to be known. No implicit materialization or allocation is performed by the facility. See Appendix L for further discussion of ranges compatibility.

2.3 Exception safety

Exception handling follows the corresponding Standard Library rules for the selected execution mode:

When evaluated without an execution policy, if binary_op throws, the exception is propagated to the caller ([algorithms.general]).
When evaluated with an execution policy: if execution of binary_op exits via an uncaught exception and the policy is one of the standard execution policies, std::terminate is called ([algorithms.parallel.exceptions]).

The expression-equivalence (returned-value) guarantee applies only when evaluation completes normally. If std::terminate is called under a policy-based evaluation, the state of any outputs and any externally observable side effects is unspecified.

2.4 Complexity

Evaluation of the canonical reduction for an input range of N elements performs O(N) applications of binary_op. Specifically, the canonical expression contains exactly N − 1 applications of binary_op when N > 0; absent-operand positions (§4.2.2) do not induce additional applications.

This matches the work complexity required of std::reduce. Only work complexity is normative in this paper. The canonical abstract expression has O(log N) height; this paper does not require implementations to realize that depth without auxiliary storage. No guarantees are made about evaluation depth, degree of parallelism, or auxiliary storage.

[Note: The natural shift-reduce evaluation strategy (§4.2.3) maintains a stack of depth O(log K) per lane, where K = ceil(N/L). For L lanes this implies O(L · log(N/L)) intermediate values of type A as working storage. This is modest in practice (e.g., 8 lanes × 30 stack entries for a billion elements) but is not zero. —end note]

2.5 Expression identity vs. bitwise identity

This facility guarantees expression identity: for fixed input order and topology coordinate, the abstract reduction expression is identical across conforming implementations. Bitwise identity of results additionally depends on the floating-point evaluation model; §6 discusses the distinction in detail.

3. Design Space (Informative)

This section catalogs key design alternatives for a reproducible reduction facility and explains why this proposal chooses a user-selectable, interleaved-lane topology with a standard-defined canonical expression.

The goal is to close the “grouping gap” for parallel reductions by providing a facility that:

Specifies a single abstract expression (parenthesization and operand order) for a chosen topology.
Achieves scalable depth (O(log N)) suitable for parallel execution.
Is topology-stable: the expression is not implicitly determined by thread count or implementation strategy.
Remains implementable efficiently on modern hardware (SIMD, multicore, GPU).

The following alternatives are evaluated against these requirements.

3.0 Expression, Algorithm, and Execution

Every reduction computes an abstract expression: a parenthesized combination of operands with a defined left/right operand order. This expression exists independently of how it is evaluated in time. In C++, the abstract expression has historically been implicit — specified only indirectly through algorithm wording — and has never been named as a distinct semantic concern.

In practice, three concerns determine the behavior of a reduction:

Expression structure — the abstract computation being performed: grouping, parenthesization, and operand order.
Algorithm — the semantic contract that determines which aspects of the expression are specified or intentionally left unspecified.
Execution — how evaluation of the expression is scheduled: sequentially, in parallel, vectorized, or otherwise.

The C++ Standard Library already relies on this separation, but does not articulate it explicitly. As a result, expression structure is routinely conflated with execution strategy, producing persistent confusion.

Existing facilities through this lens. Seen through this model, existing facilities differ primarily in expression ownership, not in parallelism:

std::accumulate specifies a left-to-right fold. The algorithm fully defines the abstract expression, yielding a deterministic result for a given input order and operation.
std::reduce explicitly declines to specify the expression structure. It defines a generalized sum, permitting reassociation to enable scalable execution.
Execution policies operate exclusively on the execution dimension. They constrain scheduling and concurrency, but do not define or refine the abstract expression being evaluated. In particular, even execution::seq does not impose a specific grouping or forbid reassociation; it affects how an algorithm runs, not what expression it computes.

This explains why std::reduce(execution::seq, ...) is still permitted to produce different results for non-associative operations: the algorithm has not specified the expression, and the policy does not add semantic guarantees.

Why execution policies cannot carry expression semantics. It is natural to ask whether a canonical reduction could be expressed as an execution policy rather than a new algorithm. Under the current standard model, execution policies are deliberately non-semantic with respect to the returned value:

Policies are designed to relax constraints (permitting concurrency, vectorization), not to add new semantic guarantees about grouping or parenthesization.
Policies compose freely and orthogonally. Encoding topology in a policy would require dominance rules to resolve conflicts between policies that specify different topologies — a semantic hierarchy the standard does not have.
std::reduce already proves the limitation: std::reduce(execution::seq, ...) still has generalized-sum semantics. If seq were sufficient to fix the expression, std::reduce(seq, ...) would collapse into std::accumulate — but the standard explicitly keeps them distinct.
A topology parameter is a semantic parameter that intentionally changes observable results for non-associative operations. That is fundamentally different from an execution hint.

Encoding expression structure in a policy would therefore require a fundamental change to the execution-policy model. This proposal intentionally avoids that scope.

Consequence. Because expression structure is a semantic concern that execution policies cannot express, and because views and range adaptors can reorder traversal but cannot constrain combination (§3.7), expression ownership must reside in the algorithm. This proposal makes the abstract expression explicit and assigns ownership of it to the algorithm, restoring a clean separation between what is computed (expression), what is guaranteed (algorithm), and how it is executed (execution policy).

This separation also clarifies the relationship to executors (§3.10).

3.1 Why Not Left-Fold?

A strict left-to-right fold has a fixed evaluation order:

// Left-fold
T result = init;
for (auto it = first; it != last; ++it)
 result = op(result, *it);

This is std::accumulate. It provides run-to-run stability for a given input order, but it cannot parallelize effectively because each operation depends on the prior result. The reduction depth is O(N) rather than O(log N), making it unsuitable for scalable parallel execution.

A left-fold therefore solves stability but not scalability, and it already exists in the standard library.

3.2 Why Not Blocked Decomposition?

A common parallelization strategy is blocked decomposition: assign contiguous chunks to workers and reduce each chunk:

// Blocked (illustrative)
// Thread 0: E[0..N/4)
// Thread 1: E[N/4..N/2)
// Thread 2: E[N/2..3N/4)
// Thread 3: E[3N/4..N)

Blocked decomposition can be efficient, but it does not by itself provide a topology-stable semantic contract. To obtain a fully specified abstract expression, the standard would need to specify:

the exact chunk boundaries (including handling of non-power-of-two sizes),
how chunk results are combined (the second-stage reduction tree), and
how these choices relate to the execution policy and degree of parallelism.

Absent such specification, the resulting expression is naturally coupled to execution strategy (thread count, scheduler, partitioner), which varies across implementations and runs. Fully specifying these details effectively defines a new algorithm and topology — at which point the question becomes: which topology should be standardized?

This proposal selects a topology that is simple to specify canonically and maps well to common hardware structures (see §3.5).

3.3 Why Not a Fixed Standard Constant?

Another approach is to standardize a fixed topology constant. This can appear attractive for simplicity and uniformity, but it ages poorly:

If the fixed constant is too small, future wider hardware may be underutilized.
If the fixed constant is too large, current narrow hardware may incur unnecessary work or overhead.
Any fixed value becomes a standard revision pressure point as hardware evolves.

A fixed constant trades long-term efficiency and flexibility for initial simplicity and becomes an ongoing “default value” debate. This proposal instead makes topology selection an explicit part of the semantic contract (§3.5).

3.4 Why Not Implementation-Defined?

Allowing implementations to choose the reduction topology is precisely the status quo problem. For parallel reductions such as std::reduce, the Standard permits reassociation, and therefore permits different abstract expressions across implementations and settings.

An implementation-defined topology does not close the grouping gap; it merely re-labels it. A reproducibility facility must standardize the expression structure for a chosen topology, rather than leaving that structure implementation-defined.

3.5 The proposed design

The facility proposed here is defined by a family of canonical expressions parameterized by a user-selected topology coordinate. For a chosen coordinate, the Standard defines a single abstract expression, and the returned value is the result of evaluating that expression (as-if), independent of execution strategy.

Primary topology coordinate. The canonical expression is defined in terms of the lane count L. For intuition: L = 1 degenerates to a single canonical tree over the input sequence (no lane interleaving). Larger L introduces SIMD/GPU-friendly lane parallelism, but still denotes one Standard-defined canonical expression fully determined by (N, L). The normative definition of the two-stage reduction (lane partitioning, per-lane canonical tree, cross-lane canonical tree) appears in §4.

Why interleaved lanes, not contiguous blocks? A contiguous-block partition (elements [0, N/L) to lane 0, [N/L, 2N/L) to lane 1, etc.) would also be topology-stable. The interleaved layout is chosen because it maps directly to SIMD register filling: a single aligned vector load of L consecutive elements places one element into each lane simultaneously. Contiguous blocks would require gathering from L distant memory locations to fill a register. Additionally, interleaving guarantees that all lanes contain the same number of elements (to within one), so all lanes execute the same canonical tree shape — enabling SIMD lockstep execution without per-lane branching. This uniform tree shape across lanes is what makes the two-stage decomposition (§4.4) efficient in practice.

A byte span M may be provided as a derived convenience by mapping L = M / sizeof(value_type), but because sizeof(value_type) is implementation-defined (and may vary across platforms/ABIs), specifying M does not, in general, select the same canonical expression across implementations. For layout-stable arithmetic types, the M spelling is a convenient way to align topology with SIMD register width; it does not change the semantic definition of the expression (§9.2).

3.6 Industry precedents for constrained computation structure

A common counter-argument is that because bitwise identity across different ISAs (e.g., x86 vs. ARM) cannot be guaranteed by the C++ Standard alone, the library should not attempt to provide run-to-run stability guarantees. This “all-or-nothing” view overlooks existing industry practice.

A widely used mechanism for improving reproducibility is to constrain computation structure (topology, kernel choice, or execution path) to remove sources of run-to-run variability introduced by parallel execution. This proposal targets one such source: unspecified reassociation inside standard parallel reductions. Fixing the abstract expression is therefore a necessary building block for reproducible parallel reductions; additional conditions (e.g., floating-point evaluation model and environment constraints) remain outside the scope of this paper.

3.6.1 Examples (informative)

Library	Feature	Mechanism (documented)	Scope (typical)
Intel oneMKL	Conditional Numerical Reproducibility (CNR)	Constrains execution paths / pins to a logical ISA	Reproducibility across specified CPU configurations
NVIDIA CUB	Deterministic reduction variants	Uses a fixed reduction order for deterministic mode	Reproducibility on the same architecture
PyTorch / TensorFlow	Deterministic algorithms flags	Disables nondeterministic kernels / selects deterministic kernels	Reproducible training runs (scope varies)

3.6.2 References (informative)

Intel oneMKL CNR: Intel documents that parallel algorithms can produce different results based on thread counts and ISA, and provides CNR modes to constrain execution paths. See: Intel oneMKL Developer Guide, “Introduction to Conditional Numerical Reproducibility” [IntelCNR].

NVIDIA CUB: NVIDIA distinguishes between “fast” nondeterministic reductions and “deterministic” variants that use a fixed-order reduction. See: NVIDIA CUB API Documentation [NvidiaCUB].

These libraries address reproducibility through different mechanisms and with different scope. This proposal does not claim to replicate their exact guarantees, but draws on the same insight: fixing the computation structure is a useful building block for reproducibility.

3.6.3 Conclusion

By fixing the abstract expression structure, this proposal provides a necessary (but not sufficient) condition for reproducibility. Sufficient conditions depend on the program’s floating-point evaluation model and environment. Without this proposal, even a user with strict control over their floating-point environment cannot achieve reproducible parallel results in general, because the standard library itself permits unspecified reassociation in parallel reductions.

3.7 The structural necessity of a new algorithm

A central tenet of this proposal is that run-to-run stability of the returned value is a property of the evaluation, not the data source. To achieve both logarithmic scalability and topological determinism, the algorithm itself must “own” the reduction tree. This logic cannot be injected into existing algorithms via a View or a Range adaptor.

3.7.1 Why std::reduce + Views cannot provide this contract

Views can reorder iteration; they cannot constrain combination. A view may present elements in a deterministic order, but it cannot force the algorithm to evaluate a particular parenthesization.
std::reduce explicitly permits reassociation. Therefore, even with a deterministic view and a fixed input order, std::reduce may legally evaluate a different abstract expression (see [numeric.ops.reduce]).

// A deterministic view does not make std::reduce deterministic:
auto view = data | views::transform(f); // preserves iteration order
auto r1 = std::reduce(std::execution::par, view.begin(), view.end(), init, op);
// r1 may still use an implementation-chosen reassociation.

3.7.2 Consequence: the algorithm must own the expression

To provide the guarantee defined in §4, the reduction facility itself must specify and “own” the abstract expression tree. As established in §3.0, expression structure is a semantic concern that cannot reside in execution policies (which constrain scheduling, not grouping), views (which reorder traversal, not combination), or executors (which determine how an expression is evaluated, not what it is). The algorithm is the only standard mechanism that can own expression semantics. This is the semantic gap addressed by this proposal.

3.8 Rationale for tree shape choice

Given that a new algorithm must own its expression tree (§3.7.2), the next design question is: which tree construction rule should the Standard specify? Two well-known candidates exist: iterative pairwise (shift-reduce) and recursive bisection. This subsection records the considerations that inform the choice.

Arguments favoring iterative pairwise (recommended):

Industry alignment: SIMD and GPU implementations commonly use iterative pairwise because it maps directly to hardware primitives (warp shuffle-down, vector lane pairing). This includes NVIDIA CUB’s deterministic reduction modes.
Implementation naturalness: For implementers familiar with GPU and SIMD programming, iterative pairwise matches the mental model of “pair adjacent lanes, carry the odd one” — the same pattern used in existing high-performance reduction kernels.
Adoption ease: Libraries that already implement deterministic reductions using iterative pairwise would require no algorithmic changes to conform to this specification. This lowers the barrier to adoption.
Direct SIMD mapping: The iterative pairwise pattern corresponds directly to shuffle-down operations on GPU warps and SIMD vector lanes, enabling efficient implementation without index remapping.

Arguments favoring recursive bisection:

Specification clarity: The recursive definition is three lines with no special cases. The iterative definition requires explicit handling of odd-count carry logic.
Tree symmetry: For non-power-of-two k, recursive bisection produces more balanced subtrees. The “heavier” subtree (with more elements) is always on the right, following the m = floor(k/2) split.
Academic precedent: Recursive bisection corresponds to the “pairwise summation” algorithm analyzed in numerical analysis literature ([Higham2002] §4).

Why the choice matters:

For non-associative operations (e.g., floating-point addition), the two algorithms produce different numerical results for non-power-of-two k. Once standardized, the choice cannot be changed without breaking existing code that depends on specific results.

Why the choice is bounded:

Both algorithms have identical O(log k · ε) error bounds. Both produce identical trees for power-of-two k. The practical impact is limited to: - Non-power-of-two per-lane element counts (when ceil(N/L) is not a power of two) - Non-power-of-two lane counts (when L itself is not a power of two) - Cross-lane reduction when N does not evenly divide L

For the common case of power-of-two L and large N, the per-lane trees are dominated by power-of-two cases where both algorithms agree.

Visual comparison (k = 7): For power-of-two k, both algorithms produce identical trees. The difference is visible only for non-power-of-two k. The following side-by-side comparison for k = 7 illustrates the full extent of the difference:

Iterative Pairwise (IPR)              Recursive Bisection
─────────────────────────             ─────────────────────────

         op                                    op
       /    \                                /    \
      op      op                           op      op
     /  \    /  \                         /  \    /  \
    op   op  op  e₆                     e₀   op  op   op
   / \ / \ / \                              / \ / \  / \
  e₀ e₁ e₂ e₃ e₄ e₅                      e₁ e₂ e₃ e₄ e₅ e₆

((e₀⊕e₁)⊕(e₂⊕e₃))⊕((e₄⊕e₅)⊕e₆)    (e₀⊕(e₁⊕e₂))⊕((e₃⊕e₄)⊕(e₅⊕e₆))

Both trees have depth 3 (= ⌈log₂ 7⌉). Both perform exactly 6 applications of binary_op. The error bounds are identical. The only difference is which element carries: IPR carries e₆ (the last element, locally determined), while recursive bisection splits at floor(7/2) = 3, changing the pairing of e₀ (a globally determined property — the pairing of the first element depends on the total count).

This paper specifies iterative pairwise because it aligns with existing practice in high-performance libraries that already provide deterministic reduction modes. This minimizes disruption for implementers and users who have existing workflows built around these libraries.

Streaming evaluation: the fundamental structural difference. The most significant distinction between the two candidates is not performance or symmetry — it is that iterative pairwise is an online algorithm and recursive bisection is a batch algorithm.

Recursive bisection’s first operation is to compute a midpoint: mid = N/2. The entire split tree is determined top-down from the global sequence length. Evaluation cannot begin until N is known, and the split structure depends on a global property of the input.

Iterative pairwise (shift-reduce) processes elements incrementally: each element is shifted onto an O(log N) stack, and reductions are triggered by a branchless bit-test on the running count (ntz(count)). The stack captures the complete computation state at any point. N is not needed until termination, when remaining stack entries are collapsed.

This structural property has concrete consequences for how C++ is evolving:

Forward ranges without size(): IPR can begin reducing immediately with a single forward pass. Recursive bisection requires either a counting pass or random access to compute split points.
Chunked executor evaluation: An executor delivering data in chunks can feed each chunk into the shift-reduce loop. The stack state is the continuation — evaluation can pause and resume at any chunk boundary. Recursive bisection requires the global N before reduction can begin, forcing a synchronization barrier.
P2300 sender/receiver composition: P2300 (std::execution, adopted for C++26) explicitly separates what to execute from where to execute it, and senders describe composable units of asynchronous work. A streaming reduction using IPR is naturally expressible as a sender that consumes elements as they flow through a pipeline. Recursive bisection requires knowing the complete input before constructing the expression — it cannot compose incrementally.
Heterogeneous and distributed execution: Data arriving asynchronously from GPU kernels, network reduction, or distributed aggregation can be consumed incrementally by IPR. Recursive bisection must buffer the complete sequence before computation begins.

In a standard moving toward sender/receiver pipelines and composable asynchronous execution, the tree shape that can operate without global knowledge of N is the one structurally aligned with the future of C++ execution. This is not a minor implementation convenience — it is an architectural property that determines whether the canonical expression can participate in streaming pipelines at all.

[Note: The current normative definition (§4) is stated in terms of K = ceil(N/L), so the specification as written requires a well-defined N (§2.2). However, the streaming property is inherent to the iterative pairwise algorithm: lane assignment (i mod L) is known per-element, and the shift-reduce procedure within each lane builds the tree incrementally without foreknowledge of the lane’s element count. N is needed only to determine when to stop and collapse remaining stack entries. Future API revisions may exploit this property to weaken iterator and sizing requirements. —end note]

Performance comparison: iterative pairwise vs recursive bisection. Because both trees have identical depth for power-of-two sizes and nearly identical depth otherwise, their throughput is similar on modern hardware. Recursive bisection can be implemented with a direct unrolled mapping for small sub-problems (e.g., a flat switch for N ≤ 32 eliminates recursive call overhead), and unaligned SIMD loads on contemporary microarchitectures are essentially free when data does not cross a cache line boundary — reducing the alignment advantage that earlier analyses relied on [Dalton2014]. The iterative formulation retains structural advantages (loop-based with a branchless bit-test for reduction, natural streaming order), but these translate to modest rather than dramatic throughput differences against a well-engineered recursive implementation. This approximate throughput equivalence between the two candidates means that performance alone cannot distinguish them — and the decision appropriately falls to the axes where they do differ: industry practice alignment, executor compatibility, and the fact that iterative pairwise is what existing SIMD and GPU reduction libraries already ship. The paper does not claim iterative pairwise is faster than recursive bisection; it claims that, at equivalent performance and identical error bounds, the tree that matches existing practice is the safer standard choice.

Executor compatibility. Once the expression is separated from execution (§3.0), the question becomes: which canonical expression can executors naturally evaluate? Iterative pairwise produces a local, lane-structured expression that maps directly to executor chunking and work-graph execution without requiring the full tree to be materialized. Recursive bisection produces a globally-recursive structure that is harder to realize incrementally. In executor terms, iterative pairwise defines an expression that executors can evaluate without first building the tree.

Standard regret. Once standardized, the tree shape becomes part of the language contract and cannot be changed without breaking code that depends on specific results. The worst consequence of choosing iterative pairwise is not incorrectness or inefficiency — it is commitment. For power-of-two sizes the two candidates are identical; the commitment applies only to non-power-of-two boundary cases, where iterative pairwise matches existing SIMD/GPU practice. This is the expression shape least likely to be regretted as execution hardware evolves.

Upper bound on regret. Regardless of tree shape, any balanced binary reduction has the same O(log N · ε) error bound [Higham2002]. No alternative tree can improve the asymptotic accuracy. On the throughput side, iterative pairwise already achieves 89% of the theoretical peak [Dalton2014]. Even if a superior tree shape were discovered in the future, the maximum possible throughput improvement over IPR is therefore bounded at approximately 11%. The regret is capped: the committee is not choosing between a good answer and an unknown potentially-much-better answer — it is choosing between 89% and at most 100%, with identical error bounds. That is a narrow window in which to find regret.

Measured throughput cost. The practical performance cost of iterative pairwise (shift-reduce) summation has been measured by Dalton, Wang & Blainey [Dalton2014]. Their SIMD-optimized implementation achieves 89% of the throughput of unconstrained naïve summation when data is not resident in L1 cache (86% from L2, 90% streaming from memory), while providing the O(log N · ε) error bound of pairwise summation — and twice the throughput of the best-tuned compensated sums (Kahan-Babuška).

The more telling comparison is against the current deterministic baseline. Today, the only standard facility with a fully specified expression is std::accumulate, which is a strict left fold with a loop-carried dependency chain. It cannot utilize SIMD parallelism: on AVX-512 hardware, a left fold occupies 1 of 8 double lanes (~12% SIMD utilization) or 1 of 16 float lanes (~6%). The canonical iterative pairwise tree, by contrast, is inherently SIMD-friendly — all lanes active, with measured throughput at ~89% of peak. This represents approximately a 7× improvement in the deterministic reduction path for double on AVX-512 (and ~14× for float). The question for LEWG is therefore not “what is the cost of owning the expression?” but “how much performance does a specified expression recover compared to the only specified expression we have today?” The answer is: nearly all of it.

3.9 Deferral of API surface

This paper acknowledges the importance of a Ranges-first model in modern C++, but the specific API surface (new algorithm, new execution policy, or range-based overload) is deliberately deferred to a subsequent revision. Expression structure is orthogonal to API surface; fixing it first allows API alternatives to be evaluated against a stable semantic baseline. Once LEWG reaches consensus on this semantic foundation, a follow-on revision can propose an API that aligns with modern library patterns.

3.10 Relationship to Executors

The expression/algorithm/execution separation described in §3.0 aligns naturally with the sender/receiver execution model adopted for C++26 (P2300).

P2300 (std::execution) addresses the execution concern: where work runs, when it runs, how it is scheduled, and what progress guarantees apply. Its stated design principle — “address the concern of what to execute separately from the concern of where” — is precisely the separation this proposal formalizes for reduction expressions. Senders and receivers are intentionally agnostic to the abstract computation being performed. In the absence of a specified expression, different schedulers may legitimately induce different groupings for a reduction, leading to scheduler-dependent results. This is not a defect in the execution model; it reflects the fact that the algorithm has not fixed the computation.

By fixing the abstract expression in the algorithm, this proposal provides executors with a stable semantic target:

The algorithm defines what computation must be performed.
The executor determines how that computation is evaluated.

Different schedulers — CPU thread pool, GPU stream, or distributed sender chain — may execute the canonical expression using different physical strategies, but they are required to produce the same returned value for a fixed expression and floating-point evaluation model.

In this sense, the proposal is orthogonal and complementary to P2300. It does not constrain execution; it enables cross-scheduler consistency by giving the execution model a deterministic expression to execute. The streaming property of iterative pairwise (§3.8) is particularly relevant here: the canonical expression can be evaluated incrementally as data flows through a sender pipeline, without requiring a synchronization barrier to determine N before reduction begins. This also explains why expression semantics could never have resided in execution policies or schedulers: those mechanisms are designed for the execution dimension, and encoding expression structure in them would require conflating the two concerns that this proposal separates.

4. Fixed Expression Structure (Canonical Compute Sequence) [canonical.reduce]

Normative. Sections §4–§5 specify the semantic contract of this proposal. All other sections are informative.

The contract in this section specifies the returned value only; the evaluation schedule (including the relative ordering of independent subexpressions) is not specified. For N > 0, the standard‑fixed abstract expression contains exactly N − 1 applications of binary_op; this paper does not specify which subexpression is evaluated first or how evaluation is scheduled.

This section is the core of the proposal. It defines the fixed abstract expression — the exact grouping and left/right operand order — that determines the returned value for a given input order, binary_op, and topology coordinate L. A conforming implementation shall produce a result as-if it evaluates this standard-fixed abstract expression; implementations remain free to schedule evaluation using threads, vectorization, work-stealing, GPU kernels, etc., provided the returned value matches.

[Note: The semantic contract in §4 is defined in terms of the returned value. It does not specify scheduling, does not guarantee a deterministic number of invocations of binary_op, and does not guarantee deterministic side effects. —end note]

Overview: Two-stage canonical reduction

The canonical reduction proceeds in two stages over an interleaved lane partition. The input sequence of N elements is distributed across L lanes by index modulo (element i → lane i mod L). In Stage 1, each lane independently reduces its elements using a single canonical tree shape. In Stage 2, the L lane results are themselves reduced using the same canonical tree rule. Both stages use CANONICAL_TREE_EVAL (§4.2.3), ensuring a fully determined expression from input to result.

Example: N = 12, L = 4 (3 elements per lane)

Input:  e₀  e₁  e₂  e₃  e₄  e₅  e₆  e₇  e₈  e₉  e₁₀ e₁₁
Lane:    0   1   2   3   0   1   2   3   0   1   2   3

                    ┌─── Stage 1: Per-lane canonical trees ───┐

        Lane 0          Lane 1          Lane 2          Lane 3
          op              op              op              op
         /  \            /  \            /  \            /  \
        op  e₈          op  e₉          op  e₁₀         op  e₁₁
       / \             / \             / \             / \
      e₀  e₄         e₁  e₅         e₂  e₆         e₃  e₇

        ↓ R₀            ↓ R₁            ↓ R₂            ↓ R₃

                    ┌─── Stage 2: Cross-lane canonical tree ──┐

                              op
                            /    \
                          op      op
                         /  \    /  \
                        R₀  R₁  R₂  R₃

                              ↓
                           result

The lane count L is the single topology parameter that determines the shape of the entire computation. When L = 1, the two-stage structure collapses to a single canonical tree over the entire input sequence (one lane, no cross-lane reduction). Sections §4.1–§4.4 define each component precisely; §4.3.4 addresses ragged tails when N is not a multiple of L.

This specification defines the canonical expression structure only; it does not prescribe an evaluation strategy. However, the iterative pairwise formulation can be evaluated efficiently across threads while preserving the canonical tree. When each thread processes a power-of-two-sized chunk of the input, the shift-reduce process within that chunk collapses to a single completed subtree — the minimum possible merge state. Adjacent chunk results can then be combined in index order, recovering the canonical expression regardless of thread count. A detailed parallel realization strategy is described in Appendix N.

4.1 Canonical tree shape [canonical.reduce.tree]

To mirror the style used in the Numerics clauses (e.g. GENERALIZED_SUM), this paper introduces definitional functions used solely to specify required parenthesization and left/right operand order. These functions do not propose Standard Library names.

For a fixed input order, lane count L, and binary_op, the abstract expression (parenthesization and left/right operand order) is uniquely determined by:

Interleaved lanes: for a lane count L, the input positions are partitioned into L logical subsequences (lanes) based on i % L, preserving input order within each lane (§4.3), followed by a second canonical reduction over lane results (§4.4).
A canonical iterative pairwise tree defined by a standard-fixed algorithm (§4.2.3).

[Note: The iterative pairwise-with-carry evaluation order used by this paper corresponds to the well-known shift-reduce (carry-chain / binary-counter stack) formulation of pairwise summation described by Dalton, Wang, and Blainey [Dalton2014]. —end note]

A near-balanced tree over a non-power-of-two number of leaves implicitly contains operand positions for which no input element is present. This paper makes that absence explicit in the semantic model: when the tree geometry requires combining two operands but one operand is absent, binary_op is not applied and the present operand is propagated unchanged (§4.2.2). This avoids padding values and imposes no identity-element requirements on binary_op.

4.2 Canonical tree reduction with absent operands [canonical.reduce.tree.absent]

4.2.1 Canonical split rule (pairing count per round) [canonical.reduce.tree.split]

Given a working sequence of length n, define the number of pairs formed in each round:

let h = floor(n / 2).

The split rule h = floor(n / 2) is normative and governs the number of pairs formed in each round of the iterative pairwise algorithm (§4.2.3). It does not by itself determine the tree shape; the complete tree is determined by the iterative pairing and carry logic defined in §4.2.3.

4.2.2 Lifted combine on absent operands [canonical.reduce.tree.combine]

Let maybe<A> be a conceptual domain with values either present(a) (for a of type A) or ∅ (“absent”). The maybe<A> and ∅ notation are purely definitional devices used to specify the handling of non-power-of-two inputs without requiring an identity element; they do not appear in any proposed interface and implementations need not model absence explicitly.

Define the lifted operation COMBINE(op, u, v) on maybe<A>:

COMBINE(op, ∅, x) = x
COMBINE(op, x, ∅) = x
COMBINE(op, ∅, ∅) = ∅
COMBINE(op, present(x), present(y)) = present( op(x, y) )

[Note: In implementation terms, COMBINE is the familiar SIMD tail-masking or epilogue pattern: when an input sequence does not fill the last group evenly, the implementation skips the missing positions rather than fabricating values. The formalism above specifies this behavior precisely without prescribing the implementation technique (predicated lanes, scalar epilogue, masked operations, etc.). —end note]

This lifted operation does not require binary_op to have an identity element. Absence is a property of the expression geometry (whether an operator application exists), not a property of binary_op.

The use of ∅ is a definitional device. For any given (N, L), the locations of absent leaves are fully determined (they occur only in the ragged tail, §4.3.4). Implementations can therefore handle the tail using an epilogue or masking, without introducing per-node conditionals in the main reduction.

4.2.3 Canonical tree evaluation: Iterative Pairwise [canonical.reduce.tree.eval]

This subsection defines the canonical tree-building algorithm using iterative pairwise reduction. The algorithm pairs adjacent elements left-to-right in each round, carrying the odd trailing element to the next round. This corresponds to the shift-reduce summation pattern described by Dalton, Wang, and Blainey [Dalton2014] and matches the shuffle-down pattern used in CUDA CUB and GPU warp reductions (see §3.8 for rationale).

Definition. Define CANONICAL_TREE_EVAL(op, Y[0..k)), where k >= 1 and each Y[t] is in maybe<A>:

CANONICAL_TREE_EVAL(op, Y[0..k)):
    if k == 1:
        return Y[0]
    
    // Iteratively pair adjacent elements until one remains
    let W = Y[0..k)  // working sequence (conceptual copy)
    while |W| > 1:
        let n = |W|
        let h = floor(n / 2)
        // Pair elements: W'[i] = COMBINE(op, W[2i], W[2i+1]) for i in [0, h)
        let W' = [ COMBINE(op, W[2*i], W[2*i + 1]) for i in [0, h) ]
        // If n is odd, carry the last element
        if n is odd:
            W' = W' ++ [ W[n-1] ]
        W = W'
    return W[0]

When CANONICAL_TREE_EVAL returns ∅, it denotes that no present operands existed in the evaluated subtree.

Shift-reduce state table (k = 8):

The following table illustrates the iterative pairwise algorithm step by step for k = 8 elements, adapted from [Dalton2014] Figure 4. Each step either shifts (pushes an element onto the stack) or reduces (combines the top two stack entries of the same tree level). Lowercase letters denote intermediate results at successively higher levels of the tree: a terms are sums of two elements, b terms are sums of two a terms, and so on.

Sequence                    Stack                 Operation
─────────────────────────── ───────────────────── ────────────────────
e₀ e₁ e₂ e₃ e₄ e₅ e₆ e₇   ∅                     shift e₀
e₁ e₂ e₃ e₄ e₅ e₆ e₇      e₀                    shift e₁
e₂ e₃ e₄ e₅ e₆ e₇         e₀ e₁                 reduce a₀ = op(e₀, e₁)
e₂ e₃ e₄ e₅ e₆ e₇         a₀                    shift e₂
e₃ e₄ e₅ e₆ e₇            a₀ e₂                 shift e₃
e₄ e₅ e₆ e₇               a₀ e₂ e₃              reduce a₁ = op(e₂, e₃)
e₄ e₅ e₆ e₇               a₀ a₁                 reduce b₀ = op(a₀, a₁)
e₄ e₅ e₆ e₇               b₀                    shift e₄
e₅ e₆ e₇                  b₀ e₄                 shift e₅
e₆ e₇                     b₀ e₄ e₅              reduce a₂ = op(e₄, e₅)
e₆ e₇                     b₀ a₂                 shift e₆
e₇                         b₀ a₂ e₆              shift e₇
∅                          b₀ a₂ e₆ e₇           reduce a₃ = op(e₆, e₇)
∅                          b₀ a₂ a₃              reduce b₁ = op(a₂, a₃)
∅                          b₀ b₁                 reduce c₀ = op(b₀, b₁)
∅                          c₀                    done

The final result c₀ is the value of the canonical expression. The number of reductions after the n-th shift is determined by the number of trailing zeros in the binary representation of n [Dalton2014].

Tree diagram (k = 8):

The state table above produces the following canonical expression tree — a perfectly balanced binary tree for power-of-two k:

                    op (c₀)
                  /        \
           op (b₀)          op (b₁)
           /     \          /     \
       op (a₀) op (a₁)  op (a₂) op (a₃)
        / \     / \       / \     / \
       e₀ e₁  e₂ e₃    e₄ e₅  e₆ e₇

Expression: ((e₀ ⊕ e₁) ⊕ (e₂ ⊕ e₃)) ⊕ ((e₄ ⊕ e₅) ⊕ (e₆ ⊕ e₇))

Tree diagram (k = 7, non-power-of-two):

When k is not a power of two, the odd trailing element is carried forward, producing a slightly unbalanced tree. This is where the carry logic in the algorithm definition above determines the canonical shape:

                 op
               /    \
              op      op
             /  \    /  \
            op   op  op  e₆
           / \ / \ / \
          e₀ e₁ e₂ e₃ e₄ e₅

Expression: ((e₀ ⊕ e₁) ⊕ (e₂ ⊕ e₃)) ⊕ ((e₄ ⊕ e₅) ⊕ e₆)

The left subtree is identical to the k = 8 case with the last element removed. The carry of e₆ at round 1 (odd n = 7) places it as the right child of the right subtree’s right branch.

4.2.4 Canonical tree diagrams (informative)

The following diagrams illustrate the fixed abstract expression structure only; they do not imply any particular evaluation order, scheduling, or implementation strategy.

Legend (informative)

present(x)  : a present operand holding value x (type A)
∅           : an absent operand position (no input element exists there)
combine(u,v) : lifted combine:
              - combine(∅, x) = x
              - combine(x, ∅) = x
              - combine(∅, ∅) = ∅
              - combine(x, y) = op(x, y) when both present
op(a,b)     : the user-supplied binary_op, applied only when both operands exist

Example: absence propagation with k = 5, Y = [ X0, X1, X2, ∅, ∅ ]

              COMBINE          COMBINE(op(op(X0,X1), X2), ∅) = op(op(X0,X1), X2)
             /       \
         COMBINE      ∅       carried from round 2 (odd n=3)
          /    \
        op    COMBINE          COMBINE(X2, ∅) = X2
       / \     / \
      X0  X1  X2  ∅

Result: op(op(X0, X1), X2) — two binary_op calls from three present elements. The two ∅ positions at different tree levels each induce no application of binary_op; the lifted COMBINE rule absorbs them uniformly.

A near-balanced tree is not necessarily full; missing leaves may induce absent operands at internal combine points as the tree reduces. The lifted combine rule handles this uniformly.

4.3 Interleaved topology [canonical.reduce.interleaved]

Let E[0..N) denote the input elements in iteration order, and let X[0..N) denote the corresponding conceptual terms of the reduction expression (materialization and the reduction state type are defined in §4.6).

4.3.1 Lane partitioning by index modulo [canonical.reduce.interleaved.partition]

For each lane index j in [0, L), define:

I_j = < i in [0, N) : (i mod L) == j >, ordered by increasing i.
X_j = < X[i] : i in I_j >.

This preserves the original input order within each lane (equivalently, X_j contains positions j, j+L, j+2L, ... that are less than N).

4.3.2 Fixed-length lane leaves (single shape per lane) [canonical.reduce.interleaved.leaves]

Define K = ceil(N / L) when N > 0. (The N == 0 case is handled by §4.5 and does not form lanes.)

For each lane j in [0, L), define a fixed-length conceptual sequence Y_j[0..K) of maybe<A> leaves:

for t in [0, K):
- let i = j + t*L
- if i < N: Y_j[t] = present( X[i] )
- otherwise: Y_j[t] = ∅

Thus all lanes use the same canonical tree shape over K leaf positions. Lanes with fewer than K elements simply have trailing ∅ leaves; these do not introduce padding values and do not require identity-element properties of binary_op.

[Note: Implementations must not pad absent operand positions with a constant value (e.g., 0.0 for addition) unless that value is the identity element for the specific binary_op and argument types. The lifted COMBINE rules (§4.2.2) define the correct handling of absent positions for arbitrary binary_op. The demonstrators in Appendix K use zero-padding only because they test std::plus<double>, for which 0.0 is the identity. —end note]

When N < L, some lane indices j correspond to no input positions: Y_j[t] == ∅ for all t, yielding R_j == ∅. No applications of binary_op are induced for such lanes under the lifted COMBINE rules.

[Note: When L > N, K = 1 and each lane holds at most one element. Stage 1 performs no applications of binary_op (each lane result is either a single present value or ∅). Stage 2 then applies CANONICAL_TREE_EVAL over the L lane results, of which only N are present; the COMBINE rules propagate the L − N absent entries without invoking binary_op. The result is therefore equivalent to CANONICAL_TREE_EVAL applied directly to the N input elements. In this regime L has no observable effect on the returned value. This is intentional: no diagnostic is required, and implementations need not special-case it. —end note]

4.3.3 Interleaving layout diagrams (informative)

Example: N = 10, L = 4 ⇒ K = ceil(10/4) = 3

Input order (i):   0   1   2   3   4   5   6   7   8   9
Elements X[i]:    X0  X1  X2  X3  X4  X5  X6  X7  X8  X9
Lane (i mod L):    0   1   2   3   0   1   2   3   0   1

Lanes preserve input order within each lane:

Lane 0: X0  X4  X8
Lane 1: X1  X5  X9
Lane 2: X2  X6
Lane 3: X3  X7

Fixed-length leaves with absence (no padding values):

Y_0: [ present(X0), present(X4), present(X8) ]
Y_1: [ present(X1), present(X5), present(X9) ]
Y_2: [ present(X2), present(X6), ∅           ]
Y_3: [ present(X3), present(X7), ∅           ]

4.3.4 Ragged tail handling [canonical.reduce.interleaved.ragged]

When N is not a multiple of L, the final group of input elements is incomplete: some lanes receive one fewer element than others, producing a ragged trailing edge across the lane partition. The canonical expression handles this uniformly through the maybe<A> formalism defined in §4.2.2. Every lane uses the same tree shape over K = ceil(N/L) leaf positions, but lanes whose element count falls short of K have trailing ∅ leaves. The COMBINE rules propagate these absences without invoking binary_op and without requiring padding values or identity elements from the caller.

Example: N = 11, L = 4, so K = ceil(11/4) = 3.

The input is distributed across lanes by i mod L:

Input:   X0  X1  X2  X3 | X4  X5  X6  X7 | X8  X9  X10
Block:   ──── full ──── | ──── full ──── | ─ ragged ──

Lane assignment:

Lane 0:  X0,  X4,  X8       (3 elements — full)
Lane 1:  X1,  X5,  X9       (3 elements — full)
Lane 2:  X2,  X6,  X10      (3 elements — full)
Lane 3:  X3,  X7,  ∅        (2 elements + 1 absent)

All four lanes evaluate the same canonical tree shape over K = 3 leaf positions. For lanes 0–2, every leaf is present and the tree evaluates normally. For lane 3, the tree encounters an absent leaf:

      COMBINE
       /    \
     op      ∅
    /   \
  X3    X7

COMBINE(op(X3, X7), ∅) = op(X3, X7) — no binary_op application occurs for the absent position. The result is identical to reducing only the present elements [X3, X7].

This mechanism generalizes to any (N, L) pair. The number of ragged lanes is L - (N mod L) when N mod L ≠ 0; these lanes each have exactly one trailing ∅. The remaining N mod L lanes have all K positions present. When N is a multiple of L, no lanes are ragged and no ∅ entries arise.

In implementation terms, the ragged tail corresponds to the familiar SIMD epilogue or tail-masking pattern: the final group of elements is narrower than the full lane width, and the implementation must avoid reading or combining nonexistent data. The maybe<A> formalism specifies the required behavior without prescribing the implementation technique (masking, scalar epilogue, predicated lanes, etc.).

4.4 Two-stage reduction semantics [canonical.reduce.twostage]

Let L >= 1 be the lane count and let op denote the supplied binary_op.

4.4.1 Stage 1 — Per-lane canonical reduction (single tree shape) [canonical.reduce.twostage.perlane]

For each lane index j in [0, L), define:

R_j = CANONICAL_TREE_EVAL(op, Y_j)  // returns maybe<A>

4.4.2 Stage 2 — Canonical reduction over lane results (in increasing lane index) [canonical.reduce.twostage.crosslane]

Define a conceptual sequence Z[0..L) by:

Z[j] = R_j for j in [0, L).

Then define:

R_all = CANONICAL_TREE_EVAL(op, Z)  // returns maybe<A>

When N > 0, at least one lane contains a present operand, therefore R_all is present(r) for some r of type A. The interleaved reduction result is that r.

Therefore, the overall expression is uniquely determined by the canonical tree rule applied first within each lane and then across lanes in increasing lane index order. When L = 1, there is a single lane containing all N elements; Stage 2 receives one input and returns it unchanged, so the result is simply CANONICAL_TREE_EVAL(op, Y_0).

Summary definition. For convenience, define the composite definitional function:

CANONICAL_INTERLEAVED_REDUCE(L, op, X[0..N)):
    Partition X into L lanes by index modulo (§4.3.1).
    Form fixed-length leaf sequences Y_j[0..K) for each lane j (§4.3.2).
    For each lane j: R_j = CANONICAL_TREE_EVAL(op, Y_j).
    Form Z[0..L) where Z[j] = R_j.
    Return CANONICAL_TREE_EVAL(op, Z).

When N == 0, CANONICAL_INTERLEAVED_REDUCE is not invoked; the result is determined by §4.5.

4.4.3 Two-stage diagrams (informative)

Stage 1 summary (N = 10, L = 4, K = 3; all lanes use the same shape; ∅ propagates):

Y_0 = [X0, X4, X8] --(canonical tree k=3)--> R_0
Y_1 = [X1, X5, X9] --(canonical tree k=3)--> R_1
Y_2 = [X2, X6, ∅ ] --(canonical tree k=3)--> R_2
Y_3 = [X3, X7, ∅ ] --(canonical tree k=3)--> R_3

Stage 2 (cross-lane canonical reduction; same rules apply):

Example: L = 4, Z = [R0, R1, R2, R3]

       combine
       /      \
    combine   combine
    /    \    /    \
   R0    R1  R2    R3

Conceptual completeness: the same lifted rule handles absence in Stage 2:

Example: Z = [ R0, ∅, R2, ∅ ]

       combine
       /      \
    combine   combine
    /    \    /    \
   R0    ∅   R2    ∅

combine(R0, ∅) = R0
combine(R2, ∅) = R2
combine(R0, R2) = op(R0, R2)

4.4.4 Equivalence to reducing only present terms (informative)

For any lane j, CANONICAL_TREE_EVAL(op, Y_j) evaluates the same abstract expression as applying the canonical split rule (§4.2.1) to the subsequence X_j containing only present terms, with the understanding that absent operand positions do not create binary_op applications. The explicit absence notation does not affect the returned value; it makes the “absent operand” behavior precise and enables a single tree shape for all lanes.

4.5 Integration of an initial value (if provided) [canonical.reduce.init]

If an initial value init is provided, the abstract expression is:

If N == 0: return init.
Otherwise:
- let R = the value extracted from R_all = CANONICAL_TREE_EVAL(op, Z) in §4.4.2
- let I be a value of type A initialized from init (where A is defined in §4.6)
- return op(I, R).

The placement of init is normative. In particular, init is not interleaved into lanes and does not participate in the canonical tree expression. Combining init with the tree result in a single final application of binary_op ensures that the canonical tree shape is independent of whether an initial value is provided. The left‑operand placement is consistent with existing fold‑style conventions; because this proposal does not require commutativity of binary_op, the position of init is specified. In particular, std::accumulate places init as the left operand at every step (op(op(init, x₀), x₁)...); this proposal preserves that convention so that non-commutative operations produce consistent results when migrating from std::accumulate to the canonical reduction.

[Note: Whether a convenience form without an explicit init is provided, and what default it uses, is an API decision deferred to a future revision. —end note]

With init (conceptual):

Result = op( init, CANONICAL_INTERLEAVED_REDUCE(L, op, X[0..N)) )

init is not interleaved into lanes and is combined once with the overall result.

Informative contrast:

std::accumulate:
  (((init op X0) op X1) op X2) ... op XN-1

This proposal:
  init op ( fixed canonical tree expression over X0..XN-1 )

4.6 Materialization of conceptual terms and reduction state (introducing V and A) [canonical.reduce.types]

The preceding sections (§4.1–§4.4) define the canonical expression structure over abstract sequences. This section specifies the type rules that bridge the abstract expression to C++ evaluation.

Let:

V be the value type of the input sequence elements,
A be the reduction state type:
- if an initial value init of type T is provided, A = remove_cvref_t<T>;
- otherwise A = V.

Define the conceptual term sequence X[0..N) of type A by converting each input element:

X[i] = static_cast<A>(E[i]) for i in [0, N).

All applications of binary_op within the definitional functions in §4 operate on values of type A.

Constraints: Let A be the reduction state type defined above.

When an initial value init is provided, the initialization A{init} shall be well-formed.

Each conceptual term X[i] is formed by conversion to A as specified above.
If an initial value init is provided, it is materialized as a value I of type A initialized from init and participates in the expression as op(I, R) per §4.5.
binary_op shall be invocable with two arguments of type A, and the result shall be convertible to A.

[Note: How V is derived from the input — whether as iter_value_t<InputIt> for an iterator-pair interface, range_value_t<R> for a range interface, or otherwise — is an API decision deferred to a future revision. The semantic contract requires only that V is well-defined and that the conversion static_cast<A>(E[i]) is well-formed. Proxy reference types (e.g., std::vector<bool>::reference) and their interaction with V are likewise API-level concerns. —end note]

5. Invariance Properties [canonical.reduce.invariance]

This proposal does not impose associativity or commutativity requirements on binary_op. Instead of permitting implementations to reassociate or reorder (which can make results unspecified for non-associative operations), this proposal defines a single canonical abstract expression for fixed (N, L). Determinism is obtained by fixing the expression, not by restricting binary_op.

See §2.3 for exception/termination behavior; in particular, when std::terminate is called under a policy-based evaluation, the state of outputs and any externally observable side effects is unspecified.

See §2.4 for complexity; guarantees are intentionally limited to work complexity, consistent with std::reduce.

For a chosen topology coordinate (lane count L), the fixed expression structure provides:

Topological determinism: For fixed input order, binary_op, lane count L, and N, the abstract expression (grouping and left/right operand order) is fully specified by §4. It does not depend on implementation choices, SIMD width, thread count, or scheduling decisions.

Layout invariance: Results are independent of memory alignment and physical placement, given the same input sequence as observed through the iterator range.

Execution independence: Implementations may evaluate independent subtrees in any order or concurrently. Only the grouping is specified, not the schedule.

Cross-invocation reproducibility: Given the same topology coordinate L, input sequence, binary_op, and floating-point evaluation model, the returned value is stable across invocations (it is the value of the same specified expression under the same evaluation model).

Scope of guarantee (returned value): The run-to-run stability guarantee applies to the returned value of the reduction. If binary_op performs externally observable side effects, the order and interleaving of those side effects is not specified by this paper and may vary between invocations.

Constraints on binary_op: Let A be the reduction state type (§4.6). The only requirements on binary_op are invocability with two arguments of type A and convertibility of the result to A. No associativity, commutativity, or identity-element requirements are imposed. Because the canonical expression is fixed for a given (N, L), determinism is obtained by fixing the expression structure, not by restricting binary_op.

The remaining requirements on iterators, value types, and side effects match those of the corresponding std::reduce facility ([numeric.ops.reduce]):

When evaluated without an execution policy, binary_op shall not invalidate iterators or subranges, nor modify elements in the input range.
When evaluated with an execution policy, binary_op is an element access function subject to the requirements in [algorithms.parallel.exec].

When evaluated without an execution policy, binary_op is invoked as part of a normal library algorithm call; this paper does not require concurrent evaluation. When evaluated with an execution policy, the requirements of [algorithms.parallel.exec] additionally apply.

The run-to-run stability guarantee applies to the returned value when binary_op is functionally deterministic — that is, when it returns the same result for the same operand values. If binary_op reads mutable global state, uses random number generation, or is otherwise non-deterministic, the returned value may vary even with fixed topology coordinate and input.

[Note: Functional determinism of binary_op is not a formal requirement (the standard cannot enforce functional purity), but an observation about when the stability guarantee is meaningful. —end note]

Cross-platform reproducibility requires users to ensure an identical topology coordinate L and an equivalent floating-point evaluation model (§2.5, §6).

6. Floating-Point Considerations (Informative)

This section discusses what is and is not guaranteed about floating-point results under the canonical expression defined in §4.

Terminology: This paper uses floating-point evaluation model to mean the combination of the program’s runtime floating-point environment (e.g., <cfenv> rounding mode) and the translation/target choices that affect how floating-point expressions are evaluated (e.g., contraction/FMA, excess precision, subnormal handling, fast-math).

What is specified: For a given topology coordinate L, input sequence, binary_op, and init, the result is the value obtained by evaluating the canonical compute sequence defined in §4, in the floating-point evaluation model in effect for the program.

What this enables: By removing library-permitted reassociation, repeated executions of the same program under a stable evaluation model can obtain the same result independent of thread count, scheduling, or SIMD width.

What it does not attempt to specify: Cross-architecture bitwise identity is not a goal of this paper. Users who require bitwise identity must additionally control the relevant evaluation-model factors and ensure that sizeof(V) (and thus lane count) is stable across the intended platforms.

Relationship to P3375 (Reproducible floating-point results): Davidson’s P3375 [P3375R2] proposes a strict_float type that specifies sufficient conformance with ISO/IEC 60559:2020 to guarantee reproducible floating-point arithmetic across implementations. This proposal and P3375 are complementary: this paper fixes the expression structure (parenthesization and operand order) of a parallel reduction, while P3375 addresses the evaluation model (rounding, contraction, intermediate precision) for individual operations. Together, they would provide both necessary conditions for cross-platform bitwise reproducibility of parallel reductions. Neither paper alone is sufficient.

Relationship to std::simd (P1928): This proposal is orthogonal to std::simd. std::simd::reduce performs a horizontal reduction within a single SIMD value with unspecified order; this facility defines a deterministic expression structure over an arbitrary-length input range. An implementation may use std::simd operations internally, but the semantic contract does not depend on std::simd.

The C++ <cfenv> floating-point environment covers rounding mode and exception flags; many other factors that affect floating-point results (such as contraction/FMA and intermediate precision) are translation- or target-dependent and are not fully specified by the C++ abstract machine. This proposal therefore guarantees expression identity, not universal bitwise identity.

7. Relationship to Existing Facilities (Informative)

This paper specifies a canonical expression structure for parallel reduction. The goal is to complete the reduction “semantic spectrum” in the Standard Library: from specified but sequential, to parallel but unspecified, to parallel and specified.

Facility	Parallel	Expression specified	Notes
`std::accumulate`	No	Yes (left-fold)	Fully specified; sequential
`std::reduce`	Yes	No (generalized sum)	Unspecified grouping; results may vary for non-associative ops
HPC frameworks (Kokkos, etc.)	Yes	No	Strategy-dependent grouping; FP results may vary
oneTBB `parallel_reduce`	Yes	No	Join order varies; scheduler-dependent results
Canonical reduction (this proposal)	Yes	Yes (§4 tree)	Fixed parenthesization for chosen topology coordinate L; free scheduling

What this proposal adds: a standard-specified expression for parallel reduction, closing the third cell in the table above.

[Note: This paper uses canonical_reduce_lanes<L>(...) as the primary illustrative spelling, reflecting that lane count L is the semantic topology coordinate. For layout-stable numeric types, an API may additionally provide a span spelling canonical_reduce<M>(...) as a convenience (where L = M / sizeof(V) when well-formed; see §9.2). No Standard Library API is proposed in this paper. —end note]

8. Motivation and Use Cases (Informative)

This proposal is motivated by workloads where run-to-run stability matters, but existing parallel reductions are intentionally free to choose an evaluation order (and thus may vary with scheduling). Typical use cases include:

CI regression testing: A reduction that is stable across runs eliminates intermittent test failures and enables “golden value” comparisons.
Debugging and bisection: A stable result makes it practical to reproduce and minimize numerical regressions without chasing schedule-dependent drift.
Auditability / reproducible analytics: Regulatory or scientific workflows often require re-running computations and obtaining the same result within a defined environment.
Large-scale simulation and ML training: Stable aggregation of large sums (e.g., gradients, risk factors) improves repeatability of model training and scenario analysis.
Heterogeneous execution: A single semantic contract that can be implemented on CPU and GPU enables consistent verification, even when the execution strategy differs.

Detailed examples (with code) are collected in Appendix M.

9. API Design Space (Informative)

This section sketches possible directions for exposing the canonical expression defined in §4. The intent is to build consensus on the semantic contract before committing to an API surface.

A key design point is that the semantic topology parameter is lane count L. A byte span M is a numerics convenience coordinate that derives L = M / sizeof(V) when well-formed (§9.2). An API may choose to expose one or both coordinates.

Two primary approaches exist.

9.1 New Algorithm Approach (Illustrative)

Expose the semantics as a new algorithm (illustrative spelling only):

// Illustrative only: name/signature not proposed in this paper

// Lane-based topology (portable across ABIs for a fixed L)
template <size_t L, class InputIt, class T, class BinaryOp>
constexpr T canonical_reduce_lanes(InputIt first, InputIt last, T init, BinaryOp op);

// Span-based topology (numerics convenience; derives L from sizeof(V))
template <size_t M, class InputIt, class T, class BinaryOp>
constexpr T canonical_reduce(InputIt first, InputIt last, T init, BinaryOp op);

Rationale: The choice of topology affects observable results for non-associative operations (e.g., floating-point addition). This argues for an API whose contract explicitly includes the topology coordinate, rather than treating topology as an implementation detail.

9.2 Span Convenience: Deriving L from a Byte Span M (Informative)

Some environments prefer selecting topology using a byte-based span coordinate M rather than specifying the lane count L directly. Such a coordinate is derived only and does not change the semantic definition in §4, which is defined entirely in terms of L.

Let value_type denote iter_value_t<InputIt>. When a span coordinate M is used and is well-formed for value_type, derive:

L = M / sizeof(value_type).

Interpret the span spelling using the same definitional function name:

CANONICAL_INTERLEAVED_REDUCE(M, value_type, op, X) = CANONICAL_INTERLEAVED_REDUCE(L, op, X), with L as above.

M is well-formed for value_type only when:

M >= sizeof(value_type), and
M % sizeof(value_type) == 0.

A future API should reject invalid M rather than silently rounding to a different L.

[Note: While the span spelling M provides convenient alignment with SIMD register widths for layout-stable arithmetic types, users requiring cross-platform expression identity (e.g., for UDTs or heterogeneous verification) must specify topology via lane count L. The mapping L = M / sizeof(value_type) is platform-dependent when sizeof(value_type) varies across targets. —end note]

Mapping the span spelling M to lanes L (informative)

The span spelling selects topology by deriving L = M / sizeof(value_type) when well-formed.

Example: M = 64 bytes

Case A: value_type = float  (sizeof(float) = 4) => L = 64 / 4 = 16 lanes
Case B: value_type = double (sizeof(double) = 8) => L = 64 / 8 = 8 lanes

Interleaving rule (semantic): element i belongs to lane (i % L).

For L = 8 (e.g. double, M=64):
Input indices:    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Lane index (i%8): 0 1 2 3 4 5 6 7 0 1  2  3  4  5  6  7 ...

Lanes (preserving original order within each lane):
  lane 0: X[0], X[8], X[16], ...
  lane 1: X[1], X[9], X[17], ...
  lane 2: X[2], X[10], X[18], ...
  ...
  lane 7: X[7], X[15], X[23], ...

When topology is selected via M, expression identity across platforms additionally depends on sizeof(value_type) being stable within the intended reproducibility domain. Users who require topology identity independent of representation should specify L directly.

Convenience spelling example (informative):

// Primary spelling: select topology via lane count L
auto r_lanes = canonical_reduce_lanes<8>(first, last, init, op); // L = 8
// Convenience spelling: select topology via a byte span M (numeric domains)
// For value_type = double, sizeof(double) == 8, so M = 64 derives L = 8:
auto r_span = canonical_reduce<64>(first, last, init, op); // M = 64 bytes

9.3 Execution Policy Approach (Illustrative)

Expose the semantics as a new execution policy (illustrative spelling only):

// Illustrative only: policy type/spelling not proposed in this paper
template <size_t M>
struct canonical_policy { /* ... */ };

This approach integrates naturally with the existing parallel algorithms vocabulary. However, as established in §3.0, execution policies in the current standard are designed to constrain scheduling, not expression structure. Encoding topology in a policy would require the policy to carry semantic guarantees about the returned value — a role that policies do not currently play. It also raises unresolved questions about policy composition: what happens when two policies specify conflicting topologies, or when a topology-carrying policy composes with one that permits reassociation?

9.4 Trade-offs (Informative)

The expression/algorithm/execution analysis in §3.0 informs this trade-off:

A dedicated algorithm places expression ownership where the standard already locates semantic contracts: in the algorithm. The topology coordinate is explicit, composition questions do not arise, and the facility can be taught as “this algorithm computes this expression” — parallel to how std::accumulate computes a left fold.
A policy-based approach may reduce algorithm surface area but conflates expression and execution — the very conflation that §3.0 identifies as the source of existing confusion around std::reduce and execution::seq. It would also require new rules for policy dominance and semantic interaction that do not exist in the current standard.

9.5 Naming Considerations (Informative)

This paper uses canonical_reduce_lanes<L>(...) as the primary illustrative spelling, reflecting that lane count L is the semantic topology coordinate. A span spelling canonical_reduce<M>(...) may also appear as a numerics convenience (where L = M / sizeof(V) when well-formed). Final naming should communicate that:

the return value is defined by a specified abstract expression; and
the topology coordinate is part of the semantic contract.

9.6 Topology Defaults and Named Presets (Informative)

If an eventual API offers a no-argument default topology selection, it should avoid “silent semantic drift” across targets or toolchain versions.

In this paper, a topology preset (also referred to as a span preset) is a named, standard-fixed constant used to select the topology coordinate—typically the interleave span M (bytes), and thereby the derived lane count L for layout-stable numeric types—in a way that remains stable across targets and toolchain versions.

Option	Example spelling	Trade-off
No default	`canonical_reduce<M>(...)` required	Most explicit; no surprises; but higher user burden
Implementation-defined default	`canonical_reduce(...)` selects an implementation-defined topology	Easy to teach; can silently change returned values across targets/versions
Standard-fixed named presets	`canonical_reduce<std::canonical_span_small>(...)`	Readable; coordinated semantics; stable across targets/versions
Explicit literal	`canonical_reduce_lanes<128>(...)`	Maximum control; most verbose; requires users to pick a number

A committee-robust approach is to provide standard-fixed named presets for the span coordinate, and (if a default is adopted for numeric types) define it in terms of such a preset constant.

Illustrative named presets (standard-fixed):

namespace std {
 // Named semantic presets (bytes)
 inline constexpr size_t canonical_span_small = 128; // baseline coordination preset
 inline constexpr size_t canonical_span_large = 1024; // wide coordination preset
 
 // The golden reference span for a given V (derives L == 1)
 template<class V>
 inline constexpr size_t canonical_span_max_portability = sizeof(V);
 
 // Evolution rule: existing preset values never change.
 // Future standards may add new presets under new names.
}

// This paper does not propose a global default topology.
// A follow-on API paper may propose a default in terms of a named preset.

[Note: Appendix J provides an indicative “straw-man” API that uses these standard-fixed named presets (Option 3) without committing the committee to a final spelling or placement. —end note]

For types where sizeof(V) is not stable across the intended reproducibility domain, users requiring cross-platform topology identity should select topology by lane count L (or enforce representation as part of the contract).

Practical topology selection guidance (informative)

The semantic coordinate is L. When using the span convenience M, L = M / sizeof(value_type). Performance is often improved when L aligns with the target’s preferred execution granularity, but the Standard does not prescribe hardware behavior.

CI/CD and cross-platform verification baseline: choose L = 1 (equivalently, M = sizeof(value_type)), yielding a single global canonical tree.
Typical CPU deployment: choose L matching the target SIMD lane count (e.g., L = 4 for AVX2/double, L = 8 for AVX-512/double).
GPU warp-level consistency: choose L = warp_width (e.g., L = 32 on NVIDIA GPUs).

Use case	L (double)	L (float)	M (bytes)	Notes
Golden reference	1	1	`sizeof(V)`	Single tree; for debugging/golden values
Portability baseline	2	4	16	SSE/NEON width
AVX / AVX2	4	8	32	Desktop/server AVX2
AVX-512	8	16	64	AVX-512 servers
CUDA warp (double)	32	—	256	32-thread warp

What LEWG is being asked to agree to (this paper)

We seek direction on the semantic contract and tree shape; API surface is explicitly deferred.

We are not approving an API.
Authors recommend selecting iterative pairwise in Poll 2A to align the canonical shape with common SIMD/GPU reduction practice.
We are approving the fixed expression contract in §4 (canonical compute sequence) as a semantic building block: implementations may schedule using threads/SIMD/work-stealing/GPU kernels, but must return the value as-if evaluating the standard-fixed expression.
We intend the canonical expression to have O(N) applications of binary_op (accounting for carry/propagation rules) and O(log N) depth for the canonical shape; the semantic definition imposes no workspace requirement (implementations may use workspace).
A future revision will return with an API surface and constraints/range categories aligned with std::reduce and execution-policy requirements.

Polls for LEWG Direction (this paper)

Vote categories for all Favor/Against polls: Strongly Favor / Weakly Favor / Neutral / Weakly Against / Strongly Against

Poll 1 — Semantics-first scope

Question: We agree this paper proposes semantics only, and we want LEWG to validate the fixed expression structure before committing to API design.