Distributed databases must provide transactional guarantees across multiple nodes, each managing its own local storage and concurrency control.
A global transaction spans two or more of these nodes, and the system needs a mechanism to ensure atomicity: either all nodes commit the transaction's effects, or none of them do.
This is the domain of global transaction management, where a coordinator process orchestrates the participants through a commit protocol, handling failures and network partitions along the way.
The Problem: Distributed Atomicity
In a single-node database, the transaction manager can make an atomic commit decision by writing a single log record.
The commit or abort is a local event.
Once the decision is durable in the WAL, recovery can re-apply or undo the transaction's effects.
When a transaction touches data on multiple nodes, no single log record suffices.
Node A might be ready to commit while Node B has detected a constraint violation.
If A commits and B aborts, the database is in an inconsistent state.
The fundamental challenge is reaching agreement on the outcome despite the possibility that any participant or the coordinator itself may crash at any point during the protocol.
This is a consensus problem, but with a specific structure.
The participants are not choosing an arbitrary value.
They are voting on whether to commit or abort, with the constraint that even a single "abort" vote must cause a global abort.
Architecture: Coordinators and Participants
Global transaction management follows a coordinator-participant architecture.
When a client initiates a transaction that requires work on multiple nodes, one node (or a dedicated service) assumes the role of transaction coordinator (TC).
The remaining nodes involved in the transaction act as participants (sometimes called resource managers or cohorts).
The coordinator's responsibilities include:
- Assigning a globally unique transaction ID.
- Tracking which participants are involved in the transaction.
- Initiating the commit protocol when the client requests a commit.
- Making the final commit/abort decision and communicating it to all participants.
- Persisting the decision durably so that recovery after a crash can complete the protocol.
Each participant maintains its own local transaction state, acquires locks or manages MVCC versions locally, and writes local redo/undo log records.
The participant must be able to answer the coordinator's "can you commit?" question honestly, based on whether it can guarantee durability, and correctness for its portion of the work.
Two-Phase Commit (2PC)
The canonical protocol for distributed atomic commitment is two-phase commit, originally described by Jim Gray in the late 1970s.
It proceeds in two rounds of messages.
Phase 1: Prepare (Voting)
The coordinator sends a PREPARE message to each participant.
Each participant determines whether it can commit its local portion of the transaction.
If it can, it force-writes a prepare record to its local WAL, and responds with VOTE-COMMIT.
If it cannot (due to a conflict, constraint violation, or local failure), it responds with VOTE-ABORT and can unilaterally abort its local work.
The prepare record is critical.
By writing it before voting yes, the participant promises that it can commit the transaction even if it crashes and recovers.
The local redo information is durable and the participant will not unilaterally abort after voting yes.
Phase 2: Commit (Decision)
The coordinator collects all votes.
If every participant voted VOTE-COMMIT, the coordinator force-writes a commit record to its own log and sends GLOBAL-COMMIT to all participants.
If any participant voted VOTE-ABORT (or failed to respond within a timeout), the coordinator writes an abort record and sends GLOBAL-ABORT.
Upon receiving the global decision, each participant writes a commit or abort record to its local log, completes the transaction (releasing locks, making changes visible or rolling them back), and sends an acknowledgment to the coordinator.
Once the coordinator has received all acknowledgments, it can write an end record and garbage-collect the transaction's state.
Walkthrough
The following pseudocode shows the coordinator's logic:
function coordinate_commit(tx_id, participants):
// Phase 1: Prepare
for each p in participants:
send(p, PREPARE, tx_id)
votes = {}
for each p in participants:
votes[p] = receive_vote(p, timeout=T)
// Phase 2: Decision
if all votes are VOTE-COMMIT:
force_write_log(COMMIT, tx_id) // durable decision point
decision = GLOBAL-COMMIT
else:
force_write_log(ABORT, tx_id)
decision = GLOBAL-ABORT
for each p in participants:
send(p, decision, tx_id)
// Completion
acks = collect_acks(participants)
force_write_log(END, tx_id)
And the participant logic:
function on_prepare(tx_id):
if can_commit_locally(tx_id):
force_write_log(PREPARED, tx_id) // promise to commit if told to
send(coordinator, VOTE-COMMIT, tx_id)
enter_uncertain_state(tx_id) // blocked until decision arrives
else:
force_write_log(ABORT, tx_id)
send(coordinator, VOTE-ABORT, tx_id)
rollback_local(tx_id)
function on_decision(tx_id, decision):
if decision == GLOBAL-COMMIT:
force_write_log(COMMIT, tx_id)
apply_local(tx_id)
else:
force_write_log(ABORT, tx_id)
rollback_local(tx_id)
send(coordinator, ACK, tx_id)
The Blocking Problem
The most significant weakness of 2PC is that it is a blocking protocol.
If the coordinator crashes after sending PREPARE but before broadcasting the decision, participants that voted VOTE-COMMIT are stuck.
They have promised not to abort unilaterally; but they do not know the global decision.
They must hold their locks and wait for the coordinator to recover.
No participant can safely proceed because another participant might have voted abort, or the coordinator might have decided to commit.
During this window, the resources locked by the uncertain transaction are unavailable to other transactions, potentially causing cascading delays across the system.
Three-Phase Commit (3PC)
Skeen's three-phase commit protocol was designed to eliminate the blocking problem under certain failure models.
It introduces an intermediate pre-commit phase between voting and the final commit.
The idea is that no participant transitions directly from an uncertain state to committed.
Instead, a PRE-COMMIT message tells all participants that every vote was yes, so if the coordinator subsequently crashes, any participant that received the pre-commit can safely decide to commit by coordinating with the other participants.
3PC is non-blocking under the assumption that network partitions do not occur (or equivalently, that failed nodes can be reliably detected).
In practice, network partitions are common in distributed systems and 3PC does not provide correct behavior under partition scenarios.
For this reason, 3PC is rarely used in production systems.
Modern systems tend to address the blocking problem through coordinator replication rather than protocol changes.
Coordinator Fault Tolerance
Production systems mitigate 2PC's blocking vulnerability by making the coordinator highly available.
The standard approach is to replicate the coordinator's state using a consensus protocol such as Paxos or Raft.
If the primary coordinator fails, a replica that holds the transaction's state can take over and drive the protocol to completion.
Google's Spanner, for example, uses Paxos groups to replicate both participant state and coordinator state.
Each participant's prepare record is replicated within its Paxos group, and the coordinator's commit decision is replicated within the coordinator's Paxos group.
This transforms the blocking window from "until the coordinator's machine recovers" to "until a Paxos leader election completes," which is typically a few seconds at most.
This architecture can be understood as layering 2PC on top of consensus.
Each "participant" in 2PC is not a single node but a replicated state machine.
The coordinator is itself a replicated state machine.
The result is a system that provides atomic commitment with high availability, at the cost of additional message rounds for intra-group consensus.
Concurrency Control Interactions
The global transaction coordinator interacts with each node's local concurrency control mechanism.
Two common patterns emerge:
Strict 2PL with 2PC. Each participant acquires locks locally during transaction execution and holds them through the prepare phase.
Locks are released only after the global commit or abort decision arrives.
This means that the "prepare to commit" window extends the lock-holding duration, increasing contention.
MVCC with 2PC. Systems using multi-version concurrency control can reduce the impact of the prepare window.
Readers are not blocked by prepared-but-not-yet-committed transactions because they read from older snapshots.
However, write-write conflicts still require coordination, and the commit timestamp must be chosen carefully to maintain snapshot consistency across nodes.
Some systems (Spanner being a prominent example) use synchronized clocks (TrueTime) to assign globally consistent commit timestamps, avoiding the need for additional cross-node communication to determine timestamp ordering.
Presumed Abort and Presumed Commit
Mohan, Lindsay, and Obermarck's work on optimizing 2PC introduced two important variants: presumed abort and presumed commit.
These optimizations reduce the number of forced log writes and messages in common cases.
In presumed abort, if the coordinator has no record of a transaction during recovery, it presumes the transaction aborted.
This means the coordinator does not need to force-write an abort record, and participants that voted abort do not need to send acknowledgments.
Since aborts are common (due to deadlocks, conflicts, and application logic), this optimization is valuable.
Presumed commit works symmetrically, optimizing for the case where most transactions commit.
The choice between these variants depends on the workload's commit/abort ratio.
Key Points
- Global transaction atomicity requires a commit protocol because no single node can unilaterally decide the outcome for a multi-node transaction.
- Two-phase commit ensures atomicity by collecting durable "prepare" promises from all participants before the coordinator records a final commit or abort decision.
- 2PC is a blocking protocol: coordinator failure after prepare leaves participants holding locks indefinitely until the coordinator recovers.
- Three-phase commit eliminates blocking under a crash-stop failure model but does not handle network partitions, limiting its practical utility.
- Modern systems address 2PC's blocking problem by replicating the coordinator's state with Paxos or Raft, reducing the vulnerability window to the duration of a leader election.
- The prepare phase extends lock-holding duration in strict 2PL systems, making the interaction between global commit protocols and local concurrency control a key performance concern.
- Presumed abort and presumed commit optimizations reduce forced log writes and acknowledgment messages by exploiting assumptions about the dominant transaction outcome.
References
Gray, J. "Notes on Data Base Operating Systems." Operating Systems: An Advanced Course, Lecture Notes in Computer Science, Vol. 60, Springer-Verlag, 1978.
Mohan, C., Lindsay, B., and Obermarck, R. "Transaction Management in the R* Distributed Database Management System." ACM Transactions on Database Systems, Vol. 11, No. 4, 1986.
Skeen, D. "Nonblocking Commit Protocols." Proceedings of the 1981 ACM SIGMOD International Conference on Management of Data, 1981.
Corbett, J. C., et al. "Spanner: Google's Globally-Distributed Database." Proceedings of OSDI, 2012.
Bernstein, P. A., Hadzilacos, V., and Goodman, N. "Concurrency Control and Recovery in Database Systems." Addison-Wesley, 1987.