Append commit instead of individual transactions to commitlog #4140

kim · 2026-01-27T14:59:07Z

Changes the commitlog (and durability) write API, such that the caller decides how many transactions are in a single commit, and has to supply the transaction offsets.

This simplifies commitlog-side buffering logic to essentially a BufWriter (which, of course, we must not forget to flush). This will help throughput, but offers less opportunity to retry failed writes. This is probably a good thing, as disks can fail in erratic ways, and we should rather crash and re-verify the commitlog (suffix) than continue writing.

To that end, this patch liberally raises panics when there is a chance that internal state could be "poisoned" by partial writes, which may be debatable.

Motivation

The main motivation is to avoid maintaining the transaction offset in two places in such a way that they could diverge. As ordering commits is the responsibility of the datastore, we make it authoritative on this matter -- the commitlog will still check that offsets are contiguous, and refuse to commit if that's not the case.

A secondary, related motivation is the following:

A "commit" is an atomic unit of storage, meaning that a torn (partial) write of a commit will render the entire commit corrupt. There hasn't been a compelling case where we would want this, and have always configured the server to write exactly one transaction per commit.
The code to handle buffering of transactions is, however, rather complex, as it tries hard to allow the caller to retry writes at commit boundaries. An unfortunate consequence of this is that we'd flush to the OS very often, leaving throughput performance on the table.

So, if there is a compelling case for batching multiple transactions in a commit, it should be the datastore's responsibility.

API and ABI breaking changes

Breaks internal APIs only.

Expected complexity level and risk

5 - Mostly for the risk

Testing

Existing tests.

This moves the following responsibilities to the datastore: - maintenance of the transaction offset - deciding how many transactions are in a commit

Allowing to restore `Committed` return

kim · 2026-01-28T16:25:02Z

Nominating @gefjon and @Centril because they appeared in the reviewer suggestions.
@Centril specifically for hints on how the get rid of Box<[_]> allocations.
@Shubham8287 because of previous work on the commitlog.
@joshua-spacetime because of suggesting the change initially.

crates/commitlog/src/tests/partial.rs

crates/commitlog/src/commitlog.rs

Centril · 2026-01-28T17:04:35Z

@Centril specifically for hints on how the get rid of Box<[_]> allocations.

What is the typical length of the slice?

kim · 2026-01-28T18:33:02Z

@Centril specifically for hints on how the get rid of Box<[_]> allocations.

What is the typical length of the slice?

Well, 1 :D
One way to solve it is to allow only a single transaction in the Durability trait until we actually need more than one. The impl IntoIterator in the commitlog crate is supposedly going to be optimized away then.

Centril · 2026-01-28T20:37:04Z

@Centril specifically for hints on how the get rid of Box<[_]> allocations.

What is the typical length of the slice?

Well, 1 :D One way to solve it is to allow only a single transaction in the Durability trait until we actually need more than one. The impl IntoIterator in the commitlog crate is supposedly going to be optimized away then.

Yeah, I think this is the right call until we need something else.

Shubham8287

This looks code, simplifies commitlog's writes API alot.

I wonder, if we should test replication with this branch? To surface any bug, if exist for n!=1 case.

kim · 2026-01-29T10:34:43Z

I wonder, if we should test replication with this branch? To surface any bug, if exist for n!=1 case.

We will benefit from BufWriter buffering without allowing n > 1 in the Durability trait (just flush after n commit() calls). So I'm considering to revert the Durability change to

fn append_tx(&self, Transaction<Self::TxData>)

which requires the offset to be supplied, but doesn't allocate a Box<[_]> that we don't actually need currently.

We don't really need batched transactions at the moment, so avoid the boxed array allocation. Durability::append_tx takes a `Transaction`, though, requiring the offset to be supplied by the datastore.

…d-commit

kim · 2026-01-30T11:09:13Z

Note that Rust does not by default run destructors when the program is terminated by a signal (any signal). This, and the default being unconfirmed reads, are the reason the commitlog before this patch would flush after every write.

I added a config option to preserve this behavior. Question is if we should make it the default for standalone. (We should probably also make use of the ctrlc crate for server processes)

Old API: baseline/n=1000 tx/commit=1 fsync=32 time: [65.266 ms 66.570 ms 68.001 ms] thrpt: [14.706 Kelem/s 15.022 Kelem/s 15.322 Kelem/s] change: time: [+0.2793% +3.4722% +7.2653%] (p = 0.07 > 0.05) thrpt: [-6.7732% -3.3556% -0.2785%] No change in performance detected. large payload/n=1000 tx/commit=1 fsync=32 time: [79.236 ms 82.656 ms 86.054 ms] thrpt: [11.621 Kelem/s 12.098 Kelem/s 12.621 Kelem/s] change: time: [-9.3387% -3.0356% +3.4453%] (p = 0.39 > 0.05) thrpt: [-3.3305% +3.1306% +10.301%] No change in performance detected. mixed payloads/n=1000 tx/commit=1 fsync=32 time: [58.803 ms 59.423 ms 60.213 ms] thrpt: [16.608 Kelem/s 16.829 Kelem/s 17.006 Kelem/s] change: time: [-10.430% -6.5854% -3.3764%] (p = 0.00 < 0.05) thrpt: [+3.4944% +7.0497% +11.645%] Performance has improved. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild mixed payloads with batching/n=1000 tx/commit=16 fsync=32 time: [52.255 ms 53.006 ms 53.870 ms] thrpt: [18.563 Kelem/s 18.866 Kelem/s 19.137 Kelem/s] change: time: [-1.9957% +0.2520% +2.6380%] (p = 0.83 > 0.05) thrpt: [-2.5702% -0.2514% +2.0364%] No change in performance detected. New API: baseline/n=1000 tx/commit=1 fsync=32 time: [47.657 ms 47.991 ms 48.413 ms] thrpt: [20.655 Kelem/s 20.837 Kelem/s 20.983 Kelem/s] change: time: [-1.3528% +0.0234% +1.4079%] (p = 0.97 > 0.05) thrpt: [-1.3884% -0.0234% +1.3714%] No change in performance detected. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high severe large payload/n=1000 tx/commit=1 fsync=32 time: [90.199 ms 92.370 ms 94.611 ms] thrpt: [10.570 Kelem/s 10.826 Kelem/s 11.087 Kelem/s] change: time: [-2.1623% +0.6369% +3.5718%] (p = 0.69 > 0.05) thrpt: [-3.4487% -0.6329% +2.2101%] No change in performance detected. mixed payloads/n=1000 tx/commit=1 fsync=32 time: [51.011 ms 51.733 ms 52.611 ms] thrpt: [19.008 Kelem/s 19.330 Kelem/s 19.604 Kelem/s] change: time: [-2.7205% -1.0955% +0.7716%] (p = 0.26 > 0.05) thrpt: [-0.7656% +1.1076% +2.7966%] No change in performance detected. Found 2 outliers among 10 measurements (20.00%) 2 (20.00%) high mild mixed payloads with batching/n=1000 tx/commit=16 fsync=32 time: [85.693 ms 86.303 ms 86.920 ms] thrpt: [11.505 Kelem/s 11.587 Kelem/s 11.670 Kelem/s] change: time: [-2.1417% -1.1554% -0.2178%] (p = 0.05 < 0.05) thrpt: [+0.2183% +1.1689% +2.1886%] Change within noise threshold.

kim added 11 commits January 27, 2026 15:56

Append commit instead of individual transactions to commitlog

06f9c2e

This moves the following responsibilities to the datastore: - maintenance of the transaction offset - deciding how many transactions are in a commit

Restore some commentary

a0de7d9

Clear commit before returning error

e73abcc

More commentary

d5b29cc

Panic if >u16::MAX transactions

de918d5

Allowing to restore `Committed` return

Docs

f722edb

set_epoch doesn't need to flush

550aa4b

Return Committed from all commit methods

876f07b

Docs

0cf8def

Use assert

a50b422

Restore the commit corruption after ENOSPC test

fd27477

kim linked an issue Jan 28, 2026 that may be closed by this pull request

Make datastore responsible for maintaining the transaction offset #4125

Open

kim added 2 commits January 28, 2026 17:17

Add TODO

f37fad3

Fix fallocate tests

4a16bf5

kim requested review from Centril, Shubham8287, gefjon and joshua-spacetime January 28, 2026 16:22

kim marked this pull request as ready for review January 28, 2026 16:25

kim commented Jan 28, 2026

View reviewed changes

crates/commitlog/src/tests/partial.rs Show resolved Hide resolved

kim commented Jan 28, 2026

View reviewed changes

crates/commitlog/src/tests/partial.rs Show resolved Hide resolved

kim commented Jan 28, 2026

View reviewed changes

crates/commitlog/src/commitlog.rs Outdated Show resolved Hide resolved

Shubham8287 approved these changes Jan 29, 2026

View reviewed changes

kim added 2 commits January 29, 2026 12:35

Revert durability trait changes

0ad5bdf

We don't really need batched transactions at the moment, so avoid the boxed array allocation. Durability::append_tx takes a `Transaction`, though, requiring the offset to be supplied by the datastore.

Merge remote-tracking branch 'origin/master' into kim/commitlog/appen…

6450e54

…d-commit

kim added 3 commits January 29, 2026 19:50

Make default sync interval much smaller

c31efd0

Add option to flush after each tx (previous behaviour)

a582f67

Merge remote-tracking branch 'origin/master' into HEAD

f26a404

kim mentioned this pull request Jan 30, 2026

[WIP] commitlog: Manually flush the inner BufWriter #3839

Closed

kim added 2 commits February 10, 2026 09:31

Merge master

4ee7faf

Small touchups

88664fd

kim added 2 commits February 10, 2026 12:09

Expose config of write buffer size

f52c13a

Merge master

5d0f67c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append commit instead of individual transactions to commitlog #4140

Append commit instead of individual transactions to commitlog #4140

Uh oh!

kim commented Jan 27, 2026 •

edited

Loading

Uh oh!

kim commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Centril commented Jan 28, 2026

Uh oh!

kim commented Jan 28, 2026

Uh oh!

Centril commented Jan 28, 2026

Uh oh!

Shubham8287 left a comment

Uh oh!

kim commented Jan 29, 2026

Uh oh!

kim commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Append commit instead of individual transactions to commitlog #4140

Are you sure you want to change the base?

Append commit instead of individual transactions to commitlog #4140

Uh oh!

Conversation

kim commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

kim commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Centril commented Jan 28, 2026

Uh oh!

kim commented Jan 28, 2026

Uh oh!

Centril commented Jan 28, 2026

Uh oh!

Shubham8287 left a comment

Choose a reason for hiding this comment

Uh oh!

kim commented Jan 29, 2026

Uh oh!

kim commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kim commented Jan 27, 2026 •

edited

Loading

kim commented Jan 30, 2026 •

edited

Loading