-
Notifications
You must be signed in to change notification settings - Fork 685
Append commit instead of individual transactions to commitlog #4140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This moves the following responsibilities to the datastore: - maintenance of the transaction offset - deciding how many transactions are in a commit
Allowing to restore `Committed` return
|
Nominating @gefjon and @Centril because they appeared in the reviewer suggestions. |
What is the typical length of the slice? |
Well, 1 :D |
Yeah, I think this is the right call until we need something else. |
Shubham8287
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks code, simplifies commitlog's writes API alot.
I wonder, if we should test replication with this branch? To surface any bug, if exist for n!=1 case.
We will benefit from fn append_tx(&self, Transaction<Self::TxData>)which requires the offset to be supplied, but doesn't allocate a |
We don't really need batched transactions at the moment, so avoid the boxed array allocation. Durability::append_tx takes a `Transaction`, though, requiring the offset to be supplied by the datastore.
|
Note that Rust does not by default run destructors when the program is terminated by a signal (any signal). This, and the default being unconfirmed reads, are the reason the commitlog before this patch would flush after every write. I added a config option to preserve this behavior. Question is if we should make it the default for standalone. (We should probably also make use of the |
Old API:
baseline/n=1000 tx/commit=1 fsync=32
time: [65.266 ms 66.570 ms 68.001 ms]
thrpt: [14.706 Kelem/s 15.022 Kelem/s 15.322 Kelem/s]
change:
time: [+0.2793% +3.4722% +7.2653%] (p = 0.07 > 0.05)
thrpt: [-6.7732% -3.3556% -0.2785%]
No change in performance detected.
large payload/n=1000 tx/commit=1 fsync=32
time: [79.236 ms 82.656 ms 86.054 ms]
thrpt: [11.621 Kelem/s 12.098 Kelem/s 12.621 Kelem/s]
change:
time: [-9.3387% -3.0356% +3.4453%] (p = 0.39 > 0.05)
thrpt: [-3.3305% +3.1306% +10.301%]
No change in performance detected.
mixed payloads/n=1000 tx/commit=1 fsync=32
time: [58.803 ms 59.423 ms 60.213 ms]
thrpt: [16.608 Kelem/s 16.829 Kelem/s 17.006 Kelem/s]
change:
time: [-10.430% -6.5854% -3.3764%] (p = 0.00 < 0.05)
thrpt: [+3.4944% +7.0497% +11.645%]
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
mixed payloads with batching/n=1000 tx/commit=16 fsync=32
time: [52.255 ms 53.006 ms 53.870 ms]
thrpt: [18.563 Kelem/s 18.866 Kelem/s 19.137 Kelem/s]
change:
time: [-1.9957% +0.2520% +2.6380%] (p = 0.83 > 0.05)
thrpt: [-2.5702% -0.2514% +2.0364%]
No change in performance detected.
New API:
baseline/n=1000 tx/commit=1 fsync=32
time: [47.657 ms 47.991 ms 48.413 ms]
thrpt: [20.655 Kelem/s 20.837 Kelem/s 20.983 Kelem/s]
change:
time: [-1.3528% +0.0234% +1.4079%] (p = 0.97 > 0.05)
thrpt: [-1.3884% -0.0234% +1.3714%]
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
large payload/n=1000 tx/commit=1 fsync=32
time: [90.199 ms 92.370 ms 94.611 ms]
thrpt: [10.570 Kelem/s 10.826 Kelem/s 11.087 Kelem/s]
change:
time: [-2.1623% +0.6369% +3.5718%] (p = 0.69 > 0.05)
thrpt: [-3.4487% -0.6329% +2.2101%]
No change in performance detected.
mixed payloads/n=1000 tx/commit=1 fsync=32
time: [51.011 ms 51.733 ms 52.611 ms]
thrpt: [19.008 Kelem/s 19.330 Kelem/s 19.604 Kelem/s]
change:
time: [-2.7205% -1.0955% +0.7716%] (p = 0.26 > 0.05)
thrpt: [-0.7656% +1.1076% +2.7966%]
No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
mixed payloads with batching/n=1000 tx/commit=16 fsync=32
time: [85.693 ms 86.303 ms 86.920 ms]
thrpt: [11.505 Kelem/s 11.587 Kelem/s 11.670 Kelem/s]
change:
time: [-2.1417% -1.1554% -0.2178%] (p = 0.05 < 0.05)
thrpt: [+0.2183% +1.1689% +2.1886%]
Change within noise threshold.
Changes the commitlog (and durability) write API, such that the caller decides how many transactions are in a single commit, and has to supply the transaction offsets.
This simplifies commitlog-side buffering logic to essentially a
BufWriter(which, of course, we must not forget to flush). This will help throughput, but offers less opportunity to retry failed writes. This is probably a good thing, as disks can fail in erratic ways, and we should rather crash and re-verify the commitlog (suffix) than continue writing.To that end, this patch liberally raises panics when there is a chance that internal state could be "poisoned" by partial writes, which may be debatable.
Motivation
The main motivation is to avoid maintaining the transaction offset in two places in such a way that they could diverge. As ordering commits is the responsibility of the datastore, we make it authoritative on this matter -- the commitlog will still check that offsets are contiguous, and refuse to commit if that's not the case.
A secondary, related motivation is the following:
A "commit" is an atomic unit of storage, meaning that a torn (partial) write of a commit will render the entire commit corrupt. There hasn't been a compelling case where we would want this, and have always configured the server to write exactly one transaction per commit.
The code to handle buffering of transactions is, however, rather complex, as it tries hard to allow the caller to retry writes at commit boundaries. An unfortunate consequence of this is that we'd flush to the OS very often, leaving throughput performance on the table.
So, if there is a compelling case for batching multiple transactions in a commit, it should be the datastore's responsibility.
API and ABI breaking changes
Breaks internal APIs only.
Expected complexity level and risk
5 - Mostly for the risk
Testing
Existing tests.