-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Bloodstone - sled 1.0 #1456
Draft
spacejam
wants to merge
131
commits into
main
Choose a base branch
from
project_bloodstone
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+10,431
−28,753
Draft
Changes from all commits
Commits
Show all changes
131 commits
Select commit
Hold shift + click to select a range
3d4c6a7
Check-in basic project bloodstone implementation
spacejam cd4c58f
Sand-off a few edges, rename bloodstone->sled
spacejam b5d95d9
Add periodic flusher thread and some better in-memory size calculations
spacejam 41a9829
Cut new version for periodic flush thread
spacejam 3c19bfb
Move some dependencies into this crate for better customization. Cut …
spacejam 43f52dc
Clean up a variety of codepaths after merging crates
spacejam c9a11a3
Fsync metadata log directory before returning from flush
spacejam 6f46cde
Check-in simple arch description
spacejam 9df7f3c
Merge the tests and most of the Tree methods from old sled into sled 1.0
spacejam 87a26c4
Fix doctest compiler errors
spacejam 1b7f8ab
Concise iteration
wackbyte 819c2e0
Remove unnecessary `as_ref` calls
wackbyte e47bbcc
Only call `as_ref` once on keys
wackbyte c30b4aa
Remove unnecessary referencing
wackbyte dc1d2ab
Un-nest double `&mut` references
wackbyte 3e6436d
Fix typo in ARCHITECTURE.md
wackbyte a6c065d
Use `u8::MAX` over the soon-to-be-deprecated `u8::max_value()`
wackbyte 36fb34f
Merge pull request #1455 from wackbyte/fix-typo
spacejam cf627b4
Merge pull request #1454 from wackbyte/changes
spacejam d1b1a73
Fix temporary directory support
spacejam 5ebd72d
follow SPDX 2.1 license expression standard
spacejam c64fbbe
Implement Iter::next
spacejam e4fd52c
Implement the skeleton for Iter::next_back
spacejam 5e52a92
Fix a bug with pop_last_in_range
spacejam 5dfe390
Check-in some correctness efforts
spacejam 117ad01
Clear away other databases for comparative benchmarking for now
spacejam 4ea84fb
Cut alpha.104
spacejam b8924ef
Smooth out a number of issues, mostly around testing
spacejam 3016dde
Fix a few concurrency bugs with the flush epoch tracker
spacejam bb6d53a
Fix writebatch dirty page tracking issue that the concurrent crash te…
spacejam 6538301
Fix a test and tune-down log verbosity
spacejam 151feaa
Improve reverse iterator undershot detection to fix concurrent iterat…
spacejam eccaadc
Cut alpha.107
spacejam 6940c71
Fix overshoot bound detection in Db::page_in
spacejam 8012b12
Relax flush notification assertion. Avoid running transaction crash t…
spacejam d9179c8
Cut new release with bugfixes
spacejam b11ebea
Remove redundant open_default function
spacejam 1b9c057
Fix a few bugs that the intensive tests discovered
spacejam 5ee89a9
Fix an iterator bug, and begin testing with more interesting const ge…
spacejam 20ee4f5
Improve fuzz test, implement IntoIterator for &Db
spacejam 594bfd9
Avoid memory leak in FlushEpoch
spacejam 47ce0c9
Cut alpha.113 with memory leak fix
spacejam 6a1ff5e
Fix tests under linux
spacejam 72cfb06
Fix caching behavior
spacejam 6cdea88
Update gitignore to include fuzzer logs
spacejam 65a8a1a
Remove INDEX_FANOUT and EBR const generics from Db. Fix flush safety bug
spacejam b13abe8
Handle cooperative flushes more concretely to avoid bugs
spacejam a161729
Check-in more work on empty tree leaf merges and refactoring the dirt…
spacejam efc4242
Fix a couple merge bugs
spacejam 16dcc6b
Make deletion tracking account for specific flush epochs. Properly ma…
spacejam 8e95ec8
Merge pull request #1459 from spacejam/tyler_bloodstone_merges_and_st…
spacejam 26f1e74
Bump alpha version to 117
spacejam 49f73d2
Alter the storage format to include collection ID information in anti…
spacejam 4abb765
Avoid mapping from NodeId to InlineArray low key
spacejam 331ce89
Split out flush responsibility from Db
spacejam 1ec57a5
Move cache maintenance work to Io struct
spacejam 53c4654
Move shared IO behavior to new shared PageCache struct
spacejam cc8fe0b
Avoid race-prone Arc::strong_count checks in flush-on-shutdown logic
spacejam 365b60d
docs update
spacejam 80dbc40
Move most of the interesting methods from Db to Tree in preparation f…
spacejam 8eb2ea2
Fix performance regression for insertions
spacejam 9743807
Bump concurrent-map to take advantage of massive optimization in get_…
spacejam f5a554d
Move some ID allocation logic into its own module
spacejam 2f73989
Start threading CollectionId into the write and recovery paths to sup…
spacejam 1d76590
Restructure tests
spacejam bcea2e4
Implement multiple collections and import/export
spacejam 304c5d2
Improve flush epoch concurrent testing
spacejam 1786bdf
Bump version
spacejam 16b1d98
Extract PageTable into ObjectLocationMap
kolloch 0a91e3a
Merge pull request #1469 from kolloch/project_bloodstone
spacejam 9dded84
Restructure heap location tracking code slightly in preparation for h…
spacejam bc85cc9
Make assertions about expected location transitions, avoid todo panic…
spacejam 0209baa
Fix-up assertions so that tests pass
spacejam 4b1aea0
Improve naming
spacejam ce51eeb
Standardize naming on ObjectId. Include CollectionId and low key in O…
spacejam b765c2f
A collection of cleanups and the beginnings of heap defragmentation
spacejam af4ec89
Check-in initial GC object rewriting logic
spacejam 8228aaa
Add allocation and GC counters to Stats
spacejam fe48530
Check in gc pest
spacejam cb44c1f
A large number of improvements towards on-disk file GC
spacejam 86fe050
Re-add max_allocated as it will be used for file truncation
spacejam 971b91b
Reduce quickcheck operation counts a bit
spacejam 6e24fa3
Be more pedantic in deletion test
spacejam ed98f93
Have reads return optional values if the page was freed
spacejam cf118a8
Complete Merge
spacejam 024edbe
Clarify minimum flush epoch and better test it
spacejam 2647942
Thread significantly more event verification into the writepath
spacejam 947131e
Improve verification subsystem
spacejam 07fcb1d
Fix bug in paging out dirty pages
spacejam 784e287
Refine testing
spacejam bae5da3
Refine testing assertions
spacejam 48e6c7f
Remove unreachable wildcard match in history verifier
spacejam c9d66f6
Bump version to alpha.119
spacejam 8d34f6b
Address some TODOs, clean up the system more
spacejam 11b50b5
Use BTreeSet instead of BinaryHeap in the Allocator
spacejam ca270ab
prioritize TODOs
spacejam ecc717a
Perform file truncation when a slab is detected to be at 80% of its p…
spacejam aa1f899
Fix size calculation for file resizing
spacejam e6f509e
Bump version to 1.0.0-alpha.120
spacejam dd72722
Clear TODO related to file resiziing
spacejam 896b698
Abstract low-level Leaf access methods to enable lower defect prefix-…
spacejam e2e75ed
Have testing par! macro provide an InlineArray instead of a Vec<u8>
spacejam ffbdade
Add significantly more stats to the read and write paths.
spacejam ff5dec5
Use small cache for the concurrent iterator test, and run many more t…
spacejam bc8b14b
Update TODOs
spacejam 8c69736
Make concurrent iterator test more intense
spacejam d9533c3
Update ARCHITECTURE.md
spacejam c043a1f
Better abstract the leaf storage and make room for the soon-to-be-add…
spacejam 83a7bff
Handle flusher thread panics in a way that causes tests to fail as ex…
spacejam 433e9b0
Mark ObjectCache as RefUnwindSafe
spacejam bd65e8e
Increase the strictness of the event verifier and fix a variety of su…
spacejam d6ed26e
Update project TODOs and get things into place for proper fsync manag…
spacejam 16c108d
Tighten up concurrent tests by adding concurrent flushers to get epoc…
spacejam 6905c87
Add notion of max unflushed epoch to leaves
spacejam 3a87137
Properly log cooperative serialization in batch processing as Coopera…
spacejam 2a888db
Remove redundant and incorrect debug log event
spacejam 4e4a3f9
Provide more information for debugging failed crash tests
spacejam 1516077
Properly fsync slab files after write batches
spacejam 534fbfb
Silence warnings
spacejam a2fb9aa
Sync slabs dir after potentially initializing new files
spacejam 795a221
check-in this weekend's work before flying
spacejam cb14155
[new file format] use crc on frame lengths in metadata store
spacejam 3b4c889
Fix bug with crash test assertions. Reduce visibility of platform-spe…
spacejam a153ead
Improve crash test structure, add skeletons for more crash tests, add…
spacejam 355f4d3
Sync more thoroughly before recovering metadata store. Move crash dir…
spacejam 1a29b92
Add a lot more assertions, rely on a full Mutex in a couple places fo…
spacejam 83541fa
Explicitly bump ebr dependency to avoid bug
spacejam 6fdd358
Make crash test concurrent flushing more frequent
spacejam 1e32923
Add a SeqCst fence between Release writes and later Acquire reads
spacejam cd9d468
Retry page accesses on a rare unexpected state
spacejam 41a3293
Bump version which includes some additional strictness and fixes
spacejam File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,8 @@ | ||
fuzz-*.log | ||
default.sled | ||
crash_* | ||
timing_test* | ||
*db | ||
crash_test_files | ||
*conf | ||
*snap.* | ||
*grind.out* | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
<table style="width:100%"> | ||
<tr> | ||
<td> | ||
<table style="width:100%"> | ||
<tr> | ||
<td> key </td> | ||
<td> value </td> | ||
</tr> | ||
<tr> | ||
<td><a href="https://github.com/sponsors/spacejam">buy a coffee for us to convert into databases</a></td> | ||
<td><a href="https://github.com/sponsors/spacejam"><img src="https://img.shields.io/github/sponsors/spacejam"></a></td> | ||
</tr> | ||
<tr> | ||
<td><a href="https://docs.rs/sled">documentation</a></td> | ||
<td><a href="https://docs.rs/sled"><img src="https://docs.rs/sled/badge.svg"></a></td> | ||
</tr> | ||
<tr> | ||
<td><a href="https://discord.gg/Z6VsXds">chat about databases with us</a></td> | ||
<td><a href="https://discord.gg/Z6VsXds"><img src="https://img.shields.io/discord/509773073294295082.svg?logo=discord"></a></td> | ||
</tr> | ||
</table> | ||
</td> | ||
<td> | ||
<p align="center"> | ||
<img src="https://raw.githubusercontent.com/spacejam/sled/main/art/tree_face_anti-transphobia.png" width="40%" height="auto" /> | ||
</p> | ||
</td> | ||
</tr> | ||
</table> | ||
|
||
# sled 1.0 architecture | ||
|
||
## in-memory | ||
|
||
* Lock-free B+ tree index, extracted into the [`concurrent-map`](https://github.com/komora-io/concurrent-map) crate. | ||
* The lowest key from each leaf is stored in this in-memory index. | ||
* To read any leaf that is not already cached in memory, at most one disk read will be required. | ||
* RwLock-backed leaves, using the ArcRwLock from the [`parking_lot`](https://github.com/Amanieu/parking_lot) crate. As a `Db` grows, leaf contention tends to go down in most use cases. But this may be revisited over time if many users have issues with RwLock-related contention. Avoiding full RCU for updates on the leaves results in many of the performance benefits over sled 0.34, with significantly lower memory pressure. | ||
* A simple but very high performance epoch-based reclamation technique is used for safely deferring frees of in-memory index data and reuse of on-disk heap slots, extracted into the [`ebr`](https://github.com/komora-io/ebr) crate. | ||
* A scan-resistant LRU is used for handling eviction. By default, 20% of the cache is reserved for leaves that are accessed at most once. This is configurable via `Config.entry_cache_percent`. This is handled by the extracted [`cache-advisor`](https://github.com/komora-io/cache-advisor) crate. The overall cache size is set by the `Config.cache_size` configurable. | ||
|
||
## write path | ||
|
||
* This is where things get interesting. There is no traditional WAL. There is no LSM. Only metadata is logged atomically after objects are written in parallel. | ||
* The important guarantees are: | ||
* all previous writes are durable after a call to `Db::flush` (This is also called periodically in the background by a flusher thread) | ||
* all write batches written using `Db::apply_batch` are either 100% visible or 0% visible after crash recovery. If it was followed by a flush that returned `Ok(())` it is guaranteed to be present. | ||
* Atomic ([linearizable](https://jepsen.io/consistency/models/linearizable)) durability is provided by marking dirty leaves as participants in "flush epochs" and performing atomic batch writes of the full epoch at a time, in order. Each call to `Db::flush` advances the current flush epoch by 1. | ||
* The atomic write consists in the following steps: | ||
1. User code or the background flusher thread calls `Db::flush`. | ||
1. In parallel (via [rayon](https://docs.rs/rayon)) serialize and compress each dirty leaf with zstd (configurable via `Config.zstd_compression_level`). | ||
1. Based on the size of the bytes for each object, choose the smallest heap file slot that can hold the full set of bytes. This is an on-disk slab allocator. | ||
1. Slab slots are not power-of-two sized, but tend to increase in size by around 20% from one to the next, resulting in far lower fragmentation than typical page-oriented heaps with either constant-size or power-of-two sized leaves. | ||
1. Write the object to the allocated slot from the rayon threadpool. | ||
1. After all writes, fsync the heap files that were written to. | ||
1. If any writes were written to the end of the heap file, causing it to grow, fsync the directory that stores all heap files. | ||
1. After the writes are stable, it is now safe to write an atomic metadata batch that records the location of each written leaf in the heap. This is a simple framed batch of `(low_key, slab_slot)` tuples that are initially written to a log, but eventually merged into a simple snapshot file for the metadata store once the log becomes larger than the snapshot file. | ||
1. Fsync of the metadata log file. | ||
1. Fsync of the metadata log directory. | ||
1. After the atomic metadata batch write, the previously occupied slab slots are marked for future reuse with the epoch-based reclamation system. After all threads that may have witnessed the previous location have finished their work, the slab slot is added to the free `BinaryHeap` of the slot that it belongs to so that it may be reused in future atomic write batches. | ||
1. Return `Ok(())` to the caller of `Db::flush`. | ||
* Writing objects before the metadata write is random, but modern SSDs handle this well. Even though the SSD's FTL will be working harder to defragment things periodically than if we wrote a few megabytes sequentially with each write, the data that the FTL will be copying will be mostly live due to the eager leaf write-backs. | ||
|
||
## recovery | ||
|
||
* Recovery involves simply reading the atomic metadata store that records the low key for each written leaf as well as its location and mapping it into the in-memory index. Any gaps in the slabs are then used as free slots. | ||
* Any write that failed to complete its entire atomic writebatch is treated as if it never happened, because no user-visible flush ever returned successfully. | ||
* Rayon is also used here for parallelizing reads of this metadata. In general, this is extremely fast compared to the previous sled recovery process. | ||
|
||
## tuning | ||
|
||
* The larger the `LEAF_FANOUT` const generic on the high-level `Db` struct (default `1024`), the smaller the in-memory leaf index and the better the compression ratio of the on-disk file, but the more expensive it will be to read the entire leaf off of disk and decompress it. | ||
* You can choose to turn the `LEAF_FANOUT` relatively low to make the system behave more like an Index+Log architecture, but overall disk size will grow and write performance will decrease. | ||
* NB: changing `LEAF_FANOUT` after writing data is not supported. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,74 +1,73 @@ | ||
[package] | ||
name = "sled" | ||
version = "0.34.7" | ||
authors = ["Tyler Neely <[email protected]>"] | ||
version = "1.0.0-alpha.124" | ||
edition = "2021" | ||
authors = ["Tyler Neely <[email protected]>"] | ||
documentation = "https://docs.rs/sled/" | ||
description = "Lightweight high-performance pure-rust transactional embedded database." | ||
license = "MIT/Apache-2.0" | ||
license = "MIT OR Apache-2.0" | ||
homepage = "https://github.com/spacejam/sled" | ||
repository = "https://github.com/spacejam/sled" | ||
keywords = ["redis", "mongo", "sqlite", "lmdb", "rocksdb"] | ||
categories = ["database-implementations", "concurrency", "data-structures", "algorithms", "caching"] | ||
documentation = "https://docs.rs/sled/" | ||
readme = "README.md" | ||
edition = "2018" | ||
exclude = ["benchmarks", "examples", "bindings", "scripts", "experiments"] | ||
|
||
[package.metadata.docs.rs] | ||
features = ["docs", "metrics"] | ||
|
||
[badges] | ||
maintenance = { status = "actively-developed" } | ||
[features] | ||
# initializes allocated memory to 0xa1, writes 0xde to deallocated memory before freeing it | ||
testing-shred-allocator = [] | ||
# use a counting global allocator that provides the sled::alloc::{allocated, freed, resident, reset} functions | ||
testing-count-allocator = [] | ||
for-internal-testing-only = [] | ||
# turn off re-use of object IDs and heap slots, disable tree leaf merges, disable heap file truncation. | ||
monotonic-behavior = [] | ||
|
||
[profile.release] | ||
debug = true | ||
opt-level = 3 | ||
overflow-checks = true | ||
panic = "abort" | ||
|
||
[features] | ||
default = [] | ||
# Do not use the "testing" feature in your own testing code, this is for | ||
# internal testing use only. It injects many delays and performs several | ||
# test-only configurations that cause performance to drop significantly. | ||
# It will cause your tests to take much more time, and possibly time out etc... | ||
testing = ["event_log", "lock_free_delays", "light_testing"] | ||
light_testing = ["failpoints", "backtrace", "memshred"] | ||
lock_free_delays = [] | ||
failpoints = [] | ||
event_log = [] | ||
metrics = ["num-format"] | ||
no_logs = ["log/max_level_off"] | ||
no_inline = [] | ||
pretty_backtrace = ["color-backtrace"] | ||
docs = [] | ||
no_zstd = [] | ||
miri_optimizations = [] | ||
mutex = [] | ||
memshred = [] | ||
[profile.test] | ||
debug = true | ||
overflow-checks = true | ||
panic = "abort" | ||
|
||
[dependencies] | ||
libc = "0.2.96" | ||
crc32fast = "1.2.1" | ||
log = "0.4.14" | ||
parking_lot = "0.12.1" | ||
color-backtrace = { version = "0.5.1", optional = true } | ||
num-format = { version = "0.4.0", optional = true } | ||
backtrace = { version = "0.3.60", optional = true } | ||
im = "15.1.0" | ||
|
||
[target.'cfg(any(target_os = "linux", target_os = "macos", target_os="windows"))'.dependencies] | ||
bincode = "1.3.3" | ||
cache-advisor = "1.0.16" | ||
concurrent-map = { version = "5.0.31", features = ["serde"] } | ||
crc32fast = "1.3.2" | ||
ebr = "0.2.13" | ||
inline-array = { version = "0.1.13", features = ["serde", "concurrent_map_minimum"] } | ||
fs2 = "0.4.3" | ||
log = "0.4.19" | ||
pagetable = "0.4.5" | ||
parking_lot = { version = "0.12.1", features = ["arc_lock"] } | ||
rayon = "1.7.0" | ||
serde = { version = "1.0", features = ["derive"] } | ||
stack-map = { version = "1.0.5", features = ["serde"] } | ||
zstd = "0.12.4" | ||
fnv = "1.0.7" | ||
fault-injection = "1.0.10" | ||
crossbeam-queue = "0.3.8" | ||
crossbeam-channel = "0.5.8" | ||
tempdir = "0.3.7" | ||
|
||
[dev-dependencies] | ||
rand = "0.7" | ||
rand_chacha = "0.3.1" | ||
rand_distr = "0.3" | ||
quickcheck = "0.9" | ||
log = "0.4.14" | ||
env_logger = "0.9.0" | ||
zerocopy = "0.6.0" | ||
byteorder = "1.4.3" | ||
env_logger = "0.10.0" | ||
num-format = "0.4.4" | ||
# heed = "0.11.0" | ||
# rocksdb = "0.21.0" | ||
# rusqlite = "0.29.0" | ||
# old_sled = { version = "0.34", package = "sled" } | ||
rand = "0.8.5" | ||
quickcheck = "1.0.3" | ||
rand_distr = "0.4.3" | ||
libc = "0.2.147" | ||
|
||
[[test]] | ||
name = "test_crash_recovery" | ||
path = "tests/test_crash_recovery.rs" | ||
harness = false | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2024 by the time this is merged.