Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Bloodstone - sled 1.0 #1456

Draft
wants to merge 131 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
3d4c6a7
Check-in basic project bloodstone implementation
spacejam Jul 28, 2023
cd4c58f
Sand-off a few edges, rename bloodstone->sled
spacejam Jul 28, 2023
b5d95d9
Add periodic flusher thread and some better in-memory size calculations
spacejam Jul 28, 2023
41a9829
Cut new version for periodic flush thread
spacejam Jul 28, 2023
3c19bfb
Move some dependencies into this crate for better customization. Cut …
spacejam Jul 29, 2023
43f52dc
Clean up a variety of codepaths after merging crates
spacejam Jul 29, 2023
c9a11a3
Fsync metadata log directory before returning from flush
spacejam Jul 29, 2023
6f46cde
Check-in simple arch description
spacejam Jul 29, 2023
9df7f3c
Merge the tests and most of the Tree methods from old sled into sled 1.0
spacejam Jul 30, 2023
87a26c4
Fix doctest compiler errors
spacejam Jul 30, 2023
1b7f8ab
Concise iteration
wackbyte Jul 29, 2023
819c2e0
Remove unnecessary `as_ref` calls
wackbyte Jul 29, 2023
e47bbcc
Only call `as_ref` once on keys
wackbyte Jul 29, 2023
c30b4aa
Remove unnecessary referencing
wackbyte Jul 29, 2023
dc1d2ab
Un-nest double `&mut` references
wackbyte Jul 29, 2023
3e6436d
Fix typo in ARCHITECTURE.md
wackbyte Jul 29, 2023
a6c065d
Use `u8::MAX` over the soon-to-be-deprecated `u8::max_value()`
wackbyte Jul 31, 2023
36fb34f
Merge pull request #1455 from wackbyte/fix-typo
spacejam Aug 5, 2023
cf627b4
Merge pull request #1454 from wackbyte/changes
spacejam Aug 5, 2023
d1b1a73
Fix temporary directory support
spacejam Aug 5, 2023
5ebd72d
follow SPDX 2.1 license expression standard
spacejam Aug 5, 2023
c64fbbe
Implement Iter::next
spacejam Aug 5, 2023
e4fd52c
Implement the skeleton for Iter::next_back
spacejam Aug 5, 2023
5e52a92
Fix a bug with pop_last_in_range
spacejam Aug 5, 2023
5dfe390
Check-in some correctness efforts
spacejam Aug 5, 2023
117ad01
Clear away other databases for comparative benchmarking for now
spacejam Aug 5, 2023
4ea84fb
Cut alpha.104
spacejam Aug 5, 2023
b8924ef
Smooth out a number of issues, mostly around testing
spacejam Aug 5, 2023
3016dde
Fix a few concurrency bugs with the flush epoch tracker
spacejam Aug 5, 2023
bb6d53a
Fix writebatch dirty page tracking issue that the concurrent crash te…
spacejam Aug 5, 2023
6538301
Fix a test and tune-down log verbosity
spacejam Aug 5, 2023
151feaa
Improve reverse iterator undershot detection to fix concurrent iterat…
spacejam Aug 6, 2023
eccaadc
Cut alpha.107
spacejam Aug 6, 2023
6940c71
Fix overshoot bound detection in Db::page_in
spacejam Aug 6, 2023
8012b12
Relax flush notification assertion. Avoid running transaction crash t…
spacejam Aug 6, 2023
d9179c8
Cut new release with bugfixes
spacejam Aug 6, 2023
b11ebea
Remove redundant open_default function
spacejam Aug 6, 2023
1b9c057
Fix a few bugs that the intensive tests discovered
spacejam Aug 6, 2023
5ee89a9
Fix an iterator bug, and begin testing with more interesting const ge…
spacejam Aug 6, 2023
20ee4f5
Improve fuzz test, implement IntoIterator for &Db
spacejam Aug 6, 2023
594bfd9
Avoid memory leak in FlushEpoch
spacejam Aug 6, 2023
47ce0c9
Cut alpha.113 with memory leak fix
spacejam Aug 6, 2023
6a1ff5e
Fix tests under linux
spacejam Aug 6, 2023
72cfb06
Fix caching behavior
spacejam Aug 7, 2023
6cdea88
Update gitignore to include fuzzer logs
spacejam Aug 7, 2023
65a8a1a
Remove INDEX_FANOUT and EBR const generics from Db. Fix flush safety bug
spacejam Aug 12, 2023
b13abe8
Handle cooperative flushes more concretely to avoid bugs
spacejam Aug 12, 2023
a161729
Check-in more work on empty tree leaf merges and refactoring the dirt…
spacejam Aug 26, 2023
efc4242
Fix a couple merge bugs
spacejam Aug 26, 2023
16dcc6b
Make deletion tracking account for specific flush epochs. Properly ma…
spacejam Aug 26, 2023
8e95ec8
Merge pull request #1459 from spacejam/tyler_bloodstone_merges_and_st…
spacejam Aug 26, 2023
26f1e74
Bump alpha version to 117
spacejam Aug 26, 2023
49f73d2
Alter the storage format to include collection ID information in anti…
spacejam Sep 2, 2023
4abb765
Avoid mapping from NodeId to InlineArray low key
spacejam Sep 2, 2023
331ce89
Split out flush responsibility from Db
spacejam Sep 3, 2023
1ec57a5
Move cache maintenance work to Io struct
spacejam Sep 3, 2023
53c4654
Move shared IO behavior to new shared PageCache struct
spacejam Sep 3, 2023
cc8fe0b
Avoid race-prone Arc::strong_count checks in flush-on-shutdown logic
spacejam Sep 3, 2023
365b60d
docs update
spacejam Sep 3, 2023
80dbc40
Move most of the interesting methods from Db to Tree in preparation f…
spacejam Sep 3, 2023
8eb2ea2
Fix performance regression for insertions
spacejam Sep 3, 2023
9743807
Bump concurrent-map to take advantage of massive optimization in get_…
spacejam Sep 3, 2023
f5a554d
Move some ID allocation logic into its own module
spacejam Sep 3, 2023
2f73989
Start threading CollectionId into the write and recovery paths to sup…
spacejam Sep 4, 2023
1d76590
Restructure tests
spacejam Sep 4, 2023
bcea2e4
Implement multiple collections and import/export
spacejam Sep 5, 2023
304c5d2
Improve flush epoch concurrent testing
spacejam Sep 11, 2023
1786bdf
Bump version
spacejam Sep 11, 2023
16b1d98
Extract PageTable into ObjectLocationMap
kolloch Sep 25, 2023
0a91e3a
Merge pull request #1469 from kolloch/project_bloodstone
spacejam Sep 25, 2023
9dded84
Restructure heap location tracking code slightly in preparation for h…
spacejam Oct 3, 2023
bc85cc9
Make assertions about expected location transitions, avoid todo panic…
spacejam Oct 3, 2023
0209baa
Fix-up assertions so that tests pass
spacejam Oct 3, 2023
4b1aea0
Improve naming
spacejam Oct 3, 2023
ce51eeb
Standardize naming on ObjectId. Include CollectionId and low key in O…
spacejam Oct 3, 2023
b765c2f
A collection of cleanups and the beginnings of heap defragmentation
spacejam Oct 19, 2023
af4ec89
Check-in initial GC object rewriting logic
spacejam Oct 28, 2023
8228aaa
Add allocation and GC counters to Stats
spacejam Oct 29, 2023
fe48530
Check in gc pest
spacejam Nov 5, 2023
cb44c1f
A large number of improvements towards on-disk file GC
spacejam Nov 11, 2023
86fe050
Re-add max_allocated as it will be used for file truncation
spacejam Nov 11, 2023
971b91b
Reduce quickcheck operation counts a bit
spacejam Nov 11, 2023
6e24fa3
Be more pedantic in deletion test
spacejam Dec 17, 2023
ed98f93
Have reads return optional values if the page was freed
spacejam Dec 17, 2023
cf118a8
Complete Merge
spacejam Dec 24, 2023
024edbe
Clarify minimum flush epoch and better test it
spacejam Dec 24, 2023
2647942
Thread significantly more event verification into the writepath
spacejam Dec 24, 2023
947131e
Improve verification subsystem
spacejam Dec 25, 2023
07fcb1d
Fix bug in paging out dirty pages
spacejam Dec 25, 2023
784e287
Refine testing
spacejam Dec 25, 2023
bae5da3
Refine testing assertions
spacejam Dec 25, 2023
48e6c7f
Remove unreachable wildcard match in history verifier
spacejam Dec 25, 2023
c9d66f6
Bump version to alpha.119
spacejam Dec 25, 2023
8d34f6b
Address some TODOs, clean up the system more
spacejam Dec 25, 2023
11b50b5
Use BTreeSet instead of BinaryHeap in the Allocator
spacejam Dec 25, 2023
ca270ab
prioritize TODOs
spacejam Dec 25, 2023
ecc717a
Perform file truncation when a slab is detected to be at 80% of its p…
spacejam Dec 25, 2023
aa1f899
Fix size calculation for file resizing
spacejam Dec 25, 2023
e6f509e
Bump version to 1.0.0-alpha.120
spacejam Dec 25, 2023
dd72722
Clear TODO related to file resiziing
spacejam Dec 25, 2023
896b698
Abstract low-level Leaf access methods to enable lower defect prefix-…
spacejam Dec 27, 2023
e2e75ed
Have testing par! macro provide an InlineArray instead of a Vec<u8>
spacejam Dec 27, 2023
ffbdade
Add significantly more stats to the read and write paths.
spacejam Dec 28, 2023
ff5dec5
Use small cache for the concurrent iterator test, and run many more t…
spacejam Dec 28, 2023
bc8b14b
Update TODOs
spacejam Dec 29, 2023
8c69736
Make concurrent iterator test more intense
spacejam Dec 29, 2023
d9533c3
Update ARCHITECTURE.md
spacejam Jan 4, 2024
c043a1f
Better abstract the leaf storage and make room for the soon-to-be-add…
spacejam Feb 10, 2024
83a7bff
Handle flusher thread panics in a way that causes tests to fail as ex…
spacejam Feb 11, 2024
433e9b0
Mark ObjectCache as RefUnwindSafe
spacejam Feb 11, 2024
bd65e8e
Increase the strictness of the event verifier and fix a variety of su…
spacejam Feb 11, 2024
d6ed26e
Update project TODOs and get things into place for proper fsync manag…
spacejam Feb 18, 2024
16c108d
Tighten up concurrent tests by adding concurrent flushers to get epoc…
spacejam Mar 9, 2024
6905c87
Add notion of max unflushed epoch to leaves
spacejam Mar 10, 2024
3a87137
Properly log cooperative serialization in batch processing as Coopera…
spacejam Mar 23, 2024
2a888db
Remove redundant and incorrect debug log event
spacejam Mar 24, 2024
4e4a3f9
Provide more information for debugging failed crash tests
spacejam Apr 1, 2024
1516077
Properly fsync slab files after write batches
spacejam Apr 7, 2024
534fbfb
Silence warnings
spacejam Apr 7, 2024
a2fb9aa
Sync slabs dir after potentially initializing new files
spacejam Apr 7, 2024
795a221
check-in this weekend's work before flying
spacejam Apr 9, 2024
cb14155
[new file format] use crc on frame lengths in metadata store
spacejam Apr 14, 2024
3b4c889
Fix bug with crash test assertions. Reduce visibility of platform-spe…
spacejam May 11, 2024
a153ead
Improve crash test structure, add skeletons for more crash tests, add…
spacejam Jun 15, 2024
355f4d3
Sync more thoroughly before recovering metadata store. Move crash dir…
spacejam Jun 15, 2024
1a29b92
Add a lot more assertions, rely on a full Mutex in a couple places fo…
spacejam Jun 15, 2024
83541fa
Explicitly bump ebr dependency to avoid bug
spacejam Sep 5, 2024
6fdd358
Make crash test concurrent flushing more frequent
spacejam Oct 11, 2024
1e32923
Add a SeqCst fence between Release writes and later Acquire reads
spacejam Oct 11, 2024
cd9d468
Retry page accesses on a rare unexpected state
spacejam Oct 11, 2024
41a3293
Bump version which includes some additional strictness and fixes
spacejam Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
fuzz-*.log
default.sled
crash_*
timing_test*
*db
crash_test_files
*conf
*snap.*
*grind.out*
Expand Down
74 changes: 74 additions & 0 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
<table style="width:100%">
<tr>
<td>
<table style="width:100%">
<tr>
<td> key </td>
<td> value </td>
</tr>
<tr>
<td><a href="https://github.com/sponsors/spacejam">buy a coffee for us to convert into databases</a></td>
<td><a href="https://github.com/sponsors/spacejam"><img src="https://img.shields.io/github/sponsors/spacejam"></a></td>
</tr>
<tr>
<td><a href="https://docs.rs/sled">documentation</a></td>
<td><a href="https://docs.rs/sled"><img src="https://docs.rs/sled/badge.svg"></a></td>
</tr>
<tr>
<td><a href="https://discord.gg/Z6VsXds">chat about databases with us</a></td>
<td><a href="https://discord.gg/Z6VsXds"><img src="https://img.shields.io/discord/509773073294295082.svg?logo=discord"></a></td>
</tr>
</table>
</td>
<td>
<p align="center">
<img src="https://raw.githubusercontent.com/spacejam/sled/main/art/tree_face_anti-transphobia.png" width="40%" height="auto" />
</p>
</td>
</tr>
</table>

# sled 1.0 architecture

## in-memory

* Lock-free B+ tree index, extracted into the [`concurrent-map`](https://github.com/komora-io/concurrent-map) crate.
* The lowest key from each leaf is stored in this in-memory index.
* To read any leaf that is not already cached in memory, at most one disk read will be required.
* RwLock-backed leaves, using the ArcRwLock from the [`parking_lot`](https://github.com/Amanieu/parking_lot) crate. As a `Db` grows, leaf contention tends to go down in most use cases. But this may be revisited over time if many users have issues with RwLock-related contention. Avoiding full RCU for updates on the leaves results in many of the performance benefits over sled 0.34, with significantly lower memory pressure.
* A simple but very high performance epoch-based reclamation technique is used for safely deferring frees of in-memory index data and reuse of on-disk heap slots, extracted into the [`ebr`](https://github.com/komora-io/ebr) crate.
* A scan-resistant LRU is used for handling eviction. By default, 20% of the cache is reserved for leaves that are accessed at most once. This is configurable via `Config.entry_cache_percent`. This is handled by the extracted [`cache-advisor`](https://github.com/komora-io/cache-advisor) crate. The overall cache size is set by the `Config.cache_size` configurable.

## write path

* This is where things get interesting. There is no traditional WAL. There is no LSM. Only metadata is logged atomically after objects are written in parallel.
* The important guarantees are:
* all previous writes are durable after a call to `Db::flush` (This is also called periodically in the background by a flusher thread)
* all write batches written using `Db::apply_batch` are either 100% visible or 0% visible after crash recovery. If it was followed by a flush that returned `Ok(())` it is guaranteed to be present.
* Atomic ([linearizable](https://jepsen.io/consistency/models/linearizable)) durability is provided by marking dirty leaves as participants in "flush epochs" and performing atomic batch writes of the full epoch at a time, in order. Each call to `Db::flush` advances the current flush epoch by 1.
* The atomic write consists in the following steps:
1. User code or the background flusher thread calls `Db::flush`.
1. In parallel (via [rayon](https://docs.rs/rayon)) serialize and compress each dirty leaf with zstd (configurable via `Config.zstd_compression_level`).
1. Based on the size of the bytes for each object, choose the smallest heap file slot that can hold the full set of bytes. This is an on-disk slab allocator.
1. Slab slots are not power-of-two sized, but tend to increase in size by around 20% from one to the next, resulting in far lower fragmentation than typical page-oriented heaps with either constant-size or power-of-two sized leaves.
1. Write the object to the allocated slot from the rayon threadpool.
1. After all writes, fsync the heap files that were written to.
1. If any writes were written to the end of the heap file, causing it to grow, fsync the directory that stores all heap files.
1. After the writes are stable, it is now safe to write an atomic metadata batch that records the location of each written leaf in the heap. This is a simple framed batch of `(low_key, slab_slot)` tuples that are initially written to a log, but eventually merged into a simple snapshot file for the metadata store once the log becomes larger than the snapshot file.
1. Fsync of the metadata log file.
1. Fsync of the metadata log directory.
1. After the atomic metadata batch write, the previously occupied slab slots are marked for future reuse with the epoch-based reclamation system. After all threads that may have witnessed the previous location have finished their work, the slab slot is added to the free `BinaryHeap` of the slot that it belongs to so that it may be reused in future atomic write batches.
1. Return `Ok(())` to the caller of `Db::flush`.
* Writing objects before the metadata write is random, but modern SSDs handle this well. Even though the SSD's FTL will be working harder to defragment things periodically than if we wrote a few megabytes sequentially with each write, the data that the FTL will be copying will be mostly live due to the eager leaf write-backs.

## recovery

* Recovery involves simply reading the atomic metadata store that records the low key for each written leaf as well as its location and mapping it into the in-memory index. Any gaps in the slabs are then used as free slots.
* Any write that failed to complete its entire atomic writebatch is treated as if it never happened, because no user-visible flush ever returned successfully.
* Rayon is also used here for parallelizing reads of this metadata. In general, this is extremely fast compared to the previous sled recovery process.

## tuning

* The larger the `LEAF_FANOUT` const generic on the high-level `Db` struct (default `1024`), the smaller the in-memory leaf index and the better the compression ratio of the on-disk file, but the more expensive it will be to read the entire leaf off of disk and decompress it.
* You can choose to turn the `LEAF_FANOUT` relatively low to make the system behave more like an Index+Log architecture, but overall disk size will grow and write performance will decrease.
* NB: changing `LEAF_FANOUT` after writing data is not supported.
95 changes: 47 additions & 48 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,74 +1,73 @@
[package]
name = "sled"
version = "0.34.7"
authors = ["Tyler Neely <[email protected]>"]
version = "1.0.0-alpha.124"
edition = "2021"
authors = ["Tyler Neely <[email protected]>"]
documentation = "https://docs.rs/sled/"
description = "Lightweight high-performance pure-rust transactional embedded database."
license = "MIT/Apache-2.0"
license = "MIT OR Apache-2.0"
homepage = "https://github.com/spacejam/sled"
repository = "https://github.com/spacejam/sled"
keywords = ["redis", "mongo", "sqlite", "lmdb", "rocksdb"]
categories = ["database-implementations", "concurrency", "data-structures", "algorithms", "caching"]
documentation = "https://docs.rs/sled/"
readme = "README.md"
edition = "2018"
exclude = ["benchmarks", "examples", "bindings", "scripts", "experiments"]

[package.metadata.docs.rs]
features = ["docs", "metrics"]

[badges]
maintenance = { status = "actively-developed" }
[features]
# initializes allocated memory to 0xa1, writes 0xde to deallocated memory before freeing it
testing-shred-allocator = []
# use a counting global allocator that provides the sled::alloc::{allocated, freed, resident, reset} functions
testing-count-allocator = []
for-internal-testing-only = []
# turn off re-use of object IDs and heap slots, disable tree leaf merges, disable heap file truncation.
monotonic-behavior = []

[profile.release]
debug = true
opt-level = 3
overflow-checks = true
panic = "abort"

[features]
default = []
# Do not use the "testing" feature in your own testing code, this is for
# internal testing use only. It injects many delays and performs several
# test-only configurations that cause performance to drop significantly.
# It will cause your tests to take much more time, and possibly time out etc...
testing = ["event_log", "lock_free_delays", "light_testing"]
light_testing = ["failpoints", "backtrace", "memshred"]
lock_free_delays = []
failpoints = []
event_log = []
metrics = ["num-format"]
no_logs = ["log/max_level_off"]
no_inline = []
pretty_backtrace = ["color-backtrace"]
docs = []
no_zstd = []
miri_optimizations = []
mutex = []
memshred = []
[profile.test]
debug = true
overflow-checks = true
panic = "abort"

[dependencies]
libc = "0.2.96"
crc32fast = "1.2.1"
log = "0.4.14"
parking_lot = "0.12.1"
color-backtrace = { version = "0.5.1", optional = true }
num-format = { version = "0.4.0", optional = true }
backtrace = { version = "0.3.60", optional = true }
im = "15.1.0"

[target.'cfg(any(target_os = "linux", target_os = "macos", target_os="windows"))'.dependencies]
bincode = "1.3.3"
cache-advisor = "1.0.16"
concurrent-map = { version = "5.0.31", features = ["serde"] }
crc32fast = "1.3.2"
ebr = "0.2.13"
inline-array = { version = "0.1.13", features = ["serde", "concurrent_map_minimum"] }
fs2 = "0.4.3"
log = "0.4.19"
pagetable = "0.4.5"
parking_lot = { version = "0.12.1", features = ["arc_lock"] }
rayon = "1.7.0"
serde = { version = "1.0", features = ["derive"] }
stack-map = { version = "1.0.5", features = ["serde"] }
zstd = "0.12.4"
fnv = "1.0.7"
fault-injection = "1.0.10"
crossbeam-queue = "0.3.8"
crossbeam-channel = "0.5.8"
tempdir = "0.3.7"

[dev-dependencies]
rand = "0.7"
rand_chacha = "0.3.1"
rand_distr = "0.3"
quickcheck = "0.9"
log = "0.4.14"
env_logger = "0.9.0"
zerocopy = "0.6.0"
byteorder = "1.4.3"
env_logger = "0.10.0"
num-format = "0.4.4"
# heed = "0.11.0"
# rocksdb = "0.21.0"
# rusqlite = "0.29.0"
# old_sled = { version = "0.34", package = "sled" }
rand = "0.8.5"
quickcheck = "1.0.3"
rand_distr = "0.4.3"
libc = "0.2.147"

[[test]]
name = "test_crash_recovery"
path = "tests/test_crash_recovery.rs"
harness = false

1 change: 1 addition & 0 deletions LICENSE-APACHE
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@
Copyright 2020 Tyler Neely
Copyright 2021 Tyler Neely
Copyright 2022 Tyler Neely
Copyright 2023 Tyler Neely

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
4 changes: 4 additions & 0 deletions LICENSE-MIT
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
Copyright (c) 2015 Tyler Neely
Copyright (c) 2016 Tyler Neely
Copyright (c) 2017 Tyler Neely
Copyright (c) 2018 Tyler Neely
Copyright (c) 2019 Tyler Neely
Copyright (c) 2020 Tyler Neely
Copyright (c) 2021 Tyler Neely
Copyright (c) 2022 Tyler Neely
Copyright (c) 2023 Tyler Neely
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2024 by the time this is merged.


Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
Expand Down
17 changes: 0 additions & 17 deletions benchmarks/criterion/Cargo.toml

This file was deleted.

Loading