diff --git a/.gitignore b/.gitignore index a2e14756a..fdd570918 100644 --- a/.gitignore +++ b/.gitignore @@ -1,6 +1,8 @@ +fuzz-*.log default.sled -crash_* +timing_test* *db +crash_test_files *conf *snap.* *grind.out* diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 000000000..94993d502 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,74 @@ + + + + + +
+ + + + + + + + + + + + + + + + + +
key value
buy a coffee for us to convert into databases
documentation
chat about databases with us
+
+

+ +

+
+ +# sled 1.0 architecture + +## in-memory + +* Lock-free B+ tree index, extracted into the [`concurrent-map`](https://github.com/komora-io/concurrent-map) crate. +* The lowest key from each leaf is stored in this in-memory index. +* To read any leaf that is not already cached in memory, at most one disk read will be required. +* RwLock-backed leaves, using the ArcRwLock from the [`parking_lot`](https://github.com/Amanieu/parking_lot) crate. As a `Db` grows, leaf contention tends to go down in most use cases. But this may be revisited over time if many users have issues with RwLock-related contention. Avoiding full RCU for updates on the leaves results in many of the performance benefits over sled 0.34, with significantly lower memory pressure. +* A simple but very high performance epoch-based reclamation technique is used for safely deferring frees of in-memory index data and reuse of on-disk heap slots, extracted into the [`ebr`](https://github.com/komora-io/ebr) crate. +* A scan-resistant LRU is used for handling eviction. By default, 20% of the cache is reserved for leaves that are accessed at most once. This is configurable via `Config.entry_cache_percent`. This is handled by the extracted [`cache-advisor`](https://github.com/komora-io/cache-advisor) crate. The overall cache size is set by the `Config.cache_size` configurable. + +## write path + +* This is where things get interesting. There is no traditional WAL. There is no LSM. Only metadata is logged atomically after objects are written in parallel. +* The important guarantees are: + * all previous writes are durable after a call to `Db::flush` (This is also called periodically in the background by a flusher thread) + * all write batches written using `Db::apply_batch` are either 100% visible or 0% visible after crash recovery. If it was followed by a flush that returned `Ok(())` it is guaranteed to be present. +* Atomic ([linearizable](https://jepsen.io/consistency/models/linearizable)) durability is provided by marking dirty leaves as participants in "flush epochs" and performing atomic batch writes of the full epoch at a time, in order. Each call to `Db::flush` advances the current flush epoch by 1. +* The atomic write consists in the following steps: + 1. User code or the background flusher thread calls `Db::flush`. + 1. In parallel (via [rayon](https://docs.rs/rayon)) serialize and compress each dirty leaf with zstd (configurable via `Config.zstd_compression_level`). + 1. Based on the size of the bytes for each object, choose the smallest heap file slot that can hold the full set of bytes. This is an on-disk slab allocator. + 1. Slab slots are not power-of-two sized, but tend to increase in size by around 20% from one to the next, resulting in far lower fragmentation than typical page-oriented heaps with either constant-size or power-of-two sized leaves. + 1. Write the object to the allocated slot from the rayon threadpool. + 1. After all writes, fsync the heap files that were written to. + 1. If any writes were written to the end of the heap file, causing it to grow, fsync the directory that stores all heap files. + 1. After the writes are stable, it is now safe to write an atomic metadata batch that records the location of each written leaf in the heap. This is a simple framed batch of `(low_key, slab_slot)` tuples that are initially written to a log, but eventually merged into a simple snapshot file for the metadata store once the log becomes larger than the snapshot file. + 1. Fsync of the metadata log file. + 1. Fsync of the metadata log directory. + 1. After the atomic metadata batch write, the previously occupied slab slots are marked for future reuse with the epoch-based reclamation system. After all threads that may have witnessed the previous location have finished their work, the slab slot is added to the free `BinaryHeap` of the slot that it belongs to so that it may be reused in future atomic write batches. + 1. Return `Ok(())` to the caller of `Db::flush`. +* Writing objects before the metadata write is random, but modern SSDs handle this well. Even though the SSD's FTL will be working harder to defragment things periodically than if we wrote a few megabytes sequentially with each write, the data that the FTL will be copying will be mostly live due to the eager leaf write-backs. + +## recovery + +* Recovery involves simply reading the atomic metadata store that records the low key for each written leaf as well as its location and mapping it into the in-memory index. Any gaps in the slabs are then used as free slots. +* Any write that failed to complete its entire atomic writebatch is treated as if it never happened, because no user-visible flush ever returned successfully. +* Rayon is also used here for parallelizing reads of this metadata. In general, this is extremely fast compared to the previous sled recovery process. + +## tuning + +* The larger the `LEAF_FANOUT` const generic on the high-level `Db` struct (default `1024`), the smaller the in-memory leaf index and the better the compression ratio of the on-disk file, but the more expensive it will be to read the entire leaf off of disk and decompress it. +* You can choose to turn the `LEAF_FANOUT` relatively low to make the system behave more like an Index+Log architecture, but overall disk size will grow and write performance will decrease. +* NB: changing `LEAF_FANOUT` after writing data is not supported. diff --git a/Cargo.toml b/Cargo.toml index 117f4a644..f26ff1524 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,74 +1,73 @@ [package] name = "sled" -version = "0.34.7" -authors = ["Tyler Neely "] +version = "1.0.0-alpha.124" +edition = "2021" +authors = ["Tyler Neely "] +documentation = "https://docs.rs/sled/" description = "Lightweight high-performance pure-rust transactional embedded database." -license = "MIT/Apache-2.0" +license = "MIT OR Apache-2.0" homepage = "https://github.com/spacejam/sled" repository = "https://github.com/spacejam/sled" keywords = ["redis", "mongo", "sqlite", "lmdb", "rocksdb"] categories = ["database-implementations", "concurrency", "data-structures", "algorithms", "caching"] -documentation = "https://docs.rs/sled/" readme = "README.md" -edition = "2018" exclude = ["benchmarks", "examples", "bindings", "scripts", "experiments"] -[package.metadata.docs.rs] -features = ["docs", "metrics"] - -[badges] -maintenance = { status = "actively-developed" } +[features] +# initializes allocated memory to 0xa1, writes 0xde to deallocated memory before freeing it +testing-shred-allocator = [] +# use a counting global allocator that provides the sled::alloc::{allocated, freed, resident, reset} functions +testing-count-allocator = [] +for-internal-testing-only = [] +# turn off re-use of object IDs and heap slots, disable tree leaf merges, disable heap file truncation. +monotonic-behavior = [] [profile.release] debug = true opt-level = 3 overflow-checks = true +panic = "abort" -[features] -default = [] -# Do not use the "testing" feature in your own testing code, this is for -# internal testing use only. It injects many delays and performs several -# test-only configurations that cause performance to drop significantly. -# It will cause your tests to take much more time, and possibly time out etc... -testing = ["event_log", "lock_free_delays", "light_testing"] -light_testing = ["failpoints", "backtrace", "memshred"] -lock_free_delays = [] -failpoints = [] -event_log = [] -metrics = ["num-format"] -no_logs = ["log/max_level_off"] -no_inline = [] -pretty_backtrace = ["color-backtrace"] -docs = [] -no_zstd = [] -miri_optimizations = [] -mutex = [] -memshred = [] +[profile.test] +debug = true +overflow-checks = true +panic = "abort" [dependencies] -libc = "0.2.96" -crc32fast = "1.2.1" -log = "0.4.14" -parking_lot = "0.12.1" -color-backtrace = { version = "0.5.1", optional = true } -num-format = { version = "0.4.0", optional = true } -backtrace = { version = "0.3.60", optional = true } -im = "15.1.0" - -[target.'cfg(any(target_os = "linux", target_os = "macos", target_os="windows"))'.dependencies] +bincode = "1.3.3" +cache-advisor = "1.0.16" +concurrent-map = { version = "5.0.31", features = ["serde"] } +crc32fast = "1.3.2" +ebr = "0.2.13" +inline-array = { version = "0.1.13", features = ["serde", "concurrent_map_minimum"] } fs2 = "0.4.3" +log = "0.4.19" +pagetable = "0.4.5" +parking_lot = { version = "0.12.1", features = ["arc_lock"] } +rayon = "1.7.0" +serde = { version = "1.0", features = ["derive"] } +stack-map = { version = "1.0.5", features = ["serde"] } +zstd = "0.12.4" +fnv = "1.0.7" +fault-injection = "1.0.10" +crossbeam-queue = "0.3.8" +crossbeam-channel = "0.5.8" +tempdir = "0.3.7" [dev-dependencies] -rand = "0.7" -rand_chacha = "0.3.1" -rand_distr = "0.3" -quickcheck = "0.9" -log = "0.4.14" -env_logger = "0.9.0" -zerocopy = "0.6.0" -byteorder = "1.4.3" +env_logger = "0.10.0" +num-format = "0.4.4" +# heed = "0.11.0" +# rocksdb = "0.21.0" +# rusqlite = "0.29.0" +# old_sled = { version = "0.34", package = "sled" } +rand = "0.8.5" +quickcheck = "1.0.3" +rand_distr = "0.4.3" +libc = "0.2.147" [[test]] name = "test_crash_recovery" path = "tests/test_crash_recovery.rs" harness = false + diff --git a/LICENSE-APACHE b/LICENSE-APACHE index 66199b067..5d10ac3ed 100644 --- a/LICENSE-APACHE +++ b/LICENSE-APACHE @@ -194,6 +194,7 @@ Copyright 2020 Tyler Neely Copyright 2021 Tyler Neely Copyright 2022 Tyler Neely + Copyright 2023 Tyler Neely Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/LICENSE-MIT b/LICENSE-MIT index c530b2d39..9feb8d078 100644 --- a/LICENSE-MIT +++ b/LICENSE-MIT @@ -1,8 +1,12 @@ +Copyright (c) 2015 Tyler Neely +Copyright (c) 2016 Tyler Neely +Copyright (c) 2017 Tyler Neely Copyright (c) 2018 Tyler Neely Copyright (c) 2019 Tyler Neely Copyright (c) 2020 Tyler Neely Copyright (c) 2021 Tyler Neely Copyright (c) 2022 Tyler Neely +Copyright (c) 2023 Tyler Neely Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated diff --git a/benchmarks/criterion/Cargo.toml b/benchmarks/criterion/Cargo.toml deleted file mode 100644 index 48e136f59..000000000 --- a/benchmarks/criterion/Cargo.toml +++ /dev/null @@ -1,17 +0,0 @@ -[package] -name = "critter" -publish = false -version = "0.1.0" -authors = ["Tyler Neely "] -edition = "2018" - -[[bench]] -name = "sled" -harness = false - -# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html - -[dependencies] -criterion = "0.3.0" -sled = { path = "../.." } -jemallocator = "0.3.2" diff --git a/benchmarks/criterion/benches/sled.rs b/benchmarks/criterion/benches/sled.rs deleted file mode 100644 index b5e3f5826..000000000 --- a/benchmarks/criterion/benches/sled.rs +++ /dev/null @@ -1,157 +0,0 @@ -use criterion::{criterion_group, criterion_main, Criterion}; - -use jemallocator::Jemalloc; - -use sled::Config; - -#[cfg_attr( - // only enable jemalloc on linux and macos by default - any(target_os = "linux", target_os = "macos"), - global_allocator -)] -static ALLOC: Jemalloc = Jemalloc; - -fn counter() -> usize { - use std::sync::atomic::{AtomicUsize, Ordering::Relaxed}; - - static C: AtomicUsize = AtomicUsize::new(0); - - C.fetch_add(1, Relaxed) -} - -/// Generates a random number in `0..n`. -fn random(n: u32) -> u32 { - use std::cell::Cell; - use std::num::Wrapping; - - thread_local! { - static RNG: Cell> = Cell::new(Wrapping(1406868647)); - } - - RNG.with(|rng| { - // This is the 32-bit variant of Xorshift. - // - // Source: https://en.wikipedia.org/wiki/Xorshift - let mut x = rng.get(); - x ^= x << 13; - x ^= x >> 17; - x ^= x << 5; - rng.set(x); - - // This is a fast alternative to `x % n`. - // - // Author: Daniel Lemire - // Source: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ - ((x.0 as u64).wrapping_mul(n as u64) >> 32) as u32 - }) -} - -fn sled_bulk_load(c: &mut Criterion) { - let mut count = 0_u32; - let mut bytes = |len| -> Vec { - count += 1; - count.to_be_bytes().into_iter().cycle().take(len).copied().collect() - }; - - let mut bench = |key_len, val_len| { - let db = Config::new() - .path(format!("bulk_k{}_v{}", key_len, val_len)) - .temporary(true) - .flush_every_ms(None) - .open() - .unwrap(); - - c.bench_function( - &format!("bulk load key/value lengths {}/{}", key_len, val_len), - |b| { - b.iter(|| { - db.insert(bytes(key_len), bytes(val_len)).unwrap(); - }) - }, - ); - }; - - for key_len in &[10_usize, 128, 256, 512] { - for val_len in &[0_usize, 10, 128, 256, 512, 1024, 2048, 4096, 8192] { - bench(*key_len, *val_len) - } - } -} - -fn sled_monotonic_crud(c: &mut Criterion) { - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); - - c.bench_function("monotonic inserts", |b| { - let mut count = 0_u32; - b.iter(|| { - count += 1; - db.insert(count.to_be_bytes(), vec![]).unwrap(); - }) - }); - - c.bench_function("monotonic gets", |b| { - let mut count = 0_u32; - b.iter(|| { - count += 1; - db.get(count.to_be_bytes()).unwrap(); - }) - }); - - c.bench_function("monotonic removals", |b| { - let mut count = 0_u32; - b.iter(|| { - count += 1; - db.remove(count.to_be_bytes()).unwrap(); - }) - }); -} - -fn sled_random_crud(c: &mut Criterion) { - const SIZE: u32 = 65536; - - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); - - c.bench_function("random inserts", |b| { - b.iter(|| { - let k = random(SIZE).to_be_bytes(); - db.insert(k, vec![]).unwrap(); - }) - }); - - c.bench_function("random gets", |b| { - b.iter(|| { - let k = random(SIZE).to_be_bytes(); - db.get(k).unwrap(); - }) - }); - - c.bench_function("random removals", |b| { - b.iter(|| { - let k = random(SIZE).to_be_bytes(); - db.remove(k).unwrap(); - }) - }); -} - -fn sled_empty_opens(c: &mut Criterion) { - let _ = std::fs::remove_dir_all("empty_opens"); - c.bench_function("empty opens", |b| { - b.iter(|| { - Config::new() - .path(format!("empty_opens/{}.db", counter())) - .flush_every_ms(None) - .open() - .unwrap() - }) - }); - let _ = std::fs::remove_dir_all("empty_opens"); -} - -criterion_group!( - benches, - sled_bulk_load, - sled_monotonic_crud, - sled_random_crud, - sled_empty_opens -); -criterion_main!(benches); diff --git a/benchmarks/criterion/src/lib.rs b/benchmarks/criterion/src/lib.rs deleted file mode 100644 index e69de29bb..000000000 diff --git a/benchmarks/stress2/Cargo.toml b/benchmarks/stress2/Cargo.toml deleted file mode 100644 index 3cd1daf34..000000000 --- a/benchmarks/stress2/Cargo.toml +++ /dev/null @@ -1,37 +0,0 @@ -[package] -name = "stress2" -version = "0.1.0" -authors = ["Tyler Neely "] -publish = false -edition = "2018" - -[profile.release] -panic = 'abort' -codegen-units = 1 -lto = "fat" -debug = true -overflow-checks = true - -[features] -default = [] -lock_free_delays = ["sled/lock_free_delays"] -event_log = ["sled/event_log"] -no_logs = ["sled/no_logs"] -metrics = ["sled/metrics"] -jemalloc = ["jemallocator"] -logging = ["env_logger", "log", "color-backtrace"] -dh = ["dhat"] -memshred = [] -measure_allocs = [] - -[dependencies] -rand = "0.7.3" -env_logger = { version = "0.7.1", optional = true } -log = { version = "0.4.8", optional = true } -color-backtrace = { version = "0.3.0", optional = true } -jemallocator = { version = "0.3.2", optional = true } -num-format = "0.4.0" -dhat = { version = "0.2.2", optional = true } - -[dependencies.sled] -path = "../.." diff --git a/benchmarks/stress2/lsan.sh b/benchmarks/stress2/lsan.sh deleted file mode 100755 index e33468551..000000000 --- a/benchmarks/stress2/lsan.sh +++ /dev/null @@ -1,9 +0,0 @@ -#!/usr/bin/env bash - -set -euxo pipefail - -echo "lsan" -export RUSTFLAGS="-Z sanitizer=leak" -cargo build --features=no_jemalloc --target x86_64-unknown-linux-gnu -rm -rf default.sled -target/x86_64-unknown-linux-gnu/debug/stress2 --duration=10 --set-prop=100000000 --val-len=100000 diff --git a/benchmarks/stress2/src/main.rs b/benchmarks/stress2/src/main.rs deleted file mode 100644 index db174baaf..000000000 --- a/benchmarks/stress2/src/main.rs +++ /dev/null @@ -1,456 +0,0 @@ -use std::{ - sync::{ - atomic::{AtomicBool, AtomicUsize, Ordering}, - Arc, - }, - thread, -}; - -#[cfg(feature = "dh")] -use dhat::{Dhat, DhatAlloc}; - -use num_format::{Locale, ToFormattedString}; -use rand::{thread_rng, Rng}; - -#[cfg(feature = "jemalloc")] -mod alloc { - use jemallocator::Jemalloc; - use std::alloc::Layout; - - #[global_allocator] - static ALLOCATOR: Jemalloc = Jemalloc; -} - -#[cfg(feature = "memshred")] -mod alloc { - use std::alloc::{Layout, System}; - - #[global_allocator] - static ALLOCATOR: Alloc = Alloc; - - #[derive(Default, Debug, Clone, Copy)] - struct Alloc; - - unsafe impl std::alloc::GlobalAlloc for Alloc { - unsafe fn alloc(&self, layout: Layout) -> *mut u8 { - let ret = System.alloc(layout); - assert_ne!(ret, std::ptr::null_mut()); - std::ptr::write_bytes(ret, 0xa1, layout.size()); - ret - } - - unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { - std::ptr::write_bytes(ptr, 0xde, layout.size()); - System.dealloc(ptr, layout) - } - } -} - -#[cfg(feature = "measure_allocs")] -mod alloc { - use std::alloc::{Layout, System}; - use std::sync::atomic::{AtomicUsize, Ordering::Release}; - - pub static ALLOCATIONS: AtomicUsize = AtomicUsize::new(0); - pub static ALLOCATED_BYTES: AtomicUsize = AtomicUsize::new(0); - - #[global_allocator] - static ALLOCATOR: Alloc = Alloc; - - #[derive(Default, Debug, Clone, Copy)] - struct Alloc; - - unsafe impl std::alloc::GlobalAlloc for Alloc { - unsafe fn alloc(&self, layout: Layout) -> *mut u8 { - ALLOCATIONS.fetch_add(1, Release); - ALLOCATED_BYTES.fetch_add(layout.size(), Release); - System.alloc(layout) - } - unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { - System.dealloc(ptr, layout) - } - } -} - -#[global_allocator] -#[cfg(feature = "dh")] -static ALLOCATOR: DhatAlloc = DhatAlloc; - -static TOTAL: AtomicUsize = AtomicUsize::new(0); - -const USAGE: &str = " -Usage: stress [--threads=<#>] [--burn-in] [--duration=] \ - [--key-len=] [--val-len=] \ - [--get-prop=

] \ - [--set-prop=

] \ - [--del-prop=

] \ - [--cas-prop=

] \ - [--scan-prop=

] \ - [--merge-prop=

] \ - [--entries=] \ - [--sequential] \ - [--total-ops=] \ - [--flush-every=] - -Options: - --threads=<#> Number of threads [default: 4]. - --burn-in Don't halt until we receive a signal. - --duration= Seconds to run for [default: 10]. - --key-len= The length of keys [default: 10]. - --val-len= The length of values [default: 100]. - --get-prop=

The relative proportion of get requests [default: 94]. - --set-prop=

The relative proportion of set requests [default: 2]. - --del-prop=

The relative proportion of del requests [default: 1]. - --cas-prop=

The relative proportion of cas requests [default: 1]. - --scan-prop=

The relative proportion of scan requests [default: 1]. - --merge-prop=

The relative proportion of merge requests [default: 1]. - --entries= The total keyspace [default: 100000]. - --sequential Run the test in sequential mode instead of random. - --total-ops= Stop test after executing a total number of operations. - --flush-every= Flush and sync the database every ms [default: 200]. - --cache-mb= Size of the page cache in megabytes [default: 1024]. -"; - -#[derive(Debug, Clone, Copy)] -struct Args { - threads: usize, - burn_in: bool, - duration: u64, - key_len: usize, - val_len: usize, - get_prop: usize, - set_prop: usize, - del_prop: usize, - cas_prop: usize, - scan_prop: usize, - merge_prop: usize, - entries: usize, - sequential: bool, - total_ops: Option, - flush_every: u64, - cache_mb: usize, -} - -impl Default for Args { - fn default() -> Args { - Args { - threads: 4, - burn_in: false, - duration: 10, - key_len: 10, - val_len: 100, - get_prop: 94, - set_prop: 2, - del_prop: 1, - cas_prop: 1, - scan_prop: 1, - merge_prop: 1, - entries: 100000, - sequential: false, - total_ops: None, - flush_every: 200, - cache_mb: 1024, - } - } -} - -fn parse<'a, I, T>(mut iter: I) -> T -where - I: Iterator, - T: std::str::FromStr, - ::Err: std::fmt::Debug, -{ - iter.next().expect(USAGE).parse().expect(USAGE) -} - -impl Args { - fn parse() -> Args { - let mut args = Args::default(); - for raw_arg in std::env::args().skip(1) { - let mut splits = raw_arg[2..].split('='); - match splits.next().unwrap() { - "threads" => args.threads = parse(&mut splits), - "burn-in" => args.burn_in = true, - "duration" => args.duration = parse(&mut splits), - "key-len" => args.key_len = parse(&mut splits), - "val-len" => args.val_len = parse(&mut splits), - "get-prop" => args.get_prop = parse(&mut splits), - "set-prop" => args.set_prop = parse(&mut splits), - "del-prop" => args.del_prop = parse(&mut splits), - "cas-prop" => args.cas_prop = parse(&mut splits), - "scan-prop" => args.scan_prop = parse(&mut splits), - "merge-prop" => args.merge_prop = parse(&mut splits), - "entries" => args.entries = parse(&mut splits), - "sequential" => args.sequential = true, - "total-ops" => args.total_ops = Some(parse(&mut splits)), - "flush-every" => args.flush_every = parse(&mut splits), - "cache-mb" => args.cache_mb = parse(&mut splits), - other => panic!("unknown option: {}, {}", other, USAGE), - } - } - args - } -} - -fn report(shutdown: Arc) { - let mut last = 0; - while !shutdown.load(Ordering::Relaxed) { - thread::sleep(std::time::Duration::from_secs(1)); - let total = TOTAL.load(Ordering::Acquire); - - println!( - "did {} ops, {}mb RSS", - (total - last).to_formatted_string(&Locale::en), - rss() / (1024 * 1024) - ); - - last = total; - } -} - -fn concatenate_merge( - _key: &[u8], // the key being merged - old_value: Option<&[u8]>, // the previous value, if one existed - merged_bytes: &[u8], // the new bytes being merged in -) -> Option> { - // set the new value, return None to delete - let mut ret = old_value.map(|ov| ov.to_vec()).unwrap_or_else(Vec::new); - - ret.extend_from_slice(merged_bytes); - - Some(ret) -} - -fn run(args: Args, tree: Arc, shutdown: Arc) { - let get_max = args.get_prop; - let set_max = get_max + args.set_prop; - let del_max = set_max + args.del_prop; - let cas_max = del_max + args.cas_prop; - let merge_max = cas_max + args.merge_prop; - let scan_max = merge_max + args.scan_prop; - - let keygen = |len| -> sled::IVec { - static SEQ: AtomicUsize = AtomicUsize::new(0); - let i = if args.sequential { - SEQ.fetch_add(1, Ordering::Relaxed) - } else { - thread_rng().gen::() - } % args.entries; - - let start = if len < 8 { 8 - len } else { 0 }; - - let i_keygen = &i.to_be_bytes()[start..]; - - i_keygen.iter().cycle().take(len).copied().collect() - }; - - let valgen = |len| -> sled::IVec { - if len == 0 { - return vec![].into(); - } - - let i: usize = thread_rng().gen::() % (len * 8); - - let i_keygen = i.to_be_bytes(); - - i_keygen - .iter() - .skip_while(|v| **v == 0) - .cycle() - .take(len) - .copied() - .collect() - }; - - let mut rng = thread_rng(); - - while !shutdown.load(Ordering::Relaxed) { - let op = TOTAL.fetch_add(1, Ordering::Release); - let key = keygen(args.key_len); - let choice = rng.gen_range(0, scan_max + 1); - - match choice { - v if v <= get_max => { - tree.get_zero_copy(&key, |_| {}).unwrap(); - } - v if v > get_max && v <= set_max => { - let value = valgen(args.val_len); - tree.insert(&key, value).unwrap(); - } - v if v > set_max && v <= del_max => { - tree.remove(&key).unwrap(); - } - v if v > del_max && v <= cas_max => { - let old = if rng.gen::() { - let value = valgen(args.val_len); - Some(value) - } else { - None - }; - - let new = if rng.gen::() { - let value = valgen(args.val_len); - Some(value) - } else { - None - }; - - if let Err(e) = tree.compare_and_swap(&key, old, new) { - panic!("operational error: {:?}", e); - } - } - v if v > cas_max && v <= merge_max => { - let value = valgen(args.val_len); - tree.merge(&key, value).unwrap(); - } - _ => { - let iter = tree.range(key..).map(|res| res.unwrap()); - - if op % 2 == 0 { - let _ = iter.take(rng.gen_range(0, 15)).collect::>(); - } else { - let _ = iter - .rev() - .take(rng.gen_range(0, 15)) - .collect::>(); - } - } - } - } -} - -fn rss() -> usize { - #[cfg(target_os = "linux")] - { - use std::io::prelude::*; - use std::io::BufReader; - - let mut buf = String::new(); - let mut f = - BufReader::new(std::fs::File::open("/proc/self/statm").unwrap()); - f.read_line(&mut buf).unwrap(); - let mut parts = buf.split_whitespace(); - let rss_pages = parts.nth(1).unwrap().parse::().unwrap(); - rss_pages * 4096 - } - #[cfg(not(target_os = "linux"))] - { - 0 - } -} - -fn main() { - #[cfg(feature = "logging")] - setup_logger(); - - #[cfg(feature = "dh")] - let _dh = Dhat::start_heap_profiling(); - - let args = Args::parse(); - - let shutdown = Arc::new(AtomicBool::new(false)); - - dbg!(args); - - let config = sled::Config::new() - .cache_capacity(args.cache_mb * 1024 * 1024) - .flush_every_ms(if args.flush_every == 0 { - None - } else { - Some(args.flush_every) - }); - - let tree = Arc::new(config.open().unwrap()); - tree.set_merge_operator(concatenate_merge); - - let mut threads = vec![]; - - let now = std::time::Instant::now(); - - let n_threads = args.threads; - - for i in 0..=n_threads { - let tree = tree.clone(); - let shutdown = shutdown.clone(); - - let t = if i == 0 { - thread::Builder::new() - .name("reporter".into()) - .spawn(move || report(shutdown)) - .unwrap() - } else { - thread::spawn(move || run(args, tree, shutdown)) - }; - - threads.push(t); - } - - if let Some(ops) = args.total_ops { - assert!(!args.burn_in, "don't set both --burn-in and --total-ops"); - while TOTAL.load(Ordering::Relaxed) < ops { - thread::sleep(std::time::Duration::from_millis(50)); - } - shutdown.store(true, Ordering::SeqCst); - } else if !args.burn_in { - thread::sleep(std::time::Duration::from_secs(args.duration)); - shutdown.store(true, Ordering::SeqCst); - } - - for t in threads.into_iter() { - t.join().unwrap(); - } - let ops = TOTAL.load(Ordering::SeqCst); - let time = now.elapsed().as_secs() as usize; - - println!( - "did {} total ops in {} seconds. {} ops/s", - ops.to_formatted_string(&Locale::en), - time, - ((ops * 1_000) / (time * 1_000)).to_formatted_string(&Locale::en) - ); - - #[cfg(feature = "measure_allocs")] - println!( - "allocated {} bytes in {} allocations", - alloc::ALLOCATED_BYTES - .load(Ordering::Acquire) - .to_formatted_string(&Locale::en), - alloc::ALLOCATIONS - .load(Ordering::Acquire) - .to_formatted_string(&Locale::en), - ); - - #[cfg(feature = "metrics")] - sled::print_profile(); -} - -#[cfg(feature = "logging")] -pub fn setup_logger() { - use std::io::Write; - - color_backtrace::install(); - - fn tn() -> String { - std::thread::current().name().unwrap_or("unknown").to_owned() - } - - let mut builder = env_logger::Builder::new(); - builder - .format(|buf, record| { - writeln!( - buf, - "{:05} {:25} {:10} {}", - record.level(), - tn(), - record.module_path().unwrap().split("::").last().unwrap(), - record.args() - ) - }) - .filter(None, log::LevelFilter::Info); - - if let Ok(env) = std::env::var("RUST_LOG") { - builder.parse_filters(&env); - } - - let _r = builder.try_init(); -} diff --git a/benchmarks/stress2/tsan.sh b/benchmarks/stress2/tsan.sh deleted file mode 100755 index f71d53345..000000000 --- a/benchmarks/stress2/tsan.sh +++ /dev/null @@ -1,10 +0,0 @@ -#!/usr/bin/env bash - -set -euxo pipefail - -echo "tsan" -export RUSTFLAGS="-Z sanitizer=thread" -export TSAN_OPTIONS="suppressions=/home/t/src/sled/tsan_suppressions.txt" -sudo rm -rf default.sled -cargo +nightly run --features=lock_free_delays,no_jemalloc --target x86_64-unknown-linux-gnu -- --duration=6 -cargo +nightly run --features=lock_free_delays,no_jemalloc --target x86_64-unknown-linux-gnu -- --duration=6 diff --git a/bindings/sled-native/Cargo.toml b/bindings/sled-native/Cargo.toml deleted file mode 100644 index d8a6d800a..000000000 --- a/bindings/sled-native/Cargo.toml +++ /dev/null @@ -1,19 +0,0 @@ -[package] -name = "sled-native" -version = "0.34.6" -authors = ["Tyler Neely "] -description = "a C-compatible API for sled" -license = "Apache-2.0" -homepage = "https://github.com/spacejam/sled" -repository = "https://github.com/spacejam/sled/sled-native" -keywords = ["database", "embedded", "concurrent", "persistent", "c"] -documentation = "https://docs.rs/sled-native/" -edition = "2018" - -[lib] -name = "sled" -crate-type = ["cdylib", "staticlib"] - -[dependencies] -libc = "0.2.62" -sled = {version = "0.34.6", path = "../.."} diff --git a/bindings/sled-native/README.md b/bindings/sled-native/README.md deleted file mode 100644 index 980f43a75..000000000 --- a/bindings/sled-native/README.md +++ /dev/null @@ -1,11 +0,0 @@ -# Native C-API for sled - -## Building - -``` -$ cargo install cargo-c -$ cargo cinstall --prefix=/usr --destdir=/tmp/staging -$ sudo cp -a /tmp/staging/* / -``` - - diff --git a/bindings/sled-native/cbindgen.toml b/bindings/sled-native/cbindgen.toml deleted file mode 100644 index d8657dd8a..000000000 --- a/bindings/sled-native/cbindgen.toml +++ /dev/null @@ -1,20 +0,0 @@ -header = "// SPDX-License-Identifier: Apache-2.0" -sys_includes = ["stddef.h", "stdint.h", "stdlib.h"] -no_includes = true -include_guard = "SLED_H" -tab_width = 4 -style = "Type" -# language = "C" -cpp_compat = true - -[parse] -parse_deps = true -include = ['sled'] - -[export] -prefix = "Sled" -item_types = ["enums", "structs", "unions", "typedefs", "opaque", "functions"] - -[enum] -rename_variants = "ScreamingSnakeCase" -prefix_with_name = true diff --git a/bindings/sled-native/src/lib.rs b/bindings/sled-native/src/lib.rs deleted file mode 100644 index c5dc6dcc6..000000000 --- a/bindings/sled-native/src/lib.rs +++ /dev/null @@ -1,236 +0,0 @@ -use sled; - -use std::ffi::CString; -use std::mem; -use std::ptr; -use std::slice; - -use libc::*; - -use sled::{Config, Db, IVec, Iter}; - -fn leak_buf(v: Vec, vallen: *mut size_t) -> *mut c_char { - unsafe { - *vallen = v.len(); - } - let mut bsv = v.into_boxed_slice(); - let val = bsv.as_mut_ptr() as *mut _; - mem::forget(bsv); - val -} - -/// Create a new configuration. -#[no_mangle] -pub unsafe extern "C" fn sled_create_config() -> *mut Config { - Box::into_raw(Box::new(Config::new())) -} - -/// Destroy a configuration. -#[no_mangle] -pub unsafe extern "C" fn sled_free_config(config: *mut Config) { - drop(Box::from_raw(config)); -} - -/// Set the configured file path. The caller is responsible for freeing the path -/// string after calling this (it is copied in this function). -#[no_mangle] -pub unsafe extern "C" fn sled_config_set_path( - config: *mut Config, - path: *const c_char, -) -> *mut Config { - let c_str = CString::from_raw(path as *mut _); - let value = c_str.into_string().unwrap(); - - let config = Box::from_raw(config); - Box::into_raw(Box::from(config.path(value))) -} - -/// Set the configured cache capacity in bytes. -#[no_mangle] -pub unsafe extern "C" fn sled_config_set_cache_capacity( - config: *mut Config, - capacity: size_t, -) -> *mut Config { - let config = Box::from_raw(config); - Box::into_raw(Box::from(config.cache_capacity(capacity as u64))) -} - -/// Configure the use of the zstd compression library. -#[no_mangle] -pub unsafe extern "C" fn sled_config_use_compression( - config: *mut Config, - use_compression: c_uchar, -) -> *mut Config { - let config = Box::from_raw(config); - Box::into_raw(Box::from(config.use_compression(use_compression == 1))) -} - -/// Set the configured IO buffer flush interval in milliseconds. -#[no_mangle] -pub unsafe extern "C" fn sled_config_flush_every_ms( - config: *mut Config, - flush_every: c_int, -) -> *mut Config { - let val = if flush_every < 0 { None } else { Some(flush_every as u64) }; - let config = Box::from_raw(config); - Box::into_raw(Box::from(config.flush_every_ms(val))) -} - -/// Open a sled lock-free log-structured tree. Consumes the passed-in config. -#[no_mangle] -pub unsafe extern "C" fn sled_open_db(config: *mut Config) -> *mut Db { - let config = Box::from_raw(config); - Box::into_raw(Box::new(config.open().unwrap())) -} - -/// Close a sled lock-free log-structured tree. -#[no_mangle] -pub unsafe extern "C" fn sled_close(db: *mut Db) { - drop(Box::from_raw(db)); -} - -/// Free a buffer originally allocated by sled. -#[no_mangle] -pub unsafe extern "C" fn sled_free_buf(buf: *mut c_char, sz: size_t) { - drop(Vec::from_raw_parts(buf, sz, sz)); -} - -/// Free an iterator. -#[no_mangle] -pub unsafe extern "C" fn sled_free_iter(iter: *mut Iter) { - drop(Box::from_raw(iter)); -} - -/// Set a key to a value. -#[no_mangle] -pub unsafe extern "C" fn sled_set( - db: *mut Db, - key: *const c_uchar, - keylen: size_t, - val: *const c_uchar, - vallen: size_t, -) { - let k = IVec::from(slice::from_raw_parts(key, keylen)); - let v = IVec::from(slice::from_raw_parts(val, vallen)); - (*db).insert(k, v).unwrap(); -} - -/// Get the value of a key. -/// Caller is responsible for freeing the returned value with `sled_free_buf` if -/// it's non-null. -#[no_mangle] -pub unsafe extern "C" fn sled_get( - db: *mut Db, - key: *const c_char, - keylen: size_t, - vallen: *mut size_t, -) -> *mut c_char { - let k = slice::from_raw_parts(key as *const u8, keylen); - let res = (*db).get(k); - match res { - Ok(Some(v)) => leak_buf(v.to_vec(), vallen), - Ok(None) => ptr::null_mut(), - // TODO proper error propagation - Err(e) => panic!("{:?}", e), - } -} - -/// Delete the value of a key. -#[no_mangle] -pub unsafe extern "C" fn sled_del( - db: *mut Db, - key: *const c_char, - keylen: size_t, -) { - let k = slice::from_raw_parts(key as *const u8, keylen); - (*db).remove(k).unwrap(); -} - -/// Compare and swap. -/// Returns 1 if successful, 0 if unsuccessful. -/// Otherwise sets `actual_val` and `actual_vallen` to the current value, -/// which must be freed using `sled_free_buf` by the caller if non-null. -/// `actual_val` will be null and `actual_vallen` 0 if the current value is not -/// set. -#[no_mangle] -pub unsafe extern "C" fn sled_compare_and_swap( - db: *mut Db, - key: *const c_char, - keylen: size_t, - old_val: *const c_uchar, - old_vallen: size_t, - new_val: *const c_uchar, - new_vallen: size_t, - actual_val: *mut *const c_uchar, - actual_vallen: *mut size_t, -) -> c_uchar { - let k = IVec::from(slice::from_raw_parts(key as *const u8, keylen)); - - let old = if old_vallen == 0 { - None - } else { - let copy = - IVec::from(slice::from_raw_parts(old_val as *const u8, old_vallen)); - Some(copy) - }; - - let new = if new_vallen == 0 { - None - } else { - let copy = - IVec::from(slice::from_raw_parts(new_val as *const u8, new_vallen)); - Some(copy) - }; - - let res = (*db).compare_and_swap(k, old, new); - - match res { - Ok(Ok(())) => 1, - Ok(Err(sled::CompareAndSwapError { current: None, .. })) => { - *actual_vallen = 0; - 0 - } - Ok(Err(sled::CompareAndSwapError { current: Some(v), .. })) => { - *actual_val = leak_buf(v.to_vec(), actual_vallen) as *const u8; - 0 - } - // TODO proper error propagation - Err(e) => panic!("{:?}", e), - } -} - -/// Iterate over tuples which have specified key prefix. -/// Caller is responsible for freeing the returned iterator with -/// `sled_free_iter`. -#[no_mangle] -pub unsafe extern "C" fn sled_scan_prefix( - db: *mut Db, - key: *const c_char, - keylen: size_t, -) -> *mut Iter { - let k = slice::from_raw_parts(key as *const u8, keylen); - Box::into_raw(Box::new((*db).scan_prefix(k))) -} - -/// Get they next kv pair from an iterator. -/// Caller is responsible for freeing the key and value with `sled_free_buf`. -/// Returns 0 when exhausted. -#[no_mangle] -pub unsafe extern "C" fn sled_iter_next( - iter: *mut Iter, - key: *mut *const c_char, - keylen: *mut size_t, - val: *mut *const c_char, - vallen: *mut size_t, -) -> c_uchar { - match (*iter).next() { - Some(Ok((k, v))) => { - *key = leak_buf(k.to_vec(), keylen); - *val = leak_buf(v.to_vec(), vallen); - 1 - } - // TODO proper error propagation - Some(Err(e)) => panic!("{:?}", e), - None => 0, - } -} diff --git a/examples/bench.rs b/examples/bench.rs new file mode 100644 index 000000000..c524ab909 --- /dev/null +++ b/examples/bench.rs @@ -0,0 +1,610 @@ +use std::path::Path; +use std::sync::Barrier; +use std::thread::scope; +use std::time::{Duration, Instant}; +use std::{fs, io}; + +use num_format::{Locale, ToFormattedString}; + +use sled::{Config, Db as SledDb}; + +type Db = SledDb<1024>; + +const N_WRITES_PER_THREAD: u32 = 4 * 1024 * 1024; +const MAX_CONCURRENCY: u32 = 4; +const CONCURRENCY: &[usize] = &[/*1, 2, 4,*/ MAX_CONCURRENCY as _]; +const BYTES_PER_ITEM: u32 = 8; + +trait Databench: Clone + Send { + type READ: AsRef<[u8]>; + const NAME: &'static str; + const PATH: &'static str; + fn open() -> Self; + fn remove_generic(&self, key: &[u8]); + fn insert_generic(&self, key: &[u8], value: &[u8]); + fn get_generic(&self, key: &[u8]) -> Option; + fn flush_generic(&self); + fn print_stats(&self); +} + +impl Databench for Db { + type READ = sled::InlineArray; + + const NAME: &'static str = "sled 1.0.0-alpha"; + const PATH: &'static str = "timing_test.sled-new"; + + fn open() -> Self { + sled::Config { + path: Self::PATH.into(), + zstd_compression_level: 3, + cache_capacity_bytes: 1024 * 1024 * 1024, + entry_cache_percent: 20, + flush_every_ms: Some(200), + ..Config::default() + } + .open() + .unwrap() + } + + fn insert_generic(&self, key: &[u8], value: &[u8]) { + self.insert(key, value).unwrap(); + } + fn remove_generic(&self, key: &[u8]) { + self.remove(key).unwrap(); + } + fn get_generic(&self, key: &[u8]) -> Option { + self.get(key).unwrap() + } + fn flush_generic(&self) { + self.flush().unwrap(); + } + fn print_stats(&self) { + dbg!(self.stats()); + } +} + +/* +impl Databench for old_sled::Db { + type READ = old_sled::IVec; + + const NAME: &'static str = "sled 0.34.7"; + const PATH: &'static str = "timing_test.sled-old"; + + fn open() -> Self { + old_sled::open(Self::PATH).unwrap() + } + fn insert_generic(&self, key: &[u8], value: &[u8]) { + self.insert(key, value).unwrap(); + } + fn get_generic(&self, key: &[u8]) -> Option { + self.get(key).unwrap() + } + fn flush_generic(&self) { + self.flush().unwrap(); + } +} +*/ + +/* +impl Databench for Arc { + type READ = Vec; + + const NAME: &'static str = "rocksdb 0.21.0"; + const PATH: &'static str = "timing_test.rocksdb"; + + fn open() -> Self { + Arc::new(rocksdb::DB::open_default(Self::PATH).unwrap()) + } + fn insert_generic(&self, key: &[u8], value: &[u8]) { + self.put(key, value).unwrap(); + } + fn get_generic(&self, key: &[u8]) -> Option { + self.get(key).unwrap() + } + fn flush_generic(&self) { + self.flush().unwrap(); + } +} +*/ + +/* +struct Lmdb { + env: heed::Env, + db: heed::Database< + heed::types::UnalignedSlice, + heed::types::UnalignedSlice, + >, +} + +impl Clone for Lmdb { + fn clone(&self) -> Lmdb { + Lmdb { env: self.env.clone(), db: self.db.clone() } + } +} + +impl Databench for Lmdb { + type READ = Vec; + + const NAME: &'static str = "lmdb"; + const PATH: &'static str = "timing_test.lmdb"; + + fn open() -> Self { + let _ = std::fs::create_dir_all(Self::PATH); + let env = heed::EnvOpenOptions::new() + .map_size(1024 * 1024 * 1024) + .open(Self::PATH) + .unwrap(); + let db = env.create_database(None).unwrap(); + Lmdb { env, db } + } + fn insert_generic(&self, key: &[u8], value: &[u8]) { + let mut wtxn = self.env.write_txn().unwrap(); + self.db.put(&mut wtxn, key, value).unwrap(); + wtxn.commit().unwrap(); + } + fn get_generic(&self, key: &[u8]) -> Option { + let rtxn = self.env.read_txn().unwrap(); + let ret = self.db.get(&rtxn, key).unwrap().map(Vec::from); + rtxn.commit().unwrap(); + ret + } + fn flush_generic(&self) { + // NOOP + } +} +*/ + +/* +struct Sqlite { + connection: rusqlite::Connection, +} + +impl Clone for Sqlite { + fn clone(&self) -> Sqlite { + Sqlite { connection: rusqlite::Connection::open(Self::PATH).unwrap() } + } +} + +impl Databench for Sqlite { + type READ = Vec; + + const NAME: &'static str = "sqlite"; + const PATH: &'static str = "timing_test.sqlite"; + + fn open() -> Self { + let connection = rusqlite::Connection::open(Self::PATH).unwrap(); + connection + .execute( + "create table if not exists bench ( + key integer primary key, + val integer not null + )", + [], + ) + .unwrap(); + Sqlite { connection } + } + fn insert_generic(&self, key: &[u8], value: &[u8]) { + loop { + let res = self.connection.execute( + "insert or ignore into bench (key, val) values (?1, ?2)", + [ + format!("{}", u32::from_be_bytes(key.try_into().unwrap())), + format!( + "{}", + u32::from_be_bytes(value.try_into().unwrap()) + ), + ], + ); + if res.is_ok() { + break; + } + } + } + fn get_generic(&self, key: &[u8]) -> Option { + let mut stmt = self + .connection + .prepare("SELECT b.val from bench b WHERE key = ?1") + .unwrap(); + let mut rows = + stmt.query([u32::from_be_bytes(key.try_into().unwrap())]).unwrap(); + + let value = rows.next().unwrap()?; + value.get(0).ok() + } + fn flush_generic(&self) { + // NOOP + } +} +*/ + +fn allocated() -> usize { + #[cfg(feature = "testing-count-allocator")] + { + return sled::alloc::allocated(); + } + 0 +} + +fn freed() -> usize { + #[cfg(feature = "testing-count-allocator")] + { + return sled::alloc::freed(); + } + 0 +} + +fn resident() -> usize { + #[cfg(feature = "testing-count-allocator")] + { + return sled::alloc::resident(); + } + 0 +} + +fn inserts(store: &D) -> Vec { + println!("{} inserts", D::NAME); + let mut i = 0_u32; + + let factory = move || { + i += 1; + (store.clone(), i - 1) + }; + + let f = |state: (D, u32)| { + let (store, offset) = state; + let start = N_WRITES_PER_THREAD * offset; + let end = N_WRITES_PER_THREAD * (offset + 1); + for i in start..end { + let k: &[u8] = &i.to_be_bytes(); + store.insert_generic(k, k); + } + }; + + let mut ret = vec![]; + + for concurrency in CONCURRENCY { + let insert_elapsed = + execute_lockstep_concurrent(factory, f, *concurrency); + + let flush_timer = Instant::now(); + store.flush_generic(); + + let wps = (N_WRITES_PER_THREAD * *concurrency as u32) as u64 + * 1_000_000_u64 + / u64::try_from(insert_elapsed.as_micros().max(1)) + .unwrap_or(u64::MAX); + + ret.push(InsertStats { + thread_count: *concurrency, + inserts_per_second: wps, + }); + + println!( + "{} inserts/s with {concurrency} threads over {:?}, then {:?} to flush {}", + wps.to_formatted_string(&Locale::en), + insert_elapsed, + flush_timer.elapsed(), + D::NAME, + ); + } + + ret +} + +fn removes(store: &D) -> Vec { + println!("{} removals", D::NAME); + let mut i = 0_u32; + + let factory = move || { + i += 1; + (store.clone(), i - 1) + }; + + let f = |state: (D, u32)| { + let (store, offset) = state; + let start = N_WRITES_PER_THREAD * offset; + let end = N_WRITES_PER_THREAD * (offset + 1); + for i in start..end { + let k: &[u8] = &i.to_be_bytes(); + store.remove_generic(k); + } + }; + + let mut ret = vec![]; + + for concurrency in CONCURRENCY { + let remove_elapsed = + execute_lockstep_concurrent(factory, f, *concurrency); + + let flush_timer = Instant::now(); + store.flush_generic(); + + let wps = (N_WRITES_PER_THREAD * *concurrency as u32) as u64 + * 1_000_000_u64 + / u64::try_from(remove_elapsed.as_micros().max(1)) + .unwrap_or(u64::MAX); + + ret.push(RemoveStats { + thread_count: *concurrency, + removes_per_second: wps, + }); + + println!( + "{} removes/s with {concurrency} threads over {:?}, then {:?} to flush {}", + wps.to_formatted_string(&Locale::en), + remove_elapsed, + flush_timer.elapsed(), + D::NAME, + ); + } + + ret +} + +fn gets(store: &D) -> Vec { + println!("{} reads", D::NAME); + + let factory = || store.clone(); + + let f = |store: D| { + let start = 0; + let end = N_WRITES_PER_THREAD * MAX_CONCURRENCY; + for i in start..end { + let k: &[u8] = &i.to_be_bytes(); + store.get_generic(k); + } + }; + + let mut ret = vec![]; + + for concurrency in CONCURRENCY { + let get_stone_elapsed = + execute_lockstep_concurrent(factory, f, *concurrency); + + let rps = (N_WRITES_PER_THREAD * MAX_CONCURRENCY * *concurrency as u32) + as u64 + * 1_000_000_u64 + / u64::try_from(get_stone_elapsed.as_micros().max(1)) + .unwrap_or(u64::MAX); + + ret.push(GetStats { thread_count: *concurrency, gets_per_second: rps }); + + println!( + "{} gets/s with concurrency of {concurrency}, {:?} total reads {}", + rps.to_formatted_string(&Locale::en), + get_stone_elapsed, + D::NAME + ); + } + ret +} + +fn execute_lockstep_concurrent< + State: Send, + Factory: FnMut() -> State, + F: Sync + Fn(State), +>( + mut factory: Factory, + f: F, + concurrency: usize, +) -> Duration { + let barrier = &Barrier::new(concurrency + 1); + let f = &f; + + scope(|s| { + let mut threads = vec![]; + + for _ in 0..concurrency { + let state = factory(); + + let thread = s.spawn(move || { + barrier.wait(); + f(state); + }); + + threads.push(thread); + } + + barrier.wait(); + let get_stone = Instant::now(); + + for thread in threads.into_iter() { + thread.join().unwrap(); + } + + get_stone.elapsed() + }) +} + +#[derive(Debug, Clone, Copy)] +struct InsertStats { + thread_count: usize, + inserts_per_second: u64, +} + +#[derive(Debug, Clone, Copy)] +struct GetStats { + thread_count: usize, + gets_per_second: u64, +} + +#[derive(Debug, Clone, Copy)] +struct RemoveStats { + thread_count: usize, + removes_per_second: u64, +} + +#[allow(unused)] +#[derive(Debug, Clone)] +struct Stats { + post_insert_disk_space: u64, + post_remove_disk_space: u64, + allocated_memory: usize, + freed_memory: usize, + resident_memory: usize, + insert_stats: Vec, + get_stats: Vec, + remove_stats: Vec, +} + +impl Stats { + fn print_report(&self) { + println!( + "bytes on disk after inserts: {}", + self.post_insert_disk_space.to_formatted_string(&Locale::en) + ); + println!( + "bytes on disk after removes: {}", + self.post_remove_disk_space.to_formatted_string(&Locale::en) + ); + println!( + "bytes in memory: {}", + self.resident_memory.to_formatted_string(&Locale::en) + ); + for stats in &self.insert_stats { + println!( + "{} threads {} inserts per second", + stats.thread_count, + stats.inserts_per_second.to_formatted_string(&Locale::en) + ); + } + for stats in &self.get_stats { + println!( + "{} threads {} gets per second", + stats.thread_count, + stats.gets_per_second.to_formatted_string(&Locale::en) + ); + } + for stats in &self.remove_stats { + println!( + "{} threads {} removes per second", + stats.thread_count, + stats.removes_per_second.to_formatted_string(&Locale::en) + ); + } + } +} + +fn bench() -> Stats { + let store = D::open(); + + let insert_stats = inserts(&store); + + let before_flush = Instant::now(); + store.flush_generic(); + println!("final flush took {:?} for {}", before_flush.elapsed(), D::NAME); + + let post_insert_disk_space = du(D::PATH.as_ref()).unwrap(); + + let get_stats = gets(&store); + + let remove_stats = removes(&store); + + store.print_stats(); + + Stats { + post_insert_disk_space, + post_remove_disk_space: du(D::PATH.as_ref()).unwrap(), + allocated_memory: allocated(), + freed_memory: freed(), + resident_memory: resident(), + insert_stats, + get_stats, + remove_stats, + } +} + +fn du(path: &Path) -> io::Result { + fn recurse(mut dir: fs::ReadDir) -> io::Result { + dir.try_fold(0, |acc, file| { + let file = file?; + let size = match file.metadata()? { + data if data.is_dir() => recurse(fs::read_dir(file.path())?)?, + data => data.len(), + }; + Ok(acc + size) + }) + } + + recurse(fs::read_dir(path)?) +} + +fn main() { + let _ = env_logger::try_init(); + + let new_stats = bench::(); + + println!( + "raw data size: {}", + (MAX_CONCURRENCY * N_WRITES_PER_THREAD * BYTES_PER_ITEM) + .to_formatted_string(&Locale::en) + ); + println!("sled 1.0 space stats:"); + new_stats.print_report(); + + /* + let old_stats = bench::(); + dbg!(old_stats); + + let new_sled_vs_old_sled_storage_ratio = + new_stats.disk_space as f64 / old_stats.disk_space as f64; + let new_sled_vs_old_sled_allocated_memory_ratio = + new_stats.allocated_memory as f64 / old_stats.allocated_memory as f64; + let new_sled_vs_old_sled_freed_memory_ratio = + new_stats.freed_memory as f64 / old_stats.freed_memory as f64; + let new_sled_vs_old_sled_resident_memory_ratio = + new_stats.resident_memory as f64 / old_stats.resident_memory as f64; + + dbg!(new_sled_vs_old_sled_storage_ratio); + dbg!(new_sled_vs_old_sled_allocated_memory_ratio); + dbg!(new_sled_vs_old_sled_freed_memory_ratio); + dbg!(new_sled_vs_old_sled_resident_memory_ratio); + + let rocksdb_stats = bench::>(); + + bench::(); + + bench::(); + */ + + /* + let new_sled_vs_rocksdb_storage_ratio = + new_stats.disk_space as f64 / rocksdb_stats.disk_space as f64; + let new_sled_vs_rocksdb_allocated_memory_ratio = + new_stats.allocated_memory as f64 / rocksdb_stats.allocated_memory as f64; + let new_sled_vs_rocksdb_freed_memory_ratio = + new_stats.freed_memory as f64 / rocksdb_stats.freed_memory as f64; + let new_sled_vs_rocksdb_resident_memory_ratio = + new_stats.resident_memory as f64 / rocksdb_stats.resident_memory as f64; + + dbg!(new_sled_vs_rocksdb_storage_ratio); + dbg!(new_sled_vs_rocksdb_allocated_memory_ratio); + dbg!(new_sled_vs_rocksdb_freed_memory_ratio); + dbg!(new_sled_vs_rocksdb_resident_memory_ratio); + */ + + /* + let scan = Instant::now(); + let count = stone.iter().count(); + assert_eq!(count as u64, N_WRITES_PER_THREAD); + let scan_elapsed = scan.elapsed(); + println!( + "{} scanned items/s, total {:?}", + (N_WRITES_PER_THREAD * 1_000_000) / u64::try_from(scan_elapsed.as_micros().max(1)).unwrap_or(u64::MAX), + scan_elapsed + ); + */ + + /* + let scan_rev = Instant::now(); + let count = stone.range(..).rev().count(); + assert_eq!(count as u64, N_WRITES_PER_THREAD); + let scan_rev_elapsed = scan_rev.elapsed(); + println!( + "{} reverse-scanned items/s, total {:?}", + (N_WRITES_PER_THREAD * 1_000_000) / u64::try_from(scan_rev_elapsed.as_micros().max(1)).unwrap_or(u64::MAX), + scan_rev_elapsed + ); + */ +} diff --git a/examples/playground.rs b/examples/playground.rs deleted file mode 100644 index 78afa41f7..000000000 --- a/examples/playground.rs +++ /dev/null @@ -1,81 +0,0 @@ -extern crate sled; - -use sled::{Config, Result}; - -fn basic() -> Result<()> { - let config = Config::new().temporary(true); - - let db = config.open()?; - - let k = b"k".to_vec(); - let v1 = b"v1".to_vec(); - let v2 = b"v2".to_vec(); - - // set and get - db.insert(k.clone(), v1.clone())?; - assert_eq!(db.get(&k).unwrap().unwrap(), (v1)); - - // compare and swap - match db.compare_and_swap(k.clone(), Some(&v1), Some(v2.clone()))? { - Ok(()) => println!("it worked!"), - Err(sled::CompareAndSwapError { current: cur, proposed: _ }) => { - println!("the actual current value is {:?}", cur) - } - } - - // scan forward - let mut iter = db.range(k.as_slice()..); - let (k1, v1) = iter.next().unwrap().unwrap(); - assert_eq!(v1, v2); - assert_eq!(k1, k); - assert_eq!(iter.next(), None); - - // deletion - db.remove(&k)?; - - Ok(()) -} - -fn merge_operator() -> Result<()> { - fn concatenate_merge( - _key: &[u8], // the key being merged - old_value: Option<&[u8]>, // the previous value, if one existed - merged_bytes: &[u8], // the new bytes being merged in - ) -> Option> { - // set the new value, return None to delete - let mut ret = old_value.map_or_else(Vec::new, |ov| ov.to_vec()); - - ret.extend_from_slice(merged_bytes); - - Some(ret) - } - - let config = Config::new().temporary(true); - - let db = config.open()?; - db.set_merge_operator(concatenate_merge); - - let k = b"k".to_vec(); - - db.insert(k.clone(), vec![0])?; - db.merge(k.clone(), vec![1])?; - db.merge(k.clone(), vec![2])?; - assert_eq!(db.get(&*k).unwrap().unwrap(), (vec![0, 1, 2])); - - // sets replace previously merged data, - // bypassing the merge function. - db.insert(k.clone(), vec![3])?; - assert_eq!(db.get(&*k).unwrap().unwrap(), (vec![3])); - - // merges on non-present values will add them - db.remove(&*k)?; - db.merge(k.clone(), vec![4])?; - assert_eq!(db.get(&*k).unwrap().unwrap(), (vec![4])); - - Ok(()) -} - -fn main() -> Result<()> { - basic()?; - merge_operator() -} diff --git a/examples/structured.rs b/examples/structured.rs deleted file mode 100644 index 7b900915f..000000000 --- a/examples/structured.rs +++ /dev/null @@ -1,238 +0,0 @@ -//! This example demonstrates how to work with structured -//! keys and values without paying expensive (de)serialization -//! costs. -//! -//! The `upsert` function shows how to use structured keys and values. -//! -//! The `variable_lengths` function shows how to put a variable length -//! component in either the beginning or the end of your value. -//! -//! The `hash_join` function shows how to do some SQL-like joins. -//! -//! Running this example several times via `cargo run --example structured` -//! will initialize the count field to 0, and on subsequent runs it will -//! increment it. -use { - byteorder::{BigEndian, LittleEndian}, - zerocopy::{ - byteorder::U64, AsBytes, FromBytes, LayoutVerified, Unaligned, U16, U32, - }, -}; - -fn upsert(db: &sled::Db) -> sled::Result<()> { - // We use `BigEndian` for key types because - // they preserve lexicographic ordering, - // which is nice if we ever want to iterate - // over our items in order. We use the - // `U64` type from zerocopy because it - // does not have alignment requirements. - // sled does not guarantee any particular - // value alignment as of now. - #[derive(FromBytes, AsBytes, Unaligned)] - #[repr(C)] - struct Key { - a: U64, - b: U64, - } - - // We use `LittleEndian` for values because - // it's possibly cheaper, but the difference - // isn't likely to be measurable, so honestly - // use whatever you want for values. - #[derive(FromBytes, AsBytes, Unaligned)] - #[repr(C)] - struct Value { - count: U64, - whatever: [u8; 16], - } - - let key = Key { a: U64::new(21), b: U64::new(890) }; - - // "UPSERT" functionality - db.update_and_fetch(key.as_bytes(), |value_opt| { - if let Some(existing) = value_opt { - // We need to make a copy that will be written back - // into the database. This allows other threads that - // may have witnessed the old version to keep working - // without taking out any locks. IVec will be - // stack-allocated until it reaches 22 bytes - let mut backing_bytes = sled::IVec::from(existing); - - // this verifies that our value is the correct length - // and alignment (in this case we don't need it to be - // aligned, because we use the `U64` type from zerocopy) - let layout: LayoutVerified<&mut [u8], Value> = - LayoutVerified::new_unaligned(&mut *backing_bytes) - .expect("bytes do not fit schema"); - - // this lets us work with the underlying bytes as - // a mutable structured value. - let value: &mut Value = layout.into_mut(); - - let new_count = value.count.get() + 1; - - println!("incrementing count to {}", new_count); - - value.count.set(new_count); - - Some(backing_bytes) - } else { - println!("setting count to 0"); - - Some(sled::IVec::from( - Value { count: U64::new(0), whatever: [0; 16] }.as_bytes(), - )) - } - })?; - - Ok(()) -} - -// Cat values will be: -// favorite_number + battles_won + -#[derive(FromBytes, AsBytes, Unaligned)] -#[repr(C)] -struct CatValue { - favorite_number: U64, - battles_won: U64, -} - -// Dog values will be: -// + woof_count + postal_code -#[derive(FromBytes, AsBytes, Unaligned)] -#[repr(C)] -struct DogValue { - woof_count: U32, - postal_code: U16, -} - -fn variable_lengths(db: &sled::Db) -> sled::Result<()> { - // here we will show how we can use zerocopy for inserting - // fixed-size components, mixed with variable length - // records on the end or beginning. - - // the hash_join example below shows how to read items - // out in a way that accounts for the variable portion, - // using `zerocopy::LayoutVerified::{new_from_prefix, new_from_suffix}` - - let dogs = db.open_tree(b"dogs")?; - - let mut dog2000_value = vec![]; - dog2000_value.extend_from_slice(b"science zone"); - dog2000_value.extend_from_slice( - DogValue { woof_count: U32::new(666), postal_code: U16::new(42) } - .as_bytes(), - ); - dogs.insert("dog2000", dog2000_value)?; - - let mut zed_pup_value = vec![]; - zed_pup_value.extend_from_slice(b"bowling alley"); - zed_pup_value.extend_from_slice( - DogValue { woof_count: U32::new(32113231), postal_code: U16::new(0) } - .as_bytes(), - ); - dogs.insert("zed pup", zed_pup_value)?; - - // IMPORTANT NOTE: German dogs eat food called "barf" - let mut klaus_value = vec![]; - klaus_value.extend_from_slice(b"barf shop"); - klaus_value.extend_from_slice( - DogValue { woof_count: U32::new(0), postal_code: U16::new(12045) } - .as_bytes(), - ); - dogs.insert("klaus", klaus_value)?; - - let cats = db.open_tree(b"cats")?; - - let mut laser_cat_value = vec![]; - laser_cat_value.extend_from_slice( - CatValue { - favorite_number: U64::new(11), - battles_won: U64::new(321231321), - } - .as_bytes(), - ); - laser_cat_value.extend_from_slice(b"science zone"); - cats.insert("laser cat", laser_cat_value)?; - - let mut pulsar_cat_value = vec![]; - pulsar_cat_value.extend_from_slice( - CatValue { - favorite_number: U64::new(11), - battles_won: U64::new(321231321), - } - .as_bytes(), - ); - pulsar_cat_value.extend_from_slice(b"science zone"); - cats.insert("pulsar cat", pulsar_cat_value)?; - - let mut fluffy_value = vec![]; - fluffy_value.extend_from_slice( - CatValue { - favorite_number: U64::new(11), - battles_won: U64::new(321231321), - } - .as_bytes(), - ); - fluffy_value.extend_from_slice(b"bowling alley"); - cats.insert("fluffy", fluffy_value)?; - - Ok(()) -} - -fn hash_join(db: &sled::Db) -> sled::Result<()> { - // here we will try to find cats and dogs who - // live in the same home. - - let cats = db.open_tree(b"cats")?; - let dogs = db.open_tree(b"dogs")?; - - let mut join = std::collections::HashMap::new(); - - for name_value_res in &cats { - // cats are stored as name -> favorite_number + battles_won + home name - // variable bytes - let (name, value_bytes) = name_value_res?; - let (_, home_name): (LayoutVerified<&[u8], CatValue>, &[u8]) = - LayoutVerified::new_from_prefix(&*value_bytes).unwrap(); - let (ref mut cat_names, _dog_names) = - join.entry(home_name.to_vec()).or_insert((vec![], vec![])); - cat_names.push(std::str::from_utf8(&*name).unwrap().to_string()); - } - - for name_value_res in &dogs { - // dogs are stored as name -> home name variable bytes + woof count + - // postal code - let (name, value_bytes) = name_value_res?; - - // note that this is reversed from the cat example above, where - // the variable bytes are at the other end of the value, and are - // extracted using new_from_prefix instead of new_from_suffix. - let (home_name, _dog_value): (_, LayoutVerified<&[u8], DogValue>) = - LayoutVerified::new_from_suffix(&*value_bytes).unwrap(); - - if let Some((_cat_names, ref mut dog_names)) = join.get_mut(home_name) { - dog_names.push(std::str::from_utf8(&*name).unwrap().to_string()); - } - } - - for (home, (cats, dogs)) in join { - println!( - "the cats {:?} and the dogs {:?} live in the same home of {}", - cats, - dogs, - std::str::from_utf8(&home).unwrap() - ); - } - - Ok(()) -} - -fn main() -> sled::Result<()> { - let db = sled::open("my_database")?; - upsert(&db)?; - variable_lengths(&db)?; - hash_join(&db)?; - - Ok(()) -} diff --git a/experiments/epoch/Cargo.toml b/experiments/epoch/Cargo.toml deleted file mode 100644 index 9dae2b3f1..000000000 --- a/experiments/epoch/Cargo.toml +++ /dev/null @@ -1,8 +0,0 @@ -[package] -name = "epoch" -version = "0.1.0" -authors = ["Tyler Neely "] -edition = "2018" - -[profile.release] -debug = true diff --git a/experiments/epoch/sanitizers.sh b/experiments/epoch/sanitizers.sh deleted file mode 100755 index 2e6a9e293..000000000 --- a/experiments/epoch/sanitizers.sh +++ /dev/null @@ -1,22 +0,0 @@ -#!/bin/bash -set -eo pipefail - -echo "asan" -cargo clean -export RUSTFLAGS="-Z sanitizer=address" -# export ASAN_OPTIONS="detect_odr_violation=0" -cargo +nightly run --target x86_64-unknown-linux-gnu -unset ASAN_OPTIONS - -echo "lsan" -cargo clean -export RUSTFLAGS="-Z sanitizer=leak" -cargo +nightly run --target x86_64-unknown-linux-gnu - -echo "tsan" -cargo clean -export RUSTFLAGS="-Z sanitizer=thread" -export TSAN_OPTIONS=suppressions=../../tsan_suppressions.txt -cargo +nightly run --target x86_64-unknown-linux-gnu -unset RUSTFLAGS -unset TSAN_OPTIONS diff --git a/experiments/epoch/src/main.rs b/experiments/epoch/src/main.rs deleted file mode 100644 index e08b21818..000000000 --- a/experiments/epoch/src/main.rs +++ /dev/null @@ -1,213 +0,0 @@ -/// A simple implementation of epoch-based reclamation. -/// -/// Using the `pin` method, a thread checks into an epoch -/// before operating on a shared resource. If that thread -/// makes a shared resource inaccessible, it can defer its -/// destruction until all threads that may have already -/// checked in have moved on. -use std::{ - cell::RefCell, - sync::{ - atomic::{AtomicPtr, AtomicUsize, Ordering::SeqCst}, - Arc, - }, -}; - -const EPOCH_SZ: usize = 16; - -#[derive(Default)] -struct Epoch { - garbage: [AtomicPtr>; EPOCH_SZ], - offset: AtomicUsize, - next: AtomicPtr, - id: u64, -} - -impl Drop for Epoch { - fn drop(&mut self) { - let count = std::cmp::min(EPOCH_SZ, self.offset.load(SeqCst)); - for offset in 0..count { - let mut garbage_ptr: *mut Box = - self.garbage[offset].load(SeqCst); - while garbage_ptr.is_null() { - // maybe this is impossible, but this is to - // be defensive against race conditions. - garbage_ptr = self.garbage[offset].load(SeqCst); - } - - let garbage: Box> = - unsafe { Box::from_raw(garbage_ptr) }; - - drop(garbage); - } - - let next = self.next.swap(std::ptr::null_mut(), SeqCst); - if !next.is_null() { - let arc = unsafe { Arc::from_raw(next) }; - drop(arc); - } - } -} - -struct Collector { - head: AtomicPtr, -} - -unsafe impl Send for Collector {} -unsafe impl Sync for Collector {} - -impl Default for Collector { - fn default() -> Collector { - let ptr = Arc::into_raw(Arc::new(Epoch::default())) as *mut Epoch; - Collector { head: AtomicPtr::new(ptr) } - } -} - -impl Collector { - fn pin(&self) -> Guard { - let head_ptr = self.head.load(SeqCst); - assert!(!head_ptr.is_null()); - let mut head = unsafe { Arc::from_raw(head_ptr) }; - let mut next = head.next.load(SeqCst); - let mut last_head = head_ptr; - - // forward head to current tip - while !next.is_null() { - std::mem::forget(head); - - let res = self.head.compare_and_swap(last_head, next, SeqCst); - if res == last_head { - head = unsafe { Arc::from_raw(next) }; - last_head = next; - } else { - head = unsafe { Arc::from_raw(res) }; - last_head = res; - } - - next = head.next.load(SeqCst); - } - - let (a1, a2) = (head.clone(), head.clone()); - std::mem::forget(head); - - Guard { - _entry_epoch: a1, - current_epoch: a2, - trash_sack: RefCell::new(vec![]), - } - } -} - -impl Drop for Collector { - fn drop(&mut self) { - let head_ptr = self.head.load(SeqCst); - assert!(!head_ptr.is_null()); - unsafe { - let head = Arc::from_raw(head_ptr); - drop(head); - } - } -} - -pub(crate) struct Guard { - _entry_epoch: Arc, - current_epoch: Arc, - trash_sack: RefCell>>, -} - -impl Guard { - pub fn defer(&self, f: F) - where - F: FnOnce() + Send + 'static, - { - let garbage_ptr = - Box::into_raw(Box::new(Box::new(f) as Box)); - let mut trash_sack = self.trash_sack.borrow_mut(); - trash_sack.push(garbage_ptr); - } -} - -impl Drop for Guard { - fn drop(&mut self) { - let trash_sack = self.trash_sack.replace(vec![]); - - for garbage_ptr in trash_sack.into_iter() { - // try to reserve - let mut offset = self.current_epoch.offset.fetch_add(1, SeqCst); - while offset >= EPOCH_SZ { - let next = self.current_epoch.next.load(SeqCst); - if !next.is_null() { - unsafe { - let raced_arc = Arc::from_raw(next); - self.current_epoch = raced_arc.clone(); - std::mem::forget(raced_arc); - } - offset = self.current_epoch.offset.fetch_add(1, SeqCst); - continue; - } - - // push epoch forward if we're full - let mut next_epoch = Epoch::default(); - next_epoch.id = self.current_epoch.id + 1; - - let next_epoch_arc = Arc::new(next_epoch); - let next_ptr = - Arc::into_raw(next_epoch_arc.clone()) as *mut Epoch; - let old = self.current_epoch.next.compare_and_swap( - std::ptr::null_mut(), - next_ptr, - SeqCst, - ); - if old != std::ptr::null_mut() { - // somebody else already installed a new segment - unsafe { - let unneeded = Arc::from_raw(next_ptr); - drop(unneeded); - - let raced_arc = Arc::from_raw(old); - self.current_epoch = raced_arc.clone(); - std::mem::forget(raced_arc); - } - offset = self.current_epoch.offset.fetch_add(1, SeqCst); - continue; - } - - self.current_epoch = next_epoch_arc; - offset = self.current_epoch.offset.fetch_add(1, SeqCst); - } - - let old = - self.current_epoch.garbage[offset].swap(garbage_ptr, SeqCst); - assert!(old.is_null()); - } - } -} - -#[derive(Debug)] -struct S(usize); - -fn main() { - let collector = Arc::new(Collector::default()); - - let mut threads = vec![]; - - for t in 0..100 { - use std::thread::spawn; - - let collector = collector.clone(); - let thread = spawn(move || { - for _ in 0..1000000 { - let guard = collector.pin(); - guard.defer(move || { - S(t as usize); - }); - } - }); - - threads.push(thread); - } - - for thread in threads.into_iter() { - thread.join().unwrap(); - } -} diff --git a/experiments/new_segment_ownership/Cargo.lock b/experiments/new_segment_ownership/Cargo.lock deleted file mode 100644 index 839910bd3..000000000 --- a/experiments/new_segment_ownership/Cargo.lock +++ /dev/null @@ -1,6 +0,0 @@ -# This file is automatically @generated by Cargo. -# It is not intended for manual editing. -[[package]] -name = "new_segment_ownership" -version = "0.1.0" - diff --git a/experiments/new_segment_ownership/Cargo.toml b/experiments/new_segment_ownership/Cargo.toml deleted file mode 100644 index 1d59d65da..000000000 --- a/experiments/new_segment_ownership/Cargo.toml +++ /dev/null @@ -1,9 +0,0 @@ -[package] -name = "new_segment_ownership" -version = "0.1.0" -authors = ["Tyler Neely "] -edition = "2018" - -# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html - -[dependencies] diff --git a/experiments/new_segment_ownership/src/main.rs b/experiments/new_segment_ownership/src/main.rs deleted file mode 100644 index 21fb2ebe0..000000000 --- a/experiments/new_segment_ownership/src/main.rs +++ /dev/null @@ -1,109 +0,0 @@ -use std::sync::{ - atomic::{AtomicUsize, Ordering}, - Arc, -}; - -const SZ: usize = 128; - -#[derive(Default, Debug)] -struct Log { - segment_accountant: Arc, - io_buf: Arc, -} - -impl Log { - fn new() -> Log { - let io_buf = Arc::new(IoBuf::default()); - let segment_accountant = io_buf.segment.segment_accountant.clone(); - Log { io_buf, segment_accountant } - } - - fn reserve(&mut self, size: usize) -> Reservation { - assert!(size <= SZ); - if self.io_buf.buf.load(Ordering::SeqCst) + size > SZ { - let segment = self.segment_accountant.clone().next_segment(); - let buf = AtomicUsize::new(0); - self.io_buf = Arc::new(IoBuf { segment, buf }); - } - let io_buf = self.io_buf.clone(); - io_buf.buf.fetch_add(size, Ordering::SeqCst); - Reservation { io_buf } - } -} - -#[derive(Default, Debug)] -struct Reservation { - io_buf: Arc, -} - -#[derive(Default, Debug)] -struct IoBuf { - segment: Arc, - buf: AtomicUsize, -} - -#[derive(Default, Debug)] -struct Segment { - offset: usize, - segment_accountant: Arc, -} - -#[derive(Default, Debug)] -struct SegmentAccountant { - tip: AtomicUsize, - free: Vec, -} - -impl SegmentAccountant { - fn next_segment(self: Arc) -> Arc { - let offset = SZ + self.tip.fetch_add(SZ, Ordering::SeqCst); - println!("setting new segment {}", offset); - Arc::new(Segment { segment_accountant: self, offset }) - } -} - -fn main() { - let mut log = Log::new(); - { - let _ = log.reserve(64); - let _ = log.reserve(64); - } - println!("src/main.rs:70"); - { - let _ = log.reserve(128); - } - println!("src/main.rs:74"); - { - let _ = log.reserve(128); - } - println!("src/main.rs:78"); - { - let _ = log.reserve(128); - } - println!("src/main.rs:77"); -} - -mod dropz { - use super::*; - - impl Drop for IoBuf { - fn drop(&mut self) { - println!("IoBuf::drop"); - } - } - impl Drop for Segment { - fn drop(&mut self) { - println!("dropping Segment {:?}", self.offset); - } - } - impl Drop for SegmentAccountant { - fn drop(&mut self) { - println!("SegmentAccountant::drop"); - } - } - impl Drop for Reservation { - fn drop(&mut self) { - println!("Reservation::drop"); - } - } -} diff --git a/fuzz/.gitignore b/fuzz/.gitignore new file mode 100644 index 000000000..a0925114d --- /dev/null +++ b/fuzz/.gitignore @@ -0,0 +1,3 @@ +target +corpus +artifacts diff --git a/fuzz/Cargo.toml b/fuzz/Cargo.toml new file mode 100644 index 000000000..481ec21fc --- /dev/null +++ b/fuzz/Cargo.toml @@ -0,0 +1,31 @@ +[package] +name = "bloodstone-fuzz" +version = "0.0.0" +authors = ["Automatically generated"] +publish = false +edition = "2018" + +[package.metadata] +cargo-fuzz = true + +[dependencies.libfuzzer-sys] +version = "0.4.0" +features = ["arbitrary-derive"] + +[dependencies] +arbitrary = { version = "1.0.3", features = ["derive"] } +tempfile = "3.5.0" + +[dependencies.sled] +path = ".." +features = [] + +# Prevent this from interfering with workspaces +[workspace] +members = ["."] + +[[bin]] +name = "fuzz_model" +path = "fuzz_targets/fuzz_model.rs" +test = false +doc = false diff --git a/fuzz/fuzz_targets/fuzz_model.rs b/fuzz/fuzz_targets/fuzz_model.rs new file mode 100644 index 000000000..4af511255 --- /dev/null +++ b/fuzz/fuzz_targets/fuzz_model.rs @@ -0,0 +1,146 @@ +#![no_main] +#[macro_use] +extern crate libfuzzer_sys; +extern crate arbitrary; +extern crate sled; + +use arbitrary::Arbitrary; + +use sled::{Config, Db as SledDb, InlineArray}; + +type Db = SledDb<3>; + +const KEYSPACE: u64 = 128; + +#[derive(Debug)] +enum Op { + Get { key: InlineArray }, + Insert { key: InlineArray, value: InlineArray }, + Reboot, + Remove { key: InlineArray }, + Cas { key: InlineArray, old: Option, new: Option }, + Range { start: InlineArray, end: InlineArray }, +} + +fn keygen( + u: &mut arbitrary::Unstructured<'_>, +) -> arbitrary::Result { + let key_i: u64 = u.int_in_range(0..=KEYSPACE)?; + Ok(key_i.to_be_bytes().as_ref().into()) +} + +impl<'a> Arbitrary<'a> for Op { + fn arbitrary( + u: &mut arbitrary::Unstructured<'a>, + ) -> arbitrary::Result { + Ok(if u.ratio(1, 2)? { + Op::Insert { key: keygen(u)?, value: keygen(u)? } + } else if u.ratio(1, 2)? { + Op::Get { key: keygen(u)? } + } else if u.ratio(1, 2)? { + Op::Reboot + } else if u.ratio(1, 2)? { + Op::Remove { key: keygen(u)? } + } else if u.ratio(1, 2)? { + Op::Cas { + key: keygen(u)?, + old: if u.ratio(1, 2)? { Some(keygen(u)?) } else { None }, + new: if u.ratio(1, 2)? { Some(keygen(u)?) } else { None }, + } + } else { + let start = u.int_in_range(0..=KEYSPACE)?; + let end = (start + 1).max(u.int_in_range(0..=KEYSPACE)?); + + Op::Range { + start: start.to_be_bytes().as_ref().into(), + end: end.to_be_bytes().as_ref().into(), + } + }) + } +} + +fuzz_target!(|ops: Vec| { + let tmp_dir = tempfile::TempDir::new().unwrap(); + let tmp_path = tmp_dir.path().to_owned(); + let config = Config::new().path(tmp_path); + + let mut tree: Db = config.open().unwrap(); + let mut model = std::collections::BTreeMap::new(); + + for (_i, op) in ops.into_iter().enumerate() { + match op { + Op::Insert { key, value } => { + assert_eq!( + tree.insert(key.clone(), value.clone()).unwrap(), + model.insert(key, value) + ); + } + Op::Get { key } => { + assert_eq!(tree.get(&key).unwrap(), model.get(&key).cloned()); + } + Op::Reboot => { + drop(tree); + tree = config.open().unwrap(); + } + Op::Remove { key } => { + assert_eq!(tree.remove(&key).unwrap(), model.remove(&key)); + } + Op::Range { start, end } => { + let mut model_iter = + model.range::(&start..&end); + let mut tree_iter = tree.range(start..end); + + for (k1, v1) in &mut model_iter { + let (k2, v2) = tree_iter + .next() + .expect("None returned from iter when Some expected") + .expect("IO issue encountered"); + assert_eq!((k1, v1), (&k2, &v2)); + } + + assert!(tree_iter.next().is_none()); + } + Op::Cas { key, old, new } => { + let succ = if old == model.get(&key).cloned() { + if let Some(n) = &new { + model.insert(key.clone(), n.clone()); + } else { + model.remove(&key); + } + true + } else { + false + }; + + let res = tree + .compare_and_swap(key, old.as_ref(), new) + .expect("hit IO error"); + + if succ { + assert!(res.is_ok()); + } else { + assert!(res.is_err()); + } + } + }; + + for (key, value) in &model { + assert_eq!(tree.get(key).unwrap().unwrap(), value); + } + + for kv_res in &tree { + let (key, value) = kv_res.unwrap(); + assert_eq!(model.get(&key), Some(&value)); + } + } + + let mut model_iter = model.iter(); + let mut tree_iter = tree.iter(); + + for (k1, v1) in &mut model_iter { + let (k2, v2) = tree_iter.next().unwrap().unwrap(); + assert_eq!((k1, v1), (&k2, &v2)); + } + + assert!(tree_iter.next().is_none()); +}); diff --git a/src/alloc.rs b/src/alloc.rs new file mode 100644 index 000000000..474ca6bf7 --- /dev/null +++ b/src/alloc.rs @@ -0,0 +1,81 @@ +#[cfg(any( + feature = "testing-shred-allocator", + feature = "testing-count-allocator" +))] +pub use alloc::*; + +// the memshred feature causes all allocated and deallocated +// memory to be set to a specific non-zero value of 0xa1 for +// uninitialized allocations and 0xde for deallocated memory, +// in the hope that it will cause memory errors to surface +// more quickly. + +#[cfg(feature = "testing-shred-allocator")] +mod alloc { + use std::alloc::{Layout, System}; + + #[global_allocator] + static ALLOCATOR: ShredAllocator = ShredAllocator; + + #[derive(Default, Debug, Clone, Copy)] + struct ShredAllocator; + + unsafe impl std::alloc::GlobalAlloc for ShredAllocator { + unsafe fn alloc(&self, layout: Layout) -> *mut u8 { + let ret = System.alloc(layout); + assert_ne!(ret, std::ptr::null_mut()); + std::ptr::write_bytes(ret, 0xa1, layout.size()); + ret + } + + unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { + std::ptr::write_bytes(ptr, 0xde, layout.size()); + System.dealloc(ptr, layout) + } + } +} + +#[cfg(feature = "testing-count-allocator")] +mod alloc { + use std::alloc::{Layout, System}; + + #[global_allocator] + static ALLOCATOR: CountingAllocator = CountingAllocator; + + static ALLOCATED: AtomicUsize = AtomicUsize::new(0); + static FREED: AtomicUsize = AtomicUsize::new(0); + static RESIDENT: AtomicUsize = AtomicUsize::new(0); + + fn allocated() -> usize { + ALLOCATED.swap(0, Ordering::Relaxed) + } + + fn freed() -> usize { + FREED.swap(0, Ordering::Relaxed) + } + + fn resident() -> usize { + RESIDENT.load(Ordering::Relaxed) + } + + #[derive(Default, Debug, Clone, Copy)] + struct CountingAllocator; + + unsafe impl std::alloc::GlobalAlloc for CountingAllocator { + unsafe fn alloc(&self, layout: Layout) -> *mut u8 { + let ret = System.alloc(layout); + assert_ne!(ret, std::ptr::null_mut()); + ALLOCATED.fetch_add(layout.size(), Ordering::Relaxed); + RESIDENT.fetch_add(layout.size(), Ordering::Relaxed); + std::ptr::write_bytes(ret, 0xa1, layout.size()); + ret + } + + unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { + std::ptr::write_bytes(ptr, 0xde, layout.size()); + FREED.fetch_add(layout.size(), Ordering::Relaxed); + RESIDENT.fetch_sub(layout.size(), Ordering::Relaxed); + System.dealloc(ptr, layout) + } + } +} diff --git a/src/atomic_shim.rs b/src/atomic_shim.rs deleted file mode 100644 index f134008b7..000000000 --- a/src/atomic_shim.rs +++ /dev/null @@ -1,262 +0,0 @@ -///! Inline of `https://github.com/bltavares/atomic-shim` - -#[cfg(not(any( - target_arch = "mips", - target_arch = "powerpc", - feature = "mutex" -)))] -pub use std::sync::atomic::{AtomicI64, AtomicU64}; -#[cfg(any(target_arch = "mips", target_arch = "powerpc", feature = "mutex"))] -mod shim { - use parking_lot::{const_rwlock, RwLock}; - use std::sync::atomic::Ordering; - - #[derive(Debug, Default)] - pub struct AtomicU64 { - value: RwLock, - } - - impl AtomicU64 { - pub const fn new(v: u64) -> Self { - Self { value: const_rwlock(v) } - } - - #[allow(dead_code)] - pub fn load(&self, _: Ordering) -> u64 { - *self.value.read() - } - - #[allow(dead_code)] - pub fn store(&self, value: u64, _: Ordering) { - let mut lock = self.value.write(); - *lock = value; - } - - #[allow(dead_code)] - pub fn swap(&self, value: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = value; - prev - } - - #[allow(dead_code)] - pub fn compare_exchange( - &self, - current: u64, - new: u64, - _: Ordering, - _: Ordering, - ) -> Result { - let mut lock = self.value.write(); - let prev = *lock; - if prev == current { - *lock = new; - Ok(current) - } else { - Err(prev) - } - } - - #[allow(dead_code)] - pub fn compare_exchange_weak( - &self, - current: u64, - new: u64, - success: Ordering, - failure: Ordering, - ) -> Result { - self.compare_exchange(current, new, success, failure) - } - - #[allow(dead_code)] - pub fn fetch_add(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev.wrapping_add(val); - prev - } - - #[allow(dead_code)] - pub fn fetch_sub(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev.wrapping_sub(val); - prev - } - - #[allow(dead_code)] - pub fn fetch_and(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev & val; - prev - } - - #[allow(dead_code)] - pub fn fetch_nand(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = !(prev & val); - prev - } - - #[allow(dead_code)] - pub fn fetch_or(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev | val; - prev - } - - #[allow(dead_code)] - pub fn fetch_xor(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev ^ val; - prev - } - - #[allow(dead_code)] - pub fn fetch_max(&self, val: u64, _: Ordering) -> u64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev.max(val); - prev - } - } - - impl From for AtomicU64 { - fn from(value: u64) -> Self { - AtomicU64::new(value) - } - } - - #[derive(Debug, Default)] - pub struct AtomicI64 { - value: RwLock, - } - - impl AtomicI64 { - pub const fn new(v: i64) -> Self { - Self { value: const_rwlock(v) } - } - - #[allow(dead_code)] - pub fn load(&self, _: Ordering) -> i64 { - *self.value.read() - } - - #[allow(dead_code)] - pub fn store(&self, value: i64, _: Ordering) { - let mut lock = self.value.write(); - *lock = value; - } - - #[allow(dead_code)] - pub fn swap(&self, value: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = value; - prev - } - - #[allow(dead_code)] - pub fn compare_exchange( - &self, - current: i64, - new: i64, - _: Ordering, - _: Ordering, - ) -> Result { - let mut lock = self.value.write(); - let prev = *lock; - if prev == current { - *lock = new; - Ok(current) - } else { - Err(prev) - } - } - - #[allow(dead_code)] - pub fn compare_exchange_weak( - &self, - current: i64, - new: i64, - success: Ordering, - failure: Ordering, - ) -> Result { - self.compare_exchange(current, new, success, failure) - } - - #[allow(dead_code)] - pub fn fetch_add(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev.wrapping_add(val); - prev - } - - #[allow(dead_code)] - pub fn fetch_sub(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev.wrapping_sub(val); - prev - } - - #[allow(dead_code)] - pub fn fetch_and(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev & val; - prev - } - - #[allow(dead_code)] - pub fn fetch_nand(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = !(prev & val); - prev - } - - #[allow(dead_code)] - pub fn fetch_or(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev | val; - prev - } - - #[allow(dead_code)] - pub fn fetch_xor(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev ^ val; - prev - } - - #[allow(dead_code)] - pub fn fetch_max(&self, val: i64, _: Ordering) -> i64 { - let mut lock = self.value.write(); - let prev = *lock; - *lock = prev.max(val); - prev - } - } - - impl From for AtomicI64 { - fn from(value: i64) -> Self { - AtomicI64::new(value) - } - } -} - -#[cfg(any( - target_arch = "mips", - target_arch = "powerpc", - feature = "mutex" -))] -pub use shim::{AtomicI64, AtomicU64}; diff --git a/src/backoff.rs b/src/backoff.rs deleted file mode 100644 index 46a65d7f4..000000000 --- a/src/backoff.rs +++ /dev/null @@ -1,43 +0,0 @@ -/// Vendored and simplified from crossbeam-utils -use core::cell::Cell; -use core::sync::atomic; - -const SPIN_LIMIT: u32 = 6; - -/// Performs exponential backoff in spin loops. -/// -/// Backing off in spin loops reduces contention and improves overall performance. -/// -/// This primitive can execute *YIELD* and *PAUSE* instructions, yield the current thread to the OS -/// scheduler, and tell when is a good time to block the thread using a different synchronization -/// mechanism. Each step of the back off procedure takes roughly twice as long as the previous -/// step. -pub struct Backoff { - step: Cell, -} - -impl Backoff { - /// Creates a new `Backoff`. - pub const fn new() -> Self { - Backoff { step: Cell::new(0) } - } - - /// Backs off in a lock-free loop. - /// - /// This method should be used when we need to retry an operation because another thread made - /// progress. - /// - /// The processor may yield using the *YIELD* or *PAUSE* instruction. - #[inline] - pub fn spin(&self) { - for _ in 0..1 << self.step.get().min(SPIN_LIMIT) { - // `hint::spin_loop` requires Rust 1.49. - #[allow(deprecated)] - atomic::spin_loop_hint(); - } - - if self.step.get() <= SPIN_LIMIT { - self.step.set(self.step.get() + 1); - } - } -} diff --git a/src/batch.rs b/src/batch.rs deleted file mode 100644 index e960fd8b5..000000000 --- a/src/batch.rs +++ /dev/null @@ -1,60 +0,0 @@ -#![allow(unused_results)] - -use super::*; - -/// A batch of updates that will -/// be applied atomically to the -/// Tree. -/// -/// # Examples -/// -/// ``` -/// # fn main() -> Result<(), Box> { -/// use sled::{Batch, open}; -/// -/// # let _ = std::fs::remove_dir_all("batch_db_2"); -/// let db = open("batch_db_2")?; -/// db.insert("key_0", "val_0")?; -/// -/// let mut batch = Batch::default(); -/// batch.insert("key_a", "val_a"); -/// batch.insert("key_b", "val_b"); -/// batch.insert("key_c", "val_c"); -/// batch.remove("key_0"); -/// -/// db.apply_batch(batch)?; -/// // key_0 no longer exists, and key_a, key_b, and key_c -/// // now do exist. -/// # let _ = std::fs::remove_dir_all("batch_db_2"); -/// # Ok(()) } -/// ``` -#[derive(Debug, Default, Clone, PartialEq, Eq)] -pub struct Batch { - pub(crate) writes: Map>, -} - -impl Batch { - /// Set a key to a new value - pub fn insert(&mut self, key: K, value: V) - where - K: Into, - V: Into, - { - self.writes.insert(key.into(), Some(value.into())); - } - - /// Remove a key - pub fn remove(&mut self, key: K) - where - K: Into, - { - self.writes.insert(key.into(), None); - } - - /// Get a value if it is present in the `Batch`. - /// `Some(None)` means it's present as a deletion. - pub fn get>(&self, k: K) -> Option> { - let inner = self.writes.get(k.as_ref())?; - Some(inner.as_ref()) - } -} diff --git a/src/cache_padded.rs b/src/cache_padded.rs deleted file mode 100644 index 4495ff206..000000000 --- a/src/cache_padded.rs +++ /dev/null @@ -1,66 +0,0 @@ -/// Vendored and simplified from crossbeam-utils. -use core::fmt; -use core::ops::{Deref, DerefMut}; - -// Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache -// lines at a time, so we have to align to 128 bytes rather than 64. -// -// Sources: -// - https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf -// - https://github.com/facebook/folly/blob/1b5288e6eea6df074758f877c849b6e73bbb9fbb/folly/lang/Align.h#L107 -// -// ARM's big.LITTLE architecture has asymmetric cores and "big" cores have 128 byte cache line size -// Sources: -// - https://www.mono-project.com/news/2016/09/12/arm64-icache/ -// -#[cfg_attr( - any(target_arch = "x86_64", target_arch = "aarch64"), - repr(align(128)) -)] -#[cfg_attr( - not(any(target_arch = "x86_64", target_arch = "aarch64")), - repr(align(64)) -)] -#[derive(Default, PartialEq, Eq)] -pub struct CachePadded { - value: T, -} - -#[allow(unsafe_code)] -unsafe impl Send for CachePadded {} - -#[allow(unsafe_code)] -unsafe impl Sync for CachePadded {} - -impl CachePadded { - /// Pads and aligns a value to the length of a cache line. - pub const fn new(t: T) -> CachePadded { - CachePadded:: { value: t } - } -} - -impl Deref for CachePadded { - type Target = T; - - fn deref(&self) -> &T { - &self.value - } -} - -impl DerefMut for CachePadded { - fn deref_mut(&mut self) -> &mut T { - &mut self.value - } -} - -impl fmt::Debug for CachePadded { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - f.debug_struct("CachePadded").field("value", &self.value).finish() - } -} - -impl From for CachePadded { - fn from(t: T) -> Self { - CachePadded::new(t) - } -} diff --git a/src/concurrency_control.rs b/src/concurrency_control.rs deleted file mode 100644 index 81f8d997b..000000000 --- a/src/concurrency_control.rs +++ /dev/null @@ -1,108 +0,0 @@ -#[cfg(feature = "testing")] -use std::cell::RefCell; -use std::sync::atomic::AtomicBool; - -use parking_lot::{RwLockReadGuard, RwLockWriteGuard}; - -use super::*; - -#[cfg(feature = "testing")] -thread_local! { - pub static COUNT: RefCell = RefCell::new(0); -} - -const RW_REQUIRED_BIT: usize = 1 << 31; - -#[derive(Default)] -pub(crate) struct ConcurrencyControl { - active: AtomicUsize, - upgrade_complete: AtomicBool, - rw: RwLock<()>, -} - -static CONCURRENCY_CONTROL: Lazy< - ConcurrencyControl, - fn() -> ConcurrencyControl, -> = Lazy::new(init_cc); - -fn init_cc() -> ConcurrencyControl { - ConcurrencyControl::default() -} - -#[derive(Debug)] -#[must_use] -pub(crate) enum Protector<'a> { - Write(RwLockWriteGuard<'a, ()>), - Read(RwLockReadGuard<'a, ()>), - None(&'a AtomicUsize), -} - -impl<'a> Drop for Protector<'a> { - fn drop(&mut self) { - if let Protector::None(active) = self { - active.fetch_sub(1, Release); - } - #[cfg(feature = "testing")] - COUNT.with(|c| { - let mut c = c.borrow_mut(); - *c -= 1; - assert_eq!(*c, 0); - }); - } -} - -pub(crate) fn read<'a>() -> Protector<'a> { - CONCURRENCY_CONTROL.read() -} - -pub(crate) fn write<'a>() -> Protector<'a> { - CONCURRENCY_CONTROL.write() -} - -impl ConcurrencyControl { - fn enable(&self) { - if self.active.fetch_or(RW_REQUIRED_BIT, SeqCst) < RW_REQUIRED_BIT { - // we are the first to set this bit - while self.active.load(Acquire) != RW_REQUIRED_BIT { - // `hint::spin_loop` requires Rust 1.49. - #[allow(deprecated)] - std::sync::atomic::spin_loop_hint() - } - self.upgrade_complete.store(true, Release); - } - } - - fn read(&self) -> Protector<'_> { - #[cfg(feature = "testing")] - COUNT.with(|c| { - let mut c = c.borrow_mut(); - *c += 1; - assert_eq!(*c, 1); - }); - - let active = self.active.fetch_add(1, Release); - - if active >= RW_REQUIRED_BIT { - self.active.fetch_sub(1, Release); - Protector::Read(self.rw.read()) - } else { - Protector::None(&self.active) - } - } - - fn write(&self) -> Protector<'_> { - #[cfg(feature = "testing")] - COUNT.with(|c| { - let mut c = c.borrow_mut(); - *c += 1; - assert_eq!(*c, 1); - }); - self.enable(); - while !self.upgrade_complete.load(Acquire) { - // `hint::spin_loop` requires Rust 1.49. - #[allow(deprecated)] - std::sync::atomic::spin_loop_hint() - } - Protector::Write(self.rw.write()) - } -} diff --git a/src/config.rs b/src/config.rs index edfda4748..7eb6dab4b 100644 --- a/src/config.rs +++ b/src/config.rs @@ -1,824 +1,102 @@ -use std::{ - fs, - fs::File, - io, - io::{BufRead, BufReader, ErrorKind, Read, Seek, Write}, - ops::Deref, - path::{Path, PathBuf}, - sync::atomic::AtomicUsize, -}; +use std::io; +use std::path::{Path, PathBuf}; +use std::sync::Arc; -use crate::pagecache::{arr_to_u32, u32_to_arr, Heap}; -use crate::*; +use fault_injection::{annotate, fallible}; +use tempdir::TempDir; -const DEFAULT_PATH: &str = "default.sled"; - -/// The high-level database mode, according to -/// the trade-offs of the RUM conjecture. -#[derive(Debug, Clone, Copy)] -pub enum Mode { - /// In this mode, the database will make - /// decisions that favor using less space - /// instead of supporting the highest possible - /// write throughput. This mode will also - /// rewrite data more frequently as it - /// strives to reduce fragmentation. - LowSpace, - /// In this mode, the database will try - /// to maximize write throughput while - /// potentially using more disk space. - HighThroughput, -} - -/// A persisted configuration about high-level -/// storage file information -#[derive(Debug, Eq, PartialEq, Clone, Copy)] -struct StorageParameters { - pub segment_size: usize, - pub use_compression: bool, - pub version: (usize, usize), -} - -impl StorageParameters { - pub fn serialize(&self) -> Vec { - let mut out = vec![]; - - writeln!(&mut out, "segment_size: {}", self.segment_size).unwrap(); - writeln!(&mut out, "use_compression: {}", self.use_compression) - .unwrap(); - writeln!(&mut out, "version: {}.{}", self.version.0, self.version.1) - .unwrap(); - - out - } - - pub fn deserialize(bytes: &[u8]) -> Result { - let reader = BufReader::new(bytes); - - let mut lines = Map::new(); - - for line in reader.lines() { - let line = if let Ok(l) = line { - l - } else { - error!( - "failed to parse persisted config as UTF-8. \ - This changed in sled version 0.29" - ); - return Err(Error::Unsupported( - "failed to open database that may \ - have been created using a sled version \ - earlier than 0.29", - )); - }; - let mut split = line.split(": ").map(String::from); - let k = if let Some(k) = split.next() { - k - } else { - error!("failed to parse persisted config line: {}", line); - return Err(Error::corruption(None)); - }; - let v = if let Some(v) = split.next() { - v - } else { - error!("failed to parse persisted config line: {}", line); - return Err(Error::corruption(None)); - }; - lines.insert(k, v); - } - - let segment_size: usize = if let Some(raw) = lines.get("segment_size") { - if let Ok(parsed) = raw.parse() { - parsed - } else { - error!("failed to parse segment_size value: {}", raw); - return Err(Error::corruption(None)); - } - } else { - error!( - "failed to retrieve required configuration parameter: segment_size" - ); - return Err(Error::corruption(None)); - }; - - let use_compression: bool = if let Some(raw) = - lines.get("use_compression") - { - if let Ok(parsed) = raw.parse() { - parsed - } else { - error!("failed to parse use_compression value: {}", raw); - return Err(Error::corruption(None)); - } - } else { - error!( - "failed to retrieve required configuration parameter: use_compression" - ); - return Err(Error::corruption(None)); - }; - - let version: (usize, usize) = if let Some(raw) = lines.get("version") { - let mut split = raw.split('.'); - let major = if let Some(raw_major) = split.next() { - if let Ok(parsed_major) = raw_major.parse() { - parsed_major - } else { - error!( - "failed to parse major version value from line: {}", - raw - ); - return Err(Error::corruption(None)); - } - } else { - error!("failed to parse major version value: {}", raw); - return Err(Error::corruption(None)); - }; - - let minor = if let Some(raw_minor) = split.next() { - if let Ok(parsed_minor) = raw_minor.parse() { - parsed_minor - } else { - error!( - "failed to parse minor version value from line: {}", - raw - ); - return Err(Error::corruption(None)); - } - } else { - error!("failed to parse minor version value: {}", raw); - return Err(Error::corruption(None)); - }; - - (major, minor) - } else { - error!( - "failed to retrieve required configuration parameter: version" - ); - return Err(Error::corruption(None)); - }; - - Ok(StorageParameters { segment_size, use_compression, version }) - } -} - -/// Top-level configuration for the system. -/// -/// # Examples -/// -/// ``` -/// let _config = sled::Config::default() -/// .path("/path/to/data".to_owned()) -/// .cache_capacity(10_000_000_000) -/// .flush_every_ms(Some(1000)); -/// ``` -#[derive(Default, Debug, Clone)] -pub struct Config(Arc); - -impl Deref for Config { - type Target = Inner; - - fn deref(&self) -> &Inner { - &self.0 - } -} - -#[doc(hidden)] -#[derive(Debug, Clone)] -pub struct Inner { - #[doc(hidden)] - pub cache_capacity: usize, - #[doc(hidden)] - pub flush_every_ms: Option, - #[doc(hidden)] - pub segment_size: usize, - #[doc(hidden)] - pub path: PathBuf, - #[doc(hidden)] - pub create_new: bool, - #[doc(hidden)] - pub mode: Mode, - #[doc(hidden)] - pub temporary: bool, - #[doc(hidden)] - pub use_compression: bool, - #[doc(hidden)] - pub compression_factor: i32, - #[doc(hidden)] - pub idgen_persist_interval: u64, - #[doc(hidden)] - pub snapshot_after_ops: u64, - #[doc(hidden)] - pub version: (usize, usize), - tmp_path: PathBuf, - pub(crate) global_error: Arc>, - #[cfg(feature = "event_log")] - /// an event log for concurrent debugging - pub event_log: Arc, -} - -impl Default for Inner { - fn default() -> Self { - Self { - // generally useful - path: PathBuf::from(DEFAULT_PATH), - tmp_path: Config::gen_temp_path(), - create_new: false, - cache_capacity: 1024 * 1024 * 1024, // 1gb - mode: Mode::LowSpace, - use_compression: false, - compression_factor: 5, - temporary: false, - version: crate_version(), - - // useful in testing - segment_size: 512 * 1024, // 512kb in bytes - flush_every_ms: Some(500), - idgen_persist_interval: 1_000_000, - snapshot_after_ops: if cfg!(feature = "testing") { - 10 - } else { - 1_000_000 - }, - global_error: Arc::new(Atomic::default()), - #[cfg(feature = "event_log")] - event_log: Arc::new(crate::event_log::EventLog::default()), - } - } -} - -impl Inner { - // Get the path of the database - #[doc(hidden)] - pub fn get_path(&self) -> PathBuf { - if self.temporary && self.path == PathBuf::from(DEFAULT_PATH) { - self.tmp_path.clone() - } else { - self.path.clone() - } - } - - fn db_path(&self) -> PathBuf { - self.get_path().join("db") - } - - fn config_path(&self) -> PathBuf { - self.get_path().join("conf") - } - - pub(crate) fn normalize(&self, value: T) -> T - where - T: Copy - + TryFrom - + std::ops::Div - + std::ops::Mul, - >::Error: Debug, - { - let segment_size: T = T::try_from(self.segment_size).unwrap(); - value / segment_size * segment_size - } -} - -macro_rules! supported { - ($cond:expr, $msg:expr) => { - if !$cond { - return Err(Error::Unsupported($msg)); - } - }; -} +use crate::Db; macro_rules! builder { ($(($name:ident, $t:ty, $desc:expr)),*) => { $( #[doc=$desc] pub fn $name(mut self, to: $t) -> Self { - if Arc::strong_count(&self.0) != 1 { - error!( - "config has already been used to start \ - the system and probably should not be \ - mutated", - ); - } - let m = Arc::make_mut(&mut self.0); - m.$name = to; + self.$name = to; self } )* } } +#[derive(Debug, Clone)] +pub struct Config { + /// The base directory for storing the database. + pub path: PathBuf, + /// Cache size in **bytes**. Default is 512mb. + pub cache_capacity_bytes: usize, + /// The percentage of the cache that is dedicated to the + /// scan-resistant entry cache. + pub entry_cache_percent: u8, + /// Start a background thread that flushes data to disk + /// every few milliseconds. Defaults to every 200ms. + pub flush_every_ms: Option, + /// The zstd compression level to use when writing data to disk. Defaults to 3. + pub zstd_compression_level: i32, + /// This is only set to `Some` for objects created via + /// `Config::tmp`, and will remove the storage directory + /// when the final Arc drops. + pub tempdir_deleter: Option>, + /// A float between 0.0 and 1.0 that controls how much fragmentation can + /// exist in a file before GC attempts to recompact it. + pub target_heap_file_fill_ratio: f32, +} + +impl Default for Config { + fn default() -> Config { + Config { + path: "bloodstone.default".into(), + flush_every_ms: Some(200), + cache_capacity_bytes: 512 * 1024 * 1024, + entry_cache_percent: 20, + zstd_compression_level: 3, + tempdir_deleter: None, + target_heap_file_fill_ratio: 0.9, + } + } +} + impl Config { /// Returns a default `Config` pub fn new() -> Config { Config::default() } - /// Set the path of the database (builder). - pub fn path>(mut self, path: P) -> Config { - let m = Arc::get_mut(&mut self.0).unwrap(); - m.path = path.as_ref().to_path_buf(); - self - } - - /// A testing-only method for reducing the io-buffer size - /// to trigger correctness-critical behavior more often - /// by shrinking the buffer size. Don't rely on this. - #[doc(hidden)] - pub fn segment_size(mut self, segment_size: usize) -> Config { - if Arc::strong_count(&self.0) != 1 { - error!( - "config has already been used to start \ - the system and probably should not be \ - mutated", - ); - } - let m = Arc::make_mut(&mut self.0); - m.segment_size = segment_size; - self - } - - /// Opens a `Db` based on the provided config. - pub fn open(&self) -> Result { - // only validate, setup directory, and open file once - self.validate()?; - - let mut config = self.clone(); - config.limit_cache_max_memory(); - - let file = config.open_file()?; + /// Returns a config with the `path` initialized to a system + /// temporary directory that will be deleted when this `Config` + /// is dropped. + pub fn tmp() -> io::Result { + let tempdir = fallible!(tempdir::TempDir::new("sled_tmp")); - let heap_path = config.get_path().join("heap"); - let heap = Heap::start(&heap_path)?; - maybe_fsync_directory(heap_path)?; - - // seal config in a Config - let config = RunningConfig { - inner: config, - file: Arc::new(file), - heap: Arc::new(heap), - }; - - Db::start_inner(config) - } - - #[doc(hidden)] - pub fn flush_every_ms(mut self, every_ms: Option) -> Self { - if Arc::strong_count(&self.0) != 1 { - error!( - "config has already been used to start \ - the system and probably should not be \ - mutated", - ); - } - let m = Arc::make_mut(&mut self.0); - m.flush_every_ms = every_ms; - self + Ok(Config { + path: tempdir.path().into(), + tempdir_deleter: Some(Arc::new(tempdir)), + ..Config::default() + }) } - #[doc(hidden)] - pub fn idgen_persist_interval(mut self, interval: u64) -> Self { - if Arc::strong_count(&self.0) != 1 { - error!( - "config has already been used to start \ - the system and probably should not be \ - mutated", - ); - } - let m = Arc::make_mut(&mut self.0); - m.idgen_persist_interval = interval; + /// Set the path of the database (builder). + pub fn path>(mut self, path: P) -> Config { + self.path = path.as_ref().to_path_buf(); self } - fn gen_temp_path() -> PathBuf { - use std::time::SystemTime; - - static SALT_COUNTER: AtomicUsize = AtomicUsize::new(0); - - let seed = SALT_COUNTER.fetch_add(1, SeqCst) as u128; - - let now = SystemTime::now() - .duration_since(SystemTime::UNIX_EPOCH) - .unwrap() - .as_nanos() - << 48; - - #[cfg(not(miri))] - let pid = u128::from(std::process::id()); - - #[cfg(miri)] - let pid = 0; - - let salt = (pid << 16) + now + seed; - - if cfg!(target_os = "linux") { - // use shared memory for temporary linux files - format!("/dev/shm/pagecache.tmp.{}", salt).into() - } else { - std::env::temp_dir().join(format!("pagecache.tmp.{}", salt)) - } - } - - fn limit_cache_max_memory(&mut self) { - if let Some(limit) = sys_limits::get_memory_limit() { - if self.cache_capacity > limit { - let m = Arc::make_mut(&mut self.0); - m.cache_capacity = limit; - error!( - "cache capacity is limited to the cgroup memory \ - limit: {} bytes", - self.cache_capacity - ); - } - } - } - builder!( - ( - cache_capacity, - usize, - "maximum size in bytes for the system page cache" - ), - ( - mode, - Mode, - "specify whether the system should run in \"small\" or \"fast\" mode" - ), - (use_compression, bool, "whether to use zstd compression"), - ( - compression_factor, - i32, - "the compression factor to use with zstd compression. Ranges from 1 up to 22. Levels >= 20 are 'ultra'." - ), - ( - temporary, - bool, - "deletes the database after drop. if no path is set, uses /dev/shm on linux" - ), - ( - create_new, - bool, - "attempts to exclusively open the database, failing if it already exists" - ), - ( - snapshot_after_ops, - u64, - "take a fuzzy snapshot of pagecache metadata after this many ops" - ) + (flush_every_ms, Option, "Start a background thread that flushes data to disk every few milliseconds. Defaults to every 200ms."), + (cache_capacity_bytes, usize, "Cache size in **bytes**. Default is 512mb."), + (entry_cache_percent, u8, "The percentage of the cache that is dedicated to the scan-resistant entry cache."), + (zstd_compression_level, i32, "The zstd compression level to use when writing data to disk. Defaults to 3.") ); - // panics if config options are outside of advised range - fn validate(&self) -> Result<()> { - supported!( - self.segment_size.count_ones() == 1, - "segment_size should be a power of 2" - ); - supported!( - self.segment_size >= 256, - "segment_size should be hundreds of kb at minimum, and we won't start if below 256" - ); - supported!( - self.segment_size <= 1 << 24, - "segment_size should be <= 16mb" - ); - if self.use_compression { - supported!( - !cfg!(feature = "no_zstd"), - "the 'no_zstd' feature is set, but Config.use_compression is also set to true" - ); - } - supported!( - self.compression_factor >= 1, - "compression_factor must be >= 1" - ); - supported!( - self.compression_factor <= 22, - "compression_factor must be <= 22" - ); - supported!( - self.idgen_persist_interval > 0, - "idgen_persist_interval must be above 0" - ); - Ok(()) - } - - fn open_file(&self) -> Result { - let heap_dir: PathBuf = self.get_path().join("heap"); - - if !heap_dir.exists() { - fs::create_dir_all(heap_dir)?; - } - - self.verify_config()?; - - // open the data file - let mut options = fs::OpenOptions::new(); - - let _ = options.create(true); - let _ = options.read(true); - let _ = options.write(true); - - if self.create_new { - options.create_new(true); - } - - let _ = std::fs::File::create( - self.get_path().join("DO_NOT_USE_THIS_DIRECTORY_FOR_ANYTHING"), - ); - - let file = self.try_lock(options.open(&self.db_path())?)?; - maybe_fsync_directory(self.get_path())?; - Ok(file) - } - - fn try_lock(&self, file: File) -> Result { - #[cfg(all( - not(miri), - any(windows, target_os = "linux", target_os = "macos") - ))] - { - use fs2::FileExt; - - let try_lock = - if cfg!(any(feature = "testing", feature = "light_testing")) { - // we block here because during testing - // there are many filesystem race condition - // that happen, causing locks to be held - // for long periods of time, so we should - // block to wait on reopening files. - file.lock_exclusive() - } else { - file.try_lock_exclusive() - }; - - if try_lock.is_err() { - return Err(Error::Io( - ErrorKind::Other, - "could not acquire database file lock", - )); - } - } - - Ok(file) - } - - fn verify_config(&self) -> Result<()> { - match self.read_config() { - Ok(Some(old)) => { - if self.use_compression { - supported!( - old.use_compression, - "cannot change compression configuration across restarts. \ - this database was created without compression enabled." - ); - } else { - supported!( - !old.use_compression, - "cannot change compression configuration across restarts. \ - this database was created with compression enabled." - ); - } - - supported!( - self.segment_size == old.segment_size, - "cannot change the io buffer size across restarts." - ); - - if self.version != old.version { - error!( - "This database was created using \ - pagecache version {}.{}, but our pagecache \ - version is {}.{}. Please perform an upgrade \ - using the sled::Db::export and sled::Db::import \ - methods.", - old.version.0, - old.version.1, - self.version.0, - self.version.1, - ); - supported!( - self.version == old.version, - "The stored database must use a compatible sled version. - See error log for more details." - ); - } - Ok(()) - } - Ok(None) => self.write_config(), - Err(e) => Err(e), - } - } - - fn serialize(&self) -> Vec { - let persisted_config = StorageParameters { - version: self.version, - segment_size: self.segment_size, - use_compression: self.use_compression, - }; - - persisted_config.serialize() - } - - fn write_config(&self) -> Result<()> { - let bytes = self.serialize(); - let crc: u32 = crc32(&*bytes); - let crc_arr = u32_to_arr(crc); - - let temp_path = self.get_path().join("conf.tmp"); - let final_path = self.config_path(); - - let mut f = - fs::OpenOptions::new().write(true).create(true).open(&temp_path)?; - - io_fail!(self, "write_config bytes"); - f.write_all(&*bytes)?; - io_fail!(self, "write_config crc"); - f.write_all(&crc_arr)?; - io_fail!(self, "write_config fsync"); - f.sync_all()?; - io_fail!(self, "write_config rename"); - fs::rename(temp_path, final_path)?; - io_fail!(self, "write_config dir fsync"); - maybe_fsync_directory(self.get_path())?; - io_fail!(self, "write_config post"); - Ok(()) - } - - fn read_config(&self) -> Result> { - let path = self.config_path(); - - let f_res = fs::OpenOptions::new().read(true).open(&path); - - let mut f = match f_res { - Err(ref e) if e.kind() == ErrorKind::NotFound => { - return Ok(None); - } - Err(other) => { - return Err(other.into()); - } - Ok(f) => f, - }; - - if f.metadata()?.len() <= 8 { - warn!("empty/corrupt configuration file found"); - return Ok(None); - } - - let mut buf = vec![]; - let _ = f.read_to_end(&mut buf)?; - let len = buf.len(); - let _ = buf.split_off(len - 4); - - let mut crc_arr = [0_u8; 4]; - let _ = f.seek(io::SeekFrom::End(-4))?; - f.read_exact(&mut crc_arr)?; - let crc_expected = arr_to_u32(&crc_arr); - - let crc_actual = crc32(&*buf); - - if crc_expected != crc_actual { - warn!( - "crc for settings file {:?} failed! \ - can't verify that config is safe", - path - ); - } - - StorageParameters::deserialize(&buf).map(Some) - } - - /// Return the global error if one was encountered during - /// an asynchronous IO operation. - #[doc(hidden)] - pub fn global_error(&self) -> Result<()> { - let guard = pin(); - let ge = self.global_error.load(Acquire, &guard); - if ge.is_null() { - Ok(()) - } else { - #[allow(unsafe_code)] - unsafe { - Err(*ge.deref()) - } - } - } - - pub(crate) fn reset_global_error(&self) { - let guard = pin(); - let old = self.global_error.swap(Shared::default(), SeqCst, &guard); - if !old.is_null() { - #[allow(unsafe_code)] - unsafe { - guard.defer_destroy(old); - } - } - } - - pub(crate) fn set_global_error(&self, error_value: Error) { - let guard = pin(); - let error = Owned::new(error_value); - - let expected_old = Shared::null(); - - let _ = self.global_error.compare_and_set( - expected_old, - error, - SeqCst, - &guard, - ); - } - - #[cfg(feature = "failpoints")] - #[cfg(feature = "event_log")] - #[doc(hidden)] - // truncate the underlying file for corruption testing purposes. - pub fn truncate_corrupt(&self, new_len: u64) { - self.event_log.reset(); - let path = self.db_path(); - let f = std::fs::OpenOptions::new().write(true).open(path).unwrap(); - f.set_len(new_len).expect("should be able to truncate"); - } -} - -/// A Configuration that has an associated opened -/// file. -#[allow(clippy::module_name_repetitions)] -#[derive(Debug, Clone)] -pub struct RunningConfig { - inner: Config, - pub(crate) file: Arc, - pub(crate) heap: Arc, -} - -impl Deref for RunningConfig { - type Target = Config; - - fn deref(&self) -> &Config { - &self.inner - } -} - -#[cfg(all(not(miri), any(windows, target_os = "linux", target_os = "macos")))] -impl Drop for RunningConfig { - fn drop(&mut self) { - use fs2::FileExt; - if Arc::strong_count(&self.file) == 1 { - let _ = self.file.unlock(); - } - } -} - -impl Drop for Inner { - fn drop(&mut self) { - if self.temporary { - // Our files are temporary, so nuke them. - debug!("removing temporary storage file {:?}", self.get_path()); - let _res = fs::remove_dir_all(&self.get_path()); - } - } -} - -impl RunningConfig { - // returns the snapshot file paths for this system - #[doc(hidden)] - pub fn get_snapshot_files(&self) -> io::Result> { - let conf_path = self.get_path().join("snap."); - - let absolute_path: PathBuf = if Path::new(&conf_path).is_absolute() { - conf_path - } else { - std::env::current_dir()?.join(conf_path) - }; - - let filter = |dir_entry: io::Result| { - if let Ok(de) = dir_entry { - let path_buf = de.path(); - let path = path_buf.as_path(); - let path_str = &*path.to_string_lossy(); - if path_str.starts_with(&*absolute_path.to_string_lossy()) - && !path_str.ends_with(".generating") - { - Some(path.to_path_buf()) - } else { - None - } - } else { - None - } - }; - - let snap_dir = Path::new(&absolute_path).parent().unwrap(); - - if !snap_dir.exists() { - fs::create_dir_all(snap_dir)?; + pub fn open( + &self, + ) -> io::Result> { + if LEAF_FANOUT < 3 { + return Err(annotate!(io::Error::new( + io::ErrorKind::Unsupported, + "Db's LEAF_FANOUT const generic must be 3 or greater." + ))); } - - Ok(snap_dir.read_dir()?.filter_map(filter).collect()) + Db::open_with_config(self) } } - -fn crate_version() -> (usize, usize) { - let vsn = env!("CARGO_PKG_VERSION"); - let mut parts = vsn.split('.'); - let major = parts.next().unwrap().parse().unwrap(); - let minor = parts.next().unwrap().parse().unwrap(); - (major, minor) -} diff --git a/src/context.rs b/src/context.rs deleted file mode 100644 index 5a68fc175..000000000 --- a/src/context.rs +++ /dev/null @@ -1,73 +0,0 @@ -use super::*; - -#[derive(Debug, Clone)] -#[doc(hidden)] -pub struct Context { - // TODO file from config should be in here - config: RunningConfig, - /// Periodically flushes dirty data. We keep this in an - /// Arc separate from the PageCache below to separate - /// "high-level" references from Db, Tree etc... from - /// "low-level" references like background threads. - /// When the last high-level reference is dropped, it - /// should trigger all background threads to clean - /// up synchronously. - #[cfg(not(miri))] - pub(crate) flusher: Arc>>, - #[doc(hidden)] - pub pagecache: PageCache, -} - -impl std::ops::Deref for Context { - type Target = RunningConfig; - - fn deref(&self) -> &RunningConfig { - &self.config - } -} - -impl Context { - pub(crate) fn start(config: RunningConfig) -> Result { - trace!("starting context"); - - let pagecache = PageCache::start(config.clone())?; - - Ok(Self { - config, - pagecache, - #[cfg(not(miri))] - flusher: Arc::new(parking_lot::Mutex::new(None)), - }) - } - - /// Returns `true` if the database was - /// recovered from a previous process. - /// Note that database state is only - /// guaranteed to be present up to the - /// last call to `flush`! Otherwise state - /// is synced to disk periodically if the - /// `sync_every_ms` configuration option - /// is set to `Some(number_of_ms_between_syncs)` - /// or if the IO buffer gets filled to - /// capacity before being rotated. - pub fn was_recovered(&self) -> bool { - self.pagecache.was_recovered() - } - - /// Generate a monotonic ID. Not guaranteed to be - /// contiguous. Written to disk every `idgen_persist_interval` - /// operations, followed by a blocking flush. During recovery, we - /// take the last recovered generated ID and add 2x - /// the `idgen_persist_interval` to it. While persisting, if the - /// previous persisted counter wasn't synced to disk yet, we will do - /// a blocking flush to fsync the latest counter, ensuring - /// that we will never give out the same counter twice. - pub fn generate_id(&self) -> Result { - let _cc = concurrency_control::read(); - self.pagecache.generate_id_inner() - } - - pub(crate) fn pin_log(&self, guard: &Guard) -> Result> { - self.pagecache.pin_log(guard) - } -} diff --git a/src/db.rs b/src/db.rs index d64357199..748cef060 100644 --- a/src/db.rs +++ b/src/db.rs @@ -1,250 +1,226 @@ -use std::ops::Deref; +use std::collections::HashMap; +use std::fmt; +use std::io; +use std::sync::{mpsc, Arc}; +use std::time::{Duration, Instant}; -use crate::*; +use parking_lot::Mutex; -const DEFAULT_TREE_ID: &[u8] = b"__sled__default"; +use crate::*; -/// The `sled` embedded database! Implements -/// `Deref` to refer to -/// a default keyspace / namespace / bucket. +/// sled 1.0 alpha :) +/// +/// One of the main differences between this and sled 0.34 is that +/// `Db` and `Tree` now have a `LEAF_FANOUT` const generic parameter. +/// This parameter is an interesting single-knob performance tunable +/// that allows users to traverse the performance-vs-efficiency +/// trade-off spectrum. The default value of `1024` causes keys and +/// values to be more efficiently compressed when stored on disk, +/// but for larger-than-memory random workloads it may be advantageous +/// to lower `LEAF_FANOUT` to between `16` to `256`, depending on your +/// efficiency requirements. A lower value will also cause contention +/// to be reduced for frequently accessed data. This value cannot be +/// changed after creating the database. +/// +/// As an alpha release, please do not expect this to be safe for +/// business-critical use cases. However, if you would like this to +/// serve your business-critical use cases over time, please give it +/// a shot in a low-risk non-production environment and report any +/// issues you encounter in a github issue. /// -/// When dropped, all buffered writes are flushed -/// to disk, using the same method used by -/// `Tree::flush`. +/// Note that `Db` implements `Deref` for the default `Tree` (sled's +/// version of namespaces / keyspaces / buckets), but you can create +/// and use others using `Db::open_tree`. #[derive(Clone)] -#[doc(alias = "database")] -pub struct Db { - #[doc(hidden)] - pub context: Context, - pub(crate) default: Tree, - tenants: Arc>>, -} - -impl Deref for Db { - type Target = Tree; - - fn deref(&self) -> &Tree { - &self.default - } +pub struct Db { + config: Config, + _shutdown_dropper: Arc>, + cache: ObjectCache, + trees: Arc>>>, + collection_id_allocator: Arc, + collection_name_mapping: Tree, + default_tree: Tree, + was_recovered: bool, } -impl Debug for Db { - fn fmt( - &self, - f: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - let tenants = self.tenants.read(); - writeln!(f, "Db {{")?; - for (raw_name, tree) in tenants.iter() { - let name = std::str::from_utf8(raw_name) - .ok() - .map_or_else(|| format!("{:?}", raw_name), String::from); - write!(f, " Tree: {:?} contents: {:?}", name, tree)?; - } - write!(f, "}}")?; - Ok(()) +impl std::ops::Deref for Db { + type Target = Tree; + fn deref(&self) -> &Tree { + &self.default_tree } } -impl Db { - pub(crate) fn start_inner(config: RunningConfig) -> Result { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_start); +impl fmt::Debug for Db { + fn fmt(&self, w: &mut fmt::Formatter<'_>) -> fmt::Result { + let alternate = w.alternate(); - let context = Context::start(config)?; + let mut debug_struct = w.debug_struct(&format!("Db<{}>", LEAF_FANOUT)); - #[cfg(not(miri))] - { - let flusher_pagecache = context.pagecache.clone(); - let flusher = context.flush_every_ms.map(move |fem| { - flusher::Flusher::new( - "log flusher".to_owned(), - flusher_pagecache, - fem, + if alternate { + debug_struct + .field("global_error", &self.check_error()) + .field( + "data", + &format!("{:?}", self.iter().collect::>()), ) - }); - *context.flusher.lock() = flusher; - } - - // create or open the default tree - let guard = pin(); - let default = - meta::open_tree(&context, DEFAULT_TREE_ID.to_vec(), &guard)?; - - let ret = Self { - context: context.clone(), - default, - tenants: Arc::new(RwLock::new(FastMap8::default())), - }; - - let mut tenants = ret.tenants.write(); - - for (id, root) in &context.pagecache.get_meta(&guard).inner { - let tree = Tree(Arc::new(TreeInner { - tree_id: id.clone(), - subscribers: Subscribers::default(), - context: context.clone(), - root: AtomicU64::new(*root), - merge_operator: RwLock::new(None), - })); - assert!(tenants.insert(id.clone(), tree).is_none()); - } - - drop(tenants); - - #[cfg(feature = "event_log")] - { - for (_name, tree) in ret.tenants.read().iter() { - tree.verify_integrity()?; - } - ret.context.event_log.verify(); + .finish() + } else { + debug_struct.field("global_error", &self.check_error()).finish() } - - Ok(ret) } +} - /// Open or create a new disk-backed Tree with its own keyspace, - /// accessible from the `Db` via the provided identifier. - pub fn open_tree>(&self, name: V) -> Result { - let name_ref = name.as_ref(); - - { - let tenants = self.tenants.read(); - if let Some(tree) = tenants.get(name_ref) { - return Ok(tree.clone()); +fn flusher( + cache: ObjectCache, + shutdown_signal: mpsc::Receiver>, + flush_every_ms: usize, +) { + let interval = Duration::from_millis(flush_every_ms as _); + let mut last_flush_duration = Duration::default(); + + let flush = || { + let flush_res_res = std::panic::catch_unwind(|| cache.flush()); + match flush_res_res { + Ok(Ok(_)) => { + // don't abort. + return; + } + Ok(Err(flush_failure)) => { + log::error!( + "Db flusher encountered error while flushing: {:?}", + flush_failure + ); + cache.set_error(&flush_failure); + } + Err(panicked) => { + log::error!( + "Db flusher panicked while flushing: {:?}", + panicked + ); + cache.set_error(&io::Error::new( + io::ErrorKind::Other, + "Db flusher panicked while flushing".to_string(), + )); } - drop(tenants); } + std::process::abort(); + }; + + loop { + let recv_timeout = interval + .saturating_sub(last_flush_duration) + .max(Duration::from_millis(1)); + if let Ok(shutdown_sender) = shutdown_signal.recv_timeout(recv_timeout) + { + flush(); + + // this is probably unnecessary but it will avoid issues + // if egregious bugs get introduced that trigger it + cache.set_error(&io::Error::new( + io::ErrorKind::Other, + "system has been shut down".to_string(), + )); - let guard = pin(); + assert!(cache.is_clean()); - let mut tenants = self.tenants.write(); + drop(cache); - // we need to check this again in case another - // thread opened it concurrently. - if let Some(tree) = tenants.get(name_ref) { - return Ok(tree.clone()); + if let Err(e) = shutdown_sender.send(()) { + log::error!( + "Db flusher could not ack shutdown to requestor: {e:?}" + ); + } + log::debug!( + "flush thread terminating after signalling to requestor" + ); + return; } - let tree = meta::open_tree(&self.context, name_ref.to_vec(), &guard)?; + let before_flush = Instant::now(); - assert!(tenants.insert(name_ref.into(), tree.clone()).is_none()); + flush(); - Ok(tree) + last_flush_duration = before_flush.elapsed(); } +} - /// Remove a disk-backed collection. This is blocking and fairly slow. - pub fn drop_tree>(&self, name: V) -> Result { - let name_ref = name.as_ref(); - if name_ref == DEFAULT_TREE_ID { - return Err(Error::Unsupported("cannot remove the default tree")); +impl Drop for Db { + fn drop(&mut self) { + if self.config.flush_every_ms.is_none() { + if let Err(e) = self.flush() { + log::error!("failed to flush Db on Drop: {e:?}"); + } + } else { + // otherwise, it is expected that the flusher thread will + // flush while shutting down the final Db/Tree instance } - trace!("dropping tree {:?}", name_ref,); - - let mut tenants = self.tenants.write(); + } +} - let tree = if let Some(tree) = tenants.remove(name_ref) { - tree - } else { - return Ok(false); - }; +impl Db { + #[cfg(feature = "for-internal-testing-only")] + fn validate(&self) -> io::Result<()> { + // for each tree, iterate over index, read node and assert low key matches + // and assert first time we've ever seen node ID - // signal to all threads that this tree is no longer valid - tree.root.store(u64::max_value(), SeqCst); + let mut ever_seen = std::collections::HashSet::new(); + let before = std::time::Instant::now(); - let guard = pin(); + for (_cid, tree) in self.trees.lock().iter() { + let mut hi_none_count = 0; + let mut last_hi = None; + for (low, node) in tree.index.iter() { + // ensure we haven't reused the object_id across Trees + assert!(ever_seen.insert(node.object_id)); - let mut root_id = - Some(self.context.pagecache.meta_pid_for_name(name_ref, &guard)?); + let (read_low, node_mu, read_node) = + tree.page_in(&low, self.cache.current_flush_epoch())?; - let mut leftmost_chain: Vec = vec![root_id.unwrap()]; - let mut cursor = root_id.unwrap(); - while let Some(view) = self.view_for_pid(cursor, &guard)? { - if view.is_index { - let leftmost_child = view.iter_index_pids().next().unwrap(); - leftmost_chain.push(leftmost_child); - cursor = leftmost_child; - } else { - break; - } - } + assert_eq!(read_node.object_id, node.object_id); + assert_eq!(node_mu.leaf.as_ref().unwrap().lo, low); + assert_eq!(read_low, low); - loop { - let res = self - .context - .pagecache - .cas_root_in_meta(name_ref, root_id, None, &guard)?; + if let Some(hi) = &last_hi { + assert_eq!(hi, &node_mu.leaf.as_ref().unwrap().lo); + } - if let Err(actual_root) = res { - root_id = actual_root; - } else { - break; + if let Some(hi) = &node_mu.leaf.as_ref().unwrap().hi { + last_hi = Some(hi.clone()); + } else { + assert_eq!(hi_none_count, 0); + hi_none_count += 1; + } } } - // drop writer lock and asynchronously - drop(tenants); + log::debug!( + "{} leaves looking good after {} micros", + ever_seen.len(), + before.elapsed().as_micros() + ); - guard.flush(); - - drop(guard); - - self.gc_pages(leftmost_chain)?; + Ok(()) + } - Ok(true) + pub fn stats(&self) -> Stats { + Stats { cache: self.cache.stats() } } - // Remove all pages for this tree from the underlying - // PageCache. This will leave orphans behind if - // the tree crashes during gc. - fn gc_pages(&self, mut leftmost_chain: Vec) -> Result<()> { - let mut guard = pin(); - - let mut ops = 0; - while let Some(mut pid) = leftmost_chain.pop() { - loop { - ops += 1; - if ops % 64 == 0 { - // we re-pin here to avoid memory blow-ups during - // long-running tree removals. - guard = pin(); - } - let cursor_view = - if let Some(view) = self.view_for_pid(pid, &guard)? { - view - } else { - trace!( - "encountered Free node pid {} while GC'ing tree", - pid - ); - break; - }; - - let ret = self.context.pagecache.free( - pid, - cursor_view.node_view.0, - &guard, - )?; - - if ret.is_ok() { - let next_pid = if let Some(next_pid) = cursor_view.next { - next_pid - } else { - break; - }; - assert_ne!(pid, next_pid.get()); - pid = next_pid.get(); - } - } + pub fn size_on_disk(&self) -> io::Result { + use std::fs::read_dir; + + fn recurse(mut dir: std::fs::ReadDir) -> io::Result { + dir.try_fold(0, |acc, file| { + let file = file?; + let size = match file.metadata()? { + data if data.is_dir() => recurse(read_dir(file.path())?)?, + data => data.len(), + }; + Ok(acc + size) + }) } - Ok(()) - } - - /// Returns the trees names saved in this Db. - pub fn tree_names(&self) -> Vec { - let tenants = self.tenants.read(); - tenants.iter().map(|(name, _)| name.clone()).collect() + recurse(read_dir(&self.cache.config.path)?) } /// Returns `true` if the database was @@ -253,24 +229,119 @@ impl Db { /// guaranteed to be present up to the /// last call to `flush`! Otherwise state /// is synced to disk periodically if the - /// `sync_every_ms` configuration option + /// `Config.sync_every_ms` configuration option /// is set to `Some(number_of_ms_between_syncs)` /// or if the IO buffer gets filled to /// capacity before being rotated. pub fn was_recovered(&self) -> bool { - self.context.was_recovered() + self.was_recovered } - /// Generate a monotonic ID. Not guaranteed to be - /// contiguous. Written to disk every `idgen_persist_interval` - /// operations, followed by a blocking flush. During recovery, we - /// take the last recovered generated ID and add 2x - /// the `idgen_persist_interval` to it. While persisting, if the - /// previous persisted counter wasn't synced to disk yet, we will do - /// a blocking flush to fsync the latest counter, ensuring - /// that we will never give out the same counter twice. - pub fn generate_id(&self) -> Result { - self.context.generate_id() + pub fn open_with_config(config: &Config) -> io::Result> { + let (shutdown_tx, shutdown_rx) = mpsc::channel(); + + let (cache, indices, was_recovered) = ObjectCache::recover(&config)?; + + let _shutdown_dropper = Arc::new(ShutdownDropper { + shutdown_sender: Mutex::new(shutdown_tx), + cache: Mutex::new(cache.clone()), + }); + + let mut allocated_collection_ids = fnv::FnvHashSet::default(); + + let mut trees: HashMap> = indices + .into_iter() + .map(|(collection_id, index)| { + assert!( + allocated_collection_ids.insert(collection_id.0), + "allocated_collection_ids already contained {:?}", + collection_id + ); + ( + collection_id, + Tree::new( + collection_id, + cache.clone(), + index, + _shutdown_dropper.clone(), + ), + ) + }) + .collect(); + + let collection_name_mapping = + trees.get(&NAME_MAPPING_COLLECTION_ID).unwrap().clone(); + + let default_tree = trees.get(&DEFAULT_COLLECTION_ID).unwrap().clone(); + + for kv_res in collection_name_mapping.iter() { + let (_collection_name, collection_id_buf) = kv_res.unwrap(); + let collection_id = CollectionId(u64::from_le_bytes( + collection_id_buf.as_ref().try_into().unwrap(), + )); + + if trees.contains_key(&collection_id) { + continue; + } + + // need to initialize tree leaf for empty collection + + assert!( + allocated_collection_ids.insert(collection_id.0), + "allocated_collection_ids already contained {:?}", + collection_id + ); + + let initial_low_key = InlineArray::default(); + + let empty_node = cache.allocate_default_node(collection_id); + + let index = Index::default(); + + assert!(index.insert(initial_low_key, empty_node).is_none()); + + let tree = Tree::new( + collection_id, + cache.clone(), + index, + _shutdown_dropper.clone(), + ); + + trees.insert(collection_id, tree); + } + + let collection_id_allocator = + Arc::new(Allocator::from_allocated(&allocated_collection_ids)); + + assert_eq!(collection_name_mapping.len() + 2, trees.len()); + + let ret = Db { + config: config.clone(), + cache: cache.clone(), + default_tree, + collection_name_mapping, + collection_id_allocator, + trees: Arc::new(Mutex::new(trees)), + _shutdown_dropper, + was_recovered, + }; + + #[cfg(feature = "for-internal-testing-only")] + ret.validate()?; + + if let Some(flush_every_ms) = ret.cache.config.flush_every_ms { + let spawn_res = std::thread::Builder::new() + .name("sled_flusher".into()) + .spawn(move || flusher(cache, shutdown_rx, flush_every_ms)); + + if let Err(e) = spawn_res { + return Err(io::Error::new( + io::ErrorKind::Other, + format!("unable to spawn flusher thread for sled database: {:?}", e) + )); + } + } + Ok(ret) } /// A database export method for all collections in the `Db`, @@ -304,11 +375,11 @@ impl Db { /// ``` /// # use sled as old_sled; /// # fn main() -> Result<(), Box> { - /// let old = old_sled::open("my_old__db")?; + /// let old = old_sled::open("my_old_db_export")?; /// /// // may be a different version of sled, /// // the export type is version agnostic. - /// let new = sled::open("my_new__db")?; + /// let new = sled::open("my_new_db_export")?; /// /// let export = old.export(); /// new.import(export); @@ -316,22 +387,31 @@ impl Db { /// assert_eq!(old.checksum()?, new.checksum()?); /// # drop(old); /// # drop(new); - /// # std::fs::remove_file("my_old__db"); - /// # std::fs::remove_file("my_new__db"); + /// # let _ = std::fs::remove_dir_all("my_old_db_export"); + /// # let _ = std::fs::remove_dir_all("my_new_db_export"); /// # Ok(()) } /// ``` pub fn export( &self, - ) -> Vec<(CollectionType, CollectionName, impl Iterator>>)> - { - let tenants = self.tenants.read(); + ) -> Vec<( + CollectionType, + CollectionName, + impl Iterator>> + '_, + )> { + let trees = self.trees.lock(); let mut ret = vec![]; - for (name, tree) in tenants.iter() { + for kv_res in self.collection_name_mapping.iter() { + let (collection_name, collection_id_buf) = kv_res.unwrap(); + let collection_id = CollectionId(u64::from_le_bytes( + collection_id_buf.as_ref().try_into().unwrap(), + )); + let tree = trees.get(&collection_id).unwrap().clone(); + ret.push(( b"tree".to_vec(), - name.to_vec(), + collection_name.to_vec(), tree.iter().map(|kv_opt| { let kv = kv_opt.unwrap(); vec![kv.0.to_vec(), kv.1.to_vec()] @@ -370,11 +450,11 @@ impl Db { /// ``` /// # use sled as old_sled; /// # fn main() -> Result<(), Box> { - /// let old = old_sled::open("my_old_db")?; + /// let old = old_sled::open("my_old_db_import")?; /// /// // may be a different version of sled, /// // the export type is version agnostic. - /// let new = sled::open("my_new_db")?; + /// let new = sled::open("my_new_db_import")?; /// /// let export = old.export(); /// new.import(export); @@ -382,8 +462,8 @@ impl Db { /// assert_eq!(old.checksum()?, new.checksum()?); /// # drop(old); /// # drop(new); - /// # std::fs::remove_file("my_old_db"); - /// # std::fs::remove_file("my_new_db"); + /// # let _ = std::fs::remove_dir_all("my_old_db_import"); + /// # let _ = std::fs::remove_dir_all("my_new_db_import"); /// # Ok(()) } /// ``` pub fn import( @@ -421,54 +501,78 @@ impl Db { } } - /// Returns the CRC32 of all keys and values - /// in this Db. - /// - /// This is O(N) and locks all underlying Trees - /// for the duration of the entire scan. - pub fn checksum(&self) -> Result { - let tenants_mu = self.tenants.write(); - - // we use a btreemap to ensure lexicographic - // iteration over tree names to have consistent - // checksums. - let tenants: BTreeMap<_, _> = tenants_mu.iter().collect(); - - let mut hasher = crc32fast::Hasher::new(); - - for (name, tree) in &tenants { - hasher.update(name); - - let mut iter = tree.iter(); - while let Some(kv_res) = iter.next_inner() { - let (k, v) = kv_res?; - hasher.update(&k); - hasher.update(&v); - } - } - - Ok(hasher.finalize()) + pub fn contains_tree>(&self, name: V) -> io::Result { + Ok(self.collection_name_mapping.get(name.as_ref())?.is_some()) } - /// Returns the on-disk size of the storage files - /// for this database. - pub fn size_on_disk(&self) -> Result { - self.context.pagecache.size_on_disk() - } + pub fn drop_tree>(&self, name: V) -> io::Result { + let name_ref = name.as_ref(); + let trees = self.trees.lock(); + + let tree = if let Some(collection_id_buf) = + self.collection_name_mapping.get(name_ref)? + { + let collection_id = CollectionId(u64::from_le_bytes( + collection_id_buf.as_ref().try_into().unwrap(), + )); + + trees.get(&collection_id).unwrap() + } else { + return Ok(false); + }; + + tree.clear()?; - /// Traverses all files and calculates their total physical - /// size, then traverses all pages and calculates their - /// total logical size, then divides the physical size - /// by the logical size. - #[doc(hidden)] - pub fn space_amplification(&self) -> Result { - self.context.pagecache.space_amplification() + self.collection_name_mapping.remove(name_ref)?; + + Ok(true) } - /// Returns a true value if one of the tree names linked - /// to the database is found, if not a false will be returned. - pub fn contains_tree>(&self, name: V) -> bool { - self.tenants.read().contains_key(name.as_ref()) + /// Open or create a new disk-backed [`Tree`] with its own keyspace, + /// accessible from the `Db` via the provided identifier. + pub fn open_tree>( + &self, + name: V, + ) -> io::Result> { + let name_ref = name.as_ref(); + let mut trees = self.trees.lock(); + + if let Some(collection_id_buf) = + self.collection_name_mapping.get(name_ref)? + { + let collection_id = CollectionId(u64::from_le_bytes( + collection_id_buf.as_ref().try_into().unwrap(), + )); + + let tree = trees.get(&collection_id).unwrap(); + + return Ok(tree.clone()); + } + + let collection_id = + CollectionId(self.collection_id_allocator.allocate()); + + let initial_low_key = InlineArray::default(); + + let empty_node = self.cache.allocate_default_node(collection_id); + + let index = Index::default(); + + assert!(index.insert(initial_low_key, empty_node).is_none()); + + let tree = Tree::new( + collection_id, + self.cache.clone(), + index, + self._shutdown_dropper.clone(), + ); + + self.collection_name_mapping + .insert(name_ref, &collection_id.0.to_le_bytes())?; + + trees.insert(collection_id, tree.clone()); + + Ok(tree) } } diff --git a/src/debug_delay.rs b/src/debug_delay.rs deleted file mode 100644 index 214656378..000000000 --- a/src/debug_delay.rs +++ /dev/null @@ -1,105 +0,0 @@ -#![allow(clippy::float_arithmetic)] - -use std::sync::atomic::{AtomicUsize, Ordering::Relaxed}; - -use crate::Lazy; - -/// This function is useful for inducing random jitter into our atomic -/// operations, shaking out more possible interleavings quickly. It gets -/// fully eliminated by the compiler in non-test code. -pub fn debug_delay() { - use std::thread; - use std::time::Duration; - - static GLOBAL_DELAYS: AtomicUsize = AtomicUsize::new(0); - - static INTENSITY: Lazy u32> = Lazy::new(|| { - std::env::var("SLED_LOCK_FREE_DELAY_INTENSITY") - .unwrap_or_else(|_| "100".into()) - .parse() - .expect( - "SLED_LOCK_FREE_DELAY_INTENSITY must be set to a \ - non-negative integer (ideally below 1,000,000)", - ) - }); - - static CRASH_CHANCE: Lazy u32> = Lazy::new(|| { - std::env::var("SLED_CRASH_CHANCE") - .unwrap_or_else(|_| "0".into()) - .parse() - .expect( - "SLED_CRASH_CHANCE must be set to a \ - non-negative integer (ideally below 50,000)", - ) - }); - - thread_local!( - static LOCAL_DELAYS: std::cell::RefCell = std::cell::RefCell::new(0) - ); - - if cfg!(feature = "miri_optimizations") { - // Each interaction with LOCAL_DELAYS adds more stacked borrows - // tracking information, and Miri is single-threaded anyway. - return; - } - - let global_delays = GLOBAL_DELAYS.fetch_add(1, Relaxed); - let local_delays = LOCAL_DELAYS.with(|ld| { - let old = *ld.borrow(); - let new = (global_delays + 1).max(old + 1); - *ld.borrow_mut() = new; - old - }); - - if *CRASH_CHANCE > 0 && random(*CRASH_CHANCE) == 0 { - std::process::exit(9) - } - - if global_delays == local_delays { - // no other threads seem to be - // calling this, so we may as - // well skip it - return; - } - - if random(1000) == 1 { - let duration = random(*INTENSITY); - - #[allow(clippy::cast_possible_truncation)] - #[allow(clippy::cast_sign_loss)] - thread::sleep(Duration::from_micros(u64::from(duration))); - } - - if random(2) == 0 { - thread::yield_now(); - } -} - -/// Generates a random number in `0..n`. -fn random(n: u32) -> u32 { - use std::cell::Cell; - use std::num::Wrapping; - - thread_local! { - static RNG: Cell> = Cell::new(Wrapping(1_406_868_647)); - } - - #[allow(clippy::cast_possible_truncation)] - RNG.try_with(|rng| { - // This is the 32-bit variant of Xorshift. - // - // Source: https://en.wikipedia.org/wiki/Xorshift - let mut x = rng.get(); - x ^= x << 13; - x ^= x >> 17; - x ^= x << 5; - rng.set(x); - - // This is a fast alternative to `x % n`. - // - // Author: Daniel Lemire - // Source: https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ - (u64::from(x.0).wrapping_mul(u64::from(n)) >> 32) as u32 - }) - .unwrap_or(0) -} diff --git a/src/dll.rs b/src/dll.rs deleted file mode 100644 index c150f396b..000000000 --- a/src/dll.rs +++ /dev/null @@ -1,263 +0,0 @@ -#![allow(unsafe_code)] - -use std::{cell::UnsafeCell, ptr}; - -use super::lru::CacheAccess; - -/// A simple doubly linked list for use in the `Lru` -#[derive(Debug)] -pub(crate) struct Node { - inner: UnsafeCell, - next: *mut Node, - prev: *mut Node, -} - -impl std::ops::Deref for Node { - type Target = CacheAccess; - - fn deref(&self) -> &CacheAccess { - unsafe { &(*self.inner.get()) } - } -} - -impl std::ops::DerefMut for Node { - fn deref_mut(&mut self) -> &mut CacheAccess { - unsafe { &mut (*self.inner.get()) } - } -} - -impl Node { - fn unwire(&mut self) { - unsafe { - if !self.prev.is_null() { - (*self.prev).next = self.next; - } - - if !self.next.is_null() { - (*self.next).prev = self.prev; - } - } - - self.next = ptr::null_mut(); - self.prev = ptr::null_mut(); - } - - // This is a bit hacky but it's done - // this way because we don't have a way - // of mutating a key that is in a HashSet. - // - // This is safe to do because the hash - // happens based on the PageId of the - // CacheAccess, rather than the size - // that we modify here. - pub fn swap_sz(&self, new_sz: u8) -> u8 { - unsafe { std::mem::replace(&mut (*self.inner.get()).sz, new_sz) } - } -} - -/// A simple non-cyclical doubly linked -/// list where items can be efficiently -/// removed from the middle, for the purposes -/// of backing an LRU cache. -pub struct DoublyLinkedList { - head: *mut Node, - tail: *mut Node, - len: usize, -} - -unsafe impl Send for DoublyLinkedList {} - -impl Drop for DoublyLinkedList { - fn drop(&mut self) { - let mut cursor = self.head; - while !cursor.is_null() { - unsafe { - let node = Box::from_raw(cursor); - - // don't need to check for cycles - // because this Dll is non-cyclical - cursor = node.prev; - - // this happens without the manual drop, - // but we keep it for explicitness - drop(node); - } - } - } -} - -impl Default for DoublyLinkedList { - fn default() -> Self { - Self { head: ptr::null_mut(), tail: ptr::null_mut(), len: 0 } - } -} - -impl DoublyLinkedList { - pub(crate) const fn len(&self) -> usize { - self.len - } - - pub(crate) fn push_head(&mut self, item: CacheAccess) -> *mut Node { - self.len += 1; - - let node = Node { - inner: UnsafeCell::new(item), - next: ptr::null_mut(), - prev: self.head, - }; - - let ptr = Box::into_raw(Box::new(node)); - - self.push_head_ptr(ptr); - - ptr - } - - fn push_head_ptr(&mut self, ptr: *mut Node) { - if !self.head.is_null() { - unsafe { - (*self.head).next = ptr; - (*ptr).prev = self.head; - } - } - - if self.tail.is_null() { - self.tail = ptr; - } - - self.head = ptr; - } - - #[cfg(test)] - pub(crate) fn push_tail(&mut self, item: CacheAccess) { - self.len += 1; - - let node = Node { - inner: UnsafeCell::new(item), - next: self.tail, - prev: ptr::null_mut(), - }; - - let ptr = Box::into_raw(Box::new(node)); - - if !self.tail.is_null() { - unsafe { - (*self.tail).prev = ptr; - } - } - - if self.head.is_null() { - self.head = ptr; - } - - self.tail = ptr; - } - - pub(crate) fn promote(&mut self, ptr: *mut Node) { - if self.head == ptr { - return; - } - - unsafe { - if self.tail == ptr { - self.tail = (*ptr).next; - } - - if self.head == ptr { - self.head = (*ptr).prev; - } - - (*ptr).unwire(); - - self.push_head_ptr(ptr); - } - } - - #[cfg(test)] - pub(crate) fn pop_head(&mut self) -> Option { - if self.head.is_null() { - return None; - } - - self.len -= 1; - - unsafe { - let mut head = Box::from_raw(self.head); - - if self.head == self.tail { - self.tail = ptr::null_mut(); - } - - self.head = head.prev; - - head.unwire(); - - Some(**head) - } - } - - // NB: returns the Box instead of just the Option - // because the LRU is a map to the Node as well, and if the LRU - // accessed the map via PID, it would cause a use after free if - // we had already freed the Node in this function. - pub(crate) fn pop_tail(&mut self) -> Option> { - if self.tail.is_null() { - return None; - } - - self.len -= 1; - - unsafe { - let mut tail: Box = Box::from_raw(self.tail); - - if self.head == self.tail { - self.head = ptr::null_mut(); - } - - self.tail = tail.next; - - tail.unwire(); - - Some(tail) - } - } - - #[cfg(test)] - pub(crate) fn into_vec(mut self) -> Vec { - let mut res = vec![]; - while let Some(val) = self.pop_head() { - res.push(val); - } - res - } -} - -#[allow(unused_results)] -#[test] -fn basic_functionality() { - let mut dll = DoublyLinkedList::default(); - dll.push_head(5.into()); - dll.push_tail(6.into()); - dll.push_head(4.into()); - dll.push_tail(7.into()); - dll.push_tail(8.into()); - dll.push_head(3.into()); - dll.push_tail(9.into()); - dll.push_head(2.into()); - dll.push_head(1.into()); - assert_eq!(dll.len(), 9); - assert_eq!( - dll.into_vec(), - vec![ - 1.into(), - 2.into(), - 3.into(), - 4.into(), - 5.into(), - 6.into(), - 7.into(), - 8.into(), - 9.into() - ] - ); -} diff --git a/src/doc/engineering_practices/mod.rs b/src/doc/engineering_practices/mod.rs deleted file mode 100644 index 1eaca3fcf..000000000 --- a/src/doc/engineering_practices/mod.rs +++ /dev/null @@ -1,49 +0,0 @@ -//! Over the years that sled development has been active, some practices have -//! been collected that have helped to reduce risks throughout the codebase. -//! -//! # high-level -//! -//! * Start with the correctness requirements, ignore the performance impact -//! until the end. You'll usually write something faster by focusing on -//! keeping things minimal anyway. -//! * Throw away what can't be done in a day of coding. when you rewrite it -//! tomorrow, it will be simpler. -//! -//! # testing -//! -//! * Don't do what can't be tested to be correct -//! * For concurrent code, it must be delayable to induce strange histories when -//! running under test -//! * For IO code, it must have a failpoint so that IO errors can be injected -//! during testing, as most bugs in cloud systems happen in the untested -//! error-handling code -//! * Lean heavily into model-based property testing. sled should act like a -//! `BTreeMap`, even after crashes -//! -//! # when testing and performance collide -//! -//! * cold code is buggy code -//! * if you see a significant optimization that will make correctness-critical -//! codepaths harder to hit in tests, the optimization should only be created -//! if it's possible to artificially increase the chances of hitting the -//! codepath in test. Fox example, sled defaults to having an 8mb write -//! buffer, but during tests we often turn it down to 512 bytes so that we can -//! really abuse the correctness-critical aspects of its behavior. -//! -//! # numbers -//! -//! * No silent truncation should ever occur when converting numbers -//! * No silent wrapping should occur -//! * Crash or return a `ReportableBug` error in these cases -//! * `as` is forbidden for anything that could lose information -//! * Clippy's cast lints help us here, and it has been added to all pull -//! requests - -//! # package -//! -//! * dependencies should be minimized to keep compilation simple -//! -//! # coding conventions -//! -//! * Self should be avoided. We have a lot of code, and it provides no context -//! if people are jumping around a lot. Redundancy here improves orientation. diff --git a/src/doc/limits/mod.rs b/src/doc/limits/mod.rs deleted file mode 100644 index e2d5c4a5f..000000000 --- a/src/doc/limits/mod.rs +++ /dev/null @@ -1,22 +0,0 @@ -//! This page documents some limitations that sled imposes on users. -//! -//! * The underlying pagecache can currently store 2^36 pages. Leaf nodes in the -//! `Tree` tend to split when they have more than 16 keys and values. This -//! means that sled can hold a little less than **4,294,967,296 total items** -//! (index nodes in the tree will also consume pages, but ideally far fewer -//! than 1%). This is easy to increase without requiring migration, as it is -//! entirely a runtime concern, but nobody has expressed any interest in this -//! being larger yet. Note to future folks who need to increase this: increase -//! the width of the Node1 type in the pagetable module, and correspondingly -//! increase the number of bits that are used to index into it. It's just a -//! simple wait-free grow-only 2-level pagetable. -//! * keys and values use `usize` for the length fields due to the way that Rust -//! uses `usize` for slice lengths, and will be limited to the target -//! platform's pointer width. On 64-bit machines, this will be 64 bits. On -//! 32-bit machines, it will be limited to `u32::max_value()`. -//! * Due to the 32-bit limitation on slice sizes on 32-bit architectures, we -//! currently do not support systems large enough for the snapshot file to -//! reach over 4gb. The snapshot file tends to be a small fraction of the -//! total db size, and it's likely we'll be able to implement a streaming -//! deserializer if this ever becomes an issue, but it seems unclear if anyone -//! will encounter this limitation. diff --git a/src/doc/merge_operators/mod.rs b/src/doc/merge_operators/mod.rs deleted file mode 100644 index 99fa7a5b7..000000000 --- a/src/doc/merge_operators/mod.rs +++ /dev/null @@ -1,73 +0,0 @@ -//! Merge operators are an extremely powerful tool for use in embedded kv -//! stores. They allow users to specify custom logic for combining multiple -//! versions of a value into one. -//! -//! As a motivating example, imagine that you have a counter. In a traditional -//! kv store, you would need to read the old value, modify it, then write it -//! back (RMW). If you want to increment the counter from multiple threads, you -//! would need to either use higher-level locking or you need to spin in a CAS -//! loop until your increment is successful. Merge operators remove the need for -//! all of this by allowing multiple threads to "merge" in the desired -//! operation, rather than performing a read, then modification, then later -//! writing. `+1 -> +1 -> +1` instead of `w(r(key) + 1) -> w(r(key)+ 1) -> -//! w(r(key) + 1)`. -//! -//! Here's an example of using a merge operator to just concatenate merged bytes -//! together. Note that calling `set` acts as a value replacement, bypassing the -//! merging logic and replacing previously merged values. Calling `merge` is -//! like `set` but when the key is fetched, it will use the merge operator to -//! combine all `merge`'s since the last `set`. -//! -//! ```rust -//! fn concatenate_merge( -//! _key: &[u8], // the key being merged -//! old_value: Option<&[u8]>, // the previous value, if one existed -//! merged_bytes: &[u8] // the new bytes being merged in -//! ) -> Option> { // set the new value, return None to delete -//! let mut ret = old_value -//! .map(|ov| ov.to_vec()) -//! .unwrap_or_else(|| vec![]); -//! -//! ret.extend_from_slice(merged_bytes); -//! -//! Some(ret) -//! } -//! -//! let config = ConfigBuilder::new() -//! .temporary(true) -//! .build(); -//! -//! let tree = Tree::start(config).unwrap(); -//! tree.set_merge_operator(concatenate_merge); -//! -//! tree.set(k, vec![0]); -//! tree.merge(k, vec![1]); -//! tree.merge(k, vec![2]); -//! assert_eq!(tree.get(&k), Ok(Some(vec![0, 1, 2]))); -//! -//! // sets replace previously merged data, -//! // bypassing the merge function. -//! tree.set(k, vec![3]); -//! assert_eq!(tree.get(&k), Ok(Some(vec![3]))); -//! -//! // merges on non-present values will add them -//! tree.del(&k); -//! tree.merge(k, vec![4]); -//! assert_eq!(tree.get(&k), Ok(Some(vec![4]))); -//! ``` -//! -//! ### beyond the basics -//! -//! Merge operators can be used to express arbitrarily complex logic. You can -//! use them to implement any sort of high-level data structure on top of sled, -//! using merges of different values to represent your desired operations. -//! Similar to the above example, you could implement a list that lets you push -//! items. Bloom filters are particularly easy to implement, and merge operators -//! also are quite handy for building persistent CRDTs. -//! -//! ### warnings -//! -//! If you call `merge` without setting a merge operator, an error will be -//! returned. Merge operators may be changed over time, but make sure you do -//! this carefully to avoid race conditions. If you need to push a one-time -//! operation to a value, use `update_and_fetch` or `fetch_and_update` instead. diff --git a/src/doc/mod.rs b/src/doc/mod.rs deleted file mode 100644 index f08a22796..000000000 --- a/src/doc/mod.rs +++ /dev/null @@ -1,58 +0,0 @@ -//! #### what is sled? -//! -//! * an embedded kv store -//! * a construction kit for stateful systems -//! * ordered map API similar to a Rust `BTreeMap, Vec>` -//! * fully atomic single-key operations, supports CAS -//! * zero-copy reads -//! * merge operators -//! * forward and reverse iterators -//! * a monotonic ID generator capable of giving out 75-125+ million unique IDs -//! per second, never double allocating even in the presence of crashes -//! * [zstd](https://github.com/facebook/zstd) compression -//! * cpu-scalable lock-free implementation -//! * SSD-optimized log-structured storage -//! -//! #### why another kv store? -//! -//! People face unnecessary hardship when working with existing embedded -//! databases. They tend to have sharp performance trade-offs, are difficult to -//! tune, have unclear consistency guarantees, and are generally inflexible. -//! Facebook uses distributed machine learning to find configurations that -//! achieve great performance for specific workloads on rocksdb. Most engineers -//! don't have access to that kind of infrastructure. We would like to build -//! sled so that it can be optimized using simple local methods, with as little -//! user input as possible, and in many cases exceed the performance of popular -//! systems today. -//! -//! This is how we aim to improve the situation: -//! -//! 1. don't make the user think. the interface should be obvious. -//! 1. don't surprise users with performance traps. -//! 1. don't wake up operators. bring reliability techniques from academia into -//! real-world practice. 1. don't use so much electricity. our data structures -//! should play to modern hardware's strengths. -//! -//! sled is written by people with experience designing, building, testing, and -//! operating databases at high scales. we think the situation can be improved. -//! -//! #### targeted toward our vision of the future -//! Building a database takes years. Designers of databases make bets about -//! target usage and hardware. Here are the trends that we see, which we want to -//! optimize the experience around: -//! -//! 1. more cores on servers, spanning sockets and numa domains -//! 1. the vast majority of content consumption and generation happening on -//! phones 1. compute migrating to the edge, into CDNs -//! 1. conflict-free and OT-based replication techniques at the edge -//! 1. strongly-consistent replication techniques within and between datacenters -//! 1. event-driven architectures which benefit heavily from subscriber/watch -//! semantics - -pub mod engineering_practices; -pub mod limits; -pub mod merge_operators; -pub mod performance_guide; -pub mod reactive_semantics; -pub mod sled_architectural_outlook; -pub mod testing_strategies; diff --git a/src/doc/motivating_experiences/mod.rs b/src/doc/motivating_experiences/mod.rs deleted file mode 100644 index ef4c7fba9..000000000 --- a/src/doc/motivating_experiences/mod.rs +++ /dev/null @@ -1,96 +0,0 @@ -//!

-//! -//!

-//! -//! # Experiences with Other Systems -//! -//! sled is motivated by the experiences gained while working with other -//! stateful systems, outlined below. -//! -//! Most of the points below are learned from being burned, rather than -//! delighted. -//! -//! #### MySQL -//! -//! * make it easy to tail the replication stream in flexible topologies -//! * support merging shards a la MariaDB -//! * support mechanisms for live, lock-free schema updates a la -//! pt-online-schema-change -//! * include GTID in all replication information -//! * actively reduce tree fragmentation -//! * give operators and distributed database creators first-class support for -//! replication, sharding, backup, tuning, and diagnosis -//! * O_DIRECT + real linux AIO is worth the effort -//! -//! #### Redis -//! -//! * provide high-level collections that let engineers get to their business -//! logic as quickly as possible instead of forcing them to define a schema in -//! a relational system (usually spending an hour+ googling how to even do it) -//! * don't let single slow requests block all other requests to a shard -//! * let operators peer into the sequence of operations that hit the database -//! to track down bad usage -//! * don't force replicas to retrieve the entire state of the leader when they -//! begin replication -//! -//! #### HBase -//! -//! * don't split "the source of truth" across too many decoupled systems or you -//! will always have downtime -//! * give users first-class APIs to peer into their system state without -//! forcing them to write scrapers -//! * serve http pages for high-level overviews and possibly log access -//! * coprocessors are awesome but people should have easy ways of doing -//! secondary indexing -//! -//! #### RocksDB -//! -//! * give users tons of flexibility with different usage patterns -//! * don't force users to use distributed machine learning to discover -//! configurations that work for their use cases -//! * merge operators are extremely powerful -//! * merge operators should be usable from serial transactions across multiple -//! keys -//! -//! #### etcd -//! -//! * raft makes operating replicated systems SO MUCH EASIER than popular -//! relational systems / redis etc... -//! * modify raft to use leader leases instead of using the paxos register, -//! avoiding livelocks in the presence of simple partitions -//! * give users flexible interfaces -//! * reactive semantics are awesome, but access must be done through smart -//! clients, because users will assume watches are reliable -//! * if we have smart clients anyway, quorum reads can be cheap by -//! lower-bounding future reads to the raft id last observed -//! * expose the metrics and operational levers required to build a self-driving -//! stateful system on top of k8s/mesos/cloud providers/etc... -//! -//! #### Tendermint -//! -//! * build things in a testable way from the beginning -//! * don't seek gratuitous concurrency -//! * allow replication streams to be used in flexible ways -//! * instant finality (or interface finality, the thing should be done by the -//! time the request successfully returns to the client) is mandatory for nice -//! high-level interfaces that don't push optimism (and rollbacks) into -//! interfacing systems -//! -//! #### LMDB -//! -//! * approach a wait-free tree traversal for reads -//! * use modern tree structures that can support concurrent writers -//! * multi-process is nice for browsers etc... -//! * people value read performance and are often forgiving of terrible write -//! performance for most workloads -//! -//! #### Zookeeper -//! * reactive semantics are awesome, but access must be done through smart -//! clients, because users will assume watches are reliable -//! * the more important the system, the more you should keep old snapshots -//! around for emergency recovery -//! * never assume a hostname that was resolvable in the past will be resolvable -//! in the future -//! * if a critical thread dies, bring down the entire system -//! * make replication configuration as simple as possible. people will mess up -//! the order and cause split brains if this is not automated. diff --git a/src/doc/performance_guide/mod.rs b/src/doc/performance_guide/mod.rs deleted file mode 100644 index c6c876508..000000000 --- a/src/doc/performance_guide/mod.rs +++ /dev/null @@ -1,36 +0,0 @@ -//! ## Built-In Profiler -//! -//! To get a summary of latency histograms relating to different operations -//! you've used on a sled database, sled can print a nice table when the Db is -//! dropped by disabling the `no_metrics` default feature and setting -//! `print_profile_on_drop(true)` on a `ConfigBuilder`: -//! -//! ```rust -//! let config = sled::ConfigBuilder::new() -//! .print_profile_on_drop(true) -//! .build(); -//! -//! let db = sled::Db::start(config).unwrap(); -//! ``` -//! -//! This is useful for finding outliers, general percentiles about usage, and -//! especially for debugging performance issues if you create an issue on -//! github. -//! -//! ## Use jemalloc -//! -//! jemalloc can dramatically improve performance in some situations, but you -//! should always measure performance before and after using it, because maybe -//! for some use cases it can cause regressions. -//! -//! Cargo.toml: -//! ```toml -//! [dependencies] -//! jemallocator = "0.1" -//! ``` -//! -//! `your_code.rs`: -//! ```rust -//! #[global_allocator] -//! static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc; -//! ``` diff --git a/src/doc/reactive_semantics/mod.rs b/src/doc/reactive_semantics/mod.rs deleted file mode 100644 index b99127fe6..000000000 --- a/src/doc/reactive_semantics/mod.rs +++ /dev/null @@ -1,34 +0,0 @@ -//! As of sled `0.16.8` we support the [`watch_prefix` feature](https://docs.rs/sled/latest/sled/struct.Tree.html#method.watch_prefix) which allows a caller to create an iterator over all events that happen to keys that begin with a specified prefix. Supplying an empty vector allows you to subscribe to all updates on the `Tree`. -//! -//! #### reactive architectures -//! -//! Subscription to keys prefixed with "topic names" can allow you to treat sled -//! as a durable message bus. -//! -//! #### replicated systems -//! -//! Watching the empty prefix will subscribe to all updates on the entire -//! database. You can feed this into a replication system -//! -//! #### analysis tools and auditing -//! -//! #### ordering guarantees -//! -//! Updates are received in-order for particular keys, but updates for different -//! keys may be observed in different orders by different `Subscriber`s. As an -//! example, consider updating the keys `k1` and `k2` twice, adding 1 to the -//! current value. Different `Subscriber`s may observe the following histories: -//! -//! ``` -//! Set(k1, 100), Set(k1, 101), Set(k2, 200), Set(k2, 201) -//! or -//! Set(k1, 100), Set(k2, 200), Set(k1, 101), Set(k2, 201) -//! or -//! Set(k1, 100), Set(k2, 200), Set(k2, 201), Set(k1, 101) -//! or -//! Set(k2, 200), Set(k1, 100), Set(k1, 101), Set(k2, 201) -//! or -//! Set(k2, 200), Set(k1, 100), Set(k2, 201), Set(k1, 101) -//! or -//! Set(k2, 200), Set(k2, 201), Set(k1, 100), Set(k1, 101) -//! ``` diff --git a/src/doc/sled_architectural_outlook/mod.rs b/src/doc/sled_architectural_outlook/mod.rs deleted file mode 100644 index a19f5cad5..000000000 --- a/src/doc/sled_architectural_outlook/mod.rs +++ /dev/null @@ -1,184 +0,0 @@ -//! Here's a look at where sled is at, and where it's going architecturally. The system is very -//! much under active development, and we have a ways to go. If specific areas are interesting to -//! you, I'd love to [work together](https://github.com/spacejam/sled/blob/main/CONTRIBUTING.md)! -//! If your business has a need for particular items below, you can [fund development of particular -//! features](https://opencollective.com/sled). -//! -//! People face unnecessary hardship when working with existing embedded -//! databases. They tend to have sharp performance trade-offs, are difficult to -//! tune, have unclear consistency guarantees, and are generally inflexible. -//! Facebook uses distributed machine learning to find configurations that -//! achieve great performance for specific workloads on rocksdb. Most engineers -//! don't have access to that kind of infrastructure. We would like to build -//! sled so that it can be optimized using simple local methods, with as little -//! user input as possible, and in many cases exceed the performance of popular -//! systems today. -//! -//! This is how we aim to improve the situation: -//! -//! * low configuration required to get great performance on a wide variety of -//! workloads by using a modified Bw-Tree and keeping workload metrics that -//! allow us to self-tune -//! * first-class subscriber semantics for operations on specified prefixes of -//! keys -//! * first-class programmatic access to the binary replication stream -//! * serializable transactions -//! -//! ### Indexing -//! -//! sled started as an implementation of a Bw-Tree, but over time has moved away -//! from certain architectural aspects that have been difficult to tune. The -//! first thing to be dropped from the original Bw-Tree design was the in-memory -//! representation of a node as a long linked list of updates, terminating in -//! the actual base tree node. It was found that by leaning into atomic -//! reference counting, it became quite performant to perform RCU on entire tree -//! nodes for every update, because a tree node only needs 2 allocations (the -//! node itself, and a vector of children). All other items are protected by -//! their own rust `Arc`. This made reads dramatically faster, and allowed them -//! to avoid allocations that were required previously to build up a dynamic -//! "view" over a chain of partial updates. -//! -//! A current area of effort is to store tree nodes as a Rust -//! `RwLock>`. The Rust `Arc` has a cool method called `make_mut` -//! which can provide mutable access to an Arc if the strong count is 1, or make -//! a clone if it isn't and then provide a mutable reference to the local clone. -//! This will allow us to perform even fewer allocations and avoid RCU on the -//! tree nodes in cases of lower contention. Nesting an `Arc` in a lock -//! structure allows for an interesting "snapshot read" semantic that allows -//! writers not to block on readers. It is a middle ground between a `RwLock` -//! and RCU that trades lower memory pressure for occasional blocking when a -//! writer is holding a writer lock. This is expected to be a fairly low cost, -//! but benchmarks have not yet been produced for this prospective architecture. -//! -//! The merge and split strategies are kept from the Bw-Tree, but this might be -//! switched to using pagecache-level transactions once a cicada-like -//! transaction protocol is implemented on top of it. -//! -//! * [The Bw-Tree: A B-tree for New Hardware -//! Platforms](https://15721.courses.cs.cmu.edu/spring2018/papers/08-oltpindexes1/bwtree-icde2013.pdf) -//! * [Building a Bw-Tree Takes More Than Just Buzz Words](http://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pdf) -//! -//! ### Caching -//! -//! sled uses a pagecache that is based on LLAMA. This lets us write small -//! updates to pages without rewriting the entire page, achieving low write -//! amplification. Flash storage lets us scatter random reads in parallel, so to -//! read a logical page, we may read several fragments and collect them in -//! memory. The pagecache can be used to back any high level structure, and -//! provides a lock-free interface that supports RCU-style access patterns. When -//! the number of page deltas reaches a certain length, we squish the page -//! updates into a single blob. -//! -//! The caching is currently pretty naive. We use 256 cache shards by default. Each cache shard is a simple LRU cache implemented as a doubly-linked list protected by a `Mutex`. Future directions may take inspiration from ZFS's adaptive replacement cache, which will give us scan and thrash resistance. See [#65](https://github.com/spacejam/sled/issues/65). -//! -//! * [LLAMA: A Cache/Storage Subsystem for Modern Hardware](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/llama-vldb2013.pdf) -//! * [Adaptive replacement cache](https://en.wikipedia.org/w/index.php?title=Adaptive_replacement_cache&oldid=865482923) -//! -//! ### Concurrency Control -//! -//! sled supports point reads and writes in serializable transactions across -//! multiple trees. This is fairly limited, and does not yet use a -//! high-performance concurrency control mechanism. In order to support scans, -//! we need to be able to catch phantom conflicts. To do this, we are taking -//! some inspiration from Cicada, in terms of how they include index nodes in -//! transactions, providing a really nice way to materialize conflicts relating -//! to phantoms. sled has an ID generator built into it now, accessible from the -//! `generate_id` method on `Tree`. This can churn out 75-125 million unique -//! monotonic ID's per second on a macbook pro, so we may not need to adopt -//! Cicada's distributed timestamp generation techniques for a long time. We -//! will be using Cicada's approach to adaptive validation, causing early aborts -//! when higher contention is detected. -//! -//! * [Cicada: Dependably Fast Multi-Core In-Memory Transactions](http://15721.courses.cs.cmu.edu/spring2018/papers/06-mvcc2/lim-sigmod2017.pdf) -//! -//! ### Storage -//! -//! sled splits the main storage file into fixed-sized segments. We track which -//! pages live in which segments. A page may live in several segments, because -//! we support writing partial updates to a page with our LLAMA-like approach. -//! When a page with several fragments is squished together, we mark the page as -//! freed from the previous segments. When a segment reaches a configurable low -//! threshold of live pages, we start moving the remaining pages to other -//! segments so that underutilized segments can be reused, and we generally keep -//! the amount of fragmentation in the system controlled. -//! -//! As of July 2019, sled is naive about where it puts rewritten pages. Future directions will separate base pages from page deltas, and possibly have generational considerations. See [#450](https://github.com/spacejam/sled/issues/450). Also, when values reach a particularly large size, it no longer makes sense to inline them in leaf nodes of the tree. Taking a cue from `WiscKey`, we can eventually split these out, but we can be much more fine grained about placement strategy over time. Generally, being smart about rewriting and defragmentation is where sled may carve out the largest performance gains over existing production and research systems. -//! -//! * [The Design and Implementation of a Log-Structured File System](https://people.eecs.berkeley.edu/~brewer/cs262/LFS.pdf) -//! * [`WiscKey`: Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) -//! * [Monkey: Optimal Navigable Key-Value Store](http://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf) -//! * [Designing Access Methods: The RUM Conjecture](https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf) -//! * [The Unwritten Contract of Solid State Drives](http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf) -//! * [The five-minute rule twenty years later, and how flash memory changes the -//! rules](http://www.cs.cmu.edu/~damon2007/pdf/graefe07fiveminrule.pdf) -//! * [An Efficient Memory-Mapped Key-Value Store for `FlashStorage`](http://www.exanest.eu/pub/SoCC18_efficient_kv_store.pdf) -//! * [Generalized File System Dependencies](http://featherstitch.cs.ucla.edu/publications/featherstitch-sosp07.pdf) -//! -//! ### Replication -//! -//! We want to give database implementors great tools for replicating their data -//! backed by sled. We will provide first-class binary replication stream -//! access, as well as subscriber to high level tree updates that happen on -//! specified prefixes. These updates should be witnessed in the same order that -//! they appear in the log by all consumers. -//! -//! We will likely include a default replication implementation, based either on -//! raft or harpoon (raft but with leases instead of a paxos register-based -//! leadership mechanism to protect against bouncing leadership in the presence -//! of partitions). Additionally, we can get nice throughput gains over vanilla -//! raft by separating the concerns of block replication and consensus on -//! metadata. Blocks can be replicated in a more fragmented + p2p-like manner, -//! with HOL-blocking-prone consensus being run on ordering of said blocks. This -//! pushes a bit more complexity into `RequestVotes` compared to vanilla raft, -//! but allows us to increase throughput a bit. -//! -//! ### Reclamation -//! -//! We use epoch-based reclamation to ensure that we don't free memory until any -//! possible witnessing threads are done with their work. This is the mechanism -//! that lets us return zero-copy to values in our pagecache for tree gets. -//! -//! Right now we use crossbeam-epoch for this. We may create a shallow fork -//! (gladly contributed upstream if the maintainers are interested) that allows -//! different kinds of workloads to bound the amount of garbage that they clean -//! up, possibly punting more cleanups to a threadpool and operations that seem -//! to prioritize throughput rather than latency. -//! -//! Possible future directions include using something like -//! quiescent-state-based-reclamation, but we need to study more before -//! considering alternative approaches. -//! -//! * [Comparative Performance of Memory Reclamation Strategies for Lock-free and Concurrently-readable Data Structures](http://www.cs.utoronto.ca/~tomhart/papers/tomhart_thesis.pdf) -//! -//! ### Checkpointing -//! -//! Sled has an extremely naive checkpoint strategy. It periodically takes the -//! last snapshot, scans the segments in the log with an LSN higher than last -//! LSN applied to the snapshot, building a snapshot from the segments it reads. -//! A snapshot is effectively a CRDT, because it can use the LSN number on read -//! messages as a last-write-wins register. It is currently the same mechanism -//! as the recovery mechanism, where the data is read directly off the disk and -//! page metadata is stored in a snapshot that is updated. The snapshot is -//! entirely an optimization for recovery, and can be deleted without impacting -//! recovery correctness. -//! -//! We are moving to a CRDT-like snapshot recovery technique, and we can easily -//! parallelize recovery up until the "safety buffer" for the last few segments -//! of the log. -//! -//! We would also like to move toward the delta-checkpoint model used in -//! Hekaton, as it would allow us to further parallelize generation of -//! checkpoint information. -//! -//! ### Misc Considerations -//! -//! * [How to Architect a Query Compiler, Revisited](https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod18.pdf) -//! * shows that we can compile queries without resorting to complex -//! implementations by utilizing Futamura projections -//! * [CMU 15-721 (Spring 2018) Advanced Database Systems](https://15721.courses.cs.cmu.edu/spring2018/schedule.html) -//! * a wonderful overview of the state of the art in various database -//! topics. start here if you want to contribute deeply and don't know -//! where to begin! -//! * [Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask](http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf) -//! * suggests that we should eventually aim for an approach that is -//! shared-nothing across sockets, but lock-free within them diff --git a/src/doc/testing_strategies/mod.rs b/src/doc/testing_strategies/mod.rs deleted file mode 100644 index 8c336de95..000000000 --- a/src/doc/testing_strategies/mod.rs +++ /dev/null @@ -1,25 +0,0 @@ -//! We believe operators of stateful systems should get as much sleep as they -//! want. We take testing seriously, and we take pains to avoid the pesticide -//! paradox wherever possible. -//! -//! sled uses the following testing strategies, and is eager to expand their -//! use: -//! -//! * quickcheck-based model testing on the Tree, `PageCache`, and Log -//! * proptest-based model testing on the `PageTable` using the [model](https://docs.rs/model) -//! testing library -//! * linearizability testing on the `PageTable` using the [model](https://docs.rs/model) -//! testing library -//! * deterministic concurrent model testing using linux realtime priorities, -//! approaching the utility of the PULSE system available for the Erlang -//! ecosystem -//! * `ThreadSanitizer` on a concurrent workload -//! * `LeakSanitizer` on a concurrent workload -//! * failpoints with model testing: at every IO operation, a test can cause the -//! system to simulate a crash -//! * crash testing: processes are quickly spun up and then `kill -9`'d while -//! recovering and writing. the recovered data is verified to recover the log -//! in-order, stopping at the first torn log message or incomplete segment -//! * fuzzing: libfuzzer is used to generate sequences of operations on the Tree -//! * TLA+ has been used to model some of the concurrent algorithms, but much -//! more is necessary diff --git a/src/ebr/atomic.rs b/src/ebr/atomic.rs deleted file mode 100644 index fc0eee8a4..000000000 --- a/src/ebr/atomic.rs +++ /dev/null @@ -1,884 +0,0 @@ -use core::borrow::{Borrow, BorrowMut}; -use core::cmp; -use core::fmt; -use core::marker::PhantomData; -use core::mem::{self, MaybeUninit}; -use core::ops::{Deref, DerefMut}; -use core::slice; -use core::sync::atomic::{AtomicUsize, Ordering}; - -use std::alloc; - -use super::Guard; - -/// Given ordering for the success case in a compare-exchange operation, returns the strongest -/// appropriate ordering for the failure case. -pub(crate) const fn strongest_failure_ordering(ord: Ordering) -> Ordering { - use self::Ordering::*; - match ord { - Relaxed | Release => Relaxed, - Acquire | AcqRel => Acquire, - _ => SeqCst, - } -} - -/// The error returned on failed compare-and-set operation. -pub(crate) struct CompareAndSetError<'g, T: ?Sized + Pointable, P: Pointer> { - /// The value in the atomic pointer at the time of the failed operation. - pub(crate) current: Shared<'g, T>, - - /// The new value, which the operation failed to store. - pub(crate) new: P, -} - -impl<'g, T: 'g, P: Pointer + fmt::Debug> fmt::Debug - for CompareAndSetError<'g, T, P> -{ - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - f.debug_struct("CompareAndSetError") - .field("current", &self.current) - .field("new", &self.new) - .finish() - } -} - -/// Memory orderings for compare-and-set operations. -/// -/// A compare-and-set operation can have different memory orderings depending on whether it -/// succeeds or fails. This trait generalizes different ways of specifying memory orderings. -/// -/// The two ways of specifying orderings for compare-and-set are: -/// -/// 1. Just one `Ordering` for the success case. In case of failure, the strongest appropriate -/// ordering is chosen. -/// 2. A pair of `Ordering`s. The first one is for the success case, while the second one is -/// for the failure case. -pub(crate) trait CompareAndSetOrdering: Copy { - /// The ordering of the operation when it succeeds. - fn success(self) -> Ordering; - - /// The ordering of the operation when it fails. - /// - /// The failure ordering can't be `Release` or `AcqRel` and must be equivalent or weaker than - /// the success ordering. - fn failure(self) -> Ordering; -} - -impl CompareAndSetOrdering for Ordering { - #[inline] - fn success(self) -> Ordering { - self - } - - #[inline] - fn failure(self) -> Ordering { - strongest_failure_ordering(self) - } -} - -impl CompareAndSetOrdering for (Ordering, Ordering) { - #[inline] - fn success(self) -> Ordering { - self.0 - } - - #[inline] - fn failure(self) -> Ordering { - self.1 - } -} - -/// Returns a bitmask containing the unused least significant bits of an aligned pointer to `T`. -#[inline] -const fn low_bits() -> usize { - (1 << T::ALIGN.trailing_zeros()) - 1 -} - -/// Panics if the pointer is not properly unaligned. -#[inline] -fn ensure_aligned(raw: usize) { - assert_eq!(raw & low_bits::(), 0, "unaligned pointer"); -} - -/// Given a tagged pointer `data`, returns the same pointer, but tagged with `tag`. -/// -/// `tag` is truncated to fit into the unused bits of the pointer to `T`. -#[inline] -fn compose_tag(data: usize, tag: usize) -> usize { - (data & !low_bits::()) | (tag & low_bits::()) -} - -/// Decomposes a tagged pointer `data` into the pointer and the tag. -#[inline] -fn decompose_tag(data: usize) -> (usize, usize) { - (data & !low_bits::(), data & low_bits::()) -} - -/// Types that are pointed to by a single word. -/// -/// In concurrent programming, it is necessary to represent an object within a word because atomic -/// operations (e.g., reads, writes, read-modify-writes) support only single words. This trait -/// qualifies such types that are pointed to by a single word. -/// -/// The trait generalizes `Box` for a sized type `T`. In a box, an object of type `T` is -/// allocated in heap and it is owned by a single-word pointer. This trait is also implemented for -/// `[MaybeUninit]` by storing its size along with its elements and pointing to the pair of array -/// size and elements. -/// -/// Pointers to `Pointable` types can be stored in [`Atomic`], [`Owned`], and [`Shared`]. In -/// particular, Crossbeam supports dynamically sized slices as follows. -pub(crate) trait Pointable { - /// The alignment of pointer. - const ALIGN: usize; - - /// The type for initializers. - type Init; - - /// Initializes a with the given initializer. - /// - /// # Safety - /// - /// The result should be a multiple of `ALIGN`. - unsafe fn init(init: Self::Init) -> usize; - - /// Dereferences the given pointer. - /// - /// # Safety - /// - /// - The given `ptr` should have been initialized with [`Pointable::init`]. - /// - `ptr` should not have yet been dropped by [`Pointable::drop`]. - /// - `ptr` should not be mutably dereferenced by [`Pointable::deref_mut`] concurrently. - unsafe fn deref<'a>(ptr: usize) -> &'a Self; - - /// Mutably dereferences the given pointer. - /// - /// # Safety - /// - /// - The given `ptr` should have been initialized with [`Pointable::init`]. - /// - `ptr` should not have yet been dropped by [`Pointable::drop`]. - /// - `ptr` should not be dereferenced by [`Pointable::deref`] or [`Pointable::deref_mut`] - /// concurrently. - unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut Self; - - /// Drops the object pointed to by the given pointer. - /// - /// # Safety - /// - /// - The given `ptr` should have been initialized with [`Pointable::init`]. - /// - `ptr` should not have yet been dropped by [`Pointable::drop`]. - /// - `ptr` should not be dereferenced by [`Pointable::deref`] or [`Pointable::deref_mut`] - /// concurrently. - unsafe fn drop(ptr: usize); -} - -impl Pointable for T { - const ALIGN: usize = mem::align_of::(); - - type Init = T; - - unsafe fn init(init: Self::Init) -> usize { - Box::into_raw(Box::new(init)) as usize - } - - unsafe fn deref<'a>(ptr: usize) -> &'a Self { - &*(ptr as *const T) - } - - unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut Self { - &mut *(ptr as *mut T) - } - - unsafe fn drop(ptr: usize) { - drop(Box::from_raw(ptr as *mut T)); - } -} - -/// Array with size. -/// -/// # Memory layout -/// -/// An array consisting of size and elements: -/// -/// ```text -/// elements -/// | -/// | -/// ------------------------------------ -/// | size | 0 | 1 | 2 | 3 | 4 | 5 | 6 | -/// ------------------------------------ -/// ``` -/// -/// Its memory layout is different from that of `Box<[T]>` in that size is in the allocation (not -/// along with pointer as in `Box<[T]>`). -/// -/// Elements are not present in the type, but they will be in the allocation. -/// ``` -/// -// TODO(@jeehoonkang): once we bump the minimum required Rust version to 1.44 or newer, use -// [`alloc::alloc::Layout::extend`] instead. -#[repr(C)] -struct Array { - size: usize, - elements: [MaybeUninit; 0], -} - -impl Pointable for [MaybeUninit] { - const ALIGN: usize = mem::align_of::>(); - - type Init = usize; - - unsafe fn init(len: Self::Init) -> usize { - let size = - mem::size_of::>() + mem::size_of::>() * len; - let align = mem::align_of::>(); - let layout = alloc::Layout::from_size_align(size, align).unwrap(); - let ptr = alloc::alloc(layout) as *mut Array; - (*ptr).size = size; - ptr as usize - } - - unsafe fn deref<'a>(ptr: usize) -> &'a Self { - let array = &*(ptr as *const Array); - slice::from_raw_parts(array.elements.as_ptr(), array.size) - } - - unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut Self { - let array = &*(ptr as *mut Array); - slice::from_raw_parts_mut(array.elements.as_ptr() as *mut _, array.size) - } - - unsafe fn drop(ptr: usize) { - let array = &*(ptr as *mut Array); - let size = mem::size_of::>() - + mem::size_of::>() * array.size; - let align = mem::align_of::>(); - let layout = alloc::Layout::from_size_align(size, align).unwrap(); - alloc::dealloc(ptr as *mut u8, layout); - } -} - -/// An atomic pointer that can be safely shared between threads. -/// -/// The pointer must be properly aligned. Since it is aligned, a tag can be stored into the unused -/// least significant bits of the address. For example, the tag for a pointer to a sized type `T` -/// should be less than `(1 << mem::align_of::().trailing_zeros())`. -/// -/// Any method that loads the pointer must be passed a reference to a [`Guard`]. -/// -/// Crossbeam supports dynamically sized types. See [`Pointable`] for details. -pub(crate) struct Atomic { - data: AtomicUsize, - _marker: PhantomData<*mut T>, -} - -unsafe impl Send for Atomic {} -unsafe impl Sync for Atomic {} - -impl Atomic { - /// Allocates `value` on the heap and returns a new atomic pointer pointing to it. - pub(crate) fn new(init: T) -> Atomic { - Self::init(init) - } -} - -impl Atomic { - /// Allocates `value` on the heap and returns a new atomic pointer pointing to it. - pub(crate) fn init(init: T::Init) -> Atomic { - Self::from(Owned::init(init)) - } - - /// Returns a new atomic pointer pointing to the tagged pointer `data`. - fn from_usize(data: usize) -> Self { - Self { data: AtomicUsize::new(data), _marker: PhantomData } - } - - /// Returns a new null atomic pointer. - pub(crate) fn null() -> Atomic { - Self { data: AtomicUsize::new(0), _marker: PhantomData } - } - - /// Loads a `Shared` from the atomic pointer. - /// - /// This method takes an [`Ordering`] argument which describes the memory ordering of this - /// operation. - pub(crate) fn load<'g>( - &self, - ord: Ordering, - _: &'g Guard, - ) -> Shared<'g, T> { - unsafe { Shared::from_usize(self.data.load(ord)) } - } - - /// Stores a `Shared` or `Owned` pointer into the atomic pointer. - /// - /// This method takes an [`Ordering`] argument which describes the memory ordering of this - /// operation. - pub(crate) fn store>(&self, new: P, ord: Ordering) { - self.data.store(new.into_usize(), ord); - } - - /// Stores a `Shared` or `Owned` pointer into the atomic pointer, returning the previous - /// `Shared`. - /// - /// This method takes an [`Ordering`] argument which describes the memory ordering of this - /// operation. - pub(crate) fn swap<'g, P: Pointer>( - &self, - new: P, - ord: Ordering, - _: &'g Guard, - ) -> Shared<'g, T> { - unsafe { Shared::from_usize(self.data.swap(new.into_usize(), ord)) } - } - - /// Stores the pointer `new` (either `Shared` or `Owned`) into the atomic pointer if the current - /// value is the same as `current`. The tag is also taken into account, so two pointers to the - /// same object, but with different tags, will not be considered equal. - /// - /// The return value is a result indicating whether the new pointer was written. On success the - /// pointer that was written is returned. On failure the actual current value and `new` are - /// returned. - /// - /// This method takes a [`CompareAndSetOrdering`] argument which describes the memory - /// ordering of this operation. - pub(crate) fn compare_and_set<'g, O, P>( - &self, - current: Shared<'_, T>, - new_raw: P, - ord: O, - _: &'g Guard, - ) -> Result, CompareAndSetError<'g, T, P>> - where - O: CompareAndSetOrdering, - P: Pointer, - { - let new = new_raw.into_usize(); - self.data - .compare_exchange( - current.into_usize(), - new, - ord.success(), - ord.failure(), - ) - .map(|_| unsafe { Shared::from_usize(new) }) - .map_err(|cur| unsafe { - CompareAndSetError { - current: Shared::from_usize(cur), - new: P::from_usize(new), - } - }) - } - - /// Stores the pointer `new` (either `Shared` or `Owned`) into the atomic pointer if the current - /// value is the same as `current`. The tag is also taken into account, so two pointers to the - /// same object, but with different tags, will not be considered equal. - /// - /// Unlike [`compare_and_set`], this method is allowed to spuriously fail even when comparison - /// succeeds, which can result in more efficient code on some platforms. The return value is a - /// result indicating whether the new pointer was written. On success the pointer that was - /// written is returned. On failure the actual current value and `new` are returned. - /// - /// This method takes a [`CompareAndSetOrdering`] argument which describes the memory - /// ordering of this operation. - /// - /// [`compare_and_set`]: Atomic::compare_and_set - pub(crate) fn compare_and_set_weak<'g, O, P>( - &self, - current: Shared<'_, T>, - new_raw: P, - ord: O, - _: &'g Guard, - ) -> Result, CompareAndSetError<'g, T, P>> - where - O: CompareAndSetOrdering, - P: Pointer, - { - let new = new_raw.into_usize(); - self.data - .compare_exchange_weak( - current.into_usize(), - new, - ord.success(), - ord.failure(), - ) - .map(|_| unsafe { Shared::from_usize(new) }) - .map_err(|cur| unsafe { - CompareAndSetError { - current: Shared::from_usize(cur), - new: P::from_usize(new), - } - }) - } - - /// Bitwise "or" with the current tag. - /// - /// Performs a bitwise "or" operation on the current tag and the argument `val`, and sets the - /// new tag to the result. Returns the previous pointer. - /// - /// This method takes an [`Ordering`] argument which describes the memory ordering of this - /// operation. - pub(crate) fn fetch_or<'g>( - &self, - val: usize, - ord: Ordering, - _: &'g Guard, - ) -> Shared<'g, T> { - unsafe { - Shared::from_usize(self.data.fetch_or(val & low_bits::(), ord)) - } - } -} - -impl fmt::Debug for Atomic { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let data = self.data.load(Ordering::SeqCst); - let (raw, tag) = decompose_tag::(data); - - f.debug_struct("Atomic").field("raw", &raw).field("tag", &tag).finish() - } -} - -impl fmt::Pointer for Atomic { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let data = self.data.load(Ordering::SeqCst); - let (raw, _) = decompose_tag::(data); - fmt::Pointer::fmt(&(unsafe { T::deref(raw) }), f) - } -} - -impl Clone for Atomic { - /// Returns a copy of the atomic value. - /// - /// Note that a `Relaxed` load is used here. If you need synchronization, use it with other - /// atomics or fences. - fn clone(&self) -> Self { - let data = self.data.load(Ordering::Relaxed); - Atomic::from_usize(data) - } -} - -impl Default for Atomic { - fn default() -> Self { - Atomic::null() - } -} - -impl From> for Atomic { - /// Returns a new atomic pointer pointing to `owned`. - fn from(owned: Owned) -> Self { - let data = owned.data; - #[allow(clippy::mem_forget)] - mem::forget(owned); - Self::from_usize(data) - } -} - -impl From> for Atomic { - fn from(b: Box) -> Self { - Self::from(Owned::from(b)) - } -} - -impl From for Atomic { - fn from(t: T) -> Self { - Self::new(t) - } -} - -impl<'g, T: ?Sized + Pointable> From> for Atomic { - /// Returns a new atomic pointer pointing to `ptr`. - fn from(ptr: Shared<'g, T>) -> Self { - Self::from_usize(ptr.data) - } -} - -impl From<*const T> for Atomic { - /// Returns a new atomic pointer pointing to `raw`. - fn from(raw: *const T) -> Self { - Self::from_usize(raw as usize) - } -} - -/// A trait for either `Owned` or `Shared` pointers. -pub(crate) trait Pointer { - /// Returns the machine representation of the pointer. - fn into_usize(self) -> usize; - - /// Returns a new pointer pointing to the tagged pointer `data`. - /// - /// # Safety - /// - /// The given `data` should have been created by `Pointer::into_usize()`, and one `data` should - /// not be converted back by `Pointer::from_usize()` multiple times. - unsafe fn from_usize(data: usize) -> Self; -} - -/// An owned heap-allocated object. -/// -/// This type is very similar to `Box`. -/// -/// The pointer must be properly aligned. Since it is aligned, a tag can be stored into the unused -/// least significant bits of the address. -pub(crate) struct Owned { - data: usize, - _marker: PhantomData>, -} - -impl Pointer for Owned { - #[inline] - fn into_usize(self) -> usize { - let data = self.data; - #[allow(clippy::mem_forget)] - mem::forget(self); - data - } - - /// Returns a new pointer pointing to the tagged pointer `data`. - /// - /// # Panics - /// - /// Panics if the data is zero in debug mode. - #[inline] - unsafe fn from_usize(data: usize) -> Self { - debug_assert!(data != 0, "converting zero into `Owned`"); - Owned { data, _marker: PhantomData } - } -} - -impl Owned { - /// Returns a new owned pointer pointing to `raw`. - /// - /// This function is unsafe because improper use may lead to memory problems. Argument `raw` - /// must be a valid pointer. Also, a double-free may occur if the function is called twice on - /// the same raw pointer. - /// - /// # Panics - /// - /// Panics if `raw` is not properly aligned. - /// - /// # Safety - /// - /// The given `raw` should have been derived from `Owned`, and one `raw` should not be converted - /// back by `Owned::from_raw()` multiple times. - pub(crate) unsafe fn from_raw(raw_ptr: *mut T) -> Owned { - let raw = raw_ptr as usize; - ensure_aligned::(raw); - Self::from_usize(raw) - } - - /// Allocates `value` on the heap and returns a new owned pointer pointing to it. - pub(crate) fn new(init: T) -> Owned { - Self::init(init) - } -} - -impl Owned { - /// Allocates `value` on the heap and returns a new owned pointer pointing to it. - pub(crate) fn init(init: T::Init) -> Owned { - unsafe { Self::from_usize(T::init(init)) } - } - - /// Converts the owned pointer into a [`Shared`]. - #[allow(clippy::needless_lifetimes)] - pub(crate) fn into_shared<'g>(self, _: &'g Guard) -> Shared<'g, T> { - unsafe { Shared::from_usize(self.into_usize()) } - } - - /// Returns the tag stored within the pointer. - pub(crate) fn tag(&self) -> usize { - let (_, tag) = decompose_tag::(self.data); - tag - } - - /// Returns the same pointer, but tagged with `tag`. `tag` is truncated to be fit into the - /// unused bits of the pointer to `T`. - pub(crate) fn with_tag(self, tag: usize) -> Owned { - let data = self.into_usize(); - unsafe { Self::from_usize(compose_tag::(data, tag)) } - } -} - -impl Drop for Owned { - fn drop(&mut self) { - let (raw, _) = decompose_tag::(self.data); - unsafe { - T::drop(raw); - } - } -} - -impl fmt::Debug for Owned { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let (raw, tag) = decompose_tag::(self.data); - - f.debug_struct("Owned").field("raw", &raw).field("tag", &tag).finish() - } -} - -impl Clone for Owned { - fn clone(&self) -> Self { - Owned::new((**self).clone()).with_tag(self.tag()) - } -} - -impl Deref for Owned { - type Target = T; - - fn deref(&self) -> &T { - let (raw, _) = decompose_tag::(self.data); - unsafe { T::deref(raw) } - } -} - -impl DerefMut for Owned { - fn deref_mut(&mut self) -> &mut T { - let (raw, _) = decompose_tag::(self.data); - unsafe { T::deref_mut(raw) } - } -} - -impl From for Owned { - fn from(t: T) -> Self { - Owned::new(t) - } -} - -impl From> for Owned { - /// Returns a new owned pointer pointing to `b`. - /// - /// # Panics - /// - /// Panics if the pointer (the `Box`) is not properly aligned. - fn from(b: Box) -> Self { - unsafe { Self::from_raw(Box::into_raw(b)) } - } -} - -impl Borrow for Owned { - fn borrow(&self) -> &T { - self.deref() - } -} - -impl BorrowMut for Owned { - fn borrow_mut(&mut self) -> &mut T { - self.deref_mut() - } -} - -impl AsRef for Owned { - fn as_ref(&self) -> &T { - self.deref() - } -} - -impl AsMut for Owned { - fn as_mut(&mut self) -> &mut T { - self.deref_mut() - } -} - -/// A pointer to an object protected by the epoch GC. -/// -/// The pointer is valid for use only during the lifetime `'g`. -/// -/// The pointer must be properly aligned. Since it is aligned, a tag can be stored into the unused -/// least significant bits of the address. -pub(crate) struct Shared<'g, T: 'g + ?Sized + Pointable> { - data: usize, - _marker: PhantomData<(&'g (), *const T)>, -} - -impl Clone for Shared<'_, T> { - fn clone(&self) -> Self { - Self { data: self.data, _marker: PhantomData } - } -} - -impl Copy for Shared<'_, T> {} - -impl Pointer for Shared<'_, T> { - #[inline] - fn into_usize(self) -> usize { - self.data - } - - #[inline] - unsafe fn from_usize(data: usize) -> Self { - Shared { data, _marker: PhantomData } - } -} - -impl<'g, T> Shared<'g, T> { - /// Converts the pointer to a raw pointer (without the tag). - #[allow(clippy::trivially_copy_pass_by_ref)] - pub(crate) fn as_raw(&self) -> *const T { - let (raw, _) = decompose_tag::(self.data); - raw as *const _ - } -} - -impl<'g, T: ?Sized + Pointable> Shared<'g, T> { - /// Returns a new null pointer. - pub(crate) const fn null() -> Shared<'g, T> { - Shared { data: 0, _marker: PhantomData } - } - - /// Returns `true` if the pointer is null. - #[allow(clippy::trivially_copy_pass_by_ref)] - pub(crate) fn is_null(&self) -> bool { - let (raw, _) = decompose_tag::(self.data); - raw == 0 - } - - /// Dereferences the pointer. - /// - /// Returns a reference to the pointee that is valid during the lifetime `'g`. - /// - /// # Safety - /// - /// Dereferencing a pointer is unsafe because it could be pointing to invalid memory. - /// - /// Another concern is the possibility of data races due to lack of proper synchronization. - /// For example, consider the following scenario: - /// - /// 1. A thread creates a new object: `a.store(Owned::new(10), Relaxed)` - /// 2. Another thread reads it: `*a.load(Relaxed, guard).as_ref().unwrap()` - /// - /// The problem is that relaxed orderings don't synchronize initialization of the object with - /// the read from the second thread. This is a data race. A possible solution would be to use - /// `Release` and `Acquire` orderings. - #[allow(clippy::trivially_copy_pass_by_ref)] - #[allow(clippy::should_implement_trait)] - pub(crate) unsafe fn deref(&self) -> &'g T { - let (raw, _) = decompose_tag::(self.data); - T::deref(raw) - } - - /// Converts the pointer to a reference. - /// - /// Returns `None` if the pointer is null, or else a reference to the object wrapped in `Some`. - /// - /// # Safety - /// - /// Dereferencing a pointer is unsafe because it could be pointing to invalid memory. - /// - /// Another concern is the possibility of data races due to lack of proper synchronization. - /// For example, consider the following scenario: - /// - /// 1. A thread creates a new object: `a.store(Owned::new(10), Relaxed)` - /// 2. Another thread reads it: `*a.load(Relaxed, guard).as_ref().unwrap()` - /// - /// The problem is that relaxed orderings don't synchronize initialization of the object with - /// the read from the second thread. This is a data race. A possible solution would be to use - /// `Release` and `Acquire` orderings. - #[allow(clippy::trivially_copy_pass_by_ref)] - pub(crate) unsafe fn as_ref(&self) -> Option<&'g T> { - let (raw, _) = decompose_tag::(self.data); - if raw == 0 { - None - } else { - Some(T::deref(raw)) - } - } - - /// Takes ownership of the pointee. - /// - /// # Panics - /// - /// Panics if this pointer is null, but only in debug mode. - /// - /// # Safety - /// - /// This method may be called only if the pointer is valid and nobody else is holding a - /// reference to the same object. - pub(crate) unsafe fn into_owned(self) -> Owned { - debug_assert!( - !self.is_null(), - "converting a null `Shared` into `Owned`" - ); - Owned::from_usize(self.data) - } - - /// Returns the tag stored within the pointer. - #[allow(clippy::trivially_copy_pass_by_ref)] - pub(crate) fn tag(&self) -> usize { - let (_, tag) = decompose_tag::(self.data); - tag - } - - /// Returns the same pointer, but tagged with `tag`. `tag` is truncated to be fit into the - /// unused bits of the pointer to `T`. - #[allow(clippy::trivially_copy_pass_by_ref)] - pub(crate) fn with_tag(&self, tag: usize) -> Shared<'g, T> { - unsafe { Self::from_usize(compose_tag::(self.data, tag)) } - } -} - -impl From<*const T> for Shared<'_, T> { - /// Returns a new pointer pointing to `raw`. - /// - /// # Panics - /// - /// Panics if `raw` is not properly aligned. - fn from(raw_ptr: *const T) -> Self { - let raw = raw_ptr as usize; - ensure_aligned::(raw); - unsafe { Self::from_usize(raw) } - } -} - -impl<'g, T: ?Sized + Pointable> PartialEq> for Shared<'g, T> { - fn eq(&self, other: &Self) -> bool { - self.data == other.data - } -} - -impl Eq for Shared<'_, T> {} - -impl<'g, T: ?Sized + Pointable> PartialOrd> for Shared<'g, T> { - fn partial_cmp(&self, other: &Self) -> Option { - self.data.partial_cmp(&other.data) - } -} - -impl Ord for Shared<'_, T> { - fn cmp(&self, other: &Self) -> cmp::Ordering { - self.data.cmp(&other.data) - } -} - -impl fmt::Debug for Shared<'_, T> { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let (raw, tag) = decompose_tag::(self.data); - - f.debug_struct("Shared").field("raw", &raw).field("tag", &tag).finish() - } -} - -impl fmt::Pointer for Shared<'_, T> { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - fmt::Pointer::fmt(&(unsafe { self.deref() }), f) - } -} - -impl Default for Shared<'_, T> { - fn default() -> Self { - Shared::null() - } -} - -#[cfg(test)] -mod tests { - use super::Shared; - - #[test] - fn valid_tag_i8() { - Shared::::null().with_tag(0); - } - - #[test] - fn valid_tag_i64() { - Shared::::null().with_tag(7); - } -} diff --git a/src/ebr/collector.rs b/src/ebr/collector.rs deleted file mode 100644 index c63549e25..000000000 --- a/src/ebr/collector.rs +++ /dev/null @@ -1,91 +0,0 @@ -/// Epoch-based garbage collector. -use core::fmt; -use std::sync::Arc; - -use super::internal::{Global, Local}; -use super::Guard; -use crate::Lazy; - -/// The global data for the default garbage collector. -static COLLECTOR: Lazy Collector> = - Lazy::new(Collector::new); - -thread_local! { - /// The per-thread participant for the default garbage collector. - static HANDLE: LocalHandle = COLLECTOR.register(); -} - -/// Pins the current thread. -#[inline] -pub(crate) fn pin() -> Guard { - with_handle(LocalHandle::pin) -} - -#[inline] -fn with_handle(mut f: F) -> R -where - F: FnMut(&LocalHandle) -> R, -{ - HANDLE.try_with(|h| f(h)).unwrap_or_else(|_| f(&COLLECTOR.register())) -} - -/// An epoch-based garbage collector. -pub(super) struct Collector { - pub(super) global: Arc, -} - -unsafe impl Send for Collector {} -unsafe impl Sync for Collector {} - -impl Default for Collector { - fn default() -> Self { - Self { global: Arc::new(Global::new()) } - } -} - -impl Collector { - /// Creates a new collector. - pub(super) fn new() -> Self { - Self::default() - } - - /// Registers a new handle for the collector. - pub(super) fn register(&self) -> LocalHandle { - Local::register(self) - } -} - -impl Clone for Collector { - /// Creates another reference to the same garbage collector. - fn clone(&self) -> Self { - Collector { global: self.global.clone() } - } -} - -/// A handle to a garbage collector. -pub(super) struct LocalHandle { - pub(super) local: *const Local, -} - -impl LocalHandle { - /// Pins the handle. - #[inline] - pub(super) fn pin(&self) -> Guard { - unsafe { (*self.local).pin() } - } -} - -impl Drop for LocalHandle { - #[inline] - fn drop(&mut self) { - unsafe { - Local::release_handle(&*self.local); - } - } -} - -impl fmt::Debug for LocalHandle { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - f.pad("LocalHandle { .. }") - } -} diff --git a/src/ebr/deferred.rs b/src/ebr/deferred.rs deleted file mode 100644 index d2f94f948..000000000 --- a/src/ebr/deferred.rs +++ /dev/null @@ -1,141 +0,0 @@ -use core::fmt; -use core::marker::PhantomData; -use core::mem::{self, MaybeUninit}; -use core::ptr; - -/// Number of words a piece of `Data` can hold. -/// -/// Three words should be enough for the majority of cases. For example, you can fit inside it the -/// function pointer together with a fat pointer representing an object that needs to be destroyed. -const DATA_WORDS: usize = 3; - -/// Some space to keep a `FnOnce()` object on the stack. -type Data = MaybeUninit<[usize; DATA_WORDS]>; - -/// A `FnOnce()` that is stored inline if small, or otherwise boxed on the heap. -/// -/// This is a handy way of keeping an unsized `FnOnce()` within a sized structure. -pub(super) struct Deferred { - call: unsafe fn(*mut u8), - data: Data, - _marker: PhantomData<*mut ()>, // !Send + !Sync -} - -impl fmt::Debug for Deferred { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> { - f.pad("Deferred { .. }") - } -} - -unsafe fn call_raw(raw: *mut u8) { - let f: F = ptr::read(raw as *mut F); - f(); -} - -unsafe fn call_raw_box(raw: *mut u8) { - // It's safe to cast `raw` from `*mut u8` to `*mut Box`, because `raw` is - // originally derived from `*mut Box`. - #[allow(clippy::cast_ptr_alignment)] - let b: Box = ptr::read(raw as *mut Box); - (*b)(); -} - -impl Deferred { - /// Constructs a new `Deferred` from a `FnOnce()`. - pub(super) fn new(f: F) -> Self { - let size = mem::size_of::(); - let align = mem::align_of::(); - - unsafe { - if size <= mem::size_of::() - && align <= mem::align_of::() - { - let mut data = Data::uninit(); - ptr::write(data.as_mut_ptr() as *mut F, f); - - Deferred { - call: call_raw::, - data, - _marker: PhantomData, - } - } else { - let b: Box = Box::new(f); - let mut data = Data::uninit(); - ptr::write(data.as_mut_ptr() as *mut Box, b); - - Deferred { - call: call_raw_box::, - data, - _marker: PhantomData, - } - } - } - } - - /// Calls the function. - #[inline] - pub(super) fn call(mut self) { - let call = self.call; - #[allow(trivial_casts)] - unsafe { - call(self.data.as_mut_ptr().cast()) - }; - } -} - -#[cfg(test)] -mod tests { - use super::Deferred; - use std::cell::Cell; - - #[test] - fn on_stack() { - let fired = &Cell::new(false); - let a = [0_usize; 1]; - - let d = Deferred::new(move || { - drop(a); - fired.set(true); - }); - - assert!(!fired.get()); - d.call(); - assert!(fired.get()); - } - - #[test] - fn on_heap() { - let fired = &Cell::new(false); - let a = [0_usize; 10]; - - let d = Deferred::new(move || { - drop(a); - fired.set(true); - }); - - assert!(!fired.get()); - d.call(); - assert!(fired.get()); - } - - #[test] - fn string() { - let a = "hello".to_string(); - let d = Deferred::new(move || assert_eq!(a, "hello")); - d.call(); - } - - #[test] - fn boxed_slice_i32() { - let a: Box<[i32]> = vec![2, 3, 5, 7].into_boxed_slice(); - let d = Deferred::new(move || assert_eq!(*a, [2, 3, 5, 7])); - d.call(); - } - - #[test] - fn long_slice_usize() { - let a: [usize; 5] = [2, 3, 5, 7, 11]; - let d = Deferred::new(move || assert_eq!(a, [2, 3, 5, 7, 11])); - d.call(); - } -} diff --git a/src/ebr/epoch.rs b/src/ebr/epoch.rs deleted file mode 100644 index e01ce3551..000000000 --- a/src/ebr/epoch.rs +++ /dev/null @@ -1,113 +0,0 @@ -//! The global epoch -//! -//! The last bit in this number is unused and is always zero. Every so often the global epoch is -//! incremented, i.e. we say it "advances". A pinned participant may advance the global epoch only -//! if all currently pinned participants have been pinned in the current epoch. -//! -//! If an object became garbage in some epoch, then we can be sure that after two advancements no -//! participant will hold a reference to it. That is the crux of safe memory reclamation. - -use core::sync::atomic::{AtomicUsize, Ordering}; - -/// An epoch that can be marked as pinned or unpinned. -/// -/// Internally, the epoch is represented as an integer that wraps around at some unspecified point -/// and a flag that represents whether it is pinned or unpinned. -#[derive(Copy, Clone, Debug, Eq, PartialEq)] -pub(super) struct Epoch { - /// The least significant bit is set if pinned. The rest of the bits hold the epoch. - data: usize, -} - -impl Epoch { - /// Returns the starting epoch in unpinned state. - pub(super) const fn starting() -> Self { - Epoch { data: 0 } - } - - /// Returns the number of epochs `self` is ahead of `rhs`. - /// - /// Internally, epochs are represented as numbers in the range `(isize::MIN / 2) .. (isize::MAX - /// / 2)`, so the returned distance will be in the same interval. - pub(super) const fn wrapping_sub(self, rhs: Self) -> usize { - // The result is the same with `(self.data & !1).wrapping_sub(rhs.data & !1) as isize >> 1`, - // because the possible difference of LSB in `(self.data & !1).wrapping_sub(rhs.data & !1)` - // will be ignored in the shift operation. - self.data.wrapping_sub(rhs.data & !1) >> 1 - } - - /// Returns `true` if the epoch is marked as pinned. - pub(super) const fn is_pinned(self) -> bool { - (self.data & 1) == 1 - } - - /// Returns the same epoch, but marked as pinned. - pub(super) const fn pinned(self) -> Epoch { - Epoch { data: self.data | 1 } - } - - /// Returns the same epoch, but marked as unpinned. - pub(super) const fn unpinned(self) -> Epoch { - Epoch { data: self.data & !1 } - } - - /// Returns the successor epoch. - /// - /// The returned epoch will be marked as pinned only if the previous one was as well. - pub(super) const fn successor(self) -> Epoch { - Epoch { data: self.data.wrapping_add(2) } - } -} - -/// An atomic value that holds an `Epoch`. -#[derive(Default, Debug)] -pub(super) struct AtomicEpoch { - /// Since `Epoch` is just a wrapper around `usize`, an `AtomicEpoch` is similarly represented - /// using an `AtomicUsize`. - data: AtomicUsize, -} - -impl AtomicEpoch { - /// Creates a new atomic epoch. - pub(super) const fn new(epoch: Epoch) -> Self { - let data = AtomicUsize::new(epoch.data); - AtomicEpoch { data } - } - - /// Loads a value from the atomic epoch. - #[inline] - pub(super) fn load(&self, ord: Ordering) -> Epoch { - Epoch { data: self.data.load(ord) } - } - - /// Stores a value into the atomic epoch. - #[inline] - pub(super) fn store(&self, epoch: Epoch, ord: Ordering) { - self.data.store(epoch.data, ord); - } - - /// Stores a value into the atomic epoch if the current value is the same as `current`. - /// - /// The return value is always the previous value. If it is equal to `current`, then the value - /// is updated. - /// - /// The `Ordering` argument describes the memory ordering of this operation. - #[inline] - pub(super) fn compare_and_swap( - &self, - current: Epoch, - new: Epoch, - ord: Ordering, - ) -> Epoch { - use super::atomic::CompareAndSetOrdering; - - match self.data.compare_exchange( - current.data, - new.data, - ord.success(), - ord.failure(), - ) { - Ok(data) | Err(data) => Epoch { data }, - } - } -} diff --git a/src/ebr/internal.rs b/src/ebr/internal.rs deleted file mode 100644 index d2121737b..000000000 --- a/src/ebr/internal.rs +++ /dev/null @@ -1,679 +0,0 @@ -//! The global data and participant for garbage collection. -//! -//! # Registration -//! -//! In order to track all participants in one place, we need some form of participant -//! registration. When a participant is created, it is registered to a global lock-free -//! singly-linked list of registries; and when a participant is leaving, it is unregistered from the -//! list. -//! -//! # Pinning -//! -//! Every participant contains an integer that tells whether the participant is pinned and if so, -//! what was the global epoch at the time it was pinned. Participants also hold a pin counter that -//! aids in periodic global epoch advancement. -//! -//! When a participant is pinned, a `Guard` is returned as a witness that the participant is pinned. -//! Guards are necessary for performing atomic operations, and for freeing/dropping locations. -//! -//! # Thread-local bag -//! -//! Objects that get unlinked from concurrent data structures must be stashed away until the global -//! epoch sufficiently advances so that they become safe for destruction. Pointers to such objects -//! are pushed into a thread-local bag, and when it becomes full, the bag is marked with the current -//! global epoch and pushed into the global queue of bags. We store objects in thread-local storages -//! for amortizing the synchronization cost of pushing the garbages to a global queue. -//! -//! # Global queue -//! -//! Whenever a bag is pushed into a queue, the objects in some bags in the queue are collected and -//! destroyed along the way. This design reduces contention on data structures. The global queue -//! cannot be explicitly accessed: the only way to interact with it is by calling functions -//! `defer()` that adds an object to the thread-local bag, or `collect()` that manually triggers -//! garbage collection. -//! -//! Ideally each instance of concurrent data structure may have its own queue that gets fully -//! destroyed as soon as the data structure gets dropped. - -use core::cell::{Cell, UnsafeCell}; -use core::mem::{self, ManuallyDrop}; -use core::num::Wrapping; -use core::sync::atomic; -use core::sync::atomic::Ordering; -use core::{fmt, ptr}; - -use crate::CachePadded; - -use super::atomic::{Owned, Shared}; -use super::collector::{Collector, LocalHandle}; -use super::deferred::Deferred; -use super::epoch::{AtomicEpoch, Epoch}; -use super::list::{Entry, IsElement, IterError, List}; -use super::queue::{Queue, SIZE_HINT}; -use super::{unprotected, Guard}; - -/// Maximum number of objects a bag can contain. -#[cfg(not(feature = "lock_free_delays"))] -const MAX_OBJECTS: usize = 62; -#[cfg(feature = "lock_free_delays")] -const MAX_OBJECTS: usize = 4; - -/// A bag of deferred functions. -pub(super) struct Bag { - /// Stashed objects. - deferreds: [Deferred; MAX_OBJECTS], - len: usize, -} - -/// `Bag::try_push()` requires that it is safe for another thread to execute the given functions. -unsafe impl Send for Bag {} - -impl Bag { - /// Returns a new, empty bag. - pub(super) fn new() -> Self { - Self::default() - } - - /// Returns `true` if the bag is empty. - pub(super) const fn is_empty(&self) -> bool { - self.len == 0 - } - - /// Attempts to insert a deferred function into the bag. - /// - /// Returns `Ok(())` if successful, and `Err(deferred)` for the given `deferred` if the bag is - /// full. - /// - /// # Safety - /// - /// It should be safe for another thread to execute the given function. - pub(super) unsafe fn try_push( - &mut self, - deferred: Deferred, - ) -> Result<(), Deferred> { - if self.len < MAX_OBJECTS { - self.deferreds[self.len] = deferred; - self.len += 1; - Ok(()) - } else { - Err(deferred) - } - } - - /// Seals the bag with the given epoch. - fn seal(self, epoch: Epoch) -> SealedBag { - SealedBag { epoch, _bag: self } - } -} - -impl Default for Bag { - #[rustfmt::skip] - fn default() -> Self { - // TODO: [no_op; MAX_OBJECTS] syntax blocked by https://github.com/rust-lang/rust/issues/49147 - #[cfg(not(feature = "lock_free_delays"))] - return Bag { - len: 0, - deferreds: [ - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - ], - }; - #[cfg(feature = "lock_free_delays")] - return Bag { - len: 0, - deferreds: [ - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - Deferred::new(no_op_func), - ], - }; - } -} - -impl Drop for Bag { - fn drop(&mut self) { - // Call all deferred functions. - for deferred in &mut self.deferreds[..self.len] { - let no_op = Deferred::new(no_op_func); - let owned_deferred = mem::replace(deferred, no_op); - owned_deferred.call(); - } - } -} - -// can't #[derive(Debug)] because Debug is not implemented for arrays 64 items long -impl fmt::Debug for Bag { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - f.debug_struct("Bag") - .field("deferreds", &&self.deferreds[..self.len]) - .finish() - } -} - -const fn no_op_func() {} - -/// A pair of an epoch and a bag. -#[derive(Debug)] -struct SealedBag { - epoch: Epoch, - // exists solely to be dropped - _bag: Bag, -} - -/// It is safe to share `SealedBag` because `is_expired` only inspects the epoch. -unsafe impl Sync for SealedBag {} - -impl SealedBag { - /// Checks if it is safe to drop the bag w.r.t. the given global epoch. - const fn is_expired(&self, global_epoch: Epoch) -> bool { - // A pinned participant can witness at most one epoch advancement. Therefore, any bag that - // is within one epoch of the current one cannot be destroyed yet. - global_epoch.wrapping_sub(self.epoch) >= 2 - } -} - -/// The global data for a garbage collector. -pub(crate) struct Global { - /// The intrusive linked list of `Local`s. - locals: List, - - /// The global queue of bags of deferred functions. - queue: Queue, - - /// The global epoch. - pub(super) epoch: CachePadded, -} - -impl Global { - /// Creates a new global data for garbage collection. - #[inline] - pub(super) fn new() -> Self { - Self { - locals: List::new(), - queue: Queue::new(), - epoch: CachePadded::new(AtomicEpoch::new(Epoch::starting())), - } - } - - /// Pushes the bag into the global queue and replaces the bag with a new empty bag. - pub(super) fn push_bag(&self, bag: &mut Bag, guard: &Guard) { - let bag = mem::replace(bag, Bag::new()); - - atomic::fence(Ordering::SeqCst); - - let epoch = self.epoch.load(Ordering::Relaxed); - self.queue.push(bag.seal(epoch), guard); - } - - /// Collects several bags from the global queue and executes deferred functions in them. - /// - /// Note: This may itself produce garbage and in turn allocate new bags. - /// - /// `pin()` rarely calls `collect()`, so we want the compiler to place that call on a cold - /// path. In other words, we want the compiler to optimize branching for the case when - /// `collect()` is not called. - #[cold] - pub(super) fn collect(&self, guard: &Guard) { - /// Number of bags to destroy. - const COLLECT_STEPS: usize = 8; - - let steps = - COLLECT_STEPS.max(SIZE_HINT.load(Ordering::Relaxed) / 16).min(512); - - let global_epoch = self.try_advance(guard); - - #[cfg(feature = "testing")] - let mut count = 0; - - for _ in 0..steps { - match self.queue.try_pop_if( - |sealed_bag: &SealedBag| sealed_bag.is_expired(global_epoch), - guard, - ) { - None => break, - Some(sealed_bag) => { - drop(sealed_bag); - - #[cfg(feature = "testing")] - { - count += 1; - } - } - } - } - - #[cfg(feature = "testing")] - { - if count > 0 && SIZE_HINT.load(Ordering::Relaxed) > 5000 { - static O: std::sync::Once = std::sync::Once::new(); - - O.call_once(|| { - log::warn!( - "EBR collector grew to {} items", - SIZE_HINT.load(Ordering::Relaxed) - ); - }); - } - } - } - - /// Attempts to advance the global epoch. - /// - /// The global epoch can advance only if all currently pinned participants have been pinned in - /// the current epoch. - /// - /// Returns the current global epoch. - /// - /// `try_advance()` is annotated `#[cold]` because it is rarely called. - #[cold] - fn try_advance(&self, guard: &Guard) -> Epoch { - let global_epoch = self.epoch.load(Ordering::Relaxed); - atomic::fence(Ordering::SeqCst); - - // TODO(stjepang): `Local`s are stored in a linked list because linked lists are fairly - // easy to implement in a lock-free manner. However, traversal can be slow due to cache - // misses and data dependencies. We should experiment with other data structures as well. - for local in self.locals.iter(guard) { - match local { - Err(IterError::Stalled) => { - // A concurrent thread stalled this iteration. That thread might also try to - // advance the epoch, in which case we leave the job to it. Otherwise, the - // epoch will not be advanced. - return global_epoch; - } - Ok(local) => { - let local_epoch = local.epoch.load(Ordering::Relaxed); - - // If the participant was pinned in a different epoch, we cannot advance the - // global epoch just yet. - if local_epoch.is_pinned() - && local_epoch.unpinned() != global_epoch - { - return global_epoch; - } - } - } - } - atomic::fence(Ordering::Acquire); - - // All pinned participants were pinned in the current global epoch. - // Now let's advance the global epoch... - // - // Note that if another thread already advanced it before us, this store will simply - // overwrite the global epoch with the same value. This is true because `try_advance` was - // called from a thread that was pinned in `global_epoch`, and the global epoch cannot be - // advanced two steps ahead of it. - let new_epoch = global_epoch.successor(); - self.epoch.store(new_epoch, Ordering::Release); - new_epoch - } -} - -/// Participant for garbage collection. -pub(crate) struct Local { - /// A node in the intrusive linked list of `Local`s. - entry: Entry, - - /// The local epoch. - epoch: AtomicEpoch, - - /// A reference to the global data. - /// - /// When all guards and handles get dropped, this reference is destroyed. - collector: UnsafeCell>, - - /// The local bag of deferred functions. - pub(super) bag: UnsafeCell, - - /// The number of guards keeping this participant pinned. - pub(super) guard_count: Cell, - - /// The number of active handles. - handle_count: Cell, - - /// Total number of pinnings performed. - /// - /// This is just an auxiliary counter that sometimes kicks off collection. - pin_count: Cell>, -} - -// Make sure `Local` is less than or equal to 2048 bytes. -// https://github.com/crossbeam-rs/crossbeam/issues/551 -#[test] -fn local_size() { - assert!(2048 >= core::mem::size_of::()); -} - -impl Local { - /// Number of pinnings after which a participant will execute some deferred functions from the - /// global queue. - const PINNINGS_BETWEEN_COLLECT: usize = 128; - - /// Registers a new `Local` in the provided `Global`. - pub(super) fn register(collector: &Collector) -> LocalHandle { - unsafe { - // Since we dereference no pointers in this block, it is safe to use `unprotected`. - - let local = Owned::new(Local { - entry: Entry::default(), - epoch: AtomicEpoch::new(Epoch::starting()), - collector: UnsafeCell::new(ManuallyDrop::new( - collector.clone(), - )), - bag: UnsafeCell::new(Bag::new()), - guard_count: Cell::new(0), - handle_count: Cell::new(1), - pin_count: Cell::new(Wrapping(0)), - }) - .into_shared(unprotected()); - collector.global.locals.insert(local, unprotected()); - LocalHandle { local: local.as_raw() } - } - } - - /// Returns a reference to the `Global` in which this `Local` resides. - #[inline] - pub(super) fn global(&self) -> &Global { - &self.collector().global - } - - /// Returns a reference to the `Collector` in which this `Local` resides. - #[inline] - pub(super) fn collector(&self) -> &Collector { - unsafe { &**self.collector.get() } - } - - /// Adds `deferred` to the thread-local bag. - /// - /// # Safety - /// - /// It should be safe for another thread to execute the given function. - pub(super) unsafe fn defer(&self, mut deferred: Deferred, guard: &Guard) { - let bag = &mut *self.bag.get(); - - while let Err(d) = bag.try_push(deferred) { - self.global().push_bag(bag, guard); - deferred = d; - } - } - - pub(super) fn flush(&self, guard: &Guard) { - let bag = unsafe { &mut *self.bag.get() }; - - if !bag.is_empty() { - self.global().push_bag(bag, guard); - } - - self.global().collect(guard); - } - - /// Pins the `Local`. - #[inline] - pub(super) fn pin(&self) -> Guard { - let guard = Guard { - local: self, - #[cfg(feature = "testing")] - began: std::time::Instant::now(), - }; - - let guard_count = self.guard_count.get(); - self.guard_count.set(guard_count.checked_add(1).unwrap()); - - if guard_count == 0 { - let global_epoch = self.global().epoch.load(Ordering::Relaxed); - let new_epoch = global_epoch.pinned(); - - // Now we must store `new_epoch` into `self.epoch` and execute a `SeqCst` fence. - // The fence makes sure that any future loads from `Atomic`s will not happen before - // this store. - if cfg!(any(target_arch = "x86", target_arch = "x86_64")) { - // HACK(stjepang): On x86 architectures there are two different ways of executing - // a `SeqCst` fence. - // - // 1. `atomic::fence(SeqCst)`, which compiles into a `mfence` instruction. - // 2. `_.compare_and_swap(_, _, SeqCst)`, which compiles into a `lock cmpxchg` - // instruction. - // - // Both instructions have the effect of a full barrier, but benchmarks have shown - // that the second one makes pinning faster in this particular case. It is not - // clear that this is permitted by the C++ memory model (SC fences work very - // differently from SC accesses), but experimental evidence suggests that this - // works fine. Using inline assembly would be a viable (and correct) alternative, - // but alas, that is not possible on stable Rust. - let current = Epoch::starting(); - let previous = self.epoch.compare_and_swap( - current, - new_epoch, - Ordering::SeqCst, - ); - debug_assert_eq!( - current, previous, - "participant was expected to be unpinned" - ); - // We add a compiler fence to make it less likely for LLVM to do something wrong - // here. Formally, this is not enough to get rid of data races; practically, - // it should go a long way. - atomic::compiler_fence(Ordering::SeqCst); - } else { - self.epoch.store(new_epoch, Ordering::Relaxed); - atomic::fence(Ordering::SeqCst); - } - - // Increment the pin counter. - let count = self.pin_count.get(); - self.pin_count.set(count + Wrapping(1)); - - // After every `PINNINGS_BETWEEN_COLLECT` try advancing the epoch and collecting - // some garbage. - if count.0 % Self::PINNINGS_BETWEEN_COLLECT == 0 { - self.global().collect(&guard); - } - } - - guard - } - - /// Unpins the `Local`. - #[inline] - pub(super) fn unpin(&self) { - let guard_count = self.guard_count.get(); - self.guard_count.set(guard_count - 1); - - if guard_count == 1 - /* && !skip_collect */ - { - self.epoch.store(Epoch::starting(), Ordering::Release); - - if self.handle_count.get() == 0 { - self.finalize(); - } - } - } - - /// Decrements the handle count. - #[inline] - pub(super) fn release_handle(&self) { - let guard_count = self.guard_count.get(); - let handle_count = self.handle_count.get(); - debug_assert!(handle_count >= 1); - self.handle_count.set(handle_count - 1); - - if guard_count == 0 && handle_count == 1 { - self.finalize(); - } - } - - /// Removes the `Local` from the global linked list. - #[cold] - fn finalize(&self) { - debug_assert_eq!(self.guard_count.get(), 0); - debug_assert_eq!(self.handle_count.get(), 0); - - // Temporarily increment handle count. This is required so that the following call to `pin` - // doesn't call `finalize` again. - self.handle_count.set(1); - unsafe { - // Pin and move the local bag into the global queue. It's important that `push_bag` - // doesn't defer destruction on any new garbage. - let guard = &self.pin(); - self.global().push_bag(&mut *self.bag.get(), guard); - } - // Revert the handle count back to zero. - self.handle_count.set(0); - - unsafe { - // Take the reference to the `Global` out of this `Local`. Since we're not protected - // by a guard at this time, it's crucial that the reference is read before marking the - // `Local` as deleted. - let collector: Collector = ptr::read(&*(*self.collector.get())); - - // Mark this node in the linked list as deleted. - self.entry.delete(unprotected()); - - // Finally, drop the reference to the global. Note that this might be the last reference - // to the `Global`. If so, the global data will be destroyed and all deferred functions - // in its queue will be executed. - drop(collector); - } - } -} - -#[allow(trivial_casts)] -fn entry_offset() -> usize { - use std::mem::MaybeUninit; - let local: MaybeUninit = MaybeUninit::uninit(); - - // MaybeUninit is repr(transparent so we can treat a pointer as a pointer to the inner value) - let local_ref: &Local = - unsafe { &*(&local as *const MaybeUninit as *const Local) }; - let entry_ref: &Entry = &local_ref.entry; - - let local_ptr = local_ref as *const Local; - let entry_ptr = entry_ref as *const Entry; - - entry_ptr as usize - local_ptr as usize -} - -impl IsElement for Local { - fn entry_of(local: &Local) -> &Entry { - &local.entry - } - - unsafe fn element_of(entry: &Entry) -> &Local { - #[allow(trivial_casts)] - let local_ptr = - (entry as *const Entry as usize - entry_offset()) as *const Local; - &*local_ptr - } - - unsafe fn finalize(entry: &Entry, guard: &Guard) { - #[allow(trivial_casts)] - guard.defer_destroy(Shared::from(Self::element_of(entry) as *const _)); - } -} - -#[cfg(test)] -mod tests { - use std::sync::atomic::{AtomicUsize, Ordering}; - - use super::*; - - #[test] - fn check_defer() { - static FLAG: AtomicUsize = AtomicUsize::new(0); - fn set() { - FLAG.store(42, Ordering::Relaxed); - } - - let d = Deferred::new(set); - assert_eq!(FLAG.load(Ordering::Relaxed), 0); - d.call(); - assert_eq!(FLAG.load(Ordering::Relaxed), 42); - } - - #[test] - fn check_bag() { - static FLAG: AtomicUsize = AtomicUsize::new(0); - fn incr() { - FLAG.fetch_add(1, Ordering::Relaxed); - } - - let mut bag = Bag::new(); - assert!(bag.is_empty()); - - for _ in 0..MAX_OBJECTS { - assert!(unsafe { bag.try_push(Deferred::new(incr)).is_ok() }); - assert!(!bag.is_empty()); - assert_eq!(FLAG.load(Ordering::Relaxed), 0); - } - - let result = unsafe { bag.try_push(Deferred::new(incr)) }; - assert!(result.is_err()); - assert!(!bag.is_empty()); - assert_eq!(FLAG.load(Ordering::Relaxed), 0); - - drop(bag); - assert_eq!(FLAG.load(Ordering::Relaxed), MAX_OBJECTS); - } -} diff --git a/src/ebr/list.rs b/src/ebr/list.rs deleted file mode 100644 index 89662f97b..000000000 --- a/src/ebr/list.rs +++ /dev/null @@ -1,359 +0,0 @@ -//! Lock-free intrusive linked list. -//! -//! Ideas from Michael. High Performance Dynamic Lock-Free Hash Tables and List-Based Sets. SPAA -//! 2002. `http://dl.acm.org/citation.cfm?id=564870.564881` - -use core::marker::PhantomData; -use core::sync::atomic::Ordering::{Acquire, Relaxed, Release}; - -use super::{pin, Atomic, Guard, Shared}; - -/// An entry in a linked list. -/// -/// An Entry is accessed from multiple threads, so it would be beneficial to put it in a different -/// cache-line than thread-local data in terms of performance. -#[derive(Debug)] -pub struct Entry { - /// The next entry in the linked list. - /// If the tag is 1, this entry is marked as deleted. - next: Atomic, -} - -/// Implementing this trait asserts that the type `T` can be used as an element in the intrusive -/// linked list defined in this module. `T` has to contain (or otherwise be linked to) an instance -/// of `Entry`. -/// -/// # Example -/// -/// This trait is implemented on a type separate from `T` (although it can be just `T`), because -/// one type might be placeable into multiple lists, in which case it would require multiple -/// implementations of `IsElement`. In such cases, each struct implementing `IsElement` -/// represents a distinct `Entry` in `T`. -/// -/// For example, we can insert the following struct into two lists using `entry1` for one -/// and `entry2` for the other: -pub trait IsElement { - /// Returns a reference to this element's `Entry`. - fn entry_of(_: &T) -> &Entry; - - /// Given a reference to an element's entry, returns that element. - /// - /// # Safety - /// - /// The caller has to guarantee that the `Entry` is called with was retrieved from an instance - /// of the element type (`T`). - unsafe fn element_of(_: &Entry) -> &T; - - /// The function that is called when an entry is unlinked from list. - /// - /// # Safety - /// - /// The caller has to guarantee that the `Entry` is called with was retrieved from an instance - /// of the element type (`T`). - unsafe fn finalize(_: &Entry, _: &Guard); -} - -/// A lock-free, intrusive linked list of type `T`. -#[derive(Debug)] -pub struct List = T> { - /// The head of the linked list. - head: Atomic, - - /// The phantom data for using `T` and `C`. - _marker: PhantomData<(T, C)>, -} - -/// An iterator used for retrieving values from the list. -pub struct Iter<'g, T, C: IsElement> { - /// The guard that protects the iteration. - guard: &'g Guard, - - /// Pointer from the predecessor to the current entry. - pred: &'g Atomic, - - /// The current entry. - curr: Shared<'g, Entry>, - - /// The list head, needed for restarting iteration. - head: &'g Atomic, - - /// Logically, we store a borrow of an instance of `T` and - /// use the type information from `C`. - _marker: PhantomData<(&'g T, C)>, -} - -/// An error that occurs during iteration over the list. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub enum IterError { - /// A concurrent thread modified the state of the list at the same place that this iterator - /// was inspecting. Subsequent iteration will restart from the beginning of the list. - Stalled, -} - -impl Default for Entry { - /// Returns the empty entry. - fn default() -> Self { - Self { next: Atomic::null() } - } -} - -impl Entry { - /// Marks this entry as deleted, deferring the actual deallocation to a later iteration. - /// - /// # Safety - /// - /// The entry should be a member of a linked list, and it should not have been deleted. - /// It should be safe to call `C::finalize` on the entry after the `guard` is dropped, where `C` - /// is the associated helper for the linked list. - pub unsafe fn delete(&self, guard: &Guard) { - self.next.fetch_or(1, Release, guard); - } -} - -impl> List { - /// Returns a new, empty linked list. - pub fn new() -> Self { - Self { head: Atomic::null(), _marker: PhantomData } - } - - /// Inserts `entry` into the head of the list. - /// - /// # Safety - /// - /// You should guarantee that: - /// - /// - `container` is not null - /// - `container` is immovable, e.g. inside an `Owned` - /// - the same `Entry` is not inserted more than once - /// - the inserted object will be removed before the list is dropped - pub(crate) unsafe fn insert<'g>( - &'g self, - container: Shared<'g, T>, - guard: &'g Guard, - ) { - // Insert right after head, i.e. at the beginning of the list. - let to = &self.head; - // Get the intrusively stored Entry of the new element to insert. - let entry: &Entry = C::entry_of(container.deref()); - // Make a Shared ptr to that Entry. - #[allow(trivial_casts)] - let entry_ptr = Shared::from(entry as *const _); - // Read the current successor of where we want to insert. - let mut next = to.load(Relaxed, guard); - - loop { - // Set the Entry of the to-be-inserted element to point to the previous successor of - // `to`. - entry.next.store(next, Relaxed); - match to.compare_and_set_weak(next, entry_ptr, Release, guard) { - Ok(_) => break, - // We lost the race or weak CAS failed spuriously. Update the successor and try - // again. - Err(err) => next = err.current, - } - } - } - - /// Returns an iterator over all objects. - /// - /// # Caveat - /// - /// Every object that is inserted at the moment this function is called and persists at least - /// until the end of iteration will be returned. Since this iterator traverses a lock-free - /// linked list that may be concurrently modified, some additional caveats apply: - /// - /// 1. If a new object is inserted during iteration, it may or may not be returned. - /// 2. If an object is deleted during iteration, it may or may not be returned. - /// 3. The iteration may be aborted when it lost in a race condition. In this case, the winning - /// thread will continue to iterate over the same list. - pub fn iter<'g>(&'g self, guard: &'g Guard) -> Iter<'g, T, C> { - Iter { - guard, - pred: &self.head, - curr: self.head.load(Acquire, guard), - head: &self.head, - _marker: PhantomData, - } - } -} - -impl> Drop for List { - fn drop(&mut self) { - unsafe { - let guard = pin(); - let mut curr = self.head.load(Relaxed, &guard); - while let Some(c) = curr.as_ref() { - let succ = c.next.load(Relaxed, &guard); - // Verify that all elements have been removed from the list. - assert_eq!(succ.tag(), 1); - - C::finalize(curr.deref(), &guard); - curr = succ; - } - } - } -} - -impl<'g, T: 'g, C: IsElement> Iterator for Iter<'g, T, C> { - type Item = Result<&'g T, IterError>; - - fn next(&mut self) -> Option { - while let Some(c) = unsafe { self.curr.as_ref() } { - let succ = c.next.load(Acquire, self.guard); - - if succ.tag() == 1 { - // This entry was removed. Try unlinking it from the list. - let succ = succ.with_tag(0); - - // The tag should always be zero, because removing a node after a logically deleted - // node leaves the list in an invalid state. - debug_assert!(self.curr.tag() == 0); - - // Try to unlink `curr` from the list, and get the new value of `self.pred`. - let succ = match self - .pred - .compare_and_set(self.curr, succ, Acquire, self.guard) - { - Ok(_) => { - // We succeeded in unlinking `curr`, so we have to schedule - // deallocation. Deferred drop is okay, because `list.delete()` can only be - // called if `T: 'static`. - unsafe { - C::finalize(self.curr.deref(), self.guard); - } - - // `succ` is the new value of `self.pred`. - succ - } - Err(e) => { - // `e.current` is the current value of `self.pred`. - e.current - } - }; - - // If the predecessor node is already marked as deleted, we need to restart from - // `head`. - if succ.tag() != 0 { - self.pred = self.head; - self.curr = self.head.load(Acquire, self.guard); - - return Some(Err(IterError::Stalled)); - } - - // Move over the removed by only advancing `curr`, not `pred`. - self.curr = succ; - continue; - } - - // Move one step forward. - self.pred = &c.next; - self.curr = succ; - - return Some(Ok(unsafe { C::element_of(c) })); - } - - // We reached the end of the list. - None - } -} - -#[cfg(test)] -mod tests { - #![allow(trivial_casts)] - use super::*; - use crate::ebr::collector::Collector; - use crate::ebr::Owned; - - impl IsElement for Entry { - fn entry_of(entry: &Entry) -> &Entry { - entry - } - - unsafe fn element_of(entry: &Entry) -> &Entry { - entry - } - - unsafe fn finalize(entry: &Entry, guard: &Guard) { - guard.defer_destroy(Shared::from( - Self::element_of(entry) as *const _ - )); - } - } - - /// Checks whether the list retains inserted elements - /// and returns them in the correct order. - #[test] - fn insert() { - let collector = Collector::new(); - let handle = collector.register(); - let guard = handle.pin(); - - let l: List = List::new(); - - let e1 = Owned::new(Entry::default()).into_shared(&guard); - let e2 = Owned::new(Entry::default()).into_shared(&guard); - let e3 = Owned::new(Entry::default()).into_shared(&guard); - - unsafe { - l.insert(e1, &guard); - l.insert(e2, &guard); - l.insert(e3, &guard); - } - - let mut iter = l.iter(&guard); - let maybe_e3 = iter.next(); - assert!(maybe_e3.is_some()); - assert!(maybe_e3.unwrap().unwrap() as *const Entry == e3.as_raw()); - let maybe_e2 = iter.next(); - assert!(maybe_e2.is_some()); - assert!(maybe_e2.unwrap().unwrap() as *const Entry == e2.as_raw()); - let maybe_e1 = iter.next(); - assert!(maybe_e1.is_some()); - assert!(maybe_e1.unwrap().unwrap() as *const Entry == e1.as_raw()); - assert!(iter.next().is_none()); - - unsafe { - e1.as_ref().unwrap().delete(&guard); - e2.as_ref().unwrap().delete(&guard); - e3.as_ref().unwrap().delete(&guard); - } - } - - /// Checks whether elements can be removed from the list and whether - /// the correct elements are removed. - #[test] - fn delete() { - let collector = Collector::new(); - let handle = collector.register(); - let guard = handle.pin(); - - let l: List = List::new(); - - let e1 = Owned::new(Entry::default()).into_shared(&guard); - let e2 = Owned::new(Entry::default()).into_shared(&guard); - let e3 = Owned::new(Entry::default()).into_shared(&guard); - unsafe { - l.insert(e1, &guard); - l.insert(e2, &guard); - l.insert(e3, &guard); - e2.as_ref().unwrap().delete(&guard); - } - - let mut iter = l.iter(&guard); - let maybe_e3 = iter.next(); - assert!(maybe_e3.is_some()); - assert!(maybe_e3.unwrap().unwrap() as *const Entry == e3.as_raw()); - let maybe_e1 = iter.next(); - assert!(maybe_e1.is_some()); - assert!(maybe_e1.unwrap().unwrap() as *const Entry == e1.as_raw()); - assert!(iter.next().is_none()); - - unsafe { - e1.as_ref().unwrap().delete(&guard); - e3.as_ref().unwrap().delete(&guard); - } - - let mut iter = l.iter(&guard); - assert!(iter.next().is_none()); - } -} diff --git a/src/ebr/mod.rs b/src/ebr/mod.rs deleted file mode 100644 index 5d6b19142..000000000 --- a/src/ebr/mod.rs +++ /dev/null @@ -1,91 +0,0 @@ -#![allow(unsafe_code)] -#![allow(clippy::match_like_matches_macro)] - -/// This module started its life as crossbeam-epoch, -/// and was imported into the sled codebase to perform -/// a number of use-case specific runtime tests and -/// dynamic collection for when garbage spikes happen. -/// Unused functionality was stripped which further -/// improved compile times. -mod atomic; -mod collector; -mod deferred; -mod epoch; -mod internal; -mod list; -mod queue; - -pub(crate) use self::{ - atomic::{Atomic, Owned, Shared}, - collector::pin, -}; - -use deferred::Deferred; -use internal::Local; - -pub struct Guard { - pub(super) local: *const Local, - #[cfg(feature = "testing")] - pub(super) began: std::time::Instant, -} - -impl Guard { - pub(crate) fn defer(&self, f: F) - where - F: FnOnce() -> R + Send + 'static, - { - unsafe { - self.defer_unchecked(f); - } - } - - pub(super) unsafe fn defer_unchecked(&self, f: F) - where - F: FnOnce() -> R, - { - if let Some(local) = self.local.as_ref() { - local.defer(Deferred::new(move || drop(f())), self); - } else { - drop(f()); - } - } - - pub(crate) unsafe fn defer_destroy(&self, ptr: Shared<'_, T>) { - self.defer_unchecked(move || ptr.into_owned()); - } - - pub fn flush(&self) { - if let Some(local) = unsafe { self.local.as_ref() } { - local.flush(self); - } - } -} - -impl Drop for Guard { - #[inline] - fn drop(&mut self) { - if let Some(local) = unsafe { self.local.as_ref() } { - local.unpin(); - } - - #[cfg(feature = "testing")] - { - if self.began.elapsed() > std::time::Duration::from_secs(1) { - log::warn!("guard lived longer than allowed"); - } - } - } -} - -#[inline] -unsafe fn unprotected() -> &'static Guard { - // HACK(stjepang): An unprotected guard is just a `Guard` with its field `local` set to null. - // Since this function returns a `'static` reference to a `Guard`, we must return a reference - // to a global guard. However, it's not possible to create a `static` `Guard` because it does - // not implement `Sync`. To get around the problem, we create a static `usize` initialized to - // zero and then transmute it into a `Guard`. This is safe because `usize` and `Guard` - // (consisting of a single pointer) have the same representation in memory. - static UNPROTECTED: &[usize] = &[0_usize; std::mem::size_of::()]; - #[allow(trivial_casts)] - &*(UNPROTECTED as *const _ as *const Guard) -} diff --git a/src/ebr/queue.rs b/src/ebr/queue.rs deleted file mode 100644 index 5c34b477e..000000000 --- a/src/ebr/queue.rs +++ /dev/null @@ -1,352 +0,0 @@ -//! Michael-Scott lock-free queue. -//! -//! Usable with any number of producers and consumers. -//! -//! Michael and Scott. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue -//! Algorithms. PODC 1996. `http://dl.acm.org/citation.cfm?id=248106` -//! -//! Simon Doherty, Lindsay Groves, Victor Luchangco, and Mark Moir. 2004b. Formal Verification of a -//! Practical Lock-Free Queue Algorithm. `https://doi.org/10.1007/978-3-540-30232-2_7` - -use core::mem::MaybeUninit; -use core::sync::atomic::{ - AtomicUsize, - Ordering::{Acquire, Relaxed, Release}, -}; - -use crate::CachePadded; - -use super::{pin, unprotected, Atomic, Guard, Owned, Shared}; - -pub(in crate::ebr) static SIZE_HINT: AtomicUsize = AtomicUsize::new(0); - -// The representation here is a singly-linked list, with a sentinel node at the front. In general -// the `tail` pointer may lag behind the actual tail. Non-sentinel nodes are either all `Data` or -// all `Blocked` (requests for data from blocked threads). -#[derive(Debug)] -pub(in crate::ebr) struct Queue { - head: CachePadded>>, - tail: CachePadded>>, -} - -struct Node { - /// The slot in which a value of type `T` can be stored. - /// - /// The type of `data` is `MaybeUninit` because a `Node` doesn't always contain a `T`. - /// For example, the sentinel node in a queue never contains a value: its slot is always empty. - /// Other nodes start their life with a push operation and contain a value until it gets popped - /// out. After that such empty nodes get added to the collector for destruction. - data: MaybeUninit, - - next: Atomic>, -} - -// Any particular `T` should never be accessed concurrently, so no need for `Sync`. -unsafe impl Sync for Queue {} -unsafe impl Send for Queue {} - -impl Queue { - /// Create a new, empty queue. - pub(in crate::ebr) fn new() -> Queue { - let q = Queue { - head: CachePadded::new(Atomic::null()), - tail: CachePadded::new(Atomic::null()), - }; - let sentinel = Owned::new(Node { - data: MaybeUninit::uninit(), - next: Atomic::null(), - }); - unsafe { - let sentinel = sentinel.into_shared(unprotected()); - q.head.store(sentinel, Relaxed); - q.tail.store(sentinel, Relaxed); - } - q - } - - /// Attempts to atomically place `n` into the `next` pointer of `onto`, and returns `true` on - /// success. The queue's `tail` pointer may be updated. - #[inline] - fn push_internal( - &self, - onto: Shared<'_, Node>, - new: Shared<'_, Node>, - guard: &Guard, - ) -> bool { - // is `onto` the actual tail? - let o = unsafe { onto.deref() }; - let next = o.next.load(Acquire, guard); - if unsafe { next.as_ref().is_some() } { - // if not, try to "help" by moving the tail pointer forward - let _ = self.tail.compare_and_set(onto, next, Release, guard); - false - } else { - // looks like the actual tail; attempt to link in `n` - let result = o - .next - .compare_and_set(Shared::null(), new, Release, guard) - .is_ok(); - if result { - // try to move the tail pointer forward - let _ = self.tail.compare_and_set(onto, new, Release, guard); - } - result - } - } - - /// Adds `t` to the back of the queue, possibly waking up threads blocked on `pop`. - pub(in crate::ebr) fn push(&self, t: T, guard: &Guard) { - let new = Owned::new(Node { - data: MaybeUninit::new(t), - next: Atomic::null(), - }); - let new = Owned::into_shared(new, guard); - - loop { - // We push onto the tail, so we'll start optimistically by looking there first. - let tail = self.tail.load(Acquire, guard); - - // Attempt to push onto the `tail` snapshot; fails if `tail.next` has changed. - if self.push_internal(tail, new, guard) { - break; - } - } - SIZE_HINT.fetch_add(1, Relaxed); - } - - /// Attempts to pop a data node. `Ok(None)` if queue is empty; `Err(())` if lost race to pop. - #[inline] - fn pop_internal(&self, guard: &Guard) -> Result, ()> { - let head = self.head.load(Acquire, guard); - let h = unsafe { head.deref() }; - let next = h.next.load(Acquire, guard); - match unsafe { next.as_ref() } { - Some(n) => unsafe { - self.head - .compare_and_set(head, next, Release, guard) - .map(|_| { - let tail = self.tail.load(Relaxed, guard); - // Advance the tail so that we don't retire a pointer to a reachable node. - if head == tail { - let _ = self - .tail - .compare_and_set(tail, next, Release, guard); - } - guard.defer_destroy(head); - // TODO: Replace with MaybeUninit::read when api is stable - Some(n.data.as_ptr().read()) - }) - .map_err(|_| ()) - }, - None => Ok(None), - } - } - - /// Attempts to pop a data node, if the data satisfies the given condition. `Ok(None)` if queue - /// is empty or the data does not satisfy the condition; `Err(())` if lost race to pop. - #[inline] - fn pop_if_internal( - &self, - condition: F, - guard: &Guard, - ) -> Result, ()> - where - T: Sync, - F: Fn(&T) -> bool, - { - let head = self.head.load(Acquire, guard); - let h = unsafe { head.deref() }; - let next = h.next.load(Acquire, guard); - match unsafe { next.as_ref() } { - Some(n) if condition(unsafe { &*n.data.as_ptr() }) => unsafe { - self.head - .compare_and_set(head, next, Release, guard) - .map(|_| { - let tail = self.tail.load(Relaxed, guard); - // Advance the tail so that we don't retire a pointer to a reachable node. - if head == tail { - let _ = self - .tail - .compare_and_set(tail, next, Release, guard); - } - guard.defer_destroy(head); - Some(n.data.as_ptr().read()) - }) - .map_err(|_| ()) - }, - None | Some(_) => Ok(None), - } - } - - /// Attempts to dequeue from the front. - /// - /// Returns `None` if the queue is observed to be empty. - fn try_pop(&self, guard: &Guard) -> Option { - loop { - if let Ok(head) = self.pop_internal(guard) { - if head.is_some() { - SIZE_HINT.fetch_sub(1, Relaxed); - } - return head; - } - } - } - - /// Attempts to dequeue from the front, if the item satisfies the given condition. - /// - /// Returns `None` if the queue is observed to be empty, or the head does not satisfy the given - /// condition. - pub(in crate::ebr) fn try_pop_if( - &self, - condition: F, - guard: &Guard, - ) -> Option - where - T: Sync, - F: Fn(&T) -> bool, - { - loop { - if let Ok(head) = self.pop_if_internal(&condition, guard) { - if head.is_some() { - SIZE_HINT.fetch_sub(1, Relaxed); - } - return head; - } - } - } -} - -impl Drop for Queue { - fn drop(&mut self) { - unsafe { - let guard = pin(); - - while self.try_pop(&guard).is_some() {} - - // Destroy the remaining sentinel node. - let sentinel = self.head.load(Relaxed, &guard); - drop(sentinel.into_owned()); - } - } -} - -#[cfg(test)] -mod test { - use super::*; - use crate::pin; - - struct Queue { - queue: super::Queue, - } - - impl Queue { - fn new() -> Queue { - Queue { queue: super::Queue::new() } - } - - fn push(&self, t: T) { - let guard = &pin(); - self.queue.push(t, guard); - } - - fn is_empty(&self) -> bool { - let guard = &pin(); - let head = self.queue.head.load(Acquire, guard); - let h = unsafe { head.deref() }; - h.next.load(Acquire, guard).is_null() - } - - fn try_pop(&self) -> Option { - let guard = &pin(); - self.queue.try_pop(guard) - } - - fn pop(&self) -> T { - loop { - match self.try_pop() { - None => continue, - Some(t) => return t, - } - } - } - } - - #[test] - fn push_try_pop_1() { - let q: Queue = Queue::new(); - assert!(q.is_empty()); - q.push(37); - assert!(!q.is_empty()); - assert_eq!(q.try_pop(), Some(37)); - assert!(q.is_empty()); - } - - #[test] - fn push_try_pop_2() { - let q: Queue = Queue::new(); - assert!(q.is_empty()); - q.push(37); - q.push(48); - assert_eq!(q.try_pop(), Some(37)); - assert!(!q.is_empty()); - assert_eq!(q.try_pop(), Some(48)); - assert!(q.is_empty()); - } - - #[test] - fn push_try_pop_many_seq() { - let q: Queue = Queue::new(); - assert!(q.is_empty()); - for i in 0..200 { - q.push(i) - } - assert!(!q.is_empty()); - for i in 0..200 { - assert_eq!(q.try_pop(), Some(i)); - } - assert!(q.is_empty()); - } - - #[test] - fn push_pop_1() { - let q: Queue = Queue::new(); - assert!(q.is_empty()); - q.push(37); - assert!(!q.is_empty()); - assert_eq!(q.pop(), 37); - assert!(q.is_empty()); - } - - #[test] - fn push_pop_2() { - let q: Queue = Queue::new(); - q.push(37); - q.push(48); - assert_eq!(q.pop(), 37); - assert_eq!(q.pop(), 48); - } - - #[test] - fn push_pop_many_seq() { - let q: Queue = Queue::new(); - assert!(q.is_empty()); - for i in 0..200 { - q.push(i) - } - assert!(!q.is_empty()); - for i in 0..200 { - assert_eq!(q.pop(), i); - } - assert!(q.is_empty()); - } - - #[test] - fn is_empty_dont_pop() { - let q: Queue = Queue::new(); - q.push(20); - q.push(20); - assert!(!q.is_empty()); - assert!(!q.is_empty()); - assert!(q.try_pop().is_some()); - } -} diff --git a/src/event_log.rs b/src/event_log.rs deleted file mode 100644 index bd9a6ea0a..000000000 --- a/src/event_log.rs +++ /dev/null @@ -1,180 +0,0 @@ -//! The `EventLog` lets us cheaply record and query behavior -//! in a concurrent system. It lets us reconstruct stories about -//! what happened to our data. It lets us write tests like: -//! 1. no keys are lost through tree structural modifications -//! 2. no nodes are made inaccessible through structural modifications -//! 3. no segments are zeroed and reused before all resident -//! pages have been relocated and stabilized. -//! 4. recovery does not skip active segments -//! 5. no page is double-allocated or double-freed -//! 6. pages before restart match pages after restart -//! -//! What does it mean for data to be accessible? -//! 1. key -> page -//! 2. page -> lid -//! 3. lid ranges get stabiized over time -//! 4. lid ranges get zeroed over time -//! 5. segment trailers get written over time -//! 6. if a page's old location is zeroed before -//! `io_bufs` segment trailers have been written, -//! we are vulnerable to data loss -//! 3. segments have lifespans from fsync to zero -//! 4. -#![allow(missing_docs)] - -use crate::pagecache::DiskPtr; -use crate::*; - -use crate::stack::{Iter as StackIter, Stack}; - -/// A thing that happens at a certain time. -#[derive(Debug, Clone)] -enum Event { - PagesOnShutdown { pages: Map> }, - PagesOnRecovery { pages: Map> }, - MetaOnShutdown { meta: Meta }, - MetaOnRecovery { meta: Meta }, - RecoveredLsn(Lsn), - Stabilized(Lsn), -} - -/// A lock-free queue of Events. -#[derive(Default, Debug)] -pub struct EventLog { - inner: Stack, -} - -impl EventLog { - pub(crate) fn reset(&self) { - self.verify(); - let guard = pin(); - while self.inner.pop(&guard).is_some() {} - } - - fn iter<'a>(&self, guard: &'a Guard) -> StackIter<'a, Event> { - let head = self.inner.head(guard); - StackIter::from_ptr(head, guard) - } - - pub(crate) fn verify(&self) { - let guard = pin(); - let iter = self.iter(&guard); - - // if we encounter a `PagesOnRecovery`, then we should - // compare it to any subsequent `PagesOnShutdown` - - let mut recovered_pages = None; - let mut recovered_meta = None; - let mut minimum_lsn = None; - - for event in iter { - match event { - Event::Stabilized(lsn) | Event::RecoveredLsn(lsn) => { - if let Some(later_lsn) = minimum_lsn { - assert!( - later_lsn >= lsn, - "lsn must never go down between recoveries \ - or stabilizations. It was {} but later became {}. history: {:?}", - lsn, - later_lsn, - self.iter(&guard) - .filter(|e| matches!(e, Event::Stabilized(_)) - || matches!(e, Event::RecoveredLsn(_))) - .collect::>(), - ); - } - minimum_lsn = Some(lsn); - } - Event::PagesOnRecovery { pages } => { - recovered_pages = Some(pages.clone()); - } - Event::PagesOnShutdown { pages } => { - if let Some(ref par) = recovered_pages { - let pids = par - .iter() - .map(|(pid, _frag_locations)| *pid) - .chain( - pages.iter().map(|(pid, _frag_locations)| *pid), - ) - .collect::>() - .into_iter(); - - for pid in pids { - // we filter out the blob pointer in the log - // because it is expected that upon recovery, - // any blob pointers will be forgotten from - // the log now that they are present in the - // snapshot. - let locations_before_restart: Vec<_> = pages - .get(&pid) - .unwrap() - .iter() - .map(|ptr_ref| { - let mut ptr = *ptr_ref; - ptr.forget_heap_log_coordinates(); - ptr - }) - .collect(); - let locations_after_restart: Vec<_> = par - .get(&pid) - .unwrap_or_else(|| panic!("pid {} no longer present after restart", pid)) - .to_vec(); - assert_eq!( - locations_before_restart, - locations_after_restart, - "page {} had frag locations {:?} before \ - restart, but {:?} after restart", - pid, - locations_before_restart, - locations_after_restart - ); - } - } - } - Event::MetaOnRecovery { meta } => { - recovered_meta = Some(meta); - } - Event::MetaOnShutdown { meta } => { - if let Some(rec_meta) = recovered_meta { - assert_eq!(meta, rec_meta); - } - } - } - } - - debug!("event log verified \u{2713}"); - } - - pub(crate) fn stabilized_lsn(&self, lsn: Lsn) { - let guard = pin(); - self.inner.push(Event::Stabilized(lsn), &guard); - } - - pub(crate) fn recovered_lsn(&self, lsn: Lsn) { - let guard = pin(); - self.inner.push(Event::RecoveredLsn(lsn), &guard); - } - - pub(crate) fn pages_before_restart( - &self, - pages: Map>, - ) { - let guard = pin(); - self.inner.push(Event::PagesOnShutdown { pages }, &guard); - } - - pub(crate) fn pages_after_restart(&self, pages: Map>) { - let guard = pin(); - self.inner.push(Event::PagesOnRecovery { pages }, &guard); - } - - pub fn meta_before_restart(&self, meta: Meta) { - let guard = pin(); - self.inner.push(Event::MetaOnShutdown { meta }, &guard); - } - - pub fn meta_after_restart(&self, meta: Meta) { - let guard = pin(); - self.inner.push(Event::MetaOnRecovery { meta }, &guard); - } -} diff --git a/src/event_verifier.rs b/src/event_verifier.rs new file mode 100644 index 000000000..92282bea7 --- /dev/null +++ b/src/event_verifier.rs @@ -0,0 +1,157 @@ +use std::collections::BTreeMap; +use std::sync::Mutex; + +use crate::{FlushEpoch, ObjectId}; + +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub(crate) enum State { + Unallocated, + Dirty, + CooperativelySerialized, + AddedToWriteBatch, + Flushed, + CleanPagedIn, + PagedOut, +} + +impl State { + fn can_transition_within_epoch_to(&self, next: State) -> bool { + match (self, next) { + (State::Flushed, State::PagedOut) => true, + (State::Flushed, _) => false, + (State::AddedToWriteBatch, State::Flushed) => true, + (State::AddedToWriteBatch, _) => false, + (State::CleanPagedIn, State::AddedToWriteBatch) => false, + (State::CleanPagedIn, State::Flushed) => false, + (State::Dirty, State::AddedToWriteBatch) => true, + (State::CooperativelySerialized, State::AddedToWriteBatch) => true, + (State::CooperativelySerialized, _) => false, + (State::Unallocated, State::AddedToWriteBatch) => true, + (State::Unallocated, _) => false, + (State::Dirty, State::Dirty) => true, + (State::Dirty, State::CooperativelySerialized) => true, + (State::Dirty, State::Unallocated) => true, + (State::Dirty, _) => false, + (State::CleanPagedIn, State::Dirty) => true, + (State::CleanPagedIn, State::PagedOut) => true, + (State::CleanPagedIn, State::CleanPagedIn) => true, + (State::CleanPagedIn, State::Unallocated) => true, + (State::CleanPagedIn, State::CooperativelySerialized) => true, + (State::PagedOut, State::CleanPagedIn) => true, + (State::PagedOut, _) => false, + } + } + + fn needs_flush(&self) -> bool { + match self { + State::CleanPagedIn => false, + State::Flushed => false, + State::PagedOut => false, + _ => true, + } + } +} + +#[derive(Debug, Default)] +pub(crate) struct EventVerifier { + flush_model: + Mutex>>, +} + +impl Drop for EventVerifier { + fn drop(&mut self) { + // assert that nothing is currently Dirty + let flush_model = self.flush_model.lock().unwrap(); + for ((oid, _epoch), history) in flush_model.iter() { + if let Some((last_state, _at)) = history.last() { + assert_ne!( + *last_state, + State::Dirty, + "{oid:?} is Dirty when system shutting down" + ); + } + } + } +} + +impl EventVerifier { + pub(crate) fn mark( + &self, + object_id: ObjectId, + epoch: FlushEpoch, + state: State, + at: &'static str, + ) { + if matches!(state, State::PagedOut) { + let dirty_epochs = self.dirty_epochs(object_id); + if !dirty_epochs.is_empty() { + println!("{object_id:?} was paged out while having dirty epochs {dirty_epochs:?}"); + self.print_debug_history_for_object(object_id); + println!("{state:?} {epoch:?} {at}"); + println!("invalid object state transition"); + std::process::abort(); + } + } + + let mut flush_model = self.flush_model.lock().unwrap(); + let history = flush_model.entry((object_id, epoch)).or_default(); + + if let Some((last_state, _at)) = history.last() { + if !last_state.can_transition_within_epoch_to(state) { + println!( + "object_id {object_id:?} performed \ + illegal state transition from {last_state:?} \ + to {state:?} at {at} in epoch {epoch:?}." + ); + + println!("history:"); + history.push((state, at)); + + let active_epochs = flush_model.range( + (object_id, FlushEpoch::MIN)..=(object_id, FlushEpoch::MAX), + ); + for ((_oid, epoch), history) in active_epochs { + for (last_state, at) in history { + println!("{last_state:?} {epoch:?} {at}"); + } + } + + println!("invalid object state transition"); + + std::process::abort(); + } + } + history.push((state, at)); + } + + /// Returns the FlushEpochs for which this ObjectId has unflushed + /// dirty data for. + fn dirty_epochs(&self, object_id: ObjectId) -> Vec { + let mut dirty_epochs = vec![]; + let flush_model = self.flush_model.lock().unwrap(); + + let active_epochs = flush_model + .range((object_id, FlushEpoch::MIN)..=(object_id, FlushEpoch::MAX)); + + for ((_oid, epoch), history) in active_epochs { + let (last_state, _at) = history.last().unwrap(); + if last_state.needs_flush() { + dirty_epochs.push(*epoch); + } + } + + dirty_epochs + } + + pub(crate) fn print_debug_history_for_object(&self, object_id: ObjectId) { + let flush_model = self.flush_model.lock().unwrap(); + println!("history for object {:?}:", object_id); + let active_epochs = flush_model + .range((object_id, FlushEpoch::MIN)..=(object_id, FlushEpoch::MAX)); + for ((_oid, epoch), history) in active_epochs { + for (last_state, at) in history { + println!("{last_state:?} {epoch:?} {at}"); + } + } + } +} diff --git a/src/fail.rs b/src/fail.rs deleted file mode 100644 index 6a415707a..000000000 --- a/src/fail.rs +++ /dev/null @@ -1,55 +0,0 @@ -use parking_lot::Mutex; - -use crate::{Lazy, Map}; - -type Hm = Map<&'static str, u64>; - -static ACTIVE: Lazy, fn() -> Mutex> = Lazy::new(init); - -fn init() -> Mutex { - Mutex::new(Hm::default()) -} - -/// Returns `true` if the given failpoint is active. -pub fn is_active(name: &'static str) -> bool { - let mut active = ACTIVE.lock(); - if let Some(bitset) = active.get_mut(&name) { - let bit = *bitset & 1; - *bitset >>= 1; - if *bitset == 0 { - active.remove(&name); - } - let ret = bit != 0; - - if ret { - log::error!( - "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! FailPoint {} triggered", - name - ); - } - - ret - } else { - false - } -} - -/// Enable a particular failpoint -pub fn set(name: &'static str, bitset: u64) { - ACTIVE.lock().insert(name, bitset); -} - -/// Clear all active failpoints. -pub fn reset() { - ACTIVE.lock().clear(); -} - -/// Temporarily pause all fault injection -pub fn pause_faults() -> Hm { - std::mem::take(&mut ACTIVE.lock()) -} - -/// Restore fault injection -pub fn restore_faults(hm: Hm) { - *ACTIVE.lock() = hm; -} diff --git a/src/fastcmp.rs b/src/fastcmp.rs deleted file mode 100644 index 1a48042af..000000000 --- a/src/fastcmp.rs +++ /dev/null @@ -1,56 +0,0 @@ -use std::cmp::Ordering; - -#[cfg(any(unix, windows))] -#[allow(unsafe_code)] -pub(crate) fn fastcmp(l: &[u8], r: &[u8]) -> Ordering { - let len = std::cmp::min(l.len(), r.len()); - let cmp = unsafe { libc::memcmp(l.as_ptr() as _, r.as_ptr() as _, len) }; - match cmp { - a if a > 0 => Ordering::Greater, - a if a < 0 => Ordering::Less, - _ => l.len().cmp(&r.len()), - } -} - -#[cfg(not(any(unix, windows)))] -#[allow(unsafe_code)] -pub(crate) fn fastcmp(l: &[u8], r: &[u8]) -> Ordering { - l.cmp(r) -} - -#[cfg(test)] -mod qc { - use super::fastcmp; - - fn prop_cmp_matches(l: &[u8], r: &[u8]) -> bool { - assert_eq!(fastcmp(l, r), l.cmp(r)); - assert_eq!(fastcmp(r, l), r.cmp(l)); - assert_eq!(fastcmp(l, l), l.cmp(l)); - assert_eq!(fastcmp(r, r), r.cmp(r)); - true - } - - #[test] - fn basic_functionality() { - let cases: [&[u8]; 8] = [ - &[], - &[0], - &[1], - &[1], - &[255], - &[1, 2, 3], - &[1, 2, 3, 0], - &[1, 2, 3, 55], - ]; - for pair in cases.windows(2) { - prop_cmp_matches(pair[0], pair[1]); - } - } - - quickcheck::quickcheck! { - #[cfg_attr(miri, ignore)] - fn qc_fastcmp(l: Vec, r: Vec) -> bool { - prop_cmp_matches(&l, &r) - } - } -} diff --git a/src/fastlock.rs b/src/fastlock.rs deleted file mode 100644 index 61d96d6a8..000000000 --- a/src/fastlock.rs +++ /dev/null @@ -1,69 +0,0 @@ -use std::{ - cell::UnsafeCell, - ops::{Deref, DerefMut}, - sync::atomic::{ - AtomicBool, - Ordering::{AcqRel, Acquire, Release}, - }, -}; - -pub struct FastLockGuard<'a, T> { - mu: &'a FastLock, -} - -impl<'a, T> Drop for FastLockGuard<'a, T> { - fn drop(&mut self) { - assert!(self.mu.lock.swap(false, Release)); - } -} - -impl<'a, T> Deref for FastLockGuard<'a, T> { - type Target = T; - - fn deref(&self) -> &T { - #[allow(unsafe_code)] - unsafe { - &*self.mu.inner.get() - } - } -} - -impl<'a, T> DerefMut for FastLockGuard<'a, T> { - fn deref_mut(&mut self) -> &mut T { - #[allow(unsafe_code)] - unsafe { - &mut *self.mu.inner.get() - } - } -} - -#[repr(C)] -pub struct FastLock { - inner: UnsafeCell, - lock: AtomicBool, -} - -#[allow(unsafe_code)] -unsafe impl Sync for FastLock {} - -#[allow(unsafe_code)] -unsafe impl Send for FastLock {} - -impl FastLock { - pub const fn new(inner: T) -> FastLock { - FastLock { lock: AtomicBool::new(false), inner: UnsafeCell::new(inner) } - } - - pub fn try_lock(&self) -> Option> { - let lock_result = - self.lock.compare_exchange_weak(false, true, AcqRel, Acquire); - - let success = lock_result.is_ok(); - - if success { - Some(FastLockGuard { mu: self }) - } else { - None - } - } -} diff --git a/src/flush_epoch.rs b/src/flush_epoch.rs new file mode 100644 index 000000000..cb763c920 --- /dev/null +++ b/src/flush_epoch.rs @@ -0,0 +1,384 @@ +use std::num::NonZeroU64; +use std::sync::atomic::{AtomicPtr, AtomicU64, Ordering}; +use std::sync::{Arc, Condvar, Mutex}; + +const SEAL_BIT: u64 = 1 << 63; +const SEAL_MASK: u64 = u64::MAX - SEAL_BIT; +const MIN_EPOCH: u64 = 2; + +#[derive( + Debug, + Clone, + Copy, + serde::Serialize, + serde::Deserialize, + PartialOrd, + Ord, + PartialEq, + Eq, + Hash, +)] +pub struct FlushEpoch(NonZeroU64); + +impl FlushEpoch { + pub const MIN: FlushEpoch = FlushEpoch(NonZeroU64::MIN); + #[allow(unused)] + pub const MAX: FlushEpoch = FlushEpoch(NonZeroU64::MAX); + + pub fn increment(&self) -> FlushEpoch { + FlushEpoch(NonZeroU64::new(self.0.get() + 1).unwrap()) + } + + pub fn get(&self) -> u64 { + self.0.get() + } +} + +impl concurrent_map::Minimum for FlushEpoch { + const MIN: FlushEpoch = FlushEpoch::MIN; +} + +#[derive(Debug)] +pub(crate) struct FlushInvariants { + max_flushed_epoch: AtomicU64, + max_flushing_epoch: AtomicU64, +} + +impl Default for FlushInvariants { + fn default() -> FlushInvariants { + FlushInvariants { + max_flushed_epoch: (MIN_EPOCH - 1).into(), + max_flushing_epoch: (MIN_EPOCH - 1).into(), + } + } +} + +impl FlushInvariants { + pub(crate) fn mark_flushed_epoch(&self, epoch: FlushEpoch) { + let last = self.max_flushed_epoch.swap(epoch.get(), Ordering::SeqCst); + + assert_eq!(last + 1, epoch.get()); + } + + pub(crate) fn mark_flushing_epoch(&self, epoch: FlushEpoch) { + let last = self.max_flushing_epoch.swap(epoch.get(), Ordering::SeqCst); + + assert_eq!(last + 1, epoch.get()); + } +} + +#[derive(Clone, Debug)] +pub(crate) struct Completion { + mu: Arc>, + cv: Arc, + epoch: FlushEpoch, +} + +impl Completion { + pub fn epoch(&self) -> FlushEpoch { + self.epoch + } + + pub fn new(epoch: FlushEpoch) -> Completion { + Completion { mu: Default::default(), cv: Default::default(), epoch } + } + + pub fn wait_for_complete(self) -> FlushEpoch { + let mut mu = self.mu.lock().unwrap(); + while !*mu { + mu = self.cv.wait(mu).unwrap(); + } + + self.epoch + } + + pub fn mark_complete(self) { + self.mark_complete_inner(false); + } + + fn mark_complete_inner(&self, previously_sealed: bool) { + let mut mu = self.mu.lock().unwrap(); + if !previously_sealed { + // TODO reevaluate - assert!(!*mu); + } + log::trace!("marking epoch {:?} as complete", self.epoch); + // it's possible for *mu to already be true due to this being + // immediately dropped in the check_in method when we see that + // the checked-in epoch has already been marked as sealed. + *mu = true; + drop(mu); + self.cv.notify_all(); + } + + #[cfg(test)] + pub fn is_complete(&self) -> bool { + *self.mu.lock().unwrap() + } +} + +pub struct FlushEpochGuard<'a> { + tracker: &'a EpochTracker, + previously_sealed: bool, +} + +impl<'a> Drop for FlushEpochGuard<'a> { + fn drop(&mut self) { + let rc = self.tracker.rc.fetch_sub(1, Ordering::SeqCst) - 1; + if rc & SEAL_MASK == 0 && (rc & SEAL_BIT) == SEAL_BIT { + crate::debug_delay(); + self.tracker + .vacancy_notifier + .mark_complete_inner(self.previously_sealed); + } + } +} + +impl<'a> FlushEpochGuard<'a> { + pub fn epoch(&self) -> FlushEpoch { + self.tracker.epoch + } +} + +#[derive(Debug)] +pub(crate) struct EpochTracker { + epoch: FlushEpoch, + rc: AtomicU64, + vacancy_notifier: Completion, + previous_flush_complete: Completion, +} + +#[derive(Clone, Debug)] +pub(crate) struct FlushEpochTracker { + active_ebr: ebr::Ebr, 16, 16>, + inner: Arc, +} + +#[derive(Debug)] +pub(crate) struct FlushEpochInner { + counter: AtomicU64, + roll_mu: Mutex<()>, + current_active: AtomicPtr, +} + +impl Drop for FlushEpochInner { + fn drop(&mut self) { + let vacancy_mu = self.roll_mu.lock().unwrap(); + let old_ptr = + self.current_active.swap(std::ptr::null_mut(), Ordering::SeqCst); + if !old_ptr.is_null() { + //let old: &EpochTracker = &*old_ptr; + unsafe { drop(Box::from_raw(old_ptr)) } + } + drop(vacancy_mu); + } +} + +impl Default for FlushEpochTracker { + fn default() -> FlushEpochTracker { + let last = Completion::new(FlushEpoch(NonZeroU64::new(1).unwrap())); + let current_active_ptr = Box::into_raw(Box::new(EpochTracker { + epoch: FlushEpoch(NonZeroU64::new(MIN_EPOCH).unwrap()), + rc: AtomicU64::new(0), + vacancy_notifier: Completion::new(FlushEpoch( + NonZeroU64::new(MIN_EPOCH).unwrap(), + )), + previous_flush_complete: last.clone(), + })); + + last.mark_complete(); + + let current_active = AtomicPtr::new(current_active_ptr); + + FlushEpochTracker { + inner: Arc::new(FlushEpochInner { + counter: AtomicU64::new(2), + roll_mu: Mutex::new(()), + current_active, + }), + active_ebr: ebr::Ebr::default(), + } + } +} + +impl FlushEpochTracker { + /// Returns the epoch notifier for the previous epoch. + /// Intended to be passed to a flusher that can eventually + /// notify the flush-requesting thread. + pub fn roll_epoch_forward(&self) -> (Completion, Completion, Completion) { + let mut tracker_guard = self.active_ebr.pin(); + + let vacancy_mu = self.inner.roll_mu.lock().unwrap(); + + let flush_through = self.inner.counter.fetch_add(1, Ordering::SeqCst); + + let flush_through_epoch = + FlushEpoch(NonZeroU64::new(flush_through).unwrap()); + + let new_epoch = flush_through_epoch.increment(); + + let forward_flush_notifier = Completion::new(flush_through_epoch); + + let new_active = Box::into_raw(Box::new(EpochTracker { + epoch: new_epoch, + rc: AtomicU64::new(0), + vacancy_notifier: Completion::new(new_epoch), + previous_flush_complete: forward_flush_notifier.clone(), + })); + + let old_ptr = + self.inner.current_active.swap(new_active, Ordering::SeqCst); + + assert!(!old_ptr.is_null()); + + let (last_flush_complete_notifier, vacancy_notifier) = unsafe { + let old: &EpochTracker = &*old_ptr; + let last = old.rc.fetch_add(SEAL_BIT + 1, Ordering::SeqCst); + + assert_eq!( + last & SEAL_BIT, + 0, + "epoch {} double-sealed", + flush_through + ); + + // mark_complete_inner called via drop in a uniform way + //println!("dropping flush epoch guard for epoch {flush_through}"); + drop(FlushEpochGuard { tracker: old, previously_sealed: true }); + + (old.previous_flush_complete.clone(), old.vacancy_notifier.clone()) + }; + tracker_guard.defer_drop(unsafe { Box::from_raw(old_ptr) }); + drop(vacancy_mu); + (last_flush_complete_notifier, vacancy_notifier, forward_flush_notifier) + } + + pub fn check_in<'a>(&self) -> FlushEpochGuard<'a> { + let _tracker_guard = self.active_ebr.pin(); + loop { + let tracker: &'a EpochTracker = + unsafe { &*self.inner.current_active.load(Ordering::SeqCst) }; + + let rc = tracker.rc.fetch_add(1, Ordering::SeqCst); + + let previously_sealed = rc & SEAL_BIT == SEAL_BIT; + + let guard = FlushEpochGuard { tracker, previously_sealed }; + + if previously_sealed { + // the epoch is already closed, so we must drop the rc + // and possibly notify, which is handled in the guard's + // Drop impl. + drop(guard); + } else { + return guard; + } + } + } + + pub fn manually_advance_epoch(&self) { + self.active_ebr.manually_advance_epoch(); + } + + pub fn current_flush_epoch(&self) -> FlushEpoch { + let current = self.inner.counter.load(Ordering::SeqCst); + + FlushEpoch(NonZeroU64::new(current).unwrap()) + } +} + +#[test] +fn flush_epoch_basic_functionality() { + let epoch_tracker = FlushEpochTracker::default(); + + for expected in MIN_EPOCH..1_000_000 { + let g1 = epoch_tracker.check_in(); + let g2 = epoch_tracker.check_in(); + + assert_eq!(g1.tracker.epoch.0.get(), expected); + assert_eq!(g2.tracker.epoch.0.get(), expected); + + let previous_notifier = epoch_tracker.roll_epoch_forward().1; + assert!(!previous_notifier.is_complete()); + + drop(g1); + assert!(!previous_notifier.is_complete()); + drop(g2); + assert_eq!(previous_notifier.wait_for_complete().0.get(), expected); + } +} + +#[cfg(test)] +fn concurrent_flush_epoch_burn_in_inner() { + const N_THREADS: usize = 10; + const N_OPS_PER_THREAD: usize = 3000; + + let fa = FlushEpochTracker::default(); + + let barrier = std::sync::Arc::new(std::sync::Barrier::new(21)); + + let pt = pagetable::PageTable::::default(); + + let rolls = || { + let fa = fa.clone(); + let barrier = barrier.clone(); + let pt = &pt; + move || { + barrier.wait(); + for _ in 0..N_OPS_PER_THREAD { + let (previous, this, next) = fa.roll_epoch_forward(); + let last_epoch = previous.wait_for_complete().0.get(); + assert_eq!(0, pt.get(last_epoch).load(Ordering::Acquire)); + let flush_through_epoch = this.wait_for_complete().0.get(); + assert_eq!( + 0, + pt.get(flush_through_epoch).load(Ordering::Acquire) + ); + + next.mark_complete(); + } + } + }; + + let check_ins = || { + let fa = fa.clone(); + let barrier = barrier.clone(); + let pt = &pt; + move || { + barrier.wait(); + for _ in 0..N_OPS_PER_THREAD { + let guard = fa.check_in(); + let epoch = guard.epoch().0.get(); + pt.get(epoch).fetch_add(1, Ordering::SeqCst); + std::thread::yield_now(); + pt.get(epoch).fetch_sub(1, Ordering::SeqCst); + drop(guard); + } + } + }; + + std::thread::scope(|s| { + let mut threads = vec![]; + + for _ in 0..N_THREADS { + threads.push(s.spawn(rolls())); + threads.push(s.spawn(check_ins())); + } + + barrier.wait(); + + for thread in threads.into_iter() { + thread.join().expect("a test thread crashed unexpectedly"); + } + }); + + for i in 0..N_OPS_PER_THREAD * N_THREADS { + assert_eq!(0, pt.get(i as u64).load(Ordering::Acquire)); + } +} + +#[test] +fn concurrent_flush_epoch_burn_in() { + for _ in 0..128 { + concurrent_flush_epoch_burn_in_inner(); + } +} diff --git a/src/flusher.rs b/src/flusher.rs deleted file mode 100644 index 84c8549dc..000000000 --- a/src/flusher.rs +++ /dev/null @@ -1,184 +0,0 @@ -use std::thread; -use std::time::Duration; - -use parking_lot::{Condvar, Mutex}; - -use super::*; - -#[derive(Debug, Clone, Copy)] -pub(crate) enum ShutdownState { - Running, - ShuttingDown, - ShutDown, -} - -impl ShutdownState { - const fn is_running(self) -> bool { - matches!(self, ShutdownState::Running) - } - - const fn is_shutdown(self) -> bool { - matches!(self, ShutdownState::ShutDown) - } -} - -#[derive(Debug)] -pub(crate) struct Flusher { - shutdown: Arc>, - sc: Arc, - join_handle: Mutex>>, -} - -impl Flusher { - /// Spawns a thread that periodically calls `callback` until dropped. - pub(crate) fn new( - name: String, - pagecache: PageCache, - flush_every_ms: u64, - ) -> Self { - #[allow(clippy::mutex_atomic)] // mutex used in CondVar below - let shutdown = Arc::new(Mutex::new(ShutdownState::Running)); - let sc = Arc::new(Condvar::new()); - - let join_handle = thread::Builder::new() - .name(name) - .spawn({ - let shutdown2 = shutdown.clone(); - let sc2 = sc.clone(); - move || run(&shutdown2, &sc2, &pagecache, flush_every_ms) - }) - .unwrap(); - - Self { shutdown, sc, join_handle: Mutex::new(Some(join_handle)) } - } -} - -fn run( - shutdown_mu: &Arc>, - sc: &Arc, - pagecache: &PageCache, - flush_every_ms: u64, -) { - let flush_every = Duration::from_millis(flush_every_ms); - let mut shutdown = shutdown_mu.lock(); - let mut wrote_data = false; - while shutdown.is_running() || wrote_data { - let before = std::time::Instant::now(); - let cc = concurrency_control::read(); - match pagecache.log.roll_iobuf() { - Ok(0) => { - wrote_data = false; - if !shutdown.is_running() { - break; - } - } - Ok(_) => { - wrote_data = true; - if !shutdown.is_running() { - // loop right away if we're in - // shutdown mode, to flush data - // more quickly. - continue; - } - } - Err(e) => { - error!("failed to flush from periodic flush thread: {}", e); - - pagecache.log.iobufs.set_global_error(e); - - *shutdown = ShutdownState::ShutDown; - - // having held the mutex makes this linearized - // with the notify below. - drop(shutdown); - - let _notified = sc.notify_all(); - return; - } - } - drop(cc); - - // so we can spend a little effort - // cleaning up the segments. try not to - // spend more than half of our sleep - // time rewriting pages though. - // - // this looks weird because it's a rust-style do-while - // where the conditional is the full body - while { - let made_progress = match pagecache.attempt_gc() { - Err(e) => { - error!( - "failed to clean file from periodic flush thread: {}", - e - ); - - pagecache.log.iobufs.set_global_error(e); - - *shutdown = ShutdownState::ShutDown; - - // having held the mutex makes this linearized - // with the notify below. - drop(shutdown); - - let _notified = sc.notify_all(); - return; - } - Ok(false) => false, - Ok(true) => true, - }; - made_progress - && shutdown.is_running() - && before.elapsed() < flush_every / 2 - } {} - - if let Err(e) = pagecache.config.file.sync_all() { - error!("failed to fsync from periodic flush thread: {}", e); - } - - let sleep_duration = flush_every - .checked_sub(before.elapsed()) - .unwrap_or_else(|| Duration::from_millis(1)); - - if shutdown.is_running() { - // only sleep before the next flush if we are - // running normally. if we're shutting down, - // flush faster. - sc.wait_for(&mut shutdown, sleep_duration); - } - } - - *shutdown = ShutdownState::ShutDown; - - // having held the mutex makes this linearized - // with the notify below. - drop(shutdown); - - let _notified = sc.notify_all(); -} - -impl Drop for Flusher { - fn drop(&mut self) { - let mut shutdown = self.shutdown.lock(); - if shutdown.is_running() { - *shutdown = ShutdownState::ShuttingDown; - let _notified = self.sc.notify_all(); - } - - #[allow(unused_variables)] - let mut count = 0; - while !shutdown.is_shutdown() { - let _ = self.sc.wait_for(&mut shutdown, Duration::from_millis(100)); - count += 1; - - testing_assert!(count < 15); - } - - let mut join_handle_opt = self.join_handle.lock(); - if let Some(join_handle) = join_handle_opt.take() { - if let Err(e) = join_handle.join() { - error!("error joining Periodic thread: {:?}", e); - } - } - } -} diff --git a/src/fnv.rs b/src/fnv.rs deleted file mode 100644 index 7b623a5ce..000000000 --- a/src/fnv.rs +++ /dev/null @@ -1,32 +0,0 @@ -// extracted from the fnv crate for minor, mostly -// compile-time optimizations. fnv and sled are -// both licensed as Apache 2.0 OR MIT. -#[allow(missing_copy_implementations)] -pub struct Hasher(u64); - -impl Default for Hasher { - #[inline] - fn default() -> Hasher { - Hasher(0xcbf29ce484222325) - } -} - -impl std::hash::Hasher for Hasher { - #[inline] - fn finish(&self) -> u64 { - self.0 - } - - #[inline] - #[allow(clippy::cast_lossless)] - fn write(&mut self, bytes: &[u8]) { - let Hasher(mut hash) = *self; - - for byte in bytes.iter() { - hash ^= *byte as u64; - hash = hash.wrapping_mul(0x100000001b3); - } - - *self = Hasher(hash); - } -} diff --git a/src/heap.rs b/src/heap.rs new file mode 100644 index 000000000..7eaeec544 --- /dev/null +++ b/src/heap.rs @@ -0,0 +1,1118 @@ +use std::fmt; +use std::fs; +use std::io::{self, Read}; +use std::num::NonZeroU64; +use std::path::{Path, PathBuf}; +use std::sync::atomic::{fence, AtomicPtr, AtomicU64, Ordering}; +use std::sync::Arc; +use std::time::{Duration, Instant}; + +use ebr::{Ebr, Guard}; +use fault_injection::{annotate, fallible, maybe}; +use fnv::FnvHashSet; +use fs2::FileExt as _; +use parking_lot::{Mutex, RwLock}; +use rayon::prelude::*; + +use crate::object_location_mapper::{AllocatorStats, ObjectLocationMapper}; +use crate::{CollectionId, Config, DeferredFree, MetadataStore, ObjectId}; + +const WARN: &str = "DO_NOT_PUT_YOUR_FILES_HERE"; +pub(crate) const N_SLABS: usize = 78; +const FILE_TARGET_FILL_RATIO: u64 = 80; +const FILE_RESIZE_MARGIN: u64 = 115; + +const SLAB_SIZES: [usize; N_SLABS] = [ + 64, // 0x40 + 80, // 0x50 + 96, // 0x60 + 112, // 0x70 + 128, // 0x80 + 160, // 0xa0 + 192, // 0xc0 + 224, // 0xe0 + 256, // 0x100 + 320, // 0x140 + 384, // 0x180 + 448, // 0x1c0 + 512, // 0x200 + 640, // 0x280 + 768, // 0x300 + 896, // 0x380 + 1024, // 0x400 + 1280, // 0x500 + 1536, // 0x600 + 1792, // 0x700 + 2048, // 0x800 + 2560, // 0xa00 + 3072, // 0xc00 + 3584, // 0xe00 + 4096, // 0x1000 + 5120, // 0x1400 + 6144, // 0x1800 + 7168, // 0x1c00 + 8192, // 0x2000 + 10240, // 0x2800 + 12288, // 0x3000 + 14336, // 0x3800 + 16384, // 0x4000 + 20480, // 0x5000 + 24576, // 0x6000 + 28672, // 0x7000 + 32768, // 0x8000 + 40960, // 0xa000 + 49152, // 0xc000 + 57344, // 0xe000 + 65536, // 0x10000 + 98304, // 0x1a000 + 131072, // 0x20000 + 163840, // 0x28000 + 196608, + 262144, + 393216, + 524288, + 786432, + 1048576, + 1572864, + 2097152, + 3145728, + 4194304, + 6291456, + 8388608, + 12582912, + 16777216, + 25165824, + 33554432, + 50331648, + 67108864, + 100663296, + 134217728, + 201326592, + 268435456, + 402653184, + 536870912, + 805306368, + 1073741824, + 1610612736, + 2147483648, + 3221225472, + 4294967296, + 6442450944, + 8589934592, + 12884901888, + 17_179_869_184, // 17gb is max page size as-of now +]; + +#[derive(Default, Debug, Copy, Clone)] +pub struct WriteBatchStats { + pub heap_bytes_written: u64, + pub heap_files_written_to: u64, + /// Latency inclusive of fsync + pub heap_write_latency: Duration, + /// Latency for fsyncing files + pub heap_sync_latency: Duration, + pub metadata_bytes_written: u64, + pub metadata_write_latency: Duration, + pub truncated_files: u64, + pub truncated_bytes: u64, + pub truncate_latency: Duration, +} + +#[derive(Default, Debug, Clone, Copy)] +pub struct HeapStats { + pub allocator: AllocatorStats, + pub write_batch_max: WriteBatchStats, + pub write_batch_sum: WriteBatchStats, + pub truncated_file_bytes: u64, +} + +impl WriteBatchStats { + pub(crate) fn max(&self, other: &WriteBatchStats) -> WriteBatchStats { + WriteBatchStats { + heap_bytes_written: self + .heap_bytes_written + .max(other.heap_bytes_written), + heap_files_written_to: self + .heap_files_written_to + .max(other.heap_files_written_to), + heap_write_latency: self + .heap_write_latency + .max(other.heap_write_latency), + heap_sync_latency: self + .heap_sync_latency + .max(other.heap_sync_latency), + metadata_bytes_written: self + .metadata_bytes_written + .max(other.metadata_bytes_written), + metadata_write_latency: self + .metadata_write_latency + .max(other.metadata_write_latency), + truncated_files: self.truncated_files.max(other.truncated_files), + truncated_bytes: self.truncated_bytes.max(other.truncated_bytes), + truncate_latency: self.truncate_latency.max(other.truncate_latency), + } + } + + pub(crate) fn sum(&self, other: &WriteBatchStats) -> WriteBatchStats { + use std::ops::Add; + WriteBatchStats { + heap_bytes_written: self + .heap_bytes_written + .add(other.heap_bytes_written), + heap_files_written_to: self + .heap_files_written_to + .add(other.heap_files_written_to), + heap_write_latency: self + .heap_write_latency + .add(other.heap_write_latency), + heap_sync_latency: self + .heap_sync_latency + .add(other.heap_sync_latency), + metadata_bytes_written: self + .metadata_bytes_written + .add(other.metadata_bytes_written), + metadata_write_latency: self + .metadata_write_latency + .add(other.metadata_write_latency), + truncated_files: self.truncated_files.add(other.truncated_files), + truncated_bytes: self.truncated_bytes.add(other.truncated_bytes), + truncate_latency: self.truncate_latency.add(other.truncate_latency), + } + } +} + +const fn overhead_for_size(size: usize) -> usize { + if size + 5 <= u8::MAX as usize { + // crc32 + 1 byte frame + 5 + } else if size + 6 <= u16::MAX as usize { + // crc32 + 2 byte frame + 6 + } else if size + 8 <= u32::MAX as usize { + // crc32 + 4 byte frame + 8 + } else { + // crc32 + 8 byte frame + 12 + } +} + +fn slab_for_size(size: usize) -> u8 { + let total_size = size + overhead_for_size(size); + for idx in 0..SLAB_SIZES.len() { + if SLAB_SIZES[idx] >= total_size { + return idx as u8; + } + } + u8::MAX +} + +pub use inline_array::InlineArray; + +#[derive(Debug)] +pub struct ObjectRecovery { + pub object_id: ObjectId, + pub collection_id: CollectionId, + pub low_key: InlineArray, +} + +pub struct HeapRecovery { + pub heap: Heap, + pub recovered_nodes: Vec, + pub was_recovered: bool, +} + +enum PersistentSettings { + V1 { leaf_fanout: u64 }, +} + +impl PersistentSettings { + // NB: should only be called with a directory lock already exclusively acquired + fn verify_or_store>( + &self, + path: P, + _directory_lock: &std::fs::File, + ) -> io::Result<()> { + let settings_path = path.as_ref().join("durability_cookie"); + + match std::fs::read(&settings_path) { + Ok(previous_bytes) => { + let previous = + PersistentSettings::deserialize(&previous_bytes)?; + self.check_compatibility(&previous) + } + Err(e) if e.kind() == std::io::ErrorKind::NotFound => { + std::fs::write(settings_path, &self.serialize()) + } + Err(e) => Err(e), + } + } + + fn deserialize(buf: &[u8]) -> io::Result { + let mut cursor = buf; + let mut buf = [0_u8; 64]; + cursor.read_exact(&mut buf)?; + + let version = u16::from_le_bytes([buf[0], buf[1]]); + + let crc_actual = (crc32fast::hash(&buf[0..60]) ^ 0xAF).to_le_bytes(); + let crc_expected = &buf[60..]; + + if crc_actual != crc_expected { + return Err(io::Error::new( + io::ErrorKind::InvalidData, + "encountered corrupted settings cookie with mismatched CRC.", + )); + } + + match version { + 1 => { + let leaf_fanout = u64::from_le_bytes(buf[2..10].try_into().unwrap()); + Ok(PersistentSettings::V1 { leaf_fanout }) + } + _ => { + Err(io::Error::new( + io::ErrorKind::InvalidData, + "encountered unknown version number when reading settings cookie" + )) + } + } + } + + fn check_compatibility( + &self, + other: &PersistentSettings, + ) -> io::Result<()> { + use PersistentSettings::*; + + match (self, other) { + (V1 { leaf_fanout: lf1 }, V1 { leaf_fanout: lf2 }) => { + if lf1 != lf2 { + Err(io::Error::new( + io::ErrorKind::Unsupported, + format!( + "sled was already opened with a LEAF_FANOUT const generic of {}, \ + and this may not be changed after initial creation. Please use \ + Db::import / Db::export to migrate, if you wish to change the \ + system's format.", lf2 + ) + )) + } else { + Ok(()) + } + } + } + } + + fn serialize(&self) -> Vec { + // format: 64 bytes in total, with the last 4 being a LE crc32 + // first 2 are LE version number + let mut buf = vec![]; + + match self { + PersistentSettings::V1 { leaf_fanout } => { + // LEAF_FANOUT: 8 bytes LE + let version: [u8; 2] = 1_u16.to_le_bytes(); + buf.extend_from_slice(&version); + + buf.extend_from_slice(&leaf_fanout.to_le_bytes()); + } + } + + // zero-pad the buffer + assert!(buf.len() < 60); + buf.resize(60, 0); + + let hash: u32 = crc32fast::hash(&buf) ^ 0xAF; + let hash_bytes: [u8; 4] = hash.to_le_bytes(); + buf.extend_from_slice(&hash_bytes); + + // keep the buffer to 64 bytes for easy parsing over time. + assert_eq!(buf.len(), 64); + + buf + } +} + +#[derive(Clone, Copy, Debug, PartialEq)] +pub(crate) struct SlabAddress { + slab_id: u8, + slab_slot: [u8; 7], +} + +impl SlabAddress { + pub(crate) fn from_slab_slot(slab: u8, slot: u64) -> SlabAddress { + let slot_bytes = slot.to_be_bytes(); + + assert_eq!(slot_bytes[0], 0); + + SlabAddress { + slab_id: slab, + slab_slot: slot_bytes[1..].try_into().unwrap(), + } + } + + #[inline] + pub const fn slab(&self) -> u8 { + self.slab_id + } + + #[inline] + pub const fn slot(&self) -> u64 { + u64::from_be_bytes([ + 0, + self.slab_slot[0], + self.slab_slot[1], + self.slab_slot[2], + self.slab_slot[3], + self.slab_slot[4], + self.slab_slot[5], + self.slab_slot[6], + ]) + } +} + +impl From for SlabAddress { + fn from(i: NonZeroU64) -> SlabAddress { + let i = i.get(); + let bytes = i.to_be_bytes(); + SlabAddress { + slab_id: bytes[0] - 1, + slab_slot: bytes[1..].try_into().unwrap(), + } + } +} + +impl Into for SlabAddress { + fn into(self) -> NonZeroU64 { + NonZeroU64::new(u64::from_be_bytes([ + self.slab_id + 1, + self.slab_slot[0], + self.slab_slot[1], + self.slab_slot[2], + self.slab_slot[3], + self.slab_slot[4], + self.slab_slot[5], + self.slab_slot[6], + ])) + .unwrap() + } +} + +#[cfg(unix)] +mod sys_io { + use std::io; + use std::os::unix::fs::FileExt; + + use super::*; + + pub(super) fn read_exact_at( + file: &fs::File, + buf: &mut [u8], + offset: u64, + ) -> io::Result<()> { + match maybe!(file.read_exact_at(buf, offset)) { + Ok(r) => Ok(r), + Err(e) => { + // FIXME BUG 3: failed to read 64 bytes at offset 192 from file with len 192 + println!( + "failed to read {} bytes at offset {} from file with len {}", + buf.len(), + offset, + file.metadata().unwrap().len(), + ); + let _ = dbg!(std::backtrace::Backtrace::force_capture()); + Err(e) + } + } + } + + pub(super) fn write_all_at( + file: &fs::File, + buf: &[u8], + offset: u64, + ) -> io::Result<()> { + maybe!(file.write_all_at(buf, offset)) + } +} + +#[cfg(windows)] +mod sys_io { + use std::os::windows::fs::FileExt; + + use super::*; + + pub(super) fn read_exact_at( + file: &fs::File, + mut buf: &mut [u8], + mut offset: u64, + ) -> io::Result<()> { + while !buf.is_empty() { + match maybe!(file.seek_read(buf, offset)) { + Ok(0) => break, + Ok(n) => { + let tmp = buf; + buf = &mut tmp[n..]; + offset += n as u64; + } + Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} + Err(e) => return Err(annotate!(e)), + } + } + if !buf.is_empty() { + Err(annotate!(io::Error::new( + io::ErrorKind::UnexpectedEof, + "failed to fill whole buffer" + ))) + } else { + Ok(()) + } + } + + pub(super) fn write_all_at( + file: &fs::File, + mut buf: &[u8], + mut offset: u64, + ) -> io::Result<()> { + while !buf.is_empty() { + match maybe!(file.seek_write(buf, offset)) { + Ok(0) => { + return Err(annotate!(io::Error::new( + io::ErrorKind::WriteZero, + "failed to write whole buffer", + ))); + } + Ok(n) => { + buf = &buf[n..]; + offset += n as u64; + } + Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} + Err(e) => return Err(annotate!(e)), + } + } + Ok(()) + } +} + +#[derive(Debug)] +struct Slab { + file: fs::File, + slot_size: usize, + max_live_slot_since_last_truncation: AtomicU64, +} + +impl Slab { + fn sync(&self) -> io::Result<()> { + self.file.sync_all() + } + + fn read( + &self, + slot: u64, + _guard: &mut Guard<'_, DeferredFree, 16, 16>, + ) -> io::Result> { + log::trace!("reading from slot {} in slab {}", slot, self.slot_size); + + let mut data = Vec::with_capacity(self.slot_size); + unsafe { + data.set_len(self.slot_size); + } + + let whence = self.slot_size as u64 * slot; + + maybe!(sys_io::read_exact_at(&self.file, &mut data, whence))?; + + let hash_actual: [u8; 4] = + (crc32fast::hash(&data[..self.slot_size - 4]) ^ 0xAF).to_le_bytes(); + let hash_expected = &data[self.slot_size - 4..]; + + if hash_expected != hash_actual { + return Err(annotate!(io::Error::new( + io::ErrorKind::InvalidData, + "crc mismatch - data corruption detected" + ))); + } + + let len: usize = if self.slot_size <= u8::MAX as usize { + // crc32 + 1 byte frame + usize::from(data[self.slot_size - 5]) + } else if self.slot_size <= u16::MAX as usize { + // crc32 + 2 byte frame + let mut size_bytes: [u8; 2] = [0; 2]; + size_bytes + .copy_from_slice(&data[self.slot_size - 6..self.slot_size - 4]); + usize::from(u16::from_le_bytes(size_bytes)) + } else if self.slot_size <= u32::MAX as usize { + // crc32 + 4 byte frame + let mut size_bytes: [u8; 4] = [0; 4]; + size_bytes + .copy_from_slice(&data[self.slot_size - 8..self.slot_size - 4]); + usize::try_from(u32::from_le_bytes(size_bytes)).unwrap() + } else { + // crc32 + 8 byte frame + let mut size_bytes: [u8; 8] = [0; 8]; + size_bytes.copy_from_slice( + &data[self.slot_size - 12..self.slot_size - 4], + ); + usize::try_from(u64::from_le_bytes(size_bytes)).unwrap() + }; + + data.truncate(len); + + Ok(data) + } + + fn write(&self, slot: u64, mut data: Vec) -> io::Result<()> { + let len = data.len(); + + assert!(len + overhead_for_size(data.len()) <= self.slot_size); + + data.resize(self.slot_size, 0); + + if self.slot_size <= u8::MAX as usize { + // crc32 + 1 byte frame + data[self.slot_size - 5] = u8::try_from(len).unwrap(); + } else if self.slot_size <= u16::MAX as usize { + // crc32 + 2 byte frame + let size_bytes: [u8; 2] = u16::try_from(len).unwrap().to_le_bytes(); + data[self.slot_size - 6..self.slot_size - 4] + .copy_from_slice(&size_bytes); + } else if self.slot_size <= u32::MAX as usize { + // crc32 + 4 byte frame + let size_bytes: [u8; 4] = u32::try_from(len).unwrap().to_le_bytes(); + data[self.slot_size - 8..self.slot_size - 4] + .copy_from_slice(&size_bytes); + } else { + // crc32 + 8 byte frame + let size_bytes: [u8; 8] = u64::try_from(len).unwrap().to_le_bytes(); + data[self.slot_size - 12..self.slot_size - 4] + .copy_from_slice(&size_bytes); + } + + let hash: [u8; 4] = + (crc32fast::hash(&data[..self.slot_size - 4]) ^ 0xAF).to_le_bytes(); + data[self.slot_size - 4..].copy_from_slice(&hash); + + let whence = self.slot_size as u64 * slot; + + log::trace!("writing to slot {} in slab {}", slot, self.slot_size); + sys_io::write_all_at(&self.file, &data, whence) + } +} + +fn set_error( + global_error: &AtomicPtr<(io::ErrorKind, String)>, + error: &io::Error, +) { + let kind = error.kind(); + let reason = error.to_string(); + + let boxed = Box::new((kind, reason)); + let ptr = Box::into_raw(boxed); + + if global_error + .compare_exchange( + std::ptr::null_mut(), + ptr, + Ordering::SeqCst, + Ordering::SeqCst, + ) + .is_err() + { + // global fatal error already installed, drop this one + unsafe { + drop(Box::from_raw(ptr)); + } + } +} + +#[derive(Debug)] +pub enum Update { + Store { + object_id: ObjectId, + collection_id: CollectionId, + low_key: InlineArray, + data: Vec, + }, + Free { + object_id: ObjectId, + collection_id: CollectionId, + }, +} + +impl Update { + #[allow(unused)] + pub(crate) fn object_id(&self) -> ObjectId { + match self { + Update::Store { object_id, .. } + | Update::Free { object_id, .. } => *object_id, + } + } +} + +#[derive(Debug, PartialOrd, Ord, PartialEq, Eq)] +pub enum UpdateMetadata { + Store { + object_id: ObjectId, + collection_id: CollectionId, + low_key: InlineArray, + location: NonZeroU64, + }, + Free { + object_id: ObjectId, + collection_id: CollectionId, + }, +} + +impl UpdateMetadata { + pub fn object_id(&self) -> ObjectId { + match self { + UpdateMetadata::Store { object_id, .. } + | UpdateMetadata::Free { object_id, .. } => *object_id, + } + } +} + +#[derive(Debug, Default, Clone, Copy)] +struct WriteBatchStatTracker { + sum: WriteBatchStats, + max: WriteBatchStats, +} + +#[derive(Clone)] +pub struct Heap { + path: PathBuf, + slabs: Arc<[Slab; N_SLABS]>, + table: ObjectLocationMapper, + metadata_store: Arc>, + free_ebr: Ebr, + global_error: Arc>, + #[allow(unused)] + directory_lock: Arc, + stats: Arc>, + truncated_file_bytes: Arc, +} + +impl fmt::Debug for Heap { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.debug_struct("Heap") + .field("path", &self.path) + .field("stats", &self.stats()) + .finish() + } +} + +impl Heap { + pub fn recover( + leaf_fanout: usize, + config: &Config, + ) -> io::Result { + let path = &config.path; + log::trace!("recovering Heap at {:?}", path); + let slabs_dir = path.join("slabs"); + + // TODO NOCOMMIT + let sync_status = std::process::Command::new("sync") + .status() + .map(|status| status.success()); + + if !matches!(sync_status, Ok(true)) { + log::warn!( + "sync command before recovery failed: {:?}", + sync_status + ); + } + + // initialize directories if not present + let mut was_recovered = true; + for p in [path, &slabs_dir] { + if let Err(e) = fs::read_dir(p) { + if e.kind() == io::ErrorKind::NotFound { + fallible!(fs::create_dir_all(p)); + was_recovered = false; + continue; + } + } + maybe!(fs::File::open(p).and_then(|f| f.sync_all()))?; + } + + let _ = fs::File::create(path.join(WARN)); + + let mut file_lock_opts = fs::OpenOptions::new(); + file_lock_opts.create(false).read(false).write(false); + let directory_lock = fallible!(fs::File::open(&path)); + fallible!(directory_lock.try_lock_exclusive()); + + maybe!(fs::File::open(&slabs_dir).and_then(|f| f.sync_all()))?; + maybe!(directory_lock.sync_all())?; + + let persistent_settings = + PersistentSettings::V1 { leaf_fanout: leaf_fanout as u64 }; + + persistent_settings.verify_or_store(path, &directory_lock)?; + + let (metadata_store, recovered_metadata) = + MetadataStore::recover(path.join("metadata"))?; + + let table = ObjectLocationMapper::new( + &recovered_metadata, + config.target_heap_file_fill_ratio, + ); + + let mut recovered_nodes = + Vec::::with_capacity(recovered_metadata.len()); + + for update_metadata in recovered_metadata { + match update_metadata { + UpdateMetadata::Store { + object_id, + collection_id, + location: _, + low_key, + } => { + recovered_nodes.push(ObjectRecovery { + object_id, + collection_id, + low_key, + }); + } + UpdateMetadata::Free { .. } => { + unreachable!() + } + } + } + + let mut slabs = vec![]; + let mut slab_opts = fs::OpenOptions::new(); + slab_opts.create(true).read(true).write(true); + for i in 0..N_SLABS { + let slot_size = SLAB_SIZES[i]; + let slab_path = slabs_dir.join(format!("{}", slot_size)); + + let file = fallible!(slab_opts.open(slab_path)); + + slabs.push(Slab { + slot_size, + file, + max_live_slot_since_last_truncation: AtomicU64::new(0), + }) + } + + maybe!(fs::File::open(&slabs_dir).and_then(|f| f.sync_all()))?; + + log::debug!("recovery of Heap at {:?} complete", path); + + Ok(HeapRecovery { + heap: Heap { + slabs: Arc::new(slabs.try_into().unwrap()), + path: path.into(), + table, + global_error: metadata_store.get_global_error_arc(), + metadata_store: Arc::new(Mutex::new(metadata_store)), + directory_lock: Arc::new(directory_lock), + free_ebr: Ebr::default(), + truncated_file_bytes: Arc::default(), + stats: Arc::default(), + }, + recovered_nodes, + was_recovered, + }) + } + + pub fn get_global_error_arc( + &self, + ) -> Arc> { + self.global_error.clone() + } + + fn check_error(&self) -> io::Result<()> { + let err_ptr: *const (io::ErrorKind, String) = + self.global_error.load(Ordering::Acquire); + + if err_ptr.is_null() { + Ok(()) + } else { + let deref: &(io::ErrorKind, String) = unsafe { &*err_ptr }; + Err(io::Error::new(deref.0, deref.1.clone())) + } + } + + fn set_error(&self, error: &io::Error) { + set_error(&self.global_error, error); + } + + pub fn manually_advance_epoch(&self) { + self.free_ebr.manually_advance_epoch(); + } + + pub fn stats(&self) -> HeapStats { + let truncated_file_bytes = + self.truncated_file_bytes.load(Ordering::Acquire); + + let stats = self.stats.read(); + + HeapStats { + truncated_file_bytes, + allocator: self.table.stats(), + write_batch_max: stats.max, + write_batch_sum: stats.sum, + } + } + + pub fn read(&self, object_id: ObjectId) -> Option>> { + if let Err(e) = self.check_error() { + return Some(Err(e)); + } + + let mut guard = self.free_ebr.pin(); + let slab_address = self.table.get_location_for_object(object_id)?; + + let slab = &self.slabs[usize::from(slab_address.slab_id)]; + + match slab.read(slab_address.slot(), &mut guard) { + Ok(bytes) => Some(Ok(bytes)), + Err(e) => { + let annotated = annotate!(e); + self.set_error(&annotated); + Some(Err(annotated)) + } + } + } + + pub fn write_batch( + &self, + batch: Vec, + ) -> io::Result { + self.check_error()?; + let metadata_store = self.metadata_store.try_lock() + .expect("write_batch called concurrently! major correctness assumpiton violated"); + let mut guard = self.free_ebr.pin(); + + let slabs = &self.slabs; + let table = &self.table; + + let heap_bytes_written = AtomicU64::new(0); + let heap_files_used_0_to_63 = AtomicU64::new(0); + let heap_files_used_64_to_127 = AtomicU64::new(0); + + let map_closure = |update: Update| match update { + Update::Store { object_id, collection_id, low_key, data } => { + let data_len = data.len(); + let slab_id = slab_for_size(data_len); + let slab = &slabs[usize::from(slab_id)]; + let new_location = table.allocate_slab_slot(slab_id); + let new_location_nzu: NonZeroU64 = new_location.into(); + + let complete_durability_pipeline = + maybe!(slab.write(new_location.slot(), data)); + + if let Err(e) = complete_durability_pipeline { + // can immediately free slot as the + table.free_slab_slot(new_location); + return Err(e); + } + + // record stats + heap_bytes_written + .fetch_add(data_len as u64, Ordering::Release); + + if slab_id < 64 { + let slab_bit = 0b1 << slab_id; + heap_files_used_0_to_63 + .fetch_or(slab_bit, Ordering::Release); + } else { + assert!(slab_id < 128); + let slab_bit = 0b1 << (slab_id - 64); + heap_files_used_64_to_127 + .fetch_or(slab_bit, Ordering::Release); + } + + Ok(UpdateMetadata::Store { + object_id, + collection_id, + low_key, + location: new_location_nzu, + }) + } + Update::Free { object_id, collection_id } => { + Ok(UpdateMetadata::Free { object_id, collection_id }) + } + }; + + let before_heap_write = Instant::now(); + + let metadata_batch_res: io::Result> = + batch.into_par_iter().map(map_closure).collect(); + + let before_heap_sync = Instant::now(); + + fence(Ordering::SeqCst); + + for slab_id in 0..N_SLABS { + let dirty = if slab_id < 64 { + let slab_bit = 0b1 << slab_id; + + heap_files_used_0_to_63.load(Ordering::Acquire) & slab_bit + == slab_bit + } else { + let slab_bit = 0b1 << (slab_id - 64); + + heap_files_used_64_to_127.load(Ordering::Acquire) & slab_bit + == slab_bit + }; + + if dirty { + self.slabs[slab_id].sync()?; + } + } + + let heap_sync_latency = before_heap_sync.elapsed(); + + let heap_write_latency = before_heap_write.elapsed(); + + let metadata_batch = match metadata_batch_res { + Ok(mut mb) => { + // TODO evaluate impact : cost ratio of this sort + mb.par_sort_unstable(); + mb + } + Err(e) => { + self.set_error(&e); + return Err(e); + } + }; + + // make metadata durable + let before_metadata_write = Instant::now(); + let metadata_bytes_written = + match metadata_store.write_batch(&metadata_batch) { + Ok(metadata_bytes_written) => metadata_bytes_written, + Err(e) => { + self.set_error(&e); + return Err(e); + } + }; + let metadata_write_latency = before_metadata_write.elapsed(); + + // reclaim previous disk locations for future writes + for update_metadata in metadata_batch { + let last_address_opt = match update_metadata { + UpdateMetadata::Store { object_id, location, .. } => { + self.table.insert(object_id, SlabAddress::from(location)) + } + UpdateMetadata::Free { object_id, .. } => { + guard.defer_drop(DeferredFree { + allocator: self.table.clone_object_id_allocator_arc(), + freed_slot: object_id.0.get(), + }); + self.table.remove(object_id) + } + }; + + if let Some(last_address) = last_address_opt { + guard.defer_drop(DeferredFree { + allocator: self + .table + .clone_slab_allocator_arc(last_address.slab_id), + freed_slot: last_address.slot(), + }); + } + } + + // truncate files that are now too fragmented + let before_truncate = Instant::now(); + let mut truncated_files = 0; + let mut truncated_bytes = 0; + for (i, max_live_slot) in self.table.get_max_allocated_per_slab() { + let slab = &self.slabs[i]; + + let last_max = slab + .max_live_slot_since_last_truncation + .fetch_max(max_live_slot, Ordering::SeqCst); + + let max_since_last_truncation = last_max.max(max_live_slot); + + let currently_occupied_bytes = + (max_live_slot + 1) * slab.slot_size as u64; + + let max_occupied_bytes = + (max_since_last_truncation + 1) * slab.slot_size as u64; + + let ratio = currently_occupied_bytes * 100 / max_occupied_bytes; + + if ratio < FILE_TARGET_FILL_RATIO { + let target_len = if max_live_slot < 16 { + currently_occupied_bytes + } else { + currently_occupied_bytes * FILE_RESIZE_MARGIN / 100 + }; + + assert!(target_len < max_occupied_bytes); + assert!( + target_len >= currently_occupied_bytes, + "target_len of {} is above actual occupied len of {}", + target_len, + currently_occupied_bytes + ); + + if cfg!(not(feature = "monotonic-behavior")) { + if slab.file.set_len(target_len).is_ok() { + slab.max_live_slot_since_last_truncation + .store(max_live_slot, Ordering::SeqCst); + + let file_truncated_bytes = + currently_occupied_bytes.saturating_sub(target_len); + self.truncated_file_bytes + .fetch_add(file_truncated_bytes, Ordering::Release); + + truncated_files += 1; + truncated_bytes += file_truncated_bytes; + } else { + // TODO surface stats + } + } + } + } + + let truncate_latency = before_truncate.elapsed(); + + let heap_files_written_to = u64::from( + heap_files_used_0_to_63.load(Ordering::Acquire).count_ones() + + heap_files_used_64_to_127 + .load(Ordering::Acquire) + .count_ones(), + ); + + let stats = WriteBatchStats { + heap_bytes_written: heap_bytes_written.load(Ordering::Acquire), + heap_files_written_to, + heap_write_latency, + heap_sync_latency, + metadata_bytes_written, + metadata_write_latency, + truncated_files, + truncated_bytes, + truncate_latency, + }; + + { + let mut stats_tracker = self.stats.write(); + stats_tracker.max = stats_tracker.max.max(&stats); + stats_tracker.sum = stats_tracker.sum.sum(&stats); + } + + Ok(stats) + } + + pub fn heap_object_id_pin(&self) -> ebr::Guard<'_, DeferredFree, 16, 16> { + self.free_ebr.pin() + } + + pub fn allocate_object_id(&self) -> ObjectId { + self.table.allocate_object_id() + } + + pub(crate) fn objects_to_defrag(&self) -> FnvHashSet { + self.table.objects_to_defrag() + } +} diff --git a/src/histogram.rs b/src/histogram.rs deleted file mode 100644 index b18f119f5..000000000 --- a/src/histogram.rs +++ /dev/null @@ -1,263 +0,0 @@ -//! Copied from my historian crate. - Tyler Neely -//! -//! A zero-config simple histogram collector -//! -//! for use in instrumented optimization. -//! Uses logarithmic bucketing rather than sampling, -//! and has bounded (generally <0.5%) error on percentiles. -//! Performs no allocations after initial creation. -//! Uses Relaxed atomics during collection. -//! -//! When you create it, it allocates 65k `AtomicUsize`'s -//! that it uses for incrementing. Generating reports -//! after running workloads on dozens of `Histogram`'s -//! does not result in a perceptible delay, but it -//! might not be acceptable for use in low-latency -//! reporting paths. -//! -//! The trade-offs taken in this are to minimize latency -//! during collection, while initial allocation and -//! postprocessing delays are acceptable. -//! -//! Future work to further reduce collection latency -//! may include using thread-local caches that perform -//! no atomic operations until they are dropped, when -//! they may atomically aggregate their measurements -//! into the shared collector that will be used for -//! reporting. -#![allow(unused)] -#![allow(unused_results)] -#![allow(clippy::print_stdout)] -#![allow(clippy::float_arithmetic)] -#![allow(clippy::cast_precision_loss)] -#![allow(clippy::cast_possible_truncation)] -#![allow(clippy::cast_sign_loss)] - -use std::convert::TryFrom; -use std::fmt::{self, Debug}; -use std::sync::atomic::{AtomicUsize, Ordering}; - -const PRECISION: f64 = 100.; -const BUCKETS: usize = 1 << 16; - -/// A histogram collector that uses zero-configuration logarithmic buckets. -pub struct Histogram { - vals: Vec, - sum: AtomicUsize, - count: AtomicUsize, -} - -impl Default for Histogram { - #[allow(unsafe_code)] - fn default() -> Histogram { - #[cfg(not(feature = "miri_optimizations"))] - { - let mut vals = Vec::with_capacity(BUCKETS); - vals.resize_with(BUCKETS, Default::default); - - Histogram { - vals, - sum: AtomicUsize::new(0), - count: AtomicUsize::new(0), - } - } - - #[cfg(feature = "miri_optimizations")] - { - // Avoid calling Vec::resize_with with a large length because its - // internals cause stacked borrows tracking information to add an - // item for each element of the vector. - let mut raw_vals = - std::mem::ManuallyDrop::new(vec![0_usize; BUCKETS]); - let ptr: *mut usize = raw_vals.as_mut_ptr(); - let len = raw_vals.len(); - let capacity = raw_vals.capacity(); - - let vals: Vec = unsafe { - Vec::from_raw_parts(ptr as *mut AtomicUsize, len, capacity) - }; - - Histogram { - vals, - sum: AtomicUsize::new(0), - count: AtomicUsize::new(0), - } - } - } -} - -impl Debug for Histogram { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> { - const PS: [f64; 10] = - [0., 50., 75., 90., 95., 97.5, 99., 99.9, 99.99, 100.]; - f.write_str("Histogramgram[")?; - - for p in &PS { - let res = self.percentile(*p).round(); - let line = format!("({} -> {}) ", p, res); - f.write_str(&*line)?; - } - - f.write_str("]") - } -} - -impl Histogram { - /// Record a value. - #[inline] - pub fn measure(&self, raw_value: u64) { - #[cfg(feature = "metrics")] - { - let value_float: f64 = raw_value as f64; - self.sum.fetch_add(value_float.round() as usize, Ordering::Relaxed); - - self.count.fetch_add(1, Ordering::Relaxed); - - // compress the value to one of 2**16 values - // using logarithmic bucketing - let compressed: u16 = compress(value_float); - - // increment the counter for this compressed value - self.vals[compressed as usize].fetch_add(1, Ordering::Relaxed); - } - } - - /// Retrieve a percentile [0-100]. Returns NAN if no metrics have been - /// collected yet. - pub fn percentile(&self, p: f64) -> f64 { - #[cfg(feature = "metrics")] - { - assert!(p <= 100., "percentiles must not exceed 100.0"); - - let count = self.count.load(Ordering::Acquire); - - if count == 0 { - return std::f64::NAN; - } - - let mut target = count as f64 * (p / 100.); - if target == 0. { - target = 1.; - } - - let mut sum = 0.; - - for (idx, val) in self.vals.iter().enumerate() { - sum += val.load(Ordering::Acquire) as f64; - - if sum >= target { - return decompress(idx as u16); - } - } - } - - std::f64::NAN - } - - /// Dump out some common percentiles. - pub fn print_percentiles(&self) { - println!("{:?}", self); - } - - /// Return the sum of all observations in this histogram. - pub fn sum(&self) -> usize { - self.sum.load(Ordering::Acquire) - } - - /// Return the count of observations in this histogram. - pub fn count(&self) -> usize { - self.count.load(Ordering::Acquire) - } -} - -// compress takes a value and lossily shrinks it to an u16 to facilitate -// bucketing of histogram values, staying roughly within 1% of the true -// value. This fails for large values of 1e142 and above, and is -// inaccurate for values closer to 0 than +/- 0.51 or +/- math.Inf. -#[allow(clippy::cast_sign_loss)] -#[allow(clippy::cast_possible_truncation)] -#[inline] -fn compress>(input_value: T) -> u16 { - let value: f64 = input_value.into(); - let abs = value.abs(); - let boosted = 1. + abs; - let ln = boosted.ln(); - let compressed = PRECISION.mul_add(ln, 0.5); - assert!(compressed <= f64::from(u16::max_value())); - - compressed as u16 -} - -// decompress takes a lossily shrunken u16 and returns an f64 within 1% of -// the original passed to compress. -#[inline] -fn decompress(compressed: u16) -> f64 { - let unboosted = f64::from(compressed) / PRECISION; - (unboosted.exp() - 1.) -} - -#[cfg(feature = "metrics")] -#[test] -fn it_works() { - let c = Histogram::default(); - c.measure(2); - c.measure(2); - c.measure(3); - c.measure(3); - c.measure(4); - assert_eq!(c.percentile(0.).round() as usize, 2); - assert_eq!(c.percentile(40.).round() as usize, 2); - assert_eq!(c.percentile(40.1).round() as usize, 3); - assert_eq!(c.percentile(80.).round() as usize, 3); - assert_eq!(c.percentile(80.1).round() as usize, 4); - assert_eq!(c.percentile(100.).round() as usize, 4); - c.print_percentiles(); -} - -#[cfg(feature = "metrics")] -#[test] -fn high_percentiles() { - let c = Histogram::default(); - for _ in 0..9000 { - c.measure(10); - } - for _ in 0..900 { - c.measure(25); - } - for _ in 0..90 { - c.measure(33); - } - for _ in 0..9 { - c.measure(47); - } - c.measure(500); - assert_eq!(c.percentile(0.).round() as usize, 10); - assert_eq!(c.percentile(99.).round() as usize, 25); - assert_eq!(c.percentile(99.89).round() as usize, 33); - assert_eq!(c.percentile(99.91).round() as usize, 47); - assert_eq!(c.percentile(99.99).round() as usize, 47); - assert_eq!(c.percentile(100.).round() as usize, 502); -} - -#[cfg(feature = "metrics")] -#[test] -fn multithreaded() { - use std::sync::Arc; - use std::thread; - - let h = Arc::new(Histogram::default()); - let mut threads = vec![]; - - for _ in 0..10 { - let h = h.clone(); - threads.push(thread::spawn(move || { - h.measure(20); - })); - } - - for t in threads { - t.join().unwrap(); - } - - assert_eq!(h.percentile(50.).round() as usize, 20); -} diff --git a/src/id_allocator.rs b/src/id_allocator.rs new file mode 100644 index 000000000..458fb5622 --- /dev/null +++ b/src/id_allocator.rs @@ -0,0 +1,165 @@ +use std::collections::BTreeSet; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Arc; + +use crossbeam_queue::SegQueue; +use fnv::FnvHashSet; +use parking_lot::Mutex; + +#[derive(Default, Debug)] +struct FreeSetAndTip { + free_set: BTreeSet, + next_to_allocate: u64, +} + +#[derive(Default, Debug)] +pub struct Allocator { + free_and_pending: Mutex, + /// Flat combining. + /// + /// A lock free queue of recently freed ids which uses when there is contention on `free_and_pending`. + free_queue: SegQueue, + allocation_counter: AtomicU64, + free_counter: AtomicU64, +} + +impl Allocator { + /// Intended primarily for heap slab slot allocators when performing GC. + /// + /// If the slab is fragmented beyond the desired fill ratio, this returns + /// the range of offsets (min inclusive, max exclusive) that may be copied + /// into earlier free slots if they are currently occupied in order to + /// achieve the desired fragmentation ratio. + pub fn fragmentation_cutoff( + &self, + desired_ratio: f32, + ) -> Option<(u64, u64)> { + let mut free_and_tip = self.free_and_pending.lock(); + + let next_to_allocate = free_and_tip.next_to_allocate; + + if next_to_allocate == 0 { + return None; + } + + while let Some(free_id) = self.free_queue.pop() { + free_and_tip.free_set.insert(free_id); + } + + let live_objects = + next_to_allocate - free_and_tip.free_set.len() as u64; + let actual_ratio = live_objects as f32 / next_to_allocate as f32; + + log::trace!( + "fragmented_slots actual ratio: {actual_ratio}, free len: {}", + free_and_tip.free_set.len() + ); + + if desired_ratio <= actual_ratio { + return None; + } + + // calculate theoretical cut-off point, return everything past that + let min = (live_objects as f32 / desired_ratio) as u64; + let max = next_to_allocate; + assert!(min < max); + Some((min, max)) + } + + pub fn from_allocated(allocated: &FnvHashSet) -> Allocator { + let mut heap = BTreeSet::::default(); + let max = allocated.iter().copied().max(); + + for i in 0..max.unwrap_or(0) { + if !allocated.contains(&i) { + heap.insert(i); + } + } + + let free_and_pending = Mutex::new(FreeSetAndTip { + free_set: heap, + next_to_allocate: max.map(|m| m + 1).unwrap_or(0), + }); + + Allocator { + free_and_pending, + free_queue: SegQueue::default(), + allocation_counter: 0.into(), + free_counter: 0.into(), + } + } + + pub fn max_allocated(&self) -> Option { + let next = self.free_and_pending.lock().next_to_allocate; + + if next == 0 { + None + } else { + Some(next - 1) + } + } + + pub fn allocate(&self) -> u64 { + self.allocation_counter.fetch_add(1, Ordering::Relaxed); + let mut free_and_tip = self.free_and_pending.lock(); + while let Some(free_id) = self.free_queue.pop() { + free_and_tip.free_set.insert(free_id); + } + + compact(&mut free_and_tip); + + let pop_attempt = free_and_tip.free_set.pop_first(); + + if let Some(id) = pop_attempt { + id + } else { + let ret = free_and_tip.next_to_allocate; + free_and_tip.next_to_allocate += 1; + ret + } + } + + pub fn free(&self, id: u64) { + if cfg!(not(feature = "monotonic-behavior")) { + self.free_counter.fetch_add(1, Ordering::Relaxed); + if let Some(mut free) = self.free_and_pending.try_lock() { + while let Some(free_id) = self.free_queue.pop() { + free.free_set.insert(free_id); + } + free.free_set.insert(id); + + compact(&mut free); + } else { + self.free_queue.push(id); + } + } + } + + /// Returns the counters for allocated, free + pub fn counters(&self) -> (u64, u64) { + ( + self.allocation_counter.load(Ordering::Acquire), + self.free_counter.load(Ordering::Acquire), + ) + } +} + +fn compact(free: &mut FreeSetAndTip) { + let next = &mut free.next_to_allocate; + + while *next > 1 && free.free_set.contains(&(*next - 1)) { + free.free_set.remove(&(*next - 1)); + *next -= 1; + } +} + +pub struct DeferredFree { + pub allocator: Arc, + pub freed_slot: u64, +} + +impl Drop for DeferredFree { + fn drop(&mut self) { + self.allocator.free(self.freed_slot) + } +} diff --git a/src/iter.rs b/src/iter.rs deleted file mode 100644 index 06638661c..000000000 --- a/src/iter.rs +++ /dev/null @@ -1,255 +0,0 @@ -use std::ops::{Bound, Deref}; - -#[cfg(feature = "metrics")] -use crate::{Measure, M}; - -use super::*; - -#[cfg(any(test, feature = "lock_free_delays"))] -const MAX_LOOPS: usize = usize::max_value(); - -#[cfg(not(any(test, feature = "lock_free_delays")))] -const MAX_LOOPS: usize = 1_000_000; - -fn possible_predecessor(s: &[u8]) -> Option> { - let mut ret = s.to_vec(); - match ret.pop() { - None => None, - Some(i) if i == 0 => Some(ret), - Some(i) => { - ret.push(i - 1); - ret.extend_from_slice(&[255; 4]); - Some(ret) - } - } -} - -macro_rules! iter_try { - ($e:expr) => { - match $e { - Ok(item) => item, - Err(e) => return Some(Err(e)), - } - }; -} - -/// An iterator over keys and values in a `Tree`. -pub struct Iter { - pub(super) tree: Tree, - pub(super) hi: Bound, - pub(super) lo: Bound, - pub(super) cached_node: Option<(PageId, Node)>, - pub(super) going_forward: bool, -} - -impl Iter { - /// Iterate over the keys of this Tree - pub fn keys( - self, - ) -> impl DoubleEndedIterator> + Send + Sync { - self.map(|r| r.map(|(k, _v)| k)) - } - - /// Iterate over the values of this Tree - pub fn values( - self, - ) -> impl DoubleEndedIterator> + Send + Sync { - self.map(|r| r.map(|(_k, v)| v)) - } - - fn bounds_collapsed(&self) -> bool { - match (&self.lo, &self.hi) { - (Bound::Included(ref start), Bound::Included(ref end)) - | (Bound::Included(ref start), Bound::Excluded(ref end)) - | (Bound::Excluded(ref start), Bound::Included(ref end)) - | (Bound::Excluded(ref start), Bound::Excluded(ref end)) => { - start > end - } - _ => false, - } - } - - fn low_key(&self) -> &[u8] { - match self.lo { - Bound::Unbounded => &[], - Bound::Excluded(ref lo) | Bound::Included(ref lo) => lo.as_ref(), - } - } - - fn high_key(&self) -> &[u8] { - const MAX_KEY: &[u8] = &[255; 1024 * 1024]; - match self.hi { - Bound::Unbounded => MAX_KEY, - Bound::Excluded(ref hi) | Bound::Included(ref hi) => hi.as_ref(), - } - } - - pub(crate) fn next_inner(&mut self) -> Option<::Item> { - let guard = pin(); - let (mut pid, mut node) = if let (true, Some((pid, node))) = - (self.going_forward, self.cached_node.take()) - { - (pid, node) - } else { - let view = - iter_try!(self.tree.view_for_key(self.low_key(), &guard)); - (view.pid, view.deref().clone()) - }; - - for _ in 0..MAX_LOOPS { - if self.bounds_collapsed() { - return None; - } - - if !node.contains_upper_bound(&self.lo) { - // node too low (maybe merged, maybe exhausted?) - let view = - iter_try!(self.tree.view_for_key(self.low_key(), &guard)); - - pid = view.pid; - node = view.deref().clone(); - continue; - } else if !node.contains_lower_bound(&self.lo, true) { - // node too high (maybe split, maybe exhausted?) - let seek_key = possible_predecessor(node.lo())?; - let view = iter_try!(self.tree.view_for_key(seek_key, &guard)); - pid = view.pid; - node = view.deref().clone(); - continue; - } - - if let Some((key, value)) = node.successor(&self.lo) { - self.lo = Bound::Excluded(key.clone()); - self.cached_node = Some((pid, node)); - self.going_forward = true; - - match self.hi { - Bound::Unbounded => return Some(Ok((key, value))), - Bound::Included(ref h) if *h >= key => { - return Some(Ok((key, value))); - } - Bound::Excluded(ref h) if *h > key => { - return Some(Ok((key, value))); - } - _ => return None, - } - } else if let Some(hi) = node.hi() { - self.lo = Bound::Included(hi.into()); - continue; - } else { - return None; - } - } - panic!( - "fucked up tree traversal next({:?}) on {:?}", - self.lo, self.tree - ); - } -} - -impl Iterator for Iter { - type Item = Result<(IVec, IVec)>; - - fn next(&mut self) -> Option { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_scan); - let _cc = concurrency_control::read(); - self.next_inner() - } - - fn last(mut self) -> Option { - self.next_back() - } -} - -impl DoubleEndedIterator for Iter { - fn next_back(&mut self) -> Option { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_reverse_scan); - let guard = pin(); - let _cc = concurrency_control::read(); - - let (mut pid, mut node) = if let (false, Some((pid, node))) = - (self.going_forward, self.cached_node.take()) - { - (pid, node) - } else { - let view = - iter_try!(self.tree.view_for_key(self.high_key(), &guard)); - (view.pid, view.deref().clone()) - }; - - for _ in 0..MAX_LOOPS { - if self.bounds_collapsed() { - return None; - } - - if !node.contains_upper_bound(&self.hi) { - // node too low (maybe merged, maybe exhausted?) - let view = - iter_try!(self.tree.view_for_key(self.high_key(), &guard)); - - pid = view.pid; - node = view.deref().clone(); - continue; - } else if !node.contains_lower_bound(&self.hi, false) { - // node too high (maybe split, maybe exhausted?) - let seek_key = possible_predecessor(node.lo())?; - let view = iter_try!(self.tree.view_for_key(seek_key, &guard)); - pid = view.pid; - node = view.deref().clone(); - continue; - } - - if let Some((key, value)) = node.predecessor(&self.hi) { - self.hi = Bound::Excluded(key.clone()); - self.cached_node = Some((pid, node)); - self.going_forward = false; - - match self.lo { - Bound::Unbounded => return Some(Ok((key, value))), - Bound::Included(ref l) if *l <= key => { - return Some(Ok((key, value))); - } - Bound::Excluded(ref l) if *l < key => { - return Some(Ok((key, value))); - } - _ => return None, - } - } else if node.lo().is_empty() { - return None; - } else { - self.hi = Bound::Excluded(node.lo().into()); - continue; - } - } - panic!( - "fucked up tree traversal next_back({:?}) on {:?}", - self.hi, self.tree - ); - } -} - -#[test] -fn basic_functionality() { - assert_eq!(possible_predecessor(b""), None); - assert_eq!(possible_predecessor(&[0]), Some(vec![])); - assert_eq!(possible_predecessor(&[0, 0]), Some(vec![0])); - assert_eq!( - possible_predecessor(&[0, 1]), - Some(vec![0, 0, 255, 255, 255, 255]) - ); - assert_eq!( - possible_predecessor(&[0, 2]), - Some(vec![0, 1, 255, 255, 255, 255]) - ); - assert_eq!(possible_predecessor(&[1, 0]), Some(vec![1])); - assert_eq!( - possible_predecessor(&[1, 1]), - Some(vec![1, 0, 255, 255, 255, 255]) - ); - assert_eq!( - possible_predecessor(&[155]), - Some(vec![154, 255, 255, 255, 255]) - ); -} diff --git a/src/ivec.rs b/src/ivec.rs deleted file mode 100644 index ebd7de395..000000000 --- a/src/ivec.rs +++ /dev/null @@ -1,369 +0,0 @@ -#![allow(unsafe_code)] - -use std::{ - alloc::{alloc, dealloc, Layout}, - convert::TryFrom, - fmt, - hash::{Hash, Hasher}, - iter::FromIterator, - mem::size_of, - ops::{Deref, DerefMut}, - sync::atomic::{AtomicUsize, Ordering}, -}; - -const SZ: usize = size_of::(); -const CUTOFF: usize = SZ - 1; - -/// A buffer that may either be inline or remote and protected -/// by an Arc. The inner buffer is guaranteed to be aligned to -/// 8 byte boundaries. -#[repr(align(8))] -pub struct IVec([u8; SZ]); - -impl Clone for IVec { - fn clone(&self) -> IVec { - if !self.is_inline() { - self.deref_header().rc.fetch_add(1, Ordering::Relaxed); - } - IVec(self.0) - } -} - -impl Drop for IVec { - fn drop(&mut self) { - if !self.is_inline() { - let rc = self.deref_header().rc.fetch_sub(1, Ordering::Release) - 1; - - if rc == 0 { - let layout = Layout::from_size_align( - self.deref_header().len + size_of::(), - 8, - ) - .unwrap(); - - std::sync::atomic::fence(Ordering::Acquire); - - unsafe { - dealloc(self.remote_ptr() as *mut u8, layout); - } - } - } - } -} - -struct RemoteHeader { - rc: AtomicUsize, - len: usize, -} - -impl Deref for IVec { - type Target = [u8]; - - #[inline] - fn deref(&self) -> &[u8] { - if self.is_inline() { - &self.0[..self.inline_len()] - } else { - unsafe { - let data_ptr = self.remote_ptr().add(size_of::()); - let len = self.deref_header().len; - std::slice::from_raw_parts(data_ptr, len) - } - } - } -} - -impl AsRef<[u8]> for IVec { - #[inline] - fn as_ref(&self) -> &[u8] { - self - } -} - -impl DerefMut for IVec { - #[inline] - fn deref_mut(&mut self) -> &mut [u8] { - let inline_len = self.inline_len(); - if self.is_inline() { - &mut self.0[..inline_len] - } else { - self.make_mut(); - unsafe { - let data_ptr = self.remote_ptr().add(size_of::()); - let len = self.deref_header().len; - std::slice::from_raw_parts_mut(data_ptr as *mut u8, len) - } - } - } -} - -impl AsMut<[u8]> for IVec { - #[inline] - fn as_mut(&mut self) -> &mut [u8] { - self.deref_mut() - } -} - -impl Default for IVec { - fn default() -> Self { - Self::from(&[]) - } -} - -impl Hash for IVec { - fn hash(&self, state: &mut H) { - self.deref().hash(state); - } -} - -impl IVec { - fn new(slice: &[u8]) -> Self { - let mut data = [0_u8; SZ]; - if slice.len() <= CUTOFF { - data[SZ - 1] = (u8::try_from(slice.len()).unwrap() << 1) | 1; - data[..slice.len()].copy_from_slice(slice); - } else { - let layout = Layout::from_size_align( - slice.len() + size_of::(), - 8, - ) - .unwrap(); - - let header = RemoteHeader { rc: 1.into(), len: slice.len() }; - - unsafe { - let ptr = alloc(layout); - - std::ptr::write(ptr as *mut RemoteHeader, header); - std::ptr::copy_nonoverlapping( - slice.as_ptr(), - ptr.add(size_of::()), - slice.len(), - ); - std::ptr::write_unaligned(data.as_mut_ptr() as _, ptr); - } - - // assert that the bottom 3 bits are empty, as we expect - // the buffer to always have an alignment of 8 (2 ^ 3). - #[cfg(not(miri))] - assert_eq!(data[SZ - 1] & 0b111, 0); - } - Self(data) - } - - fn remote_ptr(&self) -> *const u8 { - assert!(!self.is_inline()); - unsafe { std::ptr::read(self.0.as_ptr() as *const *const u8) } - } - - fn deref_header(&self) -> &RemoteHeader { - assert!(!self.is_inline()); - unsafe { &*(self.remote_ptr() as *mut RemoteHeader) } - } - - #[cfg(miri)] - fn inline_len(&self) -> usize { - (self.trailer() >> 1) as usize - } - - #[cfg(miri)] - fn is_inline(&self) -> bool { - self.trailer() & 1 == 1 - } - - #[cfg(miri)] - fn trailer(&self) -> u8 { - self.deref()[SZ - 1] - } - - #[cfg(not(miri))] - const fn inline_len(&self) -> usize { - (self.trailer() >> 1) as usize - } - - #[cfg(not(miri))] - const fn is_inline(&self) -> bool { - self.trailer() & 1 == 1 - } - - #[cfg(not(miri))] - const fn trailer(&self) -> u8 { - self.0[SZ - 1] - } - - fn make_mut(&mut self) { - assert!(!self.is_inline()); - if self.deref_header().rc.load(Ordering::Acquire) != 1 { - *self = IVec::from(self.deref()) - } - } -} - -impl FromIterator for IVec { - fn from_iter(iter: T) -> Self - where - T: IntoIterator, - { - let bs: Vec = iter.into_iter().collect(); - bs.into() - } -} - -impl From<&[u8]> for IVec { - fn from(slice: &[u8]) -> Self { - IVec::new(slice) - } -} - -impl From<&str> for IVec { - fn from(s: &str) -> Self { - Self::from(s.as_bytes()) - } -} - -impl From for IVec { - fn from(s: String) -> Self { - Self::from(s.as_bytes()) - } -} - -impl From<&String> for IVec { - fn from(s: &String) -> Self { - Self::from(s.as_bytes()) - } -} - -impl From<&IVec> for IVec { - fn from(v: &Self) -> Self { - v.clone() - } -} - -impl From> for IVec { - fn from(v: Vec) -> Self { - IVec::new(&v) - } -} - -impl From> for IVec { - fn from(v: Box<[u8]>) -> Self { - IVec::new(&v) - } -} - -impl std::borrow::Borrow<[u8]> for IVec { - fn borrow(&self) -> &[u8] { - self.as_ref() - } -} - -impl std::borrow::Borrow<[u8]> for &IVec { - fn borrow(&self) -> &[u8] { - self.as_ref() - } -} - -impl From<&[u8; N]> for IVec { - fn from(v: &[u8; N]) -> Self { - Self::from(&v[..]) - } -} - -impl Ord for IVec { - fn cmp(&self, other: &Self) -> std::cmp::Ordering { - self.as_ref().cmp(other.as_ref()) - } -} - -impl PartialOrd for IVec { - fn partial_cmp(&self, other: &Self) -> Option { - Some(self.cmp(other)) - } -} - -impl> PartialEq for IVec { - fn eq(&self, other: &T) -> bool { - self.as_ref() == other.as_ref() - } -} - -impl PartialEq<[u8]> for IVec { - fn eq(&self, other: &[u8]) -> bool { - self.as_ref() == other - } -} - -impl Eq for IVec {} - -impl fmt::Debug for IVec { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - self.as_ref().fmt(f) - } -} - -#[cfg(test)] -mod qc { - use super::IVec; - - #[test] - fn ivec_usage() { - let iv1 = IVec::from(vec![1, 2, 3]); - assert_eq!(iv1, vec![1, 2, 3]); - let iv2 = IVec::from(&[4; 128][..]); - assert_eq!(iv2, vec![4; 128]); - } - - #[test] - fn boxed_slice_conversion() { - let boite1: Box<[u8]> = Box::new([1, 2, 3]); - let iv1: IVec = boite1.into(); - assert_eq!(iv1, vec![1, 2, 3]); - let boite2: Box<[u8]> = Box::new([4; 128]); - let iv2: IVec = boite2.into(); - assert_eq!(iv2, vec![4; 128]); - } - - #[test] - fn ivec_as_mut_identity() { - let initial = &[1]; - let mut iv = IVec::from(initial); - assert_eq!(initial, &*iv); - assert_eq!(initial, &mut *iv); - assert_eq!(initial, iv.as_mut()); - } - - fn prop_identity(ivec: &IVec) -> bool { - let mut iv2 = ivec.clone(); - - if iv2 != ivec { - println!("expected clone to equal original"); - return false; - } - - if *ivec != *iv2 { - println!("expected AsMut to equal original"); - return false; - } - - if *ivec != iv2.as_mut() { - println!("expected AsMut to equal original"); - return false; - } - - true - } - - quickcheck::quickcheck! { - #[cfg_attr(miri, ignore)] - fn ivec(item: IVec) -> bool { - prop_identity(&item) - } - } - - #[test] - fn ivec_bug_00() { - assert!(prop_identity(&IVec::new(&[ - 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, - ]))); - } -} diff --git a/src/lazy.rs b/src/lazy.rs deleted file mode 100644 index 91c523584..000000000 --- a/src/lazy.rs +++ /dev/null @@ -1,103 +0,0 @@ -//! This module exists because `lazy_static` causes TSAN to -//! be very unhappy. We rely heavily on TSAN for finding -//! races, so we don't use `lazy_static`. - -use std::sync::atomic::{ - AtomicBool, AtomicPtr, - Ordering::{Acquire, SeqCst}, -}; - -/// A lazily initialized value -pub struct Lazy { - value: AtomicPtr, - init_mu: AtomicBool, - init: F, -} - -impl Lazy { - /// Create a new Lazy - pub const fn new(init: F) -> Self - where - F: Sized, - { - Self { - value: AtomicPtr::new(std::ptr::null_mut()), - init_mu: AtomicBool::new(false), - init, - } - } -} - -impl Drop for Lazy { - fn drop(&mut self) { - let value_ptr = self.value.load(Acquire); - if !value_ptr.is_null() { - #[allow(unsafe_code)] - unsafe { - drop(Box::from_raw(value_ptr)) - } - } - } -} - -impl std::ops::Deref for Lazy -where - F: Fn() -> T, -{ - type Target = T; - - fn deref(&self) -> &T { - { - let value_ptr = self.value.load(Acquire); - if !value_ptr.is_null() { - #[allow(unsafe_code)] - unsafe { - return &*value_ptr; - } - } - } - - // We want to keep looping as long as it returns true, - // so we don't need any explicit conversion here. - while self - .init_mu - .compare_exchange(false, true, SeqCst, SeqCst) - .is_err() - { - // `hint::spin_loop` requires Rust 1.49. - #[allow(deprecated)] - std::sync::atomic::spin_loop_hint(); - } - - { - let value_ptr = self.value.load(Acquire); - // we need to check this again because - // maybe some other thread completed - // the initialization already. - if !value_ptr.is_null() { - let unlock = self.init_mu.swap(false, SeqCst); - assert!(unlock); - #[allow(unsafe_code)] - unsafe { - return &*value_ptr; - } - } - } - - { - let value = (self.init)(); - let value_ptr = Box::into_raw(Box::new(value)); - - let old = self.value.swap(value_ptr, SeqCst); - assert!(old.is_null()); - - let unlock = self.init_mu.swap(false, SeqCst); - assert!(unlock); - - #[allow(unsafe_code)] - unsafe { - &*value_ptr - } - } - } -} diff --git a/src/leaf.rs b/src/leaf.rs new file mode 100644 index 000000000..8822e86eb --- /dev/null +++ b/src/leaf.rs @@ -0,0 +1,94 @@ +use crate::*; + +#[derive(Debug, Clone, serde::Serialize, serde::Deserialize)] +pub(crate) struct Leaf { + pub lo: InlineArray, + pub hi: Option, + pub prefix_length: usize, + pub data: stack_map::StackMap, + pub in_memory_size: usize, + pub mutation_count: u64, + #[serde(skip)] + pub dirty_flush_epoch: Option, + #[serde(skip)] + pub page_out_on_flush: Option, + #[serde(skip)] + pub deleted: Option, + #[serde(skip)] + pub max_unflushed_epoch: Option, +} + +impl Leaf { + pub(crate) fn empty() -> Leaf { + Leaf { + lo: InlineArray::default(), + hi: None, + prefix_length: 0, + data: stack_map::StackMap::default(), + // this does not need to be marked as dirty until it actually + // receives inserted data + dirty_flush_epoch: None, + in_memory_size: std::mem::size_of::>(), + mutation_count: 0, + page_out_on_flush: None, + deleted: None, + max_unflushed_epoch: None, + } + } + + pub(crate) const fn is_empty(&self) -> bool { + self.data.is_empty() + } + + pub(crate) fn set_dirty_epoch(&mut self, epoch: FlushEpoch) { + assert!(self.deleted.is_none()); + if let Some(current_epoch) = self.dirty_flush_epoch { + assert!(current_epoch <= epoch); + } + if self.page_out_on_flush < Some(epoch) { + self.page_out_on_flush = None; + } + self.dirty_flush_epoch = Some(epoch); + } + + fn prefix(&self) -> &[u8] { + assert!(self.deleted.is_none()); + &self.lo[..self.prefix_length] + } + + pub(crate) fn get(&self, key: &[u8]) -> Option<&InlineArray> { + assert!(self.deleted.is_none()); + let prefixed_key = if self.prefix_length == 0 { + key + } else { + let prefix = self.prefix(); + assert!(key.starts_with(prefix)); + &key[self.prefix_length..] + }; + self.data.get(prefixed_key) + } + + pub(crate) fn insert( + &mut self, + key: InlineArray, + value: InlineArray, + ) -> Option { + assert!(self.deleted.is_none()); + let prefixed_key = if self.prefix_length == 0 { + key + } else { + let prefix = self.prefix(); + assert!(key.starts_with(prefix)); + key[self.prefix_length..].into() + }; + self.data.insert(prefixed_key, value) + } + + pub(crate) fn remove(&mut self, key: &[u8]) -> Option { + assert!(self.deleted.is_none()); + let prefix = self.prefix(); + assert!(key.starts_with(prefix)); + let partial_key = &key[self.prefix_length..]; + self.data.remove(partial_key) + } +} diff --git a/src/lib.rs b/src/lib.rs index e49760795..ab5baec03 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -1,50 +1,100 @@ -//! `sled` is an embedded database with +// 1.0 blockers +// +// bugs +// * tree predecessor holds lock on successor and tries to get it for predecessor. This will +// deadlock if used concurrently with write batches, which acquire locks lexicographically. +// * add merges to iterator test and assert it deadlocks +// * alternative is to merge right, not left +// * page-out needs to be deferred until after any flush of the dirty epoch +// * need to remove max_unflushed_epoch after flushing it +// * can't send reliable page-out request backwards from 7->6 +// * re-locking every mutex in a writebatch feels bad +// * need to signal stability status forward +// * maybe we already are +// * can make dirty_flush_epoch atomic and CAS it to 0 after flush +// * can change dirty_flush_epoch to unflushed_epoch +// * can always set mutation_count to max dirty flush epoch +// * this feels nice, we can lazily update a global stable flushed counter +// * can get rid of dirty_flush_epoch and page_out_on_flush? +// * or at least dirty_flush_epoch +// * dirty_flush_epoch really means "hasn't yet been cooperatively serialized @ F.E." +// * interesting metrics: +// * whether dirty for some epoch +// * whether cooperatively serialized for some epoch +// * whether fully flushed for some epoch +// * clean -> dirty -> {maybe coop} -> flushed +// * for page-out, we only care if it's stable or if we need to add it to +// a page-out priority queue +// +// reliability +// TODO make all writes wrapped in a Tearable wrapper that splits writes +// and can possibly crash based on a counter. +// TODO test concurrent drop_tree when other threads are still using it +// TODO list trees test for recovering empty collections +// TODO set explicit max key and value sizes w/ corresponding heap +// TODO add failpoints to writepath +// +// performance +// TODO handle prefix encoding +// TODO (minor) remove cache access for removed node in merge function +// TODO index+log hybrid - tinylsm key -> object location +// +// features +// TODO multi-collection batch +// +// misc +// TODO skim inlining output of RUSTFLAGS="-Cremark=all -Cdebuginfo=1" +// +// ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1.0 cutoff ~~~~~~~~~~~~~~~~~~~~~~~~~~~ +// +// post-1.0 improvements +// +// reliability +// TODO bug hiding: if the crash_iter test panics, the test doesn't fail as expected +// TODO event log assertion for testing heap location bidirectional referential integrity, +// particularly in the object location mapper. +// TODO ensure nothing "from the future" gets copied into earlier epochs during GC +// TODO collection_id on page_in checks - it needs to be pinned w/ heap's EBR? +// TODO put aborts behind feature flags for hard crashes +// TODO re-enable transaction tests in test_tree.rs +// +// performance +// TODO force writers to flush when some number of dirty epochs have built up +// TODO serialize flush batch in parallel +// TODO concurrent serialization of NotYetSerialized dirty objects +// TODO make the Arc`, //! but with several additional capabilities for //! assisting creators of stateful systems. //! //! It is fully thread-safe, and all operations are -//! atomic. Most are fully non-blocking. Multiple -//! `Tree`s with isolated keyspaces are supported with the +//! atomic. Multiple `Tree`s with isolated keyspaces +//! are supported with the //! [`Db::open_tree`](struct.Db.html#method.open_tree) method. //! -//! ACID transactions involving reads and writes to -//! multiple items are supported with the -//! [`Tree::transaction`](struct.Tree.html#method.transaction) -//! method. Transactions may also operate over -//! multiple `Tree`s (see -//! [`Tree::transaction`](struct.Tree.html#method.transaction) -//! docs for more info). -//! -//! Users may also subscribe to updates on individual -//! `Tree`s by using the -//! [`Tree::watch_prefix`](struct.Tree.html#method.watch_prefix) -//! method, which returns a blocking `Iterator` over -//! updates to keys that begin with the provided -//! prefix. You may supply an empty prefix to subscribe -//! to everything. -//! -//! [Merge operators](https://github.com/spacejam/sled/wiki/merge-operators) -//! (aka read-modify-write operators) are supported. A -//! merge operator is a function that specifies -//! how new data can be merged into an existing value -//! without requiring both a read and a write. -//! Using the -//! [`Tree::merge`](struct.Tree.html#method.merge) -//! method, you may "push" data to a `Tree` value -//! and have the provided merge operator combine -//! it with the existing value, if there was one. -//! They are set on a per-`Tree` basis, and essentially -//! allow any sort of data structure to be built -//! using merges as an atomic high-level operation. -//! //! `sled` is built by experienced database engineers //! who think users should spend less time tuning and //! working against high-friction APIs. Expect //! significant ergonomic and performance improvements //! over time. Most surprises are bugs, so please -//! [let us know](mailto:t@jujit.su?subject=sled%20sucks!!!) if something -//! is high friction. +//! [let us know](mailto:tylerneely@gmail.com?subject=sled%20sucks!!!) +//! if something is high friction. //! //! # Examples //! @@ -68,10 +118,10 @@ //! let scan_key: &[u8] = b"a non-present key before yo!"; //! let mut iter = db.range(scan_key..); //! assert_eq!(&iter.next().unwrap().unwrap().0, b"yo!"); -//! assert_eq!(iter.next(), None); +//! assert!(iter.next().is_none()); //! //! db.remove(b"yo!"); -//! assert_eq!(db.get(b"yo!"), Ok(None)); +//! assert!(db.get(b"yo!").unwrap().is_none()); //! //! let other_tree: sled::Tree = db.open_tree(b"cool db facts").unwrap(); //! other_tree.insert( @@ -80,454 +130,287 @@ //! ).unwrap(); //! # let _ = std::fs::remove_dir_all("my_db"); //! ``` -#![doc( - html_logo_url = "https://raw.githubusercontent.com/spacejam/sled/main/art/tree_face_anti-transphobia.png" -)] -#![cfg_attr( - feature = "testing", - deny( - missing_docs, - future_incompatible, - nonstandard_style, - rust_2018_idioms, - missing_copy_implementations, - trivial_casts, - trivial_numeric_casts, - unsafe_code, - unused_qualifications, - ) -)] -#![cfg_attr(feature = "testing", deny( - // over time, consider enabling the commented-out lints below - clippy::cast_lossless, - clippy::cast_possible_truncation, - clippy::cast_possible_wrap, - clippy::cast_precision_loss, - clippy::cast_sign_loss, - clippy::decimal_literal_representation, - clippy::doc_markdown, - // clippy::else_if_without_else, - clippy::empty_enum, - clippy::explicit_into_iter_loop, - clippy::explicit_iter_loop, - clippy::expl_impl_clone_on_copy, - clippy::fallible_impl_from, - clippy::filter_map_next, - clippy::float_arithmetic, - clippy::get_unwrap, - clippy::if_not_else, - // clippy::indexing_slicing, - clippy::inline_always, - //clippy::integer_arithmetic, - clippy::invalid_upcast_comparisons, - clippy::items_after_statements, - clippy::manual_find_map, - clippy::map_entry, - clippy::map_flatten, - clippy::match_like_matches_macro, - clippy::match_same_arms, - clippy::maybe_infinite_iter, - clippy::mem_forget, - // clippy::missing_docs_in_private_items, - clippy::module_name_repetitions, - clippy::multiple_inherent_impl, - clippy::mut_mut, - clippy::needless_borrow, - clippy::needless_continue, - clippy::needless_pass_by_value, - clippy::non_ascii_literal, - clippy::path_buf_push_overwrite, - clippy::print_stdout, - clippy::redundant_closure_for_method_calls, - // clippy::shadow_reuse, - clippy::shadow_same, - clippy::shadow_unrelated, - clippy::single_match_else, - clippy::string_add, - clippy::string_add_assign, - clippy::type_repetition_in_bounds, - clippy::unicode_not_nfc, - // clippy::unimplemented, - clippy::unseparated_literal_suffix, - clippy::used_underscore_binding, - clippy::wildcard_dependencies, +mod config; +mod db; +mod flush_epoch; +mod heap; +mod id_allocator; +mod leaf; +mod metadata_store; +mod object_cache; +mod object_location_mapper; +mod tree; + +#[cfg(any( + feature = "testing-shred-allocator", + feature = "testing-count-allocator" ))] -#![cfg_attr( - feature = "testing", - warn( - clippy::missing_const_for_fn, - clippy::multiple_crate_versions, - // clippy::wildcard_enum_match_arm, - ) -)] -#![allow(clippy::comparison_chain)] - -macro_rules! io_fail { - ($config:expr, $e:expr) => { - #[cfg(feature = "failpoints")] - { - debug_delay(); - if fail::is_active($e) { - $config.set_global_error(Error::FailPoint); - return Err(Error::FailPoint).into(); +pub mod alloc; + +#[cfg(feature = "for-internal-testing-only")] +mod event_verifier; + +#[inline] +fn debug_delay() { + #[cfg(debug_assertions)] + { + let rand = + std::time::SystemTime::UNIX_EPOCH.elapsed().unwrap().as_nanos(); + + if rand % 128 > 100 { + for _ in 0..rand % 16 { + std::thread::yield_now(); } } - }; + } } -macro_rules! testing_assert { - ($($e:expr),*) => { - #[cfg(feature = "testing")] - assert!($($e),*) - }; -} +pub use crate::config::Config; +pub use crate::db::Db; +pub use crate::tree::{Batch, Iter, Tree}; +pub use inline_array::InlineArray; -mod atomic_shim; -mod backoff; -mod batch; -mod cache_padded; -mod concurrency_control; -mod config; -mod context; -mod db; -mod dll; -mod ebr; -mod fastcmp; -mod fastlock; -mod fnv; -mod histogram; -mod iter; -mod ivec; -mod lazy; -mod lru; -mod meta; -#[cfg(feature = "metrics")] -mod metrics; -mod node; -mod oneshot; -mod pagecache; -mod result; -mod serialization; -mod stack; -mod subscriber; -mod sys_limits; -mod threadpool; -pub mod transaction; -mod tree; -mod varint; +const NAME_MAPPING_COLLECTION_ID: CollectionId = CollectionId(0); +const DEFAULT_COLLECTION_ID: CollectionId = CollectionId(1); +const INDEX_FANOUT: usize = 64; +const EBR_LOCAL_GC_BUFFER_SIZE: usize = 128; -/// Functionality for conditionally triggering failpoints under test. -#[cfg(feature = "failpoints")] -pub mod fail; +use std::collections::BTreeMap; +use std::num::NonZeroU64; +use std::ops::Bound; +use std::sync::Arc; -#[cfg(feature = "docs")] -pub mod doc; +use parking_lot::RwLock; -#[cfg(not(miri))] -mod flusher; +use crate::flush_epoch::{ + FlushEpoch, FlushEpochGuard, FlushEpochTracker, FlushInvariants, +}; +use crate::heap::{ + HeapStats, ObjectRecovery, SlabAddress, Update, WriteBatchStats, +}; +use crate::id_allocator::{Allocator, DeferredFree}; +use crate::leaf::Leaf; -#[cfg(feature = "event_log")] -/// The event log helps debug concurrency issues. -pub mod event_log; +// These are public so that they can be easily crash tested in external +// binaries. They are hidden because there are zero guarantees around their +// API stability or functionality. +#[doc(hidden)] +pub use crate::heap::{Heap, HeapRecovery}; +#[doc(hidden)] +pub use crate::metadata_store::MetadataStore; +#[doc(hidden)] +pub use crate::object_cache::{CacheStats, Dirty, FlushStats, ObjectCache}; /// Opens a `Db` with a default configuration at the /// specified path. This will create a new storage /// directory at the specified path if it does /// not already exist. You can use the `Db::was_recovered` /// method to determine if your database was recovered -/// from a previous instance. You can use `Config::create_new` -/// if you want to increase the chances that the database -/// will be freshly created. -pub fn open>(path: P) -> Result { +/// from a previous instance. +pub fn open>(path: P) -> std::io::Result { Config::new().path(path).open() } -/// Print a performance profile to standard out -/// detailing what the internals of the system are doing. -/// -/// Requires the `metrics` feature to be enabled, -/// which may introduce a bit of memory and overall -/// performance overhead as lots of metrics are -/// tallied up. Nevertheless, it is a useful -/// tool for quickly understanding the root of -/// a performance problem, and it can be invaluable -/// for including in any opened issues. -#[cfg(feature = "metrics")] -#[allow(clippy::print_stdout)] -pub fn print_profile() { - println!("{}", M.format_profile()); +#[derive(Debug, Copy, Clone)] +pub struct Stats { + pub cache: CacheStats, } -/// hidden re-export of items for testing purposes -#[doc(hidden)] -pub use self::{ - config::RunningConfig, - lazy::Lazy, - pagecache::{ - constants::{ - MAX_MSG_HEADER_LEN, MAX_SPACE_AMPLIFICATION, SEG_HEADER_LEN, - }, - BatchManifest, DiskPtr, Log, LogKind, LogOffset, LogRead, Lsn, - PageCache, PageId, - }, - serialization::Serialize, -}; - -pub use self::{ - batch::Batch, - config::{Config, Mode}, - db::Db, - iter::Iter, - ivec::IVec, - result::{Error, Result}, - subscriber::{Event, Subscriber}, - transaction::Transactional, - tree::{CompareAndSwapError, Tree}, -}; - -#[cfg(feature = "metrics")] -use self::{ - histogram::Histogram, - metrics::{clock, Measure, M}, -}; - -use { - self::{ - atomic_shim::{AtomicI64 as AtomicLsn, AtomicU64}, - backoff::Backoff, - cache_padded::CachePadded, - concurrency_control::Protector, - context::Context, - ebr::{ - pin as crossbeam_pin, Atomic, Guard as CrossbeamGuard, Owned, - Shared, - }, - fastcmp::fastcmp, - lru::Lru, - meta::Meta, - node::Node, - oneshot::{OneShot, OneShotFiller}, - result::CasResult, - subscriber::Subscribers, - tree::TreeInner, - }, - log::{debug, error, trace, warn}, - pagecache::{constants::MAX_BLOB, RecoveryGuard}, - parking_lot::{Condvar, Mutex, RwLock}, - std::{ - collections::BTreeMap, - convert::TryFrom, - fmt::{self, Debug}, - io::{Read, Write}, - sync::{ - atomic::{ - AtomicUsize, - Ordering::{Acquire, Relaxed, Release, SeqCst}, - }, - Arc, - }, - }, -}; +/// Compare and swap result. +/// +/// It returns `Ok(Ok(()))` if operation finishes successfully and +/// - `Ok(Err(CompareAndSwapError(current, proposed)))` if operation failed +/// to setup a new value. `CompareAndSwapError` contains current and +/// proposed values. +/// - `Err(Error::Unsupported)` if the database is opened in read-only mode. +/// otherwise. +pub type CompareAndSwapResult = std::io::Result< + std::result::Result, +>; + +type Index = concurrent_map::ConcurrentMap< + InlineArray, + Object, + INDEX_FANOUT, + EBR_LOCAL_GC_BUFFER_SIZE, +>; + +/// Compare and swap error. +#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)] +pub struct CompareAndSwapError { + /// The current value which caused your CAS to fail. + pub current: Option, + /// Returned value that was proposed unsuccessfully. + pub proposed: Option, +} -#[doc(hidden)] -pub fn pin() -> Guard { - Guard { inner: crossbeam_pin() } +/// Compare and swap success. +#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)] +pub struct CompareAndSwapSuccess { + /// The current value which was successfully installed. + pub new_value: Option, + /// Returned value that was previously stored. + pub previous_value: Option, } -#[doc(hidden)] -pub struct Guard { - inner: CrossbeamGuard, +impl std::fmt::Display for CompareAndSwapError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + write!(f, "Compare and swap conflict") + } } -impl std::ops::Deref for Guard { - type Target = CrossbeamGuard; +impl std::error::Error for CompareAndSwapError {} + +#[derive( + Debug, + Clone, + Copy, + serde::Serialize, + serde::Deserialize, + PartialOrd, + Ord, + PartialEq, + Eq, + Hash, +)] +pub struct ObjectId(NonZeroU64); - fn deref(&self) -> &CrossbeamGuard { - &self.inner +impl ObjectId { + fn new(from: u64) -> Option { + NonZeroU64::new(from).map(ObjectId) } } -#[derive(Debug)] -struct Conflict; +impl std::ops::Deref for ObjectId { + type Target = u64; -type Conflictable = std::result::Result; + fn deref(&self) -> &u64 { + let self_ref: &NonZeroU64 = &self.0; -fn crc32(buf: &[u8]) -> u32 { - let mut hasher = crc32fast::Hasher::new(); - hasher.update(buf); - hasher.finalize() -} + // NonZeroU64 is repr(transparent) where it wraps a u64 + // so it is guaranteed to match the binary layout. This + // makes it safe to cast a reference to one as a reference + // to the other like this. + let self_ptr: *const NonZeroU64 = self_ref as *const _; + let reference: *const u64 = self_ptr as *const u64; -fn calculate_message_crc32(header: &[u8], body: &[u8]) -> u32 { - trace!( - "calculating crc32 for header len {} body len {}", - header.len(), - body.len() - ); - let mut hasher = crc32fast::Hasher::new(); - hasher.update(body); - hasher.update(&header[4..]); - let crc32 = hasher.finalize(); - crc32 ^ 0xFFFF_FFFF + unsafe { &*reference } + } } -#[cfg(any(test, feature = "lock_free_delays"))] -mod debug_delay; - -#[cfg(any(test, feature = "lock_free_delays"))] -use debug_delay::debug_delay; - -/// This function is useful for inducing random jitter into our atomic -/// operations, shaking out more possible interleavings quickly. It gets -/// fully eliminated by the compiler in non-test code. -#[cfg(not(any(test, feature = "lock_free_delays")))] -const fn debug_delay() {} - -/// Link denotes a tree node or its modification fragment such as -/// key addition or removal. -#[derive(Clone, Debug, PartialEq)] -pub(crate) enum Link { - /// A new value is set for a given key - Set(IVec, IVec), - /// The kv pair at a particular index is removed - Del(IVec), - /// A child of this Index node is marked as mergable - ParentMergeIntention(PageId), - /// The merging child has been completely merged into its left sibling - ParentMergeConfirm, - /// A Node is marked for being merged into its left sibling - ChildMergeCap, +impl concurrent_map::Minimum for ObjectId { + const MIN: ObjectId = ObjectId(NonZeroU64::MIN); } -/// A fast map that is not resistant to collision attacks. Works -/// on 8 bytes at a time. -#[cfg(not(feature = "testing"))] -pub(crate) type FastMap8 = - std::collections::HashMap>; +#[derive( + Debug, + Clone, + Copy, + serde::Serialize, + serde::Deserialize, + PartialOrd, + Ord, + PartialEq, + Eq, + Hash, +)] +pub struct CollectionId(u64); -#[cfg(feature = "testing")] -pub(crate) type FastMap8 = BTreeMap; +impl concurrent_map::Minimum for CollectionId { + const MIN: CollectionId = CollectionId(u64::MIN); +} -/// A fast set that is not resistant to collision attacks. Works -/// on 8 bytes at a time. -#[cfg(not(feature = "testing"))] -pub(crate) type FastSet8 = - std::collections::HashSet>; +#[derive(Debug, Clone)] +struct CacheBox { + leaf: Option>>, + #[allow(unused)] + logged_index: BTreeMap, +} -#[cfg(feature = "testing")] -pub(crate) type FastSet8 = std::collections::BTreeSet; +#[allow(unused)] +#[derive(Debug, Clone)] +struct LogValue { + location: SlabAddress, + value: Option, +} -#[cfg(not(feature = "testing"))] -use std::collections::HashMap as Map; +#[derive(Debug, Clone)] +pub struct Object { + object_id: ObjectId, + collection_id: CollectionId, + low_key: InlineArray, + inner: Arc>>, +} -// we avoid HashMap while testing because -// it makes tests non-deterministic -#[cfg(feature = "testing")] -use std::collections::{BTreeMap as Map, BTreeSet as Set}; +impl PartialEq for Object { + fn eq(&self, other: &Self) -> bool { + self.object_id == other.object_id + } +} -/// A function that may be configured on a particular shared `Tree` -/// that will be applied as a kind of read-modify-write operator -/// to any values that are written using the `Tree::merge` method. -/// -/// The first argument is the key. The second argument is the -/// optional existing value that was in place before the -/// merged value being applied. The Third argument is the -/// data being merged into the item. -/// -/// You may return `None` to delete the value completely. -/// -/// Merge operators are shared by all instances of a particular -/// `Tree`. Different merge operators may be set on different -/// `Tree`s. -/// -/// # Examples -/// -/// ``` -/// # fn main() -> Result<(), Box> { -/// use sled::{Config, IVec}; -/// -/// fn concatenate_merge( -/// _key: &[u8], // the key being merged -/// old_value: Option<&[u8]>, // the previous value, if one existed -/// merged_bytes: &[u8] // the new bytes being merged in -/// ) -> Option> { // set the new value, return None to delete -/// let mut ret = old_value -/// .map(|ov| ov.to_vec()) -/// .unwrap_or_else(|| vec![]); -/// -/// ret.extend_from_slice(merged_bytes); -/// -/// Some(ret) -/// } -/// -/// let config = Config::new() -/// .temporary(true); -/// -/// let tree = config.open()?; -/// tree.set_merge_operator(concatenate_merge); -/// -/// let k = b"k1"; -/// -/// tree.insert(k, vec![0]); -/// tree.merge(k, vec![1]); -/// tree.merge(k, vec![2]); -/// assert_eq!(tree.get(k), Ok(Some(IVec::from(vec![0, 1, 2])))); -/// -/// // Replace previously merged data. The merge function will not be called. -/// tree.insert(k, vec![3]); -/// assert_eq!(tree.get(k), Ok(Some(IVec::from(vec![3])))); -/// -/// // Merges on non-present values will cause the merge function to be called -/// // with `old_value == None`. If the merge function returns something (which it -/// // does, in this case) a new value will be inserted. -/// tree.remove(k); -/// tree.merge(k, vec![4]); -/// assert_eq!(tree.get(k), Ok(Some(IVec::from(vec![4])))); -/// # Ok(()) } -/// ``` -pub trait MergeOperator: - Send + Sync + Fn(&[u8], Option<&[u8]>, &[u8]) -> Option> -{ +/// Stored on `Db` and `Tree` in an Arc, so that when the +/// last "high-level" struct is dropped, the flusher thread +/// is cleaned up. +struct ShutdownDropper { + shutdown_sender: parking_lot::Mutex< + std::sync::mpsc::Sender>, + >, + cache: parking_lot::Mutex>, } -impl MergeOperator for F where - F: Send + Sync + Fn(&[u8], Option<&[u8]>, &[u8]) -> Option> -{ + +impl Drop for ShutdownDropper { + fn drop(&mut self) { + let (tx, rx) = std::sync::mpsc::channel(); + log::debug!("sending shutdown signal to flusher"); + if self.shutdown_sender.lock().send(tx).is_ok() { + if let Err(e) = rx.recv() { + log::error!("failed to shut down flusher thread: {:?}", e); + } else { + log::debug!("flush thread successfully terminated"); + } + } else { + log::debug!( + "failed to shut down flusher, manually flushing ObjectCache" + ); + let cache = self.cache.lock(); + if let Err(e) = cache.flush() { + log::error!( + "Db flusher encountered error while flushing: {:?}", + e + ); + cache.set_error(&e); + } + } + } } -mod compile_time_assertions { - use crate::*; - - #[allow(unreachable_code)] - const fn _assert_public_types_send_sync() { - _assert_send::(); - - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); - _assert_send_sync::(); +fn map_bound U>(bound: Bound, f: F) -> Bound { + match bound { + Bound::Unbounded => Bound::Unbounded, + Bound::Included(x) => Bound::Included(f(x)), + Bound::Excluded(x) => Bound::Excluded(f(x)), } +} - const fn _assert_send() {} +const fn _assert_public_types_send_sync() { + use std::fmt::Debug; - const fn _assert_send_sync() {} -} + const fn _assert_send() {} -#[cfg(all(unix, not(miri)))] -fn maybe_fsync_directory>( - path: P, -) -> std::io::Result<()> { - std::fs::File::open(path)?.sync_all() -} + const fn _assert_send_sync() {} + + /* + _assert_send::(); + _assert_send_sync::(); + _assert_send_sync::(); + _assert_send_sync::(); + */ + + _assert_send::(); -#[cfg(any(not(unix), miri))] -fn maybe_fsync_directory>( - _: P, -) -> std::io::Result<()> { - Ok(()) + _assert_send_sync::(); + _assert_send_sync::(); + _assert_send_sync::(); + _assert_send_sync::(); + _assert_send_sync::(); } diff --git a/src/lru.rs b/src/lru.rs deleted file mode 100644 index f69eaa5b8..000000000 --- a/src/lru.rs +++ /dev/null @@ -1,475 +0,0 @@ -#![allow(unsafe_code)] - -use std::{ - borrow::{Borrow, BorrowMut}, - convert::TryFrom, - hash::{Hash, Hasher}, - mem::MaybeUninit, - sync::atomic::{AtomicPtr, AtomicUsize, Ordering}, -}; - -use crate::{ - atomic_shim::AtomicU64, - debug_delay, - dll::{DoublyLinkedList, Node}, - fastlock::FastLock, - FastSet8, Guard, PageId, -}; - -#[cfg(any(test, feature = "lock_free_delays"))] -const MAX_QUEUE_ITEMS: usize = 4; - -#[cfg(not(any(test, feature = "lock_free_delays")))] -const MAX_QUEUE_ITEMS: usize = 64; - -#[cfg(any(test, feature = "lock_free_delays"))] -const N_SHARDS: usize = 2; - -#[cfg(not(any(test, feature = "lock_free_delays")))] -const N_SHARDS: usize = 256; - -struct AccessBlock { - len: AtomicUsize, - block: [AtomicU64; MAX_QUEUE_ITEMS], - next: AtomicPtr, -} - -impl Default for AccessBlock { - fn default() -> AccessBlock { - AccessBlock { - len: AtomicUsize::new(0), - block: unsafe { MaybeUninit::zeroed().assume_init() }, - next: AtomicPtr::default(), - } - } -} - -struct AccessQueue { - writing: AtomicPtr, - full_list: AtomicPtr, -} - -impl AccessBlock { - fn new(item: CacheAccess) -> AccessBlock { - let mut ret = AccessBlock { - len: AtomicUsize::new(1), - block: unsafe { MaybeUninit::zeroed().assume_init() }, - next: AtomicPtr::default(), - }; - ret.block[0] = AtomicU64::from(u64::from(item)); - ret - } -} - -impl Default for AccessQueue { - fn default() -> AccessQueue { - AccessQueue { - writing: AtomicPtr::new(Box::into_raw(Box::new( - AccessBlock::default(), - ))), - full_list: AtomicPtr::default(), - } - } -} - -impl AccessQueue { - fn push(&self, item: CacheAccess) -> bool { - loop { - debug_delay(); - let head = self.writing.load(Ordering::Acquire); - let block = unsafe { &*head }; - - debug_delay(); - let offset = block.len.fetch_add(1, Ordering::Acquire); - - if offset < MAX_QUEUE_ITEMS { - let item_u64: u64 = item.into(); - assert_ne!(item_u64, 0); - debug_delay(); - unsafe { - block - .block - .get_unchecked(offset) - .store(item_u64, Ordering::Release); - } - return false; - } else { - // install new writer - let new = Box::into_raw(Box::new(AccessBlock::new(item))); - debug_delay(); - let res = self.writing.compare_exchange( - head, - new, - Ordering::AcqRel, - Ordering::Acquire, - ); - if res.is_err() { - // we lost the CAS, free the new item that was - // never published to other threads - unsafe { - drop(Box::from_raw(new)); - } - continue; - } - - // push the now-full item to the full list for future - // consumption - let mut ret; - let mut full_list_ptr = self.full_list.load(Ordering::Acquire); - while { - // we loop because maybe other threads are pushing stuff too - block.next.store(full_list_ptr, Ordering::Release); - debug_delay(); - ret = self.full_list.compare_exchange( - full_list_ptr, - head, - Ordering::AcqRel, - Ordering::Acquire, - ); - ret.is_err() - } { - full_list_ptr = ret.unwrap_err(); - } - return true; - } - } - } - - fn take<'a>(&self, guard: &'a Guard) -> CacheAccessIter<'a> { - debug_delay(); - let ptr = self.full_list.swap(std::ptr::null_mut(), Ordering::AcqRel); - - CacheAccessIter { guard, current_offset: 0, current_block: ptr } - } -} - -impl Drop for AccessQueue { - fn drop(&mut self) { - debug_delay(); - let writing = self.writing.load(Ordering::Acquire); - unsafe { - Box::from_raw(writing); - } - debug_delay(); - let mut head = self.full_list.load(Ordering::Acquire); - while !head.is_null() { - unsafe { - debug_delay(); - let next = - (*head).next.swap(std::ptr::null_mut(), Ordering::Release); - Box::from_raw(head); - head = next; - } - } - } -} - -struct CacheAccessIter<'a> { - guard: &'a Guard, - current_offset: usize, - current_block: *mut AccessBlock, -} - -impl<'a> Iterator for CacheAccessIter<'a> { - type Item = CacheAccess; - - fn next(&mut self) -> Option { - while !self.current_block.is_null() { - let current_block = unsafe { &*self.current_block }; - - debug_delay(); - if self.current_offset >= MAX_QUEUE_ITEMS { - let to_drop_ptr = self.current_block; - debug_delay(); - self.current_block = current_block.next.load(Ordering::Acquire); - self.current_offset = 0; - debug_delay(); - let to_drop = unsafe { Box::from_raw(to_drop_ptr) }; - self.guard.defer(|| to_drop); - continue; - } - - let mut next = 0; - while next == 0 { - // we spin here because there's a race between bumping - // the offset and setting the value to something other - // than 0 (and 0 is an invalid value) - debug_delay(); - next = current_block.block[self.current_offset] - .load(Ordering::Acquire); - } - self.current_offset += 1; - return Some(CacheAccess::from(next)); - } - - None - } -} - -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub(crate) struct CacheAccess { - // safe because MAX_PID_BITS / N_SHARDS < u32::MAX - // 2**37 / 2**8 < 2**32 - pub pid: u32, - pub sz: u8, -} - -impl From for u64 { - fn from(ca: CacheAccess) -> u64 { - (u64::from(ca.pid) << 8) | u64::from(ca.sz) - } -} - -#[allow(clippy::fallible_impl_from)] -impl From for CacheAccess { - fn from(u: u64) -> CacheAccess { - let sz = usize::try_from((u << 56) >> 56).unwrap(); - assert_ne!(sz, 0); - let pid = u >> 8; - assert!(pid < u64::from(u32::MAX)); - CacheAccess { - pid: u32::try_from(pid).unwrap(), - sz: u8::try_from(sz).unwrap(), - } - } -} - -impl CacheAccess { - fn size(&self) -> usize { - 1 << usize::from(self.sz) - } - - fn new(pid: PageId, sz: usize) -> CacheAccess { - let rounded_up_power_of_2 = - u8::try_from(sz.next_power_of_two().trailing_zeros()).unwrap(); - - CacheAccess { - pid: u32::try_from(pid).expect("expected caller to shift pid down"), - sz: rounded_up_power_of_2, - } - } -} - -/// A simple LRU cache. -pub struct Lru { - shards: Vec<(AccessQueue, FastLock)>, -} - -impl Lru { - /// Instantiates a new `Lru` cache. - pub(crate) fn new(cache_capacity: usize) -> Self { - assert!( - cache_capacity >= N_SHARDS, - "Please configure the cache \ - capacity to be at least 256 bytes" - ); - let shard_capacity = cache_capacity / N_SHARDS; - - let mut shards = Vec::with_capacity(N_SHARDS); - shards.resize_with(N_SHARDS, || { - (AccessQueue::default(), FastLock::new(Shard::new(shard_capacity))) - }); - - Self { shards } - } - - /// Called when an item is accessed. Returns a Vec of items to be - /// evicted. Uses flat-combining to avoid blocking on what can - /// be an asynchronous operation. - /// - /// layout: - /// items: 1 2 3 4 5 6 7 8 9 10 - /// shards: 1 0 1 0 1 0 1 0 1 0 - /// shard 0: 2 4 6 8 10 - /// shard 1: 1 3 5 7 9 - pub(crate) fn accessed( - &self, - id: PageId, - item_size: usize, - guard: &Guard, - ) -> Vec { - const SHARD_BITS: usize = N_SHARDS.trailing_zeros() as usize; - - let mut ret = vec![]; - let shards = N_SHARDS as u64; - let (shard_idx, shifted_pid) = (id % shards, id >> SHARD_BITS); - let (access_queue, shard_mu) = &self.shards[safe_usize(shard_idx)]; - - let cache_access = CacheAccess::new(shifted_pid, item_size); - let filled = access_queue.push(cache_access); - - if filled { - // only try to acquire this if the access queue has filled - // an entire segment - if let Some(mut shard) = shard_mu.try_lock() { - let accesses = access_queue.take(guard); - for item in accesses { - let to_evict = shard.accessed(item); - // map shard internal offsets to global items ids - for pos in to_evict { - let address = - (PageId::from(pos) << SHARD_BITS) + shard_idx; - ret.push(address); - } - } - } - } - ret - } -} - -#[derive(Eq)] -struct Entry(*mut Node); - -unsafe impl Send for Entry {} - -impl Ord for Entry { - fn cmp(&self, other: &Entry) -> std::cmp::Ordering { - let left_pid: u32 = *self.borrow(); - let right_pid: u32 = *other.borrow(); - left_pid.cmp(&right_pid) - } -} - -impl PartialOrd for Entry { - fn partial_cmp(&self, other: &Entry) -> Option { - Some(self.cmp(other)) - } -} - -impl PartialEq for Entry { - fn eq(&self, other: &Entry) -> bool { - unsafe { (*self.0).pid == (*other.0).pid } - } -} - -impl BorrowMut for Entry { - fn borrow_mut(&mut self) -> &mut CacheAccess { - unsafe { &mut *self.0 } - } -} - -impl Borrow for Entry { - fn borrow(&self) -> &CacheAccess { - unsafe { &*self.0 } - } -} - -impl Borrow for Entry { - fn borrow(&self) -> &u32 { - unsafe { &(*self.0).pid } - } -} - -// we only hash on pid, since we will change -// sz sometimes and we access the item by pid -impl Hash for Entry { - fn hash(&self, hasher: &mut H) { - unsafe { (*self.0).pid.hash(hasher) } - } -} - -struct Shard { - dll: DoublyLinkedList, - entries: FastSet8, - capacity: usize, - size: usize, -} - -impl Shard { - fn new(capacity: usize) -> Self { - assert!(capacity > 0, "shard capacity must be non-zero"); - - Self { - dll: DoublyLinkedList::default(), - entries: FastSet8::default(), - capacity, - size: 0, - } - } - - /// `PageId`s in the shard list are indexes of the entries. - fn accessed(&mut self, cache_access: CacheAccess) -> Vec { - if let Some(entry) = self.entries.get(&cache_access.pid) { - let old_sz_po2 = unsafe { (*entry.0).swap_sz(cache_access.sz) }; - let old_size = 1 << usize::from(old_sz_po2); - - self.size -= old_size; - self.dll.promote(entry.0); - } else { - let ptr = self.dll.push_head(cache_access); - self.entries.insert(Entry(ptr)); - }; - - self.size += cache_access.size(); - - let mut to_evict = vec![]; - - while self.size > self.capacity { - if self.dll.len() == 1 { - // don't evict what we just added - break; - } - - let node = self.dll.pop_tail().unwrap(); - - assert!(self.entries.remove(&node.pid)); - - to_evict.push(node.pid); - - self.size -= node.size(); - - // NB: node is stored in our entries map - // via a raw pointer, which points to - // the same allocation used in the DLL. - // We have to be careful to free node - // only after removing it from both - // the DLL and our entries map. - drop(node); - } - - to_evict - } -} - -#[inline] -fn safe_usize(value: PageId) -> usize { - usize::try_from(value).unwrap() -} - -#[test] -fn lru_smoke_test() { - use crate::pin; - - let lru = Lru::new(2); - for i in 0..1000 { - let guard = pin(); - lru.accessed(i, 16, &guard); - } -} - -#[test] -fn lru_access_test() { - use crate::pin; - - let ci = CacheAccess::new(6, 20667); - assert_eq!(ci.size(), 32 * 1024); - - let lru = Lru::new(4096); - - let guard = pin(); - - assert_eq!(lru.accessed(0, 20667, &guard), vec![]); - assert_eq!(lru.accessed(2, 20667, &guard), vec![]); - assert_eq!(lru.accessed(4, 20667, &guard), vec![]); - assert_eq!(lru.accessed(6, 20667, &guard), vec![]); - assert_eq!(lru.accessed(8, 20667, &guard), vec![0, 2, 4]); - assert_eq!(lru.accessed(10, 20667, &guard), vec![]); - assert_eq!(lru.accessed(12, 20667, &guard), vec![]); - assert_eq!(lru.accessed(14, 20667, &guard), vec![]); - assert_eq!(lru.accessed(16, 20667, &guard), vec![6, 8, 10, 12]); - assert_eq!(lru.accessed(18, 20667, &guard), vec![]); - assert_eq!(lru.accessed(20, 20667, &guard), vec![]); - assert_eq!(lru.accessed(22, 20667, &guard), vec![]); - assert_eq!(lru.accessed(24, 20667, &guard), vec![14, 16, 18, 20]); -} diff --git a/src/meta.rs b/src/meta.rs deleted file mode 100644 index 3de1e88c3..000000000 --- a/src/meta.rs +++ /dev/null @@ -1,108 +0,0 @@ -use crate::*; - -/// A simple map that can be used to store metadata -/// for the pagecache tenant. -#[derive(Clone, Debug, Eq, PartialEq, Default)] -pub struct Meta { - pub(crate) inner: BTreeMap, -} - -impl Meta { - /// Retrieve the `PageId` associated with an identifier - pub(crate) fn get_root(&self, table: &[u8]) -> Option { - self.inner.get(table).cloned() - } - - /// Set the `PageId` associated with an identifier - pub(crate) fn set_root( - &mut self, - name: IVec, - pid: PageId, - ) -> Option { - self.inner.insert(name, pid) - } - - /// Remove the page mapping for a given identifier - pub(crate) fn del_root(&mut self, name: &[u8]) -> Option { - self.inner.remove(name) - } -} - -/// Open or create a new disk-backed Tree with its own keyspace, -/// accessible from the `Db` via the provided identifier. -pub(crate) fn open_tree( - context: &Context, - raw_name: V, - guard: &Guard, -) -> Result -where - V: Into, -{ - let name = raw_name.into(); - - // we loop because creating this Tree may race with - // concurrent attempts to open the same one. - loop { - match context.pagecache.meta_pid_for_name(&name, guard) { - Ok(root_id) => { - assert_ne!(root_id, 0); - return Ok(Tree(Arc::new(TreeInner { - tree_id: name, - context: context.clone(), - subscribers: Subscribers::default(), - root: AtomicU64::new(root_id), - merge_operator: RwLock::new(None), - }))); - } - Err(Error::CollectionNotFound) => {} - Err(other) => return Err(other), - } - - // set up empty leaf - let leaf = Node::new_empty_leaf(); - let (leaf_id, leaf_ptr) = context.pagecache.allocate(leaf, guard)?; - - trace!( - "allocated pid {} for leaf in new_tree for namespace {:?}", - leaf_id, - name - ); - - // set up root index - - // vec![0] represents a prefix-encoded empty prefix - let root = Node::new_root(leaf_id); - let (root_id, root_ptr) = context.pagecache.allocate(root, guard)?; - - debug!("allocated pid {} for root of new_tree {:?}", root_id, name); - - let res = context.pagecache.cas_root_in_meta( - &name, - None, - Some(root_id), - guard, - )?; - - if res.is_err() { - // clean up the tree we just created if we couldn't - // install it. - let _ = context - .pagecache - .free(root_id, root_ptr, guard)? - .expect("could not free allocated page"); - let _ = context - .pagecache - .free(leaf_id, leaf_ptr, guard)? - .expect("could not free allocated page"); - continue; - } - - return Ok(Tree(Arc::new(TreeInner { - tree_id: name, - subscribers: Subscribers::default(), - context: context.clone(), - root: AtomicU64::new(root_id), - merge_operator: RwLock::new(None), - }))); - } -} diff --git a/src/metadata_store.rs b/src/metadata_store.rs new file mode 100644 index 000000000..836c6d2fa --- /dev/null +++ b/src/metadata_store.rs @@ -0,0 +1,846 @@ +use std::collections::BTreeSet; +use std::fs; +use std::io::{self, Read, Write}; +use std::num::NonZeroU64; +use std::path::{Path, PathBuf}; +use std::sync::{ + atomic::{AtomicPtr, AtomicU64, Ordering}, + Arc, +}; + +use crossbeam_channel::{bounded, unbounded, Receiver, Sender}; +use fault_injection::{annotate, fallible, maybe}; +use fnv::FnvHashMap; +use inline_array::InlineArray; +use parking_lot::Mutex; +use rayon::prelude::*; +use zstd::stream::read::Decoder as ZstdDecoder; +use zstd::stream::write::Encoder as ZstdEncoder; + +use crate::{heap::UpdateMetadata, CollectionId, ObjectId}; + +const WARN: &str = "DO_NOT_PUT_YOUR_FILES_HERE"; +const TMP_SUFFIX: &str = ".tmp"; +const LOG_PREFIX: &str = "log"; +const SNAPSHOT_PREFIX: &str = "snapshot"; + +const ZSTD_LEVEL: i32 = 3; + +// NB: intentionally does not implement Clone, and +// the Inner::drop code relies on this invariant for +// now so that we don't free the global error until +// all high-level structs are dropped. This is not +// hard to change over time though, just a current +// invariant. +pub struct MetadataStore { + inner: Inner, + is_shut_down: bool, +} + +impl Drop for MetadataStore { + fn drop(&mut self) { + if self.is_shut_down { + return; + } + + self.shutdown_inner(); + self.is_shut_down = true; + } +} + +struct MetadataRecovery { + recovered: Vec, + id_for_next_log: u64, + snapshot_size: u64, +} + +struct LogAndStats { + file: fs::File, + bytes_written: u64, + log_sequence_number: u64, +} + +enum WorkerMessage { + Shutdown(Sender<()>), + LogReadyToCompact { log_and_stats: LogAndStats }, +} + +fn get_compactions( + rx: &mut Receiver, +) -> Result, Option>> { + let mut ret = vec![]; + + match rx.recv() { + Ok(WorkerMessage::Shutdown(tx)) => { + return Err(Some(tx)); + } + Ok(WorkerMessage::LogReadyToCompact { log_and_stats }) => { + ret.push(log_and_stats.log_sequence_number); + } + Err(e) => { + log::error!( + "metadata store worker thread unable to receive message, unexpected shutdown: {e:?}" + ); + return Err(None); + } + } + + // scoop up any additional logs that have built up while we were busy compacting + loop { + match rx.try_recv() { + Ok(WorkerMessage::Shutdown(tx)) => { + tx.send(()).unwrap(); + return Err(Some(tx)); + } + Ok(WorkerMessage::LogReadyToCompact { log_and_stats }) => { + ret.push(log_and_stats.log_sequence_number); + } + Err(_timeout) => return Ok(ret), + } + } +} + +fn worker( + mut rx: Receiver, + mut last_snapshot_lsn: u64, + inner: Inner, +) { + loop { + if let Err(error) = check_error(&inner.global_error) { + drop(inner); + + log::error!( + "compaction thread terminating after global error set to {:?}", + error + ); + + return; + } + + match get_compactions(&mut rx) { + Ok(log_ids) => { + assert_eq!(log_ids[0], last_snapshot_lsn + 1); + + let write_res = read_snapshot_and_apply_logs( + &inner.storage_directory, + log_ids.into_iter().collect(), + Some(last_snapshot_lsn), + &inner.directory_lock, + ); + match write_res { + Err(e) => { + set_error(&inner.global_error, &e); + log::error!("log compactor thread encountered error: {:?} - setting global fatal error and shutting down compactions", e); + return; + } + Ok(recovery) => { + inner + .snapshot_size + .store(recovery.snapshot_size, Ordering::SeqCst); + last_snapshot_lsn = + recovery.id_for_next_log.checked_sub(1).unwrap(); + } + } + } + Err(Some(tx)) => { + drop(inner); + if let Err(e) = tx.send(()) { + log::error!("log compactor failed to send shutdown ack to system: {e:?}"); + } + return; + } + Err(None) => { + return; + } + } + } +} + +fn set_error( + global_error: &AtomicPtr<(io::ErrorKind, String)>, + error: &io::Error, +) { + let kind = error.kind(); + let reason = error.to_string(); + + let boxed = Box::new((kind, reason)); + let ptr = Box::into_raw(boxed); + + if global_error + .compare_exchange( + std::ptr::null_mut(), + ptr, + Ordering::SeqCst, + Ordering::SeqCst, + ) + .is_err() + { + // global fatal error already installed, drop this one + unsafe { + drop(Box::from_raw(ptr)); + } + } +} + +fn check_error( + global_error: &AtomicPtr<(io::ErrorKind, String)>, +) -> io::Result<()> { + let err_ptr: *const (io::ErrorKind, String) = + global_error.load(Ordering::Acquire); + + if err_ptr.is_null() { + Ok(()) + } else { + let deref: &(io::ErrorKind, String) = unsafe { &*err_ptr }; + Err(io::Error::new(deref.0, deref.1.clone())) + } +} + +#[derive(Clone)] +struct Inner { + global_error: Arc>, + active_log: Arc>, + snapshot_size: Arc, + storage_directory: PathBuf, + directory_lock: Arc, + worker_outbox: Sender, +} + +impl Drop for Inner { + fn drop(&mut self) { + // NB: this is the only place where the global error should be + // reclaimed in the whole sled codebase, as this Inner is only held + // by the background writer and the heap (in an Arc) so when this + // drop happens, it's because the whole system is going down, not + // because any particular Db instance that may have been cloned + // by a thread is dropping. + let error_ptr = + self.global_error.swap(std::ptr::null_mut(), Ordering::Acquire); + if !error_ptr.is_null() { + unsafe { + drop(Box::from_raw(error_ptr)); + } + } + } +} + +impl MetadataStore { + pub fn get_global_error_arc( + &self, + ) -> Arc> { + self.inner.global_error.clone() + } + + fn shutdown_inner(&mut self) { + let (tx, rx) = bounded(1); + if self.inner.worker_outbox.send(WorkerMessage::Shutdown(tx)).is_ok() { + let _ = rx.recv(); + } + + self.set_error(&io::Error::new( + io::ErrorKind::Other, + "system has been shut down".to_string(), + )); + + self.is_shut_down = true; + } + + fn check_error(&self) -> io::Result<()> { + check_error(&self.inner.global_error) + } + + fn set_error(&self, error: &io::Error) { + set_error(&self.inner.global_error, error); + } + + /// Returns the writer handle `MetadataStore`, a sorted array of metadata, and a sorted array + /// of free keys. + pub fn recover>( + storage_directory: P, + ) -> io::Result<( + // Metadata writer + MetadataStore, + // Metadata - node id, value, user data + Vec, + )> { + use fs2::FileExt; + + // TODO NOCOMMIT + let sync_status = std::process::Command::new("sync") + .status() + .map(|status| status.success()); + + if !matches!(sync_status, Ok(true)) { + log::warn!( + "sync command before recovery failed: {:?}", + sync_status + ); + } + + let path = storage_directory.as_ref(); + + // initialize directories if not present + if let Err(e) = fs::read_dir(path) { + if e.kind() == io::ErrorKind::NotFound { + fallible!(fs::create_dir_all(path)); + } + } + + let _ = fs::File::create(path.join(WARN)); + + let directory_lock = fallible!(fs::File::open(path)); + fallible!(directory_lock.sync_all()); + fallible!(directory_lock.try_lock_exclusive()); + + let recovery = + MetadataStore::recover_inner(&storage_directory, &directory_lock)?; + + let new_log = LogAndStats { + log_sequence_number: recovery.id_for_next_log, + bytes_written: 0, + file: fallible!(fs::File::create(log_path( + path, + recovery.id_for_next_log + ))), + }; + + let (tx, rx) = unbounded(); + + let inner = Inner { + snapshot_size: Arc::new(recovery.snapshot_size.into()), + storage_directory: path.into(), + directory_lock: Arc::new(directory_lock), + global_error: Default::default(), + active_log: Arc::new(Mutex::new(new_log)), + worker_outbox: tx, + }; + + let worker_inner = inner.clone(); + + let spawn_res = std::thread::Builder::new() + .name("sled_flusher".into()) + .spawn(move || { + worker( + rx, + recovery.id_for_next_log.checked_sub(1).unwrap(), + worker_inner, + ) + }); + + if let Err(e) = spawn_res { + return Err(io::Error::new( + io::ErrorKind::Other, + format!( + "unable to spawn metadata compactor thread for sled database: {:?}", + e + ), + )); + } + + Ok((MetadataStore { inner, is_shut_down: false }, recovery.recovered)) + } + + /// Returns the recovered mappings, the id for the next log file, the highest allocated object id, and the set of free ids + fn recover_inner>( + storage_directory: P, + directory_lock: &fs::File, + ) -> io::Result { + let path = storage_directory.as_ref(); + + log::debug!("opening MetadataStore at {:?}", path); + + let (log_ids, snapshot_id_opt) = enumerate_logs_and_snapshot(path)?; + + read_snapshot_and_apply_logs( + path, + log_ids, + snapshot_id_opt, + directory_lock, + ) + } + + /// Write a batch of metadata. `None` for the second half of the outer tuple represents a + /// deletion. Returns the bytes written. + pub fn write_batch(&self, batch: &[UpdateMetadata]) -> io::Result { + self.check_error()?; + + let batch_bytes = serialize_batch(batch); + let ret = batch_bytes.len() as u64; + + let mut log = self.inner.active_log.lock(); + + if let Err(e) = maybe!(log.file.write_all(&batch_bytes)) { + self.set_error(&e); + return Err(e); + } + + if let Err(e) = maybe!(log.file.sync_all()) + .and_then(|_| self.inner.directory_lock.sync_all()) + { + self.set_error(&e); + return Err(e); + } + + log.bytes_written += batch_bytes.len() as u64; + + if log.bytes_written + > self.inner.snapshot_size.load(Ordering::Acquire).max(64 * 1024) + { + let next_offset = log.log_sequence_number + 1; + let next_path = + log_path(&self.inner.storage_directory, next_offset); + + // open new log + let mut next_log_file_opts = fs::OpenOptions::new(); + next_log_file_opts.create(true).read(true).write(true); + + let next_log_file = match maybe!(next_log_file_opts.open(next_path)) + { + Ok(nlf) => nlf, + Err(e) => { + self.set_error(&e); + return Err(e); + } + }; + + let next_log_and_stats = LogAndStats { + file: next_log_file, + log_sequence_number: next_offset, + bytes_written: 0, + }; + + // replace log + let old_log_and_stats = + std::mem::replace(&mut *log, next_log_and_stats); + + // send to snapshot writer + self.inner + .worker_outbox + .send(WorkerMessage::LogReadyToCompact { + log_and_stats: old_log_and_stats, + }) + .expect("unable to send log to compact to worker"); + } + + Ok(ret) + } +} + +fn serialize_batch(batch: &[UpdateMetadata]) -> Vec { + // we initialize the vector to contain placeholder bytes for the frame length + let batch_bytes = 0_u64.to_le_bytes().to_vec(); + + // write format: + // 6 byte LE frame length (in bytes, not items) + // 2 byte crc of the frame length + // payload: + // zstd encoded 8 byte LE key + // zstd encoded 8 byte LE value + // repeated for each kv pair + // LE encoded crc32 of length + payload raw bytes, XOR 0xAF to make non-zero in empty case + let mut batch_encoder = ZstdEncoder::new(batch_bytes, ZSTD_LEVEL).unwrap(); + + for update_metadata in batch { + match update_metadata { + UpdateMetadata::Store { + object_id, + collection_id, + low_key, + location, + } => { + batch_encoder + .write_all(&object_id.0.get().to_le_bytes()) + .unwrap(); + batch_encoder + .write_all(&collection_id.0.to_le_bytes()) + .unwrap(); + batch_encoder.write_all(&location.get().to_le_bytes()).unwrap(); + + let low_key_len: u64 = low_key.len() as u64; + batch_encoder.write_all(&low_key_len.to_le_bytes()).unwrap(); + batch_encoder.write_all(&low_key).unwrap(); + } + UpdateMetadata::Free { object_id, collection_id } => { + batch_encoder + .write_all(&object_id.0.get().to_le_bytes()) + .unwrap(); + batch_encoder + .write_all(&collection_id.0.to_le_bytes()) + .unwrap(); + // heap location + batch_encoder.write_all(&0_u64.to_le_bytes()).unwrap(); + // metadata len + batch_encoder.write_all(&0_u64.to_le_bytes()).unwrap(); + } + } + } + + let mut batch_bytes = batch_encoder.finish().unwrap(); + + let batch_len = batch_bytes.len().checked_sub(8).unwrap(); + batch_bytes[..8].copy_from_slice(&batch_len.to_le_bytes()); + assert_eq!(&[0, 0], &batch_bytes[6..8]); + + let len_hash: [u8; 2] = + (crc32fast::hash(&batch_bytes[..6]) as u16).to_le_bytes(); + + batch_bytes[6..8].copy_from_slice(&len_hash); + + let hash: u32 = crc32fast::hash(&batch_bytes) ^ 0xAF; + let hash_bytes: [u8; 4] = hash.to_le_bytes(); + batch_bytes.extend_from_slice(&hash_bytes); + + batch_bytes +} + +fn read_frame( + file: &mut fs::File, + reusable_frame_buffer: &mut Vec, +) -> io::Result> { + let mut frame_size_with_crc_buf: [u8; 8] = [0; 8]; + // TODO only break if UnexpectedEof, otherwise propagate + fallible!(file.read_exact(&mut frame_size_with_crc_buf)); + + let expected_len_hash_buf = + [frame_size_with_crc_buf[6], frame_size_with_crc_buf[7]]; + + let actual_len_hash_buf: [u8; 2] = + (crc32fast::hash(&frame_size_with_crc_buf[..6]) as u16).to_le_bytes(); + + // clear crc bytes before turning into usize + let mut frame_size_buf = frame_size_with_crc_buf; + frame_size_buf[6] = 0; + frame_size_buf[7] = 0; + + if actual_len_hash_buf != expected_len_hash_buf { + return Err(annotate!(io::Error::new( + io::ErrorKind::InvalidData, + "corrupt frame length" + ))); + } + + let len_u64: u64 = u64::from_le_bytes(frame_size_buf); + let len: usize = usize::try_from(len_u64).unwrap(); + + reusable_frame_buffer.clear(); + reusable_frame_buffer.reserve(len + 12); + unsafe { + reusable_frame_buffer.set_len(len + 12); + } + reusable_frame_buffer[..8].copy_from_slice(&frame_size_with_crc_buf); + + fallible!(file.read_exact(&mut reusable_frame_buffer[8..])); + + let crc_actual = crc32fast::hash(&reusable_frame_buffer[..len + 8]) ^ 0xAF; + let crc_recorded = u32::from_le_bytes([ + reusable_frame_buffer[len + 8], + reusable_frame_buffer[len + 9], + reusable_frame_buffer[len + 10], + reusable_frame_buffer[len + 11], + ]); + + if crc_actual != crc_recorded { + log::warn!("encountered incorrect crc for batch in log"); + return Err(annotate!(io::Error::new( + io::ErrorKind::InvalidData, + "crc mismatch for read of batch frame", + ))); + } + + let mut ret = vec![]; + + let mut decoder = ZstdDecoder::new(&reusable_frame_buffer[8..len + 8]) + .expect("failed to create zstd decoder"); + + let mut object_id_buf: [u8; 8] = [0; 8]; + let mut collection_id_buf: [u8; 8] = [0; 8]; + let mut location_buf: [u8; 8] = [0; 8]; + let mut low_key_len_buf: [u8; 8] = [0; 8]; + let mut low_key_buf = vec![]; + loop { + let first_read_res = decoder + .read_exact(&mut object_id_buf) + .and_then(|_| decoder.read_exact(&mut collection_id_buf)) + .and_then(|_| decoder.read_exact(&mut location_buf)) + .and_then(|_| decoder.read_exact(&mut low_key_len_buf)); + + if let Err(e) = first_read_res { + if e.kind() != io::ErrorKind::UnexpectedEof { + return Err(e); + } else { + break; + } + } + + let object_id_u64 = u64::from_le_bytes(object_id_buf); + + let object_id = if let Some(object_id) = ObjectId::new(object_id_u64) { + object_id + } else { + return Err(annotate!(io::Error::new( + io::ErrorKind::InvalidData, + "corrupt object ID 0 somehow passed crc check" + ))); + }; + + let collection_id = CollectionId(u64::from_le_bytes(collection_id_buf)); + let location = u64::from_le_bytes(location_buf); + + let low_key_len_raw = u64::from_le_bytes(low_key_len_buf); + let low_key_len = usize::try_from(low_key_len_raw).unwrap(); + + low_key_buf.reserve(low_key_len); + unsafe { + low_key_buf.set_len(low_key_len); + } + + decoder + .read_exact(&mut low_key_buf) + .expect("we expect reads from crc-verified buffers to succeed"); + + if let Some(location_nzu) = NonZeroU64::new(location) { + let low_key = InlineArray::from(&*low_key_buf); + + ret.push(UpdateMetadata::Store { + object_id, + collection_id, + location: location_nzu, + low_key, + }); + } else { + ret.push(UpdateMetadata::Free { object_id, collection_id }); + } + } + + Ok(ret) +} + +// returns the deduplicated data in this log, along with an optional offset where a +// final torn write occurred. +fn read_log( + directory_path: &Path, + lsn: u64, +) -> io::Result> { + log::trace!("reading log {lsn}"); + let mut ret = FnvHashMap::default(); + + let mut file = fallible!(fs::File::open(log_path(directory_path, lsn))); + + let mut reusable_frame_buffer: Vec = vec![]; + + while let Ok(frame) = read_frame(&mut file, &mut reusable_frame_buffer) { + for update_metadata in frame { + ret.insert(update_metadata.object_id(), update_metadata); + } + } + + log::trace!("recovered {} items in log {}", ret.len(), lsn); + + Ok(ret) +} + +/// returns the data from the snapshot as well as the size of the snapshot +fn read_snapshot( + directory_path: &Path, + lsn: u64, +) -> io::Result<(FnvHashMap, u64)> { + log::trace!("reading snapshot {lsn}"); + let mut reusable_frame_buffer: Vec = vec![]; + let mut file = + fallible!(fs::File::open(snapshot_path(directory_path, lsn, false))); + let size = fallible!(file.metadata()).len(); + let raw_frame = read_frame(&mut file, &mut reusable_frame_buffer)?; + + let frame: FnvHashMap = raw_frame + .into_iter() + .map(|update_metadata| (update_metadata.object_id(), update_metadata)) + .collect(); + + log::trace!("recovered {} items in snapshot {}", frame.len(), lsn); + + Ok((frame, size)) +} + +fn log_path(directory_path: &Path, id: u64) -> PathBuf { + directory_path.join(format!("{LOG_PREFIX}_{:016x}", id)) +} + +fn snapshot_path(directory_path: &Path, id: u64, temporary: bool) -> PathBuf { + if temporary { + directory_path + .join(format!("{SNAPSHOT_PREFIX}_{:016x}{TMP_SUFFIX}", id)) + } else { + directory_path.join(format!("{SNAPSHOT_PREFIX}_{:016x}", id)) + } +} + +fn enumerate_logs_and_snapshot( + directory_path: &Path, +) -> io::Result<(BTreeSet, Option)> { + let mut logs = BTreeSet::new(); + let mut snapshot: Option = None; + + for dir_entry_res in fallible!(fs::read_dir(directory_path)) { + let dir_entry = fallible!(dir_entry_res); + let file_name = if let Ok(f) = dir_entry.file_name().into_string() { + f + } else { + log::warn!( + "skipping unexpected file with non-unicode name {:?}", + dir_entry.file_name() + ); + continue; + }; + + if file_name.ends_with(TMP_SUFFIX) { + log::warn!("removing incomplete snapshot rewrite {file_name:?}"); + fallible!(fs::remove_file(directory_path.join(file_name))); + } else if file_name.starts_with(LOG_PREFIX) { + let start = LOG_PREFIX.len() + 1; + let stop = start + 16; + + if let Ok(id) = u64::from_str_radix(&file_name[start..stop], 16) { + logs.insert(id); + } else { + todo!() + } + } else if file_name.starts_with(SNAPSHOT_PREFIX) { + let start = SNAPSHOT_PREFIX.len() + 1; + let stop = start + 16; + + if let Ok(id) = u64::from_str_radix(&file_name[start..stop], 16) { + if let Some(snap_id) = snapshot { + if snap_id < id { + log::warn!( + "removing stale snapshot {id} that is superceded by snapshot {id}" + ); + + if let Err(e) = fs::remove_file(&file_name) { + log::warn!( + "failed to remove stale snapshot file {:?}: {:?}", + file_name, e + ); + } + + snapshot = Some(id); + } + } else { + snapshot = Some(id); + } + } else { + todo!() + } + } + } + + let snap_id = snapshot.unwrap_or(0); + for stale_log_id in logs.range(..=snap_id) { + let file_name = log_path(directory_path, *stale_log_id); + + log::warn!("removing stale log {file_name:?} that is contained within snapshot {snap_id}"); + + fallible!(fs::remove_file(file_name)); + } + logs.retain(|l| *l > snap_id); + + Ok((logs, snapshot)) +} + +fn read_snapshot_and_apply_logs( + path: &Path, + log_ids: BTreeSet, + snapshot_id_opt: Option, + locked_directory: &fs::File, +) -> io::Result { + let (snapshot_tx, snapshot_rx) = bounded(1); + if let Some(snapshot_id) = snapshot_id_opt { + let path: PathBuf = path.into(); + rayon::spawn(move || { + let snap_res = read_snapshot(&path, snapshot_id) + .map(|(snapshot, _snapshot_len)| snapshot); + snapshot_tx.send(snap_res).unwrap(); + }); + } else { + snapshot_tx.send(Ok(Default::default())).unwrap(); + } + + let mut max_log_id = snapshot_id_opt.unwrap_or(0); + + let log_data_res: io::Result< + Vec<(u64, FnvHashMap)>, + > = (&log_ids) //.iter().collect::>()) + .into_par_iter() + .map(move |log_id| { + if let Some(snapshot_id) = snapshot_id_opt { + assert!(*log_id > snapshot_id); + } + + let log_data = read_log(path, *log_id)?; + + Ok((*log_id, log_data)) + }) + .collect(); + + let mut recovered: FnvHashMap = + snapshot_rx.recv().unwrap()?; + + log::trace!("recovered snapshot contains {recovered:?}"); + + for (log_id, log_datum) in log_data_res? { + max_log_id = max_log_id.max(log_id); + + for (object_id, update_metadata) in log_datum { + if matches!(update_metadata, UpdateMetadata::Store { .. }) { + recovered.insert(object_id, update_metadata); + } else { + let previous = recovered.remove(&object_id); + if previous.is_none() { + log::trace!("recovered a Free for {object_id:?} without a preceeding Store"); + } + } + } + } + + let mut recovered: Vec = recovered.into_values().collect(); + + recovered.par_sort_unstable(); + + // write fresh snapshot with recovered data + let new_snapshot_data = serialize_batch(&recovered); + let snapshot_size = new_snapshot_data.len() as u64; + + let new_snapshot_tmp_path = snapshot_path(path, max_log_id, true); + log::trace!("writing snapshot to {new_snapshot_tmp_path:?}"); + + let mut snapshot_file_opts = fs::OpenOptions::new(); + snapshot_file_opts.create(true).read(false).write(true); + + let mut snapshot_file = + fallible!(snapshot_file_opts.open(&new_snapshot_tmp_path)); + + fallible!(snapshot_file.write_all(&new_snapshot_data)); + drop(new_snapshot_data); + + fallible!(snapshot_file.sync_all()); + + let new_snapshot_path = snapshot_path(path, max_log_id, false); + log::trace!("renaming written snapshot to {new_snapshot_path:?}"); + fallible!(fs::rename(new_snapshot_tmp_path, new_snapshot_path)); + fallible!(locked_directory.sync_all()); + + for log_id in &log_ids { + let log_path = log_path(path, *log_id); + fallible!(fs::remove_file(log_path)); + } + + if let Some(old_snapshot_id) = snapshot_id_opt { + let old_snapshot_path = snapshot_path(path, old_snapshot_id, false); + fallible!(fs::remove_file(old_snapshot_path)); + } + + Ok(MetadataRecovery { + recovered, + id_for_next_log: max_log_id + 1, + snapshot_size, + }) +} diff --git a/src/metrics.rs b/src/metrics.rs deleted file mode 100644 index 348b908ff..000000000 --- a/src/metrics.rs +++ /dev/null @@ -1,416 +0,0 @@ -#![allow(unused_results)] -#![allow(clippy::cast_precision_loss)] -#![allow(clippy::cast_possible_truncation)] -#![allow(clippy::float_arithmetic)] - -use std::sync::atomic::AtomicUsize; - -#[cfg(not(target_arch = "x86_64"))] -use std::time::{Duration, Instant}; - -#[cfg(not(feature = "metrics"))] -use std::marker::PhantomData; - -#[cfg(feature = "metrics")] -use std::sync::atomic::Ordering::{Acquire, Relaxed}; - -use num_format::{Locale, ToFormattedString}; - -use crate::Lazy; - -use super::*; - -/// A metric collector for all pagecache users running in this -/// process. -pub static M: Lazy Metrics> = Lazy::new(Metrics::default); - -#[allow(clippy::cast_precision_loss)] -pub(crate) fn clock() -> u64 { - if cfg!(not(feature = "metrics")) { - 0 - } else { - #[cfg(target_arch = "x86_64")] - #[allow(unsafe_code)] - unsafe { - let mut aux = 0; - core::arch::x86_64::__rdtscp(&mut aux) - } - - #[cfg(not(target_arch = "x86_64"))] - { - let u = uptime(); - (u.as_secs() * 1_000_000_000) + u64::from(u.subsec_nanos()) - } - } -} - -// not correct, since it starts counting at the first observance... -#[cfg(not(target_arch = "x86_64"))] -pub(crate) fn uptime() -> Duration { - static START: Lazy Instant> = Lazy::new(Instant::now); - - if cfg!(not(feature = "metrics")) { - Duration::new(0, 0) - } else { - START.elapsed() - } -} - -/// Measure the duration of an event, and call `Histogram::measure()`. -pub struct Measure<'h> { - start: u64, - histo: &'h Histogram, - #[cfg(not(feature = "metrics"))] - _pd: PhantomData<&'h ()>, -} - -impl<'h> Measure<'h> { - /// The time delta from ctor to dtor is recorded in `histo`. - #[inline] - #[allow(unused_variables)] - pub fn new(histo: &'h Histogram) -> Measure<'h> { - Measure { - #[cfg(not(feature = "metrics"))] - _pd: PhantomData, - histo, - start: clock(), - } - } -} - -impl<'h> Drop for Measure<'h> { - #[inline] - fn drop(&mut self) { - #[cfg(feature = "metrics")] - self.histo.measure(clock() - self.start); - } -} - -#[derive(Default, Debug)] -pub struct Metrics { - pub accountant_bump_tip: Histogram, - pub accountant_hold: Histogram, - pub accountant_lock: Histogram, - pub accountant_mark_link: Histogram, - pub accountant_mark_replace: Histogram, - pub accountant_next: Histogram, - pub accountant_stabilize: Histogram, - pub advance_snapshot: Histogram, - pub assign_offset: Histogram, - pub bytes_written_heap_item: CachePadded, - pub bytes_written_heap_ptr: CachePadded, - pub bytes_written_replace: CachePadded, - pub bytes_written_link: CachePadded, - pub bytes_written_other: CachePadded, - pub fuzzy_snapshot: Histogram, - pub compress: Histogram, - pub decompress: Histogram, - pub deserialize: Histogram, - pub get_page: Histogram, - pub get_pagetable: Histogram, - pub link_page: Histogram, - pub log_reservation_attempts: CachePadded, - pub log_reservations: CachePadded, - pub make_stable: Histogram, - pub page_out: Histogram, - pub pull: Histogram, - pub read: Histogram, - pub read_segment_message: Histogram, - pub replace_page: Histogram, - pub reserve_lat: Histogram, - pub reserve_sz: Histogram, - pub rewrite_page: Histogram, - pub segment_read: Histogram, - pub segment_utilization_startup: Histogram, - pub segment_utilization_shutdown: Histogram, - pub serialize: Histogram, - pub snapshot_apply: Histogram, - pub start_pagecache: Histogram, - pub start_segment_accountant: Histogram, - pub tree_cas: Histogram, - pub tree_child_split_attempt: CachePadded, - pub tree_child_split_success: CachePadded, - pub tree_del: Histogram, - pub tree_get: Histogram, - pub tree_loops: CachePadded, - pub tree_merge: Histogram, - pub tree_parent_split_attempt: CachePadded, - pub tree_parent_split_success: CachePadded, - pub tree_reverse_scan: Histogram, - pub tree_root_split_attempt: CachePadded, - pub tree_root_split_success: CachePadded, - pub tree_scan: Histogram, - pub tree_set: Histogram, - pub tree_start: Histogram, - pub tree_traverse: Histogram, - pub write_to_log: Histogram, - pub written_bytes: Histogram, -} - -impl Metrics { - #[inline] - pub fn tree_looped(&self) { - self.tree_loops.fetch_add(1, Relaxed); - } - - #[inline] - pub fn log_reservation_attempted(&self) { - self.log_reservation_attempts.fetch_add(1, Relaxed); - } - - #[inline] - pub fn log_reservation_success(&self) { - self.log_reservations.fetch_add(1, Relaxed); - } - - #[inline] - pub fn tree_child_split_attempt(&self) { - self.tree_child_split_attempt.fetch_add(1, Relaxed); - } - - #[inline] - pub fn tree_child_split_success(&self) { - self.tree_child_split_success.fetch_add(1, Relaxed); - } - - #[inline] - pub fn tree_parent_split_attempt(&self) { - self.tree_parent_split_attempt.fetch_add(1, Relaxed); - } - - #[inline] - pub fn tree_parent_split_success(&self) { - self.tree_parent_split_success.fetch_add(1, Relaxed); - } - - #[inline] - pub fn tree_root_split_attempt(&self) { - self.tree_root_split_attempt.fetch_add(1, Relaxed); - } - - #[inline] - pub fn tree_root_split_success(&self) { - self.tree_root_split_success.fetch_add(1, Relaxed); - } - - pub fn format_profile(&self) -> String { - let mut ret = String::new(); - ret.push_str(&format!( - "sled profile:\n\ - {0: >17} | {1: >10} | {2: >10} | {3: >10} | {4: >10} | {5: >10} | {6: >10} | {7: >10} | {8: >10} | {9: >10}\n", - "op", - "min (us)", - "med (us)", - "90 (us)", - "99 (us)", - "99.9 (us)", - "99.99 (us)", - "max (us)", - "count", - "sum (s)" - )); - ret.push_str(&format!("{}\n", "-".repeat(134))); - - let p = - |mut tuples: Vec<(String, _, _, _, _, _, _, _, _, _)>| -> String { - tuples.sort_by_key(|t| (t.9 * -1. * 1e3) as i64); - let mut to_ret = String::new(); - for v in tuples { - to_ret.push_str(&format!( - "{0: >17} | {1: >10.1} | {2: >10.1} | {3: >10.1} \ - | {4: >10.1} | {5: >10.1} | {6: >10.1} | {7: >10.1} \ - | {8: >10.1} | {9: >10.1}\n", - v.0, v.1, v.2, v.3, v.4, v.5, v.6, v.7, v.8, v.9, - )); - } - to_ret - }; - - let lat = |name: &str, histo: &Histogram| { - ( - name.to_string(), - histo.percentile(0.) / 1e3, - histo.percentile(50.) / 1e3, - histo.percentile(90.) / 1e3, - histo.percentile(99.) / 1e3, - histo.percentile(99.9) / 1e3, - histo.percentile(99.99) / 1e3, - histo.percentile(100.) / 1e3, - histo.count(), - histo.sum() as f64 / 1e9, - ) - }; - - let sz = |name: &str, histo: &Histogram| { - ( - name.to_string(), - histo.percentile(0.), - histo.percentile(50.), - histo.percentile(90.), - histo.percentile(99.), - histo.percentile(99.9), - histo.percentile(99.99), - histo.percentile(100.), - histo.count(), - histo.sum() as f64, - ) - }; - - ret.push_str("tree:\n"); - - ret.push_str(&p(vec![ - lat("traverse", &self.tree_traverse), - lat("get", &self.tree_get), - lat("set", &self.tree_set), - lat("merge", &self.tree_merge), - lat("del", &self.tree_del), - lat("cas", &self.tree_cas), - lat("scan", &self.tree_scan), - lat("rev scan", &self.tree_reverse_scan), - ])); - let total_loops = self.tree_loops.load(Acquire); - let total_ops = self.tree_get.count() - + self.tree_set.count() - + self.tree_merge.count() - + self.tree_del.count() - + self.tree_cas.count() - + self.tree_scan.count() - + self.tree_reverse_scan.count(); - let loop_pct = total_loops * 100 / (total_ops + 1); - ret.push_str(&format!( - "tree contention loops: {} ({}% retry rate)\n", - total_loops, loop_pct - )); - ret.push_str(&format!( - "tree split success rates: child({}/{}) parent({}/{}) root({}/{})\n", - self.tree_child_split_success.load(Acquire) - .to_formatted_string(&Locale::en) - , - self.tree_child_split_attempt.load(Acquire) - .to_formatted_string(&Locale::en) - , - self.tree_parent_split_success.load(Acquire) - .to_formatted_string(&Locale::en) - , - self.tree_parent_split_attempt.load(Acquire) - .to_formatted_string(&Locale::en) - , - self.tree_root_split_success.load(Acquire) - .to_formatted_string(&Locale::en) - , - self.tree_root_split_attempt.load(Acquire) - .to_formatted_string(&Locale::en) - , - )); - - ret.push_str(&format!("{}\n", "-".repeat(134))); - ret.push_str("pagecache:\n"); - ret.push_str(&p(vec![ - lat("get", &self.get_page), - lat("get pt", &self.get_pagetable), - lat("rewrite", &self.rewrite_page), - lat("replace", &self.replace_page), - lat("link", &self.link_page), - lat("pull", &self.pull), - lat("page_out", &self.page_out), - ])); - let hit_ratio = self.get_page.count().saturating_sub(self.pull.count()) - * 100 - / (self.get_page.count() + 1); - ret.push_str(&format!("hit ratio: {}%\n", hit_ratio)); - - ret.push_str(&format!("{}\n", "-".repeat(134))); - ret.push_str("serialization:\n"); - ret.push_str(&p(vec![ - lat("serialize", &self.serialize), - lat("deserialize", &self.deserialize), - ])); - - ret.push_str(&format!("{}\n", "-".repeat(134))); - ret.push_str("log:\n"); - ret.push_str(&p(vec![ - lat("make_stable", &self.make_stable), - lat("read", &self.read), - lat("write", &self.write_to_log), - sz("written bytes", &self.written_bytes), - lat("assign offset", &self.assign_offset), - lat("reserve lat", &self.reserve_lat), - sz("reserve sz", &self.reserve_sz), - ])); - let log_reservations = - std::cmp::max(1, self.log_reservations.load(Acquire)); - let log_reservation_attempts = - std::cmp::max(1, self.log_reservation_attempts.load(Acquire)); - let log_reservation_retry_rate = - log_reservation_attempts.saturating_sub(log_reservations) * 100 - / (log_reservations + 1); - ret.push_str(&format!( - "log reservations: {:>15}\n", - log_reservations.to_formatted_string(&Locale::en) - )); - ret.push_str(&format!( - "log res attempts: {:>15} ({}% retry rate)\n", - log_reservation_attempts.to_formatted_string(&Locale::en), - log_reservation_retry_rate, - )); - - ret.push_str(&format!( - "heap item reserved bytes: {:>15}\n", - self.bytes_written_heap_item - .load(Acquire) - .to_formatted_string(&Locale::en) - )); - ret.push_str(&format!( - "heap pointer reserved bytes: {:>15}\n", - self.bytes_written_heap_ptr - .load(Acquire) - .to_formatted_string(&Locale::en) - )); - ret.push_str(&format!( - "node replace reserved bytes: {:>15}\n", - self.bytes_written_replace - .load(Acquire) - .to_formatted_string(&Locale::en) - )); - ret.push_str(&format!( - "node link reserved bytes: {:>15}\n", - self.bytes_written_link - .load(Acquire) - .to_formatted_string(&Locale::en) - )); - ret.push_str(&format!( - "other written reserved bytes: {:>15}\n", - self.bytes_written_other - .load(Acquire) - .to_formatted_string(&Locale::en) - )); - - ret.push_str(&format!("{}\n", "-".repeat(134))); - ret.push_str("segment accountant:\n"); - ret.push_str(&p(vec![ - lat("acquire", &self.accountant_lock), - lat("hold", &self.accountant_hold), - lat("next", &self.accountant_next), - lat("stabilize", &self.accountant_stabilize), - lat("replace", &self.accountant_mark_replace), - lat("link", &self.accountant_mark_link), - ])); - - ret.push_str(&format!("{}\n", "-".repeat(134))); - ret.push_str("recovery:\n"); - ret.push_str(&p(vec![ - lat("start", &self.tree_start), - lat("advance snapshot", &self.advance_snapshot), - lat("fuzzy snapshot", &self.fuzzy_snapshot), - lat("load SA", &self.start_segment_accountant), - lat("load PC", &self.start_pagecache), - lat("snap apply", &self.snapshot_apply), - lat("segment read", &self.segment_read), - lat("log message read", &self.read_segment_message), - sz("seg util start", &self.segment_utilization_startup), - sz("seg util end", &self.segment_utilization_shutdown), - ])); - - ret - } -} diff --git a/src/node.rs b/src/node.rs deleted file mode 100644 index c7e8ab33e..000000000 --- a/src/node.rs +++ /dev/null @@ -1,3067 +0,0 @@ -#![allow(unsafe_code)] - -// TODO we can skip the first offset because it's always 0 - -use std::{ - alloc::{alloc_zeroed, dealloc, Layout}, - cell::UnsafeCell, - cmp::Ordering::{self, Equal, Greater, Less}, - convert::{TryFrom, TryInto}, - fmt, - mem::{align_of, size_of}, - num::{NonZeroU16, NonZeroU64}, - ops::{Bound, Deref, DerefMut}, - sync::Arc, -}; - -use crate::{varint, IVec, Link}; - -const ALIGNMENT: usize = align_of::
(); - -macro_rules! tf { - ($e:expr) => { - usize::try_from($e).unwrap() - }; - ($e:expr, $t:ty) => { - <$t>::try_from($e).unwrap() - }; -} - -// allocates space for a header struct at the beginning. -fn uninitialized_node(len: usize) -> Inner { - let layout = Layout::from_size_align(len, ALIGNMENT).unwrap(); - - unsafe { - let ptr = alloc_zeroed(layout); - let cell_ptr = fatten(ptr, len); - Inner { ptr: cell_ptr } - } -} - -#[repr(C)] -#[derive(Debug, Clone, Copy)] -pub struct Header { - // NB always lay out fields from largest to smallest to properly pack the struct - pub next: Option, - pub merging_child: Option, - lo_len: u64, - hi_len: u64, - pub children: u32, - fixed_key_length: Option, - // we use this form to squish it all into - // 16 bytes, but really we do allow - // for Some(0) by shifting everything - // down by one on access. - fixed_value_length: Option, - // if all keys on a node are equidistant, - // we can avoid writing any data for them - // at all. - fixed_key_stride: Option, - pub prefix_len: u8, - probation_ops_remaining: u8, - // this can be 3 bits. 111 = 7, but we - // will never need 7 bytes for storing offsets. - // address spaces cap out at 2 ** 48 (256 ** 6) - // so as long as we can represent the numbers 1-6, - // we can reach the full linux address space currently - // supported as of 2021. - offset_bytes: u8, - // can be 2 bits - pub rewrite_generations: u8, - // this can really be 2 bits, representing - // 00: all updates have been at the end - // 01: mixed updates - // 10: all updates have been at the beginning - activity_sketch: u8, - version: u8, - // can be 1 bit - pub merging: bool, - // can be 1 bit - pub is_index: bool, -} - -fn apply_computed_distance(mut buf: &mut [u8], mut distance: usize) { - while distance > 0 { - let last = &mut buf[buf.len() - 1]; - let distance_byte = u8::try_from(distance % 256).unwrap(); - let carry = if 255 - distance_byte < *last { 1 } else { 0 }; - *last = last.wrapping_add(distance_byte); - distance = (distance >> 8) + carry; - if distance != 0 { - let new_len = buf.len() - 1; - buf = &mut buf[..new_len]; - } - } -} - -// TODO change to u64 or u128 output -// This function has several responsibilities: -// * `find` will call this when looking for the -// proper child pid on an index, with slice -// lengths that may or may not match -// * `KeyRef::Ord` and `KeyRef::distance` call -// this while performing node iteration, -// again with possibly mismatching slice -// lengths. Merging nodes together, or -// merging overlays into inner nodes -// will rely on this functionality, and -// it's possible for the lengths to vary. -// -// This is not a general-purpose function. It -// is not possible to determine distances when -// the distance is not representable using the -// return type of this function. -// -// This is different from simply treating -// the byte slice as a zero-padded big-endian -// integer because length exists as a variable -// dimension that must be numerically represented -// in a way that preserves lexicographic ordering. -fn shared_distance(base: &[u8], search: &[u8]) -> usize { - const fn f1(base: &[u8], search: &[u8]) -> usize { - (search[search.len() - 1] - base[search.len() - 1]) as usize - } - fn f2(base: &[u8], search: &[u8]) -> usize { - (u16::from_be_bytes(search.try_into().unwrap()) as usize) - - (u16::from_be_bytes(base.try_into().unwrap()) as usize) - } - const fn f3(base: &[u8], search: &[u8]) -> usize { - (u32::from_be_bytes([0, search[0], search[1], search[2]]) as usize) - - (u32::from_be_bytes([0, base[0], base[1], base[2]]) as usize) - } - fn f4(base: &[u8], search: &[u8]) -> usize { - (u32::from_be_bytes(search.try_into().unwrap()) as usize) - - (u32::from_be_bytes(base.try_into().unwrap()) as usize) - } - testing_assert!( - base <= search, - "expected base {:?} to be <= search {:?}", - base, - search - ); - testing_assert!( - base.len() == search.len(), - "base len: {} search len: {}", - base.len(), - search.len() - ); - testing_assert!(!base.is_empty()); - testing_assert!(base.len() <= 4); - - let computed_gotos = [f1, f2, f3, f4]; - computed_gotos[search.len() - 1](base, search) -} - -#[derive(Debug, Clone, Copy)] -enum KeyRef<'a> { - // used when all keys on a node are linear - // with a fixed stride length, allowing us to - // avoid ever actually storing any of them - Computed { base: &'a [u8], distance: usize }, - // used when keys are not linear, and we - // store the actual prefix-encoded keys on the node - Slice(&'a [u8]), -} - -impl<'a> From> for IVec { - fn from(kr: KeyRef<'a>) -> IVec { - (&kr).into() - } -} - -impl<'a> From<&KeyRef<'a>> for IVec { - fn from(kr: &KeyRef<'a>) -> IVec { - match kr { - KeyRef::Computed { base, distance } => { - let mut ivec: IVec = (*base).into(); - apply_computed_distance(&mut ivec, *distance); - ivec - } - KeyRef::Slice(s) => (*s).into(), - } - } -} - -impl<'a> KeyRef<'a> { - const fn unwrap_slice(&self) -> &[u8] { - if let KeyRef::Slice(s) = self { - s - } else { - panic!("called KeyRef::unwrap_slice on a KeyRef::Computed"); - } - } - - fn write_into(&self, buf: &mut [u8]) { - match self { - KeyRef::Computed { base, distance } => { - let buf_len = buf.len(); - buf[buf_len - base.len()..].copy_from_slice(base); - apply_computed_distance(buf, *distance); - } - KeyRef::Slice(s) => buf.copy_from_slice(s), - } - } - - fn shared_distance(&self, other: &KeyRef<'_>) -> usize { - match (self, other) { - ( - KeyRef::Computed { base: a, distance: da }, - KeyRef::Computed { base: b, distance: db }, - ) => { - assert!(a.len() <= 4); - assert!(b.len() <= 4); - let s_len = a.len().min(b.len()); - let s_a = &a[..s_len]; - let s_b = &b[..s_len]; - let s_da = shift_distance(a, *da, a.len() - s_len); - let s_db = shift_distance(b, *db, b.len() - s_len); - match a.cmp(b) { - Less => shared_distance(s_a, s_b) + s_db - s_da, - Greater => (s_db - s_da) - shared_distance(s_b, s_a), - Equal => db - da, - } - } - (KeyRef::Computed { .. }, KeyRef::Slice(b)) => { - // recurse to first case - self.shared_distance(&KeyRef::Computed { base: b, distance: 0 }) - } - (KeyRef::Slice(a), KeyRef::Computed { .. }) => { - // recurse to first case - KeyRef::Computed { base: a, distance: 0 }.shared_distance(other) - } - (KeyRef::Slice(a), KeyRef::Slice(b)) => { - // recurse to first case - KeyRef::Computed { base: a, distance: 0 } - .shared_distance(&KeyRef::Computed { base: b, distance: 0 }) - } - } - } - - fn is_empty(&self) -> bool { - self.len() == 0 - } - - fn len(&self) -> usize { - match self { - KeyRef::Computed { base, distance } => { - let mut slack = 0_usize; - for c in base.iter() { - slack += 255 - *c as usize; - slack <<= 8; - } - slack >>= 8; - base.len() + if *distance > slack { 1 } else { 0 } - } - KeyRef::Slice(s) => s.len(), - } - } -} - -// this function "corrects" a distance calculated -// for shared prefix lengths by accounting for -// dangling bytes that were omitted from the -// shared calculation. We only need to subtract -// distance when the base is shorter than the -// search key, because in the other case, -// the result is still usable -fn unshift_distance( - mut shared_distance: usize, - base: &[u8], - search: &[u8], -) -> usize { - if base.len() > search.len() { - for byte in &base[search.len()..] { - shared_distance <<= 8; - shared_distance -= *byte as usize; - } - } - - shared_distance -} - -fn shift_distance( - mut buf: &[u8], - mut distance: usize, - mut shift: usize, -) -> usize { - while shift > 0 { - let last = buf[buf.len() - 1]; - let distance_byte = u8::try_from(distance % 256).unwrap(); - let carry = if 255 - distance_byte < last { 1 } else { 0 }; - distance = (distance >> 8) + carry; - buf = &buf[..buf.len() - 1]; - shift -= 1; - } - distance -} - -impl PartialEq> for KeyRef<'_> { - fn eq(&self, other: &KeyRef<'_>) -> bool { - if self.len() != other.len() { - return false; - } - self.cmp(other) == Equal - } -} - -impl Eq for KeyRef<'_> {} - -impl Ord for KeyRef<'_> { - fn cmp(&self, other: &KeyRef<'_>) -> Ordering { - // TODO this needs to avoid linear_distance - // entirely when the lengths between `a` and - // `b` are more than the number of elements - // that we can actually represent numerical - // distances using - match (self, other) { - ( - KeyRef::Computed { base: a, distance: da }, - KeyRef::Computed { base: b, distance: db }, - ) => { - let s_len = a.len().min(b.len()); - let s_a = &a[..s_len]; - let s_b = &b[..s_len]; - let s_da = shift_distance(a, *da, a.len() - s_len); - let s_db = shift_distance(b, *db, b.len() - s_len); - - let shared_cmp = match s_a.cmp(s_b) { - Less => s_da.cmp(&(shared_distance(s_a, s_b) + s_db)), - Greater => (shared_distance(s_b, s_a) + s_da).cmp(&s_db), - Equal => s_da.cmp(&s_db), - }; - - match shared_cmp { - Equal => a.len().cmp(&b.len()), - other2 => other2, - } - } - (KeyRef::Computed { .. }, KeyRef::Slice(b)) => { - // recurse to first case - self.cmp(&KeyRef::Computed { base: b, distance: 0 }) - } - (KeyRef::Slice(a), KeyRef::Computed { .. }) => { - // recurse to first case - KeyRef::Computed { base: a, distance: 0 }.cmp(other) - } - (KeyRef::Slice(a), KeyRef::Slice(b)) => a.cmp(b), - } - } -} - -impl PartialOrd> for KeyRef<'_> { - fn partial_cmp(&self, other: &KeyRef<'_>) -> Option { - Some(self.cmp(other)) - } -} - -impl PartialOrd<[u8]> for KeyRef<'_> { - fn partial_cmp(&self, other: &[u8]) -> Option { - self.partial_cmp(&KeyRef::Slice(other)) - } -} - -impl PartialEq<[u8]> for KeyRef<'_> { - fn eq(&self, other: &[u8]) -> bool { - self.eq(&KeyRef::Slice(other)) - } -} - -struct Iter<'a> { - overlay: std::iter::Skip>>, - node: &'a Inner, - node_position: usize, - node_back_position: usize, - next_a: Option<(&'a [u8], Option<&'a IVec>)>, - next_b: Option<(KeyRef<'a>, &'a [u8])>, - next_back_a: Option<(&'a [u8], Option<&'a IVec>)>, - next_back_b: Option<(KeyRef<'a>, &'a [u8])>, -} - -impl<'a> Iterator for Iter<'a> { - type Item = (KeyRef<'a>, &'a [u8]); - - fn next(&mut self) -> Option { - loop { - if self.next_a.is_none() { - log::trace!("src/node.rs:94"); - if let Some((k, v)) = self.overlay.next() { - log::trace!("next_a is now ({:?}, {:?})", k, v); - self.next_a = Some((k.as_ref(), v.as_ref())); - } - } - if self.next_b.is_none() - && self.node.children() > self.node_position - { - self.next_b = Some(( - self.node.index_key(self.node_position), - self.node.index_value(self.node_position), - )); - log::trace!("next_b is now {:?}", self.next_b); - self.node_position += 1; - } - match (self.next_a, self.next_b) { - (None, _) => { - log::trace!("src/node.rs:112"); - log::trace!("iterator returning {:?}", self.next_b); - return self.next_b.take(); - } - (Some((_, None)), None) => { - log::trace!("src/node.rs:113"); - self.next_a.take(); - } - (Some((_, Some(_))), None) => { - log::trace!("src/node.rs:114"); - log::trace!("iterator returning {:?}", self.next_a); - return self.next_a.take().map(|(k, v)| { - (KeyRef::Slice(k), v.unwrap().as_ref()) - }); - } - (Some((k_a, v_a_opt)), Some((k_b, _))) => { - let cmp = KeyRef::Slice(k_a).cmp(&k_b); - match (cmp, v_a_opt) { - (Equal, Some(_)) => { - // prefer overlay, discard node value - self.next_b.take(); - log::trace!("src/node.rs:133"); - log::trace!("iterator returning {:?}", self.next_a); - return self.next_a.take().map(|(k, v)| { - (KeyRef::Slice(k), v.unwrap().as_ref()) - }); - } - (Equal, None) => { - // skip tombstone and continue the loop - log::trace!("src/node.rs:141"); - self.next_a.take(); - self.next_b.take(); - } - (Less, Some(_)) => { - log::trace!("iterator returning {:?}", self.next_a); - return self.next_a.take().map(|(k, v)| { - (KeyRef::Slice(k), v.unwrap().as_ref()) - }); - } - (Less, None) => { - log::trace!("src/node.rs:151"); - self.next_a.take(); - } - (Greater, Some(_)) => { - log::trace!("src/node.rs:120"); - log::trace!("iterator returning {:?}", self.next_b); - return self.next_b.take(); - } - (Greater, None) => { - log::trace!("src/node.rs:146"); - // we do not clear a tombstone until we move past - // it in the underlying node - log::trace!("iterator returning {:?}", self.next_b); - return self.next_b.take(); - } - } - } - } - } - } -} - -impl<'a> DoubleEndedIterator for Iter<'a> { - fn next_back(&mut self) -> Option { - loop { - if self.next_back_a.is_none() { - log::trace!("src/node.rs:458"); - if let Some((k, v)) = self.overlay.next_back() { - log::trace!("next_back_a is now ({:?}, {:?})", k, v); - self.next_back_a = Some((k.as_ref(), v.as_ref())); - } - } - if self.next_back_b.is_none() && self.node_back_position > 0 { - self.node_back_position -= 1; - self.next_back_b = Some(( - self.node.index_key(self.node_back_position), - self.node.index_value(self.node_back_position), - )); - log::trace!("next_back_b is now {:?}", self.next_back_b); - } - match (self.next_back_a, self.next_back_b) { - (None, _) => { - log::trace!("src/node.rs:474"); - log::trace!("iterator returning {:?}", self.next_back_b); - return self.next_back_b.take(); - } - (Some((_, None)), None) => { - log::trace!("src/node.rs:480"); - self.next_back_a.take(); - } - (Some((k_a, None)), Some((k_b, _))) if k_b == *k_a => { - // skip tombstone and continue the loop - log::trace!("src/node.rs:491"); - self.next_back_a.take(); - self.next_back_b.take(); - } - (Some((k_a, None)), Some((k_b, _))) if k_b > *k_a => { - log::trace!("src/node.rs:496"); - // we do not clear a tombstone until we move past - // it in the underlying node - log::trace!("iterator returning {:?}", self.next_back_b); - return self.next_back_b.take(); - } - (Some((k_a, None)), Some((k_b, _))) if k_b < *k_a => { - log::trace!("src/node.rs:503"); - self.next_back_a.take(); - } - (Some((_, Some(_))), None) => { - log::trace!("src/node.rs:483"); - log::trace!("iterator returning {:?}", self.next_back_a); - return self.next_back_a.take().map(|(k, v)| { - (KeyRef::Slice(k), v.unwrap().as_ref()) - }); - } - (Some((k_a, Some(_))), Some((k_b, _))) if k_b > *k_a => { - log::trace!("src/node.rs:508"); - log::trace!("iterator returning {:?}", self.next_back_b); - return self.next_back_b.take(); - } - (Some((k_a, Some(_))), Some((k_b, _))) if k_b < *k_a => { - log::trace!("iterator returning {:?}", self.next_back_a); - return self.next_back_a.take().map(|(k, v)| { - (KeyRef::Slice(k), v.unwrap().as_ref()) - }); - } - (Some((k_a, Some(_))), Some((k_b, _))) if k_b == *k_a => { - // prefer overlay, discard node value - self.next_back_b.take(); - log::trace!("src/node.rs:520"); - log::trace!("iterator returning {:?}", self.next_back_a); - return self.next_back_a.take().map(|(k, v)| { - (KeyRef::Slice(k), v.unwrap().as_ref()) - }); - } - _ => unreachable!( - "did not expect combination a: {:?} b: {:?}", - self.next_back_a, self.next_back_b - ), - } - } - } -} - -#[derive(Debug, PartialEq)] -pub struct Node { - // the overlay accumulates new writes and tombstones - // for deletions that have not yet been merged - // into the inner backing node - pub(crate) overlay: im::OrdMap>, - pub(crate) inner: Arc, -} - -impl Clone for Node { - fn clone(&self) -> Node { - Node { inner: self.merge_overlay(), overlay: Default::default() } - } -} - -impl Deref for Node { - type Target = Inner; - fn deref(&self) -> &Inner { - &self.inner - } -} - -impl Node { - fn iter(&self) -> Iter<'_> { - Iter { - overlay: self.overlay.iter().skip(0), - node: &self.inner, - node_position: 0, - next_a: None, - next_b: None, - node_back_position: self.children(), - next_back_a: None, - next_back_b: None, - } - } - - pub(crate) fn iter_index_pids(&self) -> impl '_ + Iterator { - log::trace!("iter_index_pids on node {:?}", self); - self.iter().map(|(_, v)| u64::from_le_bytes(v.try_into().unwrap())) - } - - pub(crate) unsafe fn from_raw(buf: &[u8]) -> Node { - Node { - overlay: Default::default(), - inner: Arc::new(Inner::from_raw(buf)), - } - } - - pub(crate) fn new_root(child_pid: u64) -> Node { - Node { - overlay: Default::default(), - inner: Arc::new(Inner::new_root(child_pid)), - } - } - - pub(crate) fn new_hoisted_root(left: u64, at: &[u8], right: u64) -> Node { - Node { - overlay: Default::default(), - inner: Arc::new(Inner::new_hoisted_root(left, at, right)), - } - } - - pub(crate) fn new_empty_leaf() -> Node { - Node { - overlay: Default::default(), - inner: Arc::new(Inner::new_empty_leaf()), - } - } - - pub(crate) fn apply(&self, link: &Link) -> Node { - use self::Link::*; - - assert!( - !self.inner.merging, - "somehow a link was applied to a node after it was merged" - ); - - match *link { - Set(ref k, ref v) => self.insert(k, v), - Del(ref key) => self.remove(key), - ParentMergeConfirm => { - assert!(self.merging_child.is_some()); - let merged_child = self - .merging_child - .expect( - "we should have a specific \ - child that was merged if this \ - link appears here", - ) - .get(); - let idx = self - .iter_index_pids() - .position(|pid| pid == merged_child) - .unwrap(); - let mut ret = - self.remove(&self.index_key(idx).into()).merge_overlay(); - Arc::get_mut(&mut ret).unwrap().merging_child = None; - Node { inner: ret, overlay: Default::default() } - } - ParentMergeIntention(pid) => { - assert!( - self.can_merge_child(pid), - "trying to merge {:?} into node {:?} which \ - is not a valid merge target", - link, - self - ); - let mut ret = self.merge_overlay(); - Arc::make_mut(&mut ret).merging_child = - Some(NonZeroU64::new(pid).unwrap()); - Node { inner: ret, overlay: Default::default() } - } - ChildMergeCap => { - let mut ret = self.merge_overlay(); - Arc::make_mut(&mut ret).merging = true; - Node { inner: ret, overlay: Default::default() } - } - } - } - - fn insert(&self, key: &IVec, value: &IVec) -> Node { - let overlay = self.overlay.update(key.clone(), Some(value.clone())); - Node { overlay, inner: self.inner.clone() } - } - - fn remove(&self, key: &IVec) -> Node { - let overlay = self.overlay.update(key.clone(), None); - let ret = Node { overlay, inner: self.inner.clone() }; - log::trace!( - "applying removal of key {:?} results in node {:?}", - key, - ret - ); - ret - } - - fn contains_key(&self, key: &[u8]) -> bool { - if key < self.lo() - || if let Some(hi) = self.hi() { key >= hi } else { false } - { - return false; - } - if let Some(fixed_key_length) = self.fixed_key_length { - if usize::from(fixed_key_length.get()) != key.len() { - return false; - } - } - self.overlay.contains_key(key) || self.inner.contains_key(key) - } - - // Push the overlay into the backing node. - fn merge_overlay(&self) -> Arc { - if self.overlay.is_empty() { - return self.inner.clone(); - }; - - // if this is a node that has a fixed key stride - // and empty values, we may be able to skip - // the normal merge process by performing a - // header-only update - let can_seamlessly_absorb = self.fixed_key_stride.is_some() - && self.fixed_value_length() == Some(0) - && { - let mut prev = self.inner.index_key(self.inner.children() - 1); - let stride: u16 = self.fixed_key_stride.unwrap().get(); - let mut length_and_stride_matches = true; - for (k, v) in &self.overlay { - length_and_stride_matches &= - v.is_some() && v.as_ref().unwrap().is_empty(); - length_and_stride_matches &= KeyRef::Slice(k) > prev - && is_linear(&prev, &KeyRef::Slice(k), stride); - - prev = KeyRef::Slice(k); - - if !length_and_stride_matches { - break; - } - } - length_and_stride_matches - }; - - if can_seamlessly_absorb { - let mut ret = self.inner.deref().clone(); - ret.children = self - .inner - .children - .checked_add(u32::try_from(self.overlay.len()).unwrap()) - .unwrap(); - return Arc::new(ret); - } - - let mut items = - Vec::with_capacity(self.inner.children() + self.overlay.len()); - - for (k, v) in self.iter() { - items.push((k, v)) - } - - log::trace!( - "merging overlay items for node {:?} into {:?}", - self, - items - ); - - let mut ret = Inner::new( - self.lo(), - self.hi(), - self.prefix_len, - self.is_index, - self.next, - &items, - ); - - #[cfg(feature = "testing")] - { - let orig_ivec_pairs: Vec<_> = self - .iter() - .map(|(k, v)| (self.prefix_decode(k), IVec::from(v))) - .collect(); - - let new_ivec_pairs: Vec<_> = ret - .iter() - .map(|(k, v)| (ret.prefix_decode(k), IVec::from(v))) - .collect(); - - assert_eq!(orig_ivec_pairs, new_ivec_pairs); - } - - ret.merging = self.merging; - ret.merging_child = self.merging_child; - ret.probation_ops_remaining = - self.probation_ops_remaining.saturating_sub( - u8::try_from(self.overlay.len().min(u8::MAX as usize)).unwrap(), - ); - - log::trace!("merged node {:?} into {:?}", self, ret); - Arc::new(ret) - } - - pub(crate) fn set_next(&mut self, next: Option) { - Arc::get_mut(&mut self.inner).unwrap().next = next; - } - - pub(crate) fn increment_rewrite_generations(&mut self) { - let rewrite_generations = self.rewrite_generations; - - // don't bump rewrite_generations unless we've cooled - // down after the last split. - if self.activity_sketch == 0 { - Arc::make_mut(&mut self.inner).rewrite_generations = - rewrite_generations.saturating_add(1); - } - } - - pub(crate) fn receive_merge(&self, other: &Node) -> Node { - log::trace!("receiving merge, left: {:?} right: {:?}", self, other); - let left = self.merge_overlay(); - let right = other.merge_overlay(); - log::trace!( - "overlays should now be merged: left: {:?} right: {:?}", - left, - right - ); - - let ret = Node { - overlay: Default::default(), - inner: Arc::new(left.receive_merge(&right)), - }; - - #[cfg(feature = "testing")] - { - let orig_ivec_pairs: Vec<_> = self - .iter() - .map(|(k, v)| (self.prefix_decode(k), IVec::from(v))) - .chain( - other - .iter() - .map(|(k, v)| (other.prefix_decode(k), IVec::from(v))), - ) - .collect(); - - let new_ivec_pairs: Vec<_> = ret - .iter() - .map(|(k, v)| (ret.prefix_decode(k), IVec::from(v))) - .collect(); - - assert_eq!(orig_ivec_pairs, new_ivec_pairs); - } - - log::trace!("merge created node {:?}", ret); - ret - } - - pub(crate) fn split(&self) -> (Node, Node) { - let (lhs_inner, rhs_inner) = self.merge_overlay().split(); - let lhs = - Node { inner: Arc::new(lhs_inner), overlay: Default::default() }; - let rhs = - Node { inner: Arc::new(rhs_inner), overlay: Default::default() }; - - #[cfg(feature = "testing")] - { - let orig_ivec_pairs: Vec<_> = self - .iter() - .map(|(k, v)| (self.prefix_decode(k), IVec::from(v))) - .collect(); - - let new_ivec_pairs: Vec<_> = lhs - .iter() - .map(|(k, v)| (lhs.prefix_decode(k), IVec::from(v))) - .chain( - rhs.iter() - .map(|(k, v)| (rhs.prefix_decode(k), IVec::from(v))), - ) - .collect(); - - assert_eq!( - orig_ivec_pairs, new_ivec_pairs, - "splitting node {:?} failed", - self - ); - } - - (lhs, rhs) - } - - pub(crate) fn parent_split(&self, at: &[u8], to: u64) -> Option { - let encoded_sep = &at[self.prefix_len as usize..]; - if self.contains_key(encoded_sep) { - log::debug!( - "parent_split skipped because \ - parent node already contains child with key {:?} \ - pid {} \ - at split point due to deep race. parent node: {:?}", - at, - to, - self - ); - return None; - } - - if at < self.lo() - || if let Some(hi) = self.hi() { hi <= at } else { false } - { - log::debug!( - "tried to add split child at {:?} to parent index node {:?}", - at, - self - ); - return None; - } - - let value = Some(to.to_le_bytes().as_ref().into()); - let overlay = self.overlay.update(encoded_sep.into(), value); - - let new_inner = - Node { overlay, inner: self.inner.clone() }.merge_overlay(); - - Some(Node { overlay: Default::default(), inner: new_inner }) - } - - /// `node_kv_pair` returns either the existing (node/key, value, current offset) tuple or - /// (node/key, none, future offset) where a node/key is node level encoded key. - pub(crate) fn node_kv_pair<'a>( - &'a self, - key: &'a [u8], - ) -> (IVec, Option<&[u8]>) { - let encoded_key = self.prefix_encode(key); - if let Some(v) = self.overlay.get(encoded_key) { - (encoded_key.into(), v.as_ref().map(AsRef::as_ref)) - } else { - // look for the key in our compacted inner node - let search = self.find(encoded_key); - - if let Ok(idx) = search { - (self.index_key(idx).into(), Some(self.index_value(idx))) - } else { - (encoded_key.into(), None) - } - } - } - - pub(crate) fn successor( - &self, - bound: &Bound, - ) -> Option<(IVec, IVec)> { - let (overlay, node_position) = match bound { - Bound::Unbounded => (self.overlay.iter().skip(0), 0), - Bound::Included(b) => { - if let Some(Some(v)) = self.overlay.get(b) { - // short circuit return - return Some((b.clone(), v.clone())); - } - let overlay_search = self.overlay.range(b.clone()..).skip(0); - - let inner_search = if &**b < self.lo() { - Err(0) - } else { - self.find(self.prefix_encode(b)) - }; - let node_position = match inner_search { - Ok(idx) => { - return Some(( - self.prefix_decode(self.inner.index_key(idx)), - self.inner.index_value(idx).into(), - )) - } - Err(idx) => idx, - }; - - (overlay_search, node_position) - } - Bound::Excluded(b) => { - let overlay_search = if self.overlay.contains_key(b) { - self.overlay.range(b.clone()..).skip(1) - } else { - self.overlay.range(b.clone()..).skip(0) - }; - - let inner_search = if &**b < self.lo() { - Err(0) - } else { - self.find(self.prefix_encode(b)) - }; - let node_position = match inner_search { - Ok(idx) => idx + 1, - Err(idx) => idx, - }; - - (overlay_search, node_position) - } - }; - - let in_bounds = |k: &KeyRef<'_>| match bound { - Bound::Unbounded => true, - Bound::Included(b) => *k >= b[self.prefix_len as usize..], - Bound::Excluded(b) => *k > b[self.prefix_len as usize..], - }; - - let mut iter = Iter { - overlay, - node: &self.inner, - node_position, - next_a: None, - next_b: None, - node_back_position: self.children(), - next_back_a: None, - next_back_b: None, - }; - - let ret: Option<(KeyRef<'_>, &[u8])> = iter.find(|(k, _)| in_bounds(k)); - - ret.map(|(k, v)| (self.prefix_decode(k), v.into())) - } - - pub(crate) fn predecessor( - &self, - bound: &Bound, - ) -> Option<(IVec, IVec)> { - let (overlay, node_back_position) = match bound { - Bound::Unbounded => (self.overlay.iter().skip(0), self.children()), - Bound::Included(b) => { - let overlay = self.overlay.range(..=b.clone()).skip(0); - - let inner_search = if &**b < self.lo() { - Err(0) - } else { - self.find(self.prefix_encode(b)) - }; - let node_back_position = match inner_search { - Ok(idx) => { - return Some(( - self.prefix_decode(self.inner.index_key(idx)), - self.inner.index_value(idx).into(), - )) - } - Err(idx) => idx, - }; - - (overlay, node_back_position) - } - Bound::Excluded(b) => { - let overlay = self.overlay.range(..b.clone()).skip(0); - - let above_hi = - if let Some(hi) = self.hi() { &**b >= hi } else { false }; - - let inner_search = if above_hi { - Err(self.children()) - } else { - self.find(self.prefix_encode(b)) - }; - #[allow(clippy::match_same_arms)] - let node_back_position = match inner_search { - Ok(idx) => idx, - Err(idx) => idx, - }; - - (overlay, node_back_position) - } - }; - - let iter = Iter { - overlay, - node: &self.inner, - node_position: 0, - node_back_position, - next_a: None, - next_b: None, - next_back_a: None, - next_back_b: None, - }; - - let in_bounds = |k: &KeyRef<'_>| match bound { - Bound::Unbounded => true, - Bound::Included(b) => *k <= b[self.prefix_len as usize..], - Bound::Excluded(b) => *k < b[self.prefix_len as usize..], - }; - - let ret: Option<(KeyRef<'_>, &[u8])> = - iter.rev().find(|(k, _)| in_bounds(k)); - - ret.map(|(k, v)| (self.prefix_decode(k), v.into())) - } - - pub(crate) fn index_next_node(&self, key: &[u8]) -> (bool, u64) { - log::trace!("index_next_node for key {:?} on node {:?}", key, self); - assert!(self.overlay.is_empty()); - assert!(key >= self.lo()); - if let Some(hi) = self.hi() { - assert!(hi > key); - } - - let encoded_key = self.prefix_encode(key); - - let idx = match self.find(encoded_key) { - Ok(idx) => idx, - Err(idx) => idx.max(1) - 1, - }; - - let is_leftmost = idx == 0; - let pid_bytes = self.index_value(idx); - let pid = u64::from_le_bytes(pid_bytes.try_into().unwrap()); - - log::trace!("index_next_node for key {:?} returning pid {} after seaching node {:?}", key, pid, self); - (is_leftmost, pid) - } - - pub(crate) fn should_split(&self) -> bool { - log::trace!("seeing if we should split node {:?}", self); - let size_checks = if cfg!(any(test, feature = "lock_free_delays")) { - self.iter().take(6).count() > 5 - } else { - let size_threshold = 1024 - crate::MAX_MSG_HEADER_LEN; - let child_threshold = 56 * 1024; - - self.len() > size_threshold || self.children > child_threshold - }; - - let safety_checks = self.merging_child.is_none() - && !self.merging - && self.iter().take(2).count() == 2 - && self.probation_ops_remaining == 0; - - if size_checks { - log::trace!( - "should_split: {} is index: {} children: {} size: {}", - safety_checks && size_checks, - self.is_index, - self.children, - self.rss() - ); - } - - safety_checks && size_checks - } - - pub(crate) fn should_merge(&self) -> bool { - let size_check = if cfg!(any(test, feature = "lock_free_delays")) { - self.iter().take(2).count() < 2 - } else { - let size_threshold = 256 - crate::MAX_MSG_HEADER_LEN; - self.len() < size_threshold - }; - - let safety_checks = self.merging_child.is_none() - && !self.merging - && self.probation_ops_remaining == 0; - - safety_checks && size_check - } -} - -/// An immutable sorted string table -#[must_use] -pub struct Inner { - ptr: *mut UnsafeCell<[u8]>, -} - -/// -#[allow(trivial_casts)] -fn fatten(data: *mut u8, len: usize) -> *mut UnsafeCell<[u8]> { - // Requirements of slice::from_raw_parts. - assert!(!data.is_null()); - assert!(isize::try_from(len).is_ok()); - - let slice = unsafe { core::slice::from_raw_parts(data as *const (), len) }; - slice as *const [()] as *mut _ -} - -impl PartialEq for Inner { - fn eq(&self, other: &Inner) -> bool { - self.as_ref().eq(other.as_ref()) - } -} - -impl Clone for Inner { - fn clone(&self) -> Inner { - unsafe { Inner::from_raw(self.as_ref()) } - } -} - -unsafe impl Sync for Inner {} -unsafe impl Send for Inner {} - -impl Drop for Inner { - fn drop(&mut self) { - let layout = Layout::from_size_align(self.len(), ALIGNMENT).unwrap(); - unsafe { - dealloc(self.ptr(), layout); - } - } -} - -impl AsRef<[u8]> for Inner { - fn as_ref(&self) -> &[u8] { - self.buf() - } -} - -impl AsMut<[u8]> for Inner { - fn as_mut(&mut self) -> &mut [u8] { - self.buf_mut() - } -} - -impl Deref for Inner { - type Target = Header; - - fn deref(&self) -> &Header { - self.header() - } -} - -impl fmt::Debug for Inner { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let mut ds = f.debug_struct("Inner"); - - ds.field("header", self.header()) - .field("lo", &self.lo()) - .field("hi", &self.hi()); - - if self.fixed_value_length() == Some(0) - && self.fixed_key_stride.is_some() - { - ds.field("fixed node, all keys and values omitted", &()).finish() - } else if self.is_index { - ds.field( - "items", - &self - .iter_keys() - .zip(self.iter_index_pids()) - .collect::>(), - ) - .finish() - } else { - ds.field("items", &self.iter().collect::>()).finish() - } - } -} - -impl DerefMut for Inner { - fn deref_mut(&mut self) -> &mut Header { - self.header_mut() - } -} - -// determines if the item can be losslessly -// constructed from the base by adding a fixed -// stride to it. -fn is_linear(a: &KeyRef<'_>, b: &KeyRef<'_>, stride: u16) -> bool { - let a_len = a.len(); - if a_len != b.len() || a_len > 4 { - return false; - } - - a.shared_distance(b) == stride as usize -} - -impl Inner { - #[inline] - fn ptr(&self) -> *mut u8 { - unsafe { (*self.ptr).get() as *mut u8 } - } - - #[inline] - pub(crate) fn len(&self) -> usize { - self.buf().len() - } - - #[inline] - fn buf(&self) -> &[u8] { - unsafe { &*(*self.ptr).get() } - } - - #[inline] - fn buf_mut(&mut self) -> &mut [u8] { - unsafe { &mut *(*self.ptr).get() } - } - - unsafe fn from_raw(buf: &[u8]) -> Inner { - let mut ret = uninitialized_node(buf.len()); - ret.as_mut().copy_from_slice(buf); - ret - } - - fn new( - lo: &[u8], - hi: Option<&[u8]>, - prefix_len: u8, - is_index: bool, - next: Option, - items: &[(KeyRef<'_>, &[u8])], - ) -> Inner { - assert!(items.len() <= u32::MAX as usize); - - // determine if we need to use varints and offset - // indirection tables, or if everything is equal - // size we can skip this. If all keys are linear - // with a fixed stride, we can completely skip writing - // them at all, as they can always be calculated by - // adding the desired offset to the lo key. - - // we compare the lo key to the second item because - // it is assumed that the first key matches the lo key - // in the case of a fixed stride - let mut fixed_key_stride: Option = if items.len() > 1 - && lo[prefix_len as usize..].len() == items[1].0.len() - && items[1].0.len() <= 4 - { - assert!( - items[1].0 > lo[prefix_len as usize..], - "somehow, the second key on this node is not greater \ - than the node low key (adjusted for prefix): \ - lo: {:?} items: {:?}", - lo, - items - ); - u16::try_from( - KeyRef::Slice(&lo[prefix_len as usize..]) - .shared_distance(&items[1].0), - ) - .ok() - } else { - None - }; - - let mut fixed_key_length = match items { - [(kr, _), ..] if !kr.is_empty() => Some(kr.len()), - _ => None, - }; - - let mut fixed_value_length = items.first().map(|(_, v)| v.len()); - - let mut dynamic_key_storage_size = 0; - let mut dynamic_value_storage_size = 0; - - let mut prev: Option<&KeyRef<'_>> = None; - - // the first pass over items determines the various - // sizes required to represent keys and values, and - // whether keys, values, or both share the same sizes - // or possibly whether the keys increment at a fixed - // rate so that they can be completely skipped - for (k, v) in items { - dynamic_key_storage_size += k.len() + varint::size(k.len() as u64); - dynamic_value_storage_size += - v.len() + varint::size(v.len() as u64); - - if fixed_key_length.is_some() { - if let Some(last) = prev { - // see if the lengths all match for the offset table - // omission optimization - if last.len() == k.len() { - // see if the keys are equidistant for the - // key omission optimization - if let Some(stride) = fixed_key_stride { - if !is_linear(last, k, stride) { - fixed_key_stride = None; - } - } - } else { - fixed_key_length = None; - fixed_key_stride = None; - } - } - - prev = Some(k); - } - - if let Some(fvl) = fixed_value_length { - if v.len() != fvl { - fixed_value_length = None; - } - } - } - let fixed_key_length = fixed_key_length - .and_then(|fkl| u16::try_from(fkl).ok()) - .and_then(NonZeroU16::new); - - let fixed_key_stride = - fixed_key_stride.map(|stride| NonZeroU16::new(stride).unwrap()); - - let fixed_value_length = fixed_value_length - .and_then(|fvl| { - if fvl < u16::MAX as usize { - // we add 1 to the fvl to - // represent Some(0) in - // less space. - u16::try_from(fvl).ok() - } else { - None - } - }) - .and_then(|fvl| NonZeroU16::new(fvl + 1)); - - let key_storage_size = if let Some(key_length) = fixed_key_length { - assert_ne!(key_length.get(), 0); - if let Some(stride) = fixed_key_stride { - // all keys can be directly computed from the node lo key - // by adding a fixed stride length to the node lo key - assert!(stride.get() > 0); - 0 - } else { - key_length.get() as usize * items.len() - } - } else { - dynamic_key_storage_size - }; - - // we max the value size with the size of a u64 because - // when we retrieve offset sizes, we may actually read - // over the end of the offset array and into the keys - // and values data, and for nodes that only store tiny - // items, it's possible that this would extend beyond the - // allocation. This is why we always make the value buffer - // 8 bytes or more, so any overlap from the offset array - // does not extend beyond the allocation. - let value_storage_size = if let Some(value_length) = fixed_value_length - { - (value_length.get() - 1) as usize * items.len() - } else { - dynamic_value_storage_size - } - .max(size_of::()); - - let (offsets_storage_size, offset_bytes) = if fixed_key_length.is_some() - && fixed_value_length.is_some() - { - (0, 0) - } else { - let max_indexable_offset = - if fixed_key_length.is_some() { 0 } else { key_storage_size } - + if fixed_value_length.is_some() { - 0 - } else { - value_storage_size - }; - - #[cfg(target_pointer_width = "32")] - let bytes_per_offset: u8 = match max_indexable_offset { - i if i < 256 => 1, - i if i < (1 << 16) => 2, - i if i < (1 << 24) => 3, - _ => unreachable!(), - }; - #[cfg(not(target_pointer_width = "32"))] - let bytes_per_offset: u8 = match max_indexable_offset { - i if i < 256 => 1, - i if i < (1 << 16) => 2, - i if i < (1 << 24) => 3, - i if i < (1 << 32) => 4, - i if i < (1 << 40) => 5, - i if i < (1 << 48) => 6, - _ => unreachable!(), - }; - - (bytes_per_offset as usize * items.len(), bytes_per_offset) - }; - - let total_node_storage_size = size_of::
() - + hi.map(<[u8]>::len).unwrap_or(0) - + lo.len() - + key_storage_size - + value_storage_size - + offsets_storage_size; - - let mut ret = uninitialized_node(tf!(total_node_storage_size)); - - if offset_bytes == 0 { - assert!(fixed_key_length.is_some()); - assert!(fixed_value_length.is_some()); - } - - let header = ret.header_mut(); - header.fixed_key_length = fixed_key_length; - header.fixed_value_length = fixed_value_length; - header.lo_len = lo.len() as u64; - header.hi_len = hi.map(|h| h.len() as u64).unwrap_or(0); - header.fixed_key_stride = fixed_key_stride; - header.offset_bytes = offset_bytes; - header.children = tf!(items.len(), u32); - header.prefix_len = prefix_len; - header.version = 1; - header.next = next; - header.is_index = is_index; - - /* - header.merging_child = None; - header.merging = false; - header.probation_ops_remaining = 0; - header.activity_sketch = 0; - header.rewrite_generations = 0; - * TODO use UnsafeCell to allow this to soundly work - *ret.header_mut() = Header { - rewrite_generations: 0, - activity_sketch: 0, - probation_ops_remaining: 0, - merging_child: None, - merging: false, - fixed_key_length, - fixed_value_length, - lo_len: lo.len() as u64, - hi_len: hi.map(|hi| hi.len() as u64).unwrap_or(0), - fixed_key_stride, - offset_bytes, - children: tf!(items.len(), u32), - prefix_len, - version: 1, - next, - is_index, - }; - */ - - ret.lo_mut().copy_from_slice(lo); - - if let Some(ref mut hi_buf) = ret.hi_mut() { - hi_buf.copy_from_slice(hi.unwrap()); - } - - // we use either 0 or 1 offset tables. - // - if keys and values are all equal lengths, no offset table is - // required - // - if keys are equal length but values are not, we put an offset table - // at the beginning of the data buffer, then put each of the keys - // packed together, then varint-prefixed values which are addressed by - // the offset table - // - if keys and values are both different lengths, we put an offset - // table at the beginning of the data buffer, then varint-prefixed - // keys followed inline with varint-prefixed values. - // - // So, there are 4 possible layouts: - // 1. [fixed size keys] [fixed size values] - // - signified by fixed_key_length and fixed_value_length being Some - // 2. [offsets] [fixed size keys] [variable values] - // - fixed_key_length: Some, fixed_value_length: None - // 3. [offsets] [variable keys] [fixed-length values] - // - fixed_key_length: None, fixed_value_length: Some - // 4. [offsets] [variable keys followed by variable values] - // - fixed_key_length: None, fixed_value_length: None - let mut offset = 0_u64; - for (idx, (k, v)) in items.iter().enumerate() { - if fixed_key_length.is_none() || fixed_value_length.is_none() { - ret.set_offset(idx, tf!(offset)); - } - if fixed_key_length.is_none() { - assert!(fixed_key_stride.is_none()); - offset += varint::size(k.len() as u64) as u64 + k.len() as u64; - } - if fixed_value_length.is_none() { - offset += varint::size(v.len() as u64) as u64 + v.len() as u64; - } - - if let Some(stride) = fixed_key_stride { - assert!(stride.get() > 0); - } else { - // we completely skip writing any key data at all - // when the keys are linear, as they can be - // computed losslessly by multiplying the desired - // index by the fixed stride length. - let mut key_buf = ret.key_buf_for_offset_mut(idx); - if fixed_key_length.is_none() { - let varint_bytes = - varint::serialize_into(k.len() as u64, key_buf); - key_buf = &mut key_buf[varint_bytes..]; - } - k.write_into(&mut key_buf[..k.len()]); - } - - let mut value_buf = ret.value_buf_for_offset_mut(idx); - if fixed_value_length.is_none() { - let varint_bytes = - varint::serialize_into(v.len() as u64, value_buf); - value_buf = &mut value_buf[varint_bytes..]; - } - value_buf[..v.len()].copy_from_slice(v); - } - - if ret.is_index { - assert!(!ret.is_empty()) - } - - if let Some(stride) = ret.fixed_key_stride { - assert!( - ret.fixed_key_length.is_some(), - "fixed_key_stride is {} but fixed_key_length \ - is None for generated node {:?}", - stride, - ret - ); - } - - testing_assert!( - ret.is_sorted(), - "created new node is not sorted: {:?}, had items passed in: {:?} fixed stride: {:?}", - ret, - items, - fixed_key_stride - ); - - #[cfg(feature = "testing")] - { - for i in 0..items.len() { - if fixed_key_length.is_none() || fixed_value_length.is_none() { - assert!( - ret.offset(i) < total_node_storage_size, - "offset {} is {} which is larger than \ - total node storage size of {} for node \ - with header {:#?}", - i, - ret.offset(i), - total_node_storage_size, - ret.header() - ); - } - } - } - - log::trace!("created new node {:?}", ret); - - ret - } - - fn new_root(child_pid: u64) -> Inner { - Inner::new( - &[], - None, - 0, - true, - None, - &[(KeyRef::Slice(prefix::empty()), &child_pid.to_le_bytes())], - ) - } - - fn new_hoisted_root(left: u64, at: &[u8], right: u64) -> Inner { - Inner::new( - &[], - None, - 0, - true, - None, - &[ - (KeyRef::Slice(prefix::empty()), &left.to_le_bytes()), - (KeyRef::Slice(at), &right.to_le_bytes()), - ], - ) - } - - fn new_empty_leaf() -> Inner { - Inner::new(&[], None, 0, false, None, &[]) - } - - fn fixed_value_length(&self) -> Option { - self.fixed_value_length.map(|fvl| usize::from(fvl.get()) - 1) - } - - // returns the OPEN ENDED buffer where a key may be placed - fn key_buf_for_offset_mut(&mut self, index: usize) -> &mut [u8] { - assert!(self.fixed_key_stride.is_none()); - let offset_sz = self.children as usize * self.offset_bytes as usize; - if let Some(k_sz) = self.fixed_key_length { - let keys_buf = &mut self.data_buf_mut()[offset_sz..]; - &mut keys_buf[index * tf!(k_sz.get())..] - } else { - // find offset for key or combined kv offset - let offset = self.offset(index); - let keys_buf = &mut self.data_buf_mut()[offset_sz..]; - &mut keys_buf[offset..] - } - } - - // returns the OPEN ENDED buffer where a value may be placed - // - // NB: it's important that this is only ever called after setting - // the key and its varint length prefix, as this needs to be parsed - // for case 4. - fn value_buf_for_offset_mut(&mut self, index: usize) -> &mut [u8] { - let stride = self.fixed_key_stride; - match (self.fixed_key_length, self.fixed_value_length()) { - (_, Some(0)) => &mut [], - (Some(_), Some(v_sz)) | (None, Some(v_sz)) => { - let values_buf = self.values_buf_mut(); - &mut values_buf[index * tf!(v_sz)..] - } - (Some(_), None) => { - // find combined kv offset - let offset = self.offset(index); - let values_buf = self.values_buf_mut(); - &mut values_buf[offset..] - } - (None, None) => { - // find combined kv offset, skip key bytes - let offset = self.offset(index); - let values_buf = self.values_buf_mut(); - let slot_buf = &mut values_buf[offset..]; - let (key_len, key_varint_sz) = if stride.is_some() { - (0, 0) - } else { - varint::deserialize(slot_buf).unwrap() - }; - &mut slot_buf[tf!(key_len) + key_varint_sz..] - } - } - } - - // returns the OPEN ENDED buffer where a value may be read - // - // NB: it's important that this is only ever called after setting - // the key and its varint length prefix, as this needs to be parsed - // for case 4. - fn value_buf_for_offset(&self, index: usize) -> &[u8] { - let stride = self.fixed_key_stride; - match (self.fixed_key_length, self.fixed_value_length()) { - (_, Some(0)) => &[], - (Some(_), Some(v_sz)) | (None, Some(v_sz)) => { - let values_buf = self.values_buf(); - &values_buf[index * v_sz..] - } - (Some(_), None) => { - // find combined kv offset - let offset = self.offset(index); - let values_buf = self.values_buf(); - &values_buf[offset..] - } - (None, None) => { - // find combined kv offset, skip key bytes - let offset = self.offset(index); - let values_buf = self.values_buf(); - let slot_buf = &values_buf[offset..]; - let (key_len, key_varint_sz) = if stride.is_some() { - (0, 0) - } else { - varint::deserialize(slot_buf).unwrap() - }; - &slot_buf[tf!(key_len) + key_varint_sz..] - } - } - } - - #[inline] - fn offset(&self, index: usize) -> usize { - assert!(index < self.children as usize); - assert!( - self.offset_bytes > 0, - "offset invariant failed on {:#?}", - self.header() - ); - let offsets_buf_start = - tf!(self.lo_len) + tf!(self.hi_len) + size_of::
(); - - let start = offsets_buf_start + (index * self.offset_bytes as usize); - - let mask = usize::MAX - >> (8 - * (tf!(size_of::(), u32) - - u32::from(self.offset_bytes))); - - let mut tmp = std::mem::MaybeUninit::::uninit(); - let len = size_of::(); - - // we use unsafe code here because it cuts a significant number of - // CPU cycles on a simple insertion workload compared to using the - // more idiomatic approach of copying the correct number of bytes into - // a buffer initialized with zeroes. the seemingly "less" unsafe - // approach of using ptr::copy_nonoverlapping did not improve matters. - // using a match statement on offest_bytes and performing simpler - // casting for one or two bytes slowed things down due to increasing - // code size. this approach is branch-free and cut CPU usage of this - // function from 7-11% down to 0.5-2% in a monotonic insertion workload. - #[allow(unsafe_code)] - unsafe { - let ptr: *const u8 = self.ptr().add(start); - std::ptr::copy_nonoverlapping( - ptr, - tmp.as_mut_ptr() as *mut u8, - len, - ); - *tmp.as_mut_ptr() &= mask; - tmp.assume_init() - } - } - - fn set_offset(&mut self, index: usize, offset: usize) { - let offset_bytes = self.offset_bytes as usize; - let buf = { - let start = index * self.offset_bytes as usize; - let end = start + offset_bytes; - &mut self.data_buf_mut()[start..end] - }; - let bytes = &offset.to_le_bytes()[..offset_bytes]; - buf.copy_from_slice(bytes); - } - - fn values_buf_mut(&mut self) -> &mut [u8] { - let offset_sz = self.children as usize * self.offset_bytes as usize; - match (self.fixed_key_length, self.fixed_value_length()) { - (_, Some(0)) => &mut [], - (_, Some(fixed_value_length)) => { - let total_value_size = fixed_value_length * self.children(); - let data_buf = self.data_buf_mut(); - let start = data_buf.len() - total_value_size; - &mut data_buf[start..] - } - (Some(fixed_key_length), _) => { - let start = if self.fixed_key_stride.is_some() { - offset_sz - } else { - offset_sz + tf!(fixed_key_length.get()) * self.children() - }; - &mut self.data_buf_mut()[start..] - } - (None, None) => &mut self.data_buf_mut()[offset_sz..], - } - } - - fn values_buf(&self) -> &[u8] { - let offset_sz = self.children as usize * self.offset_bytes as usize; - match (self.fixed_key_length, self.fixed_value_length()) { - (_, Some(0)) => &[], - (_, Some(fixed_value_length)) => { - let total_value_size = fixed_value_length * self.children(); - let data_buf = self.data_buf(); - let start = data_buf.len() - total_value_size; - &data_buf[start..] - } - (Some(fixed_key_length), _) => { - let start = if self.fixed_key_stride.is_some() { - offset_sz - } else { - offset_sz - + tf!(fixed_key_length.get()) * self.children as usize - }; - &self.data_buf()[start..] - } - (None, None) => &self.data_buf()[offset_sz..], - } - } - - #[inline] - fn data_buf(&self) -> &[u8] { - let start = tf!(self.lo_len) + tf!(self.hi_len) + size_of::
(); - &self.as_ref()[start..] - } - - fn data_buf_mut(&mut self) -> &mut [u8] { - let start = tf!(self.lo_len) + tf!(self.hi_len) + size_of::
(); - &mut self.as_mut()[start..] - } - - fn weighted_split_point(&self) -> usize { - let bits_set = self.activity_sketch.count_ones() as usize; - - if bits_set == 0 { - // this shouldn't happen often, but it could happen - // if we burn through our probation_ops_remaining - // with just removals and no inserts, which don't tick - // the activity sketch. - return self.children() / 2; - } - - let mut weighted_count = 0_usize; - for bit in 0..8 { - if (1 << bit) & self.activity_sketch != 0 { - weighted_count += bit + 1; - } - } - let average_bit = weighted_count / bits_set; - (average_bit * self.children as usize / 8) - .min(self.children() - 1) - .max(1) - } - - fn split(&self) -> (Inner, Inner) { - assert!(self.children() >= 2); - assert!(!self.merging); - assert!(self.merging_child.is_none()); - - let split_point = self.weighted_split_point(); - - let left_max: IVec = self.index_key(split_point - 1).into(); - let right_min: IVec = self.index_key(split_point).into(); - - assert_ne!( - left_max, right_min, - "split point: {} node: {:?}", - split_point, self - ); - - // see if we can reduce the splitpoint length to reduce - // the number of bytes that end up in index nodes - let splitpoint_length = if self.is_index { - right_min.len() - } else { - // we can only perform suffix truncation when - // choosing the split points for leaf nodes. - // split points bubble up into indexes, but - // an important invariant is that for indexes - // the first item always matches the lo key, - // otherwise ranges would be permanently - // inaccessible by falling into the gap - // during a split. - right_min - .iter() - .zip(left_max.iter()) - .take_while(|(a, b)| a == b) - .count() - + 1 - }; - - let untruncated_split_key: IVec = self.index_key(split_point).into(); - - let possibly_truncated_split_key = - &untruncated_split_key[..splitpoint_length]; - - let split_key = - self.prefix_decode(KeyRef::Slice(possibly_truncated_split_key)); - - if untruncated_split_key.len() != possibly_truncated_split_key.len() { - log::trace!( - "shaving off {} bytes for split key", - untruncated_split_key.len() - - possibly_truncated_split_key.len() - ); - } - - log::trace!( - "splitting node with lo: {:?} split_key: {:?} hi: {:?} prefix_len {}", - self.lo(), - split_key, - self.hi(), - self.prefix_len - ); - - #[cfg(test)] - use rand::Rng; - - // prefix encoded length can only grow or stay the same - // during splits - #[cfg(test)] - let test_jitter_left = rand::thread_rng().gen_range(0, 16); - - #[cfg(not(test))] - let test_jitter_left = u8::MAX as usize; - - let additional_left_prefix = self.lo()[self.prefix_len as usize..] - .iter() - .zip(split_key[self.prefix_len as usize..].iter()) - .take((u8::MAX - self.prefix_len) as usize) - .take_while(|(a, b)| a == b) - .count() - .min(test_jitter_left); - - #[cfg(test)] - let test_jitter_right = rand::thread_rng().gen_range(0, 16); - - #[cfg(not(test))] - let test_jitter_right = u8::MAX as usize; - - let additional_right_prefix = if let Some(hi) = self.hi() { - split_key[self.prefix_len as usize..] - .iter() - .zip(hi[self.prefix_len as usize..].iter()) - .take((u8::MAX - self.prefix_len) as usize) - .take_while(|(a, b)| a == b) - .count() - .min(test_jitter_right) - } else { - 0 - }; - - log::trace!( - "trying to add additional left prefix length {} to items {:?}", - additional_left_prefix, - self.iter().take(split_point).collect::>() - ); - - let left_items: Vec<_> = self - .iter() - .take(split_point) - .map(|(k, v)| (IVec::from(k), v)) - .collect(); - - let left_items2: Vec<_> = left_items - .iter() - .map(|(k, v)| (KeyRef::Slice(&k[additional_left_prefix..]), *v)) - .collect(); - - // we need to convert these to ivecs first - // because if we shave off bytes of the - // KeyRef base then it may corrupt their - // semantic meanings, applying distances - // that overflow into different values - // than what the KeyRef was originally - // created to represent. - let right_ivecs: Vec<_> = self - .iter() - .skip(split_point) - .map(|(k, v)| (IVec::from(k), v)) - .collect(); - - let right_items: Vec<_> = right_ivecs - .iter() - .map(|(k, v)| (KeyRef::Slice(&k[additional_right_prefix..]), *v)) - .collect(); - - let mut left = Inner::new( - self.lo(), - Some(&split_key), - self.prefix_len + tf!(additional_left_prefix, u8), - self.is_index, - self.next, - &left_items2, - ); - - left.rewrite_generations = - if split_point == 1 { 0 } else { self.rewrite_generations }; - left.probation_ops_remaining = - tf!((self.children() / 2).min(u8::MAX as usize), u8); - - let mut right = Inner::new( - &split_key, - self.hi(), - self.prefix_len + tf!(additional_right_prefix, u8), - self.is_index, - self.next, - &right_items, - ); - - right.rewrite_generations = if split_point == self.children() - 1 { - 0 - } else { - self.rewrite_generations - }; - right.probation_ops_remaining = left.probation_ops_remaining; - - right.next = self.next; - - log::trace!( - "splitting node {:?} into left: {:?} and right: {:?}", - self, - left, - right - ); - - testing_assert!( - left.is_sorted(), - "split node left is not sorted: {:?}", - left - ); - testing_assert!( - right.is_sorted(), - "split node right is not sorted: {:?}", - right - ); - - (left, right) - } - - fn receive_merge(&self, other: &Inner) -> Inner { - log::trace!( - "merging node receiving merge left: {:?} right: {:?}", - self, - other - ); - assert_eq!(self.hi(), Some(other.lo())); - assert_eq!(self.is_index, other.is_index); - assert!(!self.merging); - assert!(self.merging_child.is_none()); - assert!(other.merging_child.is_none()); - - // we can shortcut the normal merge process - // when the right sibling can be merged into - // our node without considering keys or values - let can_seamlessly_absorb = self.fixed_key_stride.is_some() - && self.fixed_key_stride == other.fixed_key_stride - && self.fixed_value_length() == Some(0) - && other.fixed_value_length() == Some(0) - && self.fixed_key_length == other.fixed_key_length - && self.lo().len() == other.lo().len() - && self.hi().map(<[_]>::len) == other.hi().map(<[_]>::len); - - if can_seamlessly_absorb { - let mut ret = self.clone(); - if let Some(other_hi) = other.hi() { - ret.hi_mut().unwrap().copy_from_slice(other_hi); - } - ret.children = self.children.checked_add(other.children).unwrap(); - ret.next = other.next; - ret.rewrite_generations = - self.rewrite_generations.max(other.rewrite_generations); - return ret; - } - - let prefix_len = if let Some(right_hi) = other.hi() { - #[cfg(test)] - use rand::Rng; - - // prefix encoded length can only grow or stay the same - // during splits - #[cfg(test)] - let test_jitter = rand::thread_rng().gen_range(0, 16); - - #[cfg(not(test))] - let test_jitter = u8::MAX as usize; - - self.lo() - .iter() - .zip(right_hi) - .take(u8::MAX as usize) - .take_while(|(a, b)| a == b) - .count() - .min(test_jitter) - } else { - 0 - }; - - let extended_left: Vec<_>; - let extended_right: Vec<_>; - let items: Vec<_> = if self.prefix_len as usize == prefix_len - && other.prefix_len as usize == prefix_len - { - self.iter().chain(other.iter()).collect() - } else { - extended_left = self - .iter_keys() - .map(|k| { - prefix::reencode(self.prefix(), &IVec::from(k), prefix_len) - }) - .collect(); - - let left_iter = extended_left - .iter() - .map(|k| KeyRef::Slice(k.as_ref())) - .zip(self.iter_values()); - - extended_right = other - .iter_keys() - .map(|k| { - prefix::reencode(other.prefix(), &IVec::from(k), prefix_len) - }) - .collect(); - - let right_iter = extended_right - .iter() - .map(|k| KeyRef::Slice(k.as_ref())) - .zip(other.iter_values()); - - left_iter.chain(right_iter).collect() - }; - - let other_rewrite_generations = other.rewrite_generations; - let other_next = other.next; - - let mut ret = Inner::new( - self.lo(), - other.hi(), - u8::try_from(prefix_len).unwrap(), - self.is_index, - other_next, - &*items, - ); - - ret.rewrite_generations = - self.rewrite_generations.max(other_rewrite_generations); - - testing_assert!(ret.is_sorted()); - - ret - } - - fn header(&self) -> &Header { - assert_eq!(self.ptr() as usize % 8, 0); - unsafe { &*(self.ptr as *mut u64 as *mut Header) } - } - - fn header_mut(&mut self) -> &mut Header { - unsafe { &mut *(self.ptr as *mut Header) } - } - - fn is_empty(&self) -> bool { - self.children() == 0 - } - - pub(crate) fn rss(&self) -> u64 { - self.len() as u64 - } - - fn children(&self) -> usize { - self.children as usize - } - - fn contains_key(&self, key: &[u8]) -> bool { - if key < self.lo() - || if let Some(hi) = self.hi() { key >= hi } else { false } - { - return false; - } - if let Some(fixed_key_length) = self.fixed_key_length { - if usize::from(fixed_key_length.get()) != key.len() { - return false; - } - } - self.find(key).is_ok() - } - - fn find(&self, key: &[u8]) -> Result { - if let Some(stride) = self.fixed_key_stride { - // NB this branch must be able to handle - // keys that are shorter or longer than - // our fixed key length! - let base = &self.lo()[self.prefix_len as usize..]; - - let s_len = key.len().min(base.len()); - - let shared_distance: usize = - shared_distance(&base[..s_len], &key[..s_len]); - - let mut distance = unshift_distance(shared_distance, base, key); - - if distance % stride.get() as usize == 0 && key.len() < base.len() { - // searching for [9] resulted in going to [9, 0], - // this must have 1 subtracted in that case - distance = distance.checked_sub(1).unwrap(); - } - - let offset = distance / stride.get() as usize; - - if base.len() != key.len() - || distance % stride.get() as usize != 0 - || offset >= self.children as usize - { - // search key does not evenly fit based on - // our fixed stride length - log::trace!("failed to find, search: {:?} lo: {:?} \ - prefix_len: {} distance: {} stride: {} offset: {} children: {}, node: {:?}", - key, self.lo(), self.prefix_len, distance, - stride.get(), offset, self.children, self - ); - return Err((offset + 1).min(self.children())); - } - - log::trace!("found offset in Node::find {}", offset); - return Ok(offset); - } - - let mut size = self.children(); - if size == 0 || self.index_key(0).unwrap_slice() > key { - return Err(0); - } - let mut left = 0; - let mut right = size; - while left < right { - let mid = left + size / 2; - - let l = self.index_key(mid); - let cmp = crate::fastcmp(l.unwrap_slice(), key); - - if cmp == Less { - left = mid + 1; - } else if cmp == Greater { - right = mid; - } else { - return Ok(mid); - } - - size = right - left; - } - Err(left) - } - - pub(crate) fn can_merge_child(&self, pid: u64) -> bool { - self.merging_child.is_none() - && !self.merging - && self.iter_index_pids().any(|p| p == pid) - } - - fn iter_keys( - &self, - ) -> impl Iterator> - + ExactSizeIterator - + DoubleEndedIterator - + Clone { - (0..self.children()).map(move |idx| self.index_key(idx)) - } - - fn iter_index_pids( - &self, - ) -> impl '_ - + Iterator - + ExactSizeIterator - + DoubleEndedIterator - + Clone { - assert!(self.is_index); - self.iter_values().map(move |pid_bytes| { - u64::from_le_bytes(pid_bytes.try_into().unwrap()) - }) - } - - fn iter_values( - &self, - ) -> impl Iterator + ExactSizeIterator + DoubleEndedIterator + Clone - { - (0..self.children()).map(move |idx| self.index_value(idx)) - } - - fn iter(&self) -> impl Iterator, &[u8])> { - self.iter_keys().zip(self.iter_values()) - } - - pub(crate) fn lo(&self) -> &[u8] { - let start = size_of::
(); - let end = start + tf!(self.lo_len); - &self.as_ref()[start..end] - } - - fn lo_mut(&mut self) -> &mut [u8] { - let start = size_of::
(); - let end = start + tf!(self.lo_len); - &mut self.as_mut()[start..end] - } - - pub(crate) fn hi(&self) -> Option<&[u8]> { - let start = tf!(self.lo_len) + size_of::
(); - let end = start + tf!(self.hi_len); - if start == end { - None - } else { - Some(&self.as_ref()[start..end]) - } - } - - fn hi_mut(&mut self) -> Option<&mut [u8]> { - let start = tf!(self.lo_len) + size_of::
(); - let end = start + tf!(self.hi_len); - if start == end { - None - } else { - Some(&mut self.as_mut()[start..end]) - } - } - - fn index_key(&self, idx: usize) -> KeyRef<'_> { - assert!( - idx < self.children(), - "index {} is not less than internal length of {}", - idx, - self.children() - ); - - if let Some(stride) = self.fixed_key_stride { - return KeyRef::Computed { - base: &self.lo()[self.prefix_len as usize..], - distance: stride.get() as usize * idx, - }; - } - - let offset_sz = self.children as usize * self.offset_bytes as usize; - let keys_buf = &self.data_buf()[offset_sz..]; - let key_buf = { - match (self.fixed_key_length, self.fixed_value_length) { - (Some(k_sz), Some(_)) | (Some(k_sz), None) => { - &keys_buf[idx * tf!(k_sz.get())..] - } - (None, Some(_)) | (None, None) => { - // find offset for key or combined kv offset - let offset = self.offset(idx); - &keys_buf[offset..] - } - } - }; - - let (start, end) = if let Some(fixed_key_length) = self.fixed_key_length - { - (0, tf!(fixed_key_length.get())) - } else { - let (key_len, varint_sz) = varint::deserialize(key_buf).unwrap(); - let start = varint_sz; - let end = start + tf!(key_len); - (start, end) - }; - - KeyRef::Slice(&key_buf[start..end]) - } - - fn index_value(&self, idx: usize) -> &[u8] { - assert!( - idx < self.children(), - "index {} is not less than internal length of {}", - idx, - self.children() - ); - - if let Some(0) = self.fixed_value_length() { - return &[]; - } - - let buf = self.value_buf_for_offset(idx); - - let (start, end) = - if let Some(fixed_value_length) = self.fixed_value_length() { - (0, fixed_value_length) - } else { - let (value_len, varint_sz) = varint::deserialize(buf).unwrap(); - let start = varint_sz; - let end = start + tf!(value_len); - (start, end) - }; - - &buf[start..end] - } - - pub(crate) fn contains_upper_bound(&self, bound: &Bound) -> bool { - if let Some(hi) = self.hi() { - match bound { - Bound::Excluded(b) if hi >= b => true, - Bound::Included(b) if hi > b => true, - _ => false, - } - } else { - true - } - } - - pub(crate) fn contains_lower_bound( - &self, - bound: &Bound, - is_forward: bool, - ) -> bool { - let lo = self.lo(); - match bound { - Bound::Excluded(b) if lo < b || (is_forward && *b == lo) => true, - Bound::Included(b) if lo <= b => true, - Bound::Unbounded if !is_forward => self.hi().is_none(), - _ => lo.is_empty(), - } - } - - fn prefix_decode(&self, key: KeyRef<'_>) -> IVec { - match key { - KeyRef::Slice(s) => prefix::decode(self.prefix(), s), - KeyRef::Computed { base, distance } => { - let mut ret = prefix::decode(self.prefix(), base); - apply_computed_distance(&mut ret, distance); - ret - } - } - } - - pub(crate) fn prefix_encode<'a>(&self, key: &'a [u8]) -> &'a [u8] { - assert!(self.lo() <= key); - if let Some(hi) = self.hi() { - assert!( - hi > key, - "key being encoded {:?} >= self.hi {:?}", - key, - hi - ); - } - - &key[self.prefix_len as usize..] - } - - fn prefix(&self) -> &[u8] { - &self.lo()[..self.prefix_len as usize] - } - - #[cfg(feature = "testing")] - fn is_sorted(&self) -> bool { - if self.fixed_key_stride.is_some() { - return true; - } - if self.children() <= 1 { - return true; - } - - for i in 0..self.children() - 1 { - if self.index_key(i) >= self.index_key(i + 1) { - log::error!( - "key {:?} at index {} >= key {:?} at index {}", - self.index_key(i), - i, - self.index_key(i + 1), - i + 1 - ); - return false; - } - } - - true - } -} - -mod prefix { - use crate::IVec; - - pub(super) const fn empty() -> &'static [u8] { - &[] - } - - pub(super) fn reencode( - old_prefix: &[u8], - old_encoded_key: &[u8], - new_prefix_length: usize, - ) -> IVec { - old_prefix - .iter() - .chain(old_encoded_key.iter()) - .skip(new_prefix_length) - .copied() - .collect() - } - - pub(super) fn decode(old_prefix: &[u8], old_encoded_key: &[u8]) -> IVec { - let mut decoded_key = - Vec::with_capacity(old_prefix.len() + old_encoded_key.len()); - decoded_key.extend_from_slice(old_prefix); - decoded_key.extend_from_slice(old_encoded_key); - - IVec::from(decoded_key) - } -} - -#[cfg(test)] -mod test { - use std::collections::BTreeMap; - - use quickcheck::{Arbitrary, Gen}; - - use super::*; - - #[test] - fn keyref_ord_equal_length() { - assert_eq!( - KeyRef::Computed { base: &[], distance: 0 }, - KeyRef::Slice(&[]) - ); - assert_eq!( - KeyRef::Computed { base: &[0], distance: 0 }, - KeyRef::Slice(&[0]) - ); - assert_eq!( - KeyRef::Computed { base: &[0], distance: 1 }, - KeyRef::Slice(&[1]) - ); - assert_eq!( - KeyRef::Slice(&[1]), - KeyRef::Computed { base: &[0], distance: 1 }, - ); - assert_eq!( - KeyRef::Slice(&[1, 0]), - KeyRef::Computed { base: &[0, 255], distance: 1 }, - ); - assert_eq!( - KeyRef::Computed { base: &[0, 255], distance: 1 }, - KeyRef::Slice(&[1, 0]), - ); - assert!(KeyRef::Slice(&[1]) > KeyRef::Slice(&[0])); - assert!(KeyRef::Slice(&[]) < KeyRef::Slice(&[0])); - assert!( - KeyRef::Computed { base: &[0, 255], distance: 2 } - > KeyRef::Slice(&[1, 0]), - ); - assert!( - KeyRef::Slice(&[1, 0]) - < KeyRef::Computed { base: &[0, 255], distance: 2 } - ); - assert!( - KeyRef::Computed { base: &[0, 255], distance: 2 } - < KeyRef::Slice(&[2, 0]), - ); - assert!( - KeyRef::Slice(&[2, 0]) - > KeyRef::Computed { base: &[0, 255], distance: 2 } - ); - } - - #[test] - fn keyref_ord_varied_length() { - assert!( - KeyRef::Computed { base: &[0, 200], distance: 201 } - > KeyRef::Slice(&[1]) - ); - assert!( - KeyRef::Slice(&[1]) - < KeyRef::Computed { base: &[0, 200], distance: 201 } - ); - assert!( - KeyRef::Computed { base: &[2, 0], distance: 0 } - > KeyRef::Computed { base: &[2], distance: 0 } - ); - assert!( - KeyRef::Computed { base: &[2], distance: 0 } - < KeyRef::Computed { base: &[2, 0], distance: 0 } - ); - assert!( - KeyRef::Computed { base: &[0, 2], distance: 0 } - < KeyRef::Computed { base: &[2], distance: 0 } - ); - assert!( - KeyRef::Computed { base: &[2], distance: 0 } - > KeyRef::Computed { base: &[0, 2], distance: 0 } - ); - assert!( - KeyRef::Computed { base: &[2], distance: 0 } - != KeyRef::Computed { base: &[0, 2], distance: 0 } - ); - assert!( - KeyRef::Computed { base: &[2], distance: 0 } - != KeyRef::Computed { base: &[2, 0], distance: 0 } - ); - assert!( - KeyRef::Computed { base: &[1, 0], distance: 0 } - != KeyRef::Computed { base: &[255], distance: 1 } - ); - assert!( - KeyRef::Computed { base: &[0, 0], distance: 0 } - != KeyRef::Computed { base: &[255], distance: 1 } - ); - } - - #[test] - fn compute_distances() { - let table: &[(&[u8], &[u8], usize)] = - &[(&[0], &[0], 0), (&[0], &[1], 1), (&[0, 255], &[1, 0], 1)]; - - for (a, b, expected) in table { - assert_eq!(shared_distance(a, b), *expected); - } - } - - #[test] - fn apply_computed_distances() { - let table: &[(KeyRef<'_>, &[u8])] = &[ - (KeyRef::Computed { base: &[0], distance: 0 }, &[0]), - (KeyRef::Computed { base: &[0], distance: 1 }, &[1]), - (KeyRef::Computed { base: &[0, 255], distance: 1 }, &[1, 0]), - (KeyRef::Computed { base: &[2, 253], distance: 8 }, &[3, 5]), - ]; - - for (key_ref, expected) in table { - let ivec: IVec = key_ref.into(); - assert_eq!(&ivec, expected) - } - - let key_ref = KeyRef::Computed { base: &[2, 253], distance: 8 }; - let buf = &mut [0, 0][..]; - key_ref.write_into(buf); - assert_eq!(buf, &[3, 5]); - } - - #[test] - fn insert_regression() { - let node = Inner::new( - &[0, 0, 0, 0, 0, 0, 162, 211], - Some(&[0, 0, 0, 0, 0, 0, 163, 21]), - 6, - false, - Some(NonZeroU64::new(220).unwrap()), - &[ - (KeyRef::Slice(&[162, 211, 0, 0]), &[]), - (KeyRef::Slice(&[163, 15, 0, 0]), &[]), - ], - ); - - Node { overlay: Default::default(), inner: Arc::new(node) } - .insert(&vec![162, 211, 0, 0].into(), &vec![].into()) - .merge_overlay(); - } - - impl Arbitrary for Node { - fn arbitrary(g: &mut G) -> Node { - Node { - overlay: Default::default(), - inner: Arc::new(Inner::arbitrary(g)), - } - } - - fn shrink(&self) -> Box> { - let overlay = self.overlay.clone(); - Box::new( - self.inner.shrink().map(move |ni| Node { - overlay: overlay.clone(), - inner: ni, - }), - ) - } - } - - impl Arbitrary for Inner { - fn arbitrary(g: &mut G) -> Inner { - use rand::Rng; - - let mut lo: Vec = Arbitrary::arbitrary(g); - let mut hi: Option> = Some(Arbitrary::arbitrary(g)); - - let children: BTreeMap, Vec> = Arbitrary::arbitrary(g); - - if let Some((min_k, _)) = children.iter().next() { - if *min_k < lo { - lo = min_k.clone(); - } - } - - if let Some((max_k, _)) = children.iter().next_back() { - if Some(max_k) >= hi.as_ref() { - hi = None - } - } - - let hi: Option<&[u8]> = - if let Some(ref hi) = hi { Some(hi) } else { None }; - - let equal_length_keys = - g.gen::>().map(|kl| (kl % 32).max(1)); - - let min_key_length = equal_length_keys.unwrap_or(0); - - let equal_length_values = - g.gen::>().map(|vl| (vl % 32).max(1)); - - let min_value_length = equal_length_values.unwrap_or(0); - - let children_ref: Vec<(KeyRef<'_>, &[u8])> = children - .iter() - .filter(|(k, v)| { - k.len() >= min_key_length && v.len() >= min_value_length - }) - .map(|(k, v)| { - ( - if let Some(kl) = equal_length_keys { - KeyRef::Slice(&k[..kl]) - } else { - KeyRef::Slice(k.as_ref()) - }, - if let Some(vl) = equal_length_values { - &v[..vl] - } else { - v.as_ref() - }, - ) - }) - .collect::>() - .into_iter() - .collect(); - - let mut ret = - Inner::new(&lo, hi.map(|h| h), 0, false, None, &children_ref); - - ret.activity_sketch = g.gen(); - - if g.gen_bool(1. / 30.) { - ret.probation_ops_remaining = g.gen(); - } - - if g.gen_bool(1. / 4.) { - ret.rewrite_generations = g.gen(); - } - - ret - } - - fn shrink(&self) -> Box> { - Box::new({ - let node = self.clone(); - let lo = node.lo(); - let shrink_lo = if lo.is_empty() { - None - } else { - Some(Inner::new( - &lo[..lo.len() - 1], - node.hi(), - node.prefix_len, - node.is_index, - node.next, - &node.iter().collect::>(), - )) - }; - - let shrink_hi = if let Some(hi) = node.hi() { - let new_hi = if node.is_empty() { - Some(&hi[..hi.len() - 1]) - } else { - let max_k = node.index_key(node.children() - 1); - if max_k >= hi[..hi.len() - 1] { - None - } else { - Some(&hi[..hi.len() - 1]) - } - }; - - Some(Inner::new( - node.lo(), - new_hi, - node.prefix_len, - node.is_index, - node.next, - &node.iter().collect::>(), - )) - } else { - None - }; - - let item_removals = (0..node.children()).map({ - let node = self.clone(); - move |i| { - let key = node.index_key(i).into(); - - Node { - overlay: Default::default(), - inner: Arc::new(node.clone()), - } - .remove(&key) - .merge_overlay() - .deref() - .clone() - } - }); - let item_reductions = (0..node.children()).flat_map({ - let node = self.clone(); - move |i| { - let (k, v) = ( - IVec::from(node.index_key(i)), - node.index_value(i).to_vec(), - ); - let k_shrink = k.shrink().flat_map({ - let node2 = Node { - overlay: Default::default(), - inner: Arc::new(node.clone()), - } - .remove(&k.deref().into()) - .merge_overlay() - .deref() - .clone(); - - let v = v.clone(); - move |k| { - if node2.contains_key(&k) { - None - } else { - let new_node = Node { - overlay: Default::default(), - inner: Arc::new(node2.clone()), - } - .insert( - &k.deref().into(), - &v.deref().into(), - ) - .merge_overlay() - .deref() - .clone(); - Some(new_node) - } - } - }); - let v_shrink = v.shrink().map({ - let node3 = node.clone(); - move |v| { - Node { - overlay: Default::default(), - inner: Arc::new(node3.clone()), - } - .insert(&k.deref().into(), &v.into()) - .merge_overlay() - .deref() - .clone() - } - }); - k_shrink.chain(v_shrink) - } - }); - - shrink_lo - .into_iter() - .chain(shrink_hi) - .into_iter() - .chain(item_removals) - .chain(item_reductions) - }) - } - } - - fn prop_indexable( - lo: Vec, - hi: Vec, - children: Vec<(Vec, Vec)>, - ) -> bool { - let children_ref: Vec<(KeyRef<'_>, &[u8])> = children - .iter() - .filter_map(|(k, v)| { - if k < &lo { - None - } else { - Some((KeyRef::Slice(k.as_ref()), v.as_ref())) - } - }) - .collect::>() - .into_iter() - .collect(); - - let ir = Inner::new( - &lo, - if hi <= lo { None } else { Some(&hi) }, - 0, - false, - None, - &children_ref, - ); - - assert_eq!(ir.children as usize, children_ref.len()); - - for (idx, (k, v)) in children_ref.iter().enumerate() { - assert_eq!(ir.index_key(idx), *k); - let value = ir.index_value(idx); - assert_eq!( - value, *v, - "expected value index {} to have value {:?} but instead it was {:?}", - idx, *v, value, - ); - } - true - } - - fn prop_insert_split_merge( - node: Inner, - key: Vec, - value: Vec, - ) -> bool { - // the inserted key must have its bytes after the prefix len - // be greater than the node's lo key after the prefix len - let skip_key_ops = !node - .contains_upper_bound(&Bound::Included((&*key).into())) - || !node - .contains_lower_bound(&Bound::Included((&*key).into()), true); - - let node2 = if !node.contains_key(&key) && !skip_key_ops { - let applied = Node { - overlay: Default::default(), - inner: Arc::new(node.clone()), - } - .insert(&key.deref().into(), &value.into()) - .merge_overlay() - .deref() - .clone(); - let applied_items: Vec<_> = applied.iter().collect(); - let clone = applied.clone(); - let cloned_items: Vec<_> = clone.iter().collect(); - assert_eq!(applied_items, cloned_items); - applied - } else { - node.clone() - }; - - if node2.children() > 2 { - let (left, right) = node2.split(); - let node3 = left.receive_merge(&right); - assert_eq!( - node3.iter().collect::>(), - node2.iter().collect::>() - ); - } - - if !node.contains_key(&key) && !skip_key_ops { - let node4 = Node { - overlay: Default::default(), - inner: Arc::new(node.clone()), - } - .remove(&key.deref().into()) - .merge_overlay() - .deref() - .clone(); - - assert_eq!( - node.iter().collect::>(), - node4.iter().collect::>(), - "we expected that removing item at key {:?} would return the node to its original pre-insertion state", - key - ); - } - - true - } - - quickcheck::quickcheck! { - #[cfg_attr(miri, ignore)] - fn indexable(lo: Vec, hi: Vec, children: BTreeMap, Vec>) -> bool { - prop_indexable(lo, hi, children.into_iter().collect()) - } - - #[cfg_attr(miri, ignore)] - fn insert_split_merge(node: Inner, key: Vec, value: Vec) -> bool { - prop_insert_split_merge(node, key, value) - } - - } - - #[test] - fn node_bug_00() { - // postmortem: offsets were not being stored, and the slot buf was not - // being considered correctly while writing or reading values in - // shared slots. - assert!(prop_indexable( - vec![], - vec![], - vec![(vec![], vec![]), (vec![1], vec![1]),] - )); - } - - #[test] - fn node_bug_01() { - // postmortem: hi and lo keys were not properly being accounted in the - // inital allocation - assert!(prop_indexable(vec![], vec![0], vec![],)); - } - - #[test] - fn node_bug_02() { - // postmortem: the test code had some issues with handling invalid keys for nodes - let node = Inner::new( - &[47, 97][..], - None, - 0, - false, - None, - &[(KeyRef::Slice(&[47, 97]), &[]), (KeyRef::Slice(&[99]), &[])], - ); - - assert!(prop_insert_split_merge(node, vec![], vec![])); - } - - #[test] - fn node_bug_03() { - // postmortem: linear key lengths were being improperly determined - assert!(prop_indexable( - vec![], - vec![], - vec![(vec![], vec![]), (vec![0], vec![]),] - )); - } - - #[test] - fn node_bug_04() { - let node = Inner::new( - &[0, 2, 253], - Some(&[0, 3, 33]), - 1, - true, - None, - &[ - ( - KeyRef::Computed { base: &[2, 253], distance: 0 }, - &620_u64.to_le_bytes(), - ), - ( - KeyRef::Computed { base: &[2, 253], distance: 2 }, - &665_u64.to_le_bytes(), - ), - ( - KeyRef::Computed { base: &[2, 253], distance: 4 }, - &683_u64.to_le_bytes(), - ), - ( - KeyRef::Computed { base: &[2, 253], distance: 6 }, - &713_u64.to_le_bytes(), - ), - ], - ); - - Node { inner: Arc::new(node), overlay: Default::default() }.split(); - } - - #[test] - fn node_bug_05() { - // postmortem: `prop_indexable` did not account for the requirement - // of feeding sorted items that are >= the lo key to the Node::new method. - assert!(prop_indexable( - vec![1], - vec![], - vec![(vec![], vec![]), (vec![0], vec![])], - )) - } -} diff --git a/src/object_cache.rs b/src/object_cache.rs new file mode 100644 index 000000000..8301ae8f9 --- /dev/null +++ b/src/object_cache.rs @@ -0,0 +1,960 @@ +use std::cell::RefCell; +use std::collections::HashMap; +use std::io; +use std::sync::atomic::{AtomicPtr, AtomicU64, Ordering}; +use std::sync::Arc; +use std::time::{Duration, Instant}; + +use cache_advisor::CacheAdvisor; +use concurrent_map::{ConcurrentMap, Minimum}; +use fault_injection::annotate; +use inline_array::InlineArray; +use parking_lot::RwLock; + +use crate::*; + +#[derive(Debug, Copy, Clone)] +pub struct CacheStats { + pub cache_hits: u64, + pub cache_misses: u64, + pub cache_hit_ratio: f32, + pub max_read_io_latency_us: u64, + pub sum_read_io_latency_us: u64, + pub deserialization_latency_max_us: u64, + pub deserialization_latency_sum_us: u64, + pub heap: HeapStats, + pub flush_max: FlushStats, + pub flush_sum: FlushStats, + pub compacted_heap_slots: u64, + pub tree_leaves_merged: u64, +} + +#[derive(Default, Debug, Clone, Copy)] +pub struct FlushStats { + pub pre_block_on_previous_flush: Duration, + pub pre_block_on_current_quiescence: Duration, + pub serialization_latency: Duration, + pub compute_defrag_latency: Duration, + pub storage_latency: Duration, + pub post_write_eviction_latency: Duration, + pub objects_flushed: u64, + pub write_batch: WriteBatchStats, +} + +impl FlushStats { + pub fn sum(&self, other: &FlushStats) -> FlushStats { + use std::ops::Add; + + FlushStats { + pre_block_on_previous_flush: self + .pre_block_on_previous_flush + .add(other.pre_block_on_previous_flush), + pre_block_on_current_quiescence: self + .pre_block_on_current_quiescence + .add(other.pre_block_on_current_quiescence), + compute_defrag_latency: self + .compute_defrag_latency + .add(other.compute_defrag_latency), + serialization_latency: self + .serialization_latency + .add(other.serialization_latency), + storage_latency: self.storage_latency.add(other.storage_latency), + post_write_eviction_latency: self + .post_write_eviction_latency + .add(other.post_write_eviction_latency), + objects_flushed: self.objects_flushed.add(other.objects_flushed), + write_batch: self.write_batch.sum(&other.write_batch), + } + } + pub fn max(&self, other: &FlushStats) -> FlushStats { + FlushStats { + pre_block_on_previous_flush: self + .pre_block_on_previous_flush + .max(other.pre_block_on_previous_flush), + pre_block_on_current_quiescence: self + .pre_block_on_current_quiescence + .max(other.pre_block_on_current_quiescence), + compute_defrag_latency: self + .compute_defrag_latency + .max(other.compute_defrag_latency), + serialization_latency: self + .serialization_latency + .max(other.serialization_latency), + storage_latency: self.storage_latency.max(other.storage_latency), + post_write_eviction_latency: self + .post_write_eviction_latency + .max(other.post_write_eviction_latency), + objects_flushed: self.objects_flushed.max(other.objects_flushed), + write_batch: self.write_batch.max(&other.write_batch), + } + } +} + +#[derive(Clone, Debug, PartialEq)] +pub enum Dirty { + NotYetSerialized { + low_key: InlineArray, + node: Object, + collection_id: CollectionId, + }, + CooperativelySerialized { + object_id: ObjectId, + collection_id: CollectionId, + low_key: InlineArray, + data: Arc>, + mutation_count: u64, + }, + MergedAndDeleted { + object_id: ObjectId, + collection_id: CollectionId, + }, +} + +impl Dirty { + pub fn is_final_state(&self) -> bool { + match self { + Dirty::NotYetSerialized { .. } => false, + Dirty::CooperativelySerialized { .. } => true, + Dirty::MergedAndDeleted { .. } => true, + } + } +} + +#[derive(Debug, Default, Clone, Copy)] +struct FlushStatTracker { + count: u64, + sum: FlushStats, + max: FlushStats, +} + +#[derive(Debug, Default)] +pub(crate) struct ReadStatTracker { + pub cache_hits: AtomicU64, + pub cache_misses: AtomicU64, + pub max_read_io_latency_us: AtomicU64, + pub sum_read_io_latency_us: AtomicU64, + pub max_deserialization_latency_us: AtomicU64, + pub sum_deserialization_latency_us: AtomicU64, +} + +#[derive(Clone)] +pub struct ObjectCache { + pub config: Config, + global_error: Arc>, + pub object_id_index: ConcurrentMap< + ObjectId, + Object, + INDEX_FANOUT, + EBR_LOCAL_GC_BUFFER_SIZE, + >, + heap: Heap, + cache_advisor: RefCell, + flush_epoch: FlushEpochTracker, + dirty: ConcurrentMap<(FlushEpoch, ObjectId), Dirty, 4>, + compacted_heap_slots: Arc, + pub(super) tree_leaves_merged: Arc, + #[cfg(feature = "for-internal-testing-only")] + pub(super) event_verifier: Arc, + invariants: Arc, + flush_stats: Arc>, + pub(super) read_stats: Arc, +} + +impl std::panic::RefUnwindSafe + for ObjectCache +{ +} + +impl ObjectCache { + /// Returns the recovered ObjectCache, the tree indexes, and a bool signifying whether the system + /// was recovered or not + pub fn recover( + config: &Config, + ) -> io::Result<( + ObjectCache, + HashMap>, + bool, + )> { + let HeapRecovery { heap, recovered_nodes, was_recovered } = + Heap::recover(LEAF_FANOUT, config)?; + + let (object_id_index, indices) = initialize(&recovered_nodes, &heap); + + // validate recovery + for ObjectRecovery { object_id, collection_id, low_key } in + recovered_nodes + { + let index = indices.get(&collection_id).unwrap(); + let node = index.get(&low_key).unwrap(); + assert_eq!(node.object_id, object_id); + } + + if config.cache_capacity_bytes < 256 { + log::debug!( + "Db configured to have Config.cache_capacity_bytes \ + of under 256, so we will use the minimum of 256 bytes instead" + ); + } + + if config.entry_cache_percent > 80 { + log::debug!( + "Db configured to have Config.entry_cache_percent\ + of over 80%, so we will clamp it to the maximum of 80% instead" + ); + } + + let pc = ObjectCache { + config: config.clone(), + object_id_index, + cache_advisor: RefCell::new(CacheAdvisor::new( + config.cache_capacity_bytes.max(256), + config.entry_cache_percent.min(80), + )), + global_error: heap.get_global_error_arc(), + heap, + dirty: Default::default(), + flush_epoch: Default::default(), + #[cfg(feature = "for-internal-testing-only")] + event_verifier: Arc::default(), + compacted_heap_slots: Arc::default(), + tree_leaves_merged: Arc::default(), + invariants: Arc::default(), + flush_stats: Arc::default(), + read_stats: Arc::default(), + }; + + Ok((pc, indices, was_recovered)) + } + + pub fn is_clean(&self) -> bool { + self.dirty.is_empty() + } + + pub fn read(&self, object_id: ObjectId) -> Option>> { + match self.heap.read(object_id) { + Some(Ok(buf)) => Some(Ok(buf)), + Some(Err(e)) => Some(Err(annotate!(e))), + None => None, + } + } + + pub fn stats(&self) -> CacheStats { + let flush_stats = { *self.flush_stats.read() }; + let cache_hits = self.read_stats.cache_hits.load(Ordering::Acquire); + let cache_misses = self.read_stats.cache_misses.load(Ordering::Acquire); + let cache_hit_ratio = + cache_hits as f32 / (cache_hits + cache_misses).max(1) as f32; + + CacheStats { + cache_hits, + cache_misses, + cache_hit_ratio, + compacted_heap_slots: self + .compacted_heap_slots + .load(Ordering::Acquire), + tree_leaves_merged: self.tree_leaves_merged.load(Ordering::Acquire), + heap: self.heap.stats(), + flush_max: flush_stats.max, + flush_sum: flush_stats.sum, + deserialization_latency_max_us: self + .read_stats + .max_deserialization_latency_us + .load(Ordering::Acquire), + deserialization_latency_sum_us: self + .read_stats + .sum_deserialization_latency_us + .load(Ordering::Acquire), + max_read_io_latency_us: self + .read_stats + .max_read_io_latency_us + .load(Ordering::Acquire), + sum_read_io_latency_us: self + .read_stats + .sum_read_io_latency_us + .load(Ordering::Acquire), + } + } + + pub fn check_error(&self) -> io::Result<()> { + let err_ptr: *const (io::ErrorKind, String) = + self.global_error.load(Ordering::Acquire); + + if err_ptr.is_null() { + Ok(()) + } else { + let deref: &(io::ErrorKind, String) = unsafe { &*err_ptr }; + Err(io::Error::new(deref.0, deref.1.clone())) + } + } + + pub fn set_error(&self, error: &io::Error) { + let kind = error.kind(); + let reason = error.to_string(); + + let boxed = Box::new((kind, reason)); + let ptr = Box::into_raw(boxed); + + if self + .global_error + .compare_exchange( + std::ptr::null_mut(), + ptr, + Ordering::SeqCst, + Ordering::SeqCst, + ) + .is_err() + { + // global fatal error already installed, drop this one + unsafe { + drop(Box::from_raw(ptr)); + } + } + } + + pub fn allocate_default_node( + &self, + collection_id: CollectionId, + ) -> Object { + let object_id = self.allocate_object_id(FlushEpoch::MIN); + + let node = Object { + object_id, + collection_id, + low_key: InlineArray::default(), + inner: Arc::new(RwLock::new(CacheBox { + leaf: Some(Box::new(Leaf::empty())).into(), + logged_index: BTreeMap::default(), + })), + }; + + self.object_id_index.insert(object_id, node.clone()); + + node + } + + pub fn allocate_object_id( + &self, + #[allow(unused)] flush_epoch: FlushEpoch, + ) -> ObjectId { + let object_id = self.heap.allocate_object_id(); + + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + object_id, + flush_epoch, + event_verifier::State::CleanPagedIn, + concat!(file!(), ':', line!(), ":allocated"), + ); + } + + object_id + } + + pub fn current_flush_epoch(&self) -> FlushEpoch { + self.flush_epoch.current_flush_epoch() + } + pub fn check_into_flush_epoch(&self) -> FlushEpochGuard { + self.flush_epoch.check_in() + } + + pub fn install_dirty( + &self, + flush_epoch: FlushEpoch, + object_id: ObjectId, + dirty: Dirty, + ) { + // dirty can transition from: + // None -> NotYetSerialized + // None -> MergedAndDeleted + // None -> CooperativelySerialized + // + // NotYetSerialized -> MergedAndDeleted + // NotYetSerialized -> CooperativelySerialized + // + // if the new Dirty is final, we must assert that + // we are transitioning from None or NotYetSerialized. + // + // if the new Dirty is not final, we must assert + // that the old value is also not final. + + let last_dirty_opt = self.dirty.insert((flush_epoch, object_id), dirty); + + if let Some(last_dirty) = last_dirty_opt { + assert!( + !last_dirty.is_final_state(), + "tried to install another Dirty marker for a node that is already + finalized for this flush epoch. \nflush_epoch: {:?}\nlast: {:?}", + flush_epoch, last_dirty, + ); + } + } + + // NB: must not be called while holding a leaf lock - which also means + // that no two LeafGuards can be held concurrently in the same scope due to + // this being called in the destructor. + pub fn mark_access_and_evict( + &self, + object_id: ObjectId, + size: usize, + #[allow(unused)] flush_epoch: FlushEpoch, + ) -> io::Result<()> { + let mut ca = self.cache_advisor.borrow_mut(); + let to_evict = ca.accessed_reuse_buffer(*object_id, size); + for (node_to_evict, _rough_size) in to_evict { + let object_id = + if let Some(object_id) = ObjectId::new(*node_to_evict) { + object_id + } else { + unreachable!("object ID must never have been 0"); + }; + + let node = if let Some(n) = self.object_id_index.get(&object_id) { + if *n.object_id != *node_to_evict { + log::debug!("during cache eviction, node to evict did not match current occupant for {:?}", node_to_evict); + continue; + } + n + } else { + log::debug!("during cache eviction, unable to find node to evict for {:?}", node_to_evict); + continue; + }; + + let mut write = node.inner.write(); + if write.leaf.is_none() { + // already paged out + continue; + } + let leaf: &mut Leaf = write.leaf.as_mut().unwrap(); + + if let Some(dirty_epoch) = leaf.dirty_flush_epoch { + // We can't page out this leaf until it has been + // flushed, because its changes are not yet durable. + leaf.page_out_on_flush = + leaf.page_out_on_flush.max(Some(dirty_epoch)); + } else if let Some(max_unflushed_epoch) = leaf.max_unflushed_epoch { + leaf.page_out_on_flush = + leaf.page_out_on_flush.max(Some(max_unflushed_epoch)); + } else { + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + node.object_id, + flush_epoch, + event_verifier::State::PagedOut, + concat!(file!(), ':', line!(), ":page-out"), + ); + } + write.leaf = None; + } + } + + Ok(()) + } + + pub fn heap_object_id_pin(&self) -> ebr::Guard<'_, DeferredFree, 16, 16> { + self.heap.heap_object_id_pin() + } + + pub fn flush(&self) -> io::Result { + let mut write_batch = vec![]; + + log::trace!("advancing epoch"); + let ( + previous_flush_complete_notifier, + this_vacant_notifier, + forward_flush_notifier, + ) = self.flush_epoch.roll_epoch_forward(); + + let before_previous_block = Instant::now(); + + log::trace!( + "waiting for previous flush of {:?} to complete", + previous_flush_complete_notifier.epoch() + ); + let previous_epoch = + previous_flush_complete_notifier.wait_for_complete(); + + let pre_block_on_previous_flush = before_previous_block.elapsed(); + + let before_current_quiescence = Instant::now(); + + log::trace!( + "waiting for our epoch {:?} to become vacant", + this_vacant_notifier.epoch() + ); + + assert_eq!(previous_epoch.increment(), this_vacant_notifier.epoch()); + + let flush_through_epoch: FlushEpoch = + this_vacant_notifier.wait_for_complete(); + + let pre_block_on_current_quiescence = + before_current_quiescence.elapsed(); + + self.invariants.mark_flushing_epoch(flush_through_epoch); + + let mut objects_to_defrag = self.heap.objects_to_defrag(); + + let flush_boundary = (flush_through_epoch.increment(), ObjectId::MIN); + + let mut evict_after_flush = vec![]; + + let before_serialization = Instant::now(); + + for ((dirty_epoch, dirty_object_id), dirty_value_initial_read) in + self.dirty.range(..flush_boundary) + { + objects_to_defrag.remove(&dirty_object_id); + + let dirty_value = self + .dirty + .remove(&(dirty_epoch, dirty_object_id)) + .expect("violation of flush responsibility"); + + if let Dirty::NotYetSerialized { .. } = &dirty_value { + assert_eq!(dirty_value_initial_read, dirty_value); + } + + // drop is necessary to increase chance of Arc strong count reaching 1 + // while taking ownership of the value + drop(dirty_value_initial_read); + + assert_eq!(dirty_epoch, flush_through_epoch); + + match dirty_value { + Dirty::MergedAndDeleted { object_id, collection_id } => { + assert_eq!(object_id, dirty_object_id); + + log::trace!( + "MergedAndDeleted for {:?}, adding None to write_batch", + object_id + ); + write_batch.push(Update::Free { object_id, collection_id }); + + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + object_id, + dirty_epoch, + event_verifier::State::AddedToWriteBatch, + concat!( + file!(), + ':', + line!(), + ":flush-merged-and-deleted" + ), + ); + } + } + Dirty::CooperativelySerialized { + object_id: _, + collection_id, + low_key, + mutation_count: _, + mut data, + } => { + Arc::make_mut(&mut data); + let data = Arc::into_inner(data).unwrap(); + write_batch.push(Update::Store { + object_id: dirty_object_id, + collection_id, + low_key, + data, + }); + + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + dirty_object_id, + dirty_epoch, + event_verifier::State::AddedToWriteBatch, + concat!( + file!(), + ':', + line!(), + ":flush-cooperative" + ), + ); + } + } + Dirty::NotYetSerialized { low_key, collection_id, node } => { + assert_eq!(low_key, node.low_key); + assert_eq!(dirty_object_id, node.object_id, "mismatched node ID for NotYetSerialized with low key {:?}", low_key); + let mut lock = node.inner.write(); + + let leaf_ref: &mut Leaf = if let Some( + lock_ref, + ) = + lock.leaf.as_mut() + { + lock_ref + } else { + #[cfg(feature = "for-internal-testing-only")] + self.event_verifier + .print_debug_history_for_object(dirty_object_id); + + panic!("failed to get lock for node that was NotYetSerialized, low key {:?} id {:?}", low_key, node.object_id); + }; + + assert_eq!(leaf_ref.lo, low_key); + + let data = if leaf_ref.dirty_flush_epoch + == Some(flush_through_epoch) + { + if let Some(deleted_at) = leaf_ref.deleted { + #[cfg(feature = "for-internal-testing-only")] + if deleted_at <= flush_through_epoch { + println!( + "{dirty_object_id:?} deleted at {deleted_at:?} \ + but we are flushing at {flush_through_epoch:?}" + ); + self.event_verifier + .print_debug_history_for_object( + dirty_object_id, + ); + } + assert!(deleted_at > flush_through_epoch); + } + + leaf_ref.max_unflushed_epoch = + leaf_ref.dirty_flush_epoch.take(); + + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + dirty_object_id, + dirty_epoch, + event_verifier::State::AddedToWriteBatch, + concat!( + file!(), + ':', + line!(), + ":flush-serialize" + ), + ); + } + + leaf_ref.serialize(self.config.zstd_compression_level) + } else { + // Here we expect that there was a benign data race and that another thread + // mutated the leaf after encountering it being dirty for our epoch, after + // storing a CooperativelySerialized in the dirty map. + let dirty_value_2_opt = + self.dirty.remove(&(dirty_epoch, dirty_object_id)); + + if let Some(Dirty::CooperativelySerialized { + low_key: low_key_2, + mutation_count: _, + mut data, + collection_id: ci2, + object_id: ni2, + }) = dirty_value_2_opt + { + assert_eq!(node.object_id, ni2); + assert_eq!(node.object_id, dirty_object_id); + assert_eq!(low_key, low_key_2); + assert_eq!(node.low_key, low_key); + assert_eq!(collection_id, ci2); + Arc::make_mut(&mut data); + + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + dirty_object_id, + dirty_epoch, + event_verifier::State::AddedToWriteBatch, + concat!( + file!(), + ':', + line!(), + ":flush-laggy-cooperative" + ), + ); + } + + Arc::into_inner(data).unwrap() + } else { + log::error!( + "violation of flush responsibility for second read \ + of expected cooperative serialization. leaf in question's \ + dirty_flush_epoch is {:?}, our expected key was {:?}. node.deleted: {:?}", + leaf_ref.dirty_flush_epoch, + (dirty_epoch, dirty_object_id), + leaf_ref.deleted, + ); + #[cfg(feature = "for-internal-testing-only")] + self.event_verifier.print_debug_history_for_object( + dirty_object_id, + ); + + unreachable!("a leaf was expected to be cooperatively serialized but it was not available"); + } + }; + + write_batch.push(Update::Store { + object_id: dirty_object_id, + collection_id: collection_id, + low_key, + data, + }); + + if leaf_ref.page_out_on_flush == Some(flush_through_epoch) { + // page_out_on_flush is set to false + // on page-in due to serde(skip) + evict_after_flush.push(node.clone()); + } + } + } + } + + if !objects_to_defrag.is_empty() { + log::debug!( + "objects to defrag (after flush loop): {}", + objects_to_defrag.len() + ); + self.compacted_heap_slots + .fetch_add(objects_to_defrag.len() as u64, Ordering::Relaxed); + } + + let before_compute_defrag = Instant::now(); + + if cfg!(not(feature = "monotonic-behavior")) { + for fragmented_object_id in objects_to_defrag { + let object_opt = + self.object_id_index.get(&fragmented_object_id); + + let object = if let Some(object) = object_opt { + object + } else { + log::debug!("defragmenting object not found in object_id_index: {fragmented_object_id:?}"); + continue; + }; + + if let Some(ref inner) = object.inner.read().leaf { + if let Some(dirty) = inner.dirty_flush_epoch { + assert!(dirty > flush_through_epoch); + // This object will be rewritten anyway when its dirty epoch gets flushed + continue; + } + } + + let data = match self.read(fragmented_object_id) { + Some(Ok(data)) => data, + Some(Err(e)) => { + let annotated = annotate!(e); + log::error!( + "failed to read object during GC: {annotated:?}" + ); + continue; + } + None => { + log::error!( + "failed to read object during GC: object not found" + ); + continue; + } + }; + + write_batch.push(Update::Store { + object_id: fragmented_object_id, + collection_id: object.collection_id, + low_key: object.low_key, + data, + }); + } + } + + let compute_defrag_latency = before_compute_defrag.elapsed(); + + let serialization_latency = before_serialization.elapsed(); + + let before_storage = Instant::now(); + + let objects_flushed = write_batch.len() as u64; + + #[cfg(feature = "for-internal-testing-only")] + let write_batch_object_ids: Vec = + write_batch.iter().map(Update::object_id).collect(); + + let write_batch_stats = if objects_flushed > 0 { + let write_batch_stats = self.heap.write_batch(write_batch)?; + log::trace!( + "marking {flush_through_epoch:?} as flushed - \ + {objects_flushed} objects written, {write_batch_stats:?}", + ); + write_batch_stats + } else { + WriteBatchStats::default() + }; + + let storage_latency = before_storage.elapsed(); + + #[cfg(feature = "for-internal-testing-only")] + { + for update_object_id in write_batch_object_ids { + self.event_verifier.mark( + update_object_id, + flush_through_epoch, + event_verifier::State::Flushed, + concat!(file!(), ':', line!(), ":flush-finished"), + ); + } + } + + log::trace!( + "marking the forward flush notifier that {:?} is flushed", + flush_through_epoch + ); + + self.invariants.mark_flushed_epoch(flush_through_epoch); + + forward_flush_notifier.mark_complete(); + + let before_eviction = Instant::now(); + + for node_to_evict in evict_after_flush { + // NB: since we dropped this leaf and lock after we marked its + // node in evict_after_flush, it's possible that it may have + // been written to afterwards. + let mut lock = node_to_evict.inner.write(); + let leaf = lock.leaf.as_mut().unwrap(); + + if let Some(dirty_epoch) = leaf.dirty_flush_epoch { + if dirty_epoch != flush_through_epoch { + continue; + } + } else { + continue; + } + + #[cfg(feature = "for-internal-testing-only")] + { + self.event_verifier.mark( + node_to_evict.object_id, + flush_through_epoch, + event_verifier::State::PagedOut, + concat!(file!(), ':', line!(), ":page-out-after-flush"), + ); + } + + lock.leaf = None; + } + + let post_write_eviction_latency = before_eviction.elapsed(); + + // kick forward the low level epoch-based reclamation systems + // because this operation can cause a lot of garbage to build + // up, and this speeds up its reclamation. + self.flush_epoch.manually_advance_epoch(); + self.heap.manually_advance_epoch(); + + let ret = FlushStats { + pre_block_on_current_quiescence, + pre_block_on_previous_flush, + serialization_latency, + storage_latency, + post_write_eviction_latency, + objects_flushed, + write_batch: write_batch_stats, + compute_defrag_latency, + }; + + let mut flush_stats = self.flush_stats.write(); + flush_stats.count += 1; + flush_stats.max = flush_stats.max.max(&ret); + flush_stats.sum = flush_stats.sum.sum(&ret); + + assert_eq!(self.dirty.range(..flush_boundary).count(), 0); + + Ok(ret) + } +} + +fn initialize( + recovered_nodes: &[ObjectRecovery], + heap: &Heap, +) -> ( + ConcurrentMap< + ObjectId, + Object, + INDEX_FANOUT, + EBR_LOCAL_GC_BUFFER_SIZE, + >, + HashMap>, +) { + let mut trees: HashMap> = HashMap::new(); + + let object_id_index: ConcurrentMap< + ObjectId, + Object, + INDEX_FANOUT, + EBR_LOCAL_GC_BUFFER_SIZE, + > = ConcurrentMap::default(); + + for ObjectRecovery { object_id, collection_id, low_key } in recovered_nodes + { + let node = Object { + object_id: *object_id, + collection_id: *collection_id, + low_key: low_key.clone(), + inner: Arc::new(RwLock::new(CacheBox { + leaf: None.into(), + logged_index: BTreeMap::default(), + })), + }; + + assert!(object_id_index.insert(*object_id, node.clone()).is_none()); + + let tree = trees.entry(*collection_id).or_default(); + + assert!( + tree.insert(low_key.clone(), node).is_none(), + "inserted multiple objects with low key {:?}", + low_key + ); + } + + // initialize default collections if not recovered + for collection_id in [NAME_MAPPING_COLLECTION_ID, DEFAULT_COLLECTION_ID] { + let tree = trees.entry(collection_id).or_default(); + + if tree.is_empty() { + let object_id = heap.allocate_object_id(); + + let initial_low_key = InlineArray::MIN; + + let empty_node = Object { + object_id, + collection_id, + low_key: initial_low_key.clone(), + inner: Arc::new(RwLock::new(CacheBox { + leaf: Some(Box::new(Leaf::empty())).into(), + logged_index: BTreeMap::default(), + })), + }; + + assert!(object_id_index + .insert(object_id, empty_node.clone()) + .is_none()); + + assert!(tree.insert(initial_low_key, empty_node).is_none()); + } else { + assert!( + tree.contains_key(&InlineArray::MIN), + "tree {:?} had no minimum node", + collection_id + ); + } + } + + for (cid, tree) in &trees { + assert!( + tree.contains_key(&InlineArray::MIN), + "tree {:?} had no minimum node", + cid + ); + } + + (object_id_index, trees) +} diff --git a/src/object_location_mapper.rs b/src/object_location_mapper.rs new file mode 100644 index 000000000..2aa672a80 --- /dev/null +++ b/src/object_location_mapper.rs @@ -0,0 +1,305 @@ +use std::num::NonZeroU64; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Arc; + +use fnv::FnvHashSet; +use pagetable::PageTable; + +use crate::{ + heap::{SlabAddress, UpdateMetadata, N_SLABS}, + Allocator, ObjectId, +}; + +#[derive(Debug, Default, Copy, Clone)] +pub struct AllocatorStats { + pub objects_allocated: u64, + pub objects_freed: u64, + pub heap_slots_allocated: u64, + pub heap_slots_freed: u64, +} + +#[derive(Default)] +struct SlabTenancy { + slot_to_object_id: PageTable, + slot_allocator: Arc, +} + +impl SlabTenancy { + // returns (ObjectId, slot index) pairs + fn objects_to_defrag( + &self, + target_fill_ratio: f32, + ) -> Vec<(ObjectId, u64)> { + let (frag_min, frag_max) = if let Some(frag) = + self.slot_allocator.fragmentation_cutoff(target_fill_ratio) + { + frag + } else { + return vec![]; + }; + + let mut ret = vec![]; + + for fragmented_slot in frag_min..frag_max { + let object_id_u64 = self + .slot_to_object_id + .get(fragmented_slot) + .load(Ordering::Acquire); + + if let Some(object_id) = ObjectId::new(object_id_u64) { + ret.push((object_id, fragmented_slot)); + } + } + + ret + } +} + +#[derive(Clone)] +pub(crate) struct ObjectLocationMapper { + object_id_to_location: PageTable, + slab_tenancies: Arc<[SlabTenancy; N_SLABS]>, + object_id_allocator: Arc, + target_fill_ratio: f32, +} + +impl ObjectLocationMapper { + pub(crate) fn new( + recovered_metadata: &[UpdateMetadata], + target_fill_ratio: f32, + ) -> ObjectLocationMapper { + let mut ret = ObjectLocationMapper { + object_id_to_location: PageTable::default(), + slab_tenancies: Arc::new(core::array::from_fn(|_| { + SlabTenancy::default() + })), + object_id_allocator: Arc::default(), + target_fill_ratio, + }; + + let mut object_ids: FnvHashSet = Default::default(); + let mut slots_per_slab: [FnvHashSet; N_SLABS] = + core::array::from_fn(|_| Default::default()); + + for update_metadata in recovered_metadata { + match update_metadata { + UpdateMetadata::Store { + object_id, + collection_id: _, + location, + low_key: _, + } => { + object_ids.insert(**object_id); + let slab_address = SlabAddress::from(*location); + slots_per_slab[slab_address.slab() as usize] + .insert(slab_address.slot()); + ret.insert(*object_id, slab_address); + } + UpdateMetadata::Free { .. } => { + unreachable!() + } + } + } + + ret.object_id_allocator = + Arc::new(Allocator::from_allocated(&object_ids)); + + let slabs = Arc::get_mut(&mut ret.slab_tenancies).unwrap(); + + for i in 0..N_SLABS { + let slab = &mut slabs[i]; + slab.slot_allocator = + Arc::new(Allocator::from_allocated(&slots_per_slab[i])); + } + + ret + } + + pub(crate) fn get_max_allocated_per_slab(&self) -> Vec<(usize, u64)> { + let mut ret = vec![]; + + for (i, slab) in self.slab_tenancies.iter().enumerate() { + if let Some(max_allocated) = slab.slot_allocator.max_allocated() { + ret.push((i, max_allocated)); + } + } + + ret + } + + pub(crate) fn stats(&self) -> AllocatorStats { + let (objects_allocated, objects_freed) = + self.object_id_allocator.counters(); + + let mut heap_slots_allocated = 0; + let mut heap_slots_freed = 0; + + for slab_id in 0..N_SLABS { + let (allocated, freed) = + self.slab_tenancies[slab_id].slot_allocator.counters(); + heap_slots_allocated += allocated; + heap_slots_freed += freed; + } + + AllocatorStats { + objects_allocated, + objects_freed, + heap_slots_allocated, + heap_slots_freed, + } + } + + pub(crate) fn clone_object_id_allocator_arc(&self) -> Arc { + self.object_id_allocator.clone() + } + + pub(crate) fn allocate_object_id(&self) -> ObjectId { + // object IDs wrap a NonZeroU64, so if we get 0, just re-allocate and leak the id + + let mut object_id = self.object_id_allocator.allocate(); + if object_id == 0 { + object_id = self.object_id_allocator.allocate(); + assert_ne!(object_id, 0); + } + ObjectId::new(object_id).unwrap() + } + + pub(crate) fn clone_slab_allocator_arc( + &self, + slab_id: u8, + ) -> Arc { + self.slab_tenancies[usize::from(slab_id)].slot_allocator.clone() + } + + pub(crate) fn allocate_slab_slot(&self, slab_id: u8) -> SlabAddress { + let slot = + self.slab_tenancies[usize::from(slab_id)].slot_allocator.allocate(); + SlabAddress::from_slab_slot(slab_id, slot) + } + + pub(crate) fn free_slab_slot(&self, slab_address: SlabAddress) { + self.slab_tenancies[usize::from(slab_address.slab())] + .slot_allocator + .free(slab_address.slot()) + } + + pub(crate) fn get_location_for_object( + &self, + object_id: ObjectId, + ) -> Option { + let location_u64 = + self.object_id_to_location.get(*object_id).load(Ordering::Acquire); + + let nzu = NonZeroU64::new(location_u64)?; + + Some(SlabAddress::from(nzu)) + } + + /// Returns the previous address for this object, if it is vacating one. + /// + /// # Panics + /// + /// Asserts that the new location is actually unoccupied. This is a major + /// correctness violation if that isn't true. + pub(crate) fn insert( + &self, + object_id: ObjectId, + new_location: SlabAddress, + ) -> Option { + // insert into object_id_to_location + let location_nzu: NonZeroU64 = new_location.into(); + let location_u64 = location_nzu.get(); + + let last_u64 = self + .object_id_to_location + .get(*object_id) + .swap(location_u64, Ordering::Release); + + let last_address_opt = if let Some(nzu) = NonZeroU64::new(last_u64) { + let last_address = SlabAddress::from(nzu); + Some(last_address) + } else { + None + }; + + // insert into slab_tenancies + let slab = new_location.slab(); + let slot = new_location.slot(); + + let _last_oid_at_location = self.slab_tenancies[usize::from(slab)] + .slot_to_object_id + .get(slot) + .swap(*object_id, Ordering::Release); + + // TODO add debug event verifier here assert_eq!(0, last_oid_at_location); + + last_address_opt + } + + /// Unmaps an object and returns its location. + /// + /// # Panics + /// + /// Asserts that the object was actually stored in a location. + pub(crate) fn remove(&self, object_id: ObjectId) -> Option { + let last_u64 = self + .object_id_to_location + .get(*object_id) + .swap(0, Ordering::Release); + + if let Some(nzu) = NonZeroU64::new(last_u64) { + let last_address = SlabAddress::from(nzu); + + let slab = last_address.slab(); + let slot = last_address.slot(); + + let last_oid_at_location = self.slab_tenancies[usize::from(slab)] + .slot_to_object_id + .get(slot) + .swap(0, Ordering::Release); + + assert_eq!(*object_id, last_oid_at_location); + + Some(last_address) + } else { + None + } + } + + pub(crate) fn objects_to_defrag(&self) -> FnvHashSet { + let mut ret = FnvHashSet::default(); + + for slab_id in 0..N_SLABS { + let slab = &self.slab_tenancies[usize::from(slab_id)]; + + for (object_id, slot) in + slab.objects_to_defrag(self.target_fill_ratio) + { + let sa = SlabAddress::from_slab_slot( + u8::try_from(slab_id).unwrap(), + slot, + ); + + let rt_sa = if let Some(rt_raw_sa) = NonZeroU64::new( + self.object_id_to_location + .get(*object_id) + .load(Ordering::Acquire), + ) { + SlabAddress::from(rt_raw_sa) + } else { + // object has been removed but its slot has not yet been freed, + // hopefully due to a deferred write + // TODO test that with a testing event log + continue; + }; + + if sa == rt_sa { + let newly_inserted = ret.insert(object_id); + assert!(newly_inserted, "{object_id:?} present multiple times across slab objects_to_defrag"); + } + } + } + + ret + } +} diff --git a/src/oneshot.rs b/src/oneshot.rs deleted file mode 100644 index f709849de..000000000 --- a/src/oneshot.rs +++ /dev/null @@ -1,149 +0,0 @@ -use std::{ - future::Future, - pin::Pin, - task::{Context, Poll, Waker}, - time::{Duration, Instant}, -}; - -use parking_lot::{Condvar, Mutex}; - -use crate::Arc; - -#[derive(Debug)] -struct OneShotState { - filled: bool, - fused: bool, - item: Option, - waker: Option, -} - -impl Default for OneShotState { - fn default() -> OneShotState { - OneShotState { filled: false, fused: false, item: None, waker: None } - } -} - -/// A Future value which may or may not be filled -#[derive(Debug)] -pub struct OneShot { - mu: Arc>>, - cv: Arc, -} - -/// The completer side of the Future -pub struct OneShotFiller { - mu: Arc>>, - cv: Arc, -} - -impl OneShot { - /// Create a new `OneShotFiller` and the `OneShot` - /// that will be filled by its completion. - pub fn pair() -> (OneShotFiller, Self) { - let mu = Arc::new(Mutex::new(OneShotState::default())); - let cv = Arc::new(Condvar::new()); - let future = Self { mu: mu.clone(), cv: cv.clone() }; - let filler = OneShotFiller { mu, cv }; - - (filler, future) - } - - /// Block on the `OneShot`'s completion - /// or dropping of the `OneShotFiller` - pub fn wait(self) -> Option { - let mut inner = self.mu.lock(); - while !inner.filled { - self.cv.wait(&mut inner); - } - inner.item.take() - } - - /// Block on the `OneShot`'s completion - /// or dropping of the `OneShotFiller`, - /// returning an error if not filled - /// before a given timeout or if the - /// system shuts down before then. - /// - /// Upon a successful receive, the - /// oneshot should be dropped, as it - /// will never yield that value again. - pub fn wait_timeout( - &mut self, - mut timeout: Duration, - ) -> Result { - let mut inner = self.mu.lock(); - while !inner.filled { - let start = Instant::now(); - let res = self.cv.wait_for(&mut inner, timeout); - if res.timed_out() { - return Err(std::sync::mpsc::RecvTimeoutError::Disconnected); - } - timeout = timeout.checked_sub(start.elapsed()).unwrap_or_default(); - } - if let Some(item) = inner.item.take() { - Ok(item) - } else { - Err(std::sync::mpsc::RecvTimeoutError::Disconnected) - } - } -} - -impl Future for OneShot { - type Output = Option; - - fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll { - let mut state = self.mu.lock(); - if state.fused { - return Poll::Pending; - } - if state.filled { - state.fused = true; - Poll::Ready(state.item.take()) - } else { - state.waker = Some(cx.waker().clone()); - Poll::Pending - } - } -} - -impl OneShotFiller { - /// Complete the `OneShot` - pub fn fill(self, inner: T) { - let mut state = self.mu.lock(); - - if let Some(waker) = state.waker.take() { - waker.wake(); - } - - state.filled = true; - state.item = Some(inner); - - // having held the mutex makes this linearized - // with the notify below. - drop(state); - - let _notified = self.cv.notify_all(); - } -} - -impl Drop for OneShotFiller { - fn drop(&mut self) { - let mut state = self.mu.lock(); - - if state.filled { - return; - } - - if let Some(waker) = state.waker.take() { - waker.wake(); - } - - state.filled = true; - - // having held the mutex makes this linearized - // with the notify below. - drop(state); - - let _notified = self.cv.notify_all(); - } -} diff --git a/src/pagecache/constants.rs b/src/pagecache/constants.rs deleted file mode 100644 index cab16aedd..000000000 --- a/src/pagecache/constants.rs +++ /dev/null @@ -1,45 +0,0 @@ -use super::*; - -// crc: u32 4 -// kind: u8 1 -// seg num: u64 9 (varint) -// pid: u64 9 (varint) -// len: u64 9 (varint) -/// Log messages have a header that might eb up to this length. -pub const MAX_MSG_HEADER_LEN: usize = 32; - -/// Log segments have a header of this length. -pub const SEG_HEADER_LEN: usize = 20; - -/// During testing, this should never be exceeded. -// TODO drop this to 3 over time -#[allow(unused)] -pub const MAX_SPACE_AMPLIFICATION: f64 = 10.; - -pub(crate) const META_PID: PageId = 0; -pub(crate) const COUNTER_PID: PageId = 1; -pub(crate) const BATCH_MANIFEST_PID: PageId = PageId::max_value() - 666; - -pub(crate) const PAGE_CONSOLIDATION_THRESHOLD: usize = 10; -pub(crate) const SEGMENT_CLEANUP_THRESHOLD: usize = 50; - -// Allows for around 1 trillion items to be stored -// 2^37 * (assuming 50% node fill, 8 items per leaf) -// and well below 1% of nodes being non-leaf nodes. -#[cfg(target_pointer_width = "64")] -pub(crate) const MAX_PID_BITS: usize = 37; - -// Allows for around 32 billion items to be stored -// 2^32 * (assuming 50% node fill of 8 items per leaf) -// and well below 1% of nodes being non-leaf nodes. -// Assumed to be enough for a 32-bit system. -#[cfg(target_pointer_width = "32")] -pub(crate) const MAX_PID_BITS: usize = 32; - -// Limit keys and values to 128gb on 64-bit systems. -#[cfg(target_pointer_width = "64")] -pub(crate) const MAX_BLOB: usize = 1 << 37; - -// Limit keys and values to 512mb on 32-bit systems. -#[cfg(target_pointer_width = "32")] -pub(crate) const MAX_BLOB: usize = 1 << 29; diff --git a/src/pagecache/disk_pointer.rs b/src/pagecache/disk_pointer.rs deleted file mode 100644 index 4499a0195..000000000 --- a/src/pagecache/disk_pointer.rs +++ /dev/null @@ -1,75 +0,0 @@ -use std::num::NonZeroU64; - -use super::{HeapId, LogOffset}; -use crate::*; - -/// A pointer to a location on disk or an off-log heap item. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub enum DiskPtr { - /// Points to a value stored in the single-file log. - Inline(LogOffset), - /// Points to a value stored off-log in the heap. - Heap(Option, HeapId), -} - -impl DiskPtr { - pub(crate) const fn new_inline(l: LogOffset) -> Self { - DiskPtr::Inline(l) - } - - pub(crate) fn new_heap_item(lid: LogOffset, heap_id: HeapId) -> Self { - DiskPtr::Heap(Some(NonZeroU64::new(lid).unwrap()), heap_id) - } - - pub(crate) const fn is_inline(&self) -> bool { - matches!(self, DiskPtr::Inline(_)) - } - - pub(crate) const fn is_heap_item(&self) -> bool { - matches!(self, DiskPtr::Heap(_, _)) - } - - pub(crate) const fn heap_id(&self) -> Option { - if let DiskPtr::Heap(_, heap_id) = self { - Some(*heap_id) - } else { - None - } - } - - #[doc(hidden)] - pub const fn lid(&self) -> Option { - match self { - DiskPtr::Inline(lid) => Some(*lid), - DiskPtr::Heap(Some(lid), _) => Some(lid.get()), - DiskPtr::Heap(None, _) => None, - } - } - - pub(crate) fn forget_heap_log_coordinates(&mut self) { - match self { - DiskPtr::Inline(_) => {} - DiskPtr::Heap(ref mut opt, _) => *opt = None, - } - } - - pub(crate) const fn original_lsn(&self) -> Lsn { - match self { - DiskPtr::Heap(_, heap_id) => heap_id.original_lsn, - DiskPtr::Inline(_) => panic!("called original_lsn on non-Heap"), - } - } - - pub(crate) const fn heap_pointer_merged_into_snapshot(&self) -> bool { - matches!(self, DiskPtr::Heap(None, _)) - } -} - -impl fmt::Display for DiskPtr { - fn fmt( - &self, - f: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - write!(f, "{:?}", self) - } -} diff --git a/src/pagecache/header.rs b/src/pagecache/header.rs deleted file mode 100644 index f0ce2611a..000000000 --- a/src/pagecache/header.rs +++ /dev/null @@ -1,66 +0,0 @@ -use super::*; - -// This is the most writers in a single IO buffer -// that we have space to accommodate in the counter -// for writers in the IO buffer header. -pub(in crate::pagecache) const MAX_WRITERS: Header = 127; - -pub(in crate::pagecache) type Header = u64; - -// salt: 31 bits -// maxed: 1 bit -// seal: 1 bit -// n_writers: 7 bits -// offset: 24 bits - -pub(crate) const fn is_maxed(v: Header) -> bool { - v & (1 << 32) == 1 << 32 -} - -pub(crate) const fn mk_maxed(v: Header) -> Header { - v | (1 << 32) -} - -pub(crate) const fn is_sealed(v: Header) -> bool { - v & (1 << 31) == 1 << 31 -} - -pub(crate) const fn mk_sealed(v: Header) -> Header { - v | (1 << 31) -} - -pub(crate) const fn n_writers(v: Header) -> Header { - (v << 33) >> 57 -} - -#[inline] -pub(crate) fn incr_writers(v: Header) -> Header { - assert_ne!(n_writers(v), MAX_WRITERS); - v + (1 << 24) -} - -#[inline] -pub(crate) fn decr_writers(v: Header) -> Header { - assert_ne!(n_writers(v), 0); - v - (1 << 24) -} - -#[inline] -pub(crate) fn offset(v: Header) -> usize { - let ret = (v << 40) >> 40; - usize::try_from(ret).unwrap() -} - -#[inline] -pub(crate) fn bump_offset(v: Header, by: usize) -> Header { - assert_eq!(by >> 24, 0); - v + (by as Header) -} - -pub(crate) const fn bump_salt(v: Header) -> Header { - (v + (1 << 33)) & 0xFFFF_FFFD_0000_0000 -} - -pub(crate) const fn salt(v: Header) -> Header { - (v >> 33) << 33 -} diff --git a/src/pagecache/heap.rs b/src/pagecache/heap.rs deleted file mode 100644 index 92e294496..000000000 --- a/src/pagecache/heap.rs +++ /dev/null @@ -1,416 +0,0 @@ -#![allow(unsafe_code)] - -use std::{ - convert::{TryFrom, TryInto}, - fmt::{self, Debug}, - fs::File, - path::Path, - sync::{ - atomic::{AtomicU32, Ordering::Acquire}, - Arc, - }, -}; - -use crate::{ - ebr::pin, - pagecache::{pread_exact, pwrite_all, MessageKind}, - stack::Stack, - Error, Lsn, Result, -}; - -#[cfg(not(feature = "testing"))] -pub(crate) const MIN_SZ: u64 = 32 * 1024; - -#[cfg(feature = "testing")] -pub(crate) const MIN_SZ: u64 = 128; - -const MIN_TRAILING_ZEROS: u64 = MIN_SZ.trailing_zeros() as u64; - -pub type SlabId = u8; -pub type SlabIdx = u32; - -/// A unique identifier for a particular slot in the heap -#[allow(clippy::module_name_repetitions)] -#[derive(Clone, Copy, PartialOrd, Ord, Eq, PartialEq, Hash)] -pub struct HeapId { - pub location: u64, - pub original_lsn: Lsn, -} - -impl Debug for HeapId { - fn fmt( - &self, - f: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - let (slab, idx, original_lsn) = self.decompose(); - f.debug_struct("HeapId") - .field("slab", &slab) - .field("idx", &idx) - .field("original_lsn", &original_lsn) - .finish() - } -} - -impl HeapId { - pub fn decompose(&self) -> (SlabId, SlabIdx, Lsn) { - const IDX_MASK: u64 = (1 << 32) - 1; - let slab_id = - u8::try_from((self.location >> 32).trailing_zeros()).unwrap(); - let slab_idx = u32::try_from(self.location & IDX_MASK).unwrap(); - (slab_id, slab_idx, self.original_lsn) - } - - pub fn compose( - slab_id: SlabId, - slab_idx: SlabIdx, - original_lsn: Lsn, - ) -> HeapId { - let slab = 1 << (32 + u64::from(slab_id)); - let heap_id = slab | u64::from(slab_idx); - HeapId { location: heap_id, original_lsn } - } - - fn offset(&self) -> u64 { - let (slab_id, idx, _) = self.decompose(); - slab_id_to_size(slab_id) * u64::from(idx) - } - - fn slab_size(&self) -> u64 { - let (slab_id, _idx, _lsn) = self.decompose(); - slab_id_to_size(slab_id) - } -} - -pub(crate) fn slab_size(size: u64) -> u64 { - slab_id_to_size(size_to_slab_id(size)) -} - -fn slab_id_to_size(slab_id: u8) -> u64 { - 1 << (MIN_TRAILING_ZEROS + u64::from(slab_id)) -} - -fn size_to_slab_id(size: u64) -> SlabId { - // find the power of 2 that is at least 64k - let normalized_size = std::cmp::max(MIN_SZ, size.next_power_of_two()); - - // drop the lowest unused bits - let rebased_size = normalized_size >> MIN_TRAILING_ZEROS; - - u8::try_from(rebased_size.trailing_zeros()).unwrap() -} - -pub(crate) struct Reservation { - slab_free: Arc>, - completed: bool, - file: File, - pub heap_id: HeapId, - from_tip: bool, -} - -impl Drop for Reservation { - fn drop(&mut self) { - if !self.completed { - let (_slab_id, idx, _) = self.heap_id.decompose(); - self.slab_free.push(idx, &pin()); - } - } -} - -impl Reservation { - pub fn complete(mut self, data: &[u8]) -> Result { - log::trace!( - "Heap::complete({:?}) to offset {} in file {:?}", - self.heap_id, - self.heap_id.offset(), - self.file - ); - assert_eq!(data.len() as u64, self.heap_id.slab_size()); - - // write data - pwrite_all(&self.file, data, self.heap_id.offset())?; - - // sync data - if self.from_tip { - self.file.sync_all()?; - } else if cfg!(not(target_os = "linux")) { - self.file.sync_data()?; - } else { - #[allow(clippy::assertions_on_constants)] - { - assert!(cfg!(target_os = "linux")); - } - - #[cfg(target_os = "linux")] - { - use std::os::unix::io::AsRawFd; - let ret = unsafe { - libc::sync_file_range( - self.file.as_raw_fd(), - i64::try_from(self.heap_id.offset()).unwrap(), - i64::try_from(data.len()).unwrap(), - libc::SYNC_FILE_RANGE_WAIT_BEFORE - | libc::SYNC_FILE_RANGE_WRITE - | libc::SYNC_FILE_RANGE_WAIT_AFTER, - ) - }; - if ret < 0 { - let err = std::io::Error::last_os_error(); - if let Some(libc::ENOSYS) = err.raw_os_error() { - self.file.sync_all()?; - } else { - return Err(err.into()); - } - } - } - } - - // if this is not reached due to an IO error, - // the offset will be returned to the Slab in Drop - self.completed = true; - - Ok(self.heap_id) - } -} - -#[derive(Debug)] -pub(crate) struct Heap { - // each slab stores - // items that are double - // the size of the previous, - // ranging from 64k in the - // smallest slab to 2^48 in - // the last. - slabs: [Slab; 32], -} - -impl Heap { - pub fn start>(p: P) -> Result { - let mut slabs_vec = vec![]; - - for slab_id in 0..32 { - let slab = Slab::start(&p, slab_id)?; - slabs_vec.push(slab); - } - - let slabs: [Slab; 32] = slabs_vec.try_into().unwrap(); - - Ok(Heap { slabs }) - } - - pub fn gc_unknown_items(&self, snapshot: &crate::pagecache::Snapshot) { - let mut bitmaps = vec![]; - for slab in &self.slabs { - let tip = slab.tip.load(Acquire) as usize; - bitmaps.push(vec![0_u64; 1 + (tip / 64)]); - } - - for page_state in &snapshot.pt { - for heap_id in page_state.heap_ids() { - let (slab_id, idx, _lsn) = heap_id.decompose(); - - // set the bit for this slot - let block = idx / 64; - let bit = idx % 64; - let bitmask = 1 << bit; - bitmaps[slab_id as usize][block as usize] |= bitmask; - } - } - - let iter = self.slabs.iter().zip(bitmaps.into_iter()); - - for (slab, bitmap) in iter { - let tip = slab.tip.load(Acquire); - - for idx in 0..tip { - let block = idx / 64; - let bit = idx % 64; - let bitmask = 1 << bit; - let free = bitmap[block as usize] & bitmask == 0; - - if free { - slab.free(idx); - } - } - } - } - - pub fn read(&self, heap_id: HeapId) -> Result<(MessageKind, Vec)> { - log::trace!("Heap::read({:?})", heap_id); - let (slab_id, slab_idx, original_lsn) = heap_id.decompose(); - self.slabs[slab_id as usize].read(slab_idx, original_lsn) - } - - pub fn free(&self, heap_id: HeapId) { - log::trace!("Heap::free({:?})", heap_id); - let (slab_id, slab_idx, _) = heap_id.decompose(); - self.slabs[slab_id as usize].free(slab_idx) - } - - pub fn reserve(&self, size: u64, original_lsn: Lsn) -> Reservation { - assert!(size < 1 << 48); - let slab_id = size_to_slab_id(size); - let ret = self.slabs[slab_id as usize].reserve(original_lsn); - log::trace!("Heap::reserve({}) -> {:?}", size, ret.heap_id); - ret - } -} - -#[derive(Debug)] -struct Slab { - file: File, - slab_id: u8, - tip: AtomicU32, - free: Arc>, -} - -impl Slab { - pub fn start>(directory: P, slab_id: u8) -> Result { - let bs = slab_id_to_size(slab_id); - let free = Arc::new(Stack::default()); - - let mut options = std::fs::OpenOptions::new(); - options.create(true); - options.read(true); - options.write(true); - - let file = - options.open(directory.as_ref().join(format!("{:02}", slab_id)))?; - let len = file.metadata()?.len(); - let max_idx = len / bs; - log::trace!( - "starting heap slab for sizes of {}. tip: {} max idx: {}", - bs, - len, - max_idx - ); - let tip = AtomicU32::new(u32::try_from(max_idx).unwrap()); - - Ok(Slab { file, slab_id, tip, free }) - } - - fn read( - &self, - slab_idx: SlabIdx, - original_lsn: Lsn, - ) -> Result<(MessageKind, Vec)> { - let bs = slab_id_to_size(self.slab_id); - let offset = u64::from(slab_idx) * bs; - - log::trace!("reading heap slab slot {} at offset {}", slab_idx, offset); - - let mut heap_buf = vec![0; usize::try_from(bs).unwrap()]; - - pread_exact(&self.file, &mut heap_buf, offset)?; - - let stored_crc = - u32::from_le_bytes(heap_buf[1..5].as_ref().try_into().unwrap()); - - let mut hasher = crc32fast::Hasher::new(); - hasher.update(&heap_buf[0..1]); - hasher.update(&heap_buf[5..]); - let actual_crc = hasher.finalize(); - - if actual_crc == stored_crc { - let actual_lsn = Lsn::from_le_bytes( - heap_buf[5..13].as_ref().try_into().unwrap(), - ); - if actual_lsn != original_lsn { - log::debug!( - "heap slot lsn {} does not match expected original lsn {}", - actual_lsn, - original_lsn - ); - return Err(Error::corruption(None)); - } - let buf = heap_buf[13..].to_vec(); - Ok((MessageKind::from(heap_buf[0]), buf)) - } else { - log::debug!( - "heap message CRC does not match contents. stored: {} actual: {}", - stored_crc, - actual_crc - ); - Err(Error::corruption(None)) - } - } - - fn reserve(&self, original_lsn: Lsn) -> Reservation { - let (idx, from_tip) = if let Some(idx) = self.free.pop(&pin()) { - log::trace!( - "reusing heap index {} in slab for sizes of {}", - idx, - slab_id_to_size(self.slab_id), - ); - (idx, false) - } else { - log::trace!( - "no free heap slots in slab for sizes of {}", - slab_id_to_size(self.slab_id), - ); - (self.tip.fetch_add(1, Acquire), true) - }; - - log::trace!( - "heap reservation for slot {} in the slab for sizes of {}", - idx, - slab_id_to_size(self.slab_id), - ); - - let heap_id = HeapId::compose(self.slab_id, idx, original_lsn); - - Reservation { - slab_free: self.free.clone(), - completed: false, - file: self.file.try_clone().unwrap(), - from_tip, - heap_id, - } - } - - fn free(&self, idx: u32) { - self.punch_hole(idx); - self.free.push(idx, &pin()); - } - - fn punch_hole(&self, #[allow(unused)] idx: u32) { - #[cfg(all(target_os = "linux", not(miri)))] - { - use std::{ - os::unix::io::AsRawFd, - sync::atomic::{AtomicBool, Ordering::Relaxed}, - }; - - use libc::{fallocate, FALLOC_FL_KEEP_SIZE, FALLOC_FL_PUNCH_HOLE}; - - static HOLE_PUNCHING_ENABLED: AtomicBool = AtomicBool::new(true); - const MODE: i32 = FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE; - - if HOLE_PUNCHING_ENABLED.load(Relaxed) { - let bs = i64::try_from(slab_id_to_size(self.slab_id)).unwrap(); - let offset = i64::from(idx) * bs; - - let fd = self.file.as_raw_fd(); - - let ret = unsafe { - fallocate( - fd, - MODE, - #[allow(clippy::useless_conversion)] - offset.try_into().unwrap(), - #[allow(clippy::useless_conversion)] - bs.try_into().unwrap(), - ) - }; - - if ret != 0 { - let err = std::io::Error::last_os_error(); - log::error!( - "failed to punch hole in heap file: {:?}. disabling hole punching", - err - ); - HOLE_PUNCHING_ENABLED.store(false, Relaxed); - } - } - } - } -} diff --git a/src/pagecache/iobuf.rs b/src/pagecache/iobuf.rs deleted file mode 100644 index bd89cf1d7..000000000 --- a/src/pagecache/iobuf.rs +++ /dev/null @@ -1,1268 +0,0 @@ -use std::{ - alloc::{alloc, dealloc, Layout}, - cell::UnsafeCell, - sync::atomic::AtomicPtr, -}; - -use crate::{pagecache::*, *}; - -macro_rules! io_fail { - ($self:expr, $e:expr) => { - #[cfg(feature = "failpoints")] - { - debug_delay(); - if crate::fail::is_active($e) { - $self.set_global_error(Error::FailPoint); - return Err(Error::FailPoint); - } - }; - }; -} - -struct AlignedBuf(*mut u8, usize); - -impl AlignedBuf { - fn new(len: usize) -> AlignedBuf { - let layout = Layout::from_size_align(len, 8192).unwrap(); - let ptr = unsafe { alloc(layout) }; - - assert!(!ptr.is_null(), "failed to allocate critical IO buffer"); - - AlignedBuf(ptr, len) - } -} - -impl Drop for AlignedBuf { - fn drop(&mut self) { - let layout = Layout::from_size_align(self.1, 8192).unwrap(); - unsafe { - dealloc(self.0, layout); - } - } -} - -pub(crate) struct IoBuf { - buf: Arc>, - header: CachePadded, - base: usize, - pub offset: LogOffset, - pub lsn: Lsn, - pub capacity: usize, - from_tip: bool, - stored_max_stable_lsn: Lsn, -} - -#[allow(unsafe_code)] -unsafe impl Sync for IoBuf {} - -#[allow(unsafe_code)] -unsafe impl Send for IoBuf {} - -impl IoBuf { - /// # Safety - /// - /// This operation provides access to a mutable buffer of - /// uninitialized memory. For this to be correct, we must - /// ensure that: - /// 1. overlapping mutable slices are never created. - /// 2. a read to any subslice of this slice only happens - /// after a write has initialized that memory - /// - /// It is intended that the log reservation code guarantees - /// that no two `Reservation` objects will hold overlapping - /// mutable slices to our io buffer. - /// - /// It is intended that the `write_to_log` function only - /// tries to write initialized bytes to the underlying storage. - /// - /// It is intended that the `write_to_log` function will - /// initialize any yet-to-be-initialized bytes before writing - /// the buffer to storage. #1040 added logic that was intended - /// to meet this requirement. - /// - /// The safety of this method was discussed in #1044. - pub(crate) fn get_mut_range( - &self, - at: usize, - len: usize, - ) -> &'static mut [u8] { - let buf_ptr = self.buf.get(); - - unsafe { - assert!((*buf_ptr).1 >= at + len); - std::slice::from_raw_parts_mut( - (*buf_ptr).0.add(self.base + at), - len, - ) - } - } - - // This is called upon the initialization of a fresh segment. - // We write a new segment header to the beginning of the buffer - // for assistance during recovery. The caller is responsible - // for ensuring that the IoBuf's capacity has been set properly. - fn store_segment_header( - &mut self, - last: Header, - lsn: Lsn, - max_stable_lsn: Lsn, - ) { - debug!("storing lsn {} in beginning of buffer", lsn); - assert!(self.capacity >= SEG_HEADER_LEN); - - self.stored_max_stable_lsn = max_stable_lsn; - - self.lsn = lsn; - - let header = SegmentHeader { lsn, max_stable_lsn, ok: true }; - let header_bytes: [u8; SEG_HEADER_LEN] = header.into(); - - #[allow(unsafe_code)] - unsafe { - std::ptr::copy_nonoverlapping( - header_bytes.as_ptr(), - (*self.buf.get()).0, - SEG_HEADER_LEN, - ); - } - - // ensure writes to the buffer land after our header. - let last_salt = header::salt(last); - let new_salt = header::bump_salt(last_salt); - let bumped = header::bump_offset(new_salt, SEG_HEADER_LEN); - self.set_header(bumped); - } - - pub(crate) fn get_header(&self) -> Header { - debug_delay(); - self.header.load(Acquire) - } - - pub(crate) fn set_header(&self, new: Header) { - debug_delay(); - self.header.store(new, Release); - } - - pub(crate) fn cas_header( - &self, - old: Header, - new: Header, - ) -> std::result::Result<(), Header> { - debug_delay(); - let res = self.header.compare_exchange(old, new, SeqCst, SeqCst); - - res.map(|_| ()) - } -} - -#[derive(Debug)] -pub(crate) struct StabilityIntervals { - fsynced_ranges: Vec<(Lsn, Lsn)>, - batches: BTreeMap, - stable_lsn: Lsn, -} - -impl StabilityIntervals { - fn new(lsn: Lsn) -> StabilityIntervals { - StabilityIntervals { - stable_lsn: lsn, - fsynced_ranges: vec![], - batches: BTreeMap::default(), - } - } - - pub(crate) fn mark_batch(&mut self, interval: (Lsn, Lsn)) { - assert!(interval.0 > self.stable_lsn); - self.batches.insert(interval.0, interval.1); - } - - fn mark_fsync(&mut self, interval: (Lsn, Lsn)) -> Option { - trace!( - "pushing interval {:?} into fsynced_ranges {:?}", - interval, - self.fsynced_ranges - ); - if let Some((low, high)) = self.fsynced_ranges.last_mut() { - if *low == interval.1 + 1 { - *low = interval.0 - } else if *high + 1 == interval.0 { - *high = interval.1 - } else { - self.fsynced_ranges.push(interval); - } - } else { - self.fsynced_ranges.push(interval); - } - - #[cfg(any(test, feature = "event_log", feature = "lock_free_delays"))] - assert!( - self.fsynced_ranges.len() < 10000, - "intervals is getting strangely long... {:?}", - self - ); - - // reverse sort - self.fsynced_ranges - .sort_unstable_by_key(|&range| std::cmp::Reverse(range)); - - while let Some(&(low, high)) = self.fsynced_ranges.last() { - assert!(low <= high); - let cur_stable = self.stable_lsn; - assert!( - low > cur_stable, - "somehow, we marked offset {} stable while \ - interval {}-{} had not yet been applied!", - cur_stable, - low, - high - ); - if cur_stable + 1 == low { - debug!("new highest interval: {} - {}", low, high); - self.fsynced_ranges.pop().unwrap(); - self.stable_lsn = high; - } else { - break; - } - } - - let mut batch_stable_lsn = None; - - // batches must be atomically recoverable, which - // means that we should wait until the entire - // batch has been stabilized before any parts - // of the batch are allowed to be reused - // due to having marked them as stable. - while let Some((low, high)) = - self.batches.iter().map(|(l, h)| (*l, *h)).next() - { - assert!( - low < high, - "expected batch low mark {} to be below high mark {}", - low, - high - ); - - if high <= self.stable_lsn { - // the entire batch has been written to disk - // and fsynced, so we can propagate its stability - // through the `batch_stable_lsn` variable. - if let Some(bsl) = batch_stable_lsn { - assert!( - bsl < high, - "expected batch stable lsn of {} to be less than high of {}", - bsl, - high - ); - } - batch_stable_lsn = Some(high); - self.batches.remove(&low).unwrap(); - } else { - if low <= self.stable_lsn { - // the batch has not been fully written - // to disk, but we can communicate that - // the region before the batch has - // stabilized. - batch_stable_lsn = Some(low - 1); - } - break; - } - } - - if self.batches.is_empty() { - Some(self.stable_lsn) - } else { - batch_stable_lsn - } - } -} - -pub(crate) struct IoBufs { - pub config: RunningConfig, - - // A pointer to the current IoBuf. This relies on crossbeam-epoch - // for garbage collection when it gets swapped out, to ensure that - // no witnessing threads experience use-after-free. - // mutated from the maybe_seal_and_write_iobuf method. - // finally dropped in the Drop impl, without using crossbeam-epoch, - // because if this drops, all witnessing threads should be done. - pub iobuf: AtomicPtr, - - // Pending intervals that have been written to stable storage, but may be - // higher than the current value of `stable` due to interesting thread - // interleavings. - pub intervals: Mutex, - pub interval_updated: Condvar, - - // The highest CONTIGUOUS log sequence number that has been written to - // stable storage. This may be lower than the length of the underlying - // file, and there may be buffers that have been written out-of-order - // to stable storage due to interesting thread interleavings. - pub stable_lsn: AtomicLsn, - pub max_reserved_lsn: AtomicLsn, - pub max_header_stable_lsn: Arc, - pub segment_accountant: Mutex, - pub segment_cleaner: SegmentCleaner, - deferred_segment_ops: stack::Stack, -} - -impl Drop for IoBufs { - fn drop(&mut self) { - let ptr = self.iobuf.swap(std::ptr::null_mut(), SeqCst); - assert!(!ptr.is_null()); - unsafe { - Arc::from_raw(ptr); - } - } -} - -/// `IoBufs` is a set of lock-free buffers for coordinating -/// writes to underlying storage. -impl IoBufs { - pub fn start(config: RunningConfig, snapshot: &Snapshot) -> Result { - let segment_cleaner = SegmentCleaner::default(); - - let mut segment_accountant: SegmentAccountant = - SegmentAccountant::start( - config.clone(), - snapshot, - segment_cleaner.clone(), - )?; - - let segment_size = config.segment_size; - - let (recovered_lid, recovered_lsn) = - snapshot.recovered_coords(config.segment_size); - - let (next_lid, next_lsn, from_tip) = - match (recovered_lid, recovered_lsn) { - (Some(next_lid), Some(next_lsn)) => { - debug!( - "starting log at recovered active \ - offset {}, recovered lsn {}", - next_lid, next_lsn - ); - (next_lid, next_lsn, true) - } - (None, None) => { - debug!("starting log for a totally fresh system"); - let next_lsn = 0; - let (next_lid, from_tip) = - segment_accountant.next(next_lsn)?; - (next_lid, next_lsn, from_tip) - } - (None, Some(next_lsn)) => { - let (next_lid, from_tip) = - segment_accountant.next(next_lsn)?; - debug!( - "starting log at clean offset {}, recovered lsn {}", - next_lid, next_lsn - ); - (next_lid, next_lsn, from_tip) - } - (Some(_), None) => unreachable!(), - }; - - assert!(next_lsn >= Lsn::try_from(next_lid).unwrap()); - - debug!( - "starting IoBufs with next_lsn: {} \ - next_lid: {}", - next_lsn, next_lid - ); - - // we want stable to begin at -1 if the 0th byte - // of our file has not yet been written. - let stable = next_lsn - 1; - - // the tip offset is not completely full yet, reuse it - let base = assert_usize(next_lid % segment_size as LogOffset); - - let mut iobuf = IoBuf { - buf: Arc::new(UnsafeCell::new(AlignedBuf::new(segment_size))), - header: CachePadded::new(AtomicU64::new(0)), - base, - offset: next_lid, - lsn: next_lsn, - from_tip, - capacity: segment_size - base, - stored_max_stable_lsn: -1, - }; - - if snapshot.active_segment.is_none() { - iobuf.store_segment_header(0, next_lsn, stable); - } - - Ok(IoBufs { - config, - - iobuf: AtomicPtr::new(Arc::into_raw(Arc::new(iobuf)) as *mut IoBuf), - - intervals: Mutex::new(StabilityIntervals::new(stable)), - interval_updated: Condvar::new(), - - stable_lsn: AtomicLsn::new(stable), - max_reserved_lsn: AtomicLsn::new(stable), - max_header_stable_lsn: Arc::new(AtomicLsn::new(next_lsn)), - segment_accountant: Mutex::new(segment_accountant), - segment_cleaner, - deferred_segment_ops: stack::Stack::default(), - }) - } - - pub(in crate::pagecache) fn sa_mark_link( - &self, - pid: PageId, - cache_info: CacheInfo, - guard: &Guard, - ) { - let op = SegmentOp::Link { pid, cache_info }; - self.deferred_segment_ops.push(op, guard); - } - - pub(in crate::pagecache) fn sa_mark_replace( - &self, - pid: PageId, - old_cache_infos: &[CacheInfo], - new_cache_info: CacheInfo, - guard: &Guard, - ) -> Result<()> { - let worked: Option> = self.try_with_sa(|sa| { - #[cfg(feature = "metrics")] - let start = clock(); - sa.mark_replace(pid, old_cache_infos, new_cache_info)?; - for op in self.deferred_segment_ops.take_iter(guard) { - sa.apply_op(op)?; - } - #[cfg(feature = "metrics")] - M.accountant_hold.measure(clock() - start); - Ok(()) - }); - - if let Some(res) = worked { - res - } else { - let op = SegmentOp::Replace { - pid, - old_cache_infos: old_cache_infos.to_vec(), - new_cache_info, - }; - self.deferred_segment_ops.push(op, guard); - Ok(()) - } - } - - pub(in crate::pagecache) fn sa_stabilize(&self, lsn: Lsn) -> Result<()> { - // we avoid creating a Guard while blocking on the SA mutex, and we - // then drop the Guard only after the SA mutex is no longer held. - - let worked: Option> = self.try_with_sa(|sa| { - let guard = pin(); - for op in self.deferred_segment_ops.take_iter(&guard) { - sa.apply_op(op)?; - } - sa.stabilize(lsn, false)?; - Ok(guard) - }); - - if let Some(guard_res) = worked { - // we want to drop the EBR guard when we're not - // holding the SA mutex because it could take a while - // to clean up garbage. - drop(guard_res?); - } - - Ok(()) - } - - /// `SegmentAccountant` access for coordination with the `PageCache` - pub(in crate::pagecache) fn try_with_sa(&self, f: F) -> Option - where - F: FnOnce(&mut SegmentAccountant) -> B, - { - debug_delay(); - if let Some(mut sa) = self.segment_accountant.try_lock() { - #[cfg(feature = "metrics")] - let start = clock(); - - let ret = f(&mut sa); - - #[cfg(feature = "metrics")] - M.accountant_hold.measure(clock() - start); - - debug_delay(); - - Some(ret) - } else { - None - } - } - - /// `SegmentAccountant` access for coordination with the `PageCache` - pub(in crate::pagecache) fn with_sa(&self, f: F) -> B - where - F: FnOnce(&mut SegmentAccountant) -> B, - { - #[cfg(feature = "metrics")] - let start = clock(); - - debug_delay(); - let mut sa = self.segment_accountant.lock(); - - #[cfg(feature = "metrics")] - let locked_at = clock(); - - #[cfg(feature = "metrics")] - M.accountant_lock.measure(locked_at - start); - - let ret = f(&mut sa); - - drop(sa); - - #[cfg(feature = "metrics")] - M.accountant_hold.measure(clock() - locked_at); - - ret - } - - /// Return an iterator over the log, starting with - /// a specified offset. - pub(crate) fn iter_from(&self, lsn: Lsn) -> LogIter { - trace!("iterating from lsn {}", lsn); - let segments = self.with_sa(|sa| sa.segment_snapshot_iter_from(lsn)); - - LogIter { - config: self.config.clone(), - max_lsn: Some(self.stable()), - cur_lsn: None, - segment_base: None, - segments, - last_stage: false, - } - } - - /// Returns the last stable offset in storage. - pub(in crate::pagecache) fn stable(&self) -> Lsn { - debug_delay(); - self.stable_lsn.load(Acquire) - } - - // Adds a header to the front of the buffer - #[allow(clippy::mut_mut)] - pub(crate) fn encapsulate( - &self, - item: &T, - header: MessageHeader, - mut out_buf: &mut [u8], - heap_reservation_opt: Option, - ) -> Result<()> { - // we create this double ref to allow scooting - // the slice forward without doing anything - // to the argument - let out_buf_ref: &mut &mut [u8] = &mut out_buf; - { - #[cfg(feature = "metrics")] - let _ = Measure::new(&M.serialize); - header.serialize_into(out_buf_ref); - } - - if let Some(heap_reservation) = heap_reservation_opt { - // write blob to file - io_fail!(self, "blob blob write"); - let mut heap_buf = vec![ - 0; - usize::try_from(super::heap::slab_size( - 13 + item.serialized_size() - )) - .unwrap() - ]; - - #[cfg(feature = "metrics")] - let serialization_timer = Measure::new(&M.serialize); - heap_buf[0] = header.kind.into(); - heap_buf[5..13].copy_from_slice( - &heap_reservation.heap_id.original_lsn.to_le_bytes(), - ); - let heap_buf_ref: &mut &mut [u8] = &mut &mut heap_buf[13..]; - item.serialize_into(heap_buf_ref); - #[cfg(feature = "metrics")] - drop(serialization_timer); - - let mut hasher = crc32fast::Hasher::new(); - hasher.update(&heap_buf[0..1]); - hasher.update(&heap_buf[5..]); - let crc = hasher.finalize().to_le_bytes(); - - heap_buf[1..5].copy_from_slice(&crc); - - // write the blob pointer and its original lsn into - // the log - heap_reservation.heap_id.serialize_into(out_buf_ref); - - // write the blob file - heap_reservation.complete(&heap_buf)?; - } else { - #[cfg(feature = "metrics")] - let _ = Measure::new(&M.serialize); - item.serialize_into(out_buf_ref); - }; - - assert_eq!( - out_buf_ref.len(), - 0, - "trying to serialize header {:?} \ - and item {:?} but there were \ - buffer leftovers at the end", - header, - item - ); - - Ok(()) - } - - // Write an IO buffer's data to stable storage and set up the - // next IO buffer for writing. - pub(crate) fn write_to_log(&self, iobuf: Arc) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.write_to_log); - let header = iobuf.get_header(); - let log_offset = iobuf.offset; - let base_lsn = iobuf.lsn; - let capacity = iobuf.capacity; - - let segment_size = self.config.segment_size; - - assert_eq!( - Lsn::try_from(log_offset % segment_size as LogOffset).unwrap(), - base_lsn % segment_size as Lsn - ); - - assert_ne!( - log_offset, - LogOffset::max_value(), - "created reservation for uninitialized slot", - ); - - assert!(header::is_sealed(header)); - - let bytes_to_write = header::offset(header); - - trace!( - "write_to_log log_offset {} lsn {} len {}", - log_offset, - base_lsn, - bytes_to_write - ); - - let maxed = header::is_maxed(header); - let unused_space = capacity - bytes_to_write; - let should_pad = maxed && unused_space >= MAX_MSG_HEADER_LEN; - - // a pad is a null message written to the end of a buffer - // to signify that nothing else will be written into it - if should_pad { - let pad_len = unused_space - MAX_MSG_HEADER_LEN; - let data = iobuf.get_mut_range(bytes_to_write, unused_space); - - let segment_number = SegmentNumber( - u64::try_from(base_lsn).unwrap() - / u64::try_from(self.config.segment_size).unwrap(), - ); - - let cap_header = MessageHeader { - kind: MessageKind::Cap, - pid: PageId::max_value(), - segment_number, - len: u64::try_from(pad_len).unwrap(), - crc32: 0, - }; - - trace!("writing segment cap {:?}", cap_header); - - let header_bytes = cap_header.serialize(); - - // initialize the remainder of this buffer (only pad_len of this - // will be part of the Cap message) - let padding_bytes = vec![ - MessageKind::Corrupted.into(); - unused_space - header_bytes.len() - ]; - - #[allow(unsafe_code)] - unsafe { - std::ptr::copy_nonoverlapping( - header_bytes.as_ptr(), - data.as_mut_ptr(), - header_bytes.len(), - ); - std::ptr::copy_nonoverlapping( - padding_bytes.as_ptr(), - data.as_mut_ptr().add(header_bytes.len()), - padding_bytes.len(), - ); - } - - // this as to stay aligned with the hashing - let crc32_arr = u32_to_arr(calculate_message_crc32( - &header_bytes, - &padding_bytes[..pad_len], - )); - - #[allow(unsafe_code)] - unsafe { - std::ptr::copy_nonoverlapping( - crc32_arr.as_ptr(), - // the crc32 is the first part of the buffer - data.as_mut_ptr(), - std::mem::size_of::(), - ); - } - } else if maxed { - // initialize the remainder of this buffer's red zone - let data = iobuf.get_mut_range(bytes_to_write, unused_space); - - #[allow(unsafe_code)] - unsafe { - // note: this could use slice::fill() if it stabilizes - std::ptr::write_bytes( - data.as_mut_ptr(), - MessageKind::Corrupted.into(), - unused_space, - ); - } - } - - let total_len = if maxed { capacity } else { bytes_to_write }; - - let data = iobuf.get_mut_range(0, total_len); - let stored_max_stable_lsn = iobuf.stored_max_stable_lsn; - - io_fail!(self, "buffer write"); - let f = &self.config.file; - pwrite_all(f, data, log_offset)?; - if !self.config.temporary { - if iobuf.from_tip { - f.sync_all()?; - } else if cfg!(not(target_os = "linux")) { - f.sync_data()?; - } else { - #[allow(clippy::assertions_on_constants)] - { - assert!(cfg!(target_os = "linux")); - } - - #[cfg(target_os = "linux")] - { - use std::os::unix::io::AsRawFd; - let ret = unsafe { - libc::sync_file_range( - f.as_raw_fd(), - i64::try_from(log_offset).unwrap(), - i64::try_from(total_len).unwrap(), - libc::SYNC_FILE_RANGE_WAIT_BEFORE - | libc::SYNC_FILE_RANGE_WRITE - | libc::SYNC_FILE_RANGE_WAIT_AFTER, - ) - }; - if ret < 0 { - let err = std::io::Error::last_os_error(); - if let Some(libc::ENOSYS) = err.raw_os_error() { - f.sync_all()?; - } else { - return Err(err.into()); - } - } - } - } - } - - // get rid of the iobuf as quickly as possible because - // it is a huge allocation - drop(iobuf); - - io_fail!(self, "buffer write post"); - - if total_len > 0 { - let complete_len = if maxed { - let lsn_idx = base_lsn / segment_size as Lsn; - let next_seg_beginning = (lsn_idx + 1) * segment_size as Lsn; - assert_usize(next_seg_beginning - base_lsn) - } else { - total_len - }; - - debug!( - "wrote lsns {}-{} to disk at offsets {}-{}, maxed {} complete_len {}", - base_lsn, - base_lsn + total_len as Lsn - 1, - log_offset, - log_offset + total_len as LogOffset - 1, - maxed, - complete_len - ); - self.mark_interval(base_lsn, complete_len); - } - - #[cfg(feature = "metrics")] - M.written_bytes.measure(total_len as u64); - - // NB the below deferred logic is important to ensure - // that we never actually free a segment until all threads - // that may have witnessed a DiskPtr that points into it - // have completed their (crossbeam-epoch)-pinned operations. - let guard = pin(); - let max_header_stable_lsn = self.max_header_stable_lsn.clone(); - guard.defer(move || { - trace!("bumping atomic header lsn to {}", stored_max_stable_lsn); - max_header_stable_lsn.fetch_max(stored_max_stable_lsn, SeqCst) - }); - guard.flush(); - drop(guard); - - let current_max_header_stable_lsn = - self.max_header_stable_lsn.load(Acquire); - - self.sa_stabilize(current_max_header_stable_lsn) - } - - // It's possible that IO buffers are written out of order! - // So we need to use this to keep track of them, and only - // increment self.stable. If we didn't do this, then we would - // accidentally decrement self.stable sometimes, or bump stable - // above an offset that corresponds to a buffer that hasn't actually - // been written yet! It's OK to use a mutex here because it is pretty - // fast, compared to the other operations on shared state. - fn mark_interval(&self, whence: Lsn, len: usize) { - debug!("mark_interval({}, {})", whence, len); - assert!( - len > 0, - "mark_interval called with an empty length at {}", - whence - ); - let mut intervals = self.intervals.lock(); - - let interval = (whence, whence + len as Lsn - 1); - - let updated = intervals.mark_fsync(interval); - - if let Some(new_stable_lsn) = updated { - trace!("mark_interval new highest lsn {}", new_stable_lsn); - self.stable_lsn.store(new_stable_lsn, SeqCst); - - #[cfg(feature = "event_log")] - { - // We add 1 because we want it to stay monotonic with recovery - // LSN, which deals with the next LSN after the last stable one. - // We need to do this while intervals is held otherwise it - // may race with another thread that stabilizes something - // lower. - self.config.event_log.stabilized_lsn(new_stable_lsn + 1); - } - - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - } - let _notified = self.interval_updated.notify_all(); - } - - pub(in crate::pagecache) fn current_iobuf(&self) -> Arc { - // we bump up the ref count, and forget the arc to retain a +1. - // If we didn't forget it, it would then go back down again, - // even though we just created a new reference to it, leading - // to double-frees. - let arc = unsafe { Arc::from_raw(self.iobuf.load(Acquire)) }; - #[allow(clippy::mem_forget)] - std::mem::forget(arc.clone()); - arc - } - - pub(crate) fn set_global_error(&self, e: Error) { - self.config.set_global_error(e); - - // wake up any waiting threads - // so they don't stall forever - let intervals = self.intervals.lock(); - - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - - let _notified = self.interval_updated.notify_all(); - } -} - -pub(crate) fn roll_iobuf(iobufs: &Arc) -> Result { - let iobuf = iobufs.current_iobuf(); - let header = iobuf.get_header(); - if header::is_sealed(header) { - trace!("skipping roll_iobuf due to already-sealed header"); - return Ok(0); - } - if header::offset(header) == 0 { - trace!("skipping roll_iobuf due to empty segment"); - } else { - trace!("sealing ioubuf from roll_iobuf"); - maybe_seal_and_write_iobuf(iobufs, &iobuf, header, false)?; - } - - Ok(header::offset(header)) -} - -/// Blocks until the specified log sequence number has -/// been made stable on disk. Returns the number of -/// bytes written. Suitable as a full consistency -/// barrier. -pub(in crate::pagecache) fn make_stable( - iobufs: &Arc, - lsn: Lsn, -) -> Result { - make_stable_inner(iobufs, lsn, false) -} - -/// Blocks until the specified log sequence number -/// has been written to disk. it's assumed that -/// log messages are always written contiguously -/// due to the way reservations manage io buffer -/// tenancy. this is only suitable for use -/// before trying to read a message from the log, -/// so that the system can avoid a full barrier -/// if the desired item has already been made -/// durable. -pub(in crate::pagecache) fn make_durable( - iobufs: &Arc, - lsn: Lsn, -) -> Result { - make_stable_inner(iobufs, lsn, true) -} - -pub(in crate::pagecache) fn make_stable_inner( - iobufs: &Arc, - lsn: Lsn, - partial_durability: bool, -) -> Result { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.make_stable); - - // NB before we write the 0th byte of the file, stable is -1 - let first_stable = iobufs.stable(); - if first_stable >= lsn { - return Ok(0); - } - - let mut stable = first_stable; - - while stable < lsn { - if let Err(e) = iobufs.config.global_error() { - error!("bailing out of stabilization code due to detected IO error: {:?}", e); - let intervals = iobufs.intervals.lock(); - - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - - let _notified = iobufs.interval_updated.notify_all(); - return Err(e); - } - - let iobuf = iobufs.current_iobuf(); - let header = iobuf.get_header(); - if header::offset(header) == 0 - || header::is_sealed(header) - || iobuf.lsn > lsn - { - // nothing to write, don't bother sealing - // current IO buffer. - } else { - maybe_seal_and_write_iobuf(iobufs, &iobuf, header, false)?; - stable = iobufs.stable(); - // NB we have to continue here to possibly clear - // the next io buffer, which may have dirty - // data we need to flush (and maybe no other - // thread is still alive to do so) - continue; - } - - // block until another thread updates the stable lsn - let mut intervals = iobufs.intervals.lock(); - - // check global error again now that we are holding a mutex - if let Err(e) = iobufs.config.global_error() { - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - - let _notified = iobufs.interval_updated.notify_all(); - return Err(e); - } - - stable = iobufs.stable(); - - if partial_durability { - if intervals.stable_lsn > lsn { - return Ok(assert_usize(stable - first_stable)); - } - - for (low, high) in &intervals.fsynced_ranges { - if *low <= lsn && *high > lsn { - return Ok(assert_usize(stable - first_stable)); - } - } - } - - if stable < lsn { - trace!("waiting on cond var for make_stable({})", lsn); - - #[cfg(not(feature = "event_log"))] - { - // wait forever when running in prod - iobufs.interval_updated.wait(&mut intervals); - } - - #[cfg(feature = "event_log")] - { - // while testing, panic if we take too long to stabilize - let timeout = iobufs.interval_updated.wait_for( - &mut intervals, - std::time::Duration::from_secs(5), - ); - if timeout.timed_out() { - fn tn() -> String { - std::thread::current() - .name() - .unwrap_or("unknown") - .to_owned() - } - panic!( - "{} failed to make_stable after 30 seconds. \ - waiting to stabilize lsn {}, current stable {} \ - intervals: {:?}", - tn(), - lsn, - iobufs.stable(), - intervals - ); - } - } - } else { - debug!("make_stable({}) returning", lsn); - break; - } - } - - Ok(assert_usize(stable - first_stable)) -} - -/// Called by users who wish to force the current buffer -/// to flush some pending writes. Returns the number -/// of bytes written during this call. -pub(in crate::pagecache) fn flush(iobufs: &Arc) -> Result { - let _cc = concurrency_control::read(); - let max_reserved_lsn = iobufs.max_reserved_lsn.load(Acquire); - make_stable(iobufs, max_reserved_lsn) -} - -/// Attempt to seal the current IO buffer, possibly -/// writing it to disk if there are no other writers -/// operating on it. -pub(in crate::pagecache) fn maybe_seal_and_write_iobuf( - iobufs: &Arc, - iobuf: &Arc, - header: Header, - from_reserve: bool, -) -> Result<()> { - if header::is_sealed(header) { - // this buffer is already sealed. nothing to do here. - return Ok(()); - } - - // NB need to do this before CAS because it can get - // written and reset by another thread afterward - let lid = iobuf.offset; - let lsn = iobuf.lsn; - let capacity = iobuf.capacity; - let segment_size = iobufs.config.segment_size; - - if header::offset(header) > capacity { - // a race happened, nothing we can do - return Ok(()); - } - - let res_len = header::offset(header); - let maxed = from_reserve || capacity - res_len < MAX_MSG_HEADER_LEN; - let sealed = if maxed { - trace!("setting maxed to true for iobuf with lsn {}", lsn); - header::mk_maxed(header::mk_sealed(header)) - } else { - header::mk_sealed(header) - }; - - let worked = iobuf.cas_header(header, sealed).is_ok(); - if !worked { - return Ok(()); - } - - trace!("sealed iobuf with lsn {}", lsn); - - assert!( - capacity + SEG_HEADER_LEN >= res_len, - "res_len of {} higher than buffer capacity {}", - res_len, - capacity - ); - - assert_ne!( - lid, - LogOffset::max_value(), - "sealing something that should never have \ - been claimed (iobuf lsn {})\n{:?}", - lsn, - iobufs - ); - - // open new slot - let mut next_lsn = lsn; - - #[cfg(feature = "metrics")] - let measure_assign_offset = Measure::new(&M.assign_offset); - - let (next_offset, from_tip) = if maxed { - // roll lsn to the next offset - let lsn_idx = lsn / segment_size as Lsn; - next_lsn = (lsn_idx + 1) * segment_size as Lsn; - - // mark unused as clear - debug!( - "rolling to new segment after clearing {}-{}", - lid, - lid + res_len as LogOffset, - ); - - match iobufs.with_sa(|sa| sa.next(next_lsn)) { - Ok(ret) => ret, - Err(e) => { - iobufs.set_global_error(e); - return Err(e); - } - } - } else { - debug!( - "advancing offset within the current segment from {} to {}", - lid, - lid + res_len as LogOffset - ); - next_lsn += res_len as Lsn; - - (lid + res_len as LogOffset, iobuf.from_tip) - }; - - // NB as soon as the "sealed" bit is 0, this allows new threads - // to start writing into this buffer, so do that after it's all - // set up. expect this thread to block until the buffer completes - // its entire life cycle as soon as we do that. - let next_iobuf = if maxed { - let mut next_iobuf = IoBuf { - buf: Arc::new(UnsafeCell::new(AlignedBuf::new(segment_size))), - header: CachePadded::new(AtomicU64::new(0)), - base: 0, - offset: next_offset, - lsn: next_lsn, - from_tip, - capacity: segment_size, - stored_max_stable_lsn: -1, - }; - - next_iobuf.store_segment_header(sealed, next_lsn, iobufs.stable()); - - next_iobuf - } else { - let new_cap = capacity - res_len; - assert_ne!(new_cap, 0); - let last_salt = header::salt(sealed); - let new_salt = header::bump_salt(last_salt); - - IoBuf { - // reuse the previous io buffer - buf: iobuf.buf.clone(), - header: CachePadded::new(AtomicU64::new(new_salt)), - base: iobuf.base + res_len, - offset: next_offset, - lsn: next_lsn, - from_tip, - capacity: new_cap, - stored_max_stable_lsn: -1, - } - }; - - // we acquire this mutex to guarantee that any threads that - // are going to wait on the condition variable will observe - // the change. - debug_delay(); - let intervals = iobufs.intervals.lock(); - let old_ptr = iobufs - .iobuf - .swap(Arc::into_raw(Arc::new(next_iobuf)) as *mut IoBuf, SeqCst); - - let old_arc = unsafe { Arc::from_raw(old_ptr) }; - - pin().defer(move || drop(old_arc)); - - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - - let _notified = iobufs.interval_updated.notify_all(); - - #[cfg(feature = "metrics")] - drop(measure_assign_offset); - - // if writers is 0, it's our responsibility to write the buffer. - if header::n_writers(sealed) == 0 { - iobufs.config.global_error()?; - trace!( - "asynchronously writing iobuf with lsn {} to log from maybe_seal", - lsn - ); - let iobufs2 = iobufs.clone(); - let iobuf2 = iobuf.clone(); - let _result = threadpool::write_to_log(iobuf2, iobufs2); - - #[cfg(feature = "event_log")] - _result.wait(); - - Ok(()) - } else { - trace!( - "currently {} other writers, so we will let one of them write \ - the iobuf with lsn {} to disk", - header::n_writers(sealed), - lsn - ); - Ok(()) - } -} - -impl Debug for IoBufs { - fn fmt( - &self, - formatter: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - formatter.write_fmt(format_args!("IoBufs {{ buf: {:?} }}", self.iobuf)) - } -} - -impl Debug for IoBuf { - fn fmt( - &self, - formatter: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - let header = self.get_header(); - formatter.write_fmt(format_args!( - "\n\tIoBuf {{ lid: {}, n_writers: {}, offset: \ - {}, sealed: {} }}", - self.offset, - header::n_writers(header), - header::offset(header), - header::is_sealed(header) - )) - } -} diff --git a/src/pagecache/iterator.rs b/src/pagecache/iterator.rs deleted file mode 100644 index d580c65c2..000000000 --- a/src/pagecache/iterator.rs +++ /dev/null @@ -1,513 +0,0 @@ -use std::{collections::BTreeMap, io}; - -use super::{ - pread_exact_or_eof, read_message, read_segment_header, BasedBuf, DiskPtr, - LogKind, LogOffset, LogRead, Lsn, SegmentHeader, SegmentNumber, - MAX_MSG_HEADER_LEN, SEG_HEADER_LEN, -}; -use crate::*; - -#[derive(Debug)] -pub struct LogIter { - pub config: RunningConfig, - pub segments: BTreeMap, - pub segment_base: Option, - pub max_lsn: Option, - pub cur_lsn: Option, - pub last_stage: bool, -} - -impl Iterator for LogIter { - type Item = (LogKind, PageId, Lsn, DiskPtr); - - fn next(&mut self) -> Option { - // If segment is None, get next on segment_iter, panic - // if we can't read something we expect to be able to, - // return None if there are no more remaining segments. - loop { - let remaining_seg_too_small_for_msg = !valid_entry_offset( - LogOffset::try_from(self.cur_lsn.unwrap_or(0)).unwrap(), - self.config.segment_size, - ); - - if remaining_seg_too_small_for_msg { - // clearing this also communicates to code in - // the snapshot generation logic that there was - // no more available space for a message in the - // last read segment - self.segment_base = None; - } - - if self.segment_base.is_none() { - if let Err(e) = self.read_segment() { - debug!("unable to load new segment: {:?}", e); - return None; - } - } - - let lsn = self.cur_lsn.unwrap(); - - // self.segment_base is `Some` now. - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.read_segment_message); - - // NB this inequality must be greater than or equal to the - // max_lsn. max_lsn may be set to the beginning of the first - // corrupt message encountered in the previous sweep of recovery. - if let Some(max_lsn) = self.max_lsn { - if let Some(cur_lsn) = self.cur_lsn { - if cur_lsn > max_lsn { - // all done - debug!("hit max_lsn {} in iterator, stopping", max_lsn); - return None; - } - } - } - - let segment_base = &self.segment_base.as_ref().unwrap(); - - let lid = segment_base.offset - + LogOffset::try_from(lsn % self.config.segment_size as Lsn) - .unwrap(); - - let expected_segment_number = SegmentNumber( - u64::try_from(lsn).unwrap() - / u64::try_from(self.config.segment_size).unwrap(), - ); - - match read_message( - &**segment_base, - lid, - expected_segment_number, - &self.config, - ) { - Ok(LogRead::Heap(header, _buf, heap_id, inline_len)) => { - trace!("read heap item in LogIter::next"); - self.cur_lsn = Some(lsn + Lsn::from(inline_len)); - - return Some(( - LogKind::from(header.kind), - header.pid, - lsn, - DiskPtr::new_heap_item(lid, heap_id), - )); - } - Ok(LogRead::Inline(header, _buf, inline_len)) => { - trace!( - "read inline flush with header {:?} in LogIter::next", - header, - ); - self.cur_lsn = Some(lsn + Lsn::from(inline_len)); - - return Some(( - LogKind::from(header.kind), - header.pid, - lsn, - DiskPtr::Inline(lid), - )); - } - Ok(LogRead::BatchManifest(last_lsn_in_batch, inline_len)) => { - if let Some(max_lsn) = self.max_lsn { - if last_lsn_in_batch > max_lsn { - debug!( - "cutting recovery short due to torn batch. \ - required stable lsn: {} actual max possible lsn: {}", - last_lsn_in_batch, - self.max_lsn.unwrap() - ); - return None; - } - } - self.cur_lsn = Some(lsn + Lsn::from(inline_len)); - continue; - } - Ok(LogRead::Canceled(inline_len)) => { - trace!("read zeroed in LogIter::next"); - self.cur_lsn = Some(lsn + Lsn::from(inline_len)); - } - Ok(LogRead::Corrupted) => { - trace!( - "read corrupted msg in LogIter::next as lid {} lsn {}", - lid, - lsn - ); - if self.last_stage { - // this happens when the second half of a freed segment - // is overwritten before its segment header. it's fine - // to just treat it like a cap - // because any already applied - // state can be assumed to be replaced later on by - // the stabilized state that came afterwards. - let _taken = self.segment_base.take().unwrap(); - - continue; - } else { - // found a tear - return None; - } - } - Ok(LogRead::Cap(_segment_number)) => { - trace!("read cap in LogIter::next"); - let _taken = self.segment_base.take().unwrap(); - - continue; - } - Ok(LogRead::DanglingHeap(_, heap_id, inline_len)) => { - debug!( - "encountered dangling heap \ - pointer at lsn {} heap_id {:?}", - lsn, heap_id - ); - self.cur_lsn = Some(lsn + Lsn::from(inline_len)); - continue; - } - Err(e) => { - debug!( - "failed to read log message at lid {} \ - with expected lsn {} during iteration: {}", - lid, lsn, e - ); - return None; - } - } - } - } -} - -impl LogIter { - /// read a segment of log messages. Only call after - /// pausing segment rewriting on the segment accountant! - fn read_segment(&mut self) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.segment_read); - if self.segments.is_empty() { - return Err(io::Error::new( - io::ErrorKind::Other, - "no segments remaining to iterate over", - ) - .into()); - } - - let first_ref = self.segments.iter().next().unwrap(); - let (lsn, offset) = (*first_ref.0, *first_ref.1); - - if let Some(max_lsn) = self.max_lsn { - if lsn > max_lsn { - return Err(io::Error::new( - io::ErrorKind::Other, - "next segment is above our configured max_lsn", - ) - .into()); - } - } - - assert!( - lsn + (self.config.segment_size as Lsn) - >= self.cur_lsn.unwrap_or(0), - "caller is responsible for providing segments \ - that contain the initial cur_lsn value or higher" - ); - - trace!( - "LogIter::read_segment lsn: {:?} cur_lsn: {:?}", - lsn, - self.cur_lsn - ); - // we add segment_len to this check because we may be getting the - // initial segment that is a bit behind where we left off before. - assert!( - lsn + self.config.segment_size as Lsn >= self.cur_lsn.unwrap_or(0) - ); - let f = &self.config.file; - let segment_header = read_segment_header(f, offset)?; - if offset % self.config.segment_size as LogOffset != 0 { - debug!("segment offset not divisible by segment length"); - return Err(Error::corruption(None)); - } - if segment_header.lsn % self.config.segment_size as Lsn != 0 { - debug!( - "expected a segment header lsn that is divisible \ - by the segment_size ({}) instead it was {}", - self.config.segment_size, segment_header.lsn - ); - return Err(Error::corruption(None)); - } - - if segment_header.lsn != lsn { - // this page was torn, nothing to read - debug!( - "segment header lsn ({}) != expected lsn ({})", - segment_header.lsn, lsn - ); - return Err(io::Error::new( - io::ErrorKind::Other, - "encountered torn segment", - ) - .into()); - } - - trace!("read segment header {:?}", segment_header); - - let mut buf = vec![0; self.config.segment_size]; - let size = pread_exact_or_eof(f, &mut buf, offset)?; - - trace!("setting stored segment buffer length to {} after read", size); - buf.truncate(size); - - self.cur_lsn = Some(segment_header.lsn + SEG_HEADER_LEN as Lsn); - - self.segment_base = Some(BasedBuf { buf, offset }); - - // NB this should only happen after we've successfully read - // the header, because we want to zero the segment if we - // fail to read that, and we use the remaining segment - // list to perform zeroing off of. - self.segments.remove(&lsn); - - Ok(()) - } -} - -const fn valid_entry_offset(lid: LogOffset, segment_len: usize) -> bool { - let seg_start = lid / segment_len as LogOffset * segment_len as LogOffset; - - let max_lid = - seg_start + segment_len as LogOffset - MAX_MSG_HEADER_LEN as LogOffset; - - let min_lid = seg_start + SEG_HEADER_LEN as LogOffset; - - lid >= min_lid && lid <= max_lid -} - -// Scan the log file if we don't know of any Lsn offsets yet, -// and recover the order of segments, and the highest Lsn. -fn scan_segment_headers_and_tail( - min: Lsn, - config: &RunningConfig, -) -> Result<(BTreeMap, Lsn)> { - fn fetch( - idx: u64, - min: Lsn, - config: &RunningConfig, - ) -> Option<(LogOffset, SegmentHeader)> { - let segment_len = u64::try_from(config.segment_size).unwrap(); - let base_lid = idx * segment_len; - let segment = read_segment_header(&config.file, base_lid).ok()?; - trace!( - "SA scanned header at lid {} during startup: {:?}", - base_lid, - segment - ); - if segment.ok && segment.lsn >= min { - assert_ne!(segment.lsn, Lsn::max_value()); - Some((base_lid, segment)) - } else { - trace!( - "not using segment at lid {}, ok: {} lsn: {} min lsn: {}", - base_lid, - segment.ok, - segment.lsn, - min - ); - None - } - } - - let segment_len = LogOffset::try_from(config.segment_size).unwrap(); - - let f = &config.file; - let file_len = f.metadata()?.len(); - let segments = (file_len / segment_len) - + if file_len % segment_len - < LogOffset::try_from(SEG_HEADER_LEN).unwrap() - { - 0 - } else { - 1 - }; - - trace!( - "file len: {} segment len {} segments: {}", - file_len, - segment_len, - segments - ); - - // scatter - let header_promises: Vec<_> = (0..segments) - .map({ - // let config = config.clone(); - move |idx| { - threadpool::spawn({ - let config2 = config.clone(); - move || fetch(idx, min, &config2) - }) - } - }) - .collect(); - - // gather - let mut headers: Vec<(LogOffset, SegmentHeader)> = vec![]; - for promise in header_promises { - let read_attempt = - promise.wait().expect("thread pool should not crash"); - - if let Some(completed_result) = read_attempt { - headers.push(completed_result); - } - } - - // find max stable LSN recorded in segment headers - let mut ordering = BTreeMap::new(); - let mut max_header_stable_lsn = min; - - for (lid, header) in headers { - max_header_stable_lsn = - std::cmp::max(header.max_stable_lsn, max_header_stable_lsn); - - if let Some(old) = ordering.insert(header.lsn, lid) { - assert_eq!( - old, lid, - "duplicate segment LSN {} detected at both {} and {}, \ - one should have been zeroed out during recovery", - header.lsn, old, lid - ); - } - } - - debug!( - "ordering before clearing tears: {:?}, \ - max_header_stable_lsn: {}", - ordering, max_header_stable_lsn - ); - - // Check that the segments above max_header_stable_lsn - // properly link their previous segment pointers. - let end_of_last_contiguous_message_in_unstable_tail = - check_contiguity_in_unstable_tail( - max_header_stable_lsn, - &ordering, - config, - ); - - Ok((ordering, end_of_last_contiguous_message_in_unstable_tail)) -} - -// This ensures that the last <# io buffers> segments on -// disk connect via their previous segment pointers in -// the header. This is important because we expect that -// the last <# io buffers> segments will join up, and we -// never reuse buffers within this safety range. -fn check_contiguity_in_unstable_tail( - max_header_stable_lsn: Lsn, - ordering: &BTreeMap, - config: &RunningConfig, -) -> Lsn { - let segment_size = config.segment_size as Lsn; - - // -1..(2 * segment_size) - 1 => 0 - // otherwise the floor of the buffer - let lowest_lsn_in_tail: Lsn = - std::cmp::max(0, (max_header_stable_lsn / segment_size) * segment_size); - - let mut expected_present = lowest_lsn_in_tail; - let mut missing_item_in_tail = None; - - let logical_tail = ordering - .range(lowest_lsn_in_tail..) - .map(|(lsn, lid)| (*lsn, *lid)) - .take_while(|(lsn, _lid)| { - let matches = expected_present == *lsn; - if !matches { - debug!( - "failed to find expected segment \ - at lsn {}, tear detected", - expected_present - ); - missing_item_in_tail = Some(expected_present); - } - expected_present += segment_size; - matches - }) - .collect(); - - debug!( - "in clean_tail_tears, found missing item in tail: {:?} \ - and we'll scan segments {:?} above lowest lsn {}", - missing_item_in_tail, logical_tail, lowest_lsn_in_tail - ); - - let mut iter = LogIter { - config: config.clone(), - segments: logical_tail, - segment_base: None, - max_lsn: missing_item_in_tail, - cur_lsn: None, - last_stage: false, - }; - - // run the iterator to completion - for _ in &mut iter {} - - // `cur_lsn` is set to the beginning - // of the next message - let end_of_last_message = iter.cur_lsn.unwrap_or(0) - 1; - - debug!( - "filtering out segments after detected tear at (lsn, lid) {:?}", - end_of_last_message, - ); - - end_of_last_message -} - -/// Returns a log iterator, the max stable lsn, -/// and a set of segments that can be -/// zeroed after the new snapshot is written, -/// but no sooner, otherwise it is not crash-safe. -pub fn raw_segment_iter_from( - lsn: Lsn, - config: &RunningConfig, -) -> Result { - let segment_len = config.segment_size as Lsn; - let normalized_lsn = lsn / segment_len * segment_len; - - let (ordering, end_of_last_msg) = - scan_segment_headers_and_tail(normalized_lsn, config)?; - - // find the last stable tip, to properly handle batch manifests. - let tip_segment_iter: BTreeMap<_, _> = ordering - .iter() - .next_back() - .map(|(a, b)| (*a, *b)) - .into_iter() - .collect(); - - trace!( - "trying to find the max stable tip for \ - bounding batch manifests with segment iter {:?} \ - of segments >= first_tip {}", - tip_segment_iter, - end_of_last_msg, - ); - - trace!( - "generated iterator over segments {:?} with lsn >= {}", - ordering, - normalized_lsn, - ); - - let segments = ordering - .into_iter() - .filter(move |&(l, _)| l >= normalized_lsn) - .collect(); - - Ok(LogIter { - config: config.clone(), - max_lsn: Some(end_of_last_msg), - cur_lsn: None, - segment_base: None, - segments, - last_stage: true, - }) -} diff --git a/src/pagecache/logger.rs b/src/pagecache/logger.rs deleted file mode 100644 index 223295c17..000000000 --- a/src/pagecache/logger.rs +++ /dev/null @@ -1,866 +0,0 @@ -use std::fs::File; - -use super::{ - arr_to_lsn, arr_to_u32, assert_usize, header, iobuf, lsn_to_arr, - pread_exact, pread_exact_or_eof, roll_iobuf, u32_to_arr, Arc, BasedBuf, - DiskPtr, HeapId, IoBuf, IoBufs, LogKind, LogOffset, Lsn, MessageKind, - Reservation, Serialize, Snapshot, BATCH_MANIFEST_PID, COUNTER_PID, - MAX_MSG_HEADER_LEN, META_PID, SEG_HEADER_LEN, -}; - -use crate::*; - -/// A sequential store which allows users to create -/// reservations placed at known log offsets, used -/// for writing persistent data structures that need -/// to know where to find persisted bits in the future. -#[derive(Debug)] -pub struct Log { - /// iobufs is the underlying lock-free IO write buffer. - pub(crate) iobufs: Arc, - pub(crate) config: RunningConfig, -} - -impl Log { - /// Start the log, open or create the configured file, - /// and optionally start the periodic buffer flush thread. - pub fn start(config: RunningConfig, snapshot: &Snapshot) -> Result { - let iobufs = Arc::new(IoBufs::start(config.clone(), snapshot)?); - - Ok(Self { iobufs, config }) - } - - /// Flushes any pending IO buffers to disk to ensure durability. - /// Returns the number of bytes written during this call. - pub fn flush(&self) -> Result { - iobuf::flush(&self.iobufs) - } - - /// Return an iterator over the log, starting with - /// a specified offset. - pub fn iter_from(&self, lsn: Lsn) -> super::LogIter { - self.iobufs.iter_from(lsn) - } - - pub(crate) fn roll_iobuf(&self) -> Result { - roll_iobuf(&self.iobufs) - } - - /// read a buffer from the disk - pub fn read(&self, pid: PageId, lsn: Lsn, ptr: DiskPtr) -> Result { - trace!("reading log lsn {} ptr {}", lsn, ptr); - - let expected_segment_number = SegmentNumber( - u64::try_from(lsn).unwrap() - / u64::try_from(self.config.segment_size).unwrap(), - ); - - iobuf::make_durable(&self.iobufs, lsn)?; - - if ptr.is_inline() { - let f = &self.config.file; - read_message( - &**f, - ptr.lid().unwrap(), - expected_segment_number, - &self.config, - ) - } else { - // we short-circuit the inline read - // here because it might not still - // exist in the inline log. - let heap_id = ptr.heap_id().unwrap(); - self.config.heap.read(heap_id).map(|(kind, buf)| { - let header = MessageHeader { - kind, - pid, - segment_number: expected_segment_number, - crc32: 0, - len: 0, - }; - LogRead::Heap(header, buf, heap_id, 0) - }) - } - } - - /// returns the current stable offset written to disk - pub fn stable_offset(&self) -> Lsn { - self.iobufs.stable() - } - - /// blocks until the specified log sequence number has - /// been made stable on disk. Returns the number of - /// bytes written during this call. this is appropriate - /// as a full consistency-barrier for all data written - /// up until this point. - pub fn make_stable(&self, lsn: Lsn) -> Result { - iobuf::make_stable(&self.iobufs, lsn) - } - - /// Reserve a replacement buffer for a previously written - /// heap write. This allows the tiny pointer in the log - /// to be migrated to a new segment without copying the - /// massive slab in the heap that the pointer references. - pub(super) fn rewrite_heap_pointer( - &self, - pid: PageId, - heap_pointer: HeapId, - guard: &Guard, - ) -> Result> { - let ret = self.reserve_inner( - LogKind::Replace, - pid, - &heap_pointer, - Some(heap_pointer), - guard, - ); - - if let Err(e) = &ret { - self.iobufs.set_global_error(*e); - } - - ret - } - - /// Tries to claim a reservation for writing a buffer to a - /// particular location in stable storge, which may either be - /// completed or aborted later. Useful for maintaining - /// linearizability across CAS operations that may need to - /// persist part of their operation. - pub fn reserve( - &self, - log_kind: LogKind, - pid: PageId, - item: &T, - guard: &Guard, - ) -> Result> { - let ret = self.reserve_inner(log_kind, pid, item, None, guard); - - if let Err(e) = &ret { - self.iobufs.set_global_error(*e); - } - - ret - } - - fn reserve_inner( - &self, - log_kind: LogKind, - pid: PageId, - item: &T, - heap_rewrite: Option, - _: &Guard, - ) -> Result> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.reserve_lat); - - let serialized_len = item.serialized_size(); - let max_buf_len = - u64::try_from(MAX_MSG_HEADER_LEN).unwrap() + serialized_len; - - #[cfg(feature = "metrics")] - M.reserve_sz.measure(max_buf_len); - - let max_buf_size = usize::try_from(super::heap::MIN_SZ * 15 / 16) - .unwrap() - .min(self.config.segment_size - SEG_HEADER_LEN); - - let over_heap_threshold = - max_buf_len > u64::try_from(max_buf_size).unwrap(); - - assert!(!(over_heap_threshold && heap_rewrite.is_some())); - - let mut printed = false; - macro_rules! trace_once { - ($($msg:expr),*) => { - if !printed { - trace!($($msg),*); - printed = true; - } - }; - } - - let backoff = Backoff::new(); - - let kind = match ( - pid, - log_kind, - over_heap_threshold || heap_rewrite.is_some(), - ) { - (COUNTER_PID, LogKind::Replace, false) => MessageKind::Counter, - (META_PID, LogKind::Replace, true) => MessageKind::HeapMeta, - (META_PID, LogKind::Replace, false) => MessageKind::InlineMeta, - (BATCH_MANIFEST_PID, LogKind::Skip, false) => { - MessageKind::BatchManifest - } - (_, LogKind::Free, false) => MessageKind::Free, - (_, LogKind::Replace, true) => MessageKind::HeapNode, - (_, LogKind::Replace, false) => MessageKind::InlineNode, - (_, LogKind::Link, true) => MessageKind::HeapLink, - (_, LogKind::Link, false) => MessageKind::InlineLink, - other => unreachable!( - "unexpected combination of PageId, \ - LogKind, and heap status: {:?}", - other - ), - }; - - #[cfg(feature = "metrics")] - match kind { - MessageKind::HeapLink | MessageKind::HeapNode => { - M.bytes_written_heap_item.fetch_add( - usize::try_from(serialized_len).unwrap(), - Relaxed, - ); - M.bytes_written_heap_ptr.fetch_add(16, Relaxed); - } - MessageKind::InlineNode => { - M.bytes_written_replace.fetch_add( - usize::try_from(serialized_len).unwrap(), - Relaxed, - ); - } - MessageKind::InlineLink => { - M.bytes_written_link.fetch_add( - usize::try_from(serialized_len).unwrap(), - Relaxed, - ); - } - _ => { - M.bytes_written_other.fetch_add( - usize::try_from(serialized_len).unwrap(), - Relaxed, - ); - } - } - - loop { - #[cfg(feature = "metrics")] - M.log_reservation_attempted(); - - // don't continue if the system - // has encountered an issue. - if let Err(e) = self.config.global_error() { - let intervals = self.iobufs.intervals.lock(); - - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - - let _notified = self.iobufs.interval_updated.notify_all(); - return Err(e); - } - - // load current header value - let iobuf = self.iobufs.current_iobuf(); - let header = iobuf.get_header(); - let buf_offset = header::offset(header); - let reservation_lsn = - iobuf.lsn + Lsn::try_from(buf_offset).unwrap(); - - // skip if already sealed - if header::is_sealed(header) { - // already sealed, start over and hope cur - // has already been bumped by sealer. - trace_once!("io buffer already sealed, spinning"); - - backoff.spin(); - - continue; - } - - // figure out how big the header + buf will be. - // this is variable because of varints used - // in the header. - let message_header = MessageHeader { - crc32: 0, - kind, - segment_number: SegmentNumber( - u64::try_from(iobuf.lsn).unwrap() - / u64::try_from(self.config.segment_size).unwrap(), - ), - pid, - len: if over_heap_threshold { - // a HeapId is always 16 bytes - 16 - } else { - serialized_len - }, - }; - - let inline_buf_len = if over_heap_threshold { - usize::try_from( - // a HeapId is always 16 bytes - message_header.serialized_size() + 16, - ) - .unwrap() - } else { - usize::try_from( - message_header.serialized_size() + serialized_len, - ) - .unwrap() - }; - - trace!( - "reserving buf of len {} for pid {} with kind {:?}", - inline_buf_len, - pid, - kind - ); - - // try to claim space - let prospective_size = buf_offset + inline_buf_len; - // we don't reserve anything if we're within the last - // MAX_MSG_HEADER_LEN bytes of the buffer. during - // recovery, we assume that nothing can begin here, - // because headers are dynamically sized. - let red_zone = iobuf.capacity - buf_offset < MAX_MSG_HEADER_LEN; - let would_overflow = red_zone || prospective_size > iobuf.capacity; - if would_overflow { - // This buffer is too full to accept our write! - // Try to seal the buffer, and maybe write it if - // there are zero writers. - trace_once!("io buffer too full, spinning"); - iobuf::maybe_seal_and_write_iobuf( - &self.iobufs, - &iobuf, - header, - true, - )?; - backoff.spin(); - continue; - } - - // attempt to claim by incrementing an unsealed header - let bumped_offset = header::bump_offset(header, inline_buf_len); - - // check for maxed out IO buffer writers - if header::n_writers(bumped_offset) == header::MAX_WRITERS { - trace_once!( - "spinning because our buffer has {} writers already", - header::MAX_WRITERS - ); - backoff.spin(); - continue; - } - - let claimed = header::incr_writers(bumped_offset); - - if iobuf.cas_header(header, claimed).is_err() { - // CAS failed, start over - trace_once!("CAS failed while claiming buffer slot, spinning"); - backoff.spin(); - continue; - } - - let log_offset = iobuf.offset; - - // if we're giving out a reservation, - // the writer count should be positive - assert_ne!(header::n_writers(claimed), 0); - - // should never have claimed a sealed buffer - assert!(!header::is_sealed(claimed)); - - // MAX is used to signify unreadiness of - // the underlying IO buffer, and if it's - // still set here, the buffer counters - // used to choose this IO buffer - // were incremented in a racy way. - assert_ne!( - log_offset, - LogOffset::max_value(), - "fucked up on iobuf with lsn {}\n{:?}", - reservation_lsn, - self - ); - - let destination = iobuf.get_mut_range(buf_offset, inline_buf_len); - let reservation_lid = log_offset + buf_offset as LogOffset; - - trace!( - "reserved {} bytes at lsn {} lid {}", - inline_buf_len, - reservation_lsn, - reservation_lid, - ); - - self.iobufs - .max_reserved_lsn - .fetch_max(reservation_lsn + inline_buf_len as Lsn - 1, SeqCst); - - let (heap_reservation, heap_id_opt) = if over_heap_threshold { - let heap_reservation = self - .config - .heap - .reserve(serialized_len + 13, reservation_lsn); - let heap_id = heap_reservation.heap_id; - (Some(heap_reservation), Some(heap_id)) - } else { - (None, None) - }; - - self.iobufs.encapsulate( - item, - message_header, - destination, - heap_reservation, - )?; - - #[cfg(feature = "metrics")] - M.log_reservation_success(); - - let pointer = if let Some(heap_id) = heap_id_opt { - DiskPtr::new_heap_item(reservation_lid, heap_id) - } else if let Some(heap_id) = heap_rewrite { - DiskPtr::new_heap_item(reservation_lid, heap_id) - } else { - DiskPtr::new_inline(reservation_lid) - }; - - return Ok(Reservation { - iobuf, - log: self, - buf: destination, - flushed: false, - lsn: reservation_lsn, - pointer, - is_heap_item_rewrite: heap_rewrite.is_some(), - header_len: usize::try_from(message_header.serialized_size()) - .unwrap(), - }); - } - } - - /// Called by Reservation on termination (completion or abort). - /// Handles departure from shared state, and possibly writing - /// the buffer to stable storage if necessary. - pub(super) fn exit_reservation(&self, iobuf: &Arc) -> Result<()> { - let mut header = iobuf.get_header(); - - // Decrement writer count, retrying until successful. - loop { - let new_hv = header::decr_writers(header); - if let Err(current) = iobuf.cas_header(header, new_hv) { - // we failed to decr, retry - header = current; - } else { - // success - header = new_hv; - break; - } - } - - // Succeeded in decrementing writers, if we decremented writn - // to 0 and it's sealed then we should write it to storage. - if header::n_writers(header) == 0 && header::is_sealed(header) { - if let Err(e) = self.config.global_error() { - let intervals = self.iobufs.intervals.lock(); - - // having held the mutex makes this linearized - // with the notify below. - drop(intervals); - - let _notified = self.iobufs.interval_updated.notify_all(); - return Err(e); - } - - let lsn = iobuf.lsn; - trace!( - "asynchronously writing iobuf with lsn {} \ - to log from exit_reservation", - lsn - ); - let iobufs2 = self.iobufs.clone(); - let iobuf2 = iobuf.clone(); - threadpool::write_to_log(iobuf2, iobufs2); - - Ok(()) - } else { - Ok(()) - } - } -} - -impl Drop for Log { - fn drop(&mut self) { - // don't do any more IO if we're crashing - if self.config.global_error().is_err() { - return; - } - - if let Err(e) = iobuf::flush(&self.iobufs) { - error!("failed to flush from IoBufs::drop: {}", e); - } - - if !self.config.temporary { - self.config.file.sync_all().unwrap(); - } - - debug!("IoBufs dropped"); - } -} - -/// All log messages are prepended with this header -#[derive(Debug, Copy, Clone, PartialEq, Eq)] -pub struct MessageHeader { - pub crc32: u32, - pub kind: MessageKind, - pub segment_number: SegmentNumber, - pub pid: PageId, - pub len: u64, -} - -/// A number representing a segment number. -#[derive(Copy, Clone, PartialEq, Eq, Debug)] -#[repr(transparent)] -pub struct SegmentNumber(pub u64); - -impl std::ops::Deref for SegmentNumber { - type Target = u64; - - fn deref(&self) -> &u64 { - &self.0 - } -} - -/// A segment's header contains the new base LSN and a reference -/// to the previous log segment. -#[derive(Debug, Copy, Clone)] -pub struct SegmentHeader { - pub lsn: Lsn, - pub max_stable_lsn: Lsn, - pub ok: bool, -} - -/// The result of a read of a log message -#[derive(Debug)] -pub enum LogRead { - /// Successful read, entirely on-log - Inline(MessageHeader, Vec, u32), - /// Successful read, spilled to a slot in the heap - Heap(MessageHeader, Vec, HeapId, u32), - /// A cancelled message was encountered - Canceled(u32), - /// A padding message used to show that a segment was filled - Cap(SegmentNumber), - /// This log message was not readable due to corruption - Corrupted, - /// This heap slot has been replaced - DanglingHeap(MessageHeader, HeapId, u32), - /// This data may only be read if at least this future location is stable - BatchManifest(Lsn, u32), -} - -impl LogRead { - /// Return true if we read a successful Inline or Heap value. - pub const fn is_successful(&self) -> bool { - matches!(self, LogRead::Inline(..) | LogRead::Heap(..)) - } - - /// Return the underlying data read from a log read, if successful. - pub fn into_data(self) -> Option> { - match self { - LogRead::Heap(_, buf, _, _) | LogRead::Inline(_, buf, _) => { - Some(buf) - } - _ => None, - } - } -} - -impl From<[u8; SEG_HEADER_LEN]> for SegmentHeader { - fn from(buf: [u8; SEG_HEADER_LEN]) -> Self { - #[allow(unsafe_code)] - unsafe { - let crc32_header = - arr_to_u32(buf.get_unchecked(0..4)) ^ 0xFFFF_FFFF; - - let xor_lsn = arr_to_lsn(buf.get_unchecked(4..12)); - let lsn = xor_lsn ^ 0x7FFF_FFFF_FFFF_FFFF; - - let xor_max_stable_lsn = arr_to_lsn(buf.get_unchecked(12..20)); - let max_stable_lsn = xor_max_stable_lsn ^ 0x7FFF_FFFF_FFFF_FFFF; - - let crc32_tested = crc32(&buf[4..20]); - - let ok = crc32_tested == crc32_header; - - if !ok { - debug!( - "segment with lsn {} had computed crc {}, \ - but stored crc {}", - lsn, crc32_tested, crc32_header - ); - } - - Self { lsn, max_stable_lsn, ok } - } - } -} - -impl From for [u8; SEG_HEADER_LEN] { - fn from(header: SegmentHeader) -> [u8; SEG_HEADER_LEN] { - let mut buf = [0; SEG_HEADER_LEN]; - - let xor_lsn = header.lsn ^ 0x7FFF_FFFF_FFFF_FFFF; - let lsn_arr = lsn_to_arr(xor_lsn); - - let xor_max_stable_lsn = header.max_stable_lsn ^ 0x7FFF_FFFF_FFFF_FFFF; - let highest_stable_lsn_arr = lsn_to_arr(xor_max_stable_lsn); - - #[allow(unsafe_code)] - unsafe { - std::ptr::copy_nonoverlapping( - lsn_arr.as_ptr(), - buf.as_mut_ptr().add(4), - std::mem::size_of::(), - ); - std::ptr::copy_nonoverlapping( - highest_stable_lsn_arr.as_ptr(), - buf.as_mut_ptr().add(12), - std::mem::size_of::(), - ); - } - - let crc32 = u32_to_arr(crc32(&buf[4..20]) ^ 0xFFFF_FFFF); - - #[allow(unsafe_code)] - unsafe { - std::ptr::copy_nonoverlapping( - crc32.as_ptr(), - buf.as_mut_ptr(), - std::mem::size_of::(), - ); - } - - buf - } -} - -pub(crate) fn read_segment_header( - file: &File, - lid: LogOffset, -) -> Result { - trace!("reading segment header at {}", lid); - - let mut seg_header_buf = [0; SEG_HEADER_LEN]; - pread_exact(file, &mut seg_header_buf, lid)?; - let segment_header = SegmentHeader::from(seg_header_buf); - - if segment_header.lsn < Lsn::try_from(lid).unwrap() { - debug!( - "segment had lsn {} but we expected something \ - greater, as the base lid is {}", - segment_header.lsn, lid - ); - } - - Ok(segment_header) -} - -pub(crate) trait ReadAt { - fn pread_exact(&self, dst: &mut [u8], at: u64) -> Result<()>; - - fn pread_exact_or_eof(&self, dst: &mut [u8], at: u64) -> Result; -} - -impl ReadAt for File { - fn pread_exact(&self, dst: &mut [u8], at: u64) -> Result<()> { - pread_exact(self, dst, at) - } - - fn pread_exact_or_eof(&self, dst: &mut [u8], at: u64) -> Result { - pread_exact_or_eof(self, dst, at) - } -} - -impl ReadAt for BasedBuf { - fn pread_exact(&self, dst: &mut [u8], mut at: u64) -> Result<()> { - if at < self.offset - || u64::try_from(dst.len()).unwrap() + at - > u64::try_from(self.buf.len()).unwrap() + self.offset - { - return Err(std::io::Error::new( - std::io::ErrorKind::UnexpectedEof, - "failed to fill buffer", - ) - .into()); - } - at -= self.offset; - let at_usize = usize::try_from(at).unwrap(); - let to_usize = at_usize + dst.len(); - dst.copy_from_slice(self.buf[at_usize..to_usize].as_ref()); - Ok(()) - } - - fn pread_exact_or_eof(&self, dst: &mut [u8], mut at: u64) -> Result { - if at < self.offset - || u64::try_from(self.buf.len()).unwrap() < at - self.offset - { - return Err(std::io::Error::new( - std::io::ErrorKind::UnexpectedEof, - "failed to fill buffer", - ) - .into()); - } - at -= self.offset; - - let at_usize = usize::try_from(at).unwrap(); - - let len = std::cmp::min(dst.len(), self.buf.len() - at_usize); - - let start = at_usize; - let end = start + len; - dst[..len].copy_from_slice(self.buf[start..end].as_ref()); - Ok(len) - } -} - -/// read a buffer from the disk -pub(crate) fn read_message( - file: &R, - lid: LogOffset, - expected_segment_number: SegmentNumber, - config: &RunningConfig, -) -> Result { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.read); - let segment_len = config.segment_size; - let seg_start = lid / segment_len as LogOffset * segment_len as LogOffset; - trace!("reading message from segment: {} at lid: {}", seg_start, lid); - assert!(seg_start + SEG_HEADER_LEN as LogOffset <= lid); - assert!( - (seg_start + segment_len as LogOffset) - lid - >= MAX_MSG_HEADER_LEN as LogOffset, - "tried to read a message from the red zone" - ); - - let msg_header_buf = &mut [0; 128]; - let _read_bytes = file.pread_exact_or_eof(msg_header_buf, lid)?; - let header_cursor = &mut msg_header_buf.as_ref(); - let len_before = header_cursor.len(); - let header = MessageHeader::deserialize(header_cursor)?; - let len_after = header_cursor.len(); - let message_offset = len_before - len_after; - trace!( - "read message header at lid {} with header length {}: {:?}", - lid, - message_offset, - header - ); - - let ceiling = seg_start + segment_len as LogOffset; - - assert!(lid + message_offset as LogOffset <= ceiling); - - let max_possible_len = - assert_usize(ceiling - lid - message_offset as LogOffset); - - if header.len > max_possible_len as u64 { - trace!( - "read a corrupted message with impossibly long length {:?}", - header - ); - return Ok(LogRead::Corrupted); - } - - let header_len = usize::try_from(header.len).unwrap(); - - if header.kind == MessageKind::Corrupted { - trace!( - "read a corrupted message with Corrupted MessageKind: {:?}", - header - ); - return Ok(LogRead::Corrupted); - } - - // perform crc check on everything that isn't Corrupted - let mut buf = vec![0; header_len]; - - if header_len > len_after { - // we have to read more data from disk - file.pread_exact(&mut buf, lid + message_offset as LogOffset)?; - } else { - // we already read this data in the initial read - buf.copy_from_slice(header_cursor[..header_len].as_ref()); - } - - let crc32 = calculate_message_crc32( - msg_header_buf[..message_offset].as_ref(), - &buf, - ); - - if crc32 != header.crc32 { - trace!( - "read a message with a bad checksum with header {:?} msg len: {} expected: {} actual: {}", - header, - header_len, - header.crc32, - crc32 - ); - return Ok(LogRead::Corrupted); - } - - let inline_len = u32::try_from(message_offset).unwrap() - + u32::try_from(header.len).unwrap(); - - if header.segment_number != expected_segment_number { - debug!( - "header {:?} does not contain expected segment_number {:?}", - header, expected_segment_number - ); - return Ok(LogRead::Corrupted); - } - - match header.kind { - MessageKind::Canceled => { - trace!("read failed of len {}", header.len); - Ok(LogRead::Canceled(inline_len)) - } - MessageKind::Cap => { - trace!("read pad in segment number {:?}", header.segment_number); - Ok(LogRead::Cap(header.segment_number)) - } - MessageKind::HeapLink - | MessageKind::HeapNode - | MessageKind::HeapMeta => { - assert_eq!(buf.len(), 16); - let heap_id = HeapId::deserialize(&mut &buf[..]).unwrap(); - - match config.heap.read(heap_id) { - Ok((kind, buf2)) => { - assert_eq!(header.kind, kind); - trace!( - "read a successful heap message for heap {:?} in segment number {:?}", - heap_id, - header.segment_number, - ); - - Ok(LogRead::Heap(header, buf2, heap_id, inline_len)) - } - Err(e) => { - debug!("failed to read heap: {:?}", e); - Ok(LogRead::DanglingHeap(header, heap_id, inline_len)) - } - } - } - MessageKind::InlineLink - | MessageKind::InlineNode - | MessageKind::InlineMeta - | MessageKind::Free - | MessageKind::Counter => { - trace!("read a successful inline message"); - Ok(LogRead::Inline(header, buf, inline_len)) - } - MessageKind::BatchManifest => { - assert_eq!(buf.len(), std::mem::size_of::()); - let max_lsn = arr_to_lsn(&buf); - Ok(LogRead::BatchManifest(max_lsn, inline_len)) - } - MessageKind::Corrupted => unreachable!( - "corrupted should have been handled \ - before reading message length above" - ), - } -} diff --git a/src/pagecache/mod.rs b/src/pagecache/mod.rs deleted file mode 100644 index 7b3b962bc..000000000 --- a/src/pagecache/mod.rs +++ /dev/null @@ -1,2109 +0,0 @@ -//! `pagecache` is a lock-free pagecache and log for building high-performance -//! databases. -#![allow(unsafe_code)] - -pub mod constants; -pub mod logger; - -mod disk_pointer; -mod header; -mod heap; -pub(crate) mod iobuf; -mod iterator; -mod pagetable; -#[cfg(any(all(not(unix), not(windows)), miri))] -mod parallel_io_polyfill; -#[cfg(all(unix, not(miri)))] -mod parallel_io_unix; -#[cfg(all(windows, not(miri)))] -mod parallel_io_windows; -mod reservation; -mod segment; -mod snapshot; - -use std::{fmt, ops::Deref}; - -use crate::*; - -#[cfg(any(all(not(unix), not(windows)), miri))] -use parallel_io_polyfill::{pread_exact, pread_exact_or_eof, pwrite_all}; - -#[cfg(all(unix, not(miri)))] -use parallel_io_unix::{pread_exact, pread_exact_or_eof, pwrite_all}; - -#[cfg(all(windows, not(miri)))] -use parallel_io_windows::{pread_exact, pread_exact_or_eof, pwrite_all}; - -use self::{ - constants::{ - BATCH_MANIFEST_PID, COUNTER_PID, META_PID, - PAGE_CONSOLIDATION_THRESHOLD, SEGMENT_CLEANUP_THRESHOLD, - }, - header::Header, - iobuf::{roll_iobuf, IoBuf, IoBufs}, - iterator::{raw_segment_iter_from, LogIter}, - pagetable::PageTable, - segment::{SegmentAccountant, SegmentCleaner, SegmentOp}, -}; - -pub(crate) use self::{ - heap::{Heap, HeapId}, - logger::{ - read_message, read_segment_header, MessageHeader, SegmentHeader, - SegmentNumber, - }, - reservation::Reservation, - snapshot::{read_snapshot_or_default, PageState, Snapshot}, -}; - -pub use self::{ - constants::{MAX_MSG_HEADER_LEN, MAX_SPACE_AMPLIFICATION, SEG_HEADER_LEN}, - disk_pointer::DiskPtr, - logger::{Log, LogRead}, -}; - -/// A file offset in the database log. -pub type LogOffset = u64; - -/// The logical sequence number of an item in the database log. -pub type Lsn = i64; - -/// A page identifier. -pub type PageId = u64; - -/// Uses a non-varint `Lsn` to mark offsets. -#[derive(Default, Clone, Copy, Ord, PartialOrd, Eq, PartialEq, Debug)] -#[repr(transparent)] -pub struct BatchManifest(pub Lsn); - -/// A buffer with an associated offset. Useful for -/// batching many reads over a file segment. -#[derive(Debug)] -pub struct BasedBuf { - pub buf: Vec, - pub offset: LogOffset, -} - -/// A byte used to disambiguate log message types -#[derive(Clone, Copy, PartialEq, Eq, Debug)] -#[repr(u8)] -pub enum MessageKind { - /// The EVIL_BYTE is written as a canary to help - /// detect torn writes. - Corrupted = 0, - /// Indicates that the following buffer corresponds - /// to a reservation for an in-memory operation that - /// failed to complete. It should be skipped during - /// recovery. - Canceled = 1, - /// Indicates that the following buffer is used - /// as padding to fill out the rest of the segment - /// before sealing it. - Cap = 2, - /// Indicates that the following buffer contains - /// an Lsn for the last write in an atomic writebatch. - BatchManifest = 3, - /// Indicates that this page was freed from the pagetable. - Free = 4, - /// Indicates that the last persisted ID was at least - /// this high. - Counter = 5, - /// The meta page, stored inline - InlineMeta = 6, - /// The meta page, stored heaply - HeapMeta = 7, - /// A consolidated page replacement, stored inline - InlineNode = 8, - /// A consolidated page replacement, stored heaply - HeapNode = 9, - /// A partial page update, stored inline - InlineLink = 10, - /// A partial page update, stored heaply - HeapLink = 11, -} - -impl MessageKind { - pub(in crate::pagecache) const fn into(self) -> u8 { - self as u8 - } -} - -impl From for MessageKind { - fn from(byte: u8) -> Self { - use MessageKind::*; - match byte { - 0 => Corrupted, - 1 => Canceled, - 2 => Cap, - 3 => BatchManifest, - 4 => Free, - 5 => Counter, - 6 => InlineMeta, - 7 => HeapMeta, - 8 => InlineNode, - 9 => HeapNode, - 10 => InlineLink, - 11 => HeapLink, - other => { - debug!("encountered unexpected message kind byte {}", other); - Corrupted - } - } - } -} - -/// The high-level types of stored information -/// about pages and their mutations -#[derive(Clone, Copy, Debug, PartialEq, Eq)] -pub enum LogKind { - /// Persisted data containing a page replacement - Replace, - /// Persisted immutable update - Link, - /// Freeing of a page - Free, - /// Some state indicating this should be skipped - Skip, - /// Unexpected corruption - Corrupted, -} - -const fn log_kind_from_update(update: &Update) -> LogKind { - match update { - Update::Free => LogKind::Free, - Update::Link(..) => LogKind::Link, - Update::Node(..) | Update::Counter(..) | Update::Meta(..) => { - LogKind::Replace - } - } -} - -impl From for LogKind { - fn from(kind: MessageKind) -> Self { - match kind { - MessageKind::Free => LogKind::Free, - MessageKind::InlineNode - | MessageKind::Counter - | MessageKind::HeapNode - | MessageKind::InlineMeta - | MessageKind::HeapMeta => LogKind::Replace, - MessageKind::InlineLink | MessageKind::HeapLink => LogKind::Link, - MessageKind::Canceled - | MessageKind::Cap - | MessageKind::BatchManifest => LogKind::Skip, - other => { - debug!("encountered unexpected message kind byte {:?}", other); - LogKind::Corrupted - } - } - } -} - -fn assert_usize(from: T) -> usize -where - usize: TryFrom, -{ - usize::try_from(from).expect("lost data cast while converting to usize") -} - -use std::convert::{TryFrom, TryInto}; - -pub(in crate::pagecache) const fn lsn_to_arr(number: Lsn) -> [u8; 8] { - number.to_le_bytes() -} - -#[inline] -pub(in crate::pagecache) fn arr_to_lsn(arr: &[u8]) -> Lsn { - Lsn::from_le_bytes(arr.try_into().unwrap()) -} - -pub(in crate::pagecache) const fn u64_to_arr(number: u64) -> [u8; 8] { - number.to_le_bytes() -} - -#[inline] -pub(crate) fn arr_to_u32(arr: &[u8]) -> u32 { - u32::from_le_bytes(arr.try_into().unwrap()) -} - -pub(crate) const fn u32_to_arr(number: u32) -> [u8; 4] { - number.to_le_bytes() -} - -#[derive(Debug, Clone, Copy)] -pub struct NodeView<'g>(pub(crate) PageView<'g>); - -impl<'g> Deref for NodeView<'g> { - type Target = Node; - fn deref(&self) -> &Node { - self.0.as_node() - } -} - -#[derive(Debug, Clone, Copy)] -pub struct MetaView<'g>(PageView<'g>); - -impl<'g> Deref for MetaView<'g> { - type Target = Meta; - fn deref(&self) -> &Meta { - self.0.as_meta() - } -} - -#[derive(Debug, Clone, Copy)] -pub struct PageView<'g> { - pub(in crate::pagecache) read: Shared<'g, Page>, - pub(in crate::pagecache) entry: &'g Atomic, -} - -impl<'g> Deref for PageView<'g> { - type Target = Page; - - fn deref(&self) -> &Page { - unsafe { self.read.deref() } - } -} - -#[derive(Clone, Copy, Debug, PartialEq, Eq)] -pub struct CacheInfo { - pub ts: u64, - pub lsn: Lsn, - pub pointer: DiskPtr, -} - -#[cfg(test)] -impl quickcheck::Arbitrary for CacheInfo { - fn arbitrary(g: &mut G) -> CacheInfo { - use rand::Rng; - - CacheInfo { ts: g.gen(), lsn: g.gen(), pointer: DiskPtr::arbitrary(g) } - } -} - -/// Update denotes a state or a change in a sequence of updates -/// of which a page consists. -#[derive(Clone, Debug)] -#[cfg_attr(feature = "testing", derive(PartialEq))] -pub(in crate::pagecache) enum Update { - Link(Link), - Node(Node), - Free, - Counter(u64), - Meta(Meta), -} - -impl Update { - fn as_node(&self) -> &Node { - match self { - Update::Node(node) => node, - other => panic!("called as_node on non-Node: {:?}", other), - } - } - - fn as_node_mut(&mut self) -> &mut Node { - match self { - Update::Node(node) => node, - other => panic!("called as_node_mut on non-Node: {:?}", other), - } - } - - fn as_link(&self) -> &Link { - match self { - Update::Link(link) => link, - other => panic!("called as_link on non-Link: {:?}", other), - } - } - - fn as_meta(&self) -> &Meta { - if let Update::Meta(meta) = self { - meta - } else { - panic!("called as_meta on {:?}", self) - } - } - - fn as_counter(&self) -> u64 { - if let Update::Counter(counter) = self { - *counter - } else { - panic!("called as_counter on {:?}", self) - } - } -} - -/// Ensures that any operations that are written to disk between the -/// creation of this guard and its destruction will be recovered -/// atomically. When this guard is dropped, it marks in an earlier -/// reservation where the stable tip must be in order to perform -/// recovery. If this is beyond where the system successfully -/// wrote before crashing, then the recovery will stop immediately -/// before any of the atomic batch can be partially recovered. -/// -/// Must call `seal_batch` to complete the atomic batch operation. -/// -/// If this is dropped without calling `seal_batch`, the complete -/// recovery effect will not occur. -#[derive(Debug)] -pub struct RecoveryGuard<'a> { - batch_res: Reservation<'a>, -} - -impl<'a> RecoveryGuard<'a> { - /// Writes the last LSN for a batch into an earlier - /// reservation, releasing it. - pub(crate) fn seal_batch(self) -> Result<()> { - let max_reserved = - self.batch_res.log.iobufs.max_reserved_lsn.load(Acquire); - self.batch_res.mark_writebatch(max_reserved).map(|_| ()) - } -} - -/// A page consists of a sequence of state transformations -/// with associated storage parameters like disk pos, lsn, time. -#[derive(Debug, Clone)] -pub struct Page { - update: Option, - cache_infos: Vec, -} - -impl Page { - pub(in crate::pagecache) fn rss(&self) -> Option { - match &self.update { - Some(Update::Node(ref node)) => Some(node.rss()), - _ => None, - } - } - - fn to_page_state(&self) -> PageState { - let base = &self.cache_infos[0]; - if self.is_free() { - PageState::Free(base.lsn, base.pointer) - } else { - let mut frags: Vec<(Lsn, DiskPtr)> = vec![]; - - for cache_info in self.cache_infos.iter().skip(1) { - frags.push((cache_info.lsn, cache_info.pointer)); - } - - PageState::Present { base: (base.lsn, base.pointer), frags } - } - } - - fn as_node(&self) -> &Node { - self.update.as_ref().unwrap().as_node() - } - - fn as_meta(&self) -> &Meta { - self.update.as_ref().unwrap().as_meta() - } - - fn as_counter(&self) -> u64 { - self.update.as_ref().unwrap().as_counter() - } - - const fn is_free(&self) -> bool { - matches!(self.update, Some(Update::Free)) - } - - fn last_lsn(&self) -> Lsn { - self.cache_infos.last().map(|ci| ci.lsn).unwrap() - } - - fn ts(&self) -> u64 { - self.cache_infos.last().map_or(0, |ci| ci.ts) - } - - fn lone_heap_item(&self) -> Option { - if self.cache_infos.len() == 1 - && self.cache_infos[0].pointer.is_heap_item() - { - Some(self.cache_infos[0]) - } else { - None - } - } -} - -/// A lock-free pagecache which supports linkmented pages -/// for dramatically improving write throughput. -#[derive(Clone)] -pub struct PageCache(Arc); - -impl Deref for PageCache { - type Target = PageCacheInner; - - fn deref(&self) -> &PageCacheInner { - &self.0 - } -} - -pub struct PageCacheInner { - was_recovered: bool, - pub(crate) config: RunningConfig, - inner: PageTable, - next_pid_to_allocate: Mutex, - // needs to be a sub-Arc because we separate - // it for async modification in an EBR guard - free: Arc>>, - #[doc(hidden)] - pub log: Log, - lru: Lru, - - idgen: AtomicU64, - idgen_persists: AtomicU64, - idgen_persist_mu: Mutex<()>, - - // fuzzy snapshot-related items - snapshot_min_lsn: AtomicLsn, - links: AtomicU64, - snapshot_lock: Mutex<()>, -} - -impl Debug for PageCache { - fn fmt( - &self, - f: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - f.write_str(&*format!( - "PageCache {{ max: {:?} free: {:?} }}\n", - *self.next_pid_to_allocate.lock(), - self.free - )) - } -} - -#[cfg(feature = "event_log")] -impl Drop for PageCacheInner { - fn drop(&mut self) { - trace!("dropping pagecache"); - - // we can't as easily assert recovery - // invariants across failpoints for now - if self.log.iobufs.config.global_error().is_ok() { - let mut pages_before_restart = Map::default(); - - let guard = pin(); - - self.config - .event_log - .meta_before_restart(self.get_meta(&guard).deref().clone()); - - for pid in 0..*self.next_pid_to_allocate.lock() { - let pte = self.inner.get(pid, &guard); - let pointers = - pte.cache_infos.iter().map(|ci| ci.pointer).collect(); - pages_before_restart.insert(pid, pointers); - } - - self.config.event_log.pages_before_restart(pages_before_restart); - } - - trace!("pagecache dropped"); - } -} - -impl PageCache { - /// Instantiate a new `PageCache`. - pub(crate) fn start(config: RunningConfig) -> Result { - trace!("starting pagecache"); - - config.reset_global_error(); - - // try to pull any existing snapshot off disk, and - // apply any new data to it to "catch-up" the - // snapshot before loading it. - let snapshot = read_snapshot_or_default(&config)?; - - config.heap.gc_unknown_items(&snapshot); - - #[cfg(feature = "testing")] - { - // these checks are in place to catch non-idempotent - // recovery which could trigger feedback loops and - // emergent behavior. - trace!( - "\n\n~~~~ regenerating snapshot for idempotency test ~~~~\n" - ); - - let paused_faults = crate::fail::pause_faults(); - - let snapshot2 = read_snapshot_or_default(&config) - .expect("second read snapshot"); - - crate::fail::restore_faults(paused_faults); - - assert_eq!( - snapshot.active_segment, snapshot2.active_segment, - "snapshot active_segment diverged across recoveries.\n\n \ - first: {:?}\n\n - second: {:?}\n\n", - snapshot, snapshot2 - ); - assert_eq!( - snapshot.stable_lsn, snapshot2.stable_lsn, - "snapshot stable_lsn diverged across recoveries.\n\n \ - first: {:?}\n\n - second: {:?}\n\n", - snapshot, snapshot2 - ); - for (pid, (p1, p2)) in - snapshot.pt.iter().zip(snapshot2.pt.iter()).enumerate() - { - assert_eq!( - p1, p2, - "snapshot pid {} diverged across recoveries.\n\n \ - first: {:?}\n\n - second: {:?}\n\n", - pid, p1, p2 - ); - } - assert_eq!( - snapshot.pt.len(), - snapshot2.pt.len(), - "snapshots number of pages diverged across recoveries.\n\n \ - first: {:?}\n\n - second: {:?}\n\n", - snapshot.pt, - snapshot2.pt - ); - assert_eq!( - snapshot, snapshot2, - "snapshots diverged across recoveries.\n\n \ - first: {:?}\n\n - second: {:?}\n\n", - snapshot, snapshot2 - ); - } - - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.start_pagecache); - - let cache_capacity = config.cache_capacity; - let lru = Lru::new(cache_capacity); - - let mut pc = PageCacheInner { - was_recovered: false, - config: config.clone(), - free: Arc::new(Mutex::new(FastSet8::default())), - idgen: AtomicU64::new(0), - idgen_persist_mu: Mutex::new(()), - idgen_persists: AtomicU64::new(0), - inner: PageTable::default(), - log: Log::start(config, &snapshot)?, - lru, - next_pid_to_allocate: Mutex::new(0), - snapshot_min_lsn: AtomicLsn::new(snapshot.stable_lsn.unwrap_or(0)), - links: AtomicU64::new(0), - snapshot_lock: Mutex::new(()), - }; - - // now we read it back in - pc.load_snapshot(&snapshot)?; - - #[cfg(feature = "event_log")] - { - // NB this must be before idgen/meta are initialized - // because they may cas_page on initial page-in. - let guard = pin(); - - let mut pages_after_restart = Map::default(); - - for pid in 0..*pc.next_pid_to_allocate.lock() { - let pte = pc.inner.get(pid, &guard); - let pointers = - pte.cache_infos.iter().map(|ci| ci.pointer).collect(); - pages_after_restart.insert(pid, pointers); - } - - pc.config.event_log.pages_after_restart(pages_after_restart); - } - - let mut was_recovered = true; - - let guard = pin(); - if !pc.inner.contains_pid(META_PID, &guard) { - // set up meta - was_recovered = false; - - let meta_update = Update::Meta(Meta::default()); - - let (meta_id, _) = pc.allocate_inner(meta_update, &guard)?; - - assert_eq!( - meta_id, META_PID, - "we expect the meta page to have pid {}, but it had pid {} instead", - META_PID, meta_id, - ); - } - - if !pc.inner.contains_pid(COUNTER_PID, &guard) { - // set up idgen - was_recovered = false; - - let counter_update = Update::Counter(0); - - let (counter_id, _) = pc.allocate_inner(counter_update, &guard)?; - - assert_eq!( - counter_id, COUNTER_PID, - "we expect the counter to have pid {}, but it had pid {} instead", - COUNTER_PID, counter_id, - ); - } - - let (idgen_key, counter) = pc.get_idgen(&guard); - let idgen_recovery = if was_recovered { - counter + (2 * pc.config.idgen_persist_interval) - } else { - 0 - }; - let idgen_persists = counter / pc.config.idgen_persist_interval - * pc.config.idgen_persist_interval; - - pc.idgen.store(idgen_recovery, Release); - pc.idgen_persists.store(idgen_persists, Release); - - if was_recovered { - // advance pc.idgen_persists and the counter page by one - // interval, so that when generate_id() is next called, it - // will advance them further by another interval, and wait for - // this update to be durable before returning the first ID. - let necessary_persists = - (counter / pc.config.idgen_persist_interval + 1) - * pc.config.idgen_persist_interval; - let counter_update = Update::Counter(necessary_persists); - let old = pc.idgen_persists.swap(necessary_persists, Release); - assert_eq!(old, idgen_persists); - // CAS should never fail because the PageCache is still being constructed. - pc.cas_page(COUNTER_PID, idgen_key, counter_update, false, &guard)? - .unwrap(); - } else { - drop(guard); - // persist the meta and idgen pages now, so that we don't hand - // out id 0 again if we crash and recover - pc.flush()?; - } - - pc.was_recovered = was_recovered; - - #[cfg(feature = "event_log")] - { - let guard2 = pin(); - - pc.config - .event_log - .meta_after_restart(pc.get_meta(&guard2).deref().clone()); - } - - trace!("pagecache started"); - - Ok(PageCache(Arc::new(pc))) - } - - /// Try to atomically add a `PageLink` to the page. - /// Returns `Ok(new_key)` if the operation was successful. Returns - /// `Err(None)` if the page no longer exists. Returns - /// `Err(Some(actual_key))` if the atomic link fails. - pub(crate) fn link<'g>( - &self, - pid: PageId, - mut old: PageView<'g>, - new: Link, - guard: &'g Guard, - ) -> Result> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.link_page); - - trace!("linking pid {} node {:?} with {:?}", pid, old.as_node(), new); - - // A failure injector that fails links randomly - // during test to ensure interleaving coverage. - #[cfg(any(test, feature = "lock_free_delays"))] - { - use std::cell::RefCell; - use std::time::{SystemTime, UNIX_EPOCH}; - - thread_local! { - pub static COUNT: RefCell = RefCell::new(1); - } - - let time_now = - SystemTime::now().duration_since(UNIX_EPOCH).unwrap(); - - #[allow(clippy::cast_possible_truncation)] - let fail_seed = std::cmp::max(3, time_now.as_nanos() as u32 % 128); - - let inject_failure = COUNT.with(|c| { - let mut cr = c.borrow_mut(); - *cr += 1; - *cr % fail_seed == 0 - }); - - if inject_failure { - debug!( - "injecting a randomized failure in the link of pid {}", - pid - ); - if let Some(current_pointer) = self.get(pid, guard)? { - return Ok(Err(Some((current_pointer.0, new)))); - } else { - return Ok(Err(None)); - } - } - } - - let node = old.as_node().apply(&new); - - // see if we should short-circuit replace - if old.cache_infos.len() >= PAGE_CONSOLIDATION_THRESHOLD { - log::trace!("skipping link, replacing pid {} with {:?}", pid, node); - let short_circuit = self.replace(pid, old, &node, guard)?; - return Ok(short_circuit.map_err(|a| a.map(|b| (b.0, new)))); - } - - log::trace!( - "applying link of {:?} to pid {:?} resulted in node {:?}", - new, - pid, - node - ); - - let mut new_page = Some(Owned::new(Page { - update: Some(Update::Node(node)), - cache_infos: Vec::default(), - })); - - loop { - // TODO handle replacement on threshold here instead - - let log_reservation = - self.log.reserve(LogKind::Link, pid, &new, guard)?; - let lsn = log_reservation.lsn; - let pointer = log_reservation.pointer; - - // NB the setting of the timestamp is quite - // correctness-critical! We use the ts to - // ensure that fundamentally new data causes - // high-level link and replace operations - // to fail when the data in the pagecache - // actually changes. When we just rewrite - // the page for the purposes of moving it - // to a new location on disk, however, we - // don't want to cause threads that are - // basing the correctness of their new - // writes on the unchanged state to fail. - // Here, we bump it by 1, to signal that - // the underlying state is fundamentally - // changing. - let ts = old.ts() + 1; - - let cache_info = CacheInfo { ts, lsn, pointer }; - - let mut new_cache_infos = - Vec::with_capacity(old.cache_infos.len() + 1); - new_cache_infos.extend_from_slice(&old.cache_infos); - new_cache_infos.push(cache_info); - - let mut page_ptr = new_page.take().unwrap(); - page_ptr.cache_infos = new_cache_infos; - - debug_delay(); - let result = - old.entry.compare_and_set(old.read, page_ptr, SeqCst, guard); - - match result { - Ok(new_shared) => { - trace!("link of pid {} succeeded", pid); - - unsafe { - guard.defer_destroy(old.read); - } - - assert_ne!(old.last_lsn(), 0); - - self.log.iobufs.sa_mark_link(pid, cache_info, guard); - - // NB complete must happen AFTER calls to SA, because - // when the iobuf's n_writers hits 0, we may transition - // the segment to inactive, resulting in a race otherwise. - // FIXME can result in deadlock if a node that holds SA - // is waiting to acquire a new reservation blocked by this? - log_reservation.complete()?; - - // possibly evict an item now that our cache has grown - if let Some(rss) = unsafe { new_shared.deref().rss() } { - self.lru_access(pid, rss, guard)?; - } - - old.read = new_shared; - - let link_count = self.links.fetch_add(1, Relaxed); - - if link_count > 0 - && link_count % self.config.snapshot_after_ops == 0 - { - let s2: PageCache = self.clone(); - threadpool::take_fuzzy_snapshot(s2); - } - - return Ok(Ok(old)); - } - Err(cas_error) => { - log_reservation.abort()?; - let actual = cas_error.current; - let actual_ts = unsafe { actual.deref().ts() }; - if actual_ts == old.ts() { - trace!( - "link of pid {} failed due to movement, retrying", - pid - ); - new_page = Some(cas_error.new); - - old.read = actual; - } else { - trace!("link of pid {} failed due to new update", pid); - let mut page_view = old; - page_view.read = actual; - return Ok(Err(Some((page_view, new)))); - } - } - } - } - } - - pub(crate) fn take_fuzzy_snapshot(&self) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.fuzzy_snapshot); - let lock = self.snapshot_lock.try_lock(); - if lock.is_none() { - log::debug!( - "skipping snapshot because the snapshot lock is already claimed" - ); - return Ok(()); - } - let stable_lsn_before: Lsn = self.log.stable_offset(); - - // This is how we determine the number of the pages we will snapshot. - let pid_bound = { - let mu = self.next_pid_to_allocate.lock(); - *mu - }; - - let pid_bound_usize = assert_usize(pid_bound); - - let mut page_states = Vec::::with_capacity(pid_bound_usize); - let mut guard = pin(); - for pid in 0..pid_bound { - if pid % 64 == 0 { - drop(guard); - guard = pin(); - } - 'inner: loop { - let pg_view = self.inner.get(pid, &guard); - - match *pg_view.cache_infos { - [_single_cache_info] => { - let page_state = pg_view.to_page_state(); - page_states.push(page_state); - break 'inner; - } - [_first_of_several, ..] => { - // If a page has multiple disk locations, - // rewrite it to a single one before storing - // a single 8-byte pointer to its cold location. - if let Err(e) = self.rewrite_page(pid, None, &guard) { - log::error!( - "aborting fuzzy snapshot attempt after \ - failing to rewrite pid {}: {:?}", - pid, - e - ); - return Err(e); - } - continue 'inner; - } - [] => { - // there is a benign race with the thread - // that is allocating this page. the allocating - // thread has not yet written the new page to disk, - // and it does not yet have any storage tracking - // information. - std::thread::yield_now(); - } - } - - // break out of this loop if the overall system - // has halted - self.log.iobufs.config.global_error()?; - } - } - drop(guard); - - let max_reserved_lsn_after: Lsn = - self.log.iobufs.max_reserved_lsn.load(Acquire); - - let snapshot = Snapshot { - version: 0, - stable_lsn: Some(stable_lsn_before), - active_segment: None, - pt: page_states, - }; - - self.log.make_stable(max_reserved_lsn_after)?; - - snapshot::write_snapshot(&self.config, &snapshot)?; - - // NB: this must only happen after writing the snapshot to disk - self.snapshot_min_lsn.fetch_max(stable_lsn_before, SeqCst); - - // explicitly drop this to make it clear that it needs to - // be held for the duration of the snapshot operation. - drop(lock); - - Ok(()) - } -} - -impl PageCacheInner { - /// Flushes any pending IO buffers to disk to ensure durability. - /// Returns the number of bytes written during this call. - pub(crate) fn flush(&self) -> Result { - self.log.flush() - } - - /// Create a new page, trying to reuse old freed pages if possible - /// to maximize underlying `PageTable` pointer density. Returns - /// the page ID and its pointer for use in future atomic `replace` - /// and `link` operations. - pub(crate) fn allocate<'g>( - &self, - new: Node, - guard: &'g Guard, - ) -> Result<(PageId, PageView<'g>)> { - self.allocate_inner(Update::Node(new), guard) - } - - fn allocate_inner<'g>( - &self, - new: Update, - guard: &'g Guard, - ) -> Result<(PageId, PageView<'g>)> { - let mut allocation_serializer; - - let free_opt = { - let mut free = self.free.lock(); - if let Some(pid) = free.iter().copied().next() { - free.remove(&pid); - Some(pid) - } else { - None - } - }; - - let (pid, page_view) = if let Some(pid) = free_opt { - trace!("re-allocating pid {}", pid); - - let page_view = self.inner.get(pid, guard); - assert!( - page_view.is_free(), - "failed to re-allocate pid {} which \ - contained unexpected state {:?}", - pid, - page_view, - ); - (pid, page_view) - } else { - // we need to hold the allocation mutex because - // we have to maintain the invariant that our - // recoverable allocated pages will be contiguous. - // If we did not hold this mutex, it would be - // possible (especially under high thread counts) - // to persist pages non-monotonically to disk, - // which would break our recovery invariants. - // While we could just remove that invariant, - // because it is overly-strict, it allows us - // to flag corruption and bugs during testing - // much more easily. - allocation_serializer = self.next_pid_to_allocate.lock(); - let pid = *allocation_serializer; - *allocation_serializer += 1; - - trace!("allocating pid {} for the first time", pid); - - let new_page = Page { update: None, cache_infos: Vec::default() }; - - let page_view = self.inner.insert(pid, new_page, guard); - - (pid, page_view) - }; - - let new_pointer = self - .cas_page(pid, page_view, new, false, guard)? - .unwrap_or_else(|e| { - panic!( - "should always be able to install \ - a new page during allocation, but \ - failed for pid {}: {:?}", - pid, e - ) - }); - - Ok((pid, new_pointer)) - } - - /// Attempt to opportunistically rewrite data from a Draining - /// segment of the file to help with space amplification. - /// Returns Ok(true) if we had the opportunity to attempt to - /// move a page. Returns Ok(false) if there were no pages - /// to GC. Returns an Err if we encountered an IO problem - /// while performing this GC. - #[cfg(not(miri))] - pub(crate) fn attempt_gc(&self) -> Result { - let guard = pin(); - let cc = concurrency_control::read(); - let to_clean = self.log.iobufs.segment_cleaner.pop(); - let ret = if let Some((pid_to_clean, segment_to_clean)) = to_clean { - self.rewrite_page(pid_to_clean, Some(segment_to_clean), &guard) - .map(|_| true) - } else { - Ok(false) - }; - drop(cc); - guard.flush(); - ret - } - - /// Initiate an atomic sequence of writes to the - /// underlying log. Returns a `RecoveryGuard` which, - /// when dropped, will record the current max reserved - /// LSN into an earlier log reservation. During recovery, - /// when we hit this early atomic LSN marker, if the - /// specified LSN is beyond the contiguous tip of the log, - /// we immediately halt recovery, preventing the recovery - /// of partial transactions or write batches. This is - /// a relatively low-level primitive that can be used - /// to facilitate transactions and write batches when - /// combined with a concurrency control system in another - /// component. - pub(crate) fn pin_log(&self, guard: &Guard) -> Result> { - // HACK: we are rolling the io buffer before AND - // after taking out the reservation pin to avoid - // a deadlock where the batch reservation causes - // writes to fail to flush to disk. in the future, - // this may be addressed in a nicer way by representing - // transactions with a begin and end message, rather - // than a single beginning message that needs to - // be held until we know the final batch LSN. - self.log.roll_iobuf()?; - - let batch_res = self.log.reserve( - LogKind::Skip, - BATCH_MANIFEST_PID, - &BatchManifest::default(), - guard, - )?; - - iobuf::maybe_seal_and_write_iobuf( - &self.log.iobufs, - &batch_res.iobuf, - batch_res.iobuf.get_header(), - false, - )?; - - Ok(RecoveryGuard { batch_res }) - } - - /// Free a particular page. - pub(crate) fn free<'g>( - &self, - pid: PageId, - old: PageView<'g>, - guard: &'g Guard, - ) -> Result> { - trace!("attempting to free pid {}", pid); - - if pid <= COUNTER_PID || pid == BATCH_MANIFEST_PID { - panic!("tried to free pid {}", pid); - } - - let new_pointer = - self.cas_page(pid, old, Update::Free, false, guard)?; - - if new_pointer.is_ok() { - let free_mu = self.free.clone(); - guard.defer(move || { - let mut free = free_mu.lock(); - assert!(free.insert(pid), "pid {} was double-freed", pid); - }); - } - - Ok(new_pointer.map_err(|o| o.map(|(pointer, _)| (pointer, ())))) - } - - /// Node an existing page with a different set of `PageLink`s. - /// Returns `Ok(new_key)` if the operation was successful. Returns - /// `Err(None)` if the page no longer exists. Returns - /// `Err(Some(actual_key))` if the atomic swap fails. - pub(crate) fn replace<'g>( - &self, - pid: PageId, - old: PageView<'g>, - new_unmerged: &Node, - guard: &'g Guard, - ) -> Result> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.replace_page); - // `Node::clone` implicitly consolidates the node overlay - let new = new_unmerged.clone(); - - trace!("replacing pid {} with {:?}", pid, new); - - // A failure injector that fails replace calls randomly - // during test to ensure interleaving coverage. - #[cfg(any(test, feature = "lock_free_delays"))] - { - use std::cell::RefCell; - use std::time::{SystemTime, UNIX_EPOCH}; - - thread_local! { - pub static COUNT: RefCell = RefCell::new(1); - } - - let time_now = - SystemTime::now().duration_since(UNIX_EPOCH).unwrap(); - - #[allow(clippy::cast_possible_truncation)] - let fail_seed = std::cmp::max(3, time_now.as_nanos() as u32 % 128); - - let inject_failure = COUNT.with(|c| { - let mut cr = c.borrow_mut(); - *cr += 1; - *cr % fail_seed == 0 - }); - - if inject_failure { - debug!( - "injecting a randomized failure in the replace of pid {}", - pid - ); - if let Some(current_pointer) = self.get(pid, guard)? { - return Ok(Err(Some((current_pointer.0, new)))); - } else { - return Ok(Err(None)); - } - } - } - - let result = - self.cas_page(pid, old, Update::Node(new), false, guard)?; - - if let Some((pid_to_clean, segment_to_clean)) = - self.log.iobufs.segment_cleaner.pop() - { - self.rewrite_page(pid_to_clean, Some(segment_to_clean), guard)?; - } - - Ok(result.map_err(|fail| { - let (pointer, shared) = fail.unwrap(); - if let Update::Node(rejected_new) = shared { - Some((pointer, rejected_new)) - } else { - unreachable!(); - } - })) - } - - // rewrite a page so we can reuse the segment that it is - // (at least partially) located in. This happens when a - // segment has had enough resident page replacements moved - // away to trigger the `segment_cleanup_threshold`. - fn rewrite_page( - &self, - pid: PageId, - segment_to_purge_opt: Option, - guard: &Guard, - ) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.rewrite_page); - - trace!("rewriting pid {}", pid); - - loop { - let page_view = self.inner.get(pid, guard); - - if let Some(segment_to_purge) = segment_to_purge_opt { - let purge_segment_id = - segment_to_purge / self.config.segment_size as u64; - - let already_moved = !unsafe { page_view.read.deref() } - .cache_infos - .iter() - .any(|ce| { - if let Some(lid) = ce.pointer.lid() { - lid / self.config.segment_size as u64 - == purge_segment_id - } else { - // the item has been relocated off-log to - // a slot in the heap. - true - } - }); - - if already_moved { - return Ok(()); - } - } - - // if the page only has a single heap pointer in the log, rewrite - // the pointer in a new segment without touching the big heap - // slot that the pointer references. - if let Some(lone_cache_info) = page_view.lone_heap_item() { - trace!("rewriting pointer to heap item with pid {}", pid); - - let snapshot_min_lsn = self.snapshot_min_lsn.load(Acquire); - let original_lsn = lone_cache_info.pointer.original_lsn(); - let skip_log = original_lsn < snapshot_min_lsn; - - let heap_id = lone_cache_info.pointer.heap_id().unwrap(); - - let (log_reservation_opt, cache_info) = if skip_log { - trace!( - "allowing heap pointer for pid {} with original lsn of {} \ - to be forgotten from the log, as it is contained in the \ - snapshot which has a minimum lsn of {}", - pid, - original_lsn, - snapshot_min_lsn - ); - - let cache_info = CacheInfo { - pointer: DiskPtr::Heap(None, heap_id), - ..lone_cache_info - }; - - (None, cache_info) - } else { - let log_reservation = - self.log.rewrite_heap_pointer(pid, heap_id, guard)?; - - let cache_info = CacheInfo { - ts: page_view.ts(), - lsn: log_reservation.lsn, - pointer: log_reservation.pointer, - }; - - (Some(log_reservation), cache_info) - }; - - let new_page = Owned::new(Page { - update: page_view.update.clone(), - cache_infos: vec![cache_info], - }); - - debug_delay(); - let result = page_view.entry.compare_and_set( - page_view.read, - new_page, - SeqCst, - guard, - ); - - if let Ok(new_shared) = result { - unsafe { - guard.defer_destroy(page_view.read); - } - - self.log.iobufs.sa_mark_replace( - pid, - &page_view.cache_infos, - cache_info, - guard, - )?; - - // NB complete must happen AFTER calls to SA, because - // when the iobuf's n_writers hits 0, we may transition - // the segment to inactive, resulting in a race otherwise. - if let Some(log_reservation) = log_reservation_opt { - log_reservation.complete()?; - } - - // only call accessed & page_out if called from - // something other than page_out itself - if segment_to_purge_opt.is_some() { - // possibly evict an item now that our cache has grown - if let Some(rss) = unsafe { new_shared.deref().rss() } { - self.lru_access(pid, rss, guard)?; - } - } - - trace!("rewriting pid {} succeeded", pid); - - return Ok(()); - } else { - if let Some(log_reservation) = log_reservation_opt { - log_reservation.abort()?; - } - - trace!("rewriting pid {} failed", pid); - } - } else { - trace!("rewriting page with pid {}", pid); - - // page-in whole page with a get - let (key, update): (_, Update) = if pid == META_PID { - let meta_view = self.get_meta(guard); - (meta_view.0, Update::Meta(meta_view.deref().clone())) - } else if pid == COUNTER_PID { - let (key, counter) = self.get_idgen(guard); - (key, Update::Counter(counter)) - } else if let Some(node_view) = self.get(pid, guard)? { - let mut node = node_view.deref().clone(); - node.increment_rewrite_generations(); - (node_view.0, Update::Node(node)) - } else { - let page_view_retry = self.inner.get(pid, guard); - - if page_view_retry.is_free() { - (page_view, Update::Free) - } else { - debug!( - "when rewriting pid {} \ - we encountered a rewritten \ - node with a link {:?} that \ - we previously witnessed a Free \ - for (PageCache::get returned None), \ - assuming we can just return now since \ - the Free was replace'd", - pid, page_view.update - ); - return Ok(()); - } - }; - - let res = self.cas_page(pid, key, update, true, guard).map( - |res| { - trace!( - "rewriting pid {} success: {}", - pid, - res.is_ok() - ); - res - }, - )?; - if res.is_ok() { - return Ok(()); - } - } - } - } - - fn lru_access(&self, pid: PageId, size: u64, guard: &Guard) -> Result<()> { - let to_evict = - self.lru.accessed(pid, usize::try_from(size).unwrap(), guard); - trace!("accessed pid {} -> paging out pids {:?}", pid, to_evict); - if !to_evict.is_empty() { - self.page_out(to_evict, guard)?; - } - Ok(()) - } - - /// Traverses all files and calculates their total physical - /// size, then traverses all pages and calculates their - /// total logical size, then divides the physical size - /// by the logical size. - #[allow(clippy::cast_precision_loss)] - #[allow(clippy::float_arithmetic)] - #[doc(hidden)] - pub(crate) fn space_amplification(&self) -> Result { - let on_disk_bytes = self.size_on_disk()? as f64; - let logical_size = (self.logical_size_of_all_tree_pages()? - + self.config.segment_size as u64) - as f64; - - Ok(on_disk_bytes / logical_size) - } - - pub(crate) fn size_on_disk(&self) -> Result { - let mut size = self.config.file.metadata()?.len(); - - let base_path = self.config.get_path().join("heap"); - let heap_dir = base_path.parent().expect( - "should be able to determine the parent for the heap directory", - ); - let heap_files = std::fs::read_dir(heap_dir)?; - - for slab_file_res in heap_files { - let slab_file = if let Ok(bf) = slab_file_res { - bf - } else { - continue; - }; - - // it's possible the heap item was removed lazily - // in the background and no longer exists - #[cfg(not(miri))] - { - size += slab_file.metadata().map(|m| m.len()).unwrap_or(0); - } - - // workaround to avoid missing `dirfd` shim - #[cfg(miri)] - { - size += std::fs::metadata(slab_file.path()) - .map(|m| m.len()) - .unwrap_or(0); - } - } - - Ok(size) - } - - fn logical_size_of_all_tree_pages(&self) -> Result { - let guard = pin(); - let min_pid = COUNTER_PID + 1; - let next_pid_to_allocate = *self.next_pid_to_allocate.lock(); - - let mut ret = 0; - for pid in min_pid..next_pid_to_allocate { - if let Some(node_cell) = self.get(pid, &guard)? { - ret += node_cell.rss(); - } - } - Ok(ret) - } - - fn cas_page<'g>( - &self, - pid: PageId, - mut old: PageView<'g>, - update: Update, - is_rewrite: bool, - guard: &'g Guard, - ) -> Result> { - trace!( - "cas_page called on pid {} to {:?} with old ts {:?}", - pid, - update, - old.ts() - ); - - let log_kind = log_kind_from_update(&update); - trace!("cas_page on pid {} has log kind: {:?}", pid, log_kind); - - let mut new_page = Some(Owned::new(Page { - update: Some(update), - cache_infos: Vec::default(), - })); - - loop { - let mut page_ptr = new_page.take().unwrap(); - let log_reservation = match page_ptr.update.as_ref().unwrap() { - Update::Counter(ref c) => { - self.log.reserve(log_kind, pid, c, guard)? - } - Update::Meta(ref m) => { - self.log.reserve(log_kind, pid, m, guard)? - } - Update::Free => self.log.reserve(log_kind, pid, &(), guard)?, - Update::Node(ref node) => { - self.log.reserve(log_kind, pid, node, guard)? - } - other => { - panic!("non-replacement used in cas_page: {:?}", other) - } - }; - let lsn = log_reservation.lsn; - let new_pointer = log_reservation.pointer; - - // NB the setting of the timestamp is quite - // correctness-critical! We use the ts to - // ensure that fundamentally new data causes - // high-level link and replace operations - // to fail when the data in the pagecache - // actually changes. When we just rewrite - // the page for the purposes of moving it - // to a new location on disk, however, we - // don't want to cause threads that are - // basing the correctness of their new - // writes on the unchanged state to fail. - // Here, we only bump it up by 1 if the - // update represents a fundamental change - // that SHOULD cause CAS failures. - let ts = if is_rewrite { old.ts() } else { old.ts() + 1 }; - - let cache_info = CacheInfo { ts, lsn, pointer: new_pointer }; - - page_ptr.cache_infos = vec![cache_info]; - - debug_delay(); - let result = - old.entry.compare_and_set(old.read, page_ptr, SeqCst, guard); - - match result { - Ok(new_shared) => { - unsafe { - guard.defer_destroy(old.read); - } - - trace!("cas_page succeeded on pid {}", pid); - self.log.iobufs.sa_mark_replace( - pid, - &old.cache_infos, - cache_info, - guard, - )?; - - // NB complete must happen AFTER calls to SA, because - // when the iobuf's n_writers hits 0, we may transition - // the segment to inactive, resulting in a race otherwise. - let _pointer = log_reservation.complete()?; - - // possibly evict an item now that our cache has grown - if let Some(rss) = unsafe { new_shared.deref().rss() } { - self.lru_access(pid, rss, guard)?; - } - - return Ok(Ok(PageView { - read: new_shared, - entry: old.entry, - })); - } - Err(cas_error) => { - trace!("cas_page failed on pid {}", pid); - let _pointer = log_reservation.abort()?; - - let current: Shared<'_, _> = cas_error.current; - let actual_ts = unsafe { current.deref().ts() }; - - let mut returned_update: Owned<_> = cas_error.new; - - if is_rewrite || actual_ts != old.ts() { - return Ok(Err(Some(( - PageView { read: current, entry: old.entry }, - returned_update.update.take().unwrap(), - )))); - } - trace!( - "retrying CAS on pid {} with same ts of {}", - pid, - old.ts() - ); - old.read = current; - new_page = Some(returned_update); - } - } // match cas result - } // loop - } - - /// Retrieve the current meta page - pub(crate) fn get_meta<'g>(&self, guard: &'g Guard) -> MetaView<'g> { - trace!("getting page iter for META"); - - let page_view = self.inner.get(META_PID, guard); - - if page_view.update.is_none() { - panic!( - "{:?}", - Error::ReportableBug( - "failed to retrieve META page \ - which should always be present" - ) - ) - } - - MetaView(page_view) - } - - /// Retrieve the current persisted IDGEN value - pub(in crate::pagecache) fn get_idgen<'g>( - &self, - guard: &'g Guard, - ) -> (PageView<'g>, u64) { - trace!("getting page iter for idgen"); - - let page_view = self.inner.get(COUNTER_PID, guard); - - if page_view.update.is_none() { - panic!( - "{:?}", - Error::ReportableBug( - "failed to retrieve counter page \ - which should always be present" - ) - ) - } - - let counter = page_view.as_counter(); - (page_view, counter) - } - - /// Try to retrieve a page by its logical ID. - pub(crate) fn get<'g>( - &self, - pid: PageId, - guard: &'g Guard, - ) -> Result>> { - trace!("getting page iterator for pid {}", pid); - #[cfg(feature = "metrics")] - let _measure_get_page = Measure::new(&M.get_page); - - if pid <= COUNTER_PID || pid == BATCH_MANIFEST_PID { - panic!( - "tried to do normal pagecache get on priviledged pid {}", - pid - ); - } - - let mut last_attempted_cache_info = None; - let mut last_err = None; - let mut page_view; - - let mut updates: Vec = loop { - // we loop here because if the page we want to - // pull is moved, we want to retry. but if we - // get a corruption and then - page_view = self.inner.get(pid, guard); - - if page_view.is_free() { - return Ok(None); - } - - if page_view.update.is_some() { - // possibly evict an item now that our cache has grown - if let Some(rss) = page_view.rss() { - self.lru_access(pid, rss, guard)?; - } - return Ok(Some(NodeView(page_view))); - } - - trace!( - "pulling pid {} view {:?} deref {:?}", - pid, - page_view, - page_view.deref() - ); - if page_view.cache_infos.first() - == last_attempted_cache_info.as_ref() - { - return Err(last_err.unwrap()); - } else { - last_attempted_cache_info = - page_view.cache_infos.first().copied(); - } - - #[cfg(feature = "metrics")] - let _measure_pull = Measure::new(&M.pull); - - // need to page-in - let updates_result: Result> = page_view - .cache_infos - .iter() - .map(|ci| self.pull(pid, ci.lsn, ci.pointer)) - .collect(); - - last_err = if let Ok(updates) = updates_result { - break updates; - } else { - Some(updates_result.unwrap_err()) - }; - }; - - let (base_slice, links) = updates.split_at_mut(1); - - let base: &mut Node = base_slice[0].as_node_mut(); - - for link_update in links { - let link: &Link = link_update.as_link(); - *base = base.apply(link); - } - - updates.truncate(1); - let base_owned = updates.pop().unwrap(); - - let page = Owned::new(Page { - update: Some(base_owned), - cache_infos: page_view.cache_infos.clone(), - }); - - debug_delay(); - let result = page_view.entry.compare_and_set( - page_view.read, - page, - SeqCst, - guard, - ); - - if let Ok(new_shared) = result { - trace!("fix-up for pid {} succeeded", pid); - - unsafe { - guard.defer_destroy(page_view.read); - } - - // possibly evict an item now that our cache has grown - if let Some(rss) = unsafe { new_shared.deref().rss() } { - self.lru_access(pid, rss, guard)?; - } - - let mut page_view2 = page_view; - page_view2.read = new_shared; - - Ok(Some(NodeView(page_view2))) - } else { - trace!("fix-up for pid {} failed", pid); - - self.get(pid, guard) - } - } - - /// Returns `true` if the database was - /// recovered from a previous process. - /// Note that database state is only - /// guaranteed to be present up to the - /// last call to `flush`! Otherwise state - /// is synced to disk periodically if the - /// `sync_every_ms` configuration option - /// is set to `Some(number_of_ms_between_syncs)` - /// or if the IO buffer gets filled to - /// capacity before being rotated. - pub const fn was_recovered(&self) -> bool { - self.was_recovered - } - - /// Generate a monotonic ID. Not guaranteed to be - /// contiguous. Written to disk every `idgen_persist_interval` - /// operations, followed by a blocking flush. During recovery, we - /// take the last recovered generated ID and add 2x - /// the `idgen_persist_interval` to it. While persisting, if the - /// previous persisted counter wasn't synced to disk yet, we will do - /// a blocking flush to fsync the latest counter, ensuring - /// that we will never give out the same counter twice. - pub(crate) fn generate_id_inner(&self) -> Result { - let ret = self.idgen.fetch_add(1, Release); - - trace!("generating ID {}", ret); - - let interval = self.config.idgen_persist_interval; - let necessary_persists = ret / interval * interval; - let mut persisted = self.idgen_persists.load(Acquire); - - while persisted < necessary_persists { - let _mu = self.idgen_persist_mu.lock(); - persisted = self.idgen_persists.load(Acquire); - if persisted < necessary_persists { - // it's our responsibility to persist up to our ID - trace!( - "persisting ID gen, as persist count {} \ - is below necessary persists {}", - persisted, - necessary_persists - ); - let guard = pin(); - let (key, current) = self.get_idgen(&guard); - - assert_eq!(current, persisted); - - let counter_update = Update::Counter(necessary_persists); - - let old = self.idgen_persists.swap(necessary_persists, Release); - assert_eq!(old, persisted); - - if self - .cas_page(COUNTER_PID, key, counter_update, false, &guard)? - .is_err() - { - // CAS failed - continue; - } - - // during recovery we add 2x the interval. we only - // need to block if the last one wasn't stable yet. - // we only call make_durable instead of make_stable - // because we took out the initial reservation - // outside of a writebatch (guaranteed by using the reader - // concurrency control) and it's possible we - // could cyclically wait if the reservation for - // a replacement happened inside a writebatch. - iobuf::make_durable(&self.log.iobufs, key.last_lsn())?; - } - } - - Ok(ret) - } - - /// Look up a `PageId` for a given identifier in the `Meta` - /// mapping. This is pretty cheap, but in some cases - /// you may prefer to maintain your own atomic references - /// to collection roots instead of relying on this. See - /// sled's `Tree` root tracking for an example of - /// avoiding this in a lock-free way that handles - /// various race conditions. - pub(crate) fn meta_pid_for_name( - &self, - name: &[u8], - guard: &Guard, - ) -> Result { - let m = self.get_meta(guard); - if let Some(root) = m.get_root(name) { - Ok(root) - } else { - Err(Error::CollectionNotFound) - } - } - - /// Compare-and-swap the `Meta` mapping for a given - /// identifier. - pub(crate) fn cas_root_in_meta<'g>( - &self, - name: &[u8], - old_opt: Option, - new_opt: Option, - guard: &'g Guard, - ) -> Result>> { - loop { - let meta_view = self.get_meta(guard); - - let actual = meta_view.get_root(name); - if actual != old_opt { - return Ok(Err(actual)); - } - - let mut new_meta = meta_view.deref().clone(); - if let Some(new) = new_opt { - new_meta.set_root(name.into(), new); - } else { - new_meta.del_root(name); - } - - let new_meta_link = Update::Meta(new_meta); - - let res = self.cas_page( - META_PID, - meta_view.0, - new_meta_link, - false, - guard, - )?; - - match res { - Ok(_worked) => return Ok(Ok(())), - Err(Some((_current_pointer, _rejected))) => {} - Err(None) => { - return Err(Error::ReportableBug( - "replacing the META page has failed because \ - the pagecache does not think it currently exists.", - )); - } - } - } - } - - fn page_out(&self, to_evict: Vec, guard: &Guard) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.page_out); - for pid in to_evict { - assert_ne!(pid, BATCH_MANIFEST_PID); - - if pid <= COUNTER_PID { - // should not page these suckas out - continue; - } - - 'pid: loop { - let page_view = self.inner.get(pid, guard); - if page_view.is_free() { - // don't page-out Freed suckas - break; - } - - if page_view.cache_infos.len() > 1 { - // compress pages on page-out - self.rewrite_page(pid, None, guard)?; - continue 'pid; - } - - let new_page = Owned::new(Page { - update: None, - cache_infos: page_view.cache_infos.clone(), - }); - - debug_delay(); - if page_view - .entry - .compare_and_set(page_view.read, new_page, SeqCst, guard) - .is_ok() - { - unsafe { - guard.defer_destroy(page_view.read); - } - - break; - } - // keep looping until we page this sucka out - } - } - Ok(()) - } - - fn pull(&self, pid: PageId, lsn: Lsn, pointer: DiskPtr) -> Result { - use MessageKind::*; - - trace!("pulling pid {} lsn {} pointer {} from disk", pid, lsn, pointer); - - let expected_segment_number: SegmentNumber = SegmentNumber( - u64::try_from(lsn).unwrap() - / u64::try_from(self.config.segment_size).unwrap(), - ); - - iobuf::make_durable(&self.log.iobufs, lsn)?; - - let (header, bytes) = match self.log.read(pid, lsn, pointer) { - Ok(LogRead::Inline(header, buf, _len)) => { - assert_eq!( - header.pid, pid, - "expected pid {} on pull of pointer {}, \ - but got {} instead", - pid, pointer, header.pid - ); - assert_eq!( - header.segment_number, expected_segment_number, - "expected segment number {:?} on pull of pointer {}, \ - but got segment number {:?} instead", - expected_segment_number, pointer, header.segment_number - ); - Ok((header, buf)) - } - Ok(LogRead::Heap(header, buf, _heap_id, _inline_len)) => { - assert_eq!( - header.pid, pid, - "expected pid {} on pull of pointer {}, \ - but got {} instead", - pid, pointer, header.pid - ); - assert_eq!( - header.segment_number, expected_segment_number, - "expected segment number {:?} on pull of pointer {}, \ - but got segment number {:?} instead", - expected_segment_number, pointer, header.segment_number - ); - - Ok((header, buf)) - } - Ok(other) => { - debug!("read unexpected page: {:?}", other); - Err(Error::corruption(Some(pointer))) - } - Err(e) => { - debug!("failed to read page: {:?}", e); - Err(e) - } - }?; - - // We create this &mut &[u8] to assist the `Serializer` - // implementation that incrementally consumes bytes - // without taking ownership of them. - let buf = &mut bytes.as_slice(); - - let update_res = { - #[cfg(feature = "metrics")] - let _deserialize_latency = Measure::new(&M.deserialize); - - match header.kind { - Counter => u64::deserialize(buf).map(Update::Counter), - HeapMeta | InlineMeta => { - Meta::deserialize(buf).map(Update::Meta) - } - HeapLink | InlineLink => { - Link::deserialize(buf).map(Update::Link) - } - HeapNode | InlineNode => { - Node::deserialize(buf).map(Update::Node) - } - Free => Ok(Update::Free), - Corrupted | Canceled | Cap | BatchManifest => { - panic!("unexpected pull: {:?}", header.kind) - } - } - }; - - let update = update_res.expect("failed to deserialize data"); - - // TODO this feels racy, test it better? - if let Update::Free = update { - error!("non-link/replace found in pull of pid {}", pid); - Err(Error::ReportableBug( - "non-link/replace found in pull of page fragments", - )) - } else { - Ok(update) - } - } - - fn load_snapshot(&mut self, snapshot: &Snapshot) -> Result<()> { - let next_pid_to_allocate = snapshot.pt.len() as PageId; - - self.next_pid_to_allocate = Mutex::new(next_pid_to_allocate); - - debug!("load_snapshot loading pages from 0..{}", next_pid_to_allocate); - for pid in 0..next_pid_to_allocate { - let state = if let Some(state) = - snapshot.pt.get(usize::try_from(pid).unwrap()) - { - state - } else { - panic!( - "load_snapshot pid {} not found, despite being below the max pid {}", - pid, next_pid_to_allocate - ); - }; - - trace!("load_snapshot pid {} {:?}", pid, state); - - let mut cache_infos = Vec::default(); - - let guard = pin(); - - match *state { - PageState::Present { base, ref frags } => { - cache_infos.push(CacheInfo { - lsn: base.0, - pointer: base.1, - ts: 0, - }); - for (lsn, pointer) in frags { - let cache_info = - CacheInfo { lsn: *lsn, pointer: *pointer, ts: 0 }; - - cache_infos.push(cache_info); - } - } - PageState::Free(lsn, pointer) => { - // blow away any existing state - trace!("load_snapshot freeing pid {}", pid); - let cache_info = CacheInfo { lsn, pointer, ts: 0 }; - cache_infos.push(cache_info); - assert!(self.free.lock().insert(pid)); - } - _ => panic!("tried to load a {:?}", state), - } - - // Set up new page - trace!("installing page for pid {}", pid); - - let update = if pid == META_PID || pid == COUNTER_PID { - let update = - self.pull(pid, cache_infos[0].lsn, cache_infos[0].pointer)?; - Some(update) - } else if state.is_free() { - Some(Update::Free) - } else { - None - }; - let page = Page { update, cache_infos }; - - self.inner.insert(pid, page, &guard); - } - - Ok(()) - } -} diff --git a/src/pagecache/pagetable.rs b/src/pagecache/pagetable.rs deleted file mode 100644 index 0954d5865..000000000 --- a/src/pagecache/pagetable.rs +++ /dev/null @@ -1,250 +0,0 @@ -//! A simple wait-free, grow-only pagetable, assumes a dense keyspace. -#![allow(unsafe_code)] - -use std::{ - alloc::{alloc_zeroed, Layout}, - convert::TryFrom, - mem::{align_of, size_of}, - sync::atomic::Ordering::{Acquire, Relaxed, Release}, -}; - -use crate::{ - debug_delay, - ebr::{pin, Atomic, Guard, Owned, Shared}, - pagecache::{constants::MAX_PID_BITS, Page, PageView}, -}; - -#[cfg(feature = "metrics")] -use crate::{Measure, M}; - -#[allow(unused)] -#[doc(hidden)] -pub const PAGETABLE_NODE_SZ: usize = size_of::(); - -const NODE2_FAN_FACTOR: usize = 18; -const NODE1_FAN_OUT: usize = 1 << (MAX_PID_BITS - NODE2_FAN_FACTOR); -const NODE2_FAN_OUT: usize = 1 << NODE2_FAN_FACTOR; -const FAN_MASK: u64 = (NODE2_FAN_OUT - 1) as u64; - -pub type PageId = u64; - -struct Node1 { - children: [Atomic; NODE1_FAN_OUT], -} - -struct Node2 { - children: [Atomic; NODE2_FAN_OUT], -} - -impl Node1 { - fn new() -> Owned { - let size = size_of::(); - let align = align_of::(); - - unsafe { - let layout = Layout::from_size_align_unchecked(size, align); - - #[allow(clippy::cast_ptr_alignment)] - let ptr = alloc_zeroed(layout) as *mut Self; - - Owned::from_raw(ptr) - } - } -} - -impl Node2 { - fn new() -> Owned { - let size = size_of::(); - let align = align_of::(); - - unsafe { - let layout = Layout::from_size_align_unchecked(size, align); - - #[allow(clippy::cast_ptr_alignment)] - let ptr = alloc_zeroed(layout) as *mut Self; - - Owned::from_raw(ptr) - } - } -} - -impl Drop for Node1 { - fn drop(&mut self) { - drop_iter(self.children.iter()); - } -} - -impl Drop for Node2 { - fn drop(&mut self) { - drop_iter(self.children.iter()); - } -} - -fn drop_iter(iter: core::slice::Iter<'_, Atomic>) { - let guard = pin(); - for child in iter { - let shared_child = child.load(Relaxed, &guard); - if shared_child.is_null() { - // this does not leak because the PageTable is - // assumed to be dense. - break; - } - unsafe { - drop(shared_child.into_owned()); - } - } -} - -/// A simple lock-free radix tree. -pub struct PageTable { - head: Atomic, -} - -impl Default for PageTable { - fn default() -> Self { - let head = Node1::new(); - Self { head: Atomic::from(head) } - } -} - -impl PageTable { - /// # Panics - /// - /// will panic if the item is not null already, - /// which represents a serious failure to - /// properly handle lifecycles of pages in the - /// using system. - pub(crate) fn insert<'g>( - &self, - pid: PageId, - item: Page, - guard: &'g Guard, - ) -> PageView<'g> { - debug_delay(); - let tip = self.traverse(pid, guard); - - let shared = Owned::new(item).into_shared(guard); - let old = tip.swap(shared, Release, guard); - assert!(old.is_null()); - - PageView { read: shared, entry: tip } - } - - /// Try to get a value from the tree. - /// - /// # Panics - /// - /// Panics if the page has never been allocated. - pub(crate) fn get<'g>( - &self, - pid: PageId, - guard: &'g Guard, - ) -> PageView<'g> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.get_pagetable); - debug_delay(); - let tip = self.traverse(pid, guard); - - debug_delay(); - let res = tip.load(Acquire, guard); - - assert!(!res.is_null(), "tried to get pid {}", pid); - - PageView { read: res, entry: tip } - } - - pub(crate) fn contains_pid(&self, pid: PageId, guard: &Guard) -> bool { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.get_pagetable); - debug_delay(); - let tip = self.traverse(pid, guard); - - debug_delay(); - let res = tip.load(Acquire, guard); - - !res.is_null() - } - - fn traverse<'g>(&self, k: PageId, guard: &'g Guard) -> &'g Atomic { - let (l1k, l2k) = split_fanout(k); - - debug_delay(); - let head = self.head.load(Acquire, guard); - - debug_delay(); - let l1 = unsafe { &head.deref().children }; - - debug_delay(); - let mut l2_ptr = l1[l1k].load(Acquire, guard); - - if l2_ptr.is_null() { - let next_child = Node2::new(); - - debug_delay(); - let ret = l1[l1k].compare_and_set( - Shared::null(), - next_child, - Release, - guard, - ); - - l2_ptr = match ret { - Ok(l2) => l2, - Err(returned) => { - drop(returned.new); - returned.current - } - }; - } - - debug_delay(); - let l2 = unsafe { &l2_ptr.deref().children }; - - &l2[l2k] - } -} - -#[inline] -fn split_fanout(id: PageId) -> (usize, usize) { - // right shift 32 on 32-bit pointer systems panics - #[cfg(target_pointer_width = "64")] - assert!( - id <= 1 << MAX_PID_BITS, - "trying to access key of {}, which is \ - higher than 2 ^ {}", - id, - MAX_PID_BITS, - ); - - let left = id >> NODE2_FAN_FACTOR; - let right = id & FAN_MASK; - - (safe_usize(left), safe_usize(right)) -} - -#[inline] -fn safe_usize(value: PageId) -> usize { - usize::try_from(value).unwrap() -} - -impl Drop for PageTable { - fn drop(&mut self) { - let guard = pin(); - let head = self.head.load(Relaxed, &guard); - unsafe { - drop(head.into_owned()); - } - } -} - -#[test] -fn fanout_functionality() { - assert_eq!( - split_fanout(0b11_1111_1111_1111_1111), - (0, 0b11_1111_1111_1111_1111) - ); - assert_eq!( - split_fanout(0b111_1111_1111_1111_1111), - (0b1, 0b11_1111_1111_1111_1111) - ); -} diff --git a/src/pagecache/parallel_io_polyfill.rs b/src/pagecache/parallel_io_polyfill.rs deleted file mode 100644 index 80601f7f7..000000000 --- a/src/pagecache/parallel_io_polyfill.rs +++ /dev/null @@ -1,103 +0,0 @@ -use std::fs::File; -use std::io::{self, Read, Seek, Write}; - -use parking_lot::Mutex; - -use super::{LogOffset, Result}; - -fn init_mu() -> Mutex<()> { - Mutex::new(()) -} - -type MutexInit = fn() -> Mutex<()>; - -static GLOBAL_FILE_LOCK: crate::Lazy, MutexInit> = - crate::Lazy::new(init_mu); - -pub(crate) fn pread_exact_or_eof( - file: &File, - mut buf: &mut [u8], - offset: LogOffset, -) -> Result { - let _lock = GLOBAL_FILE_LOCK.lock(); - - let mut f = file.try_clone()?; - - let _ = f.seek(io::SeekFrom::Start(offset))?; - - let mut total = 0; - while !buf.is_empty() { - match f.read(buf) { - Ok(0) => break, - Ok(n) => { - total += n; - let tmp = buf; - buf = &mut tmp[n..]; - } - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - Ok(total) -} - -pub(crate) fn pread_exact( - file: &File, - mut buf: &mut [u8], - offset: LogOffset, -) -> Result<()> { - let _lock = GLOBAL_FILE_LOCK.lock(); - - let mut f = file.try_clone()?; - - let _ = f.seek(io::SeekFrom::Start(offset))?; - - while !buf.is_empty() { - match f.read(buf) { - Ok(0) => break, - Ok(n) => { - let tmp = buf; - buf = &mut tmp[n..]; - } - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - if !buf.is_empty() { - Err(io::Error::new( - io::ErrorKind::UnexpectedEof, - "failed to fill whole buffer", - ) - .into()) - } else { - Ok(()) - } -} - -pub(crate) fn pwrite_all( - file: &File, - mut buf: &[u8], - offset: LogOffset, -) -> Result<()> { - let _lock = GLOBAL_FILE_LOCK.lock(); - - let mut f = file.try_clone()?; - - let _ = f.seek(io::SeekFrom::Start(offset))?; - - while !buf.is_empty() { - match f.write(buf) { - Ok(0) => { - return Err(io::Error::new( - io::ErrorKind::WriteZero, - "failed to write whole buffer", - ) - .into()); - } - Ok(n) => buf = &buf[n..], - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - Ok(()) -} diff --git a/src/pagecache/parallel_io_unix.rs b/src/pagecache/parallel_io_unix.rs deleted file mode 100644 index 8bdf78337..000000000 --- a/src/pagecache/parallel_io_unix.rs +++ /dev/null @@ -1,58 +0,0 @@ -use std::convert::TryFrom; -use std::fs::File; -use std::io; -use std::os::unix::fs::FileExt; - -use super::{LogOffset, Result}; - -pub(crate) fn pread_exact_or_eof( - file: &File, - mut buf: &mut [u8], - offset: LogOffset, -) -> Result { - let mut total = 0_usize; - while !buf.is_empty() { - match file.read_at(buf, offset + u64::try_from(total).unwrap()) { - Ok(0) => break, - Ok(n) => { - total += n; - let tmp = buf; - buf = &mut tmp[n..]; - } - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - Ok(total) -} - -pub(crate) fn pread_exact( - file: &File, - buf: &mut [u8], - offset: LogOffset, -) -> Result<()> { - file.read_exact_at(buf, offset).map_err(From::from) -} - -pub(crate) fn pwrite_all( - file: &File, - buf: &[u8], - offset: LogOffset, -) -> Result<()> { - #[cfg(feature = "failpoints")] - { - crate::debug_delay(); - if crate::fail::is_active("pwrite") { - return Err(crate::Error::FailPoint); - } - - if crate::fail::is_active("pwrite partial") && !buf.is_empty() { - // TODO perturb the length more - let len = buf.len() / 2; - - file.write_all_at(&buf[..len], offset)?; - return Err(crate::Error::FailPoint); - } - }; - file.write_all_at(buf, offset).map_err(From::from) -} diff --git a/src/pagecache/parallel_io_windows.rs b/src/pagecache/parallel_io_windows.rs deleted file mode 100644 index b072c8bd9..000000000 --- a/src/pagecache/parallel_io_windows.rs +++ /dev/null @@ -1,98 +0,0 @@ -use std::convert::TryFrom; -use std::fs::File; -use std::io; -use std::os::windows::fs::FileExt; - -use super::{LogOffset, Result}; - -fn seek_read_exact( - file: &mut F, - mut buf: &mut [u8], - mut offset: u64, -) -> Result<()> { - while !buf.is_empty() { - match file.seek_read(buf, offset) { - Ok(0) => break, - Ok(n) => { - let tmp = buf; - buf = &mut tmp[n..]; - offset += n as u64; - } - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - if !buf.is_empty() { - Err(io::Error::new( - io::ErrorKind::UnexpectedEof, - "failed to fill whole buffer", - ) - .into()) - } else { - Ok(()) - } -} - -fn seek_write_all( - file: &mut F, - mut buf: &[u8], - mut offset: u64, -) -> Result<()> { - while !buf.is_empty() { - match file.seek_write(buf, offset) { - Ok(0) => { - return Err(io::Error::new( - io::ErrorKind::WriteZero, - "failed to write whole buffer", - ) - .into()); - } - Ok(n) => { - buf = &buf[n..]; - offset += n as u64; - } - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - Ok(()) -} - -pub(crate) fn pread_exact_or_eof( - file: &File, - mut buf: &mut [u8], - offset: LogOffset, -) -> Result { - let mut total = 0_usize; - while !buf.is_empty() { - match file.seek_read(buf, offset + u64::try_from(total).unwrap()) { - Ok(0) => break, - Ok(n) => { - total += n; - let tmp = buf; - buf = &mut tmp[n..]; - } - Err(ref e) if e.kind() == io::ErrorKind::Interrupted => {} - Err(e) => return Err(e.into()), - } - } - Ok(total) -} - -pub(crate) fn pread_exact( - file: &File, - buf: &mut [u8], - offset: LogOffset, -) -> Result<()> { - let mut f = file.try_clone()?; - seek_read_exact(&mut f, buf, offset).map_err(From::from) -} - -pub(crate) fn pwrite_all( - file: &File, - buf: &[u8], - offset: LogOffset, -) -> Result<()> { - let mut f = file.try_clone()?; - seek_write_all(&mut f, buf, offset).map_err(From::from) -} diff --git a/src/pagecache/reservation.rs b/src/pagecache/reservation.rs deleted file mode 100644 index 1b6280b2f..000000000 --- a/src/pagecache/reservation.rs +++ /dev/null @@ -1,145 +0,0 @@ -use crate::{pagecache::*, *}; - -/// A pending log reservation which can be aborted or completed. -/// NB the holder should quickly call `complete` or `abort` as -/// taking too long to decide will cause the underlying IO -/// buffer to become blocked. -#[derive(Debug)] -pub struct Reservation<'a> { - pub(super) log: &'a Log, - pub(super) iobuf: Arc, - pub(super) buf: &'a mut [u8], - pub(super) flushed: bool, - pub pointer: DiskPtr, - pub lsn: Lsn, - pub(super) is_heap_item_rewrite: bool, - pub(super) header_len: usize, -} - -impl<'a> Drop for Reservation<'a> { - fn drop(&mut self) { - // We auto-abort if the user never uses a reservation. - if !self.flushed { - if let Err(e) = self.flush(false) { - self.log.iobufs.set_global_error(e); - } - } - } -} - -impl<'a> Reservation<'a> { - /// Cancel the reservation, placing a failed flush on disk, returning - /// the (cancelled) log sequence number and file offset. - pub fn abort(mut self) -> Result<(Lsn, DiskPtr)> { - if self.pointer.is_heap_item() && !self.is_heap_item_rewrite { - // we can instantly free this heap item because its pointer - // is assumed to have failed to have been installed into - // the pagetable, so we can assume nobody is operating - // on it. - - trace!( - "removing heap item for aborted reservation at lsn {}", - self.pointer - ); - - self.log.config.heap.free(self.pointer.heap_id().unwrap()); - } - - self.flush(false) - } - - /// Complete the reservation, placing the buffer on disk. returns - /// the log sequence number of the write, and the file offset. - pub fn complete(mut self) -> Result<(Lsn, DiskPtr)> { - self.flush(true) - } - - /// Refills the reservation buffer with new data. - /// Must supply a buffer of an identical length - /// as the one initially provided. - /// - /// # Panics - /// - /// Will panic if the reservation is not the correct - /// size to hold a serialized Lsn. - #[doc(hidden)] - pub fn mark_writebatch(self, peg_lsn: Lsn) -> Result<(Lsn, DiskPtr)> { - trace!( - "writing batch required stable lsn {} into \ - BatchManifest at lid {:?} peg_lsn {}", - peg_lsn, - self.pointer.lid(), - self.lsn - ); - - if self.lsn == peg_lsn { - // this can happen because high-level tree updates - // may result in no work happening. - self.abort() - } else { - self.buf[4] = MessageKind::BatchManifest.into(); - - let buf = lsn_to_arr(peg_lsn); - - let dst = &mut self.buf[self.header_len..]; - - dst.copy_from_slice(&buf); - - let mut intervals = self.log.iobufs.intervals.lock(); - intervals.mark_batch((self.lsn, peg_lsn)); - drop(intervals); - - self.complete() - } - } - - fn flush(&mut self, valid: bool) -> Result<(Lsn, DiskPtr)> { - if self.flushed { - panic!("flushing already-flushed reservation!"); - } - - self.flushed = true; - - if !valid { - self.buf[4] = MessageKind::Canceled.into(); - - // zero the message contents to prevent UB - #[allow(unsafe_code)] - unsafe { - std::ptr::write_bytes( - self.buf[self.header_len..].as_mut_ptr(), - 0, - self.buf.len() - self.header_len, - ) - } - } - - // zero the crc bytes to prevent UB - #[allow(unsafe_code)] - unsafe { - std::ptr::write_bytes( - self.buf[..].as_mut_ptr(), - 0, - std::mem::size_of::(), - ) - } - - let crc32 = calculate_message_crc32( - self.buf[..self.header_len].as_ref(), - &self.buf[self.header_len..], - ); - let crc32_arr = u32_to_arr(crc32); - - #[allow(unsafe_code)] - unsafe { - std::ptr::copy_nonoverlapping( - crc32_arr.as_ptr(), - self.buf.as_mut_ptr(), - std::mem::size_of::(), - ); - } - self.log.exit_reservation(&self.iobuf)?; - - Ok((self.lsn, self.pointer)) - } -} diff --git a/src/pagecache/segment.rs b/src/pagecache/segment.rs deleted file mode 100644 index fbf301ae4..000000000 --- a/src/pagecache/segment.rs +++ /dev/null @@ -1,1169 +0,0 @@ -//! The `SegmentAccountant` is an allocator for equally- -//! sized chunks of the underlying storage file (segments). -//! -//! It must maintain these critical safety properties: -//! -//! A. We must not overwrite existing segments when they -//! contain the most-recent stable state for a page. -//! B. We must not overwrite existing segments when active -//! threads may have references to `LogOffset`'s that point -//! into those segments. -//! -//! To complicate matters, the `PageCache` only knows -//! when it has put a page into an IO buffer, but it -//! doesn't keep track of when that IO buffer is -//! stabilized (until write coalescing is implemented). -//! -//! To address these safety concerns, we rely on -//! these techniques: -//! -//! 1. We delay the reuse of any existing segment -//! by ensuring that we never deactivate a -//! segment until all data written into it, as -//! well as all data written to earlier segments, -//! has been written to disk and fsynced. -//! 2. we use a `epoch::Guard::defer()` from -//! `IoBufs::write_to_log` that guarantees -//! that we defer all segment deactivation -//! until all threads are finished that -//! may have witnessed pointers into a segment -//! that will be marked for reuse in the future. -//! -//! Another concern that arises due to the fact that -//! IO buffers may be written out-of-order is the -//! correct recovery of segments. If there is data -//! loss in recently written segments, we must be -//! careful to preserve linearizability in the log. -//! To do this, we must detect "torn segments" that -//! were not able to be fully written before a crash -//! happened. -//! -//! But what if we wrote a later segment before we -//! were able to write its immediate predecessor segment, -//! and then a crash happened? We must preserve -//! linearizability, so we must not recover the later -//! segment when its predecessor was lost in the crash. -//! -//! 3. This case is solved again by having a concept of -//! an "unstable tail" of segments that, during recovery, -//! must appear consecutively among the recovered -//! segments with the highest LSN numbers. We -//! prevent reuse of segments while they remain in -//! this "unstable tail" by only allowing them to be -//! reallocated after another later segment has written -//! a "stable consecutive lsn" into its own header -//! that is higher than ours. - -#![allow(unused_results)] - -use std::{collections::BTreeSet, mem}; - -use super::PageState; - -use crate::pagecache::*; -use crate::*; - -/// A operation that can be applied asynchronously. -#[derive(Debug)] -pub(crate) enum SegmentOp { - Link { - pid: PageId, - cache_info: CacheInfo, - }, - Replace { - pid: PageId, - old_cache_infos: Vec, - new_cache_info: CacheInfo, - }, -} - -/// The segment accountant keeps track of the logical blocks -/// of storage. It scans through all segments quickly during -/// recovery and attempts to locate torn segments. -#[derive(Debug)] -pub(crate) struct SegmentAccountant { - // static or one-time set - config: RunningConfig, - - // TODO these should be sharded to improve performance - segments: Vec, - - // TODO put behind a single mutex - free: BTreeSet, - tip: LogOffset, - max_stabilized_lsn: Lsn, - segment_cleaner: SegmentCleaner, - ordering: BTreeMap, - async_truncations: BTreeMap>>, -} - -#[derive(Debug, Clone, Default)] -pub(crate) struct SegmentCleaner { - inner: Arc>>>, -} - -impl SegmentCleaner { - pub(crate) fn pop(&self) -> Option<(PageId, LogOffset)> { - let mut inner = self.inner.lock(); - let offset = { - let (offset, pids) = inner.iter_mut().next()?; - if !pids.is_empty() { - let pid = pids.iter().next().copied().unwrap(); - pids.remove(&pid); - return Some((pid, *offset)); - } - *offset - }; - inner.remove(&offset); - None - } - - fn add_pids(&self, offset: LogOffset, pids: BTreeSet) { - let mut inner = self.inner.lock(); - let prev = inner.insert(offset, pids); - assert!(prev.is_none()); - } - - fn remove_pids(&self, offset: LogOffset) { - let mut inner = self.inner.lock(); - inner.remove(&offset); - } -} - -#[cfg(feature = "metrics")] -impl Drop for SegmentAccountant { - fn drop(&mut self) { - for segment in &self.segments { - let segment_utilization = match segment { - Segment::Free(_) | Segment::Draining(_) => 0, - Segment::Active(Active { pids, .. }) - | Segment::Inactive(Inactive { pids, .. }) => pids.len(), - }; - M.segment_utilization_shutdown.measure(segment_utilization as u64); - } - } -} - -#[derive(Debug, Default)] -struct Free { - previous_lsn: Option, -} - -#[derive(Debug, Default)] -struct Active { - lsn: Lsn, - deferred_replaced_pids: BTreeSet, - pids: BTreeSet, - latest_replacement_lsn: Lsn, - can_free_upon_deactivation: FastSet8, - deferred_heap_removals: FastSet8, -} - -#[derive(Debug, Clone, Default)] -struct Inactive { - lsn: Lsn, - pids: BTreeSet, - max_pids: usize, - replaced_pids: usize, - latest_replacement_lsn: Lsn, -} - -#[derive(Debug, Clone, Copy, Default)] -struct Draining { - lsn: Lsn, - max_pids: usize, - replaced_pids: usize, - latest_replacement_lsn: Lsn, -} - -/// A `Segment` holds the bookkeeping information for -/// a contiguous block of the disk. It may contain many -/// fragments from different pages. Over time, we track -/// when segments become reusable and allow them to be -/// overwritten for new data. -#[derive(Debug)] -enum Segment { - /// the segment is marked for reuse, should never receive - /// new pids, - Free(Free), - - /// the segment is being written to or actively recovered, and - /// will have pages assigned to it - Active(Active), - - /// the segment is no longer being written to or recovered, and - /// will have pages marked as relocated from it - Inactive(Inactive), - - /// the segment is having its resident pages relocated before - /// becoming free - Draining(Draining), -} - -impl Default for Segment { - fn default() -> Self { - Segment::Free(Free { previous_lsn: None }) - } -} - -impl Segment { - const fn is_free(&self) -> bool { - matches!(self, Segment::Free(_)) - } - - const fn is_active(&self) -> bool { - matches!(self, Segment::Active { .. }) - } - - const fn is_inactive(&self) -> bool { - matches!(self, Segment::Inactive { .. }) - } - - fn free_to_active(&mut self, new_lsn: Lsn) { - trace!("setting Segment to Active with new lsn {:?}", new_lsn,); - assert!(self.is_free()); - - *self = Segment::Active(Active { - lsn: new_lsn, - deferred_replaced_pids: BTreeSet::default(), - pids: BTreeSet::default(), - latest_replacement_lsn: 0, - can_free_upon_deactivation: FastSet8::default(), - deferred_heap_removals: FastSet8::default(), - }) - } - - /// Transitions a segment to being in the `Inactive` state. - /// Returns the set of page replacements that happened - /// while this Segment was Active - fn active_to_inactive( - &mut self, - lsn: Lsn, - config: &RunningConfig, - ) -> FastSet8 { - trace!("setting Segment with lsn {:?} to Inactive", self.lsn()); - - let (inactive, ret) = if let Segment::Active(active) = self { - assert!(lsn >= active.lsn); - - // now we can push any deferred heap removals to the removed set - for heap_id in &active.deferred_heap_removals { - trace!( - "removing heap_id {:?} while transitioning \ - segment lsn {:?} to Inactive", - heap_id, - active.lsn, - ); - config.heap.free(*heap_id); - } - - let max_pids = active.pids.len(); - - let mut pids = std::mem::take(&mut active.pids); - - for deferred_replaced_pid in &active.deferred_replaced_pids { - assert!(pids.remove(deferred_replaced_pid)); - } - - let inactive = Segment::Inactive(Inactive { - lsn: active.lsn, - max_pids, - replaced_pids: active.deferred_replaced_pids.len(), - pids, - latest_replacement_lsn: active.latest_replacement_lsn, - }); - - let can_free = mem::take(&mut active.can_free_upon_deactivation); - - (inactive, can_free) - } else { - panic!("called active_to_inactive on {:?}", self); - }; - - *self = inactive; - ret - } - - fn inactive_to_draining(&mut self, lsn: Lsn) -> BTreeSet { - trace!("setting Segment with lsn {:?} to Draining", self.lsn()); - - if let Segment::Inactive(inactive) = self { - assert!(lsn >= inactive.lsn); - let ret = mem::take(&mut inactive.pids); - *self = Segment::Draining(Draining { - lsn: inactive.lsn, - max_pids: inactive.max_pids, - replaced_pids: inactive.replaced_pids, - latest_replacement_lsn: inactive.latest_replacement_lsn, - }); - ret - } else { - panic!("called inactive_to_draining on {:?}", self); - } - } - - fn defer_free_lsn(&mut self, lsn: Lsn) { - if let Segment::Active(active) = self { - active.can_free_upon_deactivation.insert(lsn); - } else { - panic!("called defer_free_lsn on segment {:?}", self); - } - } - - fn draining_to_free(&mut self, lsn: Lsn) -> Lsn { - trace!("setting Segment with lsn {:?} to Free", self.lsn()); - - if let Segment::Draining(draining) = self { - let old_lsn = draining.lsn; - assert!(lsn >= old_lsn); - let replacement_lsn = draining.latest_replacement_lsn; - *self = Segment::Free(Free { previous_lsn: Some(old_lsn) }); - replacement_lsn - } else { - panic!("called draining_to_free on {:?}", self); - } - } - - fn recovery_ensure_initialized(&mut self, lsn: Lsn) { - if self.is_free() { - trace!("(snapshot) recovering segment with base lsn {}", lsn); - self.free_to_active(lsn); - } - } - - const fn lsn(&self) -> Lsn { - match self { - Segment::Active(Active { lsn, .. }) - | Segment::Inactive(Inactive { lsn, .. }) - | Segment::Draining(Draining { lsn, .. }) => *lsn, - Segment::Free(_) => panic!("called lsn on Segment::Free"), - } - } - - /// Add a pid to the Segment. The caller must provide - /// the Segment's LSN. - fn insert_pid(&mut self, pid: PageId, lsn: Lsn) { - trace!( - "inserting pid {} to segment lsn {:?} from segment {:?}", - pid, - self.lsn(), - self - ); - // if this breaks, maybe we didn't implement the transition - // logic right in write_to_log, and maybe a thread is - // using the SA to add pids AFTER their calls to - // res.complete() worked. - if let Segment::Active(active) = self { - assert_eq!( - lsn, active.lsn, - "insert_pid specified lsn {} for pid {} in segment {:?}", - lsn, pid, active - ); - active.pids.insert(pid); - } else { - panic!("called insert_pid on {:?}", self); - } - } - - fn remove_pid(&mut self, pid: PageId, replacement_lsn: Lsn) { - trace!( - "removing pid {} from segment lsn {:?} from segment {:?}", - pid, - self.lsn(), - self - ); - match self { - Segment::Active(active) => { - assert!(active.lsn <= replacement_lsn); - if replacement_lsn != active.lsn { - active.deferred_replaced_pids.insert(pid); - } - if replacement_lsn > active.latest_replacement_lsn { - active.latest_replacement_lsn = replacement_lsn; - } - } - Segment::Inactive(Inactive { - pids, - lsn, - latest_replacement_lsn, - replaced_pids, - .. - }) => { - assert!(*lsn <= replacement_lsn); - if replacement_lsn != *lsn { - pids.remove(&pid); - *replaced_pids += 1; - } - if replacement_lsn > *latest_replacement_lsn { - *latest_replacement_lsn = replacement_lsn; - } - } - Segment::Draining(Draining { - lsn, - latest_replacement_lsn, - replaced_pids, - .. - }) => { - assert!(*lsn <= replacement_lsn); - if replacement_lsn != *lsn { - *replaced_pids += 1; - } - if replacement_lsn > *latest_replacement_lsn { - *latest_replacement_lsn = replacement_lsn; - } - } - Segment::Free(_) => { - panic!("called remove pid {} on Segment::Free", pid) - } - } - } - - fn remove_heap_item(&mut self, heap_id: HeapId, config: &RunningConfig) { - match self { - Segment::Active(active) => { - // we have received a removal before - // transferring this segment to Inactive, so - // we defer this pid's removal until the transfer. - active.deferred_heap_removals.insert(heap_id); - } - Segment::Inactive(_) | Segment::Draining(_) => { - trace!( - "directly removing heap_id {:?} that was referred-to \ - in a segment that has already been marked as Inactive \ - or Draining.", - heap_id, - ); - config.heap.free(heap_id); - } - Segment::Free(_) => { - panic!("remove_heap_item called on a Free Segment") - } - } - } - - const fn can_free(&self) -> bool { - if let Segment::Draining(draining) = self { - draining.replaced_pids == draining.max_pids - } else { - false - } - } -} - -impl SegmentAccountant { - /// Create a new `SegmentAccountant` from previously recovered segments. - pub(super) fn start( - config: RunningConfig, - snapshot: &Snapshot, - segment_cleaner: SegmentCleaner, - ) -> Result { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.start_segment_accountant); - let mut ret = Self { - config, - segments: vec![], - free: BTreeSet::default(), - tip: 0, - max_stabilized_lsn: -1, - segment_cleaner, - ordering: BTreeMap::default(), - async_truncations: BTreeMap::default(), - }; - - ret.initialize_from_snapshot(snapshot)?; - - if let Some(max_free) = ret.free.iter().max() { - assert!( - ret.tip > *max_free, - "expected recovered tip {} to \ - be above max item in recovered \ - free list {:?}", - ret.tip, - ret.free - ); - } - - debug!( - "SA starting with tip {} stable {} free {:?}", - ret.tip, ret.max_stabilized_lsn, ret.free, - ); - - #[cfg(feature = "metrics")] - for segment in &ret.segments { - let segment_utilization = match segment { - Segment::Free(_) | Segment::Draining(_) => 0, - Segment::Active(Active { pids, .. }) - | Segment::Inactive(Inactive { pids, .. }) => pids.len(), - }; - #[cfg(feature = "metrics")] - M.segment_utilization_startup.measure(segment_utilization as u64); - } - - if let Some(stable_lsn) = snapshot.stable_lsn { - // stabilize things now so that we don't force - // the first stabilizing thread to need - // to cope with a huge amount of segments. - ret.stabilize( - stable_lsn - Lsn::try_from(ret.config.segment_size).unwrap(), - true, - )?; - } - - Ok(ret) - } - - fn initial_segments(&self, snapshot: &Snapshot) -> Result> { - let segment_size = self.config.segment_size; - let file_len = self.config.file.metadata()?.len(); - let number_of_segments = - usize::try_from(file_len / segment_size as u64).unwrap() - + if file_len % segment_size as u64 == 0 { 0 } else { 1 }; - - // generate segments from snapshot lids - let mut segments = vec![]; - segments.resize_with(number_of_segments, Segment::default); - - // sometimes the current segment is still empty, after only - // recovering the segment header but no valid messages yet - if let Some(tip_lid) = snapshot.active_segment { - let tip_idx = - usize::try_from(tip_lid / segment_size as LogOffset).unwrap(); - if tip_idx == number_of_segments { - segments.push(Segment::default()); - } - if segments.len() <= tip_idx { - error!( - "failed to properly initialize segments, suspected disk corruption" - ); - return Err(Error::corruption(None)); - } - trace!( - "setting segment for tip_lid {} to stable_lsn {}", - tip_lid, - self.config.normalize(snapshot.stable_lsn.unwrap_or(0)) - ); - segments[tip_idx].recovery_ensure_initialized( - self.config.normalize(snapshot.stable_lsn.unwrap_or(0)), - ); - } - - let mut add = |pid, lsn: Lsn, lid_opt: Option| { - let lid = if let Some(lid) = lid_opt { - lid - } else { - trace!( - "skipping segment GC for pid {} with a heap \ - ptr already in the snapshot", - pid - ); - return; - }; - let idx = assert_usize(lid / segment_size as LogOffset); - trace!( - "adding lsn: {} lid: {} for pid {} to segment {} \ - during SA recovery", - lsn, - lid, - pid, - idx - ); - let segment_lsn = self.config.normalize(lsn); - segments[idx].recovery_ensure_initialized(segment_lsn); - segments[idx].insert_pid(pid, segment_lsn); - }; - - for (pid, state) in snapshot.pt.iter().enumerate() { - match state { - PageState::Present { base, frags } => { - add(pid as PageId, base.0, base.1.lid()); - for (lsn, ptr) in frags { - add(pid as PageId, *lsn, ptr.lid()); - } - } - PageState::Free(lsn, ptr) => { - add(pid as PageId, *lsn, ptr.lid()); - } - _ => panic!("tried to recover pagestate from a {:?}", state), - } - } - - Ok(segments) - } - - fn initialize_from_snapshot(&mut self, snapshot: &Snapshot) -> Result<()> { - let segment_size = self.config.segment_size; - let segments = self.initial_segments(snapshot)?; - - self.segments = segments; - - let mut to_free = vec![]; - let mut maybe_clean = vec![]; - - let currently_active_segment = snapshot - .active_segment - .map(|tl| usize::try_from(tl / segment_size as LogOffset).unwrap()); - - for (idx, segment) in self.segments.iter_mut().enumerate() { - let segment_base = idx as LogOffset * segment_size as LogOffset; - - if segment_base >= self.tip { - // set tip above the beginning of any - self.tip = segment_base + segment_size as LogOffset; - trace!( - "raised self.tip to {} during SA initialization", - self.tip - ); - } - - if segment.is_free() { - // this segment was not used in the recovered - // snapshot, so we can assume it is free - to_free.push(segment_base); - continue; - } - - let segment_lsn = segment.lsn(); - - if let Some(tip_idx) = currently_active_segment { - if tip_idx != idx { - maybe_clean.push((idx, segment_lsn)); - } - } - } - - for segment_base in to_free { - self.free_segment(segment_base)?; - io_fail!(self.config, "zero garbage segment SA"); - pwrite_all( - &self.config.file, - &*vec![MessageKind::Corrupted.into(); self.config.segment_size], - segment_base, - )?; - } - - // we want to complete all truncations because - // they could cause calls to `next` to block. - for (_, promise) in self.async_truncations.split_off(&0) { - promise.wait().expect("threadpool should not crash")?; - } - - for (idx, segment_lsn) in maybe_clean { - self.possibly_clean_or_free_segment(idx, segment_lsn)?; - } - - trace!("initialized self.segments to {:?}", self.segments); - - self.ordering = self - .segments - .iter() - .enumerate() - .filter_map(|(id, s)| { - if s.is_free() { - None - } else { - Some((s.lsn(), id as LogOffset * segment_size as LogOffset)) - } - }) - .collect(); - - trace!("initialized self.ordering to {:?}", self.ordering); - - Ok(()) - } - - fn free_segment(&mut self, lid: LogOffset) -> Result<()> { - debug!("freeing segment {}", lid); - trace!("free list before free {:?}", self.free); - self.segment_cleaner.remove_pids(lid); - - let idx = self.segment_id(lid); - assert!( - self.tip > lid, - "freed a segment at {} above our current file tip {}, \ - please report this bug!", - lid, - self.tip, - ); - assert!(self.segments[idx].is_free()); - assert!(!self.free.contains(&lid), "double-free of a segment occurred"); - - self.free.insert(lid); - - // remove the old ordering from our list - if let Segment::Free(Free { previous_lsn: Some(last_lsn) }) = - self.segments[idx] - { - trace!( - "removing segment {} with lsn {} from ordering", - lid, - last_lsn - ); - self.ordering.remove(&last_lsn); - } - - // we want to avoid aggressive truncation because it can cause - // blocking if we allocate a segment that was just truncated. - let laziness_factor = 1; - - // truncate if possible - while self.tip != 0 && self.free.len() > laziness_factor { - let last_segment = self.tip - self.config.segment_size as LogOffset; - if self.free.contains(&last_segment) { - self.free.remove(&last_segment); - self.truncate(last_segment)?; - } else { - break; - } - } - - Ok(()) - } - - /// Asynchronously apply a GC-related operation. Used in a flat-combining - /// style that allows callers to avoid blocking while sending these - /// messages to this module. - pub(super) fn apply_op(&mut self, op: &SegmentOp) -> Result<()> { - use SegmentOp::*; - match op { - Link { pid, cache_info } => self.mark_link(*pid, *cache_info), - Replace { pid, old_cache_infos, new_cache_info } => { - self.mark_replace(*pid, old_cache_infos, *new_cache_info)? - } - } - Ok(()) - } - - /// Called by the `PageCache` when a page has been rewritten completely. - /// We mark all of the old segments that contained the previous state - /// from the page, and if the old segments are empty or clear enough to - /// begin accelerated cleaning we mark them as so. - pub(super) fn mark_replace( - &mut self, - pid: PageId, - old_cache_infos: &[CacheInfo], - new_cache_info: CacheInfo, - ) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.accountant_mark_replace); - - if !new_cache_info.pointer.heap_pointer_merged_into_snapshot() { - self.mark_link(pid, new_cache_info); - } - - let lsn = self.config.normalize(new_cache_info.lsn); - - trace!( - "mark_replace pid {} from cache infos {:?} to cache info {:?} with lsn {}", - pid, - old_cache_infos, - new_cache_info, - lsn - ); - - let new_idx_opt = - new_cache_info.pointer.lid().map(|lid| self.segment_id(lid)); - - // Do we need to schedule any heap cleanups? - // Not if we just moved the pointer without changing - // the underlying heap, as is the case with a single heap - // item with nothing else. - let schedule_rm_heap_item = !(old_cache_infos.len() == 1 - && old_cache_infos[0].pointer.is_heap_item() - && new_cache_info.pointer.is_heap_item() - && old_cache_infos[0].pointer.heap_id() - == new_cache_info.pointer.heap_id()); - - // we use this as a 0-allocation state machine to accumulate - // how much data has been freed from each segment - let mut replaced_segment = None; - - for old_cache_info in old_cache_infos { - let old_ptr = &old_cache_info.pointer; - let old_lid = if let Some(old_lid) = old_ptr.lid() { - old_lid - } else { - // the frag had been migrated to the heap store fully - continue; - }; - - if schedule_rm_heap_item && old_ptr.is_heap_item() { - trace!( - "queueing heap item removal for {} in our own segment", - old_ptr - ); - if let Some(new_idx) = new_idx_opt { - self.segments[new_idx].remove_heap_item( - old_ptr.heap_id().unwrap(), - &self.config, - ); - } else { - // this was migrated off-log and is present and stabilized - // in the snapshot. - self.config.heap.free(old_ptr.heap_id().unwrap()); - } - } - - let old_idx = self.segment_id(old_lid); - - match replaced_segment { - Some(last_idx) if last_idx == old_idx => { - // skip this because we've already removed it - // from the segment - } - _ => { - self.segments[old_idx].remove_pid(pid, lsn); - self.possibly_clean_or_free_segment(old_idx, lsn)?; - replaced_segment = Some(old_idx); - } - } - } - - Ok(()) - } - - /// Called from `PageCache` when some state has been added - /// to a logical page at a particular offset. We ensure the - /// page is present in the segment's page set. - pub(super) fn mark_link(&mut self, pid: PageId, cache_info: CacheInfo) { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.accountant_mark_link); - - trace!("mark_link pid {} at cache info {:?}", pid, cache_info); - let lid = if let Some(lid) = cache_info.pointer.lid() { - lid - } else { - // item has been migrated off-log to the heap store - return; - }; - - let idx = self.segment_id(lid); - - let segment = &mut self.segments[idx]; - - let segment_lsn = cache_info.lsn / self.config.segment_size as Lsn - * self.config.segment_size as Lsn; - - // a race happened, and our Lsn does not apply anymore - assert_eq!( - segment.lsn(), - segment_lsn, - "segment somehow got reused by the time a link was \ - marked on it. expected lsn: {} actual: {}", - segment_lsn, - segment.lsn() - ); - - segment.insert_pid(pid, segment_lsn); - } - - fn possibly_clean_or_free_segment( - &mut self, - idx: usize, - lsn: Lsn, - ) -> Result<()> { - let segment_start = (idx * self.config.segment_size) as LogOffset; - - if let Segment::Inactive(inactive) = &mut self.segments[idx] { - let live_pct = (inactive.max_pids - inactive.replaced_pids) * 50 - / (inactive.max_pids + 1); - - let can_drain = live_pct <= SEGMENT_CLEANUP_THRESHOLD; - - if can_drain { - // can be cleaned - trace!( - "SA inserting {} into to_clean from possibly_clean_or_free_segment", - segment_start - ); - let to_clean = self.segments[idx].inactive_to_draining(lsn); - self.segment_cleaner.add_pids(segment_start, to_clean); - } - } - - let segment_lsn = self.segments[idx].lsn(); - - if self.segments[idx].can_free() { - // can be reused immediately - let replacement_lsn = self.segments[idx].draining_to_free(lsn); - - if self.ordering.contains_key(&replacement_lsn) { - let replacement_lid = self.ordering[&replacement_lsn]; - let replacement_idx = usize::try_from( - replacement_lid / self.config.segment_size as u64, - ) - .unwrap(); - - if self.segments[replacement_idx].is_active() { - trace!( - "deferring free of segment {} in possibly_clean_or_free_segment", - segment_start - ); - self.segments[replacement_idx].defer_free_lsn(segment_lsn); - } else { - assert!(replacement_lsn <= self.max_stabilized_lsn); - self.free_segment(segment_start)?; - } - } else { - // replacement segment has already been freed, so we can - // go right to freeing this one too - self.free_segment(segment_start)?; - } - } - - Ok(()) - } - - pub(super) fn stabilize( - &mut self, - stable_lsn: Lsn, - in_startup: bool, - ) -> Result<()> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.accountant_stabilize); - - let segment_size = self.config.segment_size as Lsn; - let lsn = ((stable_lsn / segment_size) - 1) * segment_size; - trace!( - "stabilize({}), normalized: {}, last: {}", - stable_lsn, - lsn, - self.max_stabilized_lsn - ); - if self.max_stabilized_lsn >= lsn { - trace!( - "expected stabilization lsn {} \ - to be greater than the previous value of {}", - lsn, - self.max_stabilized_lsn - ); - return Ok(()); - } - - let bounds = ( - std::ops::Bound::Excluded(self.max_stabilized_lsn), - std::ops::Bound::Included(lsn), - ); - - let can_deactivate = self - .ordering - .range(bounds) - .map(|(segment_lsn, _lid)| *segment_lsn) - .collect::>(); - - // auto-tune collection in cases - // where we experience a blow-up, - // similar to the collection - // logic in the ebr module. - let bound = if in_startup { - usize::MAX - } else { - 32.max(can_deactivate.len() / 16) - }; - - for segment_lsn in can_deactivate.into_iter().take(bound) { - self.deactivate_segment(segment_lsn)?; - assert!(self.max_stabilized_lsn < segment_lsn); - self.max_stabilized_lsn = segment_lsn; - } - - // if we have a lot of free segments in our whole file, - // let's start relocating the current tip to boil it down - let free_segs = self.segments.iter().filter(|s| s.is_free()).count(); - let inactive_segs = - self.segments.iter().filter(|s| s.is_inactive()).count(); - let free_ratio = (free_segs * 100) / (1 + free_segs + inactive_segs); - - if free_ratio >= SEGMENT_CLEANUP_THRESHOLD && inactive_segs > 5 { - if let Some(last_index) = - self.segments.iter().rposition(Segment::is_inactive) - { - let segment_start = - (last_index * self.config.segment_size) as LogOffset; - - let to_clean = - self.segments[last_index].inactive_to_draining(lsn); - self.segment_cleaner.add_pids(segment_start, to_clean); - } - } - - Ok(()) - } - - /// Called after the trailer of a segment has been written to disk, - /// indicating that no more pids will be added to a segment. Moves - /// the segment into the Inactive state. - /// - /// # Panics - /// The provided lsn and lid must exactly match the existing segment. - fn deactivate_segment(&mut self, lsn: Lsn) -> Result<()> { - let lid = self.ordering[&lsn]; - let idx = self.segment_id(lid); - - trace!( - "deactivating segment with lid {} lsn {}: {:?}", - lid, - lsn, - self.segments[idx] - ); - - let freeable_segments = if self.segments[idx].is_active() { - self.segments[idx].active_to_inactive(lsn, &self.config) - } else { - Default::default() - }; - - for segment_lsn in freeable_segments { - let segment_start = self.ordering[&segment_lsn]; - assert_ne!(segment_start, lid); - self.free_segment(segment_start)?; - } - - self.possibly_clean_or_free_segment(idx, lsn)?; - - Ok(()) - } - - fn bump_tip(&mut self) -> Result { - let lid = self.tip; - - let truncations = self.async_truncations.split_off(&lid); - - for (_at, truncation) in truncations { - match truncation.wait().unwrap() { - Ok(()) => {} - Err(error) => { - error!("failed to shrink file: {:?}", error); - return Err(error); - } - } - } - - self.tip += self.config.segment_size as LogOffset; - - trace!("advancing file tip from {} to {}", lid, self.tip); - - Ok(lid) - } - - /// Returns the next offset to write a new segment in, as well - /// as whether the corresponding segment must be persisted using - /// fsync due to having been allocated from the file's tip, rather - /// than `sync_file_range` as is normal. - pub(super) fn next(&mut self, lsn: Lsn) -> Result<(LogOffset, bool)> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.accountant_next); - - assert_eq!( - lsn % self.config.segment_size as Lsn, - 0, - "unaligned Lsn provided to next!" - ); - - trace!("evaluating free list {:?} in SA::next", &self.free); - - // pop free or add to end - let safe = self.free.iter().next().copied(); - - let (lid, from_tip) = if let Some(next) = safe { - self.free.remove(&next); - (next, false) - } else { - (self.bump_tip()?, true) - }; - - // pin lsn to this segment - let idx = self.segment_id(lid); - - self.segments[idx].free_to_active(lsn); - - self.ordering.insert(lsn, lid); - - debug!( - "segment accountant returning offset: {} for lsn {} \ - on deck: {:?}", - lid, lsn, self.free, - ); - - assert!( - lsn >= Lsn::try_from(lid).unwrap(), - "lsn {} should always be greater than or equal to lid {}", - lsn, - lid - ); - - Ok((lid, from_tip)) - } - - /// Returns an iterator over a snapshot of current segment - /// log sequence numbers and their corresponding file offsets. - pub(super) fn segment_snapshot_iter_from( - &mut self, - lsn: Lsn, - ) -> BTreeMap { - assert!( - !self.ordering.is_empty(), - "expected ordering to have been initialized already" - ); - - let normalized_lsn = self.config.normalize(lsn); - - trace!( - "generated iterator over {:?} where lsn >= {}", - self.ordering, - normalized_lsn - ); - - self.ordering - .iter() - .filter_map(move |(l, r)| { - if *l >= normalized_lsn { - Some((*l, *r)) - } else { - None - } - }) - .collect() - } - - // truncate the file to the desired length - fn truncate(&mut self, at: LogOffset) -> Result<()> { - trace!("asynchronously truncating file to length {}", at); - - assert_eq!( - at % self.config.segment_size as LogOffset, - 0, - "new length must be io-buf-len aligned" - ); - - self.tip = at; - - assert!(!self.free.contains(&at), "double-free of a segment occurred"); - - let config = self.config.clone(); - - io_fail!(&config, "file truncation"); - let promise = threadpool::truncate(config, at); - - if self.async_truncations.insert(at, promise).is_some() { - panic!( - "somehow segment {} was truncated before \ - the previous truncation completed", - at - ); - } - - Ok(()) - } - - fn segment_id(&mut self, lid: LogOffset) -> usize { - let idx = assert_usize(lid / self.config.segment_size as LogOffset); - - // TODO never resize like this, make it a single - // responsibility when the tip is bumped / truncated. - if self.segments.len() < idx + 1 { - self.segments.resize_with(idx + 1, Segment::default); - } - - idx - } -} diff --git a/src/pagecache/snapshot.rs b/src/pagecache/snapshot.rs deleted file mode 100644 index 919caffbe..000000000 --- a/src/pagecache/snapshot.rs +++ /dev/null @@ -1,600 +0,0 @@ -use crate::*; - -use super::{ - arr_to_u32, pwrite_all, raw_segment_iter_from, u32_to_arr, u64_to_arr, - BasedBuf, DiskPtr, HeapId, LogIter, LogKind, LogOffset, Lsn, MessageKind, -}; - -/// A snapshot of the state required to quickly restart -/// the `PageCache` and `SegmentAccountant`. -#[derive(PartialEq, Debug, Default)] -#[cfg_attr(test, derive(Clone))] -pub struct Snapshot { - /// The version of the snapshot format - pub version: u8, - /// The last read message lsn - pub stable_lsn: Option, - /// The last read message lid - pub active_segment: Option, - /// the mapping from pages to (lsn, lid) - pub pt: Vec, -} - -#[derive(Clone, Debug, PartialEq, Eq)] -pub enum PageState { - /// Present signifies a page that has some data. - /// - /// It has two parts. The base and the fragments. - /// `base` is separated to guarantee that it will - /// always have at least one because it is - /// correct by construction. - /// The third element in each tuple is the on-log - /// size for the corresponding write. If things - /// are pretty large, they spill into the heaps - /// directory, but still get a small pointer that - /// gets written into the log. The sizes are used - /// for the garbage collection statistics on - /// segments. The lsn and the DiskPtr can be used - /// for actually reading the item off the disk, - /// and the size tells us how much storage it uses - /// on the disk. - Present { - base: (Lsn, DiskPtr), - frags: Vec<(Lsn, DiskPtr)>, - }, - - /// This is a free page. - Free(Lsn, DiskPtr), - Uninitialized, -} - -impl PageState { - fn push(&mut self, item: (Lsn, DiskPtr)) { - match *self { - PageState::Present { base, ref mut frags } => { - if frags.last().map_or(base.0, |f| f.0) < item.0 { - frags.push(item) - } else { - debug!( - "skipping merging item {:?} into \ - existing PageState::Present({:?})", - item, frags - ); - } - } - _ => panic!("pushed frags to {:?}", self), - } - } - - pub(crate) const fn is_free(&self) -> bool { - matches!(self, PageState::Free(_, _)) - } - - #[cfg(feature = "testing")] - fn offsets(&self) -> Vec> { - match *self { - PageState::Present { base, ref frags } => { - let mut offsets = vec![base.1.lid()]; - for (_, ptr) in frags { - offsets.push(ptr.lid()); - } - offsets - } - PageState::Free(_, ptr) => vec![ptr.lid()], - PageState::Uninitialized => { - panic!("called offsets on Uninitialized") - } - } - } - - pub(crate) fn heap_ids(&self) -> Vec { - let mut ret = vec![]; - - match *self { - PageState::Present { base, ref frags } => { - if let Some(heap_id) = base.1.heap_id() { - ret.push(heap_id); - } - for (_, ptr) in frags { - if let Some(heap_id) = ptr.heap_id() { - ret.push(heap_id); - } - } - } - PageState::Free(_, ptr) => { - if let Some(heap_id) = ptr.heap_id() { - ret.push(heap_id); - } - } - PageState::Uninitialized => { - panic!("called heap_ids on Uninitialized") - } - } - - ret - } -} - -impl Snapshot { - pub fn recovered_coords( - &self, - segment_size: usize, - ) -> (Option, Option) { - if self.stable_lsn.is_none() { - return (None, None); - } - - let stable_lsn = self.stable_lsn.unwrap(); - - if let Some(base_offset) = self.active_segment { - let progress = stable_lsn % segment_size as Lsn; - let offset = base_offset + LogOffset::try_from(progress).unwrap(); - - (Some(offset), Some(stable_lsn)) - } else { - let lsn_idx = stable_lsn / segment_size as Lsn - + if stable_lsn % segment_size as Lsn == 0 { 0 } else { 1 }; - let next_lsn = lsn_idx * segment_size as Lsn; - (None, Some(next_lsn)) - } - } - - fn apply( - &mut self, - log_kind: LogKind, - pid: PageId, - lsn: Lsn, - disk_ptr: DiskPtr, - ) -> Result<()> { - trace!( - "trying to deserialize buf for pid {} ptr {} lsn {}", - pid, - disk_ptr, - lsn - ); - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.snapshot_apply); - - let pushed = if self.pt.len() <= usize::try_from(pid).unwrap() { - self.pt.resize( - usize::try_from(pid + 1).unwrap(), - PageState::Uninitialized, - ); - true - } else { - false - }; - - match log_kind { - LogKind::Replace => { - trace!( - "compact of pid {} at ptr {} lsn {}", - pid, - disk_ptr, - lsn, - ); - - let pid_usize = usize::try_from(pid).unwrap(); - - self.pt[pid_usize] = - PageState::Present { base: (lsn, disk_ptr), frags: vec![] }; - } - LogKind::Link => { - // Because we rewrite pages over time, we may have relocated - // a page's initial Compact to a later segment. We should skip - // over pages here unless we've encountered a Compact for them. - if let Some(lids @ PageState::Present { .. }) = - self.pt.get_mut(usize::try_from(pid).unwrap()) - { - trace!( - "append of pid {} at lid {} lsn {}", - pid, - disk_ptr, - lsn, - ); - - lids.push((lsn, disk_ptr)); - } else { - trace!( - "skipping dangling append of pid {} at lid {} lsn {}", - pid, - disk_ptr, - lsn, - ); - if pushed { - let old = self.pt.pop().unwrap(); - if old != PageState::Uninitialized { - error!( - "expected previous page state to be uninitialized" - ); - return Err(Error::corruption(None)); - } - } - } - } - LogKind::Free => { - trace!("free of pid {} at ptr {} lsn {}", pid, disk_ptr, lsn); - self.pt[usize::try_from(pid).unwrap()] = - PageState::Free(lsn, disk_ptr); - } - LogKind::Corrupted | LogKind::Skip => { - error!( - "unexpected messagekind in snapshot application for pid {}: {:?}", - pid, log_kind - ); - return Err(Error::corruption(None)); - } - } - - Ok(()) - } - - fn filter_inner_heap_ids(&mut self) { - for page in &mut self.pt { - match page { - PageState::Free(_lsn, ref mut ptr) => { - ptr.forget_heap_log_coordinates() - } - PageState::Present { ref mut base, ref mut frags } => { - base.1.forget_heap_log_coordinates(); - for (_, ref mut ptr) in frags { - ptr.forget_heap_log_coordinates(); - } - } - PageState::Uninitialized => { - unreachable!() - } - } - } - } -} - -fn advance_snapshot( - mut iter: LogIter, - mut snapshot: Snapshot, - config: &RunningConfig, -) -> Result { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.advance_snapshot); - - trace!("building on top of old snapshot: {:?}", snapshot); - - let old_stable_lsn = snapshot.stable_lsn; - - for (log_kind, pid, lsn, ptr) in &mut iter { - trace!( - "in advance_snapshot looking at item with pid {} lsn {} ptr {}", - pid, - lsn, - ptr - ); - - if lsn < snapshot.stable_lsn.unwrap_or(-1) { - // don't process already-processed Lsn's. stable_lsn is for the last - // item ALREADY INCLUDED lsn in the snapshot. - trace!( - "continuing in advance_snapshot, lsn {} ptr {} stable_lsn {:?}", - lsn, - ptr, - snapshot.stable_lsn - ); - continue; - } - - snapshot.apply(log_kind, pid, lsn, ptr)?; - } - - // `snapshot.tip_lid` can be set based on 4 possibilities for the tip of the - // log: - // 1. an empty DB - tip set to None, causing a fresh segment to be - // allocated on initialization - // 2. the recovered tip is at the end of a - // segment with less space left than would fit MAX_MSG_HEADER_LEN - - // tip set to None, causing a fresh segment to be allocated on - // initialization, as in #1 above - // 3. the recovered tip is in the middle of a segment - both set to the end - // of the last valid message, causing the system to be initialized to - // that point without allocating a new segment - // 4. the recovered tip is at the beginning of a new segment, but without - // any valid messages in it yet. treat as #3 above, but also take care - // in te SA initialization to properly initialize any segment tracking - // state despite not having any pages currently residing there. - - let no_recovery_progress = iter.cur_lsn.is_none() - || iter.cur_lsn.unwrap() <= snapshot.stable_lsn.unwrap_or(0); - let db_is_empty = no_recovery_progress && snapshot.stable_lsn.is_none(); - - #[cfg(feature = "testing")] - let mut shred_point = None; - - if db_is_empty { - trace!("db is empty, returning default snapshot"); - if snapshot != Snapshot::default() { - error!("expected snapshot to be Snapshot::default"); - return Err(Error::corruption(None)); - } - } else if iter.cur_lsn.is_none() { - trace!( - "no recovery progress happened since the last snapshot \ - was generated, returning the previous one" - ); - } else { - let iterated_lsn = iter.cur_lsn.unwrap(); - - let segment_progress: Lsn = iterated_lsn % (config.segment_size as Lsn); - - // progress should never be below the SEG_HEADER_LEN if the segment_base - // is set. progress can only be 0 if we've maxed out the - // previous segment, unsetting the iterator segment_base in the - // process. - let monotonic = segment_progress >= SEG_HEADER_LEN as Lsn - || (segment_progress == 0 && iter.segment_base.is_none()); - if !monotonic { - error!( - "expected segment progress {} to be above SEG_HEADER_LEN or == 0, cur_lsn: {}", - segment_progress, iterated_lsn, - ); - return Err(Error::corruption(None)); - } - - let (stable_lsn, active_segment) = if segment_progress - + MAX_MSG_HEADER_LEN as Lsn - >= config.segment_size as Lsn - { - let bumped = - config.normalize(iterated_lsn) + config.segment_size as Lsn; - trace!("bumping snapshot.stable_lsn to {}", bumped); - (bumped, None) - } else { - if let Some(BasedBuf { offset, .. }) = iter.segment_base { - // either situation 3 or situation 4. we need to zero the - // tail of the segment after the recovered tip - let shred_len = config.segment_size - - usize::try_from(segment_progress).unwrap() - - 1; - let shred_zone = vec![MessageKind::Corrupted.into(); shred_len]; - let shred_base = - offset + LogOffset::try_from(segment_progress).unwrap(); - - #[cfg(feature = "testing")] - { - shred_point = Some(shred_base); - } - - debug!( - "zeroing the end of the recovered segment at lsn {} between lids {} and {}", - config.normalize(iterated_lsn), - shred_base, - shred_base + shred_len as LogOffset - ); - pwrite_all(&config.file, &shred_zone, shred_base)?; - config.file.sync_all()?; - } - (iterated_lsn, iter.segment_base.map(|bb| bb.offset)) - }; - - if stable_lsn < snapshot.stable_lsn.unwrap_or(0) { - error!( - "unexpected corruption encountered in storage snapshot file. \ - stable lsn {} should be >= snapshot.stable_lsn {}", - stable_lsn, - snapshot.stable_lsn.unwrap_or(0), - ); - return Err(Error::corruption(None)); - } - - snapshot.stable_lsn = Some(stable_lsn); - snapshot.active_segment = active_segment; - snapshot.filter_inner_heap_ids(); - }; - - trace!("generated snapshot: {:?}", snapshot); - - if snapshot.stable_lsn < old_stable_lsn { - error!("unexpected corruption encountered in storage snapshot file"); - return Err(Error::corruption(None)); - } - - if snapshot.stable_lsn > old_stable_lsn { - write_snapshot(config, &snapshot)?; - } - - #[cfg(feature = "testing")] - let reverse_segments = { - let shred_base = shred_point.unwrap_or(LogOffset::max_value()); - let mut reverse_segments = Map::new(); - for (pid, page) in snapshot.pt.iter().enumerate() { - let offsets = page.offsets(); - for offset_option in offsets { - let offset = if let Some(offset) = offset_option { - offset - } else { - continue; - }; - let segment = config.normalize(offset); - if segment == config.normalize(shred_base) { - assert!( - offset < shred_base, - "we shredded the location for pid {} - with locations {:?} - by zeroing the file tip after lid {}", - pid, - page, - shred_base - ); - } - let entry = - reverse_segments.entry(segment).or_insert_with(Set::new); - entry.insert((pid, offset)); - } - } - reverse_segments - }; - - for (lsn, to_zero) in &iter.segments { - debug!("zeroing torn segment at lsn {} lid {}", lsn, to_zero); - - #[cfg(feature = "testing")] - { - if let Some(pids) = reverse_segments.get(to_zero) { - assert!( - pids.is_empty(), - "expected segment that we're zeroing at lid {} \ - lsn {} \ - to contain no pages, but it contained pids {:?}", - to_zero, - lsn, - pids - ); - } - } - - // NB we intentionally corrupt this header to prevent any segment - // from being allocated which would duplicate its LSN, messing - // up recovery in the future. - io_fail!(config, "segment initial free zero"); - pwrite_all( - &config.file, - &*vec![MessageKind::Corrupted.into(); config.segment_size], - *to_zero, - )?; - if !config.temporary { - config.file.sync_all()?; - } - } - - #[cfg(feature = "event_log")] - config.event_log.recovered_lsn(snapshot.stable_lsn.unwrap_or(0)); - - Ok(snapshot) -} - -/// Read a `Snapshot` or generate a default, then advance it to -/// the tip of the data file, if present. -pub fn read_snapshot_or_default(config: &RunningConfig) -> Result { - // NB we want to error out if the read snapshot was corrupted. - // We only use a default Snapshot when there is no snapshot found. - let last_snap = read_snapshot(config)?.unwrap_or_default(); - - let log_iter = - raw_segment_iter_from(last_snap.stable_lsn.unwrap_or(0), config)?; - - let res = advance_snapshot(log_iter, last_snap, config)?; - - Ok(res) -} - -/// Read a `Snapshot` from disk. -/// Returns an error if the read snapshot was corrupted. -/// Returns `Ok(Some(snapshot))` if there was nothing written. -fn read_snapshot(config: &RunningConfig) -> Result> { - let mut candidates = config.get_snapshot_files()?; - if candidates.is_empty() { - debug!("no previous snapshot found"); - return Ok(None); - } - - candidates.sort(); - let path = candidates.pop().unwrap(); - - let mut f = std::fs::OpenOptions::new().read(true).open(&path)?; - - let mut buf = vec![]; - let _read = f.read_to_end(&mut buf)?; - let len = buf.len(); - if len <= 12 { - warn!("empty/corrupt snapshot file found at path: {:?}", path); - return Err(Error::corruption(None)); - } - - let mut len_expected_bytes = [0; 8]; - len_expected_bytes.copy_from_slice(&buf[len - 12..len - 4]); - - let mut crc_expected_bytes = [0; 4]; - crc_expected_bytes.copy_from_slice(&buf[len - 4..]); - - let _ = buf.split_off(len - 12); - let crc_expected: u32 = arr_to_u32(&crc_expected_bytes); - - let crc_actual = crc32(&buf); - - if crc_expected != crc_actual { - warn!( - "corrupt snapshot file found, crc does not match expected. \ - path: {:?}", - path - ); - return Err(Error::corruption(None)); - } - - Snapshot::deserialize(&mut buf.as_slice()).map(Some) -} - -pub(in crate::pagecache) fn write_snapshot( - config: &RunningConfig, - snapshot: &Snapshot, -) -> Result<()> { - trace!("writing snapshot {:?}", snapshot); - - let bytes = snapshot.serialize(); - - let crc32: [u8; 4] = u32_to_arr(crc32(&bytes)); - let len_bytes: [u8; 8] = u64_to_arr(bytes.len() as u64); - - let path_1_suffix = - format!("snap.{:016X}.generating", snapshot.stable_lsn.unwrap_or(0)); - - let mut path_1 = config.get_path(); - path_1.push(path_1_suffix); - - let path_2_suffix = - format!("snap.{:016X}", snapshot.stable_lsn.unwrap_or(0)); - - let mut path_2 = config.get_path(); - path_2.push(path_2_suffix); - - let parent = path_1.parent().unwrap(); - std::fs::create_dir_all(parent)?; - let mut f = - std::fs::OpenOptions::new().write(true).create(true).open(&path_1)?; - - // write the snapshot bytes, followed by a crc64 checksum at the end - io_fail!(config, "snap write"); - f.write_all(&*bytes)?; - io_fail!(config, "snap write len"); - f.write_all(&len_bytes)?; - io_fail!(config, "snap write crc"); - f.write_all(&crc32)?; - io_fail!(config, "snap write post"); - f.sync_all()?; - - trace!("wrote snapshot to {}", path_1.to_string_lossy()); - - io_fail!(config, "snap write mv"); - std::fs::rename(&path_1, &path_2)?; - io_fail!(config, "snap write dir fsync"); - maybe_fsync_directory(config.get_path())?; - io_fail!(config, "snap write mv post"); - - trace!("renamed snapshot to {}", path_2.to_string_lossy()); - - // clean up any old snapshots - let candidates = config.get_snapshot_files()?; - for path in candidates { - let path_str = path.file_name().unwrap().to_str().unwrap(); - if !path_2.to_string_lossy().ends_with(path_str) { - debug!("removing old snapshot file {:?}", path); - - io_fail!(config, "snap write rm old"); - - if let Err(e) = std::fs::remove_file(&path) { - // TODO should this just be a try return? - warn!( - "failed to remove old snapshot file, maybe snapshot race? {}", - e - ); - } - } - } - Ok(()) -} diff --git a/src/result.rs b/src/result.rs deleted file mode 100644 index d3527025b..000000000 --- a/src/result.rs +++ /dev/null @@ -1,163 +0,0 @@ -use std::{ - cmp::PartialEq, - error::Error as StdError, - fmt::{self, Display}, - io, -}; - -use crate::pagecache::{DiskPtr, PageView}; - -/// The top-level result type for dealing with -/// fallible operations. The errors tend to -/// be fail-stop, and nested results are used -/// in cases where the outer fail-stop error can -/// have try `?` used on it, exposing the inner -/// operation that is expected to fail under -/// normal operation. The philosophy behind this -/// is detailed [on the sled blog](https://sled.rs/errors). -pub type Result = std::result::Result; - -/// A compare and swap result. If the CAS is successful, -/// the new `PagePtr` will be returned as `Ok`. Otherwise, -/// the `Err` will contain a tuple of the current `PagePtr` -/// and the old value that could not be set atomically. -pub(crate) type CasResult<'a, R> = - std::result::Result, Option<(PageView<'a>, R)>>; - -/// An Error type encapsulating various issues that may come up -/// in the operation of a `Db`. -#[derive(Debug, Clone, Copy)] -pub enum Error { - /// The underlying collection no longer exists. - CollectionNotFound, - /// The system has been used in an unsupported way. - Unsupported(&'static str), - /// An unexpected bug has happened. Please open an issue on github! - ReportableBug(&'static str), - /// A read or write error has happened when interacting with the file - /// system. - Io(io::ErrorKind, &'static str), - /// Corruption has been detected in the storage file. - Corruption { - /// The file location that corrupted data was found at. - at: Option, - }, - // a failpoint has been triggered for testing purposes - #[doc(hidden)] - #[cfg(feature = "failpoints")] - FailPoint, -} - -impl Error { - pub(crate) const fn corruption(at: Option) -> Error { - Error::Corruption { at } - } -} - -impl Eq for Error {} - -impl PartialEq for Error { - fn eq(&self, other: &Self) -> bool { - use self::Error::*; - - match *self { - CollectionNotFound => matches!(other, CollectionNotFound), - Unsupported(ref l) => { - if let Unsupported(ref r) = *other { - l == r - } else { - false - } - } - ReportableBug(ref l) => { - if let ReportableBug(ref r) = *other { - l == r - } else { - false - } - } - #[cfg(feature = "failpoints")] - FailPoint => { - matches!(other, FailPoint) - } - Corruption { at: l, .. } => { - if let Corruption { at: r, .. } = *other { - l == r - } else { - false - } - } - Io(_, _) => false, - } - } -} - -impl From for Error { - #[inline] - fn from(io_error: io::Error) -> Self { - Error::Io(io_error.kind(), "io error") - } -} - -impl From for io::Error { - fn from(error: Error) -> io::Error { - use self::Error::*; - use std::io::ErrorKind; - match error { - Io(kind, reason) => io::Error::new(kind, reason), - CollectionNotFound => io::Error::new( - ErrorKind::NotFound, - "collection not found" - ), - Unsupported(why) => io::Error::new( - ErrorKind::InvalidInput, - format!("operation not supported: {:?}", why), - ), - ReportableBug(what) => io::Error::new( - ErrorKind::Other, - format!( - "unexpected bug! please report this bug at : {:?}", - what - ), - ), - Corruption { .. } => io::Error::new( - ErrorKind::InvalidData, - format!("corruption encountered: {:?}", error), - ), - #[cfg(feature = "failpoints")] - FailPoint => io::Error::new(ErrorKind::Other, "failpoint"), - } - } -} - -impl StdError for Error {} - -impl Display for Error { - fn fmt( - &self, - f: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - use self::Error::*; - - match *self { - CollectionNotFound => { - write!(f, "Collection does not exist") - } - Unsupported(ref e) => write!(f, "Unsupported: {}", e), - ReportableBug(ref e) => write!( - f, - "Unexpected bug has happened: {}. \ - PLEASE REPORT THIS BUG!", - e - ), - #[cfg(feature = "failpoints")] - FailPoint => write!(f, "Fail point has been triggered."), - Io(ref kind, ref reason) => { - write!(f, "IO error: ({:?}, {})", kind, reason) - } - Corruption { at } => { - write!(f, "Read corrupted data at file offset {:?}", at) - } - } - } -} diff --git a/src/serialization.rs b/src/serialization.rs deleted file mode 100644 index 7118ab5d8..000000000 --- a/src/serialization.rs +++ /dev/null @@ -1,905 +0,0 @@ -#![allow(clippy::mut_mut)] -use std::{ - convert::{TryFrom, TryInto}, - iter::FromIterator, - marker::PhantomData, - num::NonZeroU64, -}; - -use crate::{ - pagecache::{ - BatchManifest, HeapId, MessageHeader, PageState, SegmentNumber, - Snapshot, - }, - varint, DiskPtr, Error, IVec, Link, Meta, Node, Result, -}; - -/// Items that may be serialized and deserialized -pub trait Serialize: Sized { - /// Returns the buffer size required to hold - /// the serialized bytes for this item. - fn serialized_size(&self) -> u64; - - /// Serializees the item without allocating. - /// - /// # Panics - /// - /// Panics if the buffer is not large enough. - fn serialize_into(&self, buf: &mut &mut [u8]); - - /// Attempts to deserialize this type from some bytes. - fn deserialize(buf: &mut &[u8]) -> Result; - - /// Returns owned serialized bytes. - fn serialize(&self) -> Vec { - let sz = self.serialized_size(); - let mut buf = vec![0; usize::try_from(sz).unwrap()]; - self.serialize_into(&mut buf.as_mut_slice()); - buf - } -} - -// Moves a reference to mutable bytes forward, -// sidestepping Rust's limitations in reasoning -// about lifetimes. -// -// ☑ Checked with Miri by Tyler on 2019-12-12 -#[allow(unsafe_code)] -fn scoot(buf: &mut &mut [u8], amount: usize) { - assert!(buf.len() >= amount); - let len = buf.len(); - let ptr = buf.as_mut_ptr(); - let new_len = len - amount; - - unsafe { - let new_ptr = ptr.add(amount); - *buf = std::slice::from_raw_parts_mut(new_ptr, new_len); - } -} - -impl Serialize for BatchManifest { - fn serialized_size(&self) -> u64 { - 8 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - buf[..8].copy_from_slice(&self.0.to_le_bytes()); - scoot(buf, 8); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.len() < 8 { - return Err(Error::corruption(None)); - } - - let array = buf[..8].try_into().unwrap(); - *buf = &buf[8..]; - Ok(BatchManifest(i64::from_le_bytes(array))) - } -} - -impl Serialize for () { - fn serialized_size(&self) -> u64 { - 0 - } - - fn serialize_into(&self, _: &mut &mut [u8]) {} - - fn deserialize(_: &mut &[u8]) -> Result<()> { - Ok(()) - } -} - -impl Serialize for MessageHeader { - fn serialized_size(&self) -> u64 { - 4 + 1 - + self.len.serialized_size() - + self.segment_number.serialized_size() - + self.pid.serialized_size() - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - self.crc32.serialize_into(buf); - (self.kind as u8).serialize_into(buf); - self.len.serialize_into(buf); - self.segment_number.serialize_into(buf); - self.pid.serialize_into(buf); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - Ok(MessageHeader { - crc32: u32::deserialize(buf)?, - kind: u8::deserialize(buf)?.into(), - len: u64::deserialize(buf)?, - segment_number: SegmentNumber(u64::deserialize(buf)?), - pid: u64::deserialize(buf)?, - }) - } -} - -impl Serialize for IVec { - fn serialized_size(&self) -> u64 { - let len = self.len() as u64; - len + len.serialized_size() - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - (self.len() as u64).serialize_into(buf); - buf[..self.len()].copy_from_slice(self.as_ref()); - scoot(buf, self.len()); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - let k_len = usize::try_from(u64::deserialize(buf)?) - .expect("should never store items that rust can't natively index"); - let ret = &buf[..k_len]; - *buf = &buf[k_len..]; - Ok(ret.into()) - } -} - -impl Serialize for u64 { - fn serialized_size(&self) -> u64 { - varint::size(*self) as u64 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - let sz = varint::serialize_into(*self, buf); - - scoot(buf, sz); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.is_empty() { - return Err(Error::corruption(None)); - } - let (res, scoot) = varint::deserialize(buf)?; - *buf = &buf[scoot..]; - Ok(res) - } -} - -struct NonVarU64(u64); - -impl Serialize for NonVarU64 { - fn serialized_size(&self) -> u64 { - 8 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - buf[..8].copy_from_slice(&self.0.to_le_bytes()); - scoot(buf, 8); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.len() < 8 { - return Err(Error::corruption(None)); - } - - let array = buf[..8].try_into().unwrap(); - *buf = &buf[8..]; - Ok(NonVarU64(u64::from_le_bytes(array))) - } -} - -impl Serialize for i64 { - fn serialized_size(&self) -> u64 { - 8 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - buf[..8].copy_from_slice(&self.to_le_bytes()); - scoot(buf, 8); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.len() < 8 { - return Err(Error::corruption(None)); - } - - let array = buf[..8].try_into().unwrap(); - *buf = &buf[8..]; - Ok(i64::from_le_bytes(array)) - } -} - -impl Serialize for u32 { - fn serialized_size(&self) -> u64 { - 4 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - buf[..4].copy_from_slice(&self.to_le_bytes()); - scoot(buf, 4); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.len() < 4 { - return Err(Error::corruption(None)); - } - - let array = buf[..4].try_into().unwrap(); - *buf = &buf[4..]; - Ok(u32::from_le_bytes(array)) - } -} - -impl Serialize for bool { - fn serialized_size(&self) -> u64 { - 1 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - let byte = u8::from(*self); - buf[0] = byte; - scoot(buf, 1); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.is_empty() { - return Err(Error::corruption(None)); - } - let value = buf[0] != 0; - *buf = &buf[1..]; - Ok(value) - } -} - -impl Serialize for u8 { - fn serialized_size(&self) -> u64 { - 1 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - buf[0] = *self; - scoot(buf, 1); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.is_empty() { - return Err(Error::corruption(None)); - } - let value = buf[0]; - *buf = &buf[1..]; - Ok(value) - } -} - -impl Serialize for HeapId { - fn serialized_size(&self) -> u64 { - 16 - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - NonVarU64(self.location).serialize_into(buf); - self.original_lsn.serialize_into(buf); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - Ok(HeapId { - location: NonVarU64::deserialize(buf)?.0, - original_lsn: i64::deserialize(buf)?, - }) - } -} - -impl Serialize for Meta { - fn serialized_size(&self) -> u64 { - let len_sz: u64 = (self.inner.len() as u64).serialized_size(); - let items_sz: u64 = self - .inner - .iter() - .map(|(k, v)| { - (k.len() as u64).serialized_size() - + k.len() as u64 - + v.serialized_size() - }) - .sum(); - - len_sz + items_sz - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - (self.inner.len() as u64).serialize_into(buf); - serialize_2tuple_sequence(self.inner.iter(), buf); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - let len = u64::deserialize(buf)?; - let meta = Meta { inner: deserialize_bounded_sequence(buf, len)? }; - Ok(meta) - } -} - -impl Serialize for Link { - fn serialized_size(&self) -> u64 { - match self { - Link::Set(key, value) => { - 1 + (key.len() as u64).serialized_size() - + (value.len() as u64).serialized_size() - + u64::try_from(key.len()).unwrap() - + u64::try_from(value.len()).unwrap() - } - Link::Del(key) => { - 1 + (key.len() as u64).serialized_size() - + u64::try_from(key.len()).unwrap() - } - Link::ParentMergeIntention(a) => 1 + a.serialized_size(), - Link::ParentMergeConfirm | Link::ChildMergeCap => 1, - } - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - match self { - Link::Set(key, value) => { - 0_u8.serialize_into(buf); - key.serialize_into(buf); - value.serialize_into(buf); - } - Link::Del(key) => { - 1_u8.serialize_into(buf); - key.serialize_into(buf); - } - Link::ParentMergeIntention(pid) => { - 2_u8.serialize_into(buf); - pid.serialize_into(buf); - } - Link::ParentMergeConfirm => { - 3_u8.serialize_into(buf); - } - Link::ChildMergeCap => { - 4_u8.serialize_into(buf); - } - } - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.is_empty() { - return Err(Error::corruption(None)); - } - let discriminant = buf[0]; - *buf = &buf[1..]; - Ok(match discriminant { - 0 => Link::Set(IVec::deserialize(buf)?, IVec::deserialize(buf)?), - 1 => Link::Del(IVec::deserialize(buf)?), - 2 => Link::ParentMergeIntention(u64::deserialize(buf)?), - 3 => Link::ParentMergeConfirm, - 4 => Link::ChildMergeCap, - _ => return Err(Error::corruption(None)), - }) - } -} - -fn shift_u64_opt(value: &Option) -> u64 { - value.map(|s| s + 1).unwrap_or(0) -} - -impl Serialize for Option { - fn serialized_size(&self) -> u64 { - shift_u64_opt(self).serialized_size() - } - fn serialize_into(&self, buf: &mut &mut [u8]) { - shift_u64_opt(self).serialize_into(buf) - } - fn deserialize(buf: &mut &[u8]) -> Result { - let shifted = u64::deserialize(buf)?; - let unshifted = if shifted == 0 { None } else { Some(shifted - 1) }; - Ok(unshifted) - } -} - -impl Serialize for Option { - fn serialized_size(&self) -> u64 { - (self.map(NonZeroU64::get).unwrap_or(0)).serialized_size() - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - (self.map(NonZeroU64::get).unwrap_or(0)).serialize_into(buf) - } - - fn deserialize(buf: &mut &[u8]) -> Result { - let underlying = u64::deserialize(buf)?; - Ok(if underlying == 0 { - None - } else { - Some(NonZeroU64::new(underlying).unwrap()) - }) - } -} - -impl Serialize for Option { - fn serialized_size(&self) -> u64 { - shift_i64_opt(self).serialized_size() - } - fn serialize_into(&self, buf: &mut &mut [u8]) { - shift_i64_opt(self).serialize_into(buf) - } - fn deserialize(buf: &mut &[u8]) -> Result { - Ok(unshift_i64_opt(i64::deserialize(buf)?)) - } -} - -fn shift_i64_opt(value_opt: &Option) -> i64 { - if let Some(value) = value_opt { - if *value < 0 { - *value - } else { - value + 1 - } - } else { - 0 - } -} - -const fn unshift_i64_opt(value: i64) -> Option { - if value == 0 { - return None; - } - let subtract = value > 0; - Some(value - subtract as i64) -} - -impl Serialize for Snapshot { - fn serialized_size(&self) -> u64 { - self.version.serialized_size() - + self.stable_lsn.serialized_size() - + self.active_segment.serialized_size() - + (self.pt.len() as u64).serialized_size() - + self.pt.iter().map(Serialize::serialized_size).sum::() - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - self.version.serialize_into(buf); - self.stable_lsn.serialize_into(buf); - self.active_segment.serialize_into(buf); - (self.pt.len() as u64).serialize_into(buf); - for page_state in &self.pt { - page_state.serialize_into(buf); - } - } - - fn deserialize(buf: &mut &[u8]) -> Result { - Ok(Snapshot { - version: Serialize::deserialize(buf)?, - stable_lsn: Serialize::deserialize(buf)?, - active_segment: Serialize::deserialize(buf)?, - pt: { - let len = u64::deserialize(buf)?; - deserialize_bounded_sequence(buf, len)? - }, - }) - } -} - -impl Serialize for Node { - fn serialized_size(&self) -> u64 { - let size = self.rss(); - size.serialized_size() + size - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - assert!(self.overlay.is_empty()); - self.rss().serialize_into(buf); - buf[..self.len()].copy_from_slice(self.as_ref()); - scoot(buf, self.len()); - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.is_empty() { - return Err(Error::corruption(None)); - } - let len = usize::try_from(u64::deserialize(buf)?).unwrap(); - - #[allow(unsafe_code)] - let sst = unsafe { Node::from_raw(&buf[..len]) }; - - *buf = &buf[len..]; - Ok(sst) - } -} - -impl Serialize for DiskPtr { - fn serialized_size(&self) -> u64 { - match self { - DiskPtr::Inline(a) => 1 + a.serialized_size(), - DiskPtr::Heap(a, b) => { - 1 + a.serialized_size() + b.serialized_size() - } - } - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - match self { - DiskPtr::Inline(log_offset) => { - 0_u8.serialize_into(buf); - log_offset.serialize_into(buf); - } - DiskPtr::Heap(log_offset, heap_id) => { - 1_u8.serialize_into(buf); - log_offset.serialize_into(buf); - heap_id.serialize_into(buf); - } - } - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.len() < 2 { - return Err(Error::corruption(None)); - } - let discriminant = buf[0]; - *buf = &buf[1..]; - Ok(match discriminant { - 0 => DiskPtr::Inline(u64::deserialize(buf)?), - 1 => DiskPtr::Heap( - Serialize::deserialize(buf)?, - HeapId::deserialize(buf)?, - ), - _ => return Err(Error::corruption(None)), - }) - } -} - -impl Serialize for PageState { - fn serialized_size(&self) -> u64 { - match self { - PageState::Free(a, disk_ptr) => { - 0_u64.serialized_size() - + a.serialized_size() - + disk_ptr.serialized_size() - } - PageState::Present { base, frags } => { - (1 + frags.len() as u64).serialized_size() - + base.serialized_size() - + frags.iter().map(Serialize::serialized_size).sum::() - } - _ => panic!("tried to serialize {:?}", self), - } - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - match self { - PageState::Free(lsn, disk_ptr) => { - 0_u64.serialize_into(buf); - lsn.serialize_into(buf); - disk_ptr.serialize_into(buf); - } - PageState::Present { base, frags } => { - (1 + frags.len() as u64).serialize_into(buf); - base.serialize_into(buf); - serialize_2tuple_ref_sequence(frags.iter(), buf); - } - _ => panic!("tried to serialize {:?}", self), - } - } - - fn deserialize(buf: &mut &[u8]) -> Result { - if buf.is_empty() { - return Err(Error::corruption(None)); - } - let len = u64::deserialize(buf)?; - Ok(match len { - 0 => PageState::Free( - i64::deserialize(buf)?, - DiskPtr::deserialize(buf)?, - ), - _ => PageState::Present { - base: Serialize::deserialize(buf)?, - frags: deserialize_bounded_sequence(buf, len - 1)?, - }, - }) - } -} - -impl Serialize for (A, B) { - fn serialized_size(&self) -> u64 { - self.0.serialized_size() + self.1.serialized_size() - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - self.0.serialize_into(buf); - self.1.serialize_into(buf); - } - - fn deserialize(buf: &mut &[u8]) -> Result<(A, B)> { - let a = A::deserialize(buf)?; - let b = B::deserialize(buf)?; - Ok((a, b)) - } -} - -impl Serialize for (A, B, C) { - fn serialized_size(&self) -> u64 { - self.0.serialized_size() - + self.1.serialized_size() - + self.2.serialized_size() - } - - fn serialize_into(&self, buf: &mut &mut [u8]) { - self.0.serialize_into(buf); - self.1.serialize_into(buf); - self.2.serialize_into(buf); - } - - fn deserialize(buf: &mut &[u8]) -> Result<(A, B, C)> { - let a = A::deserialize(buf)?; - let b = B::deserialize(buf)?; - let c = C::deserialize(buf)?; - Ok((a, b, c)) - } -} - -fn serialize_2tuple_sequence<'a, XS, A, B>(xs: XS, buf: &mut &mut [u8]) -where - XS: Iterator, - A: Serialize + 'a, - B: Serialize + 'a, -{ - for item in xs { - item.0.serialize_into(buf); - item.1.serialize_into(buf); - } -} - -fn serialize_2tuple_ref_sequence<'a, XS, A, B>(xs: XS, buf: &mut &mut [u8]) -where - XS: Iterator, - A: Serialize + 'a, - B: Serialize + 'a, -{ - for item in xs { - item.0.serialize_into(buf); - item.1.serialize_into(buf); - } -} - -struct ConsumeSequence<'a, 'b, T> { - buf: &'a mut &'b [u8], - _t: PhantomData, - bound: u64, -} - -fn deserialize_bounded_sequence( - buf: &mut &[u8], - bound: u64, -) -> Result -where - R: FromIterator, -{ - let iter = ConsumeSequence { buf, _t: PhantomData, bound }; - iter.collect() -} - -impl<'a, 'b, T> Iterator for ConsumeSequence<'a, 'b, T> -where - T: Serialize, -{ - type Item = Result; - - fn next(&mut self) -> Option> { - if self.bound == 0 || self.buf.is_empty() { - return None; - } - let item_res = T::deserialize(self.buf); - self.bound -= 1; - if item_res.is_err() { - self.bound = 0; - } - Some(item_res) - } -} - -#[cfg(test)] -mod qc { - use quickcheck::{Arbitrary, Gen}; - use rand::Rng; - - use super::*; - use crate::pagecache::MessageKind; - - impl Arbitrary for MessageHeader { - fn arbitrary(g: &mut G) -> MessageHeader { - MessageHeader { - crc32: g.gen(), - len: g.gen(), - kind: MessageKind::arbitrary(g), - segment_number: SegmentNumber(SpreadU64::arbitrary(g).0), - pid: g.gen(), - } - } - } - - impl Arbitrary for HeapId { - fn arbitrary(g: &mut G) -> HeapId { - HeapId { - location: SpreadU64::arbitrary(g).0, - original_lsn: SpreadI64::arbitrary(g).0, - } - } - } - - impl Arbitrary for MessageKind { - fn arbitrary(g: &mut G) -> MessageKind { - g.gen_range(0, 12).into() - } - } - - impl Arbitrary for Link { - fn arbitrary(g: &mut G) -> Link { - let discriminant = g.gen_range(0, 5); - match discriminant { - 0 => Link::Set(IVec::arbitrary(g), IVec::arbitrary(g)), - 1 => Link::Del(IVec::arbitrary(g)), - 2 => Link::ParentMergeIntention(u64::arbitrary(g)), - 3 => Link::ParentMergeConfirm, - 4 => Link::ChildMergeCap, - _ => unreachable!("invalid choice"), - } - } - } - - impl Arbitrary for IVec { - fn arbitrary(g: &mut G) -> IVec { - let v: Vec = Arbitrary::arbitrary(g); - v.into() - } - - fn shrink(&self) -> Box> { - let v: Vec = self.to_vec(); - Box::new(v.shrink().map(IVec::from)) - } - } - - impl Arbitrary for Meta { - fn arbitrary(g: &mut G) -> Meta { - Meta { inner: Arbitrary::arbitrary(g) } - } - - fn shrink(&self) -> Box> { - Box::new(self.inner.shrink().map(|inner| Meta { inner })) - } - } - - impl Arbitrary for DiskPtr { - fn arbitrary(g: &mut G) -> DiskPtr { - if g.gen() { - DiskPtr::Inline(g.gen()) - } else { - DiskPtr::Heap(g.gen(), HeapId::arbitrary(g)) - } - } - } - - impl Arbitrary for PageState { - fn arbitrary(g: &mut G) -> PageState { - if g.gen() { - // don't generate 255 because we add 1 to this - // number in PageState::serialize_into to account - // for the base fragment - let n = g.gen_range(0, 255); - - let base = (g.gen(), DiskPtr::arbitrary(g)); - let frags = - (0..n).map(|_| (g.gen(), DiskPtr::arbitrary(g))).collect(); - PageState::Present { base, frags } - } else { - PageState::Free(g.gen(), DiskPtr::arbitrary(g)) - } - } - } - - impl Arbitrary for Snapshot { - fn arbitrary(g: &mut G) -> Snapshot { - Snapshot { - version: g.gen(), - stable_lsn: g.gen(), - active_segment: g.gen(), - pt: Arbitrary::arbitrary(g), - } - } - } - - #[derive(Debug, Clone)] - struct SpreadI64(i64); - - impl Arbitrary for SpreadI64 { - fn arbitrary(g: &mut G) -> SpreadI64 { - let uniform = g.gen::(); - let shift = g.gen_range(0, 64); - SpreadI64(uniform >> shift) - } - } - - #[derive(Debug, Clone)] - struct SpreadU64(u64); - - impl Arbitrary for SpreadU64 { - fn arbitrary(g: &mut G) -> SpreadU64 { - let uniform = g.gen::(); - let shift = g.gen_range(0, 64); - SpreadU64(uniform >> shift) - } - } - - fn prop_serialize(item: &T) -> bool - where - T: Serialize + PartialEq + Clone + std::fmt::Debug, - { - let mut buf = vec![0; usize::try_from(item.serialized_size()).unwrap()]; - let buf_ref = &mut buf.as_mut_slice(); - item.serialize_into(buf_ref); - assert_eq!( - buf_ref.len(), - 0, - "round-trip failed to consume produced bytes" - ); - assert_eq!(buf.len() as u64, item.serialized_size()); - let deserialized = T::deserialize(&mut buf.as_slice()).unwrap(); - if *item == deserialized { - true - } else { - eprintln!( - "\nround-trip serialization failed. original:\n\n{:?}\n\n \ - deserialized(serialized(original)):\n\n{:?}\n", - item, deserialized - ); - false - } - } - - quickcheck::quickcheck! { - #[cfg_attr(miri, ignore)] - fn bool(item: bool) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn u8(item: u8) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn i64(item: SpreadI64) -> bool { - prop_serialize(&item.0) - } - - #[cfg_attr(miri, ignore)] - fn u64(item: SpreadU64) -> bool { - prop_serialize(&item.0) - } - - #[cfg_attr(miri, ignore)] - fn disk_ptr(item: DiskPtr) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn page_state(item: PageState) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn meta(item: Meta) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn snapshot(item: Snapshot) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn node(item: Node) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn link(item: Link) -> bool { - prop_serialize(&item) - } - - #[cfg_attr(miri, ignore)] - fn msg_header(item: MessageHeader) -> bool { - prop_serialize(&item) - } - } -} diff --git a/src/stack.rs b/src/stack.rs deleted file mode 100644 index 89ccb1cac..000000000 --- a/src/stack.rs +++ /dev/null @@ -1,271 +0,0 @@ -#![allow(unsafe_code)] - -use std::{ - fmt::{self, Debug}, - ops::Deref, - sync::atomic::Ordering::{Acquire, Release}, -}; - -use crate::{ - debug_delay, - ebr::{pin, Atomic, Guard, Owned, Shared}, -}; - -/// A node in the lock-free `Stack`. -#[derive(Debug)] -pub struct Node { - pub(crate) inner: T, - pub(crate) next: Atomic>, -} - -impl Drop for Node { - fn drop(&mut self) { - unsafe { - let guard = pin(); - let mut cursor = self.next.load(Acquire, &guard); - - while !cursor.is_null() { - // we carefully unset the next pointer here to avoid - // a stack overflow when freeing long lists. - let node = cursor.into_owned(); - cursor = node.next.swap(Shared::null(), Acquire, &guard); - drop(node); - } - } - } -} - -/// A simple lock-free stack, with the ability to atomically -/// append or entirely swap-out entries. -pub struct Stack { - head: Atomic>, -} - -impl Default for Stack { - fn default() -> Self { - Self { head: Atomic::null() } - } -} - -impl Drop for Stack { - fn drop(&mut self) { - unsafe { - let guard = pin(); - let curr = self.head.load(Acquire, &guard); - if !curr.as_raw().is_null() { - drop(curr.into_owned()); - } - } - } -} - -impl Debug for Stack -where - T: Clone + Debug + Send + 'static + Sync, -{ - fn fmt( - &self, - formatter: &mut fmt::Formatter<'_>, - ) -> Result<(), fmt::Error> { - let guard = pin(); - let head = self.head(&guard); - let iter = Iter::from_ptr(head, &guard); - - formatter.write_str("Stack [")?; - let mut written = false; - for node in iter { - if written { - formatter.write_str(", ")?; - } - formatter.write_str(&*format!("({:?}) ", &node))?; - node.fmt(formatter)?; - written = true; - } - formatter.write_str("]")?; - Ok(()) - } -} - -impl Deref for Node { - type Target = T; - fn deref(&self) -> &T { - &self.inner - } -} - -impl Stack { - /// Add an item to the stack, spinning until successful. - pub(crate) fn push(&self, inner: T, guard: &Guard) { - debug_delay(); - let node_owned = Owned::new(Node { inner, next: Atomic::null() }); - - unsafe { - let node_shared = node_owned.into_shared(guard); - - loop { - let head = self.head(guard); - node_shared.deref().next.store(head, Release); - if self - .head - .compare_and_set(head, node_shared, Release, guard) - .is_ok() - { - return; - } - } - } - } - - /// Clears the stack and returns all items - pub(crate) fn take_iter<'a>( - &self, - guard: &'a Guard, - ) -> impl Iterator { - debug_delay(); - let node = self.head.swap(Shared::null(), Release, guard); - - let iter = Iter { inner: node, guard }; - - if !node.is_null() { - unsafe { - guard.defer_destroy(node); - } - } - - iter - } - - /// Pop the next item off the stack. Returns None if nothing is there. - pub(crate) fn pop(&self, guard: &Guard) -> Option { - use std::ptr; - use std::sync::atomic::Ordering::SeqCst; - debug_delay(); - let mut head = self.head(guard); - loop { - match unsafe { head.as_ref() } { - Some(h) => { - let next = h.next.load(Acquire, guard); - match self.head.compare_and_set(head, next, Release, guard) - { - Ok(_) => unsafe { - // we unset the next pointer before destruction - // to avoid double-frees. - h.next.store(Shared::default(), SeqCst); - guard.defer_destroy(head); - return Some(ptr::read(&h.inner)); - }, - Err(actual) => head = actual.current, - } - } - None => return None, - } - } - } - - /// Returns the current head pointer of the stack, which can - /// later be used as the key for cas and cap operations. - pub(crate) fn head<'g>(&self, guard: &'g Guard) -> Shared<'g, Node> { - self.head.load(Acquire, guard) - } -} - -/// An iterator over nodes in a lock-free stack. -pub struct Iter<'a, T> -where - T: Send + 'static + Sync, -{ - inner: Shared<'a, Node>, - guard: &'a Guard, -} - -impl<'a, T> Iter<'a, T> -where - T: 'a + Send + 'static + Sync, -{ - /// Creates a `Iter` from a pointer to one. - pub(crate) fn from_ptr<'b>( - ptr: Shared<'b, Node>, - guard: &'b Guard, - ) -> Iter<'b, T> { - Iter { inner: ptr, guard } - } -} - -impl<'a, T> Iterator for Iter<'a, T> -where - T: Send + 'static + Sync, -{ - type Item = &'a T; - - fn next(&mut self) -> Option { - debug_delay(); - if self.inner.is_null() { - None - } else { - unsafe { - let ret = &self.inner.deref().inner; - self.inner = self.inner.deref().next.load(Acquire, self.guard); - Some(ret) - } - } - } - - fn size_hint(&self) -> (usize, Option) { - let mut size = 0; - let mut cursor = self.inner; - - while !cursor.is_null() { - unsafe { - cursor = cursor.deref().next.load(Acquire, self.guard); - } - size += 1; - } - - (size, Some(size)) - } -} - -#[test] -#[cfg(not(miri))] // can't create threads -fn basic_functionality() { - use crate::pin; - use crate::CachePadded; - use std::sync::Arc; - use std::thread; - - let guard = pin(); - let ll = Arc::new(Stack::default()); - assert_eq!(ll.pop(&guard), None); - ll.push(CachePadded::new(1), &guard); - let ll2 = Arc::clone(&ll); - let t = thread::spawn(move || { - let guard = pin(); - ll2.push(CachePadded::new(2), &guard); - ll2.push(CachePadded::new(3), &guard); - ll2.push(CachePadded::new(4), &guard); - guard.flush(); - }); - t.join().unwrap(); - ll.push(CachePadded::new(5), &guard); - assert_eq!(ll.pop(&guard), Some(CachePadded::new(5))); - assert_eq!(ll.pop(&guard), Some(CachePadded::new(4))); - let ll3 = Arc::clone(&ll); - let t = thread::spawn(move || { - let guard = pin(); - assert_eq!(ll3.pop(&guard), Some(CachePadded::new(3))); - assert_eq!(ll3.pop(&guard), Some(CachePadded::new(2))); - guard.flush(); - }); - t.join().unwrap(); - assert_eq!(ll.pop(&guard), Some(CachePadded::new(1))); - let ll4 = Arc::clone(&ll); - let t = thread::spawn(move || { - let guard = pin(); - assert_eq!(ll4.pop(&guard), None); - guard.flush(); - }); - t.join().unwrap(); - drop(ll); - guard.flush(); - drop(guard); -} diff --git a/src/subscriber.rs b/src/subscriber.rs deleted file mode 100644 index 3a8a25def..000000000 --- a/src/subscriber.rs +++ /dev/null @@ -1,355 +0,0 @@ -use std::{ - future::Future, - pin::Pin, - sync::{ - atomic::{AtomicBool, Ordering::Relaxed}, - mpsc::{sync_channel, Receiver, SyncSender, TryRecvError}, - }, - task::{Context, Poll, Waker}, - time::{Duration, Instant}, -}; - -use crate::*; - -static ID_GEN: AtomicUsize = AtomicUsize::new(0); - -/// An event that happened to a key that a subscriber is interested in. -#[derive(Debug, Clone)] -pub struct Event { - /// A map of batches for each tree written to in a transaction, - /// only one of which will be the one subscribed to. - pub(crate) batches: Arc<[(Tree, Batch)]>, -} - -impl Event { - pub(crate) fn single_update( - tree: Tree, - key: IVec, - value: Option, - ) -> Event { - Event::single_batch( - tree, - Batch { writes: vec![(key, value)].into_iter().collect() }, - ) - } - - pub(crate) fn single_batch(tree: Tree, batch: Batch) -> Event { - Event::from_batches(vec![(tree, batch)]) - } - - pub(crate) fn from_batches(batches: Vec<(Tree, Batch)>) -> Event { - Event { batches: Arc::from(batches.into_boxed_slice()) } - } - - /// Iterate over each Tree, key, and optional value in this `Event` - pub fn iter<'a>( - &'a self, - ) -> Box)>> - { - self.into_iter() - } -} - -impl<'a> IntoIterator for &'a Event { - type Item = (&'a Tree, &'a IVec, &'a Option); - type IntoIter = Box>; - - fn into_iter(self) -> Self::IntoIter { - Box::new(self.batches.iter().flat_map(|(ref tree, ref batch)| { - batch.writes.iter().map(move |(k, v_opt)| (tree, k, v_opt)) - })) - } -} - -type Senders = Map, SyncSender>>)>; - -/// A subscriber listening on a specified prefix -/// -/// `Subscriber` implements both `Iterator` -/// and `Future>` -/// -/// # Examples -/// -/// Synchronous, blocking subscriber: -/// ``` -/// # fn main() -> Result<(), Box> { -/// use sled::{Config, Event}; -/// let config = Config::new().temporary(true); -/// -/// let tree = config.open()?; -/// -/// // watch all events by subscribing to the empty prefix -/// let mut subscriber = tree.watch_prefix(vec![]); -/// -/// let tree_2 = tree.clone(); -/// let thread = std::thread::spawn(move || { -/// tree.insert(vec![0], vec![1]) -/// }); -/// -/// // `Subscription` implements `Iterator` -/// for event in subscriber.take(1) { -/// // Events occur due to single key operations, -/// // batches, or transactions. The tree is included -/// // so that you may perform a new transaction or -/// // operation in response to the event. -/// for (tree, key, value_opt) in &event { -/// if let Some(value) = value_opt { -/// // key `key` was set to value `value` -/// } else { -/// // key `key` was removed -/// } -/// } -/// } -/// -/// # thread.join().unwrap(); -/// # Ok(()) -/// # } -/// ``` -/// Aynchronous, non-blocking subscriber: -/// -/// `Subscription` implements `Future>`. -/// -/// `while let Some(event) = (&mut subscriber).await { /* use it */ }` -pub struct Subscriber { - id: usize, - rx: Receiver>>, - existing: Option>>, - home: Arc>, -} - -impl Drop for Subscriber { - fn drop(&mut self) { - let mut w_senders = self.home.write(); - w_senders.remove(&self.id); - } -} - -impl Subscriber { - /// Attempts to wait for a value on this `Subscriber`, returning - /// an error if no event arrives within the provided `Duration` - /// or if the backing `Db` shuts down. - pub fn next_timeout( - &mut self, - mut timeout: Duration, - ) -> std::result::Result { - loop { - let before_first_receive = Instant::now(); - let mut future_rx = if let Some(future_rx) = self.existing.take() { - future_rx - } else { - self.rx.recv_timeout(timeout)? - }; - timeout = timeout - .checked_sub(before_first_receive.elapsed()) - .unwrap_or_default(); - - let before_second_receive = Instant::now(); - match future_rx.wait_timeout(timeout) { - Ok(Some(event)) => return Ok(event), - Ok(None) => (), - Err(timeout_error) => { - self.existing = Some(future_rx); - return Err(timeout_error); - } - } - timeout = timeout - .checked_sub(before_second_receive.elapsed()) - .unwrap_or_default(); - } - } -} - -impl Future for Subscriber { - type Output = Option; - - fn poll( - mut self: Pin<&mut Self>, - cx: &mut Context<'_>, - ) -> Poll { - loop { - let mut future_rx = if let Some(future_rx) = self.existing.take() { - future_rx - } else { - match self.rx.try_recv() { - Ok(future_rx) => future_rx, - Err(TryRecvError::Empty) => break, - Err(TryRecvError::Disconnected) => { - return Poll::Ready(None) - } - } - }; - - match Future::poll(Pin::new(&mut future_rx), cx) { - Poll::Ready(Some(event)) => return Poll::Ready(event), - Poll::Ready(None) => continue, - Poll::Pending => { - self.existing = Some(future_rx); - return Poll::Pending; - } - } - } - let mut home = self.home.write(); - let entry = home.get_mut(&self.id).unwrap(); - entry.0 = Some(cx.waker().clone()); - Poll::Pending - } -} - -impl Iterator for Subscriber { - type Item = Event; - - fn next(&mut self) -> Option { - loop { - let future_rx = self.rx.recv().ok()?; - match future_rx.wait() { - Some(Some(event)) => return Some(event), - Some(None) => return None, - None => continue, - } - } - } -} - -#[derive(Debug, Default)] -pub(crate) struct Subscribers { - watched: RwLock, Arc>>>, - ever_used: AtomicBool, -} - -impl Drop for Subscribers { - fn drop(&mut self) { - let watched = self.watched.read(); - - for senders_mu in watched.values() { - let senders = std::mem::take(&mut *senders_mu.write()); - for (_, (waker_opt, sender)) in senders { - drop(sender); - if let Some(waker) = waker_opt { - waker.wake(); - } - } - } - } -} - -impl Subscribers { - pub(crate) fn register(&self, prefix: &[u8]) -> Subscriber { - self.ever_used.store(true, Relaxed); - let r_mu = { - let r_mu = self.watched.read(); - if r_mu.contains_key(prefix) { - r_mu - } else { - drop(r_mu); - let mut w_mu = self.watched.write(); - if !w_mu.contains_key(prefix) { - let old = w_mu.insert( - prefix.to_vec(), - Arc::new(RwLock::new(Map::default())), - ); - assert!(old.is_none()); - } - drop(w_mu); - self.watched.read() - } - }; - - let (tx, rx) = sync_channel(1024); - - let arc_senders = &r_mu[prefix]; - let mut w_senders = arc_senders.write(); - - let id = ID_GEN.fetch_add(1, Relaxed); - - w_senders.insert(id, (None, tx)); - - Subscriber { id, rx, existing: None, home: arc_senders.clone() } - } - - pub(crate) fn reserve_batch( - &self, - batch: &Batch, - ) -> Option { - if !self.ever_used.load(Relaxed) { - return None; - } - - let r_mu = self.watched.read(); - - let mut skip_indices = std::collections::HashSet::new(); - let mut subscribers = vec![]; - - for key in batch.writes.keys() { - for (idx, (prefix, subs_rwl)) in r_mu.iter().enumerate() { - if key.starts_with(prefix) && !skip_indices.contains(&idx) { - skip_indices.insert(idx); - let subs = subs_rwl.read(); - - for (_id, (waker, sender)) in subs.iter() { - let (tx, rx) = OneShot::pair(); - if let Err(err) = sender.try_send(rx) { - error!("send error: {:?}", err); - continue; - } - subscribers.push((waker.clone(), tx)); - } - } - } - } - - if subscribers.is_empty() { - None - } else { - Some(ReservedBroadcast { subscribers }) - } - } - - pub(crate) fn reserve>( - &self, - key: R, - ) -> Option { - if !self.ever_used.load(Relaxed) { - return None; - } - - let r_mu = self.watched.read(); - let prefixes = r_mu.iter().filter(|(k, _)| key.as_ref().starts_with(k)); - - let mut subscribers = vec![]; - - for (_, subs_rwl) in prefixes { - let subs = subs_rwl.read(); - - for (_id, (waker, sender)) in subs.iter() { - let (tx, rx) = OneShot::pair(); - if sender.send(rx).is_err() { - continue; - } - subscribers.push((waker.clone(), tx)); - } - } - - if subscribers.is_empty() { - None - } else { - Some(ReservedBroadcast { subscribers }) - } - } -} - -pub(crate) struct ReservedBroadcast { - subscribers: Vec<(Option, OneShotFiller>)>, -} - -impl ReservedBroadcast { - pub fn complete(self, event: &Event) { - let iter = self.subscribers.into_iter(); - - for (waker_opt, tx) in iter { - tx.fill(Some(event.clone())); - if let Some(waker) = waker_opt { - waker.wake(); - } - } - } -} diff --git a/src/sys_limits.rs b/src/sys_limits.rs deleted file mode 100644 index 1582062be..000000000 --- a/src/sys_limits.rs +++ /dev/null @@ -1,122 +0,0 @@ -#![allow(unsafe_code)] - -#[cfg(any(target_os = "linux", target_os = "macos"))] -use std::io; -#[cfg(any(target_os = "linux"))] -use {std::fs::File, std::io::Read}; - -use std::convert::TryFrom; - -/// See the Kernel's documentation for more information about this subsystem, -/// found at: [Documentation/cgroup-v1/memory.txt](https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt) -/// -/// If there's no memory limit specified on the container this may return -/// 0x7FFFFFFFFFFFF000 (2^63-1 rounded down to 4k which is a common page size). -/// So we know we are not running in a memory restricted environment. -#[cfg(target_os = "linux")] -fn get_cgroup_memory_limit() -> io::Result { - File::open("/sys/fs/cgroup/memory/memory.limit_in_bytes") - .and_then(read_u64_from) -} - -#[cfg(target_os = "linux")] -fn read_u64_from(mut file: File) -> io::Result { - let mut s = String::new(); - file.read_to_string(&mut s).and_then(|_| { - s.trim() - .parse() - .map_err(|e| io::Error::new(io::ErrorKind::InvalidData, e)) - }) -} - -/// Returns the maximum size of total available memory of the process, in bytes. -/// If this limit is exceeded, the malloc() and mmap() functions shall fail with -/// errno set to [ENOMEM]. -#[cfg(any(target_os = "linux", target_os = "macos"))] -fn get_rlimit_as() -> io::Result { - let mut limit = std::mem::MaybeUninit::::uninit(); - - let ret = unsafe { libc::getrlimit(libc::RLIMIT_AS, limit.as_mut_ptr()) }; - - if ret == 0 { - Ok(unsafe { limit.assume_init() }) - } else { - Err(io::Error::last_os_error()) - } -} - -#[cfg(any(target_os = "linux", target_os = "macos"))] -pub fn get_available_memory() -> io::Result { - let pages = unsafe { libc::sysconf(libc::_SC_PHYS_PAGES) }; - if pages == -1 { - return Err(io::Error::last_os_error()); - } - - let page_size = unsafe { libc::sysconf(libc::_SC_PAGE_SIZE) }; - if page_size == -1 { - return Err(io::Error::last_os_error()); - } - - Ok(usize::try_from(pages).unwrap() * usize::try_from(page_size).unwrap()) -} - -#[cfg(miri)] -pub fn get_memory_limit() -> Option { - None -} - -#[cfg(not(miri))] -pub fn get_memory_limit() -> Option { - let mut max: u64 = 0; - - #[cfg(target_os = "linux")] - { - if let Ok(mem) = get_cgroup_memory_limit() { - max = mem; - } - - // If there's no memory limit specified on the container this - // actually returns 0x7FFFFFFFFFFFF000 (2^63-1 rounded down to - // 4k which is a common page size). So we know we are not - // running in a memory restricted environment. - // src: https://github.com/dotnet/coreclr/blob/master/src/pal/src/misc/cgroup.cpp#L385-L428 - if max > 0x7FFF_FFFF_0000_0000 { - return None; - } - } - - #[allow(clippy::useless_conversion)] - #[cfg(any(target_os = "linux", target_os = "macos"))] - { - if let Ok(rlim) = get_rlimit_as() { - let rlim_cur = u64::try_from(rlim.rlim_cur).unwrap(); - if max == 0 || rlim_cur < max { - max = rlim_cur; - } - } - - if let Ok(available) = get_available_memory() { - if max == 0 || (available as u64) < max { - max = available as u64; - } - } - } - - #[cfg(miri)] - { - // Miri has a significant memory consumption overhead. During a small - // test run, a memory amplification of ~35x was observed. Certain - // memory overheads may increase asymptotically with longer test runs, - // such as the interpreter's dead_alloc_map. Memory overhead is - // dominated by stacked borrows tags; the asymptotic behavior of this - // overhead needs further investigation. - max /= 40; - } - - let ret = usize::try_from(max).expect("cache limit not representable"); - if ret == 0 { - None - } else { - Some(ret) - } -} diff --git a/src/threadpool.rs b/src/threadpool.rs deleted file mode 100644 index 9b1f8240e..000000000 --- a/src/threadpool.rs +++ /dev/null @@ -1,289 +0,0 @@ -//! A simple adaptive threadpool that returns a oneshot future. - -use std::sync::Arc; - -use crate::{OneShot, Result}; - -#[cfg(not(miri))] -mod queue { - use std::{ - cell::RefCell, - collections::VecDeque, - sync::{ - atomic::{AtomicBool, AtomicUsize, Ordering}, - Once, - }, - time::{Duration, Instant}, - }; - - use parking_lot::{Condvar, Mutex}; - - use crate::{debug_delay, Lazy, OneShot}; - - thread_local! { - static WORKER: RefCell = RefCell::new(false); - } - - fn is_worker() -> bool { - WORKER.with(|w| *w.borrow()) - } - - pub(super) static BLOCKING_QUEUE: Lazy Queue> = - Lazy::new(Default::default); - pub(super) static IO_QUEUE: Lazy Queue> = - Lazy::new(Default::default); - pub(super) static SNAPSHOT_QUEUE: Lazy Queue> = - Lazy::new(Default::default); - pub(super) static TRUNCATE_QUEUE: Lazy Queue> = - Lazy::new(Default::default); - - type Work = Box; - - pub(super) fn spawn_to(work: F, queue: &'static Queue) -> OneShot - where - F: FnOnce() -> R + Send + 'static, - R: Send + 'static + Sized, - { - static START_THREADS: Once = Once::new(); - - START_THREADS.call_once(|| { - std::thread::Builder::new() - .name("sled-io-thread".into()) - .spawn(|| IO_QUEUE.perform_work(true, false)) - .expect("failed to spawn critical IO thread"); - - std::thread::Builder::new() - .name("sled-blocking-thread".into()) - .spawn(|| BLOCKING_QUEUE.perform_work(true, false)) - .expect("failed to spawn critical blocking thread"); - - std::thread::Builder::new() - .name("sled-snapshot-thread".into()) - .spawn(|| SNAPSHOT_QUEUE.perform_work(true, false)) - .expect("failed to spawn critical snapshot thread"); - - std::thread::Builder::new() - .name("sled-truncate-thread".into()) - .spawn(|| TRUNCATE_QUEUE.perform_work(false, false)) - .expect("failed to spawn critical truncation thread"); - }); - - let (promise_filler, promise) = OneShot::pair(); - let task = move || { - promise_filler.fill((work)()); - }; - - if is_worker() { - // NB this could prevent deadlocks because - // if a threadpool thread spawns work into - // the threadpool's queue, which it later - // blocks on the completion of, it would be - // possible for threadpool threads to block - // forever on the completion of work that - // exists in the queue but will never be - // scheduled. - task(); - } else { - queue.send(Box::new(task)); - } - - promise - } - - #[derive(Default)] - pub(super) struct Queue { - cv: Condvar, - mu: Mutex>, - temporary_threads: AtomicUsize, - spawning: AtomicBool, - } - - #[allow(unsafe_code)] - unsafe impl Send for Queue {} - - impl Queue { - fn recv_timeout(&self, duration: Duration) -> Option<(Work, usize)> { - let mut queue = self.mu.lock(); - - let cutoff = Instant::now() + duration; - - while queue.is_empty() { - let res = self.cv.wait_until(&mut queue, cutoff); - if res.timed_out() { - break; - } - } - - queue.pop_front().map(|w| (w, queue.len())) - } - - fn send(&self, work: Work) -> usize { - let mut queue = self.mu.lock(); - queue.push_back(work); - - let len = queue.len(); - - // having held the mutex makes this linearized - // with the notify below. - drop(queue); - - self.cv.notify_all(); - - len - } - - fn perform_work(&'static self, elastic: bool, temporary: bool) { - const MAX_TEMPORARY_THREADS: usize = 16; - - WORKER.with(|w| *w.borrow_mut() = true); - - self.spawning.store(false, Ordering::SeqCst); - - let wait_limit = Duration::from_millis(100); - - let mut unemployed_loops = 0; - while !temporary || unemployed_loops < 50 { - // take on a bit of GC labor - let guard = crate::pin(); - guard.flush(); - drop(guard); - - debug_delay(); - let task_opt = self.recv_timeout(wait_limit); - - if let Some((task, outstanding_work)) = task_opt { - // execute the work sent to this thread - (task)(); - - // spin up some help if we're falling behind - - let temporary_threads = - self.temporary_threads.load(Ordering::Acquire); - - if elastic - && outstanding_work > 5 - && temporary_threads < MAX_TEMPORARY_THREADS - && self - .spawning - .compare_exchange( - false, - true, - Ordering::SeqCst, - Ordering::SeqCst, - ) - .is_ok() - { - self.temporary_threads.fetch_add(1, Ordering::SeqCst); - let spawn_res = std::thread::Builder::new() - .name("sled-temporary-thread".into()) - .spawn(move || self.perform_work(false, true)); - if let Err(e) = spawn_res { - log::error!( - "failed to spin-up temporary work thread: {:?}", - e - ); - self.temporary_threads - .fetch_sub(1, Ordering::SeqCst); - } - } - - unemployed_loops = 0; - } else { - unemployed_loops += 1; - } - } - - assert!(temporary); - self.temporary_threads.fetch_sub(1, Ordering::SeqCst); - } - } -} - -/// Spawn a function on the threadpool. -pub fn spawn(work: F) -> OneShot -where - F: FnOnce() -> R + Send + 'static, - R: Send + 'static + Sized, -{ - spawn_to(work, &queue::BLOCKING_QUEUE) -} - -#[cfg(miri)] -mod queue { - /// This is the polyfill that just executes things synchronously. - use crate::{OneShot, Result}; - - pub(super) fn spawn_to(work: F, _: &()) -> OneShot - where - F: FnOnce() -> R + Send + 'static, - R: Send + 'static, - { - // Polyfill for platforms other than those we explicitly trust to - // perform threaded work on. Just execute a task without involving threads. - let (promise_filler, promise) = OneShot::pair(); - promise_filler.fill((work)()); - promise - } - - pub(super) const IO_QUEUE: () = (); - pub(super) const BLOCKING_QUEUE: () = (); - pub(super) const SNAPSHOT_QUEUE: () = (); - pub(super) const TRUNCATE_QUEUE: () = (); -} - -use queue::spawn_to; - -pub fn truncate(config: crate::RunningConfig, at: u64) -> OneShot> { - spawn_to( - move || { - log::debug!("truncating file to length {}", at); - let ret: Result<()> = config - .file - .set_len(at) - .and_then(|_| config.file.sync_all()) - .map_err(Into::into); - - if let Err(e) = &ret { - config.set_global_error(*e); - } - - ret - }, - &queue::TRUNCATE_QUEUE, - ) -} - -pub fn take_fuzzy_snapshot(pc: crate::pagecache::PageCache) -> OneShot<()> { - spawn_to( - move || { - if let Err(e) = pc.take_fuzzy_snapshot() { - log::error!("failed to write snapshot: {:?}", e); - pc.log.iobufs.set_global_error(e); - } - }, - &queue::SNAPSHOT_QUEUE, - ) -} - -pub(crate) fn write_to_log( - iobuf: Arc, - iobufs: Arc, -) -> OneShot<()> { - spawn_to( - move || { - let lsn = iobuf.lsn; - if let Err(e) = iobufs.write_to_log(iobuf) { - log::error!( - "hit error while writing iobuf with lsn {}: {:?}", - lsn, - e - ); - - // store error before notifying so that waiting threads will see - // it - iobufs.set_global_error(e); - } - }, - &queue::IO_QUEUE, - ) -} diff --git a/src/transaction.rs b/src/transaction.rs deleted file mode 100644 index a2bd6e69f..000000000 --- a/src/transaction.rs +++ /dev/null @@ -1,661 +0,0 @@ -//! Fully serializable (ACID) multi-`Tree` transactions. -//! -//! sled transactions are **optimistic** which means that -//! they may re-run in cases where conflicts are detected. -//! Do not perform IO or interact with state outside -//! of the closure unless it is idempotent, because -//! it may re-run several times. -//! -//! # Examples -//! ``` -//! # use sled::{transaction::TransactionResult, Config}; -//! # fn main() -> TransactionResult<()> { -//! -//! let config = Config::new().temporary(true); -//! let db1 = config.open().unwrap(); -//! let db = db1.open_tree(b"a").unwrap(); -//! -//! // Use write-only transactions as a writebatch: -//! db.transaction(|db| { -//! db.insert(b"k1", b"cats")?; -//! db.insert(b"k2", b"dogs")?; -//! Ok(()) -//! })?; -//! -//! // Atomically swap two items: -//! db.transaction(|db| { -//! let v1_option = db.remove(b"k1")?; -//! let v1 = v1_option.unwrap(); -//! let v2_option = db.remove(b"k2")?; -//! let v2 = v2_option.unwrap(); -//! -//! db.insert(b"k1", v2)?; -//! db.insert(b"k2", v1)?; -//! -//! Ok(()) -//! })?; -//! -//! assert_eq!(&db.get(b"k1")?.unwrap(), b"dogs"); -//! assert_eq!(&db.get(b"k2")?.unwrap(), b"cats"); -//! # Ok(()) -//! # } -//! ``` -//! -//! Transactions also work on tuples of `Tree`s, -//! preserving serializable ACID semantics! -//! In this example, we treat two trees like a -//! work queue, atomically apply updates to -//! data and move them from the unprocessed `Tree` -//! to the processed `Tree`. -//! -//! ``` -//! # use sled::{transaction::{TransactionResult, Transactional}, Config}; -//! # fn main() -> TransactionResult<()> { -//! -//! let config = Config::new().temporary(true); -//! let db = config.open().unwrap(); -//! -//! let unprocessed = db.open_tree(b"unprocessed items").unwrap(); -//! let processed = db.open_tree(b"processed items").unwrap(); -//! -//! // An update somehow gets into the tree, which we -//! // later trigger the atomic processing of. -//! unprocessed.insert(b"k3", b"ligers").unwrap(); -//! -//! // Atomically process the new item and move it -//! // between `Tree`s. -//! (&unprocessed, &processed) -//! .transaction(|(unprocessed, processed)| { -//! let unprocessed_item = unprocessed.remove(b"k3")?.unwrap(); -//! let mut processed_item = b"yappin' ".to_vec(); -//! processed_item.extend_from_slice(&unprocessed_item); -//! processed.insert(b"k3", processed_item)?; -//! Ok(()) -//! })?; -//! -//! assert_eq!(unprocessed.get(b"k3").unwrap(), None); -//! assert_eq!(&processed.get(b"k3").unwrap().unwrap(), b"yappin' ligers"); -//! # Ok(()) -//! # } -//! ``` -#![allow(clippy::module_name_repetitions)] -use std::{cell::RefCell, fmt, rc::Rc}; - -use crate::{ - concurrency_control, pin, Batch, Error, Event, Guard, IVec, Map, Protector, - Result, Tree, -}; - -/// A transaction that will -/// be applied atomically to the -/// Tree. -#[derive(Clone)] -pub struct TransactionalTree { - pub(super) tree: Tree, - pub(super) writes: Rc>, - pub(super) read_cache: Rc>>>, - pub(super) flush_on_commit: Rc>, -} - -/// An error type that is returned from the closure -/// passed to the `transaction` method. -#[derive(Debug, Clone, Copy, PartialEq, Eq)] -pub enum UnabortableTransactionError { - /// An internal conflict has occurred and the `transaction` method will - /// retry the passed-in closure until it succeeds. This should never be - /// returned directly from the user's closure, as it will create an - /// infinite loop that never returns. This is why it is hidden. - Conflict, - /// A serious underlying storage issue has occurred that requires - /// attention from an operator or a remediating system, such as - /// corruption. - Storage(Error), -} - -impl fmt::Display for UnabortableTransactionError { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - use UnabortableTransactionError::*; - match self { - Conflict => write!(f, "Conflict during transaction"), - Storage(e) => e.fmt(f), - } - } -} - -impl std::error::Error for UnabortableTransactionError { - fn source(&self) -> Option<&(dyn std::error::Error + 'static)> { - match self { - UnabortableTransactionError::Storage(ref e) => Some(e), - _ => None, - } - } -} - -pub(crate) type UnabortableTransactionResult = - std::result::Result; - -impl From for UnabortableTransactionError { - fn from(error: Error) -> Self { - UnabortableTransactionError::Storage(error) - } -} - -impl From for ConflictableTransactionError { - fn from(error: UnabortableTransactionError) -> Self { - match error { - UnabortableTransactionError::Conflict => { - ConflictableTransactionError::Conflict - } - UnabortableTransactionError::Storage(error2) => { - ConflictableTransactionError::Storage(error2) - } - } - } -} - -/// An error type that is returned from the closure -/// passed to the `transaction` method. -#[derive(Debug, Clone, PartialEq, Eq)] -pub enum ConflictableTransactionError { - /// A user-provided error type that indicates the transaction should abort. - /// This is passed into the return value of `transaction` as a direct Err - /// instance, rather than forcing users to interact with this enum - /// directly. - Abort(T), - #[doc(hidden)] - /// An internal conflict has occurred and the `transaction` method will - /// retry the passed-in closure until it succeeds. This should never be - /// returned directly from the user's closure, as it will create an - /// infinite loop that never returns. This is why it is hidden. - Conflict, - /// A serious underlying storage issue has occurred that requires - /// attention from an operator or a remediating system, such as - /// corruption. - Storage(Error), -} - -impl fmt::Display for ConflictableTransactionError { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - use ConflictableTransactionError::*; - match self { - Abort(e) => e.fmt(f), - Conflict => write!(f, "Conflict during transaction"), - Storage(e) => e.fmt(f), - } - } -} - -impl std::error::Error - for ConflictableTransactionError -{ - fn source(&self) -> Option<&(dyn std::error::Error + 'static)> { - match self { - ConflictableTransactionError::Storage(ref e) => Some(e), - _ => None, - } - } -} - -/// An error type that is returned from the closure -/// passed to the `transaction` method. -#[derive(Debug, Clone, PartialEq, Eq)] -pub enum TransactionError { - /// A user-provided error type that indicates the transaction should abort. - /// This is passed into the return value of `transaction` as a direct Err - /// instance, rather than forcing users to interact with this enum - /// directly. - Abort(T), - /// A serious underlying storage issue has occurred that requires - /// attention from an operator or a remediating system, such as - /// corruption. - Storage(Error), -} - -impl fmt::Display for TransactionError { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - use TransactionError::*; - match self { - Abort(e) => e.fmt(f), - Storage(e) => e.fmt(f), - } - } -} - -impl std::error::Error for TransactionError { - fn source(&self) -> Option<&(dyn std::error::Error + 'static)> { - match self { - TransactionError::Storage(ref e) => Some(e), - _ => None, - } - } -} - -/// A transaction-related `Result` which is used for transparently handling -/// concurrency-related conflicts when running transaction closures. -pub type ConflictableTransactionResult = - std::result::Result>; - -impl From for ConflictableTransactionError { - fn from(error: Error) -> Self { - ConflictableTransactionError::Storage(error) - } -} - -/// A transaction-related `Result` which is used for returning the -/// final result of a transaction after potentially running the provided -/// closure several times due to underlying conflicts. -pub type TransactionResult = - std::result::Result>; - -impl From for TransactionError { - fn from(error: Error) -> Self { - TransactionError::Storage(error) - } -} - -impl TransactionalTree { - /// Set a key to a new value - pub fn insert( - &self, - key: K, - value: V, - ) -> UnabortableTransactionResult> - where - K: AsRef<[u8]> + Into, - V: Into, - { - let old = self.get(key.as_ref())?; - let mut writes = self.writes.borrow_mut(); - writes.insert(key, value.into()); - Ok(old) - } - - /// Remove a key - pub fn remove( - &self, - key: K, - ) -> UnabortableTransactionResult> - where - K: AsRef<[u8]> + Into, - { - let old = self.get(key.as_ref()); - let mut writes = self.writes.borrow_mut(); - writes.remove(key); - old - } - - /// Get the value associated with a key - pub fn get>( - &self, - key: K, - ) -> UnabortableTransactionResult> { - let writes = self.writes.borrow(); - if let Some(first_try) = writes.get(key.as_ref()) { - return Ok(first_try.cloned()); - } - let mut reads = self.read_cache.borrow_mut(); - if let Some(second_try) = reads.get(key.as_ref()) { - return Ok(second_try.clone()); - } - - // not found in a cache, need to hit the backing db - let mut guard = pin(); - let get = loop { - if let Ok(get) = self.tree.get_inner(key.as_ref(), &mut guard)? { - break get; - } - }; - let last = reads.insert(key.as_ref().into(), get.clone()); - assert!(last.is_none()); - - Ok(get) - } - - /// Atomically apply multiple inserts and removals. - pub fn apply_batch( - &self, - batch: &Batch, - ) -> UnabortableTransactionResult<()> { - for (k, v_opt) in &batch.writes { - if let Some(v) = v_opt { - let _old = self.insert(k, v)?; - } else { - let _old = self.remove(k)?; - } - } - Ok(()) - } - - /// Flush the database before returning from the transaction. - pub fn flush(&self) { - *self.flush_on_commit.borrow_mut() = true; - } - - /// Generate a monotonic ID. Not guaranteed to be - /// contiguous or idempotent, can produce different values in the - /// same transaction in case of conflicts. - /// Written to disk every `idgen_persist_interval` - /// operations, followed by a blocking flush. During recovery, we - /// take the last recovered generated ID and add 2x - /// the `idgen_persist_interval` to it. While persisting, if the - /// previous persisted counter wasn't synced to disk yet, we will do - /// a blocking flush to fsync the latest counter, ensuring - /// that we will never give out the same counter twice. - pub fn generate_id(&self) -> Result { - self.tree.context.pagecache.generate_id_inner() - } - - fn unstage(&self) { - unimplemented!() - } - - const fn validate(&self) -> bool { - true - } - - fn commit(&self, event: Event) -> Result<()> { - let writes = std::mem::take(&mut *self.writes.borrow_mut()); - let mut guard = pin(); - self.tree.apply_batch_inner(writes, Some(event), &mut guard) - } - - fn from_tree(tree: &Tree) -> Self { - Self { - tree: tree.clone(), - writes: Default::default(), - read_cache: Default::default(), - flush_on_commit: Default::default(), - } - } -} - -/// A type which allows for pluggable transactional capabilities -pub struct TransactionalTrees { - inner: Vec, -} - -impl TransactionalTrees { - fn stage(&self) -> Protector<'_> { - concurrency_control::write() - } - - fn unstage(&self) { - for tree in &self.inner { - tree.unstage(); - } - } - - fn validate(&self) -> bool { - for tree in &self.inner { - if !tree.validate() { - return false; - } - } - true - } - - fn commit(&self, guard: &Guard) -> Result<()> { - let peg = self.inner[0].tree.context.pin_log(guard)?; - - let batches = self - .inner - .iter() - .map(|tree| (tree.tree.clone(), tree.writes.borrow().clone())) - .collect(); - - let event = Event::from_batches(batches); - - for tree in &self.inner { - tree.commit(event.clone())?; - } - - // when the peg drops, it ensures all updates - // written to the log since its creation are - // recovered atomically - peg.seal_batch() - } - - fn flush_if_configured(&self) -> Result<()> { - let mut should_flush = None; - - for tree in &self.inner { - if *tree.flush_on_commit.borrow() { - should_flush = Some(tree); - break; - } - } - - if let Some(tree) = should_flush { - tree.tree.flush()?; - } - Ok(()) - } -} - -/// A simple constructor for `Err(TransactionError::Abort(_))` -pub const fn abort(t: T) -> ConflictableTransactionResult { - Err(ConflictableTransactionError::Abort(t)) -} - -/// A type that may be transacted on in sled transactions. -pub trait Transactional { - /// An internal reference to an internal proxy type that - /// mediates transactional reads and writes. - type View; - - /// An internal function for creating a top-level - /// transactional structure. - fn make_overlay(&self) -> Result; - - /// An internal function for viewing the transactional - /// subcomponents based on the top-level transactional - /// structure. - fn view_overlay(overlay: &TransactionalTrees) -> Self::View; - - /// Runs a transaction, possibly retrying the passed-in closure if - /// a concurrent conflict is detected that would cause a violation - /// of serializability. This is the only trait method that - /// you're most likely to use directly. - fn transaction(&self, f: F) -> TransactionResult - where - F: Fn(&Self::View) -> ConflictableTransactionResult, - { - loop { - let tt = self.make_overlay()?; - let view = Self::view_overlay(&tt); - - // NB locks must exist until this function returns. - let locks = tt.stage(); - let ret = f(&view); - if !tt.validate() { - tt.unstage(); - continue; - } - match ret { - Ok(r) => { - let guard = pin(); - tt.commit(&guard)?; - drop(locks); - tt.flush_if_configured()?; - return Ok(r); - } - Err(ConflictableTransactionError::Abort(e)) => { - return Err(TransactionError::Abort(e)); - } - Err(ConflictableTransactionError::Conflict) => continue, - Err(ConflictableTransactionError::Storage(other)) => { - return Err(TransactionError::Storage(other)); - } - } - } - } -} - -impl Transactional for &Tree { - type View = TransactionalTree; - - fn make_overlay(&self) -> Result { - Ok(TransactionalTrees { - inner: vec![TransactionalTree::from_tree(self)], - }) - } - - fn view_overlay(overlay: &TransactionalTrees) -> Self::View { - overlay.inner[0].clone() - } -} - -impl Transactional for &&Tree { - type View = TransactionalTree; - - fn make_overlay(&self) -> Result { - Ok(TransactionalTrees { - inner: vec![TransactionalTree::from_tree(*self)], - }) - } - - fn view_overlay(overlay: &TransactionalTrees) -> Self::View { - overlay.inner[0].clone() - } -} - -impl Transactional for Tree { - type View = TransactionalTree; - - fn make_overlay(&self) -> Result { - Ok(TransactionalTrees { - inner: vec![TransactionalTree::from_tree(self)], - }) - } - - fn view_overlay(overlay: &TransactionalTrees) -> Self::View { - overlay.inner[0].clone() - } -} - -impl Transactional for [Tree] { - type View = Vec; - - fn make_overlay(&self) -> Result { - let same_db = self.windows(2).all(|w| { - let path_1 = w[0].context.get_path(); - let path_2 = w[1].context.get_path(); - path_1 == path_2 - }); - if !same_db { - return Err(Error::Unsupported( - "cannot use trees from multiple \ - databases in the same transaction", - )); - } - - Ok(TransactionalTrees { - inner: self.iter().map(TransactionalTree::from_tree).collect(), - }) - } - - fn view_overlay(overlay: &TransactionalTrees) -> Self::View { - overlay.inner.clone() - } -} - -impl Transactional for [&Tree] { - type View = Vec; - - fn make_overlay(&self) -> Result { - let same_db = self.windows(2).all(|w| { - let path_1 = w[0].context.get_path(); - let path_2 = w[1].context.get_path(); - path_1 == path_2 - }); - if !same_db { - return Err(Error::Unsupported( - "cannot use trees from multiple \ - databases in the same transaction", - )); - } - - Ok(TransactionalTrees { - inner: self - .iter() - .map(|&t| TransactionalTree::from_tree(t)) - .collect(), - }) - } - - fn view_overlay(overlay: &TransactionalTrees) -> Self::View { - overlay.inner.clone() - } -} - -macro_rules! repeat_type { - ($t:ty, ($literal:literal)) => { - ($t,) - }; - ($t:ty, ($($literals:literal),+)) => { - repeat_type!(IMPL $t, (), ($($literals),*)) - }; - (IMPL $t:ty, (), ($first:literal, $($rest:literal),*)) => { - repeat_type!(IMPL $t, ($t), ($($rest),*)) - }; - (IMPL $t:ty, ($($partial:tt),*), ($first:literal, $($rest:literal),*)) => { - repeat_type!(IMPL $t, ($t, $($partial),*), ($($rest),*)) - }; - (IMPL $t:ty, ($($partial:tt),*), ($last:literal)) => { - ($($partial),*, $t) - }; -} - -macro_rules! impl_transactional_tuple_trees { - ($($indices:tt),+) => { - impl Transactional for repeat_type!(&Tree, ($($indices),+)) { - type View = repeat_type!(TransactionalTree, ($($indices),+)); - - fn make_overlay(&self) -> Result { - let paths = vec![ - $( - self.$indices.context.get_path(), - )+ - ]; - if !paths.windows(2).all(|w| { - w[0] == w[1] - }) { - return Err(Error::Unsupported( - "cannot use trees from multiple databases in the same transaction".into(), - )); - } - - Ok(TransactionalTrees { - inner: vec![ - $( - TransactionalTree::from_tree(self.$indices) - ),+ - ], - }) - } - - fn view_overlay(overlay: &TransactionalTrees) -> Self::View { - ( - $( - overlay.inner[$indices].clone() - ),+, - ) - } - } - }; -} - -impl_transactional_tuple_trees!(0); -impl_transactional_tuple_trees!(0, 1); -impl_transactional_tuple_trees!(0, 1, 2); -impl_transactional_tuple_trees!(0, 1, 2, 3); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7, 8); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7, 8, 9); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12); -impl_transactional_tuple_trees!(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13); diff --git a/src/tree.rs b/src/tree.rs index 69718eae9..f0a3474c6 100644 --- a/src/tree.rs +++ b/src/tree.rs @@ -1,461 +1,666 @@ -use std::{ - borrow::Cow, - fmt::{self, Debug}, - num::NonZeroU64, - ops::{self, Deref, RangeBounds}, - sync::atomic::Ordering::SeqCst, +use std::collections::{BTreeMap, VecDeque}; +use std::fmt; +use std::hint; +use std::io; +use std::mem::{self, ManuallyDrop}; +use std::ops; +use std::ops::Bound; +use std::ops::RangeBounds; +use std::sync::atomic::Ordering; +use std::sync::Arc; +use std::time::Instant; + +use concurrent_map::Minimum; +use fault_injection::annotate; +use inline_array::InlineArray; +use parking_lot::{ + lock_api::{ArcRwLockReadGuard, ArcRwLockWriteGuard}, + RawRwLock, }; -use parking_lot::RwLock; +use crate::*; -use crate::{atomic_shim::AtomicU64, pagecache::NodeView, *}; +#[derive(Clone)] +pub struct Tree { + collection_id: CollectionId, + cache: ObjectCache, + pub(crate) index: Index, + _shutdown_dropper: Arc>, +} -#[derive(Debug, Clone)] -pub(crate) struct View<'g> { - pub node_view: NodeView<'g>, - pub pid: PageId, +impl Drop for Tree { + fn drop(&mut self) { + if self.cache.config.flush_every_ms.is_none() { + if let Err(e) = self.flush() { + log::error!("failed to flush Db on Drop: {e:?}"); + } + } else { + // otherwise, it is expected that the flusher thread will + // flush while shutting down the final Db/Tree instance + } + } } -impl<'g> Deref for View<'g> { - type Target = Node; +impl fmt::Debug for Tree { + fn fmt(&self, w: &mut fmt::Formatter<'_>) -> fmt::Result { + let alternate = w.alternate(); + + let mut debug_struct = w.debug_struct(&format!("Db<{}>", LEAF_FANOUT)); - fn deref(&self) -> &Node { - &*self.node_view + if alternate { + debug_struct + .field("global_error", &self.check_error()) + .field( + "data", + &format!("{:?}", self.iter().collect::>()), + ) + .finish() + } else { + debug_struct.field("global_error", &self.check_error()).finish() + } } } -impl IntoIterator for &'_ Tree { - type Item = Result<(IVec, IVec)>; - type IntoIter = Iter; +#[must_use] +struct LeafReadGuard<'a, const LEAF_FANOUT: usize = 1024> { + leaf_read: + ManuallyDrop>>, + low_key: InlineArray, + inner: &'a Tree, + object_id: ObjectId, + external_cache_access_and_eviction: bool, +} + +impl<'a, const LEAF_FANOUT: usize> Drop for LeafReadGuard<'a, LEAF_FANOUT> { + fn drop(&mut self) { + let size = self.leaf_read.leaf.as_ref().unwrap().in_memory_size; + // we must drop our mutex before calling mark_access_and_evict + unsafe { + ManuallyDrop::drop(&mut self.leaf_read); + } + if self.external_cache_access_and_eviction { + return; + } + + let current_epoch = self.inner.cache.current_flush_epoch(); - fn into_iter(self) -> Iter { - self.iter() + if let Err(e) = self.inner.cache.mark_access_and_evict( + self.object_id, + size, + current_epoch, + ) { + self.inner.set_error(&e); + log::error!( + "io error while paging out dirty data: {:?} \ + for guard of leaf with low key {:?}", + e, + self.low_key + ); + } } } -const fn out_of_bounds(numba: usize) -> bool { - numba > MAX_BLOB +struct LeafWriteGuard<'a, const LEAF_FANOUT: usize = 1024> { + leaf_write: + ManuallyDrop>>, + flush_epoch_guard: FlushEpochGuard<'a>, + low_key: InlineArray, + inner: &'a Tree, + node: Object, + external_cache_access_and_eviction: bool, } -#[cold] -const fn bounds_error() -> Result<()> { - Err(Error::Unsupported( - "Keys and values are limited to \ - 128gb on 64-bit platforms and - 512mb on 32-bit platforms." - )) -} +impl<'a, const LEAF_FANOUT: usize> LeafWriteGuard<'a, LEAF_FANOUT> { + fn epoch(&self) -> FlushEpoch { + self.flush_epoch_guard.epoch() + } -/// A flash-sympathetic persistent lock-free B+ tree. -/// -/// A `Tree` represents a single logical keyspace / namespace / bucket. -/// -/// Separate `Trees` may be opened to separate concerns using -/// `Db::open_tree`. -/// -/// `Db` implements `Deref` such that a `Db` acts -/// like the "default" `Tree`. This is the only `Tree` that cannot -/// be deleted via `Db::drop_tree`. -/// -/// When a `Db` or `Tree` is dropped, `flush` is called to attempt -/// to flush all buffered writes to disk. -/// -/// # Examples -/// -/// ``` -/// # fn main() -> Result<(), Box> { -/// use sled::IVec; -/// -/// # let _ = std::fs::remove_dir_all("db"); -/// let db: sled::Db = sled::open("db")?; -/// db.insert(b"yo!", b"v1".to_vec()); -/// assert_eq!(db.get(b"yo!"), Ok(Some(IVec::from(b"v1")))); -/// -/// // Atomic compare-and-swap. -/// db.compare_and_swap( -/// b"yo!", // key -/// Some(b"v1"), // old value, None for not present -/// Some(b"v2"), // new value, None for delete -/// )?; -/// -/// // Iterates over key-value pairs, starting at the given key. -/// let scan_key: &[u8] = b"a non-present key before yo!"; -/// let mut iter = db.range(scan_key..); -/// assert_eq!( -/// iter.next().unwrap(), -/// Ok((IVec::from(b"yo!"), IVec::from(b"v2"))) -/// ); -/// assert_eq!(iter.next(), None); -/// -/// db.remove(b"yo!"); -/// assert_eq!(db.get(b"yo!"), Ok(None)); -/// -/// let other_tree: sled::Tree = db.open_tree(b"cool db facts")?; -/// other_tree.insert( -/// b"k1", -/// &b"a Db acts like a Tree due to implementing Deref"[..] -/// )?; -/// # let _ = std::fs::remove_dir_all("db"); -/// # Ok(()) } -/// ``` -#[derive(Clone)] -#[doc(alias = "keyspace")] -#[doc(alias = "bucket")] -#[doc(alias = "table")] -pub struct Tree(pub(crate) Arc); - -#[allow(clippy::module_name_repetitions)] -pub struct TreeInner { - pub(crate) tree_id: IVec, - pub(crate) context: Context, - pub(crate) subscribers: Subscribers, - pub(crate) root: AtomicU64, - pub(crate) merge_operator: RwLock>>, + fn handle_cache_access_and_eviction_externally( + mut self, + ) -> (ObjectId, usize) { + self.external_cache_access_and_eviction = true; + ( + self.node.object_id, + self.leaf_write.leaf.as_ref().unwrap().in_memory_size, + ) + } } -impl Drop for TreeInner { +impl<'a, const LEAF_FANOUT: usize> Drop for LeafWriteGuard<'a, LEAF_FANOUT> { fn drop(&mut self) { - // Flush the underlying system in a loop until we - // have flushed all dirty data. - loop { - match self.context.pagecache.flush() { - Ok(0) => return, - Ok(_) => continue, - Err(e) => { - error!("failed to flush data to disk: {:?}", e); - return; - } - } + let size = self.leaf_write.leaf.as_ref().unwrap().in_memory_size; + + // we must drop our mutex before calling mark_access_and_evict + unsafe { + ManuallyDrop::drop(&mut self.leaf_write); + } + if self.external_cache_access_and_eviction { + return; + } + + if let Err(e) = self.inner.cache.mark_access_and_evict( + self.node.object_id, + size, + self.epoch(), + ) { + self.inner.set_error(&e); + log::error!("io error while paging out dirty data: {:?}", e); } } } -impl Deref for Tree { - type Target = TreeInner; +impl Tree { + pub(crate) fn new( + collection_id: CollectionId, + cache: ObjectCache, + index: Index, + _shutdown_dropper: Arc>, + ) -> Tree { + Tree { collection_id, cache, index, _shutdown_dropper } + } - fn deref(&self) -> &TreeInner { - &self.0 + // This is only pub for an extra assertion during testing. + #[doc(hidden)] + pub fn check_error(&self) -> io::Result<()> { + self.cache.check_error() } -} -impl Tree { - /// Insert a key to a new value, returning the last value if it - /// was set. + fn set_error(&self, error: &io::Error) { + self.cache.set_error(error) + } + + pub fn storage_stats(&self) -> Stats { + Stats { cache: self.cache.stats() } + } + + /// Synchronously flushes all dirty IO buffers and calls + /// fsync. If this succeeds, it is guaranteed that all + /// previous writes will be recovered if the system + /// crashes. Returns the number of bytes flushed during + /// this call. /// - /// # Examples + /// Flushing can take quite a lot of time, and you should + /// measure the performance impact of using it on + /// realistic sustained workloads running on realistic + /// hardware. /// - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// assert_eq!(db.insert(&[1, 2, 3], vec![0]), Ok(None)); - /// assert_eq!(db.insert(&[1, 2, 3], vec![1]), Ok(Some(sled::IVec::from(&[0])))); - /// # Ok(()) } - /// ``` - #[doc(alias = "set")] - pub fn insert(&self, key: K, value: V) -> Result> - where - K: AsRef<[u8]>, - V: Into, - { - let value_ivec = value.into(); - let mut guard = pin(); - let _cc = concurrency_control::read(); - loop { - trace!("setting key {:?}", key.as_ref()); - if let Ok(res) = self.insert_inner( - key.as_ref(), - Some(value_ivec.clone()), - false, - &mut guard, - )? { - return Ok(res); - } - } + /// This is called automatically on drop of the last open Db + /// instance. + pub fn flush(&self) -> io::Result { + self.cache.flush() } - pub(crate) fn insert_inner( + pub(crate) fn page_in( &self, key: &[u8], - value: Option, - is_transactional: bool, - guard: &mut Guard, - ) -> Result>> { - #[cfg(feature = "metrics")] - let _measure = if value.is_some() { - Measure::new(&M.tree_set) - } else { - Measure::new(&M.tree_del) - }; + flush_epoch: FlushEpoch, + ) -> io::Result<( + InlineArray, + ArcRwLockWriteGuard>, + Object, + )> { + let before_read_io = Instant::now(); + + let mut read_loops: u64 = 0; + loop { + let _heap_pin = self.cache.heap_object_id_pin(); - if out_of_bounds(key.len()) { - bounds_error()?; - } + let (low_key, node) = self.index.get_lte(key).unwrap(); + if node.collection_id != self.collection_id { + log::trace!("retry due to mismatched collection id in page_in"); - let View { node_view, pid, .. } = - self.view_for_key(key.as_ref(), guard)?; + hint::spin_loop(); - let mut subscriber_reservation = if is_transactional { - None - } else { - Some(self.subscribers.reserve(key)) - }; + continue; + } - let (encoded_key, last_value) = node_view.node_kv_pair(key.as_ref()); - let last_value_ivec = last_value.map(IVec::from); + let mut write = node.inner.write_arc(); + if write.leaf.is_none() { + self.cache + .read_stats + .cache_misses + .fetch_add(1, Ordering::Relaxed); + + let leaf_bytes = + if let Some(read_res) = self.cache.read(node.object_id) { + match read_res { + Ok(buf) => buf, + Err(e) => return Err(annotate!(e)), + } + } else { + // this particular object ID is not present + read_loops += 1; + // TODO change this assertion + debug_assert!( + read_loops < 10_000_000_000, + "search key: {:?} node key: {:?} object id: {:?}", + key, + low_key, + node.object_id + ); - if value == last_value_ivec { - // NB: always broadcast event - if let Some(Some(res)) = subscriber_reservation.take() { - let event = subscriber::Event::single_update( - self.clone(), - key.as_ref().into(), - value, - ); + hint::spin_loop(); - res.complete(&event); + continue; + }; + + let read_io_latency_us = + u64::try_from(before_read_io.elapsed().as_micros()) + .unwrap(); + self.cache + .read_stats + .max_read_io_latency_us + .fetch_max(read_io_latency_us, Ordering::Relaxed); + self.cache + .read_stats + .sum_read_io_latency_us + .fetch_add(read_io_latency_us, Ordering::Relaxed); + + let before_deserialization = Instant::now(); + + let leaf: Box> = + Leaf::deserialize(&leaf_bytes).unwrap(); + + if leaf.lo != low_key { + // TODO determine why this rare situation occurs and better + // understand whether it is really benign. + log::trace!("mismatch between object key and leaf low"); + hint::spin_loop(); + continue; + } + + let deserialization_latency_us = + u64::try_from(before_deserialization.elapsed().as_micros()) + .unwrap(); + self.cache + .read_stats + .max_deserialization_latency_us + .fetch_max(deserialization_latency_us, Ordering::Relaxed); + self.cache + .read_stats + .sum_deserialization_latency_us + .fetch_add(deserialization_latency_us, Ordering::Relaxed); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + node.object_id, + FlushEpoch::MIN, + event_verifier::State::CleanPagedIn, + concat!(file!(), ':', line!(), ":page-in"), + ); + } + + write.leaf = Some(leaf); + } else { + self.cache + .read_stats + .cache_hits + .fetch_add(1, Ordering::Relaxed); } + let leaf = write.leaf.as_mut().unwrap(); - // short-circuit a no-op set or delete - return Ok(Ok(last_value_ivec)); - } + if leaf.deleted.is_some() { + log::trace!("retry due to deleted node in page_in"); + drop(write); + + hint::spin_loop(); - let frag = if let Some(value_ivec) = value.clone() { - if out_of_bounds(value_ivec.len()) { - bounds_error()?; + continue; } - Link::Set(encoded_key, value_ivec) - } else { - Link::Del(encoded_key) - }; - let link = - self.context.pagecache.link(pid, node_view.0, frag, guard)?; + if &*leaf.lo > key { + let size = leaf.in_memory_size; + drop(write); + log::trace!("key undershoot in page_in"); + self.cache.mark_access_and_evict( + node.object_id, + size, + flush_epoch, + )?; - if link.is_ok() { - // success - if let Some(Some(res)) = subscriber_reservation.take() { - let event = subscriber::Event::single_update( - self.clone(), - key.as_ref().into(), - value, - ); + hint::spin_loop(); - res.complete(&event); + continue; } - Ok(Ok(last_value_ivec)) - } else { - #[cfg(feature = "metrics")] - M.tree_looped(); - Ok(Err(Conflict)) + if let Some(ref hi) = leaf.hi { + if &**hi <= key { + let size = leaf.in_memory_size; + drop(write); + log::trace!("key overshoot in page_in"); + self.cache.mark_access_and_evict( + node.object_id, + size, + flush_epoch, + )?; + + hint::spin_loop(); + + continue; + } + } + return Ok((low_key, write, node)); } } - /// Perform a multi-key serializable transaction. - /// - /// sled transactions are **optimistic** which means that - /// they may re-run in cases where conflicts are detected. - /// Do not perform IO or interact with state outside - /// of the closure unless it is idempotent, because - /// it may re-run several times. - /// - /// # Examples - /// - /// ``` - /// # use sled::{transaction::TransactionResult, Config}; - /// # fn main() -> TransactionResult<()> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// // Use write-only transactions as a writebatch: - /// db.transaction(|tx_db| { - /// tx_db.insert(b"k1", b"cats")?; - /// tx_db.insert(b"k2", b"dogs")?; - /// Ok(()) - /// })?; - /// - /// // Atomically swap two items: - /// db.transaction(|tx_db| { - /// let v1_option = tx_db.remove(b"k1")?; - /// let v1 = v1_option.unwrap(); - /// let v2_option = tx_db.remove(b"k2")?; - /// let v2 = v2_option.unwrap(); - /// - /// tx_db.insert(b"k1", v2)?; - /// tx_db.insert(b"k2", v1)?; - /// - /// Ok(()) - /// })?; - /// - /// assert_eq!(&db.get(b"k1")?.unwrap(), b"dogs"); - /// assert_eq!(&db.get(b"k2")?.unwrap(), b"cats"); - /// # Ok(()) - /// # } - /// ``` - /// - /// A transaction may return information from - /// an intentionally-cancelled transaction by using - /// the abort function inside the closure in - /// combination with the try operator. - /// - /// ``` - /// use sled::{transaction::{abort, TransactionError, TransactionResult}, Config}; - /// - /// #[derive(Debug, PartialEq)] - /// struct MyBullshitError; - /// - /// fn main() -> TransactionResult<(), MyBullshitError> { - /// let config = Config::new().temporary(true); - /// let db = config.open()?; - /// - /// // Use write-only transactions as a writebatch: - /// let res = db.transaction(|tx_db| { - /// tx_db.insert(b"k1", b"cats")?; - /// tx_db.insert(b"k2", b"dogs")?; - /// // aborting will cause all writes to roll-back. - /// if true { - /// abort(MyBullshitError)?; - /// } - /// Ok(42) - /// }).unwrap_err(); - /// - /// assert_eq!(res, TransactionError::Abort(MyBullshitError)); - /// assert_eq!(db.get(b"k1")?, None); - /// assert_eq!(db.get(b"k2")?, None); - /// - /// Ok(()) - /// } - /// ``` - /// - /// - /// Transactions also work on tuples of `Tree`s, - /// preserving serializable ACID semantics! - /// In this example, we treat two trees like a - /// work queue, atomically apply updates to - /// data and move them from the unprocessed `Tree` - /// to the processed `Tree`. - /// - /// ``` - /// # use sled::transaction::TransactionResult; - /// # fn main() -> TransactionResult<()> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// use sled::Transactional; - /// - /// let unprocessed = db.open_tree(b"unprocessed items")?; - /// let processed = db.open_tree(b"processed items")?; - /// - /// // An update somehow gets into the tree, which we - /// // later trigger the atomic processing of. - /// unprocessed.insert(b"k3", b"ligers")?; - /// - /// // Atomically process the new item and move it - /// // between `Tree`s. - /// (&unprocessed, &processed) - /// .transaction(|(tx_unprocessed, tx_processed)| { - /// let unprocessed_item = tx_unprocessed.remove(b"k3")?.unwrap(); - /// let mut processed_item = b"yappin' ".to_vec(); - /// processed_item.extend_from_slice(&unprocessed_item); - /// tx_processed.insert(b"k3", processed_item)?; - /// Ok(()) - /// })?; - /// - /// assert_eq!(unprocessed.get(b"k3").unwrap(), None); - /// assert_eq!(&processed.get(b"k3").unwrap().unwrap(), b"yappin' ligers"); - /// # Ok(()) } - /// ``` - pub fn transaction( - &self, - f: F, - ) -> transaction::TransactionResult - where - F: Fn( - &transaction::TransactionalTree, - ) -> transaction::ConflictableTransactionResult, - { - Transactional::transaction(&self, f) - } + // NB: must never be called without having already added the empty leaf + // operations to a normal flush epoch. This function acquires the lock + // for the left sibling so that the empty leaf's hi key can be given + // to the left sibling, but for this to happen atomically, the act of + // moving left must "happen" in the same flush epoch. By "pushing" the + // merge left potentially into a future flush epoch, any deletions that the + // leaf had applied that may have been a part of a previous batch would also + // be pushed into the future flush epoch, which would break the crash + // atomicity of the batch if the updates were not flushed in the same epoch + // as the rest of the batch. So, this is why we potentially separate the + // flush of the left merge from the flush of the operations that caused + // the leaf to empty in the first place. + fn merge_leaf_into_left_sibling<'a>( + &'a self, + mut leaf_guard: LeafWriteGuard<'a, LEAF_FANOUT>, + ) -> io::Result<()> { + let mut predecessor_guard = self.predecessor_leaf_mut(&leaf_guard)?; + + assert!(predecessor_guard.epoch() >= leaf_guard.epoch()); + + let merge_epoch = leaf_guard.epoch().max(predecessor_guard.epoch()); + + let leaf = leaf_guard.leaf_write.leaf.as_mut().unwrap(); + let predecessor = predecessor_guard.leaf_write.leaf.as_mut().unwrap(); + + assert!(leaf.deleted.is_none()); + assert!(leaf.data.is_empty()); + assert!(predecessor.deleted.is_none()); + assert_eq!(predecessor.hi.as_ref(), Some(&leaf.lo)); + + log::trace!( + "deleting empty node id {} with low key {:?} and high key {:?}", + leaf_guard.node.object_id.0, + leaf.lo, + leaf.hi + ); - /// Create a new batched update that can be - /// atomically applied. - /// - /// It is possible to apply a `Batch` in a transaction - /// as well, which is the way you can apply a `Batch` - /// to multiple `Tree`s atomically. - /// - /// # Examples - /// - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// db.insert("key_0", "val_0")?; - /// - /// let mut batch = sled::Batch::default(); - /// batch.insert("key_a", "val_a"); - /// batch.insert("key_b", "val_b"); - /// batch.insert("key_c", "val_c"); - /// batch.remove("key_0"); - /// - /// db.apply_batch(batch)?; - /// // key_0 no longer exists, and key_a, key_b, and key_c - /// // now do exist. - /// # Ok(()) } - /// ``` - pub fn apply_batch(&self, batch: Batch) -> Result<()> { - let _cc = concurrency_control::write(); - let mut guard = pin(); - self.apply_batch_inner(batch, None, &mut guard) + predecessor.hi = leaf.hi.clone(); + predecessor.set_dirty_epoch(merge_epoch); + + leaf.deleted = Some(merge_epoch); + + leaf_guard + .inner + .cache + .tree_leaves_merged + .fetch_add(1, Ordering::Relaxed); + + self.index.remove(&leaf_guard.low_key).unwrap(); + self.cache.object_id_index.remove(&leaf_guard.node.object_id).unwrap(); + + // NB: these updates must "happen" atomically in the same flush epoch + self.cache.install_dirty( + merge_epoch, + leaf_guard.node.object_id, + Dirty::MergedAndDeleted { + object_id: leaf_guard.node.object_id, + collection_id: self.collection_id, + }, + ); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + leaf_guard.node.object_id, + merge_epoch, + event_verifier::State::Unallocated, + concat!(file!(), ':', line!(), ":merged"), + ); + } + + self.cache.install_dirty( + merge_epoch, + predecessor_guard.node.object_id, + Dirty::NotYetSerialized { + low_key: predecessor.lo.clone(), + node: predecessor_guard.node.clone(), + collection_id: self.collection_id, + }, + ); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + predecessor_guard.node.object_id, + merge_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), ":merged-into"), + ); + } + + let (p_object_id, p_sz) = + predecessor_guard.handle_cache_access_and_eviction_externally(); + let (s_object_id, s_sz) = + leaf_guard.handle_cache_access_and_eviction_externally(); + + self.cache.mark_access_and_evict(p_object_id, p_sz, merge_epoch)?; + + // TODO maybe we don't need to do this, since we're removing it + self.cache.mark_access_and_evict(s_object_id, s_sz, merge_epoch)?; + + Ok(()) } - pub(crate) fn apply_batch_inner( - &self, - batch: Batch, - transaction_batch_opt: Option, - guard: &mut Guard, - ) -> Result<()> { - let peg_opt = if transaction_batch_opt.is_none() { - Some(self.context.pin_log(guard)?) - } else { - None - }; + fn predecessor_leaf_mut<'a>( + &'a self, + successor: &LeafWriteGuard<'a, LEAF_FANOUT>, + ) -> io::Result> { + assert_ne!(successor.low_key, &*InlineArray::MIN); - trace!("applying batch {:?}", batch); + loop { + let search_key = self + .index + .range::(..&successor.low_key) + .next_back() + .unwrap() + .0; - let mut subscriber_reservation = self.subscribers.reserve_batch(&batch); + assert!(search_key < successor.low_key); - for (k, v_opt) in &batch.writes { - loop { - if self - .insert_inner( - k, - v_opt.clone(), - transaction_batch_opt.is_some(), - guard, - )? - .is_ok() - { - break; - } + // TODO this can probably deadlock + let node = self.leaf_for_key_mut(&search_key)?; + + let leaf = node.leaf_write.leaf.as_ref().unwrap(); + + assert!(leaf.lo < successor.low_key); + + let leaf_hi = leaf.hi.as_ref().expect("we hold the successor mutex, so the predecessor should have a hi key"); + + if leaf_hi > &successor.low_key { + let still_in_index = self.index.get(&successor.low_key); + panic!( + "somehow, predecessor high key of {:?} \ + is greater than successor low key of {:?}. current index presence: {:?} \n \ + predecessor: {:?} \n successor: {:?}", + leaf_hi, successor.low_key, still_in_index, leaf, successor.leaf_write.leaf.as_ref().unwrap(), + ); } + if leaf_hi != &successor.low_key { + continue; + } + return Ok(node); } + } - if let Some(res) = subscriber_reservation.take() { - if let Some(transaction_batch) = transaction_batch_opt { - res.complete(&transaction_batch); + fn leaf_for_key<'a>( + &'a self, + key: &[u8], + ) -> io::Result> { + loop { + let (low_key, node) = self.index.get_lte(key).unwrap(); + let mut read = node.inner.read_arc(); + + if read.leaf.is_none() { + drop(read); + let (read_low_key, write, _node) = + self.page_in(key, self.cache.current_flush_epoch())?; + assert!(&*read_low_key <= key); + read = ArcRwLockWriteGuard::downgrade(write); } else { - res.complete(&Event::single_batch(self.clone(), batch)); + self.cache + .read_stats + .cache_hits + .fetch_add(1, Ordering::Relaxed); + } + + let leaf_guard = LeafReadGuard { + leaf_read: ManuallyDrop::new(read), + inner: self, + low_key, + object_id: node.object_id, + external_cache_access_and_eviction: false, + }; + + let leaf = leaf_guard.leaf_read.leaf.as_ref().unwrap(); + + if leaf.deleted.is_some() { + log::trace!("retry due to deleted node in leaf_for_key"); + drop(leaf_guard); + hint::spin_loop(); + continue; } + if &*leaf.lo > key { + log::trace!("key undershoot in leaf_for_key"); + drop(leaf_guard); + hint::spin_loop(); + continue; + } + if let Some(ref hi) = leaf.hi { + if &**hi <= key { + log::trace!("key overshoot on leaf_for_key"); + // cache maintenance occurs in Drop for LeafReadGuard + drop(leaf_guard); + hint::spin_loop(); + continue; + } + } + + if leaf.lo != node.low_key { + // TODO determine why this rare situation occurs and better + // understand whether it is really benign. + log::trace!("mismatch between object key and leaf low"); + // cache maintenance occurs in Drop for LeafReadGuard + drop(leaf_guard); + hint::spin_loop(); + continue; + } + + return Ok(leaf_guard); } + } - if let Some(peg) = peg_opt { - // when the peg drops, it ensures all updates - // written to the log since its creation are - // recovered atomically - peg.seal_batch() - } else { - Ok(()) + fn leaf_for_key_mut<'a>( + &'a self, + key: &[u8], + ) -> io::Result> { + let reader_epoch = self.cache.current_flush_epoch(); + + let (low_key, mut write, node) = self.page_in(key, reader_epoch)?; + + // by checking into an epoch after acquiring the node mutex, we + // avoid inversions where progress may be observed to go backwards. + let flush_epoch_guard = self.cache.check_into_flush_epoch(); + + let leaf = write.leaf.as_mut().unwrap(); + + // NB: these invariants should be enforced in page_in + assert!(leaf.deleted.is_none()); + assert!(&*leaf.lo <= key); + if let Some(ref hi) = leaf.hi { + assert!( + &**hi > key, + "while retrieving the leaf for key {:?} \ + we pulled a leaf with hi key of {:?}", + key, + hi + ); + } + + if let Some(old_dirty_epoch) = leaf.dirty_flush_epoch { + if old_dirty_epoch != flush_epoch_guard.epoch() { + log::trace!( + "cooperatively flushing {:?} with dirty {:?} after checking into {:?}", + node.object_id, + old_dirty_epoch, + flush_epoch_guard.epoch() + ); + + assert!(old_dirty_epoch < flush_epoch_guard.epoch()); + + // cooperatively serialize and put into dirty + leaf.max_unflushed_epoch = leaf.dirty_flush_epoch.take(); + leaf.page_out_on_flush.take(); + log::trace!( + "starting cooperatively serializing {:?} for {:?} because we want to use it in {:?}", + node.object_id, old_dirty_epoch, flush_epoch_guard.epoch(), + ); + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + node.object_id, + old_dirty_epoch, + event_verifier::State::CooperativelySerialized, + concat!( + file!(), + ':', + line!(), + ":cooperative-serialize" + ), + ); + } + + // be extra-explicit about serialized bytes + let leaf_ref: &Leaf = &*leaf; + + let serialized = leaf_ref + .serialize(self.cache.config.zstd_compression_level); + + log::trace!( + "D adding node {} to dirty {:?}", + node.object_id.0, + old_dirty_epoch + ); + + assert_eq!(node.low_key, leaf.lo); + + self.cache.install_dirty( + old_dirty_epoch, + node.object_id, + Dirty::CooperativelySerialized { + object_id: node.object_id, + collection_id: self.collection_id, + low_key: leaf.lo.clone(), + mutation_count: leaf.mutation_count, + data: Arc::new(serialized), + }, + ); + log::trace!( + "finished cooperatively serializing {:?} for {:?} because we want to use it in {:?}", + node.object_id, old_dirty_epoch, flush_epoch_guard.epoch(), + ); + + assert!( + old_dirty_epoch < flush_epoch_guard.epoch(), + "flush epochs somehow became unlinked" + ); + } } + + Ok(LeafWriteGuard { + flush_epoch_guard, + leaf_write: ManuallyDrop::new(write), + inner: self, + low_key, + node, + external_cache_access_and_eviction: false, + }) } /// Retrieve a value from the `Tree` if it exists. @@ -464,83 +669,159 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// db.insert(&[0], vec![0])?; - /// assert_eq!(db.get(&[0]), Ok(Some(sled::IVec::from(vec![0])))); - /// assert_eq!(db.get(&[1]), Ok(None)); + /// assert_eq!(db.get(&[0]).unwrap(), Some(sled::InlineArray::from(vec![0]))); + /// assert!(db.get(&[1]).unwrap().is_none()); /// # Ok(()) } /// ``` - pub fn get>(&self, key: K) -> Result> { - let mut guard = pin(); - let _cc = concurrency_control::read(); - loop { - if let Ok(get) = self.get_inner(key.as_ref(), &mut guard)? { - return Ok(get); - } + pub fn get>( + &self, + key: K, + ) -> io::Result> { + self.check_error()?; + + let key_ref = key.as_ref(); + + let leaf_guard = self.leaf_for_key(key_ref)?; + + let leaf = leaf_guard.leaf_read.leaf.as_ref().unwrap(); + + if let Some(ref hi) = leaf.hi { + assert!(&**hi > key_ref); } + + Ok(leaf.get(key_ref).cloned()) } - /// Pass the result of getting a key's value to a closure - /// without making a new allocation. This effectively - /// "pushes" your provided code to the data without ever copying - /// the data, rather than "pulling" a copy of the data to whatever code - /// is calling `get`. + /// Insert a key to a new value, returning the last value if it + /// was set. /// /// # Examples /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// db.insert(&[0], vec![0])?; - /// db.get_zero_copy(&[0], |value_opt| { - /// assert_eq!( - /// value_opt, - /// Some(&[0][..]) - /// ) - /// }); - /// db.get_zero_copy(&[1], |value_opt| assert!(value_opt.is_none())); + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; + /// assert_eq!(db.insert(&[1, 2, 3], vec![0]).unwrap(), None); + /// assert_eq!(db.insert(&[1, 2, 3], vec![1]).unwrap(), Some(sled::InlineArray::from(&[0]))); /// # Ok(()) } /// ``` - pub fn get_zero_copy, B, F: FnOnce(Option<&[u8]>) -> B>( + #[doc(alias = "set")] + #[doc(alias = "put")] + pub fn insert( &self, key: K, - f: F, - ) -> Result { - let guard = pin(); - let _cc = concurrency_control::read(); + value: V, + ) -> io::Result> + where + K: AsRef<[u8]>, + V: Into, + { + self.check_error()?; - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_get); + let key_ref = key.as_ref(); - trace!("getting key {:?}", key.as_ref()); + let value_ivec = value.into(); + let mut leaf_guard = self.leaf_for_key_mut(key_ref)?; + let new_epoch = leaf_guard.epoch(); - let View { node_view, .. } = self.view_for_key(key.as_ref(), &guard)?; + let leaf = leaf_guard.leaf_write.leaf.as_mut().unwrap(); - let pair = node_view.node_kv_pair(key.as_ref()); + let ret = leaf.insert(key_ref.into(), value_ivec.clone()); - let ret = f(pair.1); + let old_size = + ret.as_ref().map(|v| key_ref.len() + v.len()).unwrap_or(0); + let new_size = key_ref.len() + value_ivec.len(); - Ok(ret) - } + if new_size > old_size { + leaf.in_memory_size += new_size - old_size; + } else { + leaf.in_memory_size = + leaf.in_memory_size.saturating_sub(old_size - new_size); + } - pub(crate) fn get_inner( - &self, - key: &[u8], - guard: &mut Guard, - ) -> Result>> { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_get); + let split = + leaf.split_if_full(new_epoch, &self.cache, self.collection_id); + if split.is_some() || Some(value_ivec) != ret { + leaf.mutation_count += 1; + leaf.set_dirty_epoch(new_epoch); + log::trace!( + "F adding node {} to dirty {:?}", + leaf_guard.node.object_id.0, + new_epoch + ); + + self.cache.install_dirty( + new_epoch, + leaf_guard.node.object_id, + Dirty::NotYetSerialized { + collection_id: self.collection_id, + node: leaf_guard.node.clone(), + low_key: leaf_guard.low_key.clone(), + }, + ); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + leaf_guard.node.object_id, + new_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), ":insert"), + ); + } + } + if let Some((split_key, rhs_node)) = split { + assert_eq!(leaf.hi.as_ref().unwrap(), &split_key); + log::trace!( + "G adding new from split {:?} to dirty {:?}", + rhs_node.object_id, + new_epoch + ); + + assert_ne!(rhs_node.object_id, leaf_guard.node.object_id); + assert!(!split_key.is_empty()); + + let rhs_object_id = rhs_node.object_id; + + self.cache.install_dirty( + new_epoch, + rhs_object_id, + Dirty::NotYetSerialized { + collection_id: self.collection_id, + node: rhs_node.clone(), + low_key: split_key.clone(), + }, + ); - trace!("getting key {:?}", key); + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + rhs_object_id, + new_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), ":insert-split"), + ); + } - let View { node_view, .. } = self.view_for_key(key.as_ref(), guard)?; + // NB only make the new node reachable via the index after + // we marked it as dirty, as from this point on, any other + // thread may cooperatively deserialize it and maybe conflict + // with that previous NotYetSerialized marker. + self.cache + .object_id_index + .insert(rhs_node.object_id, rhs_node.clone()); + let prev = self.index.insert(split_key, rhs_node); + assert!(prev.is_none()); + } - let pair = node_view.node_kv_pair(key.as_ref()); - let val = pair.1.map(IVec::from); + // this is for clarity, that leaf_guard is held while + // inserting into dirty with its guarded epoch + drop(leaf_guard); - Ok(Ok(val)) + Ok(ret) } /// Delete a value, returning the old value if it existed. @@ -549,60 +830,108 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// db.insert(&[1], vec![1]); - /// assert_eq!(db.remove(&[1]), Ok(Some(sled::IVec::from(vec![1])))); - /// assert_eq!(db.remove(&[1]), Ok(None)); + /// assert_eq!(db.remove(&[1]).unwrap(), Some(sled::InlineArray::from(vec![1]))); + /// assert!(db.remove(&[1]).unwrap().is_none()); /// # Ok(()) } /// ``` #[doc(alias = "delete")] #[doc(alias = "del")] - pub fn remove>(&self, key: K) -> Result> { - let mut guard = pin(); - let _cc = concurrency_control::read(); - loop { - trace!("removing key {:?}", key.as_ref()); + pub fn remove>( + &self, + key: K, + ) -> io::Result> { + self.check_error()?; + + let key_ref = key.as_ref(); + + let mut leaf_guard = self.leaf_for_key_mut(key_ref)?; + + let new_epoch = leaf_guard.epoch(); + + let leaf = leaf_guard.leaf_write.leaf.as_mut().unwrap(); + + assert!(leaf.deleted.is_none()); + + let ret = leaf.remove(key_ref); + + if ret.is_some() { + leaf.mutation_count += 1; - if let Ok(res) = - self.insert_inner(key.as_ref(), None, false, &mut guard)? + leaf.set_dirty_epoch(new_epoch); + + log::trace!( + "H adding node {} to dirty {:?}", + leaf_guard.node.object_id.0, + new_epoch + ); + + self.cache.install_dirty( + new_epoch, + leaf_guard.node.object_id, + Dirty::NotYetSerialized { + collection_id: self.collection_id, + low_key: leaf_guard.low_key.clone(), + node: leaf_guard.node.clone(), + }, + ); + + #[cfg(feature = "for-internal-testing-only")] { - return Ok(res); + self.cache.event_verifier.mark( + leaf_guard.node.object_id, + new_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), ":remove"), + ); + } + + if cfg!(not(feature = "monotonic-behavior")) { + if leaf.data.is_empty() + && leaf_guard.low_key != InlineArray::MIN + { + assert_ne!(leaf_guard.node.low_key, InlineArray::MIN); + self.merge_leaf_into_left_sibling(leaf_guard)?; + } } } - } + Ok(ret) + } /// Compare and swap. Capable of unique creation, conditional modification, /// or deletion. If old is `None`, this will only set the value if it /// doesn't exist yet. If new is `None`, will delete the value if old is /// correct. If both old and new are `Some`, will modify the value if /// old is correct. /// - /// It returns `Ok(Ok(()))` if operation finishes successfully. + /// It returns `Ok(Ok(CompareAndSwapSuccess { new_value, previous_value }))` if operation finishes successfully. /// /// If it fails it returns: - /// - `Ok(Err(CompareAndSwapError(current, proposed)))` if operation - /// failed to setup a new value. `CompareAndSwapError` contains + /// - `Ok(Err(CompareAndSwapError{ current, proposed }))` if no IO + /// error was encountered but the operation + /// failed to specify the correct current value. `CompareAndSwapError` contains /// current and proposed values. - /// - `Err(Error::Unsupported)` if the database is opened in read-only - /// mode. + /// - `Err(io::Error)` if there was a high-level IO problem that prevented + /// the operation from logically progressing. This is usually fatal and + /// will prevent future requests from functioning, and requires the + /// administrator to fix the system issue before restarting. /// /// # Examples /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// // unique creation - /// assert_eq!( - /// db.compare_and_swap(&[1], None as Option<&[u8]>, Some(&[10])), - /// Ok(Ok(())) + /// assert!( + /// db.compare_and_swap(&[1], None as Option<&[u8]>, Some(&[10])).unwrap().is_ok(), /// ); /// /// // conditional modification - /// assert_eq!( - /// db.compare_and_swap(&[1], Some(&[10]), Some(&[20])), - /// Ok(Ok(())) + /// assert!( + /// db.compare_and_swap(&[1], Some(&[10]), Some(&[20])).unwrap().is_ok(), /// ); /// /// // failed conditional modification -- the current value is returned in @@ -615,14 +944,12 @@ impl Tree { /// assert_eq!(actual_value.current.map(|ivec| ivec.to_vec()), Some(vec![20])); /// /// // conditional deletion - /// assert_eq!( - /// db.compare_and_swap(&[1], Some(&[20]), None as Option<&[u8]>), - /// Ok(Ok(())) + /// assert!( + /// db.compare_and_swap(&[1], Some(&[20]), None as Option<&[u8]>).unwrap().is_ok(), /// ); - /// assert_eq!(db.get(&[1]), Ok(None)); + /// assert!(db.get(&[1]).unwrap().is_none()); /// # Ok(()) } /// ``` - #[allow(clippy::needless_pass_by_value)] #[doc(alias = "cas")] #[doc(alias = "tas")] #[doc(alias = "test_and_swap")] @@ -636,72 +963,124 @@ impl Tree { where K: AsRef<[u8]>, OV: AsRef<[u8]>, - NV: Into, + NV: Into, { - trace!("cas'ing key {:?}", key.as_ref()); - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_cas); + self.check_error()?; - let guard = pin(); - let _cc = concurrency_control::read(); + let key_ref = key.as_ref(); - let new2 = new.map(Into::into); + let mut leaf_guard = self.leaf_for_key_mut(key_ref)?; + let new_epoch = leaf_guard.epoch(); - // we need to retry caps until old != cur, since just because - // cap fails it doesn't mean our value was changed. - loop { - let View { pid, node_view, .. } = - self.view_for_key(key.as_ref(), &guard)?; - - let (encoded_key, current_value) = - node_view.node_kv_pair(key.as_ref()); - let matches = match (old.as_ref(), ¤t_value) { - (None, None) => true, - (Some(o), Some(c)) => o.as_ref() == &**c, - _ => false, - }; + let proposed: Option = new.map(Into::into); - if !matches { - return Ok(Err(CompareAndSwapError { - current: current_value.map(IVec::from), - proposed: new2, - })); - } + let leaf = leaf_guard.leaf_write.leaf.as_mut().unwrap(); - if current_value == new2.as_ref().map(AsRef::as_ref) { - // short-circuit no-op write. this is still correct - // because we verified that the input matches, so - // doing the work has the same semantic effect as not - // doing it in this case. - return Ok(Ok(())); - } + let current = leaf.get(key_ref).cloned(); - let mut subscriber_reservation = self.subscribers.reserve(&key); + let previous_matches = match (old, ¤t) { + (None, None) => true, + (Some(conditional), Some(current)) + if conditional.as_ref() == current.as_ref() => + { + true + } + _ => false, + }; - let frag = if let Some(ref new3) = new2 { - Link::Set(encoded_key, new3.clone()) + let ret = if previous_matches { + if let Some(ref new_value) = proposed { + leaf.insert(key_ref.into(), new_value.clone()) } else { - Link::Del(encoded_key) + leaf.remove(key_ref) }; - let link = - self.context.pagecache.link(pid, node_view.0, frag, &guard)?; - - if link.is_ok() { - if let Some(res) = subscriber_reservation.take() { - let event = subscriber::Event::single_update( - self.clone(), - key.as_ref().into(), - new2, - ); - res.complete(&event); - } + Ok(CompareAndSwapSuccess { + new_value: proposed, + previous_value: current, + }) + } else { + Err(CompareAndSwapError { current, proposed }) + }; + + let split = + leaf.split_if_full(new_epoch, &self.cache, self.collection_id); + let split_happened = split.is_some(); + if split_happened || ret.is_ok() { + leaf.mutation_count += 1; + + leaf.set_dirty_epoch(new_epoch); + log::trace!( + "A adding node {} to dirty {:?}", + leaf_guard.node.object_id.0, + new_epoch + ); + + self.cache.install_dirty( + new_epoch, + leaf_guard.node.object_id, + Dirty::NotYetSerialized { + collection_id: self.collection_id, + node: leaf_guard.node.clone(), + low_key: leaf_guard.low_key.clone(), + }, + ); - return Ok(Ok(())); + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + leaf_guard.node.object_id, + new_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), ":cas"), + ); } - #[cfg(feature = "metrics")] - M.tree_looped(); } + if let Some((split_key, rhs_node)) = split { + log::trace!( + "B adding new from split {:?} to dirty {:?}", + rhs_node.object_id, + new_epoch + ); + self.cache.install_dirty( + new_epoch, + rhs_node.object_id, + Dirty::NotYetSerialized { + collection_id: self.collection_id, + node: rhs_node.clone(), + low_key: split_key.clone(), + }, + ); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + rhs_node.object_id, + new_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), "cas-split"), + ); + } + + // NB only make the new node reachable via the index after + // we marked it as dirty, as from this point on, any other + // thread may cooperatively deserialize it and maybe conflict + // with that previous NotYetSerialized marker. + self.cache + .object_id_index + .insert(rhs_node.object_id, rhs_node.clone()); + let prev = self.index.insert(split_key, rhs_node); + assert!(prev.is_none()); + } + + if cfg!(not(feature = "monotonic-behavior")) { + if leaf.data.is_empty() && leaf_guard.low_key != InlineArray::MIN { + assert!(!split_happened); + self.merge_leaf_into_left_sibling(leaf_guard)?; + } + } + + Ok(ret) } /// Fetch the value, apply a function to it and return the result. @@ -715,14 +1094,13 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// use sled::{Config, Error, IVec}; - /// use std::convert::TryInto; + /// use sled::{Config, InlineArray}; /// - /// let config = Config::new().temporary(true); - /// let db = config.open()?; + /// let config = Config::tmp().unwrap(); + /// let db: sled::Db<1024> = config.open()?; /// - /// fn u64_to_ivec(number: u64) -> IVec { - /// IVec::from(number.to_be_bytes().to_vec()) + /// fn u64_to_ivec(number: u64) -> InlineArray { + /// InlineArray::from(number.to_be_bytes().to_vec()) /// } /// /// let zero = u64_to_ivec(0); @@ -743,21 +1121,21 @@ impl Tree { /// Some(number.to_be_bytes().to_vec()) /// } /// - /// assert_eq!(db.update_and_fetch("counter", increment), Ok(Some(zero))); - /// assert_eq!(db.update_and_fetch("counter", increment), Ok(Some(one))); - /// assert_eq!(db.update_and_fetch("counter", increment), Ok(Some(two))); - /// assert_eq!(db.update_and_fetch("counter", increment), Ok(Some(three))); + /// assert_eq!(db.update_and_fetch("counter", increment).unwrap(), Some(zero)); + /// assert_eq!(db.update_and_fetch("counter", increment).unwrap(), Some(one)); + /// assert_eq!(db.update_and_fetch("counter", increment).unwrap(), Some(two)); + /// assert_eq!(db.update_and_fetch("counter", increment).unwrap(), Some(three)); /// # Ok(()) } /// ``` pub fn update_and_fetch( &self, key: K, mut f: F, - ) -> Result> + ) -> io::Result> where K: AsRef<[u8]>, F: FnMut(Option<&[u8]>) -> Option, - V: Into, + V: Into, { let key_ref = key.as_ref(); let mut current = self.get(key_ref)?; @@ -765,12 +1143,12 @@ impl Tree { loop { let tmp = current.as_ref().map(AsRef::as_ref); let next = f(tmp).map(Into::into); - match self.compare_and_swap::<_, _, IVec>( + match self.compare_and_swap::<_, _, InlineArray>( key_ref, tmp, next.clone(), )? { - Ok(()) => return Ok(next), + Ok(_) => return Ok(next), Err(CompareAndSwapError { current: cur, .. }) => { current = cur; } @@ -789,14 +1167,13 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// use sled::{Config, Error, IVec}; - /// use std::convert::TryInto; + /// use sled::{Config, InlineArray}; /// - /// let config = Config::new().temporary(true); - /// let db = config.open()?; + /// let config = Config::tmp().unwrap(); + /// let db: sled::Db<1024> = config.open()?; /// - /// fn u64_to_ivec(number: u64) -> IVec { - /// IVec::from(number.to_be_bytes().to_vec()) + /// fn u64_to_ivec(number: u64) -> InlineArray { + /// InlineArray::from(number.to_be_bytes().to_vec()) /// } /// /// let zero = u64_to_ivec(0); @@ -816,21 +1193,21 @@ impl Tree { /// Some(number.to_be_bytes().to_vec()) /// } /// - /// assert_eq!(db.fetch_and_update("counter", increment), Ok(None)); - /// assert_eq!(db.fetch_and_update("counter", increment), Ok(Some(zero))); - /// assert_eq!(db.fetch_and_update("counter", increment), Ok(Some(one))); - /// assert_eq!(db.fetch_and_update("counter", increment), Ok(Some(two))); + /// assert_eq!(db.fetch_and_update("counter", increment).unwrap(), None); + /// assert_eq!(db.fetch_and_update("counter", increment).unwrap(), Some(zero)); + /// assert_eq!(db.fetch_and_update("counter", increment).unwrap(), Some(one)); + /// assert_eq!(db.fetch_and_update("counter", increment).unwrap(), Some(two)); /// # Ok(()) } /// ``` pub fn fetch_and_update( &self, key: K, mut f: F, - ) -> Result> + ) -> io::Result> where K: AsRef<[u8]>, F: FnMut(Option<&[u8]>) -> Option, - V: Into, + V: Into, { let key_ref = key.as_ref(); let mut current = self.get(key_ref)?; @@ -839,7 +1216,7 @@ impl Tree { let tmp = current.as_ref().map(AsRef::as_ref); let next = f(tmp); match self.compare_and_swap(key_ref, tmp, next)? { - Ok(()) => return Ok(current), + Ok(_) => return Ok(current), Err(CompareAndSwapError { current: cur, .. }) => { current = cur; } @@ -847,110 +1224,285 @@ impl Tree { } } - /// Subscribe to `Event`s that happen to keys that have - /// the specified prefix. Events for particular keys are - /// guaranteed to be witnessed in the same order by all - /// threads, but threads may witness different interleavings - /// of `Event`s across different keys. If subscribers don't - /// keep up with new writes, they will cause new writes - /// to block. There is a buffer of 1024 items per - /// `Subscriber`. This can be used to build reactive - /// and replicated systems. - /// - /// `Subscriber` implements both `Iterator` - /// and `Future>` - /// - /// # Examples - /// - /// Synchronous, blocking subscriber: - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// // watch all events by subscribing to the empty prefix - /// let mut subscriber = db.watch_prefix(vec![]); - /// - /// let tree_2 = db.clone(); - /// let thread = std::thread::spawn(move || { - /// db.insert(vec![0], vec![1]) - /// }); - /// - /// // `Subscription` implements `Iterator` - /// for event in subscriber.take(1) { - /// // Events occur due to single key operations, - /// // batches, or transactions. The tree is included - /// // so that you may perform a new transaction or - /// // operation in response to the event. - /// for (tree, key, value_opt) in &event { - /// if let Some(value) = value_opt { - /// // key `key` was set to value `value` - /// } else { - /// // key `key` was removed - /// } - /// } - /// } + pub fn iter(&self) -> Iter { + Iter { + prefetched: VecDeque::new(), + prefetched_back: VecDeque::new(), + next_fetch: Some(InlineArray::MIN), + next_back_last_lo: None, + next_calls: 0, + next_back_calls: 0, + inner: self.clone(), + bounds: (Bound::Unbounded, Bound::Unbounded), + } + } + + pub fn range(&self, range: R) -> Iter + where + K: AsRef<[u8]>, + R: RangeBounds, + { + let start: Bound = + map_bound(range.start_bound(), |b| InlineArray::from(b.as_ref())); + let end: Bound = + map_bound(range.end_bound(), |b| InlineArray::from(b.as_ref())); + + let next_fetch = Some(match &start { + Bound::Included(b) | Bound::Excluded(b) => b.clone(), + Bound::Unbounded => InlineArray::MIN, + }); + + Iter { + prefetched: VecDeque::new(), + prefetched_back: VecDeque::new(), + next_fetch, + next_back_last_lo: None, + next_calls: 0, + next_back_calls: 0, + inner: self.clone(), + bounds: (start, end), + } + } + + /// Create a new batched update that is applied + /// atomically. Readers will atomically see all updates + /// at an atomic instant, and if the database crashes, + /// either 0% or 100% of the full batch will be recovered, + /// but never a partial batch. If a `flush` operation succeeds + /// after this, it is guaranteed that 100% of the batch will be + /// visible, unless later concurrent updates changed the values + /// before the flush. + /// + /// # Examples /// - /// # thread.join().unwrap(); - /// # Ok(()) } /// ``` - /// Asynchronous, non-blocking subscriber: + /// # fn main() -> Result<(), Box> { + /// # let _ = std::fs::remove_dir_all("batch_doctest"); + /// # let db: sled::Db<1024> = sled::open("batch_doctest")?; + /// db.insert("key_0", "val_0")?; /// - /// `Subscription` implements `Future>`. + /// let mut batch = sled::Batch::default(); + /// batch.insert("key_a", "val_a"); + /// batch.insert("key_b", "val_b"); + /// batch.insert("key_c", "val_c"); + /// batch.remove("key_0"); /// + /// db.apply_batch(batch)?; + /// // key_0 no longer exists, and key_a, key_b, and key_c + /// // now do exist. + /// # let _ = std::fs::remove_dir_all("batch_doctest"); + /// # Ok(()) } /// ``` - /// # async fn foo() { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open().unwrap(); - /// # let mut subscriber = db.watch_prefix(vec![]); - /// while let Some(event) = (&mut subscriber).await { - /// /* use it */ - /// } - /// # } - /// ``` - pub fn watch_prefix>(&self, prefix: P) -> Subscriber { - self.subscribers.register(prefix.as_ref()) - } + pub fn apply_batch(&self, batch: Batch) -> io::Result<()> { + // NB: we rely on lexicographic lock acquisition + // by iterating over the batch's BTreeMap to avoid + // deadlocks during 2PL + let mut acquired_locks: BTreeMap< + InlineArray, + ( + ArcRwLockWriteGuard>, + Object, + ), + > = BTreeMap::new(); + + // Phase 1: lock acquisition + let mut last: Option<( + InlineArray, + ArcRwLockWriteGuard>, + Object, + )> = None; + + for key in batch.writes.keys() { + if let Some((_lo, w, _id)) = &last { + let leaf = w.leaf.as_ref().unwrap(); + assert!(&leaf.lo <= key); + if let Some(hi) = &leaf.hi { + if hi <= key { + let (lo, w, n) = last.take().unwrap(); + acquired_locks.insert(lo, (w, n)); + } + } + } + if last.is_none() { + // TODO evaluate whether this is correct, as page_in performs + // cache maintenance internally if it over/undershoots due to + // concurrent modifications. + last = + Some(self.page_in(key, self.cache.current_flush_epoch())?); + } + } - /// Synchronously flushes all dirty IO buffers and calls - /// fsync. If this succeeds, it is guaranteed that all - /// previous writes will be recovered if the system - /// crashes. Returns the number of bytes flushed during - /// this call. - /// - /// Flushing can take quite a lot of time, and you should - /// measure the performance impact of using it on - /// realistic sustained workloads running on realistic - /// hardware. - /// - /// This is called automatically on drop. - pub fn flush(&self) -> Result { - self.context.pagecache.flush() - } + if let Some((lo, w, id)) = last.take() { + acquired_locks.insert(lo, (w, id)); + } - /// Asynchronously flushes all dirty IO buffers - /// and calls fsync. If this succeeds, it is - /// guaranteed that all previous writes will - /// be recovered if the system crashes. Returns - /// the number of bytes flushed during this call. - /// - /// Flushing can take quite a lot of time, and you - /// should measure the performance impact of - /// using it on realistic sustained workloads - /// running on realistic hardware. - // this clippy check is mis-firing on async code. - #[allow(clippy::used_underscore_binding)] - #[allow(clippy::shadow_same)] - pub async fn flush_async(&self) -> Result { - let pagecache = self.context.pagecache.clone(); - if let Some(result) = threadpool::spawn(move || pagecache.flush()).await - { - result - } else { - Err(Error::ReportableBug( - "threadpool failed to complete \ - action before shutdown" - )) + // NB: add the flush epoch at the end of the lock acquisition + // process when all locks have been acquired, to avoid situations + // where a leaf is already dirty with an epoch "from the future". + let flush_epoch_guard = self.cache.check_into_flush_epoch(); + let new_epoch = flush_epoch_guard.epoch(); + + // Flush any leaves that are dirty from a previous flush epoch + // before performing operations. + for (write, node) in acquired_locks.values_mut() { + let leaf = write.leaf.as_mut().unwrap(); + if let Some(old_flush_epoch) = leaf.dirty_flush_epoch { + if old_flush_epoch == new_epoch { + // no need to cooperatively flush + continue; + } + + assert!(old_flush_epoch < new_epoch); + + log::trace!( + "cooperatively flushing {:?} with dirty {:?} after checking into {:?}", + node.object_id, + old_flush_epoch, + new_epoch + ); + + // cooperatively serialize and put into dirty + let old_dirty_epoch = leaf.dirty_flush_epoch.take().unwrap(); + leaf.max_unflushed_epoch = Some(old_dirty_epoch); + + // be extra-explicit about serialized bytes + let leaf_ref: &Leaf = &*leaf; + + let serialized = leaf_ref + .serialize(self.cache.config.zstd_compression_level); + + log::trace!( + "C adding node {} to dirty epoch {:?}", + node.object_id.0, + old_dirty_epoch + ); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + node.object_id, + old_dirty_epoch, + event_verifier::State::CooperativelySerialized, + concat!( + file!(), + ':', + line!(), + ":batch-cooperative-serialization" + ), + ); + } + + assert_eq!(node.low_key, leaf.lo); + self.cache.install_dirty( + old_dirty_epoch, + node.object_id, + Dirty::CooperativelySerialized { + object_id: node.object_id, + collection_id: self.collection_id, + mutation_count: leaf_ref.mutation_count, + low_key: leaf.lo.clone(), + data: Arc::new(serialized), + }, + ); + + assert!( + old_flush_epoch < flush_epoch_guard.epoch(), + "flush epochs somehow became unlinked" + ); + } + } + + let mut splits: Vec<(InlineArray, Object)> = vec![]; + let mut merges: BTreeMap> = + BTreeMap::new(); + + // Insert and split when full + for (key, value_opt) in batch.writes { + let range = ..=&key; + let (lo, (ref mut w, object)) = acquired_locks + .range_mut::(range) + .next_back() + .unwrap(); + let leaf = w.leaf.as_mut().unwrap(); + + assert_eq!(lo, &leaf.lo); + assert!(leaf.lo <= key); + if let Some(hi) = &leaf.hi { + assert!(hi > &key); + } + + if let Some(value) = value_opt { + leaf.insert(key, value); + merges.remove(lo); + + merges.remove(&leaf.lo); + + if let Some((split_key, rhs_node)) = leaf.split_if_full( + new_epoch, + &self.cache, + self.collection_id, + ) { + let write = rhs_node.inner.write_arc(); + assert!(write.leaf.is_some()); + + splits.push((split_key.clone(), rhs_node.clone())); + acquired_locks.insert(split_key, (write, rhs_node)); + } + } else { + leaf.remove(&key); + + if leaf.is_empty() { + assert_eq!(leaf.lo, lo); + merges.insert(leaf.lo.clone(), object.clone()); + } + } + } + + // Make splits globally visible + for (split_key, rhs_node) in splits { + self.cache + .object_id_index + .insert(rhs_node.object_id, rhs_node.clone()); + self.index.insert(split_key, rhs_node); + } + + // Add all written leaves to dirty and prepare to mark cache accesses + let mut cache_accesses = Vec::with_capacity(acquired_locks.len()); + for (low_key, (write, node)) in &mut acquired_locks { + let leaf = write.leaf.as_mut().unwrap(); + leaf.set_dirty_epoch(new_epoch); + leaf.mutation_count += 1; + cache_accesses.push((node.object_id, leaf.in_memory_size)); + self.cache.install_dirty( + new_epoch, + node.object_id, + Dirty::NotYetSerialized { + collection_id: self.collection_id, + node: node.clone(), + low_key: low_key.clone(), + }, + ); + + #[cfg(feature = "for-internal-testing-only")] + { + self.cache.event_verifier.mark( + node.object_id, + new_epoch, + event_verifier::State::Dirty, + concat!(file!(), ':', line!(), ":apply-batch"), + ); + } } + + // Drop locks + drop(acquired_locks); + + // Perform cache maintenance + for (object_id, size) in cache_accesses { + self.cache.mark_access_and_evict(object_id, size, new_epoch)?; + } + + Ok(()) } /// Returns `true` if the `Tree` contains a value for @@ -960,14 +1512,14 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// db.insert(&[0], vec![0])?; /// assert!(db.contains_key(&[0])?); /// assert!(!db.contains_key(&[1])?); /// # Ok(()) } /// ``` - pub fn contains_key>(&self, key: K) -> Result { + pub fn contains_key>(&self, key: K) -> io::Result { self.get(key).map(|v| v.is_some()) } @@ -985,40 +1537,41 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// use sled::IVec; - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// use sled::InlineArray; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// for i in 0..10 { /// db.insert(&[i], vec![i]) /// .expect("should write successfully"); /// } /// - /// assert_eq!(db.get_lt(&[]), Ok(None)); - /// assert_eq!(db.get_lt(&[0]), Ok(None)); + /// assert!(db.get_lt(&[]).unwrap().is_none()); + /// assert!(db.get_lt(&[0]).unwrap().is_none()); /// assert_eq!( - /// db.get_lt(&[1]), - /// Ok(Some((IVec::from(&[0]), IVec::from(&[0])))) + /// db.get_lt(&[1]).unwrap(), + /// Some((InlineArray::from(&[0]), InlineArray::from(&[0]))) /// ); /// assert_eq!( - /// db.get_lt(&[9]), - /// Ok(Some((IVec::from(&[8]), IVec::from(&[8])))) + /// db.get_lt(&[9]).unwrap(), + /// Some((InlineArray::from(&[8]), InlineArray::from(&[8]))) /// ); /// assert_eq!( - /// db.get_lt(&[10]), - /// Ok(Some((IVec::from(&[9]), IVec::from(&[9])))) + /// db.get_lt(&[10]).unwrap(), + /// Some((InlineArray::from(&[9]), InlineArray::from(&[9]))) /// ); /// assert_eq!( - /// db.get_lt(&[255]), - /// Ok(Some((IVec::from(&[9]), IVec::from(&[9])))) + /// db.get_lt(&[255]).unwrap(), + /// Some((InlineArray::from(&[9]), InlineArray::from(&[9]))) /// ); /// # Ok(()) } /// ``` - pub fn get_lt(&self, key: K) -> Result> + pub fn get_lt( + &self, + key: K, + ) -> io::Result> where K: AsRef<[u8]>, { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_get); self.range(..key).next_back().transpose() } @@ -1036,343 +1589,50 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// use sled::IVec; - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// use sled::InlineArray; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// for i in 0..10 { /// db.insert(&[i], vec![i])?; /// } /// /// assert_eq!( - /// db.get_gt(&[]), - /// Ok(Some((IVec::from(&[0]), IVec::from(&[0])))) + /// db.get_gt(&[]).unwrap(), + /// Some((InlineArray::from(&[0]), InlineArray::from(&[0]))) /// ); /// assert_eq!( - /// db.get_gt(&[0]), - /// Ok(Some((IVec::from(&[1]), IVec::from(&[1])))) + /// db.get_gt(&[0]).unwrap(), + /// Some((InlineArray::from(&[1]), InlineArray::from(&[1]))) /// ); /// assert_eq!( - /// db.get_gt(&[1]), - /// Ok(Some((IVec::from(&[2]), IVec::from(&[2])))) + /// db.get_gt(&[1]).unwrap(), + /// Some((InlineArray::from(&[2]), InlineArray::from(&[2]))) /// ); /// assert_eq!( - /// db.get_gt(&[8]), - /// Ok(Some((IVec::from(&[9]), IVec::from(&[9])))) + /// db.get_gt(&[8]).unwrap(), + /// Some((InlineArray::from(&[9]), InlineArray::from(&[9]))) /// ); - /// assert_eq!(db.get_gt(&[9]), Ok(None)); + /// assert!(db.get_gt(&[9]).unwrap().is_none()); /// /// db.insert(500u16.to_be_bytes(), vec![10]); /// assert_eq!( - /// db.get_gt(&499u16.to_be_bytes()), - /// Ok(Some((IVec::from(&500u16.to_be_bytes()), IVec::from(&[10])))) + /// db.get_gt(&499u16.to_be_bytes()).unwrap(), + /// Some((InlineArray::from(&500u16.to_be_bytes()), InlineArray::from(&[10]))) /// ); /// # Ok(()) } /// ``` - pub fn get_gt(&self, key: K) -> Result> + pub fn get_gt( + &self, + key: K, + ) -> io::Result> where K: AsRef<[u8]>, { - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_get); self.range((ops::Bound::Excluded(key), ops::Bound::Unbounded)) .next() .transpose() } - /// Merge state directly into a given key's value using the - /// configured merge operator. This allows state to be written - /// into a value directly, without any read-modify-write steps. - /// Merge operators can be used to implement arbitrary data - /// structures. - /// - /// Calling `merge` will return an `Unsupported` error if it - /// is called without first setting a merge operator function. - /// - /// Merge operators are shared by all instances of a particular - /// `Tree`. Different merge operators may be set on different - /// `Tree`s. - /// - /// # Examples - /// - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// use sled::IVec; - /// - /// fn concatenate_merge( - /// _key: &[u8], // the key being merged - /// old_value: Option<&[u8]>, // the previous value, if one existed - /// merged_bytes: &[u8] // the new bytes being merged in - /// ) -> Option> { // set the new value, return None to delete - /// let mut ret = old_value - /// .map(|ov| ov.to_vec()) - /// .unwrap_or_else(|| vec![]); - /// - /// ret.extend_from_slice(merged_bytes); - /// - /// Some(ret) - /// } - /// - /// db.set_merge_operator(concatenate_merge); - /// - /// let k = b"k1"; - /// - /// db.insert(k, vec![0]); - /// db.merge(k, vec![1]); - /// db.merge(k, vec![2]); - /// assert_eq!(db.get(k), Ok(Some(IVec::from(vec![0, 1, 2])))); - /// - /// // Replace previously merged data. The merge function will not be called. - /// db.insert(k, vec![3]); - /// assert_eq!(db.get(k), Ok(Some(IVec::from(vec![3])))); - /// - /// // Merges on non-present values will cause the merge function to be called - /// // with `old_value == None`. If the merge function returns something (which it - /// // does, in this case) a new value will be inserted. - /// db.remove(k); - /// db.merge(k, vec![4]); - /// assert_eq!(db.get(k), Ok(Some(IVec::from(vec![4])))); - /// # Ok(()) } - /// ``` - pub fn merge(&self, key: K, value: V) -> Result> - where - K: AsRef<[u8]>, - V: AsRef<[u8]>, - { - let _cc = concurrency_control::read(); - loop { - if let Ok(merge) = self.merge_inner(key.as_ref(), value.as_ref())? { - return Ok(merge); - } - } - } - - pub(crate) fn merge_inner( - &self, - key: &[u8], - value: &[u8], - ) -> Result>> { - trace!("merging key {:?}", key); - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_merge); - - let merge_operator_opt = self.merge_operator.read(); - - if merge_operator_opt.is_none() { - return Err(Error::Unsupported( - "must set a merge operator on this Tree \ - before calling merge by calling \ - Tree::set_merge_operator" - )); - } - - let merge_operator = merge_operator_opt.as_ref().unwrap(); - - loop { - let guard = pin(); - let View { pid, node_view, .. } = - self.view_for_key(key.as_ref(), &guard)?; - - let (encoded_key, current_value) = - node_view.node_kv_pair(key.as_ref()); - let tmp = current_value.as_ref().map(AsRef::as_ref); - let new_opt = merge_operator(key, tmp, value).map(IVec::from); - - if new_opt.as_ref().map(AsRef::as_ref) == current_value { - // short-circuit no-op write - return Ok(Ok(new_opt)); - } - - let mut subscriber_reservation = self.subscribers.reserve(key); - - let frag = if let Some(ref new) = new_opt { - Link::Set(encoded_key, new.clone()) - } else { - Link::Del(encoded_key) - }; - let link = - self.context.pagecache.link(pid, node_view.0, frag, &guard)?; - - if link.is_ok() { - if let Some(res) = subscriber_reservation.take() { - let event = subscriber::Event::single_update( - self.clone(), - key.as_ref().into(), - new_opt.clone(), - ); - - res.complete(&event); - } - - return Ok(Ok(new_opt)); - } - #[cfg(feature = "metrics")] - M.tree_looped(); - } - } - - /// Sets a merge operator for use with the `merge` function. - /// - /// Merge state directly into a given key's value using the - /// configured merge operator. This allows state to be written - /// into a value directly, without any read-modify-write steps. - /// Merge operators can be used to implement arbitrary data - /// structures. - /// - /// # Panics - /// - /// Calling `merge` will panic if no merge operator has been - /// configured. - /// - /// # Examples - /// - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// use sled::IVec; - /// - /// fn concatenate_merge( - /// _key: &[u8], // the key being merged - /// old_value: Option<&[u8]>, // the previous value, if one existed - /// merged_bytes: &[u8] // the new bytes being merged in - /// ) -> Option> { // set the new value, return None to delete - /// let mut ret = old_value - /// .map(|ov| ov.to_vec()) - /// .unwrap_or_else(|| vec![]); - /// - /// ret.extend_from_slice(merged_bytes); - /// - /// Some(ret) - /// } - /// - /// db.set_merge_operator(concatenate_merge); - /// - /// let k = b"k1"; - /// - /// db.insert(k, vec![0]); - /// db.merge(k, vec![1]); - /// db.merge(k, vec![2]); - /// assert_eq!(db.get(k), Ok(Some(IVec::from(vec![0, 1, 2])))); - /// - /// // Replace previously merged data. The merge function will not be called. - /// db.insert(k, vec![3]); - /// assert_eq!(db.get(k), Ok(Some(IVec::from(vec![3])))); - /// - /// // Merges on non-present values will cause the merge function to be called - /// // with `old_value == None`. If the merge function returns something (which it - /// // does, in this case) a new value will be inserted. - /// db.remove(k); - /// db.merge(k, vec![4]); - /// assert_eq!(db.get(k), Ok(Some(IVec::from(vec![4])))); - /// # Ok(()) } - /// ``` - pub fn set_merge_operator( - &self, - merge_operator: impl MergeOperator + 'static, - ) { - let mut mo_write = self.merge_operator.write(); - *mo_write = Some(Box::new(merge_operator)); - } - - /// Create a double-ended iterator over the tuples of keys and - /// values in this tree. - /// - /// # Examples - /// - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// use sled::IVec; - /// db.insert(&[1], vec![10]); - /// db.insert(&[2], vec![20]); - /// db.insert(&[3], vec![30]); - /// let mut iter = db.iter(); - /// assert_eq!( - /// iter.next().unwrap(), - /// Ok((IVec::from(&[1]), IVec::from(&[10]))) - /// ); - /// assert_eq!( - /// iter.next().unwrap(), - /// Ok((IVec::from(&[2]), IVec::from(&[20]))) - /// ); - /// assert_eq!( - /// iter.next().unwrap(), - /// Ok((IVec::from(&[3]), IVec::from(&[30]))) - /// ); - /// assert_eq!(iter.next(), None); - /// # Ok(()) } - /// ``` - pub fn iter(&self) -> Iter { - self.range::, _>(..) - } - - /// Create a double-ended iterator over tuples of keys and values, - /// where the keys fall within the specified range. - /// - /// # Examples - /// - /// ``` - /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// use sled::IVec; - /// db.insert(&[0], vec![0])?; - /// db.insert(&[1], vec![10])?; - /// db.insert(&[2], vec![20])?; - /// db.insert(&[3], vec![30])?; - /// db.insert(&[4], vec![40])?; - /// db.insert(&[5], vec![50])?; - /// - /// let start: &[u8] = &[2]; - /// let end: &[u8] = &[4]; - /// let mut r = db.range(start..end); - /// assert_eq!(r.next().unwrap(), Ok((IVec::from(&[2]), IVec::from(&[20])))); - /// assert_eq!(r.next().unwrap(), Ok((IVec::from(&[3]), IVec::from(&[30])))); - /// assert_eq!(r.next(), None); - /// - /// let mut r = db.range(start..end).rev(); - /// assert_eq!(r.next().unwrap(), Ok((IVec::from(&[3]), IVec::from(&[30])))); - /// assert_eq!(r.next().unwrap(), Ok((IVec::from(&[2]), IVec::from(&[20])))); - /// assert_eq!(r.next(), None); - /// # Ok(()) } - /// ``` - pub fn range(&self, range: R) -> Iter - where - K: AsRef<[u8]>, - R: RangeBounds, - { - let lo = match range.start_bound() { - ops::Bound::Included(start) => { - ops::Bound::Included(IVec::from(start.as_ref())) - } - ops::Bound::Excluded(start) => { - ops::Bound::Excluded(IVec::from(start.as_ref())) - } - ops::Bound::Unbounded => ops::Bound::Included(IVec::from(&[])), - }; - - let hi = match range.end_bound() { - ops::Bound::Included(end) => { - ops::Bound::Included(IVec::from(end.as_ref())) - } - ops::Bound::Excluded(end) => { - ops::Bound::Excluded(IVec::from(end.as_ref())) - } - ops::Bound::Unbounded => ops::Bound::Unbounded, - }; - - Iter { - tree: self.clone(), - hi, - lo, - cached_node: None, - going_forward: true, - } - } - /// Create an iterator over tuples of keys and values /// where all keys start with the given prefix. /// @@ -1380,9 +1640,9 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; - /// use sled::IVec; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; + /// use sled::InlineArray; /// db.insert(&[0, 0, 0], vec![0, 0, 0])?; /// db.insert(&[0, 0, 1], vec![0, 0, 1])?; /// db.insert(&[0, 0, 2], vec![0, 0, 2])?; @@ -1393,25 +1653,25 @@ impl Tree { /// let prefix: &[u8] = &[0, 0]; /// let mut r = db.scan_prefix(prefix); /// assert_eq!( - /// r.next(), - /// Some(Ok((IVec::from(&[0, 0, 0]), IVec::from(&[0, 0, 0])))) + /// r.next().unwrap().unwrap(), + /// (InlineArray::from(&[0, 0, 0]), InlineArray::from(&[0, 0, 0])) /// ); /// assert_eq!( - /// r.next(), - /// Some(Ok((IVec::from(&[0, 0, 1]), IVec::from(&[0, 0, 1])))) + /// r.next().unwrap().unwrap(), + /// (InlineArray::from(&[0, 0, 1]), InlineArray::from(&[0, 0, 1])) /// ); /// assert_eq!( - /// r.next(), - /// Some(Ok((IVec::from(&[0, 0, 2]), IVec::from(&[0, 0, 2])))) + /// r.next().unwrap().unwrap(), + /// (InlineArray::from(&[0, 0, 2]), InlineArray::from(&[0, 0, 2])) /// ); /// assert_eq!( - /// r.next(), - /// Some(Ok((IVec::from(&[0, 0, 3]), IVec::from(&[0, 0, 3])))) + /// r.next().unwrap().unwrap(), + /// (InlineArray::from(&[0, 0, 3]), InlineArray::from(&[0, 0, 3])) /// ); - /// assert_eq!(r.next(), None); + /// assert!(r.next().is_none()); /// # Ok(()) } /// ``` - pub fn scan_prefix

(&self, prefix: P) -> Iter + pub fn scan_prefix<'a, P>(&'a self, prefix: P) -> Iter where P: AsRef<[u8]>, { @@ -1419,7 +1679,7 @@ impl Tree { let mut upper = prefix_ref.to_vec(); while let Some(last) = upper.pop() { - if last < u8::max_value() { + if last < u8::MAX { upper.push(last + 1); return self.range(prefix_ref..&upper); } @@ -1430,13 +1690,13 @@ impl Tree { /// Returns the first key and value in the `Tree`, or /// `None` if the `Tree` is empty. - pub fn first(&self) -> Result> { + pub fn first(&self) -> io::Result> { self.iter().next().transpose() } /// Returns the last key and value in the `Tree`, or /// `None` if the `Tree` is empty. - pub fn last(&self) -> Result> { + pub fn last(&self) -> io::Result> { self.iter().next_back().transpose() } @@ -1446,8 +1706,8 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// db.insert(&[0], vec![0])?; /// db.insert(&[1], vec![10])?; /// db.insert(&[2], vec![20])?; @@ -1455,16 +1715,16 @@ impl Tree { /// db.insert(&[4], vec![40])?; /// db.insert(&[5], vec![50])?; /// - /// assert_eq!(&db.pop_max()?.unwrap().0, &[5]); - /// assert_eq!(&db.pop_max()?.unwrap().0, &[4]); - /// assert_eq!(&db.pop_max()?.unwrap().0, &[3]); - /// assert_eq!(&db.pop_max()?.unwrap().0, &[2]); - /// assert_eq!(&db.pop_max()?.unwrap().0, &[1]); - /// assert_eq!(&db.pop_max()?.unwrap().0, &[0]); - /// assert_eq!(db.pop_max()?, None); + /// assert_eq!(&db.pop_last()?.unwrap().0, &[5]); + /// assert_eq!(&db.pop_last()?.unwrap().0, &[4]); + /// assert_eq!(&db.pop_last()?.unwrap().0, &[3]); + /// assert_eq!(&db.pop_last()?.unwrap().0, &[2]); + /// assert_eq!(&db.pop_last()?.unwrap().0, &[1]); + /// assert_eq!(&db.pop_last()?.unwrap().0, &[0]); + /// assert_eq!(db.pop_last()?, None); /// # Ok(()) } /// ``` - pub fn pop_max(&self) -> Result> { + pub fn pop_last(&self) -> io::Result> { loop { if let Some(first_res) = self.iter().next_back() { let first = first_res?; @@ -1476,13 +1736,80 @@ impl Tree { )? .is_ok() { - trace!("pop_max removed item {:?}", first); + log::trace!("pop_last removed item {:?}", first); return Ok(Some(first)); } // try again } else { - trace!("pop_max removed nothing from empty tree"); + log::trace!("pop_last removed nothing from empty tree"); + return Ok(None); + } + } + } + + /// Pops the last kv pair in the provided range, or returns `Ok(None)` if nothing + /// exists within that range. + /// + /// # Panics + /// + /// This will panic if the provided range's end_bound() == Bound::Excluded(K::MIN). + /// + /// # Examples + /// + /// ``` + /// # fn main() -> Result<(), Box> { + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; + /// + /// let data = vec![ + /// (b"key 1", b"value 1"), + /// (b"key 2", b"value 2"), + /// (b"key 3", b"value 3") + /// ]; + /// + /// for (k, v) in data { + /// db.insert(k, v).unwrap(); + /// } + /// + /// let r1 = db.pop_last_in_range(b"key 1".as_ref()..=b"key 3").unwrap(); + /// assert_eq!(Some((b"key 3".into(), b"value 3".into())), r1); + /// + /// let r2 = db.pop_last_in_range(b"key 1".as_ref()..b"key 3").unwrap(); + /// assert_eq!(Some((b"key 2".into(), b"value 2".into())), r2); + /// + /// let r3 = db.pop_last_in_range(b"key 4".as_ref()..).unwrap(); + /// assert!(r3.is_none()); + /// + /// let r4 = db.pop_last_in_range(b"key 2".as_ref()..=b"key 3").unwrap(); + /// assert!(r4.is_none()); + /// + /// let r5 = db.pop_last_in_range(b"key 0".as_ref()..=b"key 3").unwrap(); + /// assert_eq!(Some((b"key 1".into(), b"value 1".into())), r5); + /// + /// let r6 = db.pop_last_in_range(b"key 0".as_ref()..=b"key 3").unwrap(); + /// assert!(r6.is_none()); + /// # Ok (()) } + /// ``` + pub fn pop_last_in_range( + &self, + range: R, + ) -> io::Result> + where + K: AsRef<[u8]>, + R: Clone + RangeBounds, + { + loop { + let mut r = self.range(range.clone()); + let (k, v) = if let Some(kv_res) = r.next_back() { + kv_res? + } else { return Ok(None); + }; + if self + .compare_and_swap(&k, Some(&v), None as Option)? + .is_ok() + { + return Ok(Some((k, v))); } } } @@ -1493,8 +1820,8 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// db.insert(&[0], vec![0])?; /// db.insert(&[1], vec![10])?; /// db.insert(&[2], vec![20])?; @@ -1502,16 +1829,16 @@ impl Tree { /// db.insert(&[4], vec![40])?; /// db.insert(&[5], vec![50])?; /// - /// assert_eq!(&db.pop_min()?.unwrap().0, &[0]); - /// assert_eq!(&db.pop_min()?.unwrap().0, &[1]); - /// assert_eq!(&db.pop_min()?.unwrap().0, &[2]); - /// assert_eq!(&db.pop_min()?.unwrap().0, &[3]); - /// assert_eq!(&db.pop_min()?.unwrap().0, &[4]); - /// assert_eq!(&db.pop_min()?.unwrap().0, &[5]); - /// assert_eq!(db.pop_min()?, None); + /// assert_eq!(&db.pop_first()?.unwrap().0, &[0]); + /// assert_eq!(&db.pop_first()?.unwrap().0, &[1]); + /// assert_eq!(&db.pop_first()?.unwrap().0, &[2]); + /// assert_eq!(&db.pop_first()?.unwrap().0, &[3]); + /// assert_eq!(&db.pop_first()?.unwrap().0, &[4]); + /// assert_eq!(&db.pop_first()?.unwrap().0, &[5]); + /// assert_eq!(db.pop_first()?, None); /// # Ok(()) } /// ``` - pub fn pop_min(&self) -> Result> { + pub fn pop_first(&self) -> io::Result> { loop { if let Some(first_res) = self.iter().next() { let first = first_res?; @@ -1523,13 +1850,75 @@ impl Tree { )? .is_ok() { - trace!("pop_min removed item {:?}", first); + log::trace!("pop_first removed item {:?}", first); return Ok(Some(first)); } // try again } else { - trace!("pop_min removed nothing from empty tree"); + log::trace!("pop_first removed nothing from empty tree"); + return Ok(None); + } + } + } + + /// Pops the first kv pair in the provided range, or returns `Ok(None)` if nothing + /// exists within that range. + /// + /// # Panics + /// + /// This will panic if the provided range's end_bound() == Bound::Excluded(K::MIN). + /// + /// # Examples + /// + /// ``` + /// # fn main() -> Result<(), Box> { + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; + /// + /// let data = vec![ + /// (b"key 1", b"value 1"), + /// (b"key 2", b"value 2"), + /// (b"key 3", b"value 3") + /// ]; + /// + /// for (k, v) in data { + /// db.insert(k, v).unwrap(); + /// } + /// + /// let r1 = db.pop_first_in_range("key 1".as_ref()..="key 3").unwrap(); + /// assert_eq!(Some((b"key 1".into(), b"value 1".into())), r1); + /// + /// let r2 = db.pop_first_in_range("key 1".as_ref().."key 3").unwrap(); + /// assert_eq!(Some((b"key 2".into(), b"value 2".into())), r2); + /// + /// let r3_res: std::io::Result> = db.range(b"key 4".as_ref()..).collect(); + /// let r3: Vec<_> = r3_res.unwrap(); + /// assert!(r3.is_empty()); + /// + /// let r4 = db.pop_first_in_range("key 2".as_ref()..="key 3").unwrap(); + /// assert_eq!(Some((b"key 3".into(), b"value 3".into())), r4); + /// # Ok (()) } + /// ``` + pub fn pop_first_in_range( + &self, + range: R, + ) -> io::Result> + where + K: AsRef<[u8]>, + R: Clone + RangeBounds, + { + loop { + let mut r = self.range(range.clone()); + let (k, v) = if let Some(kv_res) = r.next() { + kv_res? + } else { return Ok(None); + }; + if self + .compare_and_swap(&k, Some(&v), None as Option)? + .is_ok() + { + return Ok(Some((k, v))); } } } @@ -1542,8 +1931,8 @@ impl Tree { /// /// ``` /// # fn main() -> Result<(), Box> { - /// # let config = sled::Config::new().temporary(true); - /// # let db = config.open()?; + /// # let config = sled::Config::tmp().unwrap(); + /// # let db: sled::Db<1024> = config.open()?; /// db.insert(b"a", vec![0]); /// db.insert(b"b", vec![1]); /// assert_eq!(db.len(), 2); @@ -1554,14 +1943,24 @@ impl Tree { } /// Returns `true` if the `Tree` contains no elements. - pub fn is_empty(&self) -> bool { - self.iter().next().is_none() + /// + /// This is O(1), as we only need to see if an iterator + /// returns anything for the first call to `next()`. + pub fn is_empty(&self) -> io::Result { + if let Some(res) = self.iter().next() { + res?; + Ok(false) + } else { + Ok(true) + } } /// Clears the `Tree`, removing all values. /// /// Note that this is not atomic. - pub fn clear(&self) -> Result<()> { + /// + /// Beware: performs a full O(n) scan under the hood. + pub fn clear(&self) -> io::Result<()> { for k in self.iter().keys() { let key = k?; let _old = self.remove(key)?; @@ -1569,965 +1968,356 @@ impl Tree { Ok(()) } - /// Returns the name of the tree. - pub fn name(&self) -> IVec { - self.tree_id.clone() - } - /// Returns the CRC32 of all keys and values /// in this Tree. /// /// This is O(N) and locks the underlying tree /// for the duration of the entire scan. - pub fn checksum(&self) -> Result { + pub fn checksum(&self) -> io::Result { let mut hasher = crc32fast::Hasher::new(); - let mut iter = self.iter(); - while let Some(kv_res) = iter.next_inner() { + for kv_res in self.iter() { let (k, v) = kv_res?; hasher.update(&k); hasher.update(&v); } Ok(hasher.finalize()) } +} - fn split_node<'g>( - &self, - view: &View<'g>, - parent_view_opt: &Option>, - root_pid: PageId, - guard: &'g Guard, - ) -> Result<()> { - trace!("splitting node with pid {}", view.pid); - // split node - let (mut lhs, rhs) = view.deref().split(); - let rhs_lo = rhs.lo().to_vec(); - - // install right side - let (rhs_pid, rhs_ptr) = self.context.pagecache.allocate(rhs, guard)?; - - // replace node, pointing next to installed right - lhs.set_next(Some(NonZeroU64::new(rhs_pid).unwrap())); - let replace_res = self.context.pagecache.replace( - view.pid, - view.node_view.0, - &lhs, - guard, - )?; - #[cfg(feature = "metrics")] - M.tree_child_split_attempt(); - if replace_res.is_err() { - // if we failed, don't follow through with the - // parent split or root hoist. - let _new_stack = self - .context - .pagecache - .free(rhs_pid, rhs_ptr, guard)? - .expect("could not free allocated page"); - return Ok(()); - } - #[cfg(feature = "metrics")] - M.tree_child_split_success(); - - // either install parent split or hoist root - if let Some(parent_view) = parent_view_opt { - #[cfg(feature = "metrics")] - M.tree_parent_split_attempt(); - let split_applied = parent_view.parent_split(&rhs_lo, rhs_pid); - - if split_applied.is_none() { - // due to deep races, it's possible for the - // parent to already have a node for this lo key. - // if this is the case, we can skip the parent split - // because it's probably going to fail anyway. - return Ok(()); - } +#[allow(unused)] +pub struct Iter { + inner: Tree, + bounds: (Bound, Bound), + next_calls: usize, + next_back_calls: usize, + next_fetch: Option, + next_back_last_lo: Option, + prefetched: VecDeque<(InlineArray, InlineArray)>, + prefetched_back: VecDeque<(InlineArray, InlineArray)>, +} - let parent = split_applied.unwrap(); - - let replace_res2 = self.context.pagecache.replace( - parent_view.pid, - parent_view.node_view.0, - &parent, - guard, - )?; - trace!( - "parent_split at {:?} child pid {} \ - parent pid {} success: {}", - rhs_lo, - rhs_pid, - parent_view.pid, - replace_res2.is_ok() - ); +impl Iterator for Iter { + type Item = io::Result<(InlineArray, InlineArray)>; - #[cfg(feature = "metrics")] - if replace_res2.is_ok() { - M.tree_parent_split_success(); + fn next(&mut self) -> Option { + self.next_calls += 1; + while self.prefetched.is_empty() { + let search_key = if let Some(last) = &self.next_fetch { + last.clone() } else { - // Parent splits are an optimization - // so we don't need to care if we - // failed. - } - } else { - let _ = self.root_hoist(root_pid, rhs_pid, &rhs_lo, guard)?; - } - - Ok(()) - } - - fn root_hoist( - &self, - from: PageId, - to: PageId, - at: &[u8], - guard: &Guard, - ) -> Result { - #[cfg(feature = "metrics")] - M.tree_root_split_attempt(); - // hoist new root, pointing to lhs & rhs - - let new_root = Node::new_hoisted_root(from, at, to); - - let (new_root_pid, new_root_ptr) = - self.context.pagecache.allocate(new_root, guard)?; - debug!("allocated pid {} in root_hoist", new_root_pid); - - debug_delay(); - - let cas = self.context.pagecache.cas_root_in_meta( - &self.tree_id, - Some(from), - Some(new_root_pid), - guard, - )?; - if cas.is_ok() { - debug!("root hoist from {} to {} successful", from, new_root_pid); - #[cfg(feature = "metrics")] - M.tree_root_split_success(); - - // we spin in a cas loop because it's possible - // 2 threads are at this point, and we don't want - // to cause roots to diverge between meta and - // our version. - while self - .root - .compare_exchange(from, new_root_pid, SeqCst, SeqCst) - .is_err() - { - // `hint::spin_loop` requires Rust 1.49. - #[allow(deprecated)] - std::sync::atomic::spin_loop_hint(); - } + return None; + }; - Ok(true) - } else { - debug!( - "root hoist from {} to {} failed: {:?}", - from, new_root_pid, cas - ); - let _new_stack = self - .context - .pagecache - .free(new_root_pid, new_root_ptr, guard)? - .expect("could not free allocated page"); + let node = match self.inner.leaf_for_key(&search_key) { + Ok(n) => n, + Err(e) => return Some(Err(e)), + }; - Ok(false) - } - } + let leaf = node.leaf_read.leaf.as_ref().unwrap(); - pub(crate) fn view_for_pid<'g>( - &self, - pid: PageId, - guard: &'g Guard, - ) -> Result>> { - loop { - let node_view_opt = self.context.pagecache.get(pid, guard)?; - if let Some(node_view) = &node_view_opt { - let view = View { node_view: *node_view, pid }; - if view.merging_child.is_some() { - self.merge_node( - &view, - view.merging_child.unwrap().get(), - guard, - )?; - } else { - return Ok(Some(view)); + if let Some(leaf_hi) = &leaf.hi { + if leaf_hi <= &search_key { + // concurrent merge, retry + log::trace!("undershot in interator, retrying search"); + continue; } - } else { - return Ok(None); } - } - } - - // Returns the traversal path, completing any observed - // partially complete splits or merges along the way. - // - // We intentionally leave the cyclometric complexity - // high because attempts to split it up have made - // the inherent complexity of the operation more - // challenging to understand. - #[allow(clippy::cognitive_complexity)] - pub(crate) fn view_for_key<'g, K>( - &self, - key: K, - guard: &'g Guard, - ) -> Result> - where - K: AsRef<[u8]>, - { - #[cfg(any(test, feature = "lock_free_delays"))] - const MAX_LOOPS: usize = usize::max_value(); - - #[cfg(not(any(test, feature = "lock_free_delays")))] - const MAX_LOOPS: usize = 1_000_000; - - #[cfg(feature = "metrics")] - let _measure = Measure::new(&M.tree_traverse); - - let mut cursor = self.root.load(Acquire); - let mut root_pid = cursor; - let mut parent_view_opt = None; - let mut unsplit_parent_opt = None; - let mut took_leftmost_branch = false; - - // only merge or split nodes a few times - let mut smo_budget = 3_u8; - - #[cfg(feature = "testing")] - let mut path = vec![]; - - for i in 0.. { - macro_rules! retry { - () => { - trace!( - "retrying at line {} when cursor was {}", - line!(), - cursor - ); - if i > MAX_LOOPS { - break; - } - smo_budget = smo_budget.saturating_sub(1); - cursor = self.root.load(Acquire); - root_pid = cursor; - parent_view_opt = None; - unsplit_parent_opt = None; - took_leftmost_branch = false; - - #[cfg(feature = "testing")] - path.clear(); - continue; - }; + if leaf.lo > search_key { + // concurrent successor split, retry + log::trace!("overshot in interator, retrying search"); + continue; } - if cursor == u64::max_value() { - // this collection has been explicitly removed - return Err(Error::CollectionNotFound); + for (k, v) in leaf.data.iter() { + if self.bounds.contains(k) && &search_key <= k { + self.prefetched.push_back((k.clone(), v.clone())); + } } - let node_opt = self.view_for_pid(cursor, guard)?; - - let view = if let Some(view) = node_opt { - // merging_child should be handled in view_for_pid. - assert!( - view.merging_child.is_none(), - "view_for_pid somehow returned a view for a \ - node with a merging_child without handling it." - ); - view - } else { - retry!(); - }; - - #[cfg(feature = "testing")] - path.push((cursor, view.clone())); + self.next_fetch = leaf.hi.clone(); + } - if view.merging { - // we missed the parent merge intention due to a benign race, - // so go around again and try to help out if necessary - retry!(); - } + self.prefetched.pop_front().map(Ok) + } +} - let overshot = key.as_ref() < view.lo(); - let undershot = if let Some(hi) = view.hi() { - key.as_ref() >= hi +impl DoubleEndedIterator for Iter { + fn next_back(&mut self) -> Option { + self.next_back_calls += 1; + while self.prefetched_back.is_empty() { + let search_key: InlineArray = if let Some(last) = + &self.next_back_last_lo + { + if !self.bounds.contains(last) || last == &InlineArray::MIN { + return None; + } + self.inner + .index + .range::(..last) + .next_back() + .unwrap() + .0 } else { - false - }; - - if overshot { - // merge interfered, reload root and retry - log::trace!( - "overshot searching for {:?} on node {:?}", - key.as_ref(), - view.deref() - ); - retry!(); - } - - if smo_budget > 0 && view.should_split() { - self.split_node(&view, &parent_view_opt, root_pid, guard)?; - retry!(); - } - - if undershot { - // half-complete split detect & completion - let right_sibling = view - .next - .expect( - "if our hi bound is not Inf (inity), \ - we should have a right sibling", - ) - .get(); - trace!( - "seeking right on undershot node, from {} to {}", - cursor, - right_sibling - ); - cursor = right_sibling; - if unsplit_parent_opt.is_none() && parent_view_opt.is_some() { - unsplit_parent_opt = parent_view_opt.clone(); - } else if parent_view_opt.is_none() && view.lo().is_empty() { - assert!(unsplit_parent_opt.is_none()); - assert_eq!(view.pid, root_pid); - // we have found a partially-split root - if self.root_hoist( - root_pid, - view.next.unwrap().get(), - view.hi().unwrap(), - guard, - )? { - #[cfg(feature = "metrics")] - M.tree_root_split_success(); - retry!(); + match &self.bounds.1 { + Bound::Included(k) => k.clone(), + Bound::Excluded(k) if k == &InlineArray::MIN => { + InlineArray::MIN } + Bound::Excluded(k) => self.inner.index.get_lt(k).unwrap().0, + Bound::Unbounded => self.inner.index.last().unwrap().0, } + }; - continue; - } else if let Some(unsplit_parent) = unsplit_parent_opt.take() { - // we have found the proper page for - // our cooperative parent split - trace!( - "trying to apply split of child with \ - lo key of {:?} to parent pid {}", - view.lo(), - unsplit_parent.pid - ); - let split_applied = - unsplit_parent.parent_split(view.lo(), cursor); - - if split_applied.is_none() { - // Due to deep races, it's possible for the - // parent to already have a node for this lo key. - // if this is the case, we can skip the parent split - // because it's probably going to fail anyway. - // - // If a test is failing because of retrying in a - // loop here, this has happened often histically - // due to the Node::index_next_node method - // returning a child that is off-by-one to the - // left, always causing an undershoot. - log::trace!( - "failed to apply parent split of \ - ({:?}, {}) to parent node {:?}", - view.lo(), - cursor, - unsplit_parent - ); - retry!(); - } + let node = match self.inner.leaf_for_key(&search_key) { + Ok(n) => n, + Err(e) => return Some(Err(e)), + }; - let parent: Node = split_applied.unwrap(); + let leaf = node.leaf_read.leaf.as_ref().unwrap(); - #[cfg(feature = "metrics")] - M.tree_parent_split_attempt(); - let replace = self.context.pagecache.replace( - unsplit_parent.pid, - unsplit_parent.node_view.0, - &parent, - guard, - )?; - if replace.is_ok() { - #[cfg(feature = "metrics")] - M.tree_parent_split_success(); - } + if leaf.lo > search_key { + // concurrent successor split, retry + log::trace!("overshot in reverse interator, retrying search"); + continue; } - // detect whether a node is mergeable, and begin - // the merge process. - // NB we can never begin merging a node that is - // the leftmost child of an index, because it - // would be merged into a different index, which - // would add considerable complexity to this already - // fairly complex implementation. - if smo_budget > 0 - && !took_leftmost_branch - && parent_view_opt.is_some() - && view.should_merge() - { - let parent = parent_view_opt.as_mut().unwrap(); - assert!(parent.merging_child.is_none()); - if parent.can_merge_child(cursor) { - let frag = Link::ParentMergeIntention(cursor); - - let link = self.context.pagecache.link( - parent.pid, - parent.node_view.0, - frag, - guard, - )?; - - if let Ok(new_parent_ptr) = link { - parent.node_view = NodeView(new_parent_ptr); - self.merge_node(parent, cursor, guard)?; - retry!(); + // determine if we undershot our target due to concurrent modifications + let undershot = + match (&leaf.hi, &self.next_back_last_lo, &self.bounds.1) { + (Some(leaf_hi), Some(last_lo), _) => leaf_hi < last_lo, + (Some(_leaf_hi), None, Bound::Unbounded) => true, + (Some(leaf_hi), None, Bound::Included(bound_key)) => { + leaf_hi <= bound_key } - } - } + (Some(leaf_hi), None, Bound::Excluded(bound_key)) => { + leaf_hi < bound_key + } + (None, _, _) => false, + }; - if view.is_index { - let next = view.index_next_node(key.as_ref()); + if undershot { log::trace!( - "found next {} from node {:?}", - next.1, - view.deref() + "undershoot detected in reverse iterator with \ + (leaf_hi, next_back_last_lo, self.bounds.1) being {:?}", + (&leaf.hi, &self.next_back_last_lo, &self.bounds.1) ); - took_leftmost_branch = next.0; - parent_view_opt = Some(view); - cursor = next.1; - } else { - assert!(!overshot && !undershot); - return Ok(view); + continue; } - } - #[cfg(feature = "testing")] - { - log::error!( - "failed to traverse tree while looking for key {:?}", - key.as_ref() - ); - log::error!("took path:"); - for (pid, view) in path { - log::error!("pid: {} node: {:?}\n\n", pid, view.deref()); + for (k, v) in leaf.data.iter() { + if self.bounds.contains(k) { + let beneath_last_lo = + if let Some(last_lo) = &self.next_back_last_lo { + k < last_lo + } else { + true + }; + if beneath_last_lo { + self.prefetched_back.push_back((k.clone(), v.clone())); + } + } } + self.next_back_last_lo = Some(leaf.lo.clone()); } - panic!( - "cannot find pid {} in view_for_key, looking for key {:?} in tree", - cursor, - key.as_ref(), - ); + self.prefetched_back.pop_back().map(Ok) } +} - fn cap_merging_child<'g>( - &'g self, - child_pid: PageId, - guard: &'g Guard, - ) -> Result>> { - // Get the child node and try to install a `MergeCap` frag. - // In case we succeed, we break, otherwise we try from the start. - loop { - let mut child_view = if let Some(child_view) = - self.view_for_pid(child_pid, guard)? - { - child_view - } else { - // the child was already freed, meaning - // somebody completed this whole loop already - return Ok(None); - }; - - if child_view.merging { - trace!("child pid {} already merging", child_pid); - return Ok(Some(child_view)); - } - - let install_frag = self.context.pagecache.link( - child_pid, - child_view.node_view.0, - Link::ChildMergeCap, - guard, - )?; - match install_frag { - Ok(new_ptr) => { - trace!("child pid {} merge capped", child_pid); - child_view.node_view = NodeView(new_ptr); - return Ok(Some(child_view)); - } - Err(Some((_, _))) => { - trace!( - "child pid {} merge cap failed, retrying", - child_pid - ); - continue; - } - Err(None) => { - trace!("child pid {} already freed", child_pid); - return Ok(None); - } - } - } +impl Iter { + pub fn keys( + self, + ) -> impl DoubleEndedIterator> { + self.into_iter().map(|kv_res| kv_res.map(|(k, _v)| k)) } - fn install_parent_merge<'g>( - &self, - parent_view_ref: &View<'g>, - child_pid: PageId, - guard: &'g Guard, - ) -> Result { - let mut parent_view = Cow::Borrowed(parent_view_ref); - loop { - let linked = self.context.pagecache.link( - parent_view.pid, - parent_view.node_view.0, - Link::ParentMergeConfirm, - guard, - )?; - match linked { - Ok(_) => { - trace!( - "ParentMergeConfirm succeeded on parent pid {}, \ - now freeing child pid {}", - parent_view.pid, - child_pid - ); - return Ok(true); - } - Err(None) => { - trace!( - "ParentMergeConfirm \ - failed on (now freed) parent pid {}", - parent_view.pid - ); - return Ok(false); - } - Err(_) => { - let new_parent_view = if let Some(new_parent_view) = - self.view_for_pid(parent_view.pid, guard)? - { - trace!( - "failed to confirm merge \ - on parent pid {}, trying again", - parent_view.pid - ); - new_parent_view - } else { - trace!( - "failed to confirm merge \ - on parent pid {}, which was freed", - parent_view.pid - ); - return Ok(false); - }; - - if new_parent_view.merging_child.map(NonZeroU64::get) - != Some(child_pid) - { - trace!( - "someone else must have already \ - completed the merge, and now the \ - merging child for parent pid {} is {:?}", - new_parent_view.pid, - new_parent_view.merging_child - ); - return Ok(false); - } - - parent_view = Cow::Owned(new_parent_view); - } - } - } + pub fn values( + self, + ) -> impl DoubleEndedIterator> { + self.into_iter().map(|kv_res| kv_res.map(|(_k, v)| v)) } +} - pub(crate) fn merge_node<'g>( - &self, - parent_view: &View<'g>, - child_pid: PageId, - guard: &'g Guard, - ) -> Result<()> { - trace!( - "merging child pid {} of parent pid {}", - child_pid, - parent_view.pid - ); - - let child_view = if let Some(merging_child) = - self.cap_merging_child(child_pid, guard)? - { - merging_child - } else { - return Ok(()); - }; - - assert!(parent_view.is_index); - let child_index = parent_view - .iter_index_pids() - .position(|pid| pid == child_pid) - .unwrap(); +impl IntoIterator for &Tree { + type Item = io::Result<(InlineArray, InlineArray)>; + type IntoIter = Iter; - assert_ne!( - child_index, 0, - "merging child must not be the \ - leftmost child of its parent" - ); + fn into_iter(self) -> Self::IntoIter { + self.iter() + } +} - let mut merge_index = child_index - 1; +/// A batch of updates that will +/// be applied atomically to the +/// Tree. +/// +/// # Examples +/// +/// ``` +/// # fn main() -> Result<(), Box> { +/// use sled::{Batch, open}; +/// +/// # let _ = std::fs::remove_dir_all("batch_db_2"); +/// let db: sled::Db<1024> = open("batch_db_2")?; +/// db.insert("key_0", "val_0")?; +/// +/// let mut batch = Batch::default(); +/// batch.insert("key_a", "val_a"); +/// batch.insert("key_b", "val_b"); +/// batch.insert("key_c", "val_c"); +/// batch.remove("key_0"); +/// +/// db.apply_batch(batch)?; +/// // key_0 no longer exists, and key_a, key_b, and key_c +/// // now do exist. +/// # let _ = std::fs::remove_dir_all("batch_db_2"); +/// # Ok(()) } +/// ``` +#[derive(Debug, Default, Clone, PartialEq, Eq)] +pub struct Batch { + pub(crate) writes: + std::collections::BTreeMap>, +} - // we assume caller only merges when - // the node to be merged is not the - // leftmost child. - let mut cursor_pid = - parent_view.iter_index_pids().nth(merge_index).unwrap(); +impl Batch { + /// Set a key to a new value + pub fn insert(&mut self, key: K, value: V) + where + K: Into, + V: Into, + { + self.writes.insert(key.into(), Some(value.into())); + } - // searching for the left sibling to merge the target page into - loop { - // The only way this child could have been freed is if the original - // merge has already been handled. Only in that case can this child - // have been freed - trace!( - "cursor_pid is {} while looking for left sibling", - cursor_pid - ); - let cursor_view = if let Some(cursor_view) = - self.view_for_pid(cursor_pid, guard)? - { - cursor_view - } else { - trace!( - "couldn't retrieve frags for freed \ - (possibly outdated) prospective left \ - sibling with pid {}", - cursor_pid - ); + /// Remove a key + pub fn remove(&mut self, key: K) + where + K: Into, + { + self.writes.insert(key.into(), None); + } - if merge_index == 0 { - trace!( - "failed to find any left sibling for \ - merging pid {}, which means this merge \ - must have already completed.", - child_pid - ); - return Ok(()); - } + /// Get a value if it is present in the `Batch`. + /// `Some(None)` means it's present as a deletion. + pub fn get>(&self, k: K) -> Option> { + let inner = self.writes.get(k.as_ref())?; + Some(inner.as_ref()) + } +} - merge_index -= 1; - cursor_pid = - parent_view.iter_index_pids().nth(merge_index).unwrap(); +impl Leaf { + pub fn serialize(&self, zstd_compression_level: i32) -> Vec { + let mut ret = vec![]; - continue; - }; + let mut zstd_enc = + zstd::stream::Encoder::new(&mut ret, zstd_compression_level) + .unwrap(); - // This means that `cursor_node` is the node we want to replace - if cursor_view.next.map(NonZeroU64::get) == Some(child_pid) { - trace!( - "found left sibling pid {} points to merging node pid {}", - cursor_view.pid, - child_pid - ); - let cursor_node = cursor_view.node_view; - - let replacement = cursor_node.receive_merge(&child_view); - let replace = self.context.pagecache.replace( - cursor_pid, - cursor_node.0, - &replacement, - guard, - )?; - match replace { - Ok(_) => { - trace!( - "merged node pid {} into left sibling pid {}", - child_pid, - cursor_pid - ); - break; - } - Err(None) => { - trace!( - "failed to merge pid {} into \ - pid {} since pid {} doesn't exist anymore", - child_pid, - cursor_pid, - cursor_pid - ); - return Ok(()); - } - Err(_) => { - trace!( - "failed to merge pid {} into \ - pid {} due to cas failure", - child_pid, - cursor_pid - ); - continue; - } - } - } else if cursor_view.hi() >= Some(child_view.lo()) { - // we overshot the node being merged, - trace!( - "cursor pid {} has hi key {:?}, which is \ - >= merging child pid {}'s lo key of {:?}, breaking", - cursor_pid, - cursor_view.hi(), - child_pid, - child_view.lo() - ); - break; - } else { - // In case we didn't find the child, we get the next cursor node - if let Some(next) = cursor_view.next { - trace!( - "traversing from cursor pid {} to right sibling pid {}", - cursor_pid, - next - ); - cursor_pid = next.get(); - } else { - trace!( - "hit the right side of the tree without finding \ - a left sibling for merging child pid {}", - child_pid - ); - break; - } - } - } + bincode::serialize_into(&mut zstd_enc, self).unwrap(); - trace!( - "trying to install parent merge \ - confirmation of merged child pid {} for parent pid {}", - child_pid, - parent_view.pid - ); + zstd_enc.finish().unwrap(); - let should_continue = - self.install_parent_merge(parent_view, child_pid, guard)?; + ret + } - if !should_continue { - return Ok(()); - } + fn deserialize(buf: &[u8]) -> io::Result>> { + let zstd_decoded = zstd::stream::decode_all(buf).unwrap(); + let mut leaf: Box> = + bincode::deserialize(&zstd_decoded).unwrap(); - match self.context.pagecache.free( - child_pid, - child_view.node_view.0, - guard, - )? { - Ok(_) => { - // we freed it - trace!("freed merged pid {}", child_pid); - } - Err(None) => { - // someone else freed it - trace!("someone else freed merged pid {}", child_pid); - } - Err(Some(_)) => { - trace!( - "someone was able to reuse freed merged pid {}", - child_pid - ); - // it was reused somehow after we - // observed it as in the merging state - panic!( - "somehow the merging child was reused \ - before all threads that witnessed its previous \ - merge have left their epoch" - ) - } - } + // use decompressed buffer length as a cheap proxy for in-memory size for now + leaf.in_memory_size = zstd_decoded.len(); - trace!("finished with merge of pid {}", child_pid); - Ok(()) + Ok(leaf) } - #[doc(hidden)] - pub fn verify_integrity(&self) -> Result<()> { - // verification happens in attempt_fmt - self.attempt_fmt()?; - Ok(()) + fn set_in_memory_size(&mut self) { + self.in_memory_size = mem::size_of::>() + + self.hi.as_ref().map(|h| h.len()).unwrap_or(0) + + self.lo.len() + + self.data.iter().map(|(k, v)| k.len() + v.len()).sum::(); } - // format and verify tree integrity - fn attempt_fmt(&self) -> Result> { - let mut f = String::new(); - let guard = pin(); - - let mut pid = self.root.load(Acquire); - if pid == 0 { - panic!("somehow tree root was 0"); - } - let mut left_most = pid; - let mut level = 0; - let mut expected_pids = FastSet8::default(); - let mut referenced_pids = FastSet8::default(); - let mut loop_detector = FastSet8::default(); - - expected_pids.insert(pid); - - f.push_str("\tlevel 0:\n"); - - loop { - let get_res = self.view_for_pid(pid, &guard); - let node = match get_res { - Ok(Some(ref view)) => { - expected_pids.remove(&pid); - if loop_detector.contains(&pid) { - if cfg!(feature = "testing") { - panic!( - "detected a loop while iterating over the Tree. \ - pid {} was encountered multiple times", - pid - ); - } else { - error!( - "detected a loop while iterating over the Tree. \ - pid {} was encountered multiple times", - pid - ); - } - } else { - loop_detector.insert(pid); - } - - view.deref() - } - Ok(None) => { - if cfg!(feature = "testing") { - error!( - "Tree::fmt failed to read node pid {} \ - that has been freed", - pid, - ); - return Ok(None); - } else { - error!( - "Tree::fmt failed to read node pid {} \ - that has been freed", - pid, - ); - } - break; - } - Err(e) => { - error!( - "hit error while trying to pull pid {}: {:?}", - pid, e - ); - return Err(e); - } + fn split_if_full( + &mut self, + new_epoch: FlushEpoch, + allocator: &ObjectCache, + collection_id: CollectionId, + ) -> Option<(InlineArray, Object)> { + if self.data.is_full() { + // split + let split_offset = if self.lo.is_empty() { + // split left-most shard almost at the beginning for + // optimizing downward-growing workloads + 1 + } else if self.hi.is_none() { + // split right-most shard almost at the end for + // optimizing upward-growing workloads + self.data.len() - 2 + } else { + self.data.len() / 2 }; - f.push_str(&format!("\t\t{}: {:?}\n", pid, node)); - - if node.is_index { - for child_pid in node.iter_index_pids() { - referenced_pids.insert(child_pid); - } - } + let data = self.data.split_off(split_offset); - if let Some(next_pid) = node.next { - pid = next_pid.get(); - } else { - // we've traversed our level, time to bump down - let left_get_opt = self.view_for_pid(left_most, &guard)?; - let left_node = if let Some(ref view) = left_get_opt { - view - } else { - panic!( - "pagecache returned non-base node: {:?}", - left_get_opt - ) - }; + let left_max = &self.data.last().unwrap().0; + let right_min = &data.first().unwrap().0; - if left_node.is_index { - if let Some(next_pid) = left_node.iter_index_pids().next() { - pid = next_pid; - left_most = next_pid; - log::trace!("set left_most to {}", next_pid); - level += 1; - f.push_str(&format!("\n\tlevel {}:\n", level)); - assert!( - expected_pids.is_empty(), - "expected pids {:?} but never \ - saw them on this level. tree so far: {}", - expected_pids, - f - ); - std::mem::swap( - &mut expected_pids, - &mut referenced_pids, - ); - } else { - panic!("trying to debug print empty index node"); - } - } else { - // we've reached the end of our tree, all leafs are on - // the lowest level. - break; - } - } - } + // suffix truncation attempts to shrink the split key + // so that shorter keys bubble up into the index + let splitpoint_length = right_min + .iter() + .zip(left_max.iter()) + .take_while(|(a, b)| a == b) + .count() + + 1; - Ok(Some(f)) - } -} + let split_key = InlineArray::from(&right_min[..splitpoint_length]); -impl Debug for Tree { - fn fmt( - &self, - f: &mut fmt::Formatter<'_>, - ) -> std::result::Result<(), fmt::Error> { - f.write_str("Tree: \n\t")?; - self.context.pagecache.fmt(f)?; - - if let Some(fmt) = self.attempt_fmt().map_err(|_| std::fmt::Error)? { - f.write_str(&fmt)?; - return Ok(()); - } + let rhs_id = allocator.allocate_object_id(new_epoch); - if cfg!(feature = "testing") { - panic!( - "failed to fmt Tree due to expected page disappearing part-way through" - ); - } else { - log::error!( - "failed to fmt Tree due to expected page disappearing part-way through" + log::trace!( + "split leaf {:?} at split key: {:?} into new {:?} at {:?}", + self.lo, + split_key, + rhs_id, + new_epoch, ); - Ok(()) - } - } -} -/// Compare and swap result. -/// -/// It returns `Ok(Ok(()))` if operation finishes successfully and -/// - `Ok(Err(CompareAndSwapError(current, proposed)))` if operation failed -/// to setup a new value. `CompareAndSwapError` contains current and -/// proposed values. -/// - `Err(Error::Unsupported)` if the database is opened in read-only mode. -/// otherwise. -pub type CompareAndSwapResult = - Result>; - -impl From for CompareAndSwapResult { - fn from(error: Error) -> Self { - Err(error) - } -} + let mut rhs = Leaf { + dirty_flush_epoch: Some(new_epoch), + hi: self.hi.clone(), + lo: split_key.clone(), + prefix_length: 0, + in_memory_size: 0, + data, + mutation_count: 0, + page_out_on_flush: None, + deleted: None, + max_unflushed_epoch: None, + }; + rhs.set_in_memory_size(); + + self.hi = Some(split_key.clone()); + self.set_in_memory_size(); + + assert_eq!(self.hi.as_ref().unwrap(), &split_key); + assert_eq!(rhs.lo, &split_key); + + let rhs_node = Object { + object_id: rhs_id, + collection_id, + low_key: split_key.clone(), + inner: Arc::new(RwLock::new(CacheBox { + leaf: Some(Box::new(rhs)).into(), + logged_index: BTreeMap::default(), + })), + }; -/// Compare and swap error. -#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)] -pub struct CompareAndSwapError { - /// The current value which caused your CAS to fail. - pub current: Option, - /// Returned value that was proposed unsuccessfully. - pub proposed: Option, -} + return Some((split_key, rhs_node)); + } -impl fmt::Display for CompareAndSwapError { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - write!(f, "Compare and swap conflict") + None } } - -impl std::error::Error for CompareAndSwapError {} diff --git a/src/varint.rs b/src/varint.rs deleted file mode 100644 index 9ca535b39..000000000 --- a/src/varint.rs +++ /dev/null @@ -1,99 +0,0 @@ -use std::convert::TryFrom; - -/// Returns the number of bytes that this varint will need -pub const fn size(int: u64) -> usize { - if int <= 240 { - 1 - } else if int <= 2287 { - 2 - } else if int <= 67823 { - 3 - } else if int <= 0x00FF_FFFF { - 4 - } else if int <= 0xFFFF_FFFF { - 5 - } else if int <= 0x00FF_FFFF_FFFF { - 6 - } else if int <= 0xFFFF_FFFF_FFFF { - 7 - } else if int <= 0x00FF_FFFF_FFFF_FFFF { - 8 - } else { - 9 - } -} - -/// Returns how many bytes the varint consumed while serializing -pub fn serialize_into(int: u64, buf: &mut [u8]) -> usize { - if int <= 240 { - buf[0] = u8::try_from(int).unwrap(); - 1 - } else if int <= 2287 { - buf[0] = u8::try_from((int - 240) / 256 + 241).unwrap(); - buf[1] = u8::try_from((int - 240) % 256).unwrap(); - 2 - } else if int <= 67823 { - buf[0] = 249; - buf[1] = u8::try_from((int - 2288) / 256).unwrap(); - buf[2] = u8::try_from((int - 2288) % 256).unwrap(); - 3 - } else if int <= 0x00FF_FFFF { - buf[0] = 250; - let bytes = int.to_le_bytes(); - buf[1..4].copy_from_slice(&bytes[..3]); - 4 - } else if int <= 0xFFFF_FFFF { - buf[0] = 251; - let bytes = int.to_le_bytes(); - buf[1..5].copy_from_slice(&bytes[..4]); - 5 - } else if int <= 0x00FF_FFFF_FFFF { - buf[0] = 252; - let bytes = int.to_le_bytes(); - buf[1..6].copy_from_slice(&bytes[..5]); - 6 - } else if int <= 0xFFFF_FFFF_FFFF { - buf[0] = 253; - let bytes = int.to_le_bytes(); - buf[1..7].copy_from_slice(&bytes[..6]); - 7 - } else if int <= 0x00FF_FFFF_FFFF_FFFF { - buf[0] = 254; - let bytes = int.to_le_bytes(); - buf[1..8].copy_from_slice(&bytes[..7]); - 8 - } else { - buf[0] = 255; - let bytes = int.to_le_bytes(); - buf[1..9].copy_from_slice(&bytes[..8]); - 9 - } -} - -/// Returns the deserialized varint, along with how many bytes -/// were taken up by the varint. -pub fn deserialize(buf: &[u8]) -> crate::Result<(u64, usize)> { - if buf.is_empty() { - return Err(crate::Error::corruption(None)); - } - let res = match buf[0] { - 0..=240 => (u64::from(buf[0]), 1), - 241..=248 => { - let varint = - 240 + 256 * (u64::from(buf[0]) - 241) + u64::from(buf[1]); - (varint, 2) - } - 249 => { - let varint = 2288 + 256 * u64::from(buf[1]) + u64::from(buf[2]); - (varint, 3) - } - other => { - let sz = other as usize - 247; - let mut aligned = [0; 8]; - aligned[..sz].copy_from_slice(&buf[1..=sz]); - let varint = u64::from_le_bytes(aligned); - (varint, sz + 1) - } - }; - Ok(res) -} diff --git a/tests/00_regression.rs b/tests/00_regression.rs new file mode 100644 index 000000000..76afb475f --- /dev/null +++ b/tests/00_regression.rs @@ -0,0 +1,1640 @@ +mod common; +mod tree; + +use std::alloc::{Layout, System}; + +use tree::{prop_tree_matches_btreemap, Key, Op::*}; + +#[global_allocator] +static ALLOCATOR: ShredAllocator = ShredAllocator; + +#[derive(Default, Debug, Clone, Copy)] +struct ShredAllocator; + +unsafe impl std::alloc::GlobalAlloc for ShredAllocator { + unsafe fn alloc(&self, layout: Layout) -> *mut u8 { + assert!(layout.size() < 1_000_000_000); + let ret = System.alloc(layout); + assert_ne!(ret, std::ptr::null_mut()); + std::ptr::write_bytes(ret, 0xa1, layout.size()); + ret + } + + unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { + std::ptr::write_bytes(ptr, 0xde, layout.size()); + System.dealloc(ptr, layout) + } +} + +#[allow(dead_code)] +const INTENSITY: usize = 10; + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_00() { + // postmortem: + prop_tree_matches_btreemap(vec![Restart], false, 0, 256); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_01() { + // postmortem: + // this was a bug in the snapshot recovery, where + // it led to max_id dropping by 1 after a restart. + // postmortem 2: + // we were stalling here because we had a new log with stable of + // SEG_HEADER_LEN, but when we iterated over it to create a new + // snapshot (snapshot every 1 set in Config), we iterated up until + // that offset. make_stable requires our stable offset to be >= + // the provided one, to deal with 0. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![32]), 9), + Set(Key(vec![195]), 13), + Restart, + Set(Key(vec![164]), 147), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_02() { + // postmortem: + // this was a bug in the way that the `Materializer` + // was fed data, possibly out of order, if recover + // in the pagecache had to run over log entries + // that were later run through the same `Materializer` + // then the second time (triggered by a snapshot) + // would not pick up on the importance of seeing + // the new root set. + // portmortem 2: when refactoring iterators, failed + // to account for node.hi being empty on the infinity + // shard + prop_tree_matches_btreemap( + vec![ + Restart, + Set(Key(vec![215]), 121), + Restart, + Set(Key(vec![216]), 203), + Scan(Key(vec![210]), 4), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_03() { + // postmortem: the tree was not persisting and recovering root hoists + // postmortem 2: when refactoring the log storage, we failed to restart + // log writing in the proper location. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![113]), 204), + Set(Key(vec![119]), 205), + Set(Key(vec![166]), 88), + Set(Key(vec![23]), 44), + Restart, + Set(Key(vec![226]), 192), + Set(Key(vec![189]), 186), + Restart, + Scan(Key(vec![198]), 11), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_04() { + // postmortem: pagecache was failing to replace the LogId list + // when it encountered a new Update::Compact. + // postmortem 2: after refactoring log storage, we were not properly + // setting the log tip, and the beginning got clobbered after writing + // after a restart. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![158]), 31), + Set(Key(vec![111]), 134), + Set(Key(vec![230]), 187), + Set(Key(vec![169]), 58), + Set(Key(vec![131]), 10), + Set(Key(vec![108]), 246), + Set(Key(vec![127]), 155), + Restart, + Set(Key(vec![59]), 119), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_05() { + // postmortem: during recovery, the segment accountant was failing to + // properly set the file's tip. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![231]), 107), + Set(Key(vec![251]), 42), + Set(Key(vec![80]), 81), + Set(Key(vec![178]), 130), + Set(Key(vec![150]), 232), + Restart, + Set(Key(vec![98]), 78), + Set(Key(vec![0]), 45), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_06() { + // postmortem: after reusing segments, we were failing to checksum reads + // performed while iterating over rewritten segment buffers, and using + // former garbage data. fix: use the crc that's there for catching torn + // writes with high probability, AND zero out buffers. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![162]), 8), + Set(Key(vec![59]), 192), + Set(Key(vec![238]), 83), + Set(Key(vec![151]), 231), + Restart, + Set(Key(vec![30]), 206), + Set(Key(vec![150]), 146), + Set(Key(vec![18]), 34), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_07() { + // postmortem: the segment accountant was not fully recovered, and thought + // that it could reuse a particular segment that wasn't actually empty + // yet. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![135]), 22), + Set(Key(vec![41]), 36), + Set(Key(vec![101]), 31), + Set(Key(vec![111]), 35), + Restart, + Set(Key(vec![47]), 36), + Set(Key(vec![79]), 114), + Set(Key(vec![64]), 9), + Scan(Key(vec![196]), 25), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_08() { + // postmortem: failed to properly recover the state in the segment + // accountant that tracked the previously issued segment. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![145]), 151), + Set(Key(vec![155]), 148), + Set(Key(vec![131]), 170), + Set(Key(vec![163]), 60), + Set(Key(vec![225]), 126), + Restart, + Set(Key(vec![64]), 237), + Set(Key(vec![102]), 205), + Restart, + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_09() { + // postmortem 1: was failing to load existing snapshots on initialization. + // would encounter uninitialized segments at the log tip and overwrite + // the first segment (indexed by LSN of 0) in the segment accountant + // ordering, skipping over important updates. + // + // postmortem 2: page size tracking was inconsistent in SA. completely + // removed exact size tracking, and went back to simpler pure-page + // tenancy model. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![189]), 36), + Set(Key(vec![254]), 194), + Set(Key(vec![132]), 50), + Set(Key(vec![91]), 221), + Set(Key(vec![126]), 6), + Set(Key(vec![199]), 183), + Set(Key(vec![71]), 125), + Scan(Key(vec![67]), 16), + Set(Key(vec![190]), 16), + Restart, + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_10() { + // postmortem: after reusing a segment, but not completely writing a + // segment, we were hitting an old LSN and violating an assert, rather + // than just ending. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![152]), 163), + Set(Key(vec![105]), 191), + Set(Key(vec![207]), 217), + Set(Key(vec![128]), 19), + Set(Key(vec![106]), 22), + Scan(Key(vec![20]), 24), + Set(Key(vec![14]), 150), + Set(Key(vec![80]), 43), + Set(Key(vec![174]), 134), + Set(Key(vec![20]), 150), + Set(Key(vec![13]), 171), + Restart, + Scan(Key(vec![240]), 25), + Scan(Key(vec![77]), 37), + Set(Key(vec![153]), 232), + Del(Key(vec![2])), + Set(Key(vec![227]), 169), + Get(Key(vec![232])), + Cas(Key(vec![247]), 151, 70), + Set(Key(vec![78]), 52), + Get(Key(vec![16])), + Del(Key(vec![78])), + Cas(Key(vec![201]), 93, 196), + Set(Key(vec![172]), 84), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_11() { + // postmortem: a stall was happening because LSNs and LogIds were being + // conflated in calls to make_stable. A higher LogId than any LSN was + // being created, then passed in. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![38]), 148), + Set(Key(vec![176]), 175), + Set(Key(vec![82]), 88), + Set(Key(vec![164]), 85), + Set(Key(vec![139]), 74), + Set(Key(vec![73]), 23), + Cas(Key(vec![34]), 67, 151), + Set(Key(vec![115]), 133), + Set(Key(vec![249]), 138), + Restart, + Set(Key(vec![243]), 6), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_12() { + // postmortem: was not checking that a log entry's LSN matches its position + // as part of detecting tears / partial rewrites. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![118]), 156), + Set(Key(vec![8]), 63), + Set(Key(vec![165]), 110), + Set(Key(vec![219]), 108), + Set(Key(vec![91]), 61), + Set(Key(vec![18]), 98), + Scan(Key(vec![73]), 6), + Set(Key(vec![240]), 108), + Cas(Key(vec![71]), 28, 189), + Del(Key(vec![199])), + Restart, + Set(Key(vec![30]), 140), + Scan(Key(vec![118]), 13), + Get(Key(vec![180])), + Cas(Key(vec![115]), 151, 116), + Restart, + Set(Key(vec![31]), 95), + Cas(Key(vec![79]), 153, 225), + Set(Key(vec![34]), 161), + Get(Key(vec![213])), + Set(Key(vec![237]), 215), + Del(Key(vec![52])), + Set(Key(vec![56]), 78), + Scan(Key(vec![141]), 2), + Cas(Key(vec![228]), 114, 170), + Get(Key(vec![231])), + Get(Key(vec![223])), + Del(Key(vec![167])), + Restart, + Scan(Key(vec![240]), 31), + Del(Key(vec![54])), + Del(Key(vec![2])), + Set(Key(vec![117]), 165), + Set(Key(vec![223]), 50), + Scan(Key(vec![69]), 4), + Get(Key(vec![156])), + Set(Key(vec![214]), 72), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_13() { + // postmortem: failed root hoists were being improperly recovered before the + // following free was done on their page, but we treated the written node as + // if it were a successful completed root hoist. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![42]), 10), + Set(Key(vec![137]), 220), + Set(Key(vec![183]), 129), + Set(Key(vec![91]), 145), + Set(Key(vec![126]), 26), + Set(Key(vec![255]), 67), + Set(Key(vec![69]), 18), + Restart, + Set(Key(vec![24]), 92), + Set(Key(vec![193]), 17), + Set(Key(vec![3]), 143), + Cas(Key(vec![50]), 13, 84), + Restart, + Set(Key(vec![191]), 116), + Restart, + Del(Key(vec![165])), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_14() { + // postmortem: after adding prefix compression, we were not + // handling re-inserts and deletions properly + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![107]), 234), + Set(Key(vec![7]), 245), + Set(Key(vec![40]), 77), + Set(Key(vec![171]), 244), + Set(Key(vec![173]), 16), + Set(Key(vec![171]), 176), + Scan(Key(vec![93]), 33), + ], + true, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_15() { + // postmortem: was not sorting keys properly when binary searching for them + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![102]), 165), + Set(Key(vec![91]), 191), + Set(Key(vec![141]), 228), + Set(Key(vec![188]), 124), + Del(Key(vec![141])), + Scan(Key(vec![101]), 26), + ], + true, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_16() { + // postmortem: the test merge function was not properly adding numbers. + prop_tree_matches_btreemap( + vec![Merge(Key(vec![247]), 162), Scan(Key(vec![209]), 31)], + false, + 0, +256 + ); +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_17() { + // postmortem: we were creating a copy of a node leaf during iteration + // before accidentally putting it into a PinnedValue, despite the + // fact that it was not actually part of the node's actual memory! + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![194, 215, 103, 0, 138, 11, 248, 131]), 70), + Scan(Key(vec![]), 30), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_18() { + // postmortem: when implementing get_gt and get_lt, there were some + // issues with getting order comparisons correct. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 19), + Set(Key(vec![78]), 98), + Set(Key(vec![255]), 224), + Set(Key(vec![]), 131), + Get(Key(vec![255])), + GetGt(Key(vec![89])), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_19() { + // postmortem: we were not seeking properly to the next node + // when we hit a half-split child and were using get_lt + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 138), + Set(Key(vec![68]), 113), + Set(Key(vec![155]), 73), + Set(Key(vec![50]), 220), + Set(Key(vec![]), 247), + GetLt(Key(vec![100])), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_20() { + // postmortem: we were not seeking forward during get_gt + // if path_for_key reached a leaf that didn't include + // a key for our + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 10), + Set(Key(vec![56]), 42), + Set(Key(vec![138]), 27), + Set(Key(vec![155]), 73), + Set(Key(vec![]), 251), + GetGt(Key(vec![94])), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_21() { + // postmortem: more split woes while implementing get_lt + // postmortem 2: failed to properly account for node hi key + // being empty in the view predecessor function + // postmortem 3: when rewriting Iter, failed to account for + // direction of iteration + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![176]), 163), + Set(Key(vec![]), 229), + Set(Key(vec![169]), 121), + Set(Key(vec![]), 58), + GetLt(Key(vec![176])), + ], + false, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_22() { + // postmortem: inclusivity wasn't being properly flipped off after + // the first result during iteration + // postmortem 2: failed to properly check bounds while iterating + prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![]), 155), + Merge(Key(vec![56]), 251), + Scan(Key(vec![]), 2), + ], + false, + 0, + 256, + ); +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_23() { + // postmortem: when rewriting CRC handling code, mis-sized the blob crc + prop_tree_matches_btreemap( + vec![Set(Key(vec![6; 5120]), 92), Restart, Scan(Key(vec![]), 35)], + false, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_24() { + // postmortem: get_gt diverged with the Iter impl + prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![]), 193), + Del(Key(vec![])), + Del(Key(vec![])), + Set(Key(vec![]), 55), + Set(Key(vec![]), 212), + Merge(Key(vec![]), 236), + Del(Key(vec![])), + Set(Key(vec![]), 192), + Del(Key(vec![])), + Set(Key(vec![94]), 115), + Merge(Key(vec![62]), 34), + GetGt(Key(vec![])), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_25() { + // postmortem: was not accounting for merges when traversing + // the frag chain and a Del was encountered + prop_tree_matches_btreemap( + vec![Del(Key(vec![])), Merge(Key(vec![]), 84), Get(Key(vec![]))], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_26() { + // postmortem: + prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![]), 194), + Merge(Key(vec![62]), 114), + Merge(Key(vec![80]), 202), + Merge(Key(vec![]), 169), + Set(Key(vec![]), 197), + Del(Key(vec![])), + Del(Key(vec![])), + Set(Key(vec![]), 215), + Set(Key(vec![]), 164), + Merge(Key(vec![]), 150), + GetGt(Key(vec![])), + GetLt(Key(vec![80])), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_27() { + // postmortem: was not accounting for the fact that deletions reduce the + // chances of being able to split successfully. + prop_tree_matches_btreemap( + vec![ + Del(Key(vec![])), + Merge( + Key(vec![ + 74, 117, 68, 37, 89, 16, 84, 130, 133, 78, 74, 59, 44, 109, + 34, 5, 36, 74, 131, 100, 79, 86, 87, 107, 87, 27, 1, 85, + 53, 112, 89, 75, 67, 78, 58, 121, 0, 105, 8, 117, 79, 40, + 94, 123, 83, 72, 78, 23, 23, 35, 50, 77, 59, 75, 54, 92, + 89, 12, 27, 48, 64, 21, 42, 97, 45, 28, 122, 13, 4, 32, 51, + 25, 26, 18, 65, 12, 54, 104, 106, 80, 75, 91, 111, 9, 5, + 130, 43, 40, 3, 72, 0, 58, 92, 64, 112, 97, 75, 130, 11, + 135, 19, 107, 40, 17, 25, 49, 48, 119, 82, 54, 35, 113, 91, + 68, 12, 118, 123, 62, 108, 88, 67, 43, 33, 119, 132, 124, + 1, 62, 133, 110, 25, 62, 129, 117, 117, 107, 123, 94, 127, + 80, 0, 116, 101, 9, 9, 54, 134, 70, 66, 79, 50, 124, 115, + 85, 42, 120, 24, 15, 81, 100, 72, 71, 40, 58, 22, 6, 34, + 54, 69, 110, 18, 74, 111, 80, 52, 90, 44, 4, 29, 84, 95, + 21, 25, 10, 10, 60, 18, 78, 23, 21, 114, 92, 96, 17, 127, + 53, 86, 2, 60, 104, 8, 132, 44, 115, 6, 25, 80, 46, 12, 20, + 44, 67, 136, 127, 50, 55, 70, 41, 90, 16, 10, 44, 32, 24, + 106, 13, 104, + ]), + 219, + ), + Merge(Key(vec![]), 71), + Del(Key(vec![])), + Set(Key(vec![0]), 146), + Merge(Key(vec![13]), 155), + Merge(Key(vec![]), 14), + Del(Key(vec![])), + Set(Key(vec![]), 150), + Set( + Key(vec![ + 13, 8, 3, 6, 9, 14, 3, 13, 7, 12, 13, 7, 13, 13, 1, 13, 5, + 4, 3, 2, 6, 16, 17, 10, 0, 16, 12, 0, 16, 1, 0, 15, 15, 4, + 1, 6, 9, 9, 11, 16, 7, 6, 10, 1, 11, 10, 4, 9, 9, 14, 4, + 12, 16, 10, 15, 2, 1, 8, 4, + ]), + 247, + ), + Del(Key(vec![154])), + Del(Key(vec![])), + Del(Key(vec![ + 0, 24, 24, 31, 40, 23, 10, 30, 16, 41, 30, 23, 14, 25, 21, 19, + 18, 7, 17, 41, 11, 5, 14, 42, 11, 22, 4, 8, 4, 38, 33, 31, 3, + 30, 40, 22, 40, 39, 5, 40, 1, 41, 11, 26, 25, 33, 12, 38, 4, + 35, 30, 42, 19, 26, 23, 22, 39, 18, 29, 4, 1, 24, 14, 38, 0, + 36, 27, 11, 27, 34, 16, 15, 38, 0, 20, 37, 22, 31, 12, 26, 16, + 4, 22, 25, 4, 34, 4, 33, 37, 28, 18, 4, 41, 15, 8, 16, 27, 3, + 20, 26, 40, 31, 15, 15, 17, 15, 5, 13, 22, 37, 7, 13, 35, 14, + 6, 28, 21, 26, 13, 35, 1, 10, 8, 34, 23, 27, 29, 8, 14, 42, 36, + 31, 34, 12, 31, 24, 5, 8, 11, 36, 29, 24, 38, 8, 12, 18, 22, + 36, 21, 28, 11, 24, 0, 41, 37, 39, 42, 25, 13, 41, 27, 8, 24, + 22, 30, 17, 2, 4, 20, 33, 5, 24, 33, 6, 29, 5, 0, 17, 9, 20, + 26, 15, 23, 22, 16, 23, 16, 1, 20, 0, 28, 16, 34, 30, 19, 5, + 36, 40, 28, 6, 39, + ])), + Merge(Key(vec![]), 50), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_28() { + // postmortem: + prop_tree_matches_btreemap( + vec![ + Del(Key(vec![])), + Set(Key(vec![]), 65), + Del(Key(vec![])), + Del(Key(vec![])), + Merge(Key(vec![]), 50), + Merge(Key(vec![]), 2), + Del(Key(vec![197])), + Merge(Key(vec![5]), 146), + Set(Key(vec![222]), 224), + Merge(Key(vec![149]), 60), + Scan(Key(vec![178]), 18), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_29() { + // postmortem: tree merge and split thresholds caused an infinite + // loop while performing updates + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 142), + Merge( + Key(vec![ + 45, 47, 6, 67, 16, 12, 62, 35, 69, 80, 49, 61, 29, 82, 9, + 47, 25, 78, 47, 64, 29, 74, 45, 0, 37, 44, 21, 82, 55, 44, + 31, 60, 86, 18, 45, 67, 55, 21, 35, 46, 25, 51, 5, 32, 33, + 36, 1, 81, 28, 28, 79, 76, 80, 89, 80, 62, 8, 85, 50, 15, + 4, 11, 76, 72, 73, 47, 30, 50, 85, 67, 84, 13, 82, 84, 78, + 70, 42, 83, 8, 7, 50, 77, 85, 37, 47, 82, 86, 46, 30, 27, + 5, 39, 70, 26, 59, 16, 6, 34, 56, 40, 40, 67, 16, 61, 63, + 56, 64, 31, 15, 81, 84, 19, 61, 66, 3, 7, 40, 56, 13, 40, + 64, 50, 88, 47, 88, 50, 63, 65, 79, 62, 1, 44, 59, 27, 12, + 60, 3, 36, 89, 45, 18, 4, 68, 48, 61, 30, 48, 26, 84, 49, + 3, 74, 51, 53, 30, 57, 50, 35, 74, 59, 30, 73, 19, 30, 82, + 78, 3, 5, 62, 17, 48, 29, 67, 52, 45, 61, 74, 52, 29, 61, + 63, 11, 89, 76, 34, 8, 50, 75, 42, 12, 5, 55, 0, 59, 44, + 68, 26, 76, 37, 50, 53, 73, 53, 76, 57, 40, 30, 52, 0, 41, + 21, 8, 79, 79, 38, 37, 50, 56, 43, 9, 85, 21, 60, 64, 13, + 54, 60, 83, 1, 2, 37, 75, 42, 0, 83, 81, 80, 87, 12, 15, + 75, 55, 41, 59, 9, 80, 66, 27, 65, 26, 48, 29, 37, 38, 9, + 76, 31, 39, 35, 22, 73, 59, 28, 33, 35, 63, 78, 17, 22, 82, + 12, 60, 49, 26, 54, 19, 60, 29, 39, 37, 10, 50, 12, 19, 29, + 1, 74, 12, 5, 38, 49, 41, 19, 88, 3, 27, 77, 81, 72, 42, + 71, 86, 82, 11, 79, 40, 35, 26, 35, 64, 4, 33, 87, 31, 84, + 81, 74, 31, 49, 0, 29, 73, 14, 55, 78, 21, 23, 20, 83, 48, + 89, 88, 62, 64, 73, 7, 20, 70, 81, 64, 3, 79, 38, 75, 13, + 40, 29, 82, 40, 14, 66, 56, 54, 52, 37, 14, 67, 8, 37, 1, + 5, 73, 14, 35, 63, 48, 46, 22, 84, 71, 2, 60, 63, 88, 14, + 15, 69, 88, 2, 43, 57, 43, 52, 18, 78, 75, 75, 74, 13, 35, + 50, 35, 17, 13, 64, 82, 55, 32, 14, 57, 35, 77, 65, 22, 40, + 27, 39, 80, 23, 20, 41, 50, 48, 22, 84, 37, 59, 45, 64, 10, + 3, 69, 56, 24, 4, 25, 76, 65, 47, 52, 64, 88, 3, 23, 37, + 16, 56, 69, 71, 27, 87, 65, 74, 23, 82, 41, 60, 78, 75, 22, + 51, 15, 57, 80, 46, 73, 7, 1, 36, 64, 0, 56, 83, 74, 62, + 73, 81, 68, 71, 63, 31, 5, 23, 11, 15, 39, 2, 10, 23, 18, + 74, 3, 43, 25, 68, 54, 11, 21, 14, 58, 10, 73, 0, 66, 28, + 73, 25, 40, 55, 56, 33, 81, 67, 43, 35, 65, 38, 21, 48, 81, + 4, 77, 68, 51, 38, 36, 49, 43, 33, 51, 28, 43, 60, 71, 78, + 48, 49, 76, 21, 0, 72, 0, 32, 78, 12, 87, 5, 80, 62, 40, + 85, 26, 70, 58, 56, 78, 7, 53, 30, 16, 22, 12, 23, 37, 83, + 45, 33, 41, 83, 78, 87, 44, 0, 65, 51, 3, 8, 72, 38, 14, + 24, 64, 77, 45, 5, 1, 7, 27, 82, 7, 6, 70, 25, 67, 22, 8, + 30, 76, 41, 11, 14, 1, 65, 85, 60, 80, 0, 30, 31, 79, 43, + 89, 33, 84, 22, 7, 67, 45, 39, 74, 75, 12, 61, 19, 71, 66, + 83, 57, 38, 45, 21, 18, 37, 54, 36, 14, 54, 63, 81, 12, 7, + 10, 39, 16, 40, 10, 7, 81, 45, 12, 22, 20, 29, 85, 40, 41, + 72, 79, 58, 50, 41, 59, 64, 41, 32, 56, 35, 8, 60, 17, 14, + 89, 17, 7, 48, 6, 35, 9, 34, 54, 6, 44, 87, 76, 50, 1, 67, + 70, 15, 8, 4, 45, 67, 86, 32, 69, 3, 88, 85, 72, 66, 21, + 89, 11, 77, 1, 50, 75, 56, 41, 74, 6, 4, 51, 65, 39, 50, + 45, 56, 3, 19, 80, 86, 55, 48, 81, 17, 3, 89, 7, 9, 63, 58, + 80, 39, 34, 85, 55, 71, 41, 55, 8, 63, 38, 51, 47, 49, 83, + 2, 73, 22, 39, 18, 45, 77, 56, 80, 54, 13, 23, 81, 54, 15, + 48, 57, 83, 71, 41, 32, 64, 1, 9, 46, 27, 16, 21, 7, 28, + 55, 17, 71, 68, 17, 74, 46, 38, 84, 3, 12, 71, 63, 16, 23, + 48, 12, 29, 28, 5, 21, 61, 14, 77, 66, 62, 57, 18, 30, 63, + 14, 41, 37, 30, 73, 16, 12, 74, 8, 82, 67, 53, 10, 5, 37, + 36, 39, 52, 37, 72, 76, 21, 35, 40, 42, 55, 47, 50, 41, 19, + 40, 86, 26, 54, 23, 74, 46, 66, 59, 80, 26, 81, 61, 80, 88, + 55, 40, 30, 45, 7, 46, 21, 3, 20, 46, 63, 18, 9, 34, 67, 9, + 19, 52, 53, 29, 69, 78, 65, 39, 71, 40, 38, 57, 80, 27, 34, + 30, 27, 55, 8, 65, 31, 37, 33, 25, 39, 46, 9, 83, 6, 27, + 28, 61, 9, 21, 58, 21, 10, 69, 24, 5, 31, 32, 44, 26, 84, + 73, 73, 9, 64, 26, 21, 85, 12, 39, 81, 38, 49, 24, 35, 3, + 88, 15, 15, 76, 64, 70, 9, 30, 51, 26, 16, 70, 60, 15, 7, + 54, 36, 32, 9, 10, 18, 66, 19, 25, 77, 46, 51, 51, 14, 41, + 56, 65, 41, 87, 26, 10, 2, 73, 2, 71, 26, 56, 10, 68, 15, + 53, 10, 43, 15, 22, 45, 2, 15, 16, 69, 80, 83, 18, 22, 70, + 77, 52, 48, 24, 17, 40, 56, 22, 17, 3, 36, 46, 37, 41, 22, + 0, 41, 45, 14, 15, 73, 18, 42, 34, 5, 87, 6, 2, 7, 58, 3, + 86, 87, 7, 79, 88, 33, 30, 48, 3, 66, 27, 34, 58, 48, 71, + 40, 1, 46, 84, 32, 63, 79, 0, 21, 71, 1, 59, 39, 77, 51, + 14, 20, 58, 83, 19, 0, 2, 2, 57, 73, 79, 42, 59, 33, 50, + 15, 11, 48, 25, 14, 39, 36, 88, 71, 28, 45, 15, 59, 39, 60, + 78, 18, 18, 45, 50, 29, 66, 86, 5, 76, 85, 55, 17, 28, 8, + 39, 75, 33, 9, 73, 71, 59, 56, 57, 86, 6, 75, 26, 43, 68, + 34, 82, 88, 76, 17, 86, 63, 2, 38, 63, 13, 44, 8, 25, 0, + 63, 54, 73, 52, 3, 72, + ]), + 9, + ), + Set(Key(vec![]), 35), + Set( + Key(vec![ + 165, 64, 99, 55, 152, 102, 148, 35, 59, 10, 198, 191, 71, + 129, 170, 155, 7, 106, 171, 93, 126, + ]), + 212, + ), + Del(Key(vec![])), + Merge(Key(vec![]), 177), + Merge( + Key(vec![ + 20, 55, 154, 104, 10, 68, 64, 3, 31, 78, 232, 227, 169, + 161, 13, 50, 16, 239, 87, 0, 9, 85, 248, 32, 156, 106, 11, + 18, 57, 13, 177, 36, 69, 176, 101, 92, 119, 38, 218, 26, 4, + 154, 185, 135, 75, 167, 101, 107, 206, 76, 153, 213, 70, + 52, 205, 95, 55, 116, 242, 68, 77, 90, 249, 142, 93, 135, + 118, 127, 116, 121, 235, 183, 215, 2, 118, 193, 146, 185, + 4, 129, 167, 164, 178, 105, 149, 47, 73, 121, 95, 23, 216, + 153, 23, 108, 141, 190, 250, 121, 98, 229, 33, 106, 89, + 117, 122, 145, 47, 242, 81, 88, 141, 38, 177, 170, 167, 56, + 24, 196, 61, 97, 83, 91, 202, 181, 75, 112, 3, 169, 61, 17, + 100, 81, 111, 178, 122, 176, 95, 185, 169, 146, 239, 40, + 168, 32, 170, 34, 172, 89, 59, 188, 170, 186, 61, 7, 177, + 230, 130, 155, 208, 171, 82, 153, 20, 72, 74, 111, 147, + 178, 164, 157, 71, 114, 216, 40, 85, 91, 20, 145, 149, 95, + 36, 114, 24, 129, 144, 229, 14, 133, 77, 92, 139, 167, 48, + 18, 178, 4, 15, 171, 171, 88, 74, 104, 157, 2, 121, 13, + 141, 6, 107, 118, 228, 147, 152, 28, 206, 128, 102, 150, 1, + 129, 84, 171, 119, 110, 198, 72, 100, 166, 153, 98, 66, + 128, 79, 41, 126, + ]), + 103, + ), + Del(Key(vec![])), + Merge( + Key(vec![ + 117, 48, 90, 153, 149, 191, 229, 73, 3, 6, 73, 52, 73, 186, + 42, 53, 94, 17, 61, 11, 153, 118, 219, 188, 184, 89, 13, + 124, 138, 40, 238, 9, 46, 45, 38, 115, 153, 106, 166, 56, + 134, 206, 140, 57, 95, 244, 27, 135, 43, 13, 143, 137, 56, + 122, 243, 205, 52, 116, 130, 35, 80, 167, 58, 93, + ]), + 8, + ), + Set(Key(vec![145]), 43), + GetLt(Key(vec![229])), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_30() { + // postmortem: + prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![]), 241), + Set(Key(vec![20]), 146), + Merge( + Key(vec![ + 60, 38, 29, 57, 35, 71, 15, 46, 7, 27, 76, 84, 27, 25, 90, + 30, 37, 63, 11, 24, 27, 28, 94, 93, 82, 68, 69, 61, 46, 86, + 11, 86, 63, 34, 90, 71, 92, 87, 38, 48, 40, 78, 9, 37, 26, + 36, 60, 4, 2, 38, 32, 73, 86, 43, 52, 79, 11, 43, 59, 21, + 60, 40, 80, 94, 69, 44, 4, 73, 59, 16, 16, 22, 88, 41, 13, + 21, 91, 33, 49, 91, 20, 79, 23, 61, 53, 63, 58, 62, 49, 10, + 71, 72, 27, 55, 53, 39, 91, 82, 86, 38, 41, 1, 54, 3, 77, + 15, 93, 31, 49, 29, 82, 7, 17, 58, 42, 12, 49, 67, 62, 46, + 20, 27, 61, 32, 58, 9, 17, 19, 28, 44, 41, 34, 94, 11, 50, + 73, 1, 50, 48, 8, 88, 33, 40, 51, 15, 35, 2, 36, 37, 30, + 37, 83, 71, 91, 32, 0, 69, 28, 64, 30, 72, 63, 39, 7, 89, + 0, 21, 51, 92, 80, 13, 57, 7, 53, 94, 26, 2, 63, 18, 23, + 89, 34, 83, 55, 32, 75, 81, 27, 11, 5, 63, 0, 75, 12, 39, + 9, 13, 20, 25, 57, 94, 75, 59, 46, 84, 80, 61, 24, 31, 7, + 68, 93, 12, 94, 6, 94, 27, 33, 81, 19, 3, 78, 3, 14, 22, + 36, 49, 61, 51, 79, 43, 35, 58, 54, 65, 72, 36, 87, 3, 3, + 25, 75, 82, 58, 75, 76, 29, 89, 1, 16, 64, 63, 85, 0, 47, + ]), + 11, + ), + Merge(Key(vec![25]), 245), + Merge(Key(vec![119]), 152), + Scan(Key(vec![]), 31), + ], + false, + 0, + 256, + ); +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_31() { + // postmortem: + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![1]), 212), + Set(Key(vec![12]), 174), + Set(Key(vec![]), 182), + Set( + Key(vec![ + 12, 55, 46, 38, 40, 34, 44, 32, 19, 15, 28, 49, 35, 40, 55, + 35, 61, 9, 62, 18, 3, 58, + ]), + 86, + ), + Scan(Key(vec![]), -18), + ], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_32() { + // postmortem: the MAX_IVEC that predecessor used in reverse + // iteration was setting the first byte to 0 even though we + // no longer perform per-key prefix encoding. + prop_tree_matches_btreemap( + vec![Set(Key(vec![57]), 141), Scan(Key(vec![]), -40)], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_33() { + // postmortem: the split point was being incorrectly + // calculated when using the simplified prefix technique. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 91), + Set(Key(vec![1]), 216), + Set(Key(vec![85, 25]), 78), + Set(Key(vec![85]), 43), + GetLt(Key(vec![])), + ], + false, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_34() { + // postmortem: a safety check was too aggressive when + // finding predecessors using the new simplified prefix + // encoding technique. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![9, 212]), 100), + Set(Key(vec![9]), 63), + Set(Key(vec![5]), 100), + Merge(Key(vec![]), 16), + Set(Key(vec![9, 70]), 188), + Scan(Key(vec![]), -40), + ], + false, + 0, + 256, + ); +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_35() { + // postmortem: prefix lengths were being incorrectly + // handled on splits. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![207]), 29), + Set(Key(vec![192]), 218), + Set(Key(vec![121]), 167), + Set(Key(vec![189]), 40), + Set(Key(vec![85]), 197), + Set(Key(vec![185]), 58), + Set(Key(vec![84]), 97), + Set(Key(vec![23]), 34), + Set(Key(vec![47]), 162), + Set(Key(vec![39]), 92), + Set(Key(vec![46]), 173), + Set(Key(vec![33]), 202), + Set(Key(vec![8]), 113), + Set(Key(vec![17]), 228), + Set(Key(vec![8, 49]), 217), + Set(Key(vec![6]), 192), + Set(Key(vec![5]), 47), + Set(Key(vec![]), 5), + Set(Key(vec![0]), 103), + Set(Key(vec![1]), 230), + Set(Key(vec![0, 229]), 117), + Set(Key(vec![]), 112), + ], + false, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_36() { + // postmortem: suffix truncation caused + // regions to be permanently inaccessible + // when applied to split points on index + // nodes. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![152]), 65), + Set(Key(vec![]), 227), + Set(Key(vec![101]), 23), + Merge(Key(vec![254]), 97), + Set(Key(vec![254, 5]), 207), + Scan(Key(vec![]), -30), + ], + false, + 0, + 256, + ); +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_37() { + // postmortem: suffix truncation was so + // aggressive that it would cut into + // the prefix in the lo key sometimes. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 82), + Set(Key(vec![2, 0]), 40), + Set(Key(vec![2, 0, 0]), 49), + Set(Key(vec![1]), 187), + Scan(Key(vec![]), 33), + ], + false, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_38() { + // postmortem: Free pages were not being initialized in the + // pagecache properly. + for _ in 0..10 { + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![193]), 73), + Merge(Key(vec![117]), 216), + Set(Key(vec![221]), 176), + GetLt(Key(vec![123])), + Restart, + ], + false, + 0, + 256, + ); + } +} +*/ + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_39() { + // postmortem: + for _ in 0..100 { + prop_tree_matches_btreemap( + vec![ + Set( + Key(vec![ + 67, 48, 34, 254, 61, 189, 196, 127, 26, 185, 244, 63, + 60, 63, 246, 194, 243, 177, 218, 210, 153, 126, 124, + 47, 160, 242, 157, 2, 51, 34, 88, 41, 44, 65, 58, 211, + 245, 74, 192, 101, 222, 68, 196, 250, 127, 231, 102, + 177, 246, 105, 190, 144, 113, 148, 71, 72, 149, 246, + 38, 95, 106, 42, 83, 65, 84, 73, 148, 34, 95, 88, 57, + 232, 219, 227, 74, 14, 5, 124, 106, 57, 244, 50, 81, + 93, 145, 111, 40, 190, 127, 227, 17, 242, 165, 194, + 171, 60, 6, 255, 176, 143, 131, 164, 217, 18, 123, 19, + 246, 183, 29, 0, 6, 39, 175, 57, 134, 166, 231, 47, + 254, 158, 163, 178, 78, 240, 108, 157, 72, 135, 34, + 236, 103, 192, 109, 31, 2, 72, 128, 242, 4, 113, 109, + 224, 120, 61, 169, 226, 131, 210, 33, 181, 91, 91, 197, + 223, 127, 26, 94, 158, 55, 57, 3, 184, 15, 30, 2, 222, + 39, 29, 12, 42, 14, 166, 176, 28, 13, 246, 11, 186, 8, + 247, 113, 253, 102, 227, 68, 111, 227, 238, 54, 150, + 11, 57, 155, 4, 75, 179, 17, 172, 42, 22, 199, 44, 242, + 211, 0, 39, 243, 221, 114, 86, 145, 22, 226, 108, 32, + 248, 42, 49, 191, 112, 1, 69, 101, 112, 251, 243, 252, + 83, 140, 132, 165, + ]), + 250, + ), + Del(Key(vec![ + 11, 77, 168, 37, 181, 169, 239, 146, 240, 211, 7, 115, 197, + 119, 46, 80, 240, 92, 221, 108, 208, 247, 221, 129, 108, + 13, 36, 21, 93, 11, 243, 103, 188, 39, 126, 77, 29, 32, + 206, 175, 199, 245, 71, 96, 221, 7, 68, 64, 45, 78, 68, + 193, 73, 13, 60, 13, 28, 167, 147, 7, 90, 11, 206, 44, 84, + 243, 3, 77, 122, 87, 7, 125, 184, 6, 178, 59, + ])), + Merge(Key(vec![176]), 123), + Restart, + Merge( + Key(vec![ + 93, 43, 181, 76, 63, 247, 227, 15, 17, 239, 9, 252, + 181, 53, 65, 74, 22, 18, 71, 64, 115, 58, 110, 30, 13, + 177, 31, 47, 124, 14, 0, 157, 200, 194, 92, 215, 21, + 36, 239, 204, 18, 88, 216, 149, 18, 208, 187, 188, 32, + 76, 35, 12, 142, 157, 38, 186, 245, 63, 2, 230, 13, 79, + 160, 86, 32, 170, 239, 151, 25, 180, 170, 201, 22, 211, + 238, 208, 24, 139, 5, 44, 38, 48, 243, 38, 249, 36, 43, + 200, 52, 244, 166, 0, 29, 114, 10, 18, 253, 253, 130, + 223, 37, 8, 109, 228, 0, 122, 192, 16, 68, 231, 37, + 230, 249, 180, 214, 101, 17, + ]), + 176, + ), + Set( + Key(vec![ + 153, 217, 142, 179, 255, 74, 1, 20, 254, 1, 38, 28, 66, + 244, 81, 101, 210, 58, 18, 107, 12, 116, 74, 188, 95, + 56, 248, 9, 204, 128, 24, 239, 143, 83, 83, 213, 17, + 32, 135, 73, 217, 8, 241, 44, 57, 131, 107, 139, 122, + 32, 194, 225, 136, 148, 227, 196, 196, 121, 97, 81, 74, + ]), + 42, + ), + Set(Key(vec![]), 160), + GetLt(Key(vec![ + 244, 145, 243, 120, 149, 64, 125, 161, 98, 205, 205, 107, + 191, 119, 83, 42, 92, 119, 25, 198, 47, 123, 26, 224, 190, + 98, 144, 238, 74, 36, 76, 186, 226, 153, 69, 217, 109, 214, + 201, 104, 148, 107, 132, 219, 37, 109, 98, 172, 70, 160, + 177, 115, 194, 80, 76, 60, 148, 176, 191, 84, 109, 35, 51, + 107, 157, 11, 233, 126, 71, 183, 215, 116, 72, 235, 218, + 171, 233, 181, 53, 253, 104, 231, 138, 166, 40, + ])), + Set( + Key(vec![ + 37, 160, 29, 162, 43, 212, 2, 100, 236, 24, 2, 82, 58, + 38, 81, 137, 89, 55, 164, 83, + ]), + 64, + ), + Get(Key(vec![ + 15, 53, 101, 33, 156, 199, 212, 82, 2, 64, 136, 70, 235, + 72, 170, 188, 180, 200, 109, 231, 6, 13, 30, 70, 4, 132, + 133, 101, 82, 187, 78, 241, 157, 49, 156, 3, 17, 167, 216, + 209, 7, 174, 112, 186, 170, 189, 85, 99, 119, 52, 39, 38, + 151, 108, 203, 42, 63, 255, 216, 234, 34, 2, 80, 168, 122, + 70, 20, 11, 220, 106, 49, 110, 165, 170, 149, 163, + ])), + GetLt(Key(vec![])), + Merge(Key(vec![136]), 135), + Cas(Key(vec![177]), 159, 209), + Cas(Key(vec![101]), 143, 240), + Set(Key(vec![226, 62, 34, 63, 172, 96, 162]), 43), + Merge( + Key(vec![ + 48, 182, 144, 255, 137, 100, 2, 139, 69, 111, 159, 133, + 234, 147, 118, 231, 155, 74, 73, 98, 58, 36, 35, 21, + 50, 42, 71, 25, 200, 5, 4, 198, 158, 41, 88, 75, 153, + 254, 248, 213, 0, 89, 43, 160, 58, 206, 88, 107, 57, + 208, 119, 34, 80, 166, 112, 13, 241, 46, 172, 115, 179, + 42, 59, 200, 225, 125, 65, 18, 173, 77, 27, 129, 228, + 68, 53, 175, 61, 230, 27, 136, 131, 171, 64, 79, 125, + 149, 52, 80, + ]), + 105, + ), + Merge( + Key(vec![ + 126, 109, 165, 43, 2, 82, 97, 81, 59, 78, 243, 142, 37, + 105, 109, 178, 25, 73, 50, 103, 107, 129, 213, 193, + 158, 16, 63, 108, 160, 204, 78, 83, 2, 43, 66, 2, 18, + 11, 147, 47, 106, 106, 141, 82, 65, 101, 99, 171, 178, + 68, 106, 7, 190, 159, 105, 132, 155, 240, 155, 95, 66, + 254, 239, 202, 168, 26, 207, 213, 116, 215, 141, 77, 7, + 245, 174, 144, 39, 28, + ]), + 122, + ), + Del(Key(vec![ + 13, 152, 171, 90, 130, 131, 232, 51, 173, 103, 255, 225, + 156, 192, 146, 141, 94, 84, 39, 171, 152, 114, 133, 20, + 125, 68, 57, 27, 33, 175, 37, 164, 40, + ])), + Scan(Key(vec![]), -34), + Set(Key(vec![]), 85), + Merge(Key(vec![112]), 104), + Restart, + Restart, + Del(Key(vec![237])), + Set( + Key(vec![ + 53, 79, 71, 234, 187, 78, 206, 117, 48, 84, 162, 101, + 132, 137, 43, 144, 234, 23, 116, 13, 28, 184, 174, 241, + 181, 201, 131, 156, 7, 103, 135, 17, 168, 249, 7, 120, + 74, 8, 192, 134, 109, 54, 175, 130, 145, 206, 185, 49, + 144, 133, 226, 244, 42, 126, 176, 232, 96, 56, 70, 56, + 159, 127, 35, 39, 185, 114, 182, 41, 50, 93, 61, + ]), + 144, + ), + Merge( + Key(vec![ + 10, 58, 6, 62, 17, 15, 26, 29, 79, 34, 77, 12, 93, 65, + 87, 71, 19, 57, 25, 40, 53, 73, 57, 2, 81, 49, 67, 62, + 78, 14, 34, 70, 86, 49, 86, 84, 16, 33, 24, 7, 87, 49, + 58, 50, 13, 14, 35, 46, 7, 39, 76, 51, 21, 76, 9, 53, + 45, 21, 71, 48, 16, 73, 68, 1, 63, 34, 12, 42, 11, 85, + 79, 19, 11, 77, 90, 0, 62, 56, 37, 33, 10, 69, 20, 64, + 15, 51, 64, 90, 69, 15, 7, 41, 53, 71, 52, 21, 45, 45, + 49, 3, 59, 15, 90, 7, 12, 62, 30, 81, + ]), + 131, + ), + Get(Key(vec![ + 79, 28, 48, 41, 5, 70, 54, 56, 36, 32, 59, 15, 26, 42, 61, + 23, 53, 6, 71, 44, 61, 65, 4, 17, 23, 15, 65, 64, 46, 66, + 27, 63, 51, 44, 35, 1, 8, 70, 7, 1, 13, 10, 40, 6, 36, 64, + 68, 52, 8, 0, 46, 53, 48, 32, 9, 52, 69, 41, 8, 57, 27, 31, + 79, 27, 12, 70, 72, 33, 6, 22, 47, 37, 11, 38, 32, 7, 31, + 37, 45, 23, 74, 22, 46, 1, 3, 74, 72, 56, 52, 65, 78, 28, + 5, 68, 30, 36, 5, 43, 7, 2, 48, 75, 16, 53, 31, 40, 9, 3, + 49, 71, 70, 20, 24, 6, 23, 76, 49, 21, 12, 60, 54, 43, 7, + 79, 74, 62, 53, 20, 46, 11, 74, 29, 31, 43, 20, 27, 22, 22, + 15, 59, 12, 21, 61, 11, 8, 28, 5, 78, 70, 22, 11, 36, 62, + 56, 44, 49, 25, 39, 37, 24, 72, 65, 67, 22, 48, 16, 50, 5, + 10, 13, 36, 65, 29, 3, 26, 74, 15, 73, 78, 36, 14, 36, 30, + 42, 19, 73, 65, 75, 2, 25, 1, 32, 38, 43, 58, 19, 37, 37, + 48, 23, 72, 77, 34, 24, 1, 4, 42, 11, 68, 54, 23, 34, 0, + 48, 20, 20, 23, 61, 65, 72, 64, 24, 63, 3, 21, 48, 63, 57, + 40, 36, 46, 48, 8, 20, 62, 7, 69, 35, 79, 38, 45, 74, 7, + 16, 48, 59, 56, 31, 13, 13, + ])), + Del(Key(vec![176, 58, 119])), + Get(Key(vec![241])), + Get(Key(vec![160])), + Cas(Key(vec![]), 166, 235), + Set( + Key(vec![ + 64, 83, 151, 149, 100, 93, 5, 18, 91, 58, 84, 156, 127, + 108, 99, 168, 54, 51, 169, 185, 174, 101, 178, 148, 28, + 91, 25, 138, 14, 133, 170, 97, 138, 180, 157, 131, 174, + 22, 91, 108, 59, 165, 52, 28, 17, 175, 44, 95, 112, 38, + 141, 46, 124, 49, 116, 55, 39, 109, 73, 181, 104, 86, + 81, 150, 95, 149, 69, 110, 110, 102, 22, 62, 180, 60, + 87, 127, 127, 136, 12, 139, 109, 165, 34, 181, 158, + 156, 102, 38, 6, 149, 183, 69, 129, 98, 161, 175, 82, + 51, 47, 93, 136, 16, 118, 65, 152, 139, 8, 30, 10, 100, + 47, 13, 47, 179, 87, 19, 109, 78, 116, 20, 111, 89, 28, + 0, 86, 39, 139, 7, 111, 40, 145, 155, 107, 45, 36, 90, + 143, 154, 135, 36, 13, 98, 61, 150, 65, 128, 16, 52, + 100, 128, 11, 5, 49, 143, 56, 78, 48, 62, 86, 50, 86, + 41, 153, 53, 139, 89, 164, 33, 136, 83, 182, 53, 132, + 144, 177, 105, 104, 55, 9, 174, 30, 65, 76, 33, 163, + 172, 80, 169, 175, 54, 165, 173, 109, 24, 70, 25, 158, + 135, 76, 130, 76, 9, 56, 20, 13, 133, 33, 168, 160, + 153, 43, 80, 58, 56, 171, 28, 97, 122, 162, 32, 164, + 11, 112, 177, 63, 47, 25, 0, 66, 87, 169, 118, 173, 27, + 154, 79, 72, 107, 140, 126, 150, 60, 174, 184, 111, + 155, 22, 32, 185, 149, 95, 60, 146, 165, 103, 34, 131, + 91, 92, 85, 6, 102, 172, 131, 178, 141, 76, 84, 121, + 49, 19, 66, 127, 45, 23, 159, 33, 138, 47, 36, 106, 39, + 83, 164, 83, 16, 126, 126, 118, 84, 171, + ]), + 143, + ), + Scan(Key(vec![165]), -26), + Get(Key(vec![])), + Del(Key(vec![])), + Set( + Key(vec![ + 197, 224, 20, 219, 111, 246, 70, 138, 190, 237, 9, 202, + 187, 160, 47, 10, 231, 14, 2, 131, 30, 202, 95, 48, 44, + 21, 192, 155, 172, 51, 101, 155, 73, 5, 22, 140, 137, + 11, 37, 79, 79, 92, 25, 107, 82, 145, 39, 45, 155, 136, + 242, 8, 43, 71, 28, 70, 94, 79, 151, 20, 144, 53, 100, + 196, 74, 140, 27, 224, 59, 1, 143, 136, 132, 85, 114, + 166, 103, 242, 156, 183, 168, 148, 2, 33, 29, 201, 7, + 96, 13, 33, 102, 172, 21, 96, 27, 1, 86, 149, 150, 119, + 208, 118, 148, 51, 143, 54, 245, 89, 216, 145, 145, 72, + 105, 51, 19, 14, 15, 18, 34, 16, 101, 172, 133, 32, + 173, 106, 157, 15, 48, 194, 27, 55, 204, 110, 145, 99, + 9, 37, 195, 206, 13, 246, 161, 100, 222, 235, 184, 12, + 64, 103, 50, 158, 242, 163, 198, 61, 224, 130, 226, + 187, 158, 175, 135, 54, 110, 33, 9, 59, 127, 135, 47, + 204, 109, 105, 0, 161, 48, 247, 140, 101, 141, 81, 157, + 80, 135, 228, 102, 44, 74, 53, 121, 116, 17, 56, 26, + 112, + ]), + 22, + ), + Set(Key(vec![110]), 222), + Set(Key(vec![94]), 5), + GetGt(Key(vec![ + 181, 161, 96, 186, 128, 24, 232, 74, 149, 3, 129, 98, 220, + 25, 111, 111, 163, 244, 229, 137, 159, 137, 13, 12, 97, + 150, 6, 88, 76, 77, 31, 36, 57, 54, 82, 85, 119, 250, 187, + 163, 132, 73, 194, 129, 149, 176, 62, 118, 166, 50, 200, + 28, 158, 184, 28, 139, 74, 87, 144, 87, 1, 73, 37, 46, 226, + 91, 102, 13, 67, 195, 64, 189, 90, 190, 163, 216, 171, 22, + 69, 234, 57, 134, 96, 198, 179, 115, 43, 160, 104, 252, + 105, 192, 91, 211, 176, 171, 252, 236, 202, 158, 250, 186, + 134, 154, 82, 17, 113, 175, 13, 125, 185, 101, 38, 236, + 155, 30, 110, 11, 33, 198, 114, 184, 84, 91, 67, 125, 55, + 188, 124, 242, 89, 124, 69, 18, 26, 137, 34, 33, 201, 58, + 252, 134, 33, 131, 126, 136, 168, 20, 32, 237, 10, 57, 158, + 149, 102, 62, 10, 98, 106, 10, 93, 78, 240, 205, 38, 186, + 97, 104, 204, 14, 34, 100, 179, 161, 135, 136, 194, 99, + ])), + Merge(Key(vec![95]), 253), + GetLt(Key(vec![99])), + Merge(Key(vec![]), 124), + Get(Key(vec![61])), + Restart, + ], + false, + 0, + 256, + ); + } +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_40() { + // postmortem: deletions of non-existant keys were + // being persisted despite being unneccessary. + prop_tree_matches_btreemap( + vec![Del(Key(vec![99; 111222333]))], + false, + 0, + 256, + ); +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_41() { + // postmortem: indexing of values during + // iteration was incorrect. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![]), 131), + Set(Key(vec![17; 1]), 214), + Set(Key(vec![4; 1]), 202), + Set(Key(vec![24; 1]), 79), + Set(Key(vec![26; 1]), 235), + Scan(Key(vec![]), 19), + ], + false, + 0, + 256, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_42() { + // postmortem: during refactoring, accidentally + // messed up the index selection for merge destinations. + for _ in 0..100 { + prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![]), 112), + Set(Key(vec![110; 1]), 153), + Set(Key(vec![15; 1]), 100), + Del(Key(vec![110; 1])), + GetLt(Key(vec![148; 1])), + ], + false, + 0, + 256, + ); + } +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_43() { + // postmortem: when changing the PageState to always + // include a base node, we did not account for this + // in the tag + size compressed value. This was not + // caught by the quickcheck tests because PageState's + // Arbitrary implementation would ensure that at least + // one frag was present, which was the invariant before + // the base was extracted away from the vec of frags. + prop_tree_matches_btreemap( + vec![ + Set(Key(vec![241; 1]), 199), + Set(Key(vec![]), 198), + Set(Key(vec![72; 108]), 175), + GetLt(Key(vec![])), + Restart, + Restart, + ], + false, + 0, + 288, + ); +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_44() { + // postmortem: off-by-one bug related to LSN recovery + // where 1 was added to the index when the recovered + // LSN was actually divisible by the segment size + assert!(prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![]), 97), + Merge(Key(vec![]), 41), + Merge(Key(vec![]), 241), + Set(Key(vec![21; 1]), 24), + Del(Key(vec![])), + Set(Key(vec![]), 145), + Set(Key(vec![151; 1]), 187), + Get(Key(vec![])), + Restart, + Set(Key(vec![]), 151), + Restart, + ], + false, + 0, + 256 + )) +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_45() { + // postmortem: recovery was not properly accounting for + // the possibility of a segment to be maxed out, similar + // to bug 44. + for _ in 0..10 { + assert!(prop_tree_matches_btreemap( + vec![ + Merge(Key(vec![206; 77]), 225), + Set(Key(vec![88; 190]), 40), + Set(Key(vec![162; 1]), 213), + Merge(Key(vec![186; 1]), 175), + Set(Key(vec![105; 16]), 111), + Cas(Key(vec![]), 75, 252), + Restart + ], + false, + true, + 0, + 210 + )) + } +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_46() { + // postmortem: while implementing the heap slab, decompression + // was failing to account for the fact that the slab allocator + // will always write to the end of the slab to be compatible + // with O_DIRECT. + for _ in 0..1 { + assert!(prop_tree_matches_btreemap(vec![Restart], false, 0, 256)) + } +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_47() { + // postmortem: + assert!(prop_tree_matches_btreemap( + vec![Set(Key(vec![88; 1]), 40), Restart, Get(Key(vec![88; 1]))], + false, + 0, + 256 + )) +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_48() { + // postmortem: node value buffer calculations were failing to + // account for potential padding added to avoid buffer overreads + // while looking up offsets. + assert!(prop_tree_matches_btreemap( + vec![ + Set(Key(vec![23; 1]), 78), + Set(Key(vec![120; 1]), 223), + Set(Key(vec![123; 1]), 235), + Set(Key(vec![60; 1]), 234), + Set(Key(vec![]), 71), + Del(Key(vec![120; 1])), + Scan(Key(vec![]), -9) + ], + false, + 0, + 256 + )) +} + +/* +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_49() { + // postmortem: was incorrectly calculating the child offset while searching + // for a node with omitted keys, where the distance == the stride, and + // as a result we went into an infinite loop trying to apply a parent + // split that was already present + assert!(prop_tree_matches_btreemap( + vec![ + Set(Key(vec![39; 1]), 245), + Set(Key(vec![108; 1]), 96), + Set(Key(vec![147; 1]), 44), + Set(Key(vec![102; 1]), 2), + Merge(Key(vec![22; 1]), 160), + Set(Key(vec![36; 1]), 1), + Set(Key(vec![65; 1]), 213), + Set(Key(vec![]), 221), + Set(Key(vec![84; 1]), 20), + Merge(Key(vec![229; 1]), 61), + Set(Key(vec![156; 1]), 69), + Merge(Key(vec![252; 1]), 85), + Set(Key(vec![36; 2]), 57), + Set(Key(vec![245; 1]), 143), + Set(Key(vec![59; 1]), 209), + GetGt(Key(vec![136; 1])), + Set(Key(vec![40; 1]), 96), + GetGt(Key(vec![59; 2])) + ], + false, + false, + 0, + 0 + )) +} +*/ + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_50() { + // postmortem: node value buffer calculations were failing to + // account for potential padding added to avoid buffer overreads + // while looking up offsets. + assert!(prop_tree_matches_btreemap( + vec![ + Set(Key(vec![1; 1]), 44), + Set(Key(vec![52; 1]), 108), + Set(Key(vec![80; 1]), 177), + Set(Key(vec![225; 1]), 59), + Set(Key(vec![246; 1]), 34), + Set(Key(vec![51; 1]), 233), + Set(Key(vec![]), 88), + GetLt(Key(vec![1; 1])) + ], + false, + 0, + 0 + )) +} + +#[test] +#[cfg_attr(miri, ignore)] +fn tree_bug_51() { + // postmortem: + prop_tree_matches_btreemap( + vec![Set(Key(vec![]), 135), Restart, Scan(Key(vec![]), -38)], + false, + 0, + 0, + ); +} diff --git a/tests/common/mod.rs b/tests/common/mod.rs index 8fbe943dc..1b4640df8 100644 --- a/tests/common/mod.rs +++ b/tests/common/mod.rs @@ -1,17 +1,9 @@ -#[cfg(not(any(feature = "testing", feature = "light_testing")))] -compile_error!( - "please run tests using the \"testing\" feature, \ - which enables additional checks at runtime and \ - causes more race conditions to jump out by \ - inserting delays in concurrent code." -); - // the memshred feature causes all allocated and deallocated // memory to be set to a specific non-zero value of 0xa1 for // uninitialized allocations and 0xde for deallocated memory, // in the hope that it will cause memory errors to surface // more quickly. -#[cfg(feature = "memshred")] +#[cfg(feature = "testing-shred-allocator")] mod alloc { use std::alloc::{Layout, System}; @@ -43,9 +35,6 @@ pub fn setup_logger() { std::thread::current().name().unwrap_or("unknown").to_owned() } - #[cfg(feature = "pretty_backtrace")] - color_backtrace::install(); - let mut builder = env_logger::Builder::new(); builder .format(|buf, record| { diff --git a/tests/concurrent_batch_atomicity.rs b/tests/concurrent_batch_atomicity.rs new file mode 100644 index 000000000..d286360a2 --- /dev/null +++ b/tests/concurrent_batch_atomicity.rs @@ -0,0 +1,76 @@ +use std::sync::{Arc, Barrier}; +use std::thread; + +use sled::{Config, Db as SledDb}; + +const CONCURRENCY: usize = 32; +const N_KEYS: usize = 1024; + +type Db = SledDb<8>; + +fn batch_writer(db: Db, barrier: Arc, thread_number: usize) { + barrier.wait(); + let mut batch = sled::Batch::default(); + for key_number in 0_u128..N_KEYS as _ { + // LE is intentionally a little scrambled + batch.insert(&key_number.to_le_bytes(), &thread_number.to_le_bytes()); + } + + db.apply_batch(batch).unwrap(); +} + +#[test] +fn concurrent_batch_atomicity() { + let db: Db = Config { + path: "concurrent_batch_atomicity".into(), + ..Default::default() + } + .open() + .unwrap(); + + let mut threads = vec![]; + + let flusher_barrier = Arc::new(Barrier::new(CONCURRENCY)); + for tn in 0..CONCURRENCY { + let db = db.clone(); + let barrier = flusher_barrier.clone(); + let thread = thread::Builder::new() + .name(format!("t(thread: {} flusher)", tn)) + .spawn(move || { + db.flush().unwrap(); + barrier.wait(); + }) + .expect("should be able to spawn thread"); + threads.push(thread); + } + + let barrier = Arc::new(Barrier::new(CONCURRENCY + 1)); + for thread_number in 0..CONCURRENCY { + let db = db.clone(); + let barrier = barrier.clone(); + let jh = + thread::spawn(move || batch_writer(db, barrier, thread_number)); + threads.push(jh); + } + + barrier.wait(); + let before = std::time::Instant::now(); + + for thread in threads.into_iter() { + thread.join().unwrap(); + } + + println!("writers took {:?}", before.elapsed()); + + let mut expected_v = None; + + for key_number in 0_u128..N_KEYS as _ { + let actual_v = db.get(&key_number.to_le_bytes()).unwrap().unwrap(); + if expected_v.is_none() { + expected_v = Some(actual_v.clone()); + } + assert_eq!(Some(actual_v), expected_v); + } + + let _ = std::fs::remove_dir_all("concurrent_batch_atomicity"); +} diff --git a/tests/crash_tests/crash_batches.rs b/tests/crash_tests/crash_batches.rs new file mode 100644 index 000000000..b37aad191 --- /dev/null +++ b/tests/crash_tests/crash_batches.rs @@ -0,0 +1,121 @@ +use std::thread; + +use rand::Rng; + +use super::*; + +const CACHE_SIZE: usize = 1024 * 1024; +const BATCH_SIZE: u32 = 8; +const SEGMENT_SIZE: usize = 1024; + +/// Verifies that the keys in the tree are correctly recovered (i.e., equal). +/// Panics if they are incorrect. +fn verify_batches(tree: &Db) -> u32 { + let mut iter = tree.iter(); + let first_value = match iter.next() { + Some(Ok((_k, v))) => slice_to_u32(&*v), + Some(Err(e)) => panic!("{:?}", e), + None => return 0, + }; + + // we now expect all items in the batch to be present and to have the same value + + for key in 0..BATCH_SIZE { + let res = tree.get(u32_to_vec(key)); + let option = res.unwrap(); + let v = match option { + Some(v) => v, + None => panic!( + "expected key {} to have a value, instead it was missing in db with keys: {}", + key, + tree_to_string(&tree) + ), + }; + let value = slice_to_u32(&*v); + // FIXME BUG 1 count 2 + // assertion `left == right` failed: expected key 0 to have value 62003, instead it had value 62375 in db with keys: + // {0:62003, 1:62003, 2:62003, 3:62003, 4:62003, 5:62003, 6:62003, 7:62003, + // Human: iterating shows correct value, but first get did not + // + // expected key 1 to have value 1, instead it had value 29469 in db with keys: + // {0:1, 1:29469, 2:29469, 3:29469, 4:29469, 5:29469, 6:29469, 7:29469, + // Human: 0 didn't get included in later syncs + // + // expected key 0 to have value 59485, instead it had value 59484 in db with keys: + // {0:59485, 1:59485, 2:59485, 3:59485, 4:59485, 5:59485, 6:59485, 7:59485, + // Human: had key N during first check, then N + 1 in iteration + assert_eq!( + first_value, value, + "expected key {} to have value {}, instead it had value {}. second get: {:?}. db iter: {}. third get: {:?}", + key, first_value, value, + slice_to_u32(&*tree.get(u32_to_vec(key)).unwrap().unwrap()), + tree_to_string(&tree), + slice_to_u32(&*tree.get(u32_to_vec(key)).unwrap().unwrap()), + ); + } + + first_value +} + +fn run_batches_inner(db: Db) { + fn do_batch(i: u32, db: &Db) { + let mut rng = rand::thread_rng(); + let base_value = u32_to_vec(i); + + let mut batch = sled::Batch::default(); + if rng.gen_bool(0.1) { + for key in 0..BATCH_SIZE { + batch.remove(u32_to_vec(key)); + } + } else { + for key in 0..BATCH_SIZE { + let mut value = base_value.clone(); + let additional_len = rng.gen_range(0..SEGMENT_SIZE / 3); + value.append(&mut vec![0u8; additional_len]); + + batch.insert(u32_to_vec(key), value); + } + } + db.apply_batch(batch).unwrap(); + } + + let mut i = verify_batches(&db); + i += 1; + do_batch(i, &db); + + loop { + i += 1; + do_batch(i, &db); + } +} + +pub fn run_crash_batches() { + let crash_during_initialization = rand::thread_rng().gen_ratio(1, 10); + + if crash_during_initialization { + spawn_killah(); + } + + let path = std::path::Path::new(CRASH_DIR).join(BATCHES_DIR); + let config = Config::new() + .cache_capacity_bytes(CACHE_SIZE) + .flush_every_ms(Some(1)) + .path(path); + + let db = config.open().expect("couldn't open batch db"); + let db2 = db.clone(); + + let t1 = thread::spawn(|| run_batches_inner(db)); + let t2 = thread::spawn(move || loop { + db2.flush().unwrap(); + }); // run_batches_inner(db2)); + + if !crash_during_initialization { + spawn_killah(); + } + + if let Err(e) = t1.join().and_then(|_| t2.join()) { + println!("worker thread failed: {:?}", e); + std::process::exit(15); + } +} diff --git a/tests/crash_tests/crash_heap.rs b/tests/crash_tests/crash_heap.rs new file mode 100644 index 000000000..b3d66a581 --- /dev/null +++ b/tests/crash_tests/crash_heap.rs @@ -0,0 +1,17 @@ +use super::*; + +const FANOUT: usize = 3; + +pub fn run_crash_heap() { + let path = std::path::Path::new(CRASH_DIR).join(HEAP_DIR); + let config = Config::new().path(path); + + let HeapRecovery { heap, recovered_nodes, was_recovered } = + Heap::recover(FANOUT, &config).unwrap(); + + // validate + + spawn_killah(); + + loop {} +} diff --git a/tests/crash_tests/crash_iter.rs b/tests/crash_tests/crash_iter.rs new file mode 100644 index 000000000..51f416987 --- /dev/null +++ b/tests/crash_tests/crash_iter.rs @@ -0,0 +1,183 @@ +use std::sync::{Arc, Barrier}; +use std::thread; + +use super::*; + +const CACHE_SIZE: usize = 256; + +pub fn run_crash_iter() { + const N_FORWARD: usize = 50; + const N_REVERSE: usize = 50; + + let path = std::path::Path::new(CRASH_DIR).join(ITER_DIR); + let config = Config::new() + .cache_capacity_bytes(CACHE_SIZE) + .path(path) + .flush_every_ms(Some(1)); + + let db: Db = config.open().expect("couldn't open iter db"); + let t = db.open_tree(b"crash_iter_test").unwrap(); + + thread::Builder::new() + .name("crash_iter_flusher".to_string()) + .spawn({ + let t = t.clone(); + move || loop { + t.flush().unwrap(); + } + }) + .unwrap(); + + const INDELIBLE: [&[u8]; 16] = [ + &[0u8], + &[1u8], + &[2u8], + &[3u8], + &[4u8], + &[5u8], + &[6u8], + &[7u8], + &[8u8], + &[9u8], + &[10u8], + &[11u8], + &[12u8], + &[13u8], + &[14u8], + &[15u8], + ]; + + for item in &INDELIBLE { + t.insert(*item, *item).unwrap(); + } + + let barrier = Arc::new(Barrier::new(N_FORWARD + N_REVERSE + 2)); + let mut threads = vec![]; + + for i in 0..N_FORWARD { + let t = thread::Builder::new() + .name(format!("forward({})", i)) + .spawn({ + let t = t.clone(); + let barrier = barrier.clone(); + move || { + barrier.wait(); + loop { + let expected = INDELIBLE.iter(); + let mut keys = t.iter().keys(); + + for expect in expected { + loop { + let k = keys.next().unwrap().unwrap(); + assert!( + &*k <= *expect, + "witnessed key is {:?} but we expected \ + one <= {:?}, so we overshot due to a \ + concurrent modification", + k, + expect, + ); + if &*k == *expect { + break; + } + } + } + } + } + }) + .unwrap(); + threads.push(t); + } + + for i in 0..N_REVERSE { + let t = thread::Builder::new() + .name(format!("reverse({})", i)) + .spawn({ + let t = t.clone(); + let barrier = barrier.clone(); + move || { + barrier.wait(); + loop { + let expected = INDELIBLE.iter().rev(); + let mut keys = t.iter().keys().rev(); + + for expect in expected { + loop { + if let Some(Ok(k)) = keys.next() { + assert!( + &*k >= *expect, + "witnessed key is {:?} but we expected \ + one >= {:?}, so we overshot due to a \ + concurrent modification\n{:?}", + k, + expect, + t, + ); + if &*k == *expect { + break; + } + } else { + panic!("undershot key on tree: \n{:?}", t); + } + } + } + } + } + }) + .unwrap(); + + threads.push(t); + } + + let inserter = thread::Builder::new() + .name("inserter".into()) + .spawn({ + let t = t.clone(); + let barrier = barrier.clone(); + move || { + barrier.wait(); + + loop { + for i in 0..(16 * 16 * 8) { + let major = i / (16 * 8); + let minor = i % 16; + + let mut base = INDELIBLE[major].to_vec(); + base.push(minor as u8); + t.insert(base.clone(), base.clone()).unwrap(); + } + } + } + }) + .unwrap(); + + threads.push(inserter); + + let deleter = thread::Builder::new() + .name("deleter".into()) + .spawn({ + move || { + barrier.wait(); + + loop { + for i in 0..(16 * 16 * 8) { + let major = i / (16 * 8); + let minor = i % 16; + + let mut base = INDELIBLE[major].to_vec(); + base.push(minor as u8); + t.remove(&base).unwrap(); + } + } + } + }) + .unwrap(); + + spawn_killah(); + + threads.push(deleter); + + for thread in threads.into_iter() { + thread.join().expect("thread should not have crashed"); + } +} diff --git a/tests/crash_tests/crash_metadata_store.rs b/tests/crash_tests/crash_metadata_store.rs new file mode 100644 index 000000000..dcb647f5f --- /dev/null +++ b/tests/crash_tests/crash_metadata_store.rs @@ -0,0 +1,12 @@ +use super::*; + +pub fn run_crash_metadata_store() { + let (metadata_store, recovered) = + MetadataStore::recover(&HEAP_DIR).unwrap(); + + // validate + + spawn_killah(); + + loop {} +} diff --git a/tests/crash_tests/crash_object_cache.rs b/tests/crash_tests/crash_object_cache.rs new file mode 100644 index 000000000..278e1a787 --- /dev/null +++ b/tests/crash_tests/crash_object_cache.rs @@ -0,0 +1,17 @@ +use super::*; + +const FANOUT: usize = 3; + +pub fn run_crash_object_cache() { + let path = std::path::Path::new(CRASH_DIR).join(OBJECT_CACHE_DIR); + let config = Config::new().flush_every_ms(Some(1)).path(path); + + let (oc, collections, was_recovered): (ObjectCache, _, bool) = + ObjectCache::recover(&config).unwrap(); + + // validate + + spawn_killah(); + + loop {} +} diff --git a/tests/crash_tests/crash_sequential_writes.rs b/tests/crash_tests/crash_sequential_writes.rs new file mode 100644 index 000000000..2008b9f3d --- /dev/null +++ b/tests/crash_tests/crash_sequential_writes.rs @@ -0,0 +1,128 @@ +use std::thread; + +use super::*; + +const CACHE_SIZE: usize = 1024 * 1024; +const CYCLE: usize = 256; +const SEGMENT_SIZE: usize = 1024; + +/// Verifies that the keys in the tree are correctly recovered. +/// Panics if they are incorrect. +/// Returns the key that should be resumed at, and the current cycle value. +fn verify(tree: &Db) -> (u32, u32) { + // key 0 should always be the highest value, as that's where we increment + // at some point, it might go down by one + // it should never return, or go down again after that + let mut iter = tree.iter(); + let highest = match iter.next() { + Some(Ok((_k, v))) => slice_to_u32(&*v), + Some(Err(e)) => panic!("{:?}", e), + None => return (0, 0), + }; + + let highest_vec = u32_to_vec(highest); + + // find how far we got + let mut contiguous: u32 = 0; + let mut lowest_with_high_value = 0; + + for res in iter { + let (k, v) = res.unwrap(); + if v[..4] == highest_vec[..4] { + contiguous += 1; + } else { + let expected = if highest == 0 { + CYCLE as u32 - 1 + } else { + (highest - 1) % CYCLE as u32 + }; + let actual = slice_to_u32(&*v); + // FIXME BUG 2 + // thread '' panicked at tests/test_crash_recovery.rs:159:13: + // assertion `left == right` failed + // left: 139 + // right: 136 + assert_eq!( + expected, + actual, + "tree failed assertion with iterated values: {}, k: {:?} v: {:?} expected: {} highest: {}", + tree_to_string(&tree), + k, + v, + expected, + highest + ); + lowest_with_high_value = actual; + break; + } + } + + // ensure nothing changes after this point + let low_beginning = u32_to_vec(contiguous + 1); + + for res in tree.range(&*low_beginning..) { + let (k, v): (sled::InlineArray, _) = res.unwrap(); + assert_eq!( + slice_to_u32(&*v), + lowest_with_high_value, + "expected key {} to have value {}, instead it had value {} in db: {:?}", + slice_to_u32(&*k), + lowest_with_high_value, + slice_to_u32(&*v), + tree + ); + } + + (contiguous, highest) +} + +fn run_inner(config: Config) { + let crash_during_initialization = rand::thread_rng().gen_bool(0.1); + + if crash_during_initialization { + spawn_killah(); + } + + let tree = config.open().expect("couldn't open db"); + + if !crash_during_initialization { + spawn_killah(); + } + + let (key, highest) = verify(&tree); + + let mut hu = ((highest as usize) * CYCLE) + key as usize; + assert_eq!(hu % CYCLE, key as usize); + assert_eq!(hu / CYCLE, highest as usize); + + loop { + let key = u32_to_vec((hu % CYCLE) as u32); + + //dbg!(hu, hu % CYCLE); + + let mut value = u32_to_vec((hu / CYCLE) as u32); + let additional_len = rand::thread_rng().gen_range(0..SEGMENT_SIZE / 3); + value.append(&mut vec![0u8; additional_len]); + + tree.insert(&key, value).unwrap(); + + hu += 1; + + if hu / CYCLE >= CYCLE { + hu = 0; + } + } +} + +pub fn run_crash_sequential_writes() { + let path = std::path::Path::new(CRASH_DIR).join(SEQUENTIAL_WRITES_DIR); + let config = Config::new() + .cache_capacity_bytes(CACHE_SIZE) + .flush_every_ms(Some(1)) + .path(path); + + if let Err(e) = thread::spawn(|| run_inner(config)).join() { + println!("worker thread failed: {:?}", e); + std::process::exit(15); + } +} diff --git a/tests/crash_tests/crash_tx.rs b/tests/crash_tests/crash_tx.rs new file mode 100644 index 000000000..8a4aabe11 --- /dev/null +++ b/tests/crash_tests/crash_tx.rs @@ -0,0 +1,98 @@ +use super::*; + +const CACHE_SIZE: usize = 1024 * 1024; + +pub fn run_crash_tx() { + let config = Config::new() + .cache_capacity_bytes(CACHE_SIZE) + .flush_every_ms(Some(1)) + .path(TX_DIR); + + let _db: Db = config.open().unwrap(); + + spawn_killah(); + + loop {} + + /* + db.insert(b"k1", b"cats").unwrap(); + db.insert(b"k2", b"dogs").unwrap(); + db.insert(b"id", &0_u64.to_le_bytes()).unwrap(); + + let mut threads = vec![]; + + const N_WRITERS: usize = 50; + const N_READERS: usize = 5; + + let barrier = Arc::new(Barrier::new(N_WRITERS + N_READERS)); + + for _ in 0..N_WRITERS { + let db = db.clone(); + let barrier = barrier.clone(); + let thread = std::thread::spawn(move || { + barrier.wait(); + loop { + db.transaction::<_, _, ()>(|db| { + let v1 = db.remove(b"k1").unwrap().unwrap(); + let v2 = db.remove(b"k2").unwrap().unwrap(); + + db.insert(b"id", &db.generate_id().unwrap().to_le_bytes()) + .unwrap(); + + db.insert(b"k1", v2).unwrap(); + db.insert(b"k2", v1).unwrap(); + Ok(()) + }) + .unwrap(); + } + }); + threads.push(thread); + } + + for _ in 0..N_READERS { + let db = db.clone(); + let barrier = barrier.clone(); + let thread = std::thread::spawn(move || { + barrier.wait(); + let mut last_id = 0; + loop { + let read_id = db + .transaction::<_, _, ()>(|db| { + let v1 = db.get(b"k1").unwrap().unwrap(); + let v2 = db.get(b"k2").unwrap().unwrap(); + let id = u64::from_le_bytes( + TryFrom::try_from( + &*db.get(b"id").unwrap().unwrap(), + ) + .unwrap(), + ); + + let mut results = vec![v1, v2]; + results.sort(); + + assert_eq!( + [&results[0], &results[1]], + [b"cats", b"dogs"] + ); + + Ok(id) + }) + .unwrap(); + assert!(read_id >= last_id); + last_id = read_id; + } + }); + threads.push(thread); + } + + spawn_killah(); + + for thread in threads.into_iter() { + thread.join().expect("threads should not crash"); + } + + let v1 = db.get(b"k1").unwrap().unwrap(); + let v2 = db.get(b"k2").unwrap().unwrap(); + assert_eq!([v1, v2], [b"cats", b"dogs"]); + */ +} diff --git a/tests/crash_tests/mod.rs b/tests/crash_tests/mod.rs new file mode 100644 index 000000000..6dd7683ec --- /dev/null +++ b/tests/crash_tests/mod.rs @@ -0,0 +1,71 @@ +use std::mem::size_of; +use std::process::exit; +use std::thread; +use std::time::Duration; + +use rand::Rng; + +use sled::{ + Config, Db as SledDb, Heap, HeapRecovery, MetadataStore, ObjectCache, +}; + +mod crash_batches; +mod crash_heap; +mod crash_iter; +mod crash_metadata_store; +mod crash_object_cache; +mod crash_sequential_writes; +mod crash_tx; + +pub use crash_batches::run_crash_batches; +pub use crash_heap::run_crash_heap; +pub use crash_iter::run_crash_iter; +pub use crash_metadata_store::run_crash_metadata_store; +pub use crash_object_cache::run_crash_object_cache; +pub use crash_sequential_writes::run_crash_sequential_writes; +pub use crash_tx::run_crash_tx; + +type Db = SledDb<8>; + +// test names, also used as dir names +pub const SEQUENTIAL_WRITES_DIR: &str = "crash_sequential_writes"; +pub const BATCHES_DIR: &str = "crash_batches"; +pub const ITER_DIR: &str = "crash_iter"; +pub const TX_DIR: &str = "crash_tx"; +pub const METADATA_STORE_DIR: &str = "crash_metadata_store"; +pub const HEAP_DIR: &str = "crash_heap"; +pub const OBJECT_CACHE_DIR: &str = "crash_object_cache"; + +const CRASH_DIR: &str = "crash_test_files"; + +fn spawn_killah() { + thread::spawn(|| { + let runtime = rand::thread_rng().gen_range(0..60_000); + thread::sleep(Duration::from_micros(runtime)); + exit(9); + }); +} + +fn u32_to_vec(u: u32) -> Vec { + let buf: [u8; size_of::()] = u.to_be_bytes(); + buf.to_vec() +} + +fn slice_to_u32(b: &[u8]) -> u32 { + let mut buf = [0u8; size_of::()]; + buf.copy_from_slice(&b[..size_of::()]); + + u32::from_be_bytes(buf) +} + +fn tree_to_string(tree: &Db) -> String { + let mut ret = String::from("{"); + for kv_res in tree.iter() { + let (k, v) = kv_res.unwrap(); + let k_s = slice_to_u32(&k); + let v_s = slice_to_u32(&v); + ret.push_str(&format!("{}:{}, ", k_s, v_s)); + } + ret.push_str("}"); + ret +} diff --git a/tests/test_crash_recovery.rs b/tests/test_crash_recovery.rs index d0cce0cfc..99d358cdb 100644 --- a/tests/test_crash_recovery.rs +++ b/tests/test_crash_recovery.rs @@ -1,40 +1,49 @@ mod common; +mod crash_tests; -use std::convert::TryFrom; +use std::alloc::{Layout, System}; use std::env::{self, VarError}; -use std::mem::size_of; -use std::process::{exit, Child, Command, ExitStatus}; -use std::sync::{Arc, Barrier}; +use std::process::Command; use std::thread; -use std::time::Duration; - -use rand::Rng; - -use sled::Config; use common::cleanup; const TEST_ENV_VAR: &str = "SLED_CRASH_TEST"; const N_TESTS: usize = 100; -const CYCLE: usize = 256; -const BATCH_SIZE: u32 = 8; -const SEGMENT_SIZE: usize = 1024; - -// test names, also used as dir names -const RECOVERY_DIR: &str = "crash_recovery"; -const BATCHES_DIR: &str = "crash_batches"; -const ITER_DIR: &str = "crash_iter"; -const TX_DIR: &str = "crash_tx"; -const TESTS: [(&str, fn()); 4] = [ - (RECOVERY_DIR, crash_recovery), - (BATCHES_DIR, crash_batches), - (ITER_DIR, concurrent_crash_iter), - (TX_DIR, concurrent_crash_transactions), +const TESTS: [&str; 7] = [ + crash_tests::SEQUENTIAL_WRITES_DIR, + crash_tests::BATCHES_DIR, + crash_tests::ITER_DIR, + crash_tests::TX_DIR, + crash_tests::METADATA_STORE_DIR, + crash_tests::HEAP_DIR, + crash_tests::OBJECT_CACHE_DIR, ]; const CRASH_CHANCE: u32 = 250; +#[global_allocator] +static ALLOCATOR: ShredAllocator = ShredAllocator; + +#[derive(Default, Debug, Clone, Copy)] +struct ShredAllocator; + +unsafe impl std::alloc::GlobalAlloc for ShredAllocator { + unsafe fn alloc(&self, layout: Layout) -> *mut u8 { + assert!(layout.size() < 1_000_000_000); + let ret = System.alloc(layout); + assert_ne!(ret, std::ptr::null_mut()); + std::ptr::write_bytes(ret, 0xa1, layout.size()); + ret + } + + unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) { + std::ptr::write_bytes(ptr, 0xde, layout.size()); + System.dealloc(ptr, layout) + } +} + fn main() { // Don't actually run this harness=false test under miri, as it requires // spawning and killing child processes. @@ -46,11 +55,11 @@ fn main() { match env::var(TEST_ENV_VAR) { Err(VarError::NotPresent) => { - let filtered: Vec<(&'static str, fn())> = + let filtered: Vec<&'static str> = if let Some(filter) = std::env::args().nth(1) { TESTS .iter() - .filter(|(name, _)| name.contains(&filter)) + .filter(|name| name.contains(&filter)) .cloned() .collect() } else { @@ -67,9 +76,10 @@ fn main() { ); let mut tests = vec![]; - for (test_name, test_fn) in filtered.into_iter() { + for test_name in filtered.into_iter() { let test = thread::spawn(move || { - let res = std::panic::catch_unwind(test_fn); + let res = + std::panic::catch_unwind(|| supervisor(test_name)); println!( "test {} ... {}", test_name, @@ -77,11 +87,11 @@ fn main() { ); res.unwrap(); }); - tests.push(test); + tests.push((test_name, test)); } - for test in tests.into_iter() { - test.join().unwrap(); + for (test_name, test) in tests.into_iter() { + test.join().expect(test_name); } println!(); @@ -93,580 +103,66 @@ fn main() { println!(); } - Ok(ref s) if s == RECOVERY_DIR => run_crash_recovery(), - Ok(ref s) if s == BATCHES_DIR => run_crash_batches(), - Ok(ref s) if s == ITER_DIR => run_crash_iter(), - Ok(ref s) if s == TX_DIR => run_crash_tx(), - - Ok(_) | Err(_) => panic!("invalid crash test case"), - } -} - -/// Verifies that the keys in the tree are correctly recovered. -/// Panics if they are incorrect. -/// Returns the key that should be resumed at, and the current cycle value. -fn verify(tree: &sled::Tree) -> (u32, u32) { - // key 0 should always be the highest value, as that's where we increment - // at some point, it might go down by one - // it should never return, or go down again after that - let mut iter = tree.iter(); - let highest = match iter.next() { - Some(Ok((_k, v))) => slice_to_u32(&*v), - Some(Err(e)) => panic!("{:?}", e), - None => return (0, 0), - }; - - let highest_vec = u32_to_vec(highest); - - // find how far we got - let mut contiguous: u32 = 0; - let mut lowest = 0; - for res in iter { - let (_k, v) = res.unwrap(); - if v[..4] == highest_vec[..4] { - contiguous += 1; - } else { - let expected = if highest == 0 { - CYCLE as u32 - 1 - } else { - (highest - 1) % CYCLE as u32 - }; - let actual = slice_to_u32(&*v); - assert_eq!(expected, actual); - lowest = actual; - break; + Ok(ref s) if s == crash_tests::SEQUENTIAL_WRITES_DIR => { + crash_tests::run_crash_sequential_writes() } - } - - // ensure nothing changes after this point - let low_beginning = u32_to_vec(contiguous + 1); - - for res in tree.range(&*low_beginning..) { - let (k, v): (sled::IVec, _) = res.unwrap(); - assert_eq!( - slice_to_u32(&*v), - lowest, - "expected key {} to have value {}, instead it had value {} in db: {:?}", - slice_to_u32(&*k), - lowest, - slice_to_u32(&*v), - tree - ); - } - - tree.verify_integrity().unwrap(); - - (contiguous, highest) -} - -fn u32_to_vec(u: u32) -> Vec { - let buf: [u8; size_of::()] = u.to_be_bytes(); - buf.to_vec() -} - -fn slice_to_u32(b: &[u8]) -> u32 { - let mut buf = [0u8; size_of::()]; - buf.copy_from_slice(&b[..size_of::()]); - - u32::from_be_bytes(buf) -} - -fn spawn_killah() { - thread::spawn(|| { - let runtime = rand::thread_rng().gen_range(0, 60); - thread::sleep(Duration::from_millis(runtime)); - exit(9); - }); -} - -fn run_inner(config: Config) { - let crash_during_initialization = rand::thread_rng().gen_bool(0.1); - - if crash_during_initialization { - spawn_killah(); - } - - let tree = config.open().unwrap(); - - if !crash_during_initialization { - spawn_killah(); - } - - let (key, highest) = verify(&tree); - - let mut hu = ((highest as usize) * CYCLE) + key as usize; - assert_eq!(hu % CYCLE, key as usize); - assert_eq!(hu / CYCLE, highest as usize); - - loop { - hu += 1; - - if hu / CYCLE >= CYCLE { - hu = 0; + Ok(ref s) if s == crash_tests::BATCHES_DIR => { + crash_tests::run_crash_batches() } - - let key = u32_to_vec((hu % CYCLE) as u32); - - let mut value = u32_to_vec((hu / CYCLE) as u32); - let additional_len = rand::thread_rng().gen_range(0, SEGMENT_SIZE / 3); - value.append(&mut vec![0u8; additional_len]); - - tree.insert(&key, value).unwrap(); - } -} - -/// Verifies that the keys in the tree are correctly recovered (i.e., equal). -/// Panics if they are incorrect. -fn verify_batches(tree: &sled::Tree) -> u32 { - let mut iter = tree.iter(); - let first_value = match iter.next() { - Some(Ok((_k, v))) => slice_to_u32(&*v), - Some(Err(e)) => panic!("{:?}", e), - None => return 0, - }; - for key in 0..BATCH_SIZE { - let res = tree.get(u32_to_vec(key)); - let option = res.unwrap(); - let v = match option { - Some(v) => v, - None => panic!( - "expected key {} to have a value, instead it was missing in db: {:?}", - key, tree - ), - }; - let value = slice_to_u32(&*v); - assert_eq!( - first_value, value, - "expected key {} to have value {}, instead it had value {} in db: {:?}", - key, first_value, value, tree - ); - } - - tree.verify_integrity().unwrap(); - - first_value -} - -fn run_batches_inner(db: sled::Db) { - fn do_batch(i: u32, db: &sled::Db) { - let mut rng = rand::thread_rng(); - let base_value = u32_to_vec(i); - - let mut batch = sled::Batch::default(); - if rng.gen_bool(0.1) { - for key in 0..BATCH_SIZE { - batch.remove(u32_to_vec(key)); - } - } else { - for key in 0..BATCH_SIZE { - let mut value = base_value.clone(); - let additional_len = rng.gen_range(0, SEGMENT_SIZE / 3); - value.append(&mut vec![0u8; additional_len]); - - batch.insert(u32_to_vec(key), value); - } + Ok(ref s) if s == crash_tests::ITER_DIR => { + crash_tests::run_crash_iter() } - db.apply_batch(batch).unwrap(); - } - - let mut i = verify_batches(&db); - i += 1; - do_batch(i, &db); - - loop { - i += 1; - do_batch(i, &db); - } -} - -fn run_crash_recovery() { - let config = Config::new() - .cache_capacity(128 * 1024 * 1024) - .flush_every_ms(Some(1)) - .path(RECOVERY_DIR) - .segment_size(SEGMENT_SIZE); - - if let Err(e) = thread::spawn(|| run_inner(config)).join() { - println!("worker thread failed: {:?}", e); - std::process::exit(15); - } -} - -fn run_crash_batches() { - let crash_during_initialization = rand::thread_rng().gen_ratio(1, 10); - - if crash_during_initialization { - spawn_killah(); - } - - let config = Config::new() - .cache_capacity(128 * 1024 * 1024) - .flush_every_ms(Some(1)) - .path(BATCHES_DIR) - .segment_size(SEGMENT_SIZE); - - let db = config.open().unwrap(); - // let db2 = db.clone(); - - let t1 = thread::spawn(|| run_batches_inner(db)); - let t2 = thread::spawn(|| {}); // run_batches_inner(db2)); - - if !crash_during_initialization { - spawn_killah(); - } - - if let Err(e) = t1.join().and_then(|_| t2.join()) { - println!("worker thread failed: {:?}", e); - std::process::exit(15); + Ok(ref s) if s == crash_tests::TX_DIR => crash_tests::run_crash_tx(), + Ok(ref s) if s == crash_tests::METADATA_STORE_DIR => { + crash_tests::run_crash_metadata_store() + } + Ok(ref s) if s == crash_tests::HEAP_DIR => { + crash_tests::run_crash_heap() + } + Ok(ref s) if s == crash_tests::OBJECT_CACHE_DIR => { + crash_tests::run_crash_object_cache() + } + Ok(other) => panic!("invalid crash test case: {other}"), + Err(e) => panic!("env var {TEST_ENV_VAR} unable to be read: {e:?}"), } } -fn run_child_process(test_name: &str) -> Child { +fn run_child_process(dir: &str) { let bin = env::current_exe().expect("could not get test binary path"); - env::set_var(TEST_ENV_VAR, test_name); + env::set_var(TEST_ENV_VAR, dir); - Command::new(bin) - .env(TEST_ENV_VAR, test_name) + let status_res = Command::new(bin) + .env(TEST_ENV_VAR, dir) .env("SLED_CRASH_CHANCE", CRASH_CHANCE.to_string()) .spawn() .unwrap_or_else(|_| { - panic!("could not spawn child process for {} test", test_name) + panic!("could not spawn child process for {} test", dir) }) -} - -fn handle_child_exit_status(dir: &str, status: ExitStatus) { - let code = status.code(); - - if code.is_none() || code.unwrap() != 9 { - cleanup(dir); - panic!("{} test child exited abnormally", dir); - } -} + .wait(); -fn handle_child_wait_err(dir: &str, e: std::io::Error) { - cleanup(dir); - - panic!("error waiting for {} test child: {}", dir, e); -} + match status_res { + Ok(status) => { + let code = status.code(); -fn crash_recovery() { - let dir = RECOVERY_DIR; - cleanup(dir); - - for _ in 0..N_TESTS { - let mut child = run_child_process(dir); - - child - .wait() - .map(|status| handle_child_exit_status(dir, status)) - .map_err(|e| handle_child_wait_err(dir, e)) - .unwrap(); - } - - cleanup(dir); -} - -fn crash_batches() { - let dir = BATCHES_DIR; - cleanup(dir); - - for _ in 0..N_TESTS { - let mut child = run_child_process(dir); - - child - .wait() - .map(|status| handle_child_exit_status(dir, status)) - .map_err(|e| handle_child_wait_err(dir, e)) - .unwrap(); - } - - cleanup(dir); -} - -fn concurrent_crash_iter() { - let dir = ITER_DIR; - cleanup(dir); - - for _ in 0..N_TESTS { - let mut child = run_child_process(dir); - - child - .wait() - .map(|status| handle_child_exit_status(dir, status)) - .map_err(|e| handle_child_wait_err(dir, e)) - .unwrap(); + if code.is_none() || code.unwrap() != 9 { + cleanup(dir); + panic!("{} test child exited abnormally", dir); + } + } + Err(e) => { + cleanup(dir); + panic!("error waiting for {} test child: {}", dir, e); + } } - - cleanup(dir); } -fn concurrent_crash_transactions() { - let dir = TX_DIR; +fn supervisor(dir: &str) { cleanup(dir); for _ in 0..N_TESTS { - let mut child = run_child_process(dir); - - child - .wait() - .map(|status| handle_child_exit_status(dir, status)) - .map_err(|e| handle_child_wait_err(dir, e)) - .unwrap(); + run_child_process(dir); } cleanup(dir); } - -fn run_crash_iter() { - common::setup_logger(); - - const N_FORWARD: usize = 50; - const N_REVERSE: usize = 50; - - let config = Config::new().path(ITER_DIR).flush_every_ms(Some(1)); - - let t = config.open().unwrap(); - t.verify_integrity().unwrap(); - - const INDELIBLE: [&[u8]; 16] = [ - &[0u8], - &[1u8], - &[2u8], - &[3u8], - &[4u8], - &[5u8], - &[6u8], - &[7u8], - &[8u8], - &[9u8], - &[10u8], - &[11u8], - &[12u8], - &[13u8], - &[14u8], - &[15u8], - ]; - - for item in &INDELIBLE { - t.insert(*item, *item).unwrap(); - } - - let barrier = Arc::new(Barrier::new(N_FORWARD + N_REVERSE + 2)); - let mut threads = vec![]; - - for i in 0..N_FORWARD { - let t = thread::Builder::new() - .name(format!("forward({})", i)) - .spawn({ - let t = t.clone(); - let barrier = barrier.clone(); - move || { - barrier.wait(); - loop { - let expected = INDELIBLE.iter(); - let mut keys = t.iter().keys(); - - for expect in expected { - loop { - let k = keys.next().unwrap().unwrap(); - assert!( - &*k <= *expect, - "witnessed key is {:?} but we expected \ - one <= {:?}, so we overshot due to a \ - concurrent modification", - k, - expect, - ); - if &*k == *expect { - break; - } - } - } - } - } - }) - .unwrap(); - threads.push(t); - } - - for i in 0..N_REVERSE { - let t = thread::Builder::new() - .name(format!("reverse({})", i)) - .spawn({ - let t = t.clone(); - let barrier = barrier.clone(); - move || { - barrier.wait(); - loop { - let expected = INDELIBLE.iter().rev(); - let mut keys = t.iter().keys().rev(); - - for expect in expected { - loop { - if let Some(Ok(k)) = keys.next() { - assert!( - &*k >= *expect, - "witnessed key is {:?} but we expected \ - one >= {:?}, so we overshot due to a \ - concurrent modification\n{:?}", - k, - expect, - *t, - ); - if &*k == *expect { - break; - } - } else { - panic!("undershot key on tree: \n{:?}", *t); - } - } - } - } - } - }) - .unwrap(); - - threads.push(t); - } - - let inserter = thread::Builder::new() - .name("inserter".into()) - .spawn({ - let t = t.clone(); - let barrier = barrier.clone(); - move || { - barrier.wait(); - - loop { - for i in 0..(16 * 16 * 8) { - let major = i / (16 * 8); - let minor = i % 16; - - let mut base = INDELIBLE[major].to_vec(); - base.push(minor as u8); - t.insert(base.clone(), base.clone()).unwrap(); - } - } - } - }) - .unwrap(); - - threads.push(inserter); - - let deleter = thread::Builder::new() - .name("deleter".into()) - .spawn({ - move || { - barrier.wait(); - - loop { - for i in 0..(16 * 16 * 8) { - let major = i / (16 * 8); - let minor = i % 16; - - let mut base = INDELIBLE[major].to_vec(); - base.push(minor as u8); - t.remove(&base).unwrap(); - } - } - } - }) - .unwrap(); - - spawn_killah(); - - threads.push(deleter); - - for thread in threads.into_iter() { - thread.join().expect("thread should not have crashed"); - } -} - -fn run_crash_tx() { - common::setup_logger(); - - let config = Config::new().flush_every_ms(Some(1)).path(TX_DIR); - let db = config.open().unwrap(); - db.verify_integrity().unwrap(); - - db.insert(b"k1", b"cats").unwrap(); - db.insert(b"k2", b"dogs").unwrap(); - db.insert(b"id", &0_u64.to_le_bytes()).unwrap(); - - let mut threads = vec![]; - - const N_WRITERS: usize = 50; - const N_READERS: usize = 5; - - let barrier = Arc::new(Barrier::new(N_WRITERS + N_READERS)); - - for _ in 0..N_WRITERS { - let db = db.clone(); - let barrier = barrier.clone(); - let thread = std::thread::spawn(move || { - barrier.wait(); - loop { - db.transaction::<_, _, ()>(|db| { - let v1 = db.remove(b"k1").unwrap().unwrap(); - let v2 = db.remove(b"k2").unwrap().unwrap(); - - db.insert(b"id", &db.generate_id().unwrap().to_le_bytes()) - .unwrap(); - - db.insert(b"k1", v2).unwrap(); - db.insert(b"k2", v1).unwrap(); - Ok(()) - }) - .unwrap(); - } - }); - threads.push(thread); - } - - for _ in 0..N_READERS { - let db = db.clone(); - let barrier = barrier.clone(); - let thread = std::thread::spawn(move || { - barrier.wait(); - let mut last_id = 0; - loop { - let read_id = db - .transaction::<_, _, ()>(|db| { - let v1 = db.get(b"k1").unwrap().unwrap(); - let v2 = db.get(b"k2").unwrap().unwrap(); - let id = u64::from_le_bytes( - TryFrom::try_from( - &*db.get(b"id").unwrap().unwrap(), - ) - .unwrap(), - ); - - let mut results = vec![v1, v2]; - results.sort(); - - assert_eq!( - [&results[0], &results[1]], - [b"cats", b"dogs"] - ); - - Ok(id) - }) - .unwrap(); - assert!(read_id >= last_id); - last_id = read_id; - } - }); - threads.push(thread); - } - - spawn_killah(); - - for thread in threads.into_iter() { - thread.join().expect("threads should not crash"); - } - - let v1 = db.get(b"k1").unwrap().unwrap(); - let v2 = db.get(b"k2").unwrap().unwrap(); - assert_eq!([v1, v2], [b"cats", b"dogs"]); -} diff --git a/tests/test_log.rs b/tests/test_log.rs deleted file mode 100644 index 6ebd288c1..000000000 --- a/tests/test_log.rs +++ /dev/null @@ -1,382 +0,0 @@ -mod common; - -use rand::{thread_rng, Rng}; - -use sled::*; - -use sled::{ - pin, BatchManifest, Log, LogKind, LogRead, Lsn, PageId, SEG_HEADER_LEN, -}; - -const PID: PageId = 4; -const REPLACE: LogKind = LogKind::Replace; - -#[test] -#[ignore = "adding 50 causes the flush to never return"] -fn log_writebatch() -> crate::Result<()> { - common::setup_logger(); - let config = Config::new().temporary(true); - let db = config.open()?; - let log = &db.context.pagecache.log; - - let guard = pin(); - - log.reserve(REPLACE, PID, &IVec::from(b"1"), &guard)?.complete()?; - log.reserve(REPLACE, PID, &IVec::from(b"2"), &guard)?.complete()?; - log.reserve(REPLACE, PID, &IVec::from(b"3"), &guard)?.complete()?; - log.reserve(REPLACE, PID, &IVec::from(b"4"), &guard)?.complete()?; - log.reserve(REPLACE, PID, &IVec::from(b"5"), &guard)?.complete()?; - - // simulate a torn batch by - // writing an LSN higher than - // is possible to recover into - // a batch manifest before - // some writes. - let batch_res = - log.reserve(REPLACE, PID, &BatchManifest::default(), &guard)?; - log.reserve(REPLACE, PID, &IVec::from(b"6"), &guard)?.complete()?; - log.reserve(REPLACE, PID, &IVec::from(b"7"), &guard)?.complete()?; - log.reserve(REPLACE, PID, &IVec::from(b"8"), &guard)?.complete()?; - let last_res = log.reserve(REPLACE, PID, &IVec::from(b"9"), &guard)?; - let last_res_lsn = last_res.lsn; - last_res.complete()?; - batch_res.mark_writebatch(last_res_lsn + 50)?; - log.reserve(REPLACE, PID, &IVec::from(b"10"), &guard)?.complete()?; - - let mut iter = log.iter_from(0); - - assert!(iter.next().is_some()); - assert!(iter.next().is_some()); - assert!(iter.next().is_some()); - assert!(iter.next().is_some()); - assert!(iter.next().is_some()); - assert_eq!(iter.next(), None); - - Ok(()) -} - -#[test] -fn more_log_reservations_than_buffers() -> Result<()> { - let config = Config::new().temporary(true).segment_size(256); - - let db = config.open()?; - let log = &db.context.pagecache.log; - let mut reservations = vec![]; - - let total_seg_overhead = SEG_HEADER_LEN; - let big_msg_overhead = MAX_MSG_HEADER_LEN + total_seg_overhead; - let big_msg_sz = config.segment_size - big_msg_overhead; - - for _ in 0..256 * 30 { - let guard = pin(); - reservations.push( - log.reserve(REPLACE, PID, &IVec::from(vec![0; big_msg_sz]), &guard) - .unwrap(), - ) - } - for res in reservations.into_iter().rev() { - // abort in reverse order - res.abort()?; - } - - log.flush()?; - Ok(()) -} - -#[test] -fn non_contiguous_log_flush() -> Result<()> { - let config = Config::new().temporary(true).segment_size(1024); - - let db = config.open()?; - let log = &db.context.pagecache.log; - - let seg_overhead = SEG_HEADER_LEN; - let buf_len = config.segment_size - (MAX_MSG_HEADER_LEN + seg_overhead); - - let guard = pin(); - let res1 = log - .reserve(REPLACE, PID, &IVec::from(vec![0; buf_len]), &guard) - .unwrap(); - let res2 = log - .reserve(REPLACE, PID, &IVec::from(vec![0; buf_len]), &guard) - .unwrap(); - let lsn = res2.lsn; - res2.abort()?; - res1.abort()?; - log.make_stable(lsn)?; - Ok(()) -} - -#[test] -#[cfg(not(miri))] // can't create threads -fn concurrent_logging() -> Result<()> { - use std::thread; - - common::setup_logger(); - for _ in 0..10 { - let config = Config::new() - .temporary(true) - .flush_every_ms(Some(50)) - .segment_size(256); - - let db = config.open()?; - - let db2 = db.clone(); - let db3 = db.clone(); - let db4 = db.clone(); - let db5 = db.clone(); - let db6 = db.clone(); - - let seg_overhead = SEG_HEADER_LEN; - let buf_len = config.segment_size - (MAX_MSG_HEADER_LEN + seg_overhead); - - let t1 = thread::Builder::new() - .name("c1".to_string()) - .spawn(move || { - let log = &db.context.pagecache.log; - for i in 0..1_000 { - let buf = IVec::from(vec![1; i % buf_len]); - let guard = pin(); - log.reserve(REPLACE, PID, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - } - }) - .unwrap(); - - let t2 = thread::Builder::new() - .name("c2".to_string()) - .spawn(move || { - let log = &db2.context.pagecache.log; - for i in 0..1_000 { - let buf = IVec::from(vec![2; i % buf_len]); - let guard = pin(); - log.reserve(REPLACE, PID, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - } - }) - .unwrap(); - - let t3 = thread::Builder::new() - .name("c3".to_string()) - .spawn(move || { - let log = &db3.context.pagecache.log; - for i in 0..1_000 { - let buf = IVec::from(vec![3; i % buf_len]); - let guard = pin(); - log.reserve(REPLACE, PID, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - } - }) - .unwrap(); - - let t4 = thread::Builder::new() - .name("c4".to_string()) - .spawn(move || { - let log = &db4.context.pagecache.log; - for i in 0..1_000 { - let buf = IVec::from(vec![4; i % buf_len]); - let guard = pin(); - log.reserve(REPLACE, PID, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - } - }) - .unwrap(); - let t5 = thread::Builder::new() - .name("c5".to_string()) - .spawn(move || { - let log = &db5.context.pagecache.log; - for i in 0..1_000 { - let guard = pin(); - let buf = IVec::from(vec![5; i % buf_len]); - log.reserve(REPLACE, PID, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - } - }) - .unwrap(); - - let t6 = thread::Builder::new() - .name("c6".to_string()) - .spawn(move || { - let log = &db6.context.pagecache.log; - for i in 0..1_000 { - let buf = IVec::from(vec![6; i % buf_len]); - let guard = pin(); - let (lsn, _lid) = log - .reserve(REPLACE, PID, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - log.make_stable(lsn).unwrap(); - } - }) - .unwrap(); - - t6.join().unwrap(); - t5.join().unwrap(); - t4.join().unwrap(); - t3.join().unwrap(); - t2.join().unwrap(); - t1.join().unwrap(); - } - - Ok(()) -} - -fn write(log: &Log) { - let data_bytes = IVec::from(b"yoyoyoyo"); - let guard = pin(); - let (lsn, ptr) = log - .reserve(REPLACE, PID, &data_bytes, &guard) - .unwrap() - .complete() - .unwrap(); - let read_buf = log.read(PID, lsn, ptr).unwrap().into_data().unwrap(); - assert_eq!( - *read_buf, - *data_bytes.serialize(), - "after writing data, it should be readable" - ); -} - -fn abort(log: &Log) { - let guard = pin(); - let res = log.reserve(REPLACE, PID, &IVec::from(&[0; 5]), &guard).unwrap(); - let (lsn, ptr) = res.abort().unwrap(); - match log.read(PID, lsn, ptr) { - Ok(LogRead::Canceled(_)) => {} - other => { - panic!( - "expected to successfully read \ - aborted log message, instead read {:?}", - other - ); - } - } -} - -#[test] -fn log_aborts() { - common::setup_logger(); - let config = Config::new().temporary(true); - let db = config.open().unwrap(); - let log = &db.context.pagecache.log; - write(log); - abort(log); - write(log); - abort(log); - write(log); - abort(log); -} - -#[test] -#[cfg_attr(any(target_os = "fuchsia", miri), ignore)] -fn log_chunky_iterator() { - common::setup_logger(); - let config = - Config::new().flush_every_ms(None).temporary(true).segment_size(256); - - let db = config.open().unwrap(); - let log = &db.context.pagecache.log; - - let mut reference = vec![]; - - let max_valid_size = - config.segment_size - (MAX_MSG_HEADER_LEN + SEG_HEADER_LEN); - - for i in PID..1000 { - let len = thread_rng().gen_range(0, max_valid_size * 2); - let item = thread_rng().gen::(); - let buf = IVec::from(vec![item; len]); - let abort = thread_rng().gen::(); - - let pid = 10000 + i; - - let guard = pin(); - - if abort { - let res = log - .reserve(REPLACE, pid, &buf, &guard) - .expect("should be able to reserve"); - res.abort().unwrap(); - } else { - let res = log - .reserve(REPLACE, pid, &buf, &guard) - .expect("should be able to write reservation"); - let ptr = res.pointer; - let (lsn, _) = res.complete().unwrap(); - reference.push((REPLACE, pid, lsn, ptr)); - } - } - - for (_, pid, lsn, ptr) in reference.into_iter() { - assert!(log.read(pid, lsn, ptr).is_ok()); - } -} - -#[test] -fn multi_segment_log_iteration() -> Result<()> { - common::setup_logger(); - // ensure segments are being linked - // ensure trailers are valid - let config = - Config::new().temporary(true).segment_size(512).flush_every_ms(None); - - // this guard prevents any segments from being freed - let _guard = pin(); - - let total_seg_overhead = SEG_HEADER_LEN; - let big_msg_overhead = MAX_MSG_HEADER_LEN + total_seg_overhead; - let big_msg_sz = (config.segment_size - big_msg_overhead) / 64; - - let db = config.open().unwrap(); - let log = &db.context.pagecache.log; - - let mut expected_pids = std::collections::HashSet::new(); - - for i in 4..1000 { - let buf = IVec::from(vec![i as u8; big_msg_sz * i]); - let guard = pin(); - log.reserve(REPLACE, i as PageId, &buf, &guard) - .unwrap() - .complete() - .unwrap(); - - expected_pids.insert(i as PageId); - } - - db.flush().unwrap(); - - // start iterating just past the first segment header - let iter = log.iter_from(SEG_HEADER_LEN as Lsn); - - for (_, pid, _, _) in iter { - if pid <= 3 { - // this page is for the meta page, counter page, or the default - // tree's leaf or index nodes - continue; - } - assert!( - expected_pids.remove(&pid), - "read pid {} while iterating, but this was no longer expected", - pid - ); - } - - assert!( - expected_pids.is_empty(), - "expected to read pids {:?} but never saw them while iterating", - expected_pids - ); - - Ok(()) -} diff --git a/tests/test_space_leaks.rs b/tests/test_space_leaks.rs index 214c37b4b..5a6a6c520 100644 --- a/tests/test_space_leaks.rs +++ b/tests/test_space_leaks.rs @@ -1,15 +1,14 @@ +use std::io; + mod common; #[test] #[cfg_attr(miri, ignore)] -fn size_leak() -> sled::Result<()> { +fn size_leak() -> io::Result<()> { common::setup_logger(); - let tree = sled::Config::new() - .temporary(true) - .segment_size(2048) - .flush_every_ms(None) - .open()?; + let tree: sled::Db<1024> = + sled::Config::tmp()?.flush_every_ms(None).open()?; for _ in 0..10_000 { tree.insert(b"", b"")?; diff --git a/tests/test_tree.rs b/tests/test_tree.rs index b171dfe99..efaf57655 100644 --- a/tests/test_tree.rs +++ b/tests/test_tree.rs @@ -2,38 +2,41 @@ mod common; mod tree; use std::{ + io, sync::{ - atomic::{AtomicUsize, Ordering::SeqCst}, + atomic::{AtomicBool, AtomicUsize, Ordering::SeqCst}, Arc, Barrier, }, - time::Duration, }; #[allow(unused_imports)] use log::{debug, warn}; -use quickcheck::{QuickCheck, StdGen}; +use quickcheck::{Gen, QuickCheck}; -use sled::Transactional; -use sled::{transaction::*, *}; +// use sled::Transactional; +// use sled::transaction::*; +use sled::{Config, Db as SledDb, InlineArray}; + +type Db = SledDb<3>; use tree::{ - prop_tree_matches_btreemap, Key, - Op::{self, *}, + prop_tree_matches_btreemap, + Op::{self}, }; -const N_THREADS: usize = 10; -const N_PER_THREAD: usize = 100; +const N_THREADS: usize = 32; +const N_PER_THREAD: usize = 10_000; const N: usize = N_THREADS * N_PER_THREAD; // NB N should be multiple of N_THREADS const SPACE: usize = N; #[allow(dead_code)] const INTENSITY: usize = 10; -fn kv(i: usize) -> Vec { +fn kv(i: usize) -> InlineArray { let i = i % SPACE; let k = [(i >> 16) as u8, (i >> 8) as u8, i as u8]; - k.to_vec() + (&k).into() } #[test] @@ -41,7 +44,7 @@ fn kv(i: usize) -> Vec { fn monotonic_inserts() { common::setup_logger(); - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); + let db: Db = Config::tmp().unwrap().flush_every_ms(None).open().unwrap(); for len in [1_usize, 16, 32, 1024].iter() { for i in 0_usize..*len { @@ -86,7 +89,7 @@ fn fixed_stride_inserts() { // this is intended to test the fixed stride key omission optimization common::setup_logger(); - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); + let db: Db = Config::tmp().unwrap().flush_every_ms(None).open().unwrap(); let mut expected = std::collections::HashSet::new(); for k in 0..4096_u16 { @@ -127,7 +130,7 @@ fn fixed_stride_inserts() { let count = db.iter().rev().count(); assert_eq!(count, 0); assert_eq!(db.len(), 0); - assert!(db.is_empty()); + assert!(db.is_empty().unwrap()); } #[test] @@ -135,7 +138,7 @@ fn fixed_stride_inserts() { fn sequential_inserts() { common::setup_logger(); - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); + let db: Db = Config::tmp().unwrap().flush_every_ms(None).open().unwrap(); for len in [1, 16, 32, u16::MAX].iter() { for i in 0..*len { @@ -155,7 +158,7 @@ fn sequential_inserts() { fn reverse_inserts() { common::setup_logger(); - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); + let db: Db = Config::tmp().unwrap().flush_every_ms(None).open().unwrap(); for len in [1, 16, 32, u16::MAX].iter() { for i in 0..*len { @@ -179,12 +182,7 @@ fn very_large_reverse_tree_iterator() { let mut b = vec![255; 1024 * 1024]; b.push(1); - let db = Config::new() - .temporary(true) - .flush_every_ms(Some(1)) - .segment_size(256) - .open() - .unwrap(); + let db: Db = Config::tmp().unwrap().flush_every_ms(Some(1)).open().unwrap(); db.insert(a, "").unwrap(); db.insert(b, "").unwrap(); @@ -207,33 +205,120 @@ fn varied_compression_ratios() { buf }; - let tree = sled::Config::default() - .use_compression(true) - .path("compression_db_test") - .open() - .unwrap(); + let tree: Db = + Config::default().path("compression_db_test").open().unwrap(); tree.insert(b"low entropy", &low_entropy[..]).unwrap(); tree.insert(b"high entropy", &high_entropy[..]).unwrap(); println!("reloading database..."); drop(tree); - let tree = sled::Config::default() - .use_compression(true) - .path("compression_db_test") - .open() - .unwrap(); + let tree: Db = + Config::default().path("compression_db_test").open().unwrap(); drop(tree); let _ = std::fs::remove_dir_all("compression_db_test"); } +#[test] +fn test_pop_first() -> io::Result<()> { + let config = sled::Config::tmp().unwrap(); + let db: sled::Db<4> = config.open()?; + db.insert(&[0], vec![0])?; + db.insert(&[1], vec![10])?; + db.insert(&[2], vec![20])?; + db.insert(&[3], vec![30])?; + db.insert(&[4], vec![40])?; + db.insert(&[5], vec![50])?; + + assert_eq!(&db.pop_first()?.unwrap().0, &[0]); + assert_eq!(&db.pop_first()?.unwrap().0, &[1]); + assert_eq!(&db.pop_first()?.unwrap().0, &[2]); + assert_eq!(&db.pop_first()?.unwrap().0, &[3]); + assert_eq!(&db.pop_first()?.unwrap().0, &[4]); + assert_eq!(&db.pop_first()?.unwrap().0, &[5]); + assert_eq!(db.pop_first()?, None); + /* + */ + Ok(()) +} + +#[test] +fn test_pop_last_in_range() -> io::Result<()> { + let config = sled::Config::tmp().unwrap(); + let db: sled::Db<4> = config.open()?; + + let data = vec![ + (b"key 1", b"value 1"), + (b"key 2", b"value 2"), + (b"key 3", b"value 3"), + ]; + + for (k, v) in data { + db.insert(k, v).unwrap(); + } + + let r1 = db.pop_last_in_range(b"key 1".as_ref()..=b"key 3").unwrap(); + assert_eq!(Some((b"key 3".into(), b"value 3".into())), r1); + + let r2 = db.pop_last_in_range(b"key 1".as_ref()..b"key 3").unwrap(); + assert_eq!(Some((b"key 2".into(), b"value 2".into())), r2); + + let r3 = db.pop_last_in_range(b"key 4".as_ref()..).unwrap(); + assert!(r3.is_none()); + + let r4 = db.pop_last_in_range(b"key 2".as_ref()..=b"key 3").unwrap(); + assert!(r4.is_none()); + + let r5 = db.pop_last_in_range(b"key 0".as_ref()..=b"key 3").unwrap(); + assert_eq!(Some((b"key 1".into(), b"value 1".into())), r5); + + let r6 = db.pop_last_in_range(b"key 0".as_ref()..=b"key 3").unwrap(); + assert!(r6.is_none()); + Ok(()) +} + +#[test] +fn test_interleaved_gets_sets() { + common::setup_logger(); + let db: Db = + Config::tmp().unwrap().cache_capacity_bytes(1024).open().unwrap(); + + let done = Arc::new(AtomicBool::new(false)); + + std::thread::scope(|scope| { + let db_2 = db.clone(); + let done = &done; + scope.spawn(move || { + for v in 0..500_000_u32 { + db_2.insert(v.to_be_bytes(), &[42u8; 4096][..]) + .expect("failed to insert"); + if v % 10_000 == 0 { + log::trace!("WRITING: {}", v); + db_2.flush().unwrap(); + } + } + done.store(true, SeqCst); + }); + scope.spawn(move || { + while !done.load(SeqCst) { + for v in (0..500_000_u32).rev() { + db.get(v.to_be_bytes()).expect("Fatal error?"); + if v % 10_000 == 0 { + log::trace!("READING: {}", v) + } + } + } + }); + }); +} + #[test] #[cfg(not(miri))] // can't create threads -fn concurrent_tree_pops() -> sled::Result<()> { +fn concurrent_tree_pops() -> std::io::Result<()> { use std::thread; - let db = sled::Config::new().temporary(true).open()?; + let db: Db = Config::tmp().unwrap().open()?; // Insert values 0..5 for x in 0u32..5 { @@ -246,10 +331,10 @@ fn concurrent_tree_pops() -> sled::Result<()> { let barrier = Arc::new(Barrier::new(5)); for _ in 0..5 { let barrier = barrier.clone(); - let db = db.clone(); + let db: Db = db.clone(); threads.push(thread::spawn(move || { barrier.wait(); - db.pop_min().unwrap().unwrap(); + db.pop_first().unwrap().unwrap(); })); } @@ -258,7 +343,7 @@ fn concurrent_tree_pops() -> sled::Result<()> { } assert!( - db.is_empty(), + db.is_empty().unwrap(), "elements left in database: {:?}", db.iter().collect::>() ); @@ -276,24 +361,43 @@ fn concurrent_tree_ops() { for i in 0..INTENSITY { debug!("beginning test {}", i); - let config = Config::new() - .temporary(true) + let config = Config::tmp() + .unwrap() .flush_every_ms(Some(1)) - .segment_size(256); + .cache_capacity_bytes(1024); macro_rules! par { ($t:ident, $f:expr) => { let mut threads = vec![]; + + let flusher_barrier = Arc::new(Barrier::new(N_THREADS)); + for tn in 0..N_THREADS { + let tree = $t.clone(); + let barrier = flusher_barrier.clone(); + let thread = thread::Builder::new() + .name(format!("t(thread: {} flusher)", tn)) + .spawn(move || { + tree.flush().unwrap(); + barrier.wait(); + }) + .expect("should be able to spawn thread"); + threads.push(thread); + } + + let barrier = Arc::new(Barrier::new(N_THREADS)); + for tn in 0..N_THREADS { let tree = $t.clone(); + let barrier = barrier.clone(); let thread = thread::Builder::new() .name(format!("t(thread: {} test: {})", tn, i)) .spawn(move || { + barrier.wait(); for i in (tn * N_PER_THREAD)..((tn + 1) * N_PER_THREAD) { let k = kv(i); - $f(&*tree, k); + $f(&tree, k); } }) .expect("should be able to spawn thread"); @@ -308,9 +412,9 @@ fn concurrent_tree_ops() { } debug!("========== initial sets test {} ==========", i); - let t = Arc::new(config.open().unwrap()); - par! {t, |tree: &Tree, k: Vec| { - assert_eq!(tree.get(&*k), Ok(None)); + let t: Db = config.open().unwrap(); + par! {t, move |tree: &Db, k: InlineArray| { + assert_eq!(tree.get(&*k).unwrap(), None); tree.insert(&k, k.clone()).expect("we should write successfully"); assert_eq!(tree.get(&*k).unwrap(), Some(k.clone().into()), "failed to read key {:?} that we just wrote from tree {:?}", @@ -327,8 +431,7 @@ fn concurrent_tree_ops() { } drop(t); - let t = - Arc::new(config.open().expect("should be able to restart Tree")); + let t: Db = config.open().expect("should be able to restart Db"); let n_scanned = t.iter().count(); if n_scanned != N { @@ -340,62 +443,60 @@ fn concurrent_tree_ops() { } debug!("========== reading sets in test {} ==========", i); - par! {t, |tree: &Tree, k: Vec| { + par! {t, move |tree: &Db, k: InlineArray| { if let Some(v) = tree.get(&*k).unwrap() { if v != k { panic!("expected key {:?} not found", k); } } else { - panic!("could not read key {:?}, which we \ - just wrote to tree {:?}", k, tree); + panic!( + "could not read key {:?}, which we \ + just wrote to tree {:?}", k, tree + ); } }}; drop(t); - let t = - Arc::new(config.open().expect("should be able to restart Tree")); + let t: Db = config.open().expect("should be able to restart Db"); debug!("========== CAS test in test {} ==========", i); - par! {t, |tree: &Tree, k: Vec| { + par! {t, move |tree: &Db, k: InlineArray| { let k1 = k.clone(); let mut k2 = k; - k2.reverse(); + k2.make_mut().reverse(); tree.compare_and_swap(&k1, Some(&*k1), Some(k2)).unwrap().unwrap(); }}; drop(t); - let t = - Arc::new(config.open().expect("should be able to restart Tree")); + let t: Db = config.open().expect("should be able to restart Db"); - par! {t, |tree: &Tree, k: Vec| { + par! {t, move |tree: &Db, k: InlineArray| { let k1 = k.clone(); let mut k2 = k; - k2.reverse(); - assert_eq!(tree.get(&*k1).unwrap().unwrap().to_vec(), k2); + k2.make_mut().reverse(); + assert_eq!(tree.get(&*k1).unwrap().unwrap(), k2); }}; drop(t); - let t = - Arc::new(config.open().expect("should be able to restart Tree")); + let t: Db = config.open().expect("should be able to restart Db"); debug!("========== deleting in test {} ==========", i); - par! {t, |tree: &Tree, k: Vec| { - tree.remove(&*k).unwrap(); + par! {t, move |tree: &Db, k: InlineArray| { + tree.remove(&*k).unwrap().unwrap(); }}; drop(t); - let t = - Arc::new(config.open().expect("should be able to restart Tree")); + let t: Db = config.open().expect("should be able to restart Db"); - par! {t, |tree: &Tree, k: Vec| { - assert_eq!(tree.get(&*k), Ok(None)); + par! {t, move |tree: &Db, k: InlineArray| { + assert_eq!(tree.get(&*k).unwrap(), None); }}; } } #[test] #[cfg(not(miri))] // can't create threads -fn concurrent_tree_iter() -> Result<()> { +fn concurrent_tree_iter() -> io::Result<()> { use std::sync::Barrier; use std::thread; @@ -403,11 +504,12 @@ fn concurrent_tree_iter() -> Result<()> { const N_FORWARD: usize = INTENSITY; const N_REVERSE: usize = INTENSITY; + const N_INSERT: usize = INTENSITY; + const N_DELETE: usize = INTENSITY; + const N_FLUSHERS: usize = N_THREADS; - let config = Config::new().temporary(true).flush_every_ms(Some(1)); - - let t = config.open().unwrap(); - + // items that are expected to always be present at their expected + // order, regardless of other inserts or deletes. const INDELIBLE: [&[u8]; 16] = [ &[0u8], &[1u8], @@ -427,104 +529,124 @@ fn concurrent_tree_iter() -> Result<()> { &[15u8], ]; + let config = Config::tmp() + .unwrap() + .cache_capacity_bytes(1024 * 1024 * 1024) + .flush_every_ms(Some(1)); + + let t: Db = config.open().unwrap(); + + let mut threads: Vec>> = vec![]; + + for tn in 0..N_FLUSHERS { + let tree = t.clone(); + let thread = thread::Builder::new() + .name(format!("t(thread: {} flusher)", tn)) + .spawn(move || { + tree.flush().unwrap(); + Ok(()) + }) + .expect("should be able to spawn thread"); + threads.push(thread); + } + for item in &INDELIBLE { t.insert(item, item.to_vec())?; } - let barrier = Arc::new(Barrier::new(N_FORWARD + N_REVERSE + 2)); + let barrier = + Arc::new(Barrier::new(N_FORWARD + N_REVERSE + N_INSERT + N_DELETE)); - let mut threads: Vec>> = vec![]; static I: AtomicUsize = AtomicUsize::new(0); for i in 0..N_FORWARD { - let t = thread::Builder::new() + let t: Db = t.clone(); + let barrier = barrier.clone(); + + let thread = thread::Builder::new() .name(format!("forward({})", i)) - .spawn({ - let t = t.clone(); - let barrier = barrier.clone(); - move || { - I.fetch_add(1, SeqCst); - barrier.wait(); - for _ in 0..100 { - let expected = INDELIBLE.iter(); - let mut keys = t.iter().keys(); - - for expect in expected { - loop { - let k = keys.next().unwrap()?; - assert!( - &*k <= *expect, - "witnessed key is {:?} but we expected \ - one <= {:?}, so we overshot due to a \ - concurrent modification", - k, - expect, - ); - if &*k == *expect { - break; - } + .spawn(move || { + I.fetch_add(1, SeqCst); + barrier.wait(); + for _ in 0..1024 { + let expected = INDELIBLE.iter(); + let mut keys = t.iter().keys(); + + for expect in expected { + loop { + let k = keys.next().unwrap()?; + assert!( + &*k <= *expect, + "witnessed key is {:?} but we expected \ + one <= {:?}, so we overshot due to a \ + concurrent modification", + k, + expect, + ); + if &*k == *expect { + break; } } } - I.fetch_sub(1, SeqCst); - - Ok(()) } + I.fetch_sub(1, SeqCst); + + Ok(()) }) .unwrap(); - threads.push(t); + threads.push(thread); } for i in 0..N_REVERSE { - let t = thread::Builder::new() + let t: Db = t.clone(); + let barrier = barrier.clone(); + + let thread = thread::Builder::new() .name(format!("reverse({})", i)) - .spawn({ - let t = t.clone(); - let barrier = barrier.clone(); - move || { - I.fetch_add(1, SeqCst); - barrier.wait(); - for _ in 0..100 { - let expected = INDELIBLE.iter().rev(); - let mut keys = t.iter().keys().rev(); - - for expect in expected { - loop { - if let Some(Ok(k)) = keys.next() { - assert!( - &*k >= *expect, - "witnessed key is {:?} but we expected \ - one >= {:?}, so we overshot due to a \ - concurrent modification\n{:?}", - k, - expect, - *t, - ); - if &*k == *expect { - break; - } - } else { - panic!("undershot key on tree: \n{:?}", *t); + .spawn(move || { + I.fetch_add(1, SeqCst); + barrier.wait(); + for _ in 0..1024 { + let expected = INDELIBLE.iter().rev(); + let mut keys = t.iter().keys().rev(); + + for expect in expected { + loop { + if let Some(Ok(k)) = keys.next() { + assert!( + &*k >= *expect, + "witnessed key is {:?} but we expected \ + one >= {:?}, so we overshot due to a \ + concurrent modification\n{:?}", + k, + expect, + t, + ); + if &*k == *expect { + break; } + } else { + panic!("undershot key on tree: \n{:?}", t); } } } - I.fetch_sub(1, SeqCst); - - Ok(()) } + I.fetch_sub(1, SeqCst); + + Ok(()) }) .unwrap(); - threads.push(t); + threads.push(thread); } - let inserter = thread::Builder::new() - .name("inserter".into()) - .spawn({ - let t = t.clone(); - let barrier = barrier.clone(); - move || { + for i in 0..N_INSERT { + let t: Db = t.clone(); + let barrier = barrier.clone(); + + let thread = thread::Builder::new() + .name(format!("insert({})", i)) + .spawn(move || { barrier.wait(); while I.load(SeqCst) != 0 { @@ -539,16 +661,19 @@ fn concurrent_tree_iter() -> Result<()> { } Ok(()) - } - }) - .unwrap(); + }) + .unwrap(); - threads.push(inserter); + threads.push(thread); + } - let deleter = thread::Builder::new() - .name("deleter".into()) - .spawn({ - move || { + for i in 0..N_DELETE { + let t: Db = t.clone(); + let barrier = barrier.clone(); + + let thread = thread::Builder::new() + .name(format!("deleter({})", i)) + .spawn(move || { barrier.wait(); while I.load(SeqCst) != 0 { @@ -563,19 +688,24 @@ fn concurrent_tree_iter() -> Result<()> { } Ok(()) - } - }) - .unwrap(); + }) + .unwrap(); - threads.push(deleter); + threads.push(thread); + } for thread in threads.into_iter() { thread.join().expect("thread should not have crashed")?; } + t.check_error().expect("Db should have no set error"); + + dbg!(t.stats()); + Ok(()) } +/* #[test] #[cfg(not(miri))] // can't create threads fn concurrent_tree_transactions() -> TransactionResult<()> { @@ -586,8 +716,7 @@ fn concurrent_tree_transactions() -> TransactionResult<()> { let config = Config::new() .temporary(true) .flush_every_ms(Some(1)) - .use_compression(true); - let db = config.open().unwrap(); + let db: Db = config.open().unwrap(); db.insert(b"k1", b"cats").unwrap(); db.insert(b"k2", b"dogs").unwrap(); @@ -602,7 +731,7 @@ fn concurrent_tree_transactions() -> TransactionResult<()> { let barrier = Arc::new(Barrier::new(N_WRITERS + N_READERS + N_SUBSCRIBERS)); for _ in 0..N_WRITERS { - let db = db.clone(); + let db: Db = db.clone(); let barrier = barrier.clone(); let thread = std::thread::spawn(move || { barrier.wait(); @@ -623,7 +752,7 @@ fn concurrent_tree_transactions() -> TransactionResult<()> { } for _ in 0..N_READERS { - let db = db.clone(); + let db: Db = db.clone(); let barrier = barrier.clone(); let thread = std::thread::spawn(move || { barrier.wait(); @@ -646,7 +775,7 @@ fn concurrent_tree_transactions() -> TransactionResult<()> { } for _ in 0..N_SUBSCRIBERS { - let db = db.clone(); + let db: Db = db.clone(); let barrier = barrier.clone(); let thread = std::thread::spawn(move || { barrier.wait(); @@ -674,8 +803,8 @@ fn concurrent_tree_transactions() -> TransactionResult<()> { #[test] fn tree_flush_in_transaction() { - let config = sled::Config::new().temporary(true); - let db = config.open().unwrap(); + let config = sled::Config::tmp().unwrap(); + let db: Db = config.open().unwrap(); let tree = db.open_tree(b"a").unwrap(); tree.transaction::<_, _, sled::transaction::TransactionError>(|tree| { @@ -692,9 +821,9 @@ fn incorrect_multiple_db_transactions() -> TransactionResult<()> { common::setup_logger(); let db1 = - Config::new().temporary(true).flush_every_ms(Some(1)).open().unwrap(); + Config::tmp().unwrap().flush_every_ms(Some(1)).open().unwrap(); let db2 = - Config::new().temporary(true).flush_every_ms(Some(1)).open().unwrap(); + Config::tmp().unwrap().flush_every_ms(Some(1)).open().unwrap(); let result: TransactionResult<()> = (&*db1, &*db2).transaction::<_, ()>(|_| Ok(())); @@ -708,8 +837,8 @@ fn incorrect_multiple_db_transactions() -> TransactionResult<()> { fn many_tree_transactions() -> TransactionResult<()> { common::setup_logger(); - let config = Config::new().temporary(true).flush_every_ms(Some(1)); - let db = Arc::new(config.open().unwrap()); + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); + let db: Db = Arc::new(config.open().unwrap()); let t1 = db.open_tree(b"1")?; let t2 = db.open_tree(b"2")?; let t3 = db.open_tree(b"3")?; @@ -731,8 +860,8 @@ fn many_tree_transactions() -> TransactionResult<()> { fn batch_outside_of_transaction() -> TransactionResult<()> { common::setup_logger(); - let config = Config::new().temporary(true).flush_every_ms(Some(1)); - let db = config.open().unwrap(); + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); + let db: Db = config.open().unwrap(); let t1 = db.open_tree(b"1")?; @@ -749,6 +878,7 @@ fn batch_outside_of_transaction() -> TransactionResult<()> { assert_eq!(t1.get(b"k2")?, Some(b"v2".into())); Ok(()) } +*/ #[test] fn tree_subdir() { @@ -762,7 +892,7 @@ fn tree_subdir() { let config = Config::new().path(&path); - let t = config.open().unwrap(); + let t: Db = config.open().unwrap(); t.insert(&[1], vec![1]).unwrap(); @@ -770,7 +900,7 @@ fn tree_subdir() { let config = Config::new().path(&path); - let t = config.open().unwrap(); + let t: Db = config.open().unwrap(); let res = t.get(&*vec![1]); @@ -784,8 +914,8 @@ fn tree_subdir() { #[test] #[cfg_attr(miri, ignore)] fn tree_small_keys_iterator() { - let config = Config::new().temporary(true).flush_every_ms(Some(1)); - let t = config.open().unwrap(); + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); + let t: Db = config.open().unwrap(); for i in 0..N_PER_THREAD { let k = kv(i); t.insert(&k, k.clone()).unwrap(); @@ -818,7 +948,7 @@ fn tree_small_keys_iterator() { let mut tree_scan = t.range(&*last_key..); let r3 = tree_scan.next().unwrap().unwrap(); assert_eq!((r3.0.as_ref(), &*r3.1), (last_key.as_ref(), &*last_key)); - assert_eq!(tree_scan.next(), None); + assert!(tree_scan.next().is_none()); } #[test] @@ -827,14 +957,14 @@ fn tree_big_keys_iterator() { fn kv(i: usize) -> Vec { let k = [(i >> 16) as u8, (i >> 8) as u8, i as u8]; - let mut base = vec![0; u8::max_value() as usize]; + let mut base = vec![0; u8::MAX as usize]; base.extend_from_slice(&k); base } - let config = Config::new().temporary(true).flush_every_ms(Some(1)); + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); - let t = config.open().unwrap(); + let t: Db = config.open().unwrap(); for i in 0..N_PER_THREAD { let k = kv(i); t.insert(&k, k.clone()).unwrap(); @@ -867,14 +997,15 @@ fn tree_big_keys_iterator() { let mut tree_scan = t.range(&*last_key..); let r3 = tree_scan.next().unwrap().unwrap(); assert_eq!((r3.0.as_ref(), &*r3.1), (last_key.as_ref(), &*last_key)); - assert_eq!(tree_scan.next(), None); + assert!(tree_scan.next().is_none()); } +/* #[test] -fn tree_subscribers_and_keyspaces() -> Result<()> { - let config = Config::new().temporary(true).flush_every_ms(Some(1)); +fn tree_subscribers_and_keyspaces() -> io::Result<()> { + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); - let db = config.open().unwrap(); + let db: Db = config.open().unwrap(); let t1 = db.open_tree(b"1")?; let mut s1 = t1.watch_prefix(b""); @@ -888,15 +1019,11 @@ fn tree_subscribers_and_keyspaces() -> Result<()> { assert_eq!(s1.next().unwrap().iter().next().unwrap().1, b"t1_a"); assert_eq!(s2.next().unwrap().iter().next().unwrap().1, b"t2_a"); - let guard = pin(); - guard.flush(); - drop(guard); - drop(db); drop(t1); drop(t2); - let db = config.open().unwrap(); + let db: Db = config.open().unwrap(); let t1 = db.open_tree(b"1")?; let mut s1 = t1.watch_prefix(b""); @@ -914,15 +1041,11 @@ fn tree_subscribers_and_keyspaces() -> Result<()> { assert_eq!(s1.next().unwrap().iter().next().unwrap().1, b"t1_b"); assert_eq!(s2.next().unwrap().iter().next().unwrap().1, b"t2_b"); - let guard = pin(); - guard.flush(); - drop(guard); - drop(db); drop(t1); drop(t2); - let db = config.open().unwrap(); + let db: Db = config.open().unwrap(); let t1 = db.open_tree(b"1")?; let t2 = db.open_tree(b"2")?; @@ -946,7 +1069,7 @@ fn tree_subscribers_and_keyspaces() -> Result<()> { drop(t1); drop(t2); - let db = config.open().unwrap(); + let db: Db = config.open().unwrap(); let t1 = db.open_tree(b"1")?; let t2 = db.open_tree(b"2")?; @@ -957,13 +1080,14 @@ fn tree_subscribers_and_keyspaces() -> Result<()> { Ok(()) } +*/ #[test] fn tree_range() { common::setup_logger(); - let config = Config::new().temporary(true).flush_every_ms(Some(1)); - let t = config.open().unwrap(); + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); + let t: sled::Db<7> = config.open().unwrap(); t.insert(b"0", vec![0]).unwrap(); t.insert(b"1", vec![10]).unwrap(); @@ -977,14 +1101,14 @@ fn tree_range() { let mut r = t.range(start..end); assert_eq!(r.next().unwrap().unwrap().0, b"2"); assert_eq!(r.next().unwrap().unwrap().0, b"3"); - assert_eq!(r.next(), None); + assert!(r.next().is_none()); let start = b"2".to_vec(); let end = b"4".to_vec(); let mut r = t.range(start..end).rev(); assert_eq!(r.next().unwrap().unwrap().0, b"3"); assert_eq!(r.next().unwrap().unwrap().0, b"2"); - assert_eq!(r.next(), None); + assert!(r.next().is_none()); let start = b"2".to_vec(); let mut r = t.range(start..); @@ -992,7 +1116,7 @@ fn tree_range() { assert_eq!(r.next().unwrap().unwrap().0, b"3"); assert_eq!(r.next().unwrap().unwrap().0, b"4"); assert_eq!(r.next().unwrap().unwrap().0, b"5"); - assert_eq!(r.next(), None); + assert!(r.next().is_none()); let start = b"2".to_vec(); let mut r = t.range(..=start).rev(); @@ -1004,7 +1128,7 @@ fn tree_range() { ); assert_eq!(r.next().unwrap().unwrap().0, b"1"); assert_eq!(r.next().unwrap().unwrap().0, b"0"); - assert_eq!(r.next(), None); + assert!(r.next().is_none()); } #[test] @@ -1012,19 +1136,16 @@ fn tree_range() { fn recover_tree() { common::setup_logger(); - let config = Config::new() - .temporary(true) - .flush_every_ms(Some(1)) - .segment_size(4096); + let config = Config::tmp().unwrap().flush_every_ms(Some(1)); - let t = config.open().unwrap(); + let t: sled::Db<7> = config.open().unwrap(); for i in 0..N_PER_THREAD { let k = kv(i); t.insert(&k, k.clone()).unwrap(); } drop(t); - let t = config.open().unwrap(); + let t: sled::Db<7> = config.open().unwrap(); for i in 0..N_PER_THREAD { let k = kv(i as usize); assert_eq!(t.get(&*k).unwrap().unwrap(), k); @@ -1032,15 +1153,109 @@ fn recover_tree() { } drop(t); - let t = config.open().unwrap(); + println!("---------------- recovering a (hopefully) empty db ----------------------"); + + let t: sled::Db<7> = config.open().unwrap(); for i in 0..N_PER_THREAD { let k = kv(i as usize); - assert_eq!(t.get(&*k), Ok(None)); + assert!( + t.get(&*k).unwrap().is_none(), + "expected key {:?} to have been deleted", + i + ); } } #[test] -fn create_tree() { +#[cfg_attr(miri, ignore)] +fn tree_gc() { + const FANOUT: usize = 7; + + common::setup_logger(); + + let config = Config::tmp().unwrap().flush_every_ms(None); + + let t: sled::Db = config.open().unwrap(); + + for i in 0..N { + let k = kv(i); + t.insert(&k, k.clone()).unwrap(); + } + + for _ in 0..100 { + t.flush().unwrap(); + } + + let size_on_disk_after_inserts = t.size_on_disk().unwrap(); + + for i in 0..N { + let k = kv(i); + t.insert(&k, k.clone()).unwrap(); + } + + for _ in 0..100 { + t.flush().unwrap(); + } + + let size_on_disk_after_rewrites = t.size_on_disk().unwrap(); + + for i in 0..N { + let k = kv(i); + assert_eq!(t.get(&*k).unwrap(), Some(k.clone().into()), "{k:?}"); + t.remove(&*k).unwrap(); + } + + for _ in 0..100 { + t.flush().unwrap(); + } + + let size_on_disk_after_deletes = t.size_on_disk().unwrap(); + + t.check_error().expect("Db should have no set error"); + + let stats = t.stats(); + + dbg!(stats); + + assert!( + stats.cache.heap.allocator.objects_allocated >= (N / FANOUT) as u64, + "{stats:?}" + ); + assert!( + stats.cache.heap.allocator.objects_freed + >= (stats.cache.heap.allocator.objects_allocated / 2) as u64, + "{stats:?}" + ); + assert!( + stats.cache.heap.allocator.heap_slots_allocated >= (N / FANOUT) as u64, + "{stats:?}" + ); + assert!( + stats.cache.heap.allocator.heap_slots_freed + >= (stats.cache.heap.allocator.heap_slots_allocated / 2) as u64, + "{stats:?}" + ); + + let expected_max_size = size_on_disk_after_inserts / 15; + assert!( + size_on_disk_after_deletes <= expected_max_size, + "expected file truncation to take size under {expected_max_size} \ + but it was {size_on_disk_after_deletes}" + ); + // TODO assert!(stats.cache.heap.truncated_file_bytes > 0); + + println!( + "after writing {N} items and removing them, disk size went \ + from {}kb after inserts to {}kb after rewriting to {}kb after deletes", + size_on_disk_after_inserts / 1024, + size_on_disk_after_rewrites / 1024, + size_on_disk_after_deletes / 1024, + ); +} + +/* +#[test] +fn create_exclusive() { common::setup_logger(); let path = "create_exclusive_db"; @@ -1055,33 +1270,34 @@ fn create_tree() { config.open().unwrap_err(); std::fs::remove_dir_all(path).unwrap(); } +*/ #[test] fn contains_tree() { - let db = Config::new().temporary(true).flush_every_ms(None).open().unwrap(); + let db: Db = Config::tmp().unwrap().flush_every_ms(None).open().unwrap(); let tree_one = db.open_tree("tree 1").unwrap(); let tree_two = db.open_tree("tree 2").unwrap(); drop(tree_one); drop(tree_two); - assert_eq!(false, db.contains_tree("tree 3")); - assert_eq!(true, db.contains_tree("tree 1")); - assert_eq!(true, db.contains_tree("tree 2")); + assert_eq!(false, db.contains_tree("tree 3").unwrap()); + assert_eq!(true, db.contains_tree("tree 1").unwrap()); + assert_eq!(true, db.contains_tree("tree 2").unwrap()); assert!(db.drop_tree("tree 1").unwrap()); - assert_eq!(false, db.contains_tree("tree 1")); + assert_eq!(false, db.contains_tree("tree 1").unwrap()); } #[test] #[cfg_attr(miri, ignore)] -fn tree_import_export() -> Result<()> { +fn tree_import_export() -> io::Result<()> { common::setup_logger(); - let config_1 = Config::new().temporary(true); - let config_2 = Config::new().temporary(true); + let config_1 = Config::tmp().unwrap(); + let config_2 = Config::tmp().unwrap(); - let db = config_1.open()?; + let db: Db = config_1.open()?; for db_id in 0..N_THREADS { let tree_id = format!("tree_{}", db_id); let tree = db.open_tree(tree_id.as_bytes())?; @@ -1095,8 +1311,8 @@ fn tree_import_export() -> Result<()> { drop(db); - let exporter = config_1.open()?; - let importer = config_2.open()?; + let exporter: Db = config_1.open()?; + let importer: Db = config_2.open()?; let export = exporter.export(); importer.import(export); @@ -1105,7 +1321,7 @@ fn tree_import_export() -> Result<()> { drop(config_1); drop(importer); - let db = config_2.open()?; + let db: Db = config_2.open()?; let checksum_b = db.checksum().unwrap(); assert_eq!(checksum_a, checksum_b); @@ -1125,14 +1341,14 @@ fn tree_import_export() -> Result<()> { drop(db); - let db = config_2.open()?; + let db: Db = config_2.open()?; for db_id in 0..N_THREADS { let tree_id = format!("tree_{}", db_id); let tree = db.open_tree(tree_id.as_bytes())?; for i in 0..N_THREADS { let k = kv(i as usize); - assert_eq!(tree.get(&*k), Ok(None)); + assert_eq!(tree.get(&*k).unwrap(), None); } } @@ -1148,1612 +1364,10 @@ fn quickcheck_tree_matches_btreemap() { let n_tests = if cfg!(windows) { 25 } else { 100 }; QuickCheck::new() - .gen(StdGen::new(rand::thread_rng(), 1000)) + .gen(Gen::new(100)) .tests(n_tests) .max_tests(n_tests * 10) .quickcheck( - prop_tree_matches_btreemap - as fn(Vec, bool, bool, u8, u8) -> bool, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_00() { - // postmortem: - prop_tree_matches_btreemap(vec![Restart], false, false, 0, 0); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_01() { - // postmortem: - // this was a bug in the snapshot recovery, where - // it led to max_id dropping by 1 after a restart. - // postmortem 2: - // we were stalling here because we had a new log with stable of - // SEG_HEADER_LEN, but when we iterated over it to create a new - // snapshot (snapshot every 1 set in Config), we iterated up until - // that offset. make_stable requires our stable offset to be >= - // the provided one, to deal with 0. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![32]), 9), - Set(Key(vec![195]), 13), - Restart, - Set(Key(vec![164]), 147), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_02() { - // postmortem: - // this was a bug in the way that the `Materializer` - // was fed data, possibly out of order, if recover - // in the pagecache had to run over log entries - // that were later run through the same `Materializer` - // then the second time (triggered by a snapshot) - // would not pick up on the importance of seeing - // the new root set. - // portmortem 2: when refactoring iterators, failed - // to account for node.hi being empty on the infinity - // shard - prop_tree_matches_btreemap( - vec![ - Restart, - Set(Key(vec![215]), 121), - Restart, - Set(Key(vec![216]), 203), - Scan(Key(vec![210]), 4), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_03() { - // postmortem: the tree was not persisting and recovering root hoists - // postmortem 2: when refactoring the log storage, we failed to restart - // log writing in the proper location. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![113]), 204), - Set(Key(vec![119]), 205), - Set(Key(vec![166]), 88), - Set(Key(vec![23]), 44), - Restart, - Set(Key(vec![226]), 192), - Set(Key(vec![189]), 186), - Restart, - Scan(Key(vec![198]), 11), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_04() { - // postmortem: pagecache was failing to replace the LogId list - // when it encountered a new Update::Compact. - // postmortem 2: after refactoring log storage, we were not properly - // setting the log tip, and the beginning got clobbered after writing - // after a restart. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![158]), 31), - Set(Key(vec![111]), 134), - Set(Key(vec![230]), 187), - Set(Key(vec![169]), 58), - Set(Key(vec![131]), 10), - Set(Key(vec![108]), 246), - Set(Key(vec![127]), 155), - Restart, - Set(Key(vec![59]), 119), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_05() { - // postmortem: during recovery, the segment accountant was failing to - // properly set the file's tip. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![231]), 107), - Set(Key(vec![251]), 42), - Set(Key(vec![80]), 81), - Set(Key(vec![178]), 130), - Set(Key(vec![150]), 232), - Restart, - Set(Key(vec![98]), 78), - Set(Key(vec![0]), 45), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_06() { - // postmortem: after reusing segments, we were failing to checksum reads - // performed while iterating over rewritten segment buffers, and using - // former garbage data. fix: use the crc that's there for catching torn - // writes with high probability, AND zero out buffers. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![162]), 8), - Set(Key(vec![59]), 192), - Set(Key(vec![238]), 83), - Set(Key(vec![151]), 231), - Restart, - Set(Key(vec![30]), 206), - Set(Key(vec![150]), 146), - Set(Key(vec![18]), 34), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_07() { - // postmortem: the segment accountant was not fully recovered, and thought - // that it could reuse a particular segment that wasn't actually empty - // yet. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![135]), 22), - Set(Key(vec![41]), 36), - Set(Key(vec![101]), 31), - Set(Key(vec![111]), 35), - Restart, - Set(Key(vec![47]), 36), - Set(Key(vec![79]), 114), - Set(Key(vec![64]), 9), - Scan(Key(vec![196]), 25), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_08() { - // postmortem: failed to properly recover the state in the segment - // accountant that tracked the previously issued segment. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![145]), 151), - Set(Key(vec![155]), 148), - Set(Key(vec![131]), 170), - Set(Key(vec![163]), 60), - Set(Key(vec![225]), 126), - Restart, - Set(Key(vec![64]), 237), - Set(Key(vec![102]), 205), - Restart, - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_09() { - // postmortem 1: was failing to load existing snapshots on initialization. - // would encounter uninitialized segments at the log tip and overwrite - // the first segment (indexed by LSN of 0) in the segment accountant - // ordering, skipping over important updates. - // - // postmortem 2: page size tracking was inconsistent in SA. completely - // removed exact size tracking, and went back to simpler pure-page - // tenancy model. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![189]), 36), - Set(Key(vec![254]), 194), - Set(Key(vec![132]), 50), - Set(Key(vec![91]), 221), - Set(Key(vec![126]), 6), - Set(Key(vec![199]), 183), - Set(Key(vec![71]), 125), - Scan(Key(vec![67]), 16), - Set(Key(vec![190]), 16), - Restart, - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_10() { - // postmortem: after reusing a segment, but not completely writing a - // segment, we were hitting an old LSN and violating an assert, rather - // than just ending. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![152]), 163), - Set(Key(vec![105]), 191), - Set(Key(vec![207]), 217), - Set(Key(vec![128]), 19), - Set(Key(vec![106]), 22), - Scan(Key(vec![20]), 24), - Set(Key(vec![14]), 150), - Set(Key(vec![80]), 43), - Set(Key(vec![174]), 134), - Set(Key(vec![20]), 150), - Set(Key(vec![13]), 171), - Restart, - Scan(Key(vec![240]), 25), - Scan(Key(vec![77]), 37), - Set(Key(vec![153]), 232), - Del(Key(vec![2])), - Set(Key(vec![227]), 169), - Get(Key(vec![232])), - Cas(Key(vec![247]), 151, 70), - Set(Key(vec![78]), 52), - Get(Key(vec![16])), - Del(Key(vec![78])), - Cas(Key(vec![201]), 93, 196), - Set(Key(vec![172]), 84), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_11() { - // postmortem: a stall was happening because LSNs and LogIds were being - // conflated in calls to make_stable. A higher LogId than any LSN was - // being created, then passed in. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![38]), 148), - Set(Key(vec![176]), 175), - Set(Key(vec![82]), 88), - Set(Key(vec![164]), 85), - Set(Key(vec![139]), 74), - Set(Key(vec![73]), 23), - Cas(Key(vec![34]), 67, 151), - Set(Key(vec![115]), 133), - Set(Key(vec![249]), 138), - Restart, - Set(Key(vec![243]), 6), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_12() { - // postmortem: was not checking that a log entry's LSN matches its position - // as part of detecting tears / partial rewrites. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![118]), 156), - Set(Key(vec![8]), 63), - Set(Key(vec![165]), 110), - Set(Key(vec![219]), 108), - Set(Key(vec![91]), 61), - Set(Key(vec![18]), 98), - Scan(Key(vec![73]), 6), - Set(Key(vec![240]), 108), - Cas(Key(vec![71]), 28, 189), - Del(Key(vec![199])), - Restart, - Set(Key(vec![30]), 140), - Scan(Key(vec![118]), 13), - Get(Key(vec![180])), - Cas(Key(vec![115]), 151, 116), - Restart, - Set(Key(vec![31]), 95), - Cas(Key(vec![79]), 153, 225), - Set(Key(vec![34]), 161), - Get(Key(vec![213])), - Set(Key(vec![237]), 215), - Del(Key(vec![52])), - Set(Key(vec![56]), 78), - Scan(Key(vec![141]), 2), - Cas(Key(vec![228]), 114, 170), - Get(Key(vec![231])), - Get(Key(vec![223])), - Del(Key(vec![167])), - Restart, - Scan(Key(vec![240]), 31), - Del(Key(vec![54])), - Del(Key(vec![2])), - Set(Key(vec![117]), 165), - Set(Key(vec![223]), 50), - Scan(Key(vec![69]), 4), - Get(Key(vec![156])), - Set(Key(vec![214]), 72), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_13() { - // postmortem: failed root hoists were being improperly recovered before the - // following free was done on their page, but we treated the written node as - // if it were a successful completed root hoist. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![42]), 10), - Set(Key(vec![137]), 220), - Set(Key(vec![183]), 129), - Set(Key(vec![91]), 145), - Set(Key(vec![126]), 26), - Set(Key(vec![255]), 67), - Set(Key(vec![69]), 18), - Restart, - Set(Key(vec![24]), 92), - Set(Key(vec![193]), 17), - Set(Key(vec![3]), 143), - Cas(Key(vec![50]), 13, 84), - Restart, - Set(Key(vec![191]), 116), - Restart, - Del(Key(vec![165])), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_14() { - // postmortem: after adding prefix compression, we were not - // handling re-inserts and deletions properly - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![107]), 234), - Set(Key(vec![7]), 245), - Set(Key(vec![40]), 77), - Set(Key(vec![171]), 244), - Set(Key(vec![173]), 16), - Set(Key(vec![171]), 176), - Scan(Key(vec![93]), 33), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_15() { - // postmortem: was not sorting keys properly when binary searching for them - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![102]), 165), - Set(Key(vec![91]), 191), - Set(Key(vec![141]), 228), - Set(Key(vec![188]), 124), - Del(Key(vec![141])), - Scan(Key(vec![101]), 26), - ], - true, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_16() { - // postmortem: the test merge function was not properly adding numbers. - prop_tree_matches_btreemap( - vec![Merge(Key(vec![247]), 162), Scan(Key(vec![209]), 31)], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_17() { - // postmortem: we were creating a copy of a node leaf during iteration - // before accidentally putting it into a PinnedValue, despite the - // fact that it was not actually part of the node's actual memory! - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![194, 215, 103, 0, 138, 11, 248, 131]), 70), - Scan(Key(vec![]), 30), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_18() { - // postmortem: when implementing get_gt and get_lt, there were some - // issues with getting order comparisons correct. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 19), - Set(Key(vec![78]), 98), - Set(Key(vec![255]), 224), - Set(Key(vec![]), 131), - Get(Key(vec![255])), - GetGt(Key(vec![89])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_19() { - // postmortem: we were not seeking properly to the next node - // when we hit a half-split child and were using get_lt - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 138), - Set(Key(vec![68]), 113), - Set(Key(vec![155]), 73), - Set(Key(vec![50]), 220), - Set(Key(vec![]), 247), - GetLt(Key(vec![100])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_20() { - // postmortem: we were not seeking forward during get_gt - // if path_for_key reached a leaf that didn't include - // a key for our - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 10), - Set(Key(vec![56]), 42), - Set(Key(vec![138]), 27), - Set(Key(vec![155]), 73), - Set(Key(vec![]), 251), - GetGt(Key(vec![94])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_21() { - // postmortem: more split woes while implementing get_lt - // postmortem 2: failed to properly account for node hi key - // being empty in the view predecessor function - // postmortem 3: when rewriting Iter, failed to account for - // direction of iteration - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![176]), 163), - Set(Key(vec![]), 229), - Set(Key(vec![169]), 121), - Set(Key(vec![]), 58), - GetLt(Key(vec![176])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_22() { - // postmortem: inclusivity wasn't being properly flipped off after - // the first result during iteration - // postmortem 2: failed to properly check bounds while iterating - prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![]), 155), - Merge(Key(vec![56]), 251), - Scan(Key(vec![]), 2), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_23() { - // postmortem: when rewriting CRC handling code, mis-sized the blob crc - prop_tree_matches_btreemap( - vec![Set(Key(vec![6; 5120]), 92), Restart, Scan(Key(vec![]), 35)], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_24() { - // postmortem: get_gt diverged with the Iter impl - prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![]), 193), - Del(Key(vec![])), - Del(Key(vec![])), - Set(Key(vec![]), 55), - Set(Key(vec![]), 212), - Merge(Key(vec![]), 236), - Del(Key(vec![])), - Set(Key(vec![]), 192), - Del(Key(vec![])), - Set(Key(vec![94]), 115), - Merge(Key(vec![62]), 34), - GetGt(Key(vec![])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_25() { - // postmortem: was not accounting for merges when traversing - // the frag chain and a Del was encountered - prop_tree_matches_btreemap( - vec![Del(Key(vec![])), Merge(Key(vec![]), 84), Get(Key(vec![]))], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_26() { - // postmortem: - prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![]), 194), - Merge(Key(vec![62]), 114), - Merge(Key(vec![80]), 202), - Merge(Key(vec![]), 169), - Set(Key(vec![]), 197), - Del(Key(vec![])), - Del(Key(vec![])), - Set(Key(vec![]), 215), - Set(Key(vec![]), 164), - Merge(Key(vec![]), 150), - GetGt(Key(vec![])), - GetLt(Key(vec![80])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_27() { - // postmortem: was not accounting for the fact that deletions reduce the - // chances of being able to split successfully. - prop_tree_matches_btreemap( - vec![ - Del(Key(vec![])), - Merge( - Key(vec![ - 74, 117, 68, 37, 89, 16, 84, 130, 133, 78, 74, 59, 44, 109, - 34, 5, 36, 74, 131, 100, 79, 86, 87, 107, 87, 27, 1, 85, - 53, 112, 89, 75, 67, 78, 58, 121, 0, 105, 8, 117, 79, 40, - 94, 123, 83, 72, 78, 23, 23, 35, 50, 77, 59, 75, 54, 92, - 89, 12, 27, 48, 64, 21, 42, 97, 45, 28, 122, 13, 4, 32, 51, - 25, 26, 18, 65, 12, 54, 104, 106, 80, 75, 91, 111, 9, 5, - 130, 43, 40, 3, 72, 0, 58, 92, 64, 112, 97, 75, 130, 11, - 135, 19, 107, 40, 17, 25, 49, 48, 119, 82, 54, 35, 113, 91, - 68, 12, 118, 123, 62, 108, 88, 67, 43, 33, 119, 132, 124, - 1, 62, 133, 110, 25, 62, 129, 117, 117, 107, 123, 94, 127, - 80, 0, 116, 101, 9, 9, 54, 134, 70, 66, 79, 50, 124, 115, - 85, 42, 120, 24, 15, 81, 100, 72, 71, 40, 58, 22, 6, 34, - 54, 69, 110, 18, 74, 111, 80, 52, 90, 44, 4, 29, 84, 95, - 21, 25, 10, 10, 60, 18, 78, 23, 21, 114, 92, 96, 17, 127, - 53, 86, 2, 60, 104, 8, 132, 44, 115, 6, 25, 80, 46, 12, 20, - 44, 67, 136, 127, 50, 55, 70, 41, 90, 16, 10, 44, 32, 24, - 106, 13, 104, - ]), - 219, - ), - Merge(Key(vec![]), 71), - Del(Key(vec![])), - Set(Key(vec![0]), 146), - Merge(Key(vec![13]), 155), - Merge(Key(vec![]), 14), - Del(Key(vec![])), - Set(Key(vec![]), 150), - Set( - Key(vec![ - 13, 8, 3, 6, 9, 14, 3, 13, 7, 12, 13, 7, 13, 13, 1, 13, 5, - 4, 3, 2, 6, 16, 17, 10, 0, 16, 12, 0, 16, 1, 0, 15, 15, 4, - 1, 6, 9, 9, 11, 16, 7, 6, 10, 1, 11, 10, 4, 9, 9, 14, 4, - 12, 16, 10, 15, 2, 1, 8, 4, - ]), - 247, - ), - Del(Key(vec![154])), - Del(Key(vec![])), - Del(Key(vec![ - 0, 24, 24, 31, 40, 23, 10, 30, 16, 41, 30, 23, 14, 25, 21, 19, - 18, 7, 17, 41, 11, 5, 14, 42, 11, 22, 4, 8, 4, 38, 33, 31, 3, - 30, 40, 22, 40, 39, 5, 40, 1, 41, 11, 26, 25, 33, 12, 38, 4, - 35, 30, 42, 19, 26, 23, 22, 39, 18, 29, 4, 1, 24, 14, 38, 0, - 36, 27, 11, 27, 34, 16, 15, 38, 0, 20, 37, 22, 31, 12, 26, 16, - 4, 22, 25, 4, 34, 4, 33, 37, 28, 18, 4, 41, 15, 8, 16, 27, 3, - 20, 26, 40, 31, 15, 15, 17, 15, 5, 13, 22, 37, 7, 13, 35, 14, - 6, 28, 21, 26, 13, 35, 1, 10, 8, 34, 23, 27, 29, 8, 14, 42, 36, - 31, 34, 12, 31, 24, 5, 8, 11, 36, 29, 24, 38, 8, 12, 18, 22, - 36, 21, 28, 11, 24, 0, 41, 37, 39, 42, 25, 13, 41, 27, 8, 24, - 22, 30, 17, 2, 4, 20, 33, 5, 24, 33, 6, 29, 5, 0, 17, 9, 20, - 26, 15, 23, 22, 16, 23, 16, 1, 20, 0, 28, 16, 34, 30, 19, 5, - 36, 40, 28, 6, 39, - ])), - Merge(Key(vec![]), 50), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_28() { - // postmortem: - prop_tree_matches_btreemap( - vec![ - Del(Key(vec![])), - Set(Key(vec![]), 65), - Del(Key(vec![])), - Del(Key(vec![])), - Merge(Key(vec![]), 50), - Merge(Key(vec![]), 2), - Del(Key(vec![197])), - Merge(Key(vec![5]), 146), - Set(Key(vec![222]), 224), - Merge(Key(vec![149]), 60), - Scan(Key(vec![178]), 18), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_29() { - // postmortem: tree merge and split thresholds caused an infinite - // loop while performing updates - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 142), - Merge( - Key(vec![ - 45, 47, 6, 67, 16, 12, 62, 35, 69, 80, 49, 61, 29, 82, 9, - 47, 25, 78, 47, 64, 29, 74, 45, 0, 37, 44, 21, 82, 55, 44, - 31, 60, 86, 18, 45, 67, 55, 21, 35, 46, 25, 51, 5, 32, 33, - 36, 1, 81, 28, 28, 79, 76, 80, 89, 80, 62, 8, 85, 50, 15, - 4, 11, 76, 72, 73, 47, 30, 50, 85, 67, 84, 13, 82, 84, 78, - 70, 42, 83, 8, 7, 50, 77, 85, 37, 47, 82, 86, 46, 30, 27, - 5, 39, 70, 26, 59, 16, 6, 34, 56, 40, 40, 67, 16, 61, 63, - 56, 64, 31, 15, 81, 84, 19, 61, 66, 3, 7, 40, 56, 13, 40, - 64, 50, 88, 47, 88, 50, 63, 65, 79, 62, 1, 44, 59, 27, 12, - 60, 3, 36, 89, 45, 18, 4, 68, 48, 61, 30, 48, 26, 84, 49, - 3, 74, 51, 53, 30, 57, 50, 35, 74, 59, 30, 73, 19, 30, 82, - 78, 3, 5, 62, 17, 48, 29, 67, 52, 45, 61, 74, 52, 29, 61, - 63, 11, 89, 76, 34, 8, 50, 75, 42, 12, 5, 55, 0, 59, 44, - 68, 26, 76, 37, 50, 53, 73, 53, 76, 57, 40, 30, 52, 0, 41, - 21, 8, 79, 79, 38, 37, 50, 56, 43, 9, 85, 21, 60, 64, 13, - 54, 60, 83, 1, 2, 37, 75, 42, 0, 83, 81, 80, 87, 12, 15, - 75, 55, 41, 59, 9, 80, 66, 27, 65, 26, 48, 29, 37, 38, 9, - 76, 31, 39, 35, 22, 73, 59, 28, 33, 35, 63, 78, 17, 22, 82, - 12, 60, 49, 26, 54, 19, 60, 29, 39, 37, 10, 50, 12, 19, 29, - 1, 74, 12, 5, 38, 49, 41, 19, 88, 3, 27, 77, 81, 72, 42, - 71, 86, 82, 11, 79, 40, 35, 26, 35, 64, 4, 33, 87, 31, 84, - 81, 74, 31, 49, 0, 29, 73, 14, 55, 78, 21, 23, 20, 83, 48, - 89, 88, 62, 64, 73, 7, 20, 70, 81, 64, 3, 79, 38, 75, 13, - 40, 29, 82, 40, 14, 66, 56, 54, 52, 37, 14, 67, 8, 37, 1, - 5, 73, 14, 35, 63, 48, 46, 22, 84, 71, 2, 60, 63, 88, 14, - 15, 69, 88, 2, 43, 57, 43, 52, 18, 78, 75, 75, 74, 13, 35, - 50, 35, 17, 13, 64, 82, 55, 32, 14, 57, 35, 77, 65, 22, 40, - 27, 39, 80, 23, 20, 41, 50, 48, 22, 84, 37, 59, 45, 64, 10, - 3, 69, 56, 24, 4, 25, 76, 65, 47, 52, 64, 88, 3, 23, 37, - 16, 56, 69, 71, 27, 87, 65, 74, 23, 82, 41, 60, 78, 75, 22, - 51, 15, 57, 80, 46, 73, 7, 1, 36, 64, 0, 56, 83, 74, 62, - 73, 81, 68, 71, 63, 31, 5, 23, 11, 15, 39, 2, 10, 23, 18, - 74, 3, 43, 25, 68, 54, 11, 21, 14, 58, 10, 73, 0, 66, 28, - 73, 25, 40, 55, 56, 33, 81, 67, 43, 35, 65, 38, 21, 48, 81, - 4, 77, 68, 51, 38, 36, 49, 43, 33, 51, 28, 43, 60, 71, 78, - 48, 49, 76, 21, 0, 72, 0, 32, 78, 12, 87, 5, 80, 62, 40, - 85, 26, 70, 58, 56, 78, 7, 53, 30, 16, 22, 12, 23, 37, 83, - 45, 33, 41, 83, 78, 87, 44, 0, 65, 51, 3, 8, 72, 38, 14, - 24, 64, 77, 45, 5, 1, 7, 27, 82, 7, 6, 70, 25, 67, 22, 8, - 30, 76, 41, 11, 14, 1, 65, 85, 60, 80, 0, 30, 31, 79, 43, - 89, 33, 84, 22, 7, 67, 45, 39, 74, 75, 12, 61, 19, 71, 66, - 83, 57, 38, 45, 21, 18, 37, 54, 36, 14, 54, 63, 81, 12, 7, - 10, 39, 16, 40, 10, 7, 81, 45, 12, 22, 20, 29, 85, 40, 41, - 72, 79, 58, 50, 41, 59, 64, 41, 32, 56, 35, 8, 60, 17, 14, - 89, 17, 7, 48, 6, 35, 9, 34, 54, 6, 44, 87, 76, 50, 1, 67, - 70, 15, 8, 4, 45, 67, 86, 32, 69, 3, 88, 85, 72, 66, 21, - 89, 11, 77, 1, 50, 75, 56, 41, 74, 6, 4, 51, 65, 39, 50, - 45, 56, 3, 19, 80, 86, 55, 48, 81, 17, 3, 89, 7, 9, 63, 58, - 80, 39, 34, 85, 55, 71, 41, 55, 8, 63, 38, 51, 47, 49, 83, - 2, 73, 22, 39, 18, 45, 77, 56, 80, 54, 13, 23, 81, 54, 15, - 48, 57, 83, 71, 41, 32, 64, 1, 9, 46, 27, 16, 21, 7, 28, - 55, 17, 71, 68, 17, 74, 46, 38, 84, 3, 12, 71, 63, 16, 23, - 48, 12, 29, 28, 5, 21, 61, 14, 77, 66, 62, 57, 18, 30, 63, - 14, 41, 37, 30, 73, 16, 12, 74, 8, 82, 67, 53, 10, 5, 37, - 36, 39, 52, 37, 72, 76, 21, 35, 40, 42, 55, 47, 50, 41, 19, - 40, 86, 26, 54, 23, 74, 46, 66, 59, 80, 26, 81, 61, 80, 88, - 55, 40, 30, 45, 7, 46, 21, 3, 20, 46, 63, 18, 9, 34, 67, 9, - 19, 52, 53, 29, 69, 78, 65, 39, 71, 40, 38, 57, 80, 27, 34, - 30, 27, 55, 8, 65, 31, 37, 33, 25, 39, 46, 9, 83, 6, 27, - 28, 61, 9, 21, 58, 21, 10, 69, 24, 5, 31, 32, 44, 26, 84, - 73, 73, 9, 64, 26, 21, 85, 12, 39, 81, 38, 49, 24, 35, 3, - 88, 15, 15, 76, 64, 70, 9, 30, 51, 26, 16, 70, 60, 15, 7, - 54, 36, 32, 9, 10, 18, 66, 19, 25, 77, 46, 51, 51, 14, 41, - 56, 65, 41, 87, 26, 10, 2, 73, 2, 71, 26, 56, 10, 68, 15, - 53, 10, 43, 15, 22, 45, 2, 15, 16, 69, 80, 83, 18, 22, 70, - 77, 52, 48, 24, 17, 40, 56, 22, 17, 3, 36, 46, 37, 41, 22, - 0, 41, 45, 14, 15, 73, 18, 42, 34, 5, 87, 6, 2, 7, 58, 3, - 86, 87, 7, 79, 88, 33, 30, 48, 3, 66, 27, 34, 58, 48, 71, - 40, 1, 46, 84, 32, 63, 79, 0, 21, 71, 1, 59, 39, 77, 51, - 14, 20, 58, 83, 19, 0, 2, 2, 57, 73, 79, 42, 59, 33, 50, - 15, 11, 48, 25, 14, 39, 36, 88, 71, 28, 45, 15, 59, 39, 60, - 78, 18, 18, 45, 50, 29, 66, 86, 5, 76, 85, 55, 17, 28, 8, - 39, 75, 33, 9, 73, 71, 59, 56, 57, 86, 6, 75, 26, 43, 68, - 34, 82, 88, 76, 17, 86, 63, 2, 38, 63, 13, 44, 8, 25, 0, - 63, 54, 73, 52, 3, 72, - ]), - 9, - ), - Set(Key(vec![]), 35), - Set( - Key(vec![ - 165, 64, 99, 55, 152, 102, 148, 35, 59, 10, 198, 191, 71, - 129, 170, 155, 7, 106, 171, 93, 126, - ]), - 212, - ), - Del(Key(vec![])), - Merge(Key(vec![]), 177), - Merge( - Key(vec![ - 20, 55, 154, 104, 10, 68, 64, 3, 31, 78, 232, 227, 169, - 161, 13, 50, 16, 239, 87, 0, 9, 85, 248, 32, 156, 106, 11, - 18, 57, 13, 177, 36, 69, 176, 101, 92, 119, 38, 218, 26, 4, - 154, 185, 135, 75, 167, 101, 107, 206, 76, 153, 213, 70, - 52, 205, 95, 55, 116, 242, 68, 77, 90, 249, 142, 93, 135, - 118, 127, 116, 121, 235, 183, 215, 2, 118, 193, 146, 185, - 4, 129, 167, 164, 178, 105, 149, 47, 73, 121, 95, 23, 216, - 153, 23, 108, 141, 190, 250, 121, 98, 229, 33, 106, 89, - 117, 122, 145, 47, 242, 81, 88, 141, 38, 177, 170, 167, 56, - 24, 196, 61, 97, 83, 91, 202, 181, 75, 112, 3, 169, 61, 17, - 100, 81, 111, 178, 122, 176, 95, 185, 169, 146, 239, 40, - 168, 32, 170, 34, 172, 89, 59, 188, 170, 186, 61, 7, 177, - 230, 130, 155, 208, 171, 82, 153, 20, 72, 74, 111, 147, - 178, 164, 157, 71, 114, 216, 40, 85, 91, 20, 145, 149, 95, - 36, 114, 24, 129, 144, 229, 14, 133, 77, 92, 139, 167, 48, - 18, 178, 4, 15, 171, 171, 88, 74, 104, 157, 2, 121, 13, - 141, 6, 107, 118, 228, 147, 152, 28, 206, 128, 102, 150, 1, - 129, 84, 171, 119, 110, 198, 72, 100, 166, 153, 98, 66, - 128, 79, 41, 126, - ]), - 103, - ), - Del(Key(vec![])), - Merge( - Key(vec![ - 117, 48, 90, 153, 149, 191, 229, 73, 3, 6, 73, 52, 73, 186, - 42, 53, 94, 17, 61, 11, 153, 118, 219, 188, 184, 89, 13, - 124, 138, 40, 238, 9, 46, 45, 38, 115, 153, 106, 166, 56, - 134, 206, 140, 57, 95, 244, 27, 135, 43, 13, 143, 137, 56, - 122, 243, 205, 52, 116, 130, 35, 80, 167, 58, 93, - ]), - 8, - ), - Set(Key(vec![145]), 43), - GetLt(Key(vec![229])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_30() { - // postmortem: - prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![]), 241), - Set(Key(vec![20]), 146), - Merge( - Key(vec![ - 60, 38, 29, 57, 35, 71, 15, 46, 7, 27, 76, 84, 27, 25, 90, - 30, 37, 63, 11, 24, 27, 28, 94, 93, 82, 68, 69, 61, 46, 86, - 11, 86, 63, 34, 90, 71, 92, 87, 38, 48, 40, 78, 9, 37, 26, - 36, 60, 4, 2, 38, 32, 73, 86, 43, 52, 79, 11, 43, 59, 21, - 60, 40, 80, 94, 69, 44, 4, 73, 59, 16, 16, 22, 88, 41, 13, - 21, 91, 33, 49, 91, 20, 79, 23, 61, 53, 63, 58, 62, 49, 10, - 71, 72, 27, 55, 53, 39, 91, 82, 86, 38, 41, 1, 54, 3, 77, - 15, 93, 31, 49, 29, 82, 7, 17, 58, 42, 12, 49, 67, 62, 46, - 20, 27, 61, 32, 58, 9, 17, 19, 28, 44, 41, 34, 94, 11, 50, - 73, 1, 50, 48, 8, 88, 33, 40, 51, 15, 35, 2, 36, 37, 30, - 37, 83, 71, 91, 32, 0, 69, 28, 64, 30, 72, 63, 39, 7, 89, - 0, 21, 51, 92, 80, 13, 57, 7, 53, 94, 26, 2, 63, 18, 23, - 89, 34, 83, 55, 32, 75, 81, 27, 11, 5, 63, 0, 75, 12, 39, - 9, 13, 20, 25, 57, 94, 75, 59, 46, 84, 80, 61, 24, 31, 7, - 68, 93, 12, 94, 6, 94, 27, 33, 81, 19, 3, 78, 3, 14, 22, - 36, 49, 61, 51, 79, 43, 35, 58, 54, 65, 72, 36, 87, 3, 3, - 25, 75, 82, 58, 75, 76, 29, 89, 1, 16, 64, 63, 85, 0, 47, - ]), - 11, - ), - Merge(Key(vec![25]), 245), - Merge(Key(vec![119]), 152), - Scan(Key(vec![]), 31), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_31() { - // postmortem: - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![1]), 212), - Set(Key(vec![12]), 174), - Set(Key(vec![]), 182), - Set( - Key(vec![ - 12, 55, 46, 38, 40, 34, 44, 32, 19, 15, 28, 49, 35, 40, 55, - 35, 61, 9, 62, 18, 3, 58, - ]), - 86, - ), - Scan(Key(vec![]), -18), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_32() { - // postmortem: the MAX_IVEC that predecessor used in reverse - // iteration was setting the first byte to 0 even though we - // no longer perform per-key prefix encoding. - prop_tree_matches_btreemap( - vec![Set(Key(vec![57]), 141), Scan(Key(vec![]), -40)], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_33() { - // postmortem: the split point was being incorrectly - // calculated when using the simplified prefix technique. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 91), - Set(Key(vec![1]), 216), - Set(Key(vec![85, 25]), 78), - Set(Key(vec![85]), 43), - GetLt(Key(vec![])), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_34() { - // postmortem: a safety check was too aggressive when - // finding predecessors using the new simplified prefix - // encoding technique. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![9, 212]), 100), - Set(Key(vec![9]), 63), - Set(Key(vec![5]), 100), - Merge(Key(vec![]), 16), - Set(Key(vec![9, 70]), 188), - Scan(Key(vec![]), -40), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_35() { - // postmortem: prefix lengths were being incorrectly - // handled on splits. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![207]), 29), - Set(Key(vec![192]), 218), - Set(Key(vec![121]), 167), - Set(Key(vec![189]), 40), - Set(Key(vec![85]), 197), - Set(Key(vec![185]), 58), - Set(Key(vec![84]), 97), - Set(Key(vec![23]), 34), - Set(Key(vec![47]), 162), - Set(Key(vec![39]), 92), - Set(Key(vec![46]), 173), - Set(Key(vec![33]), 202), - Set(Key(vec![8]), 113), - Set(Key(vec![17]), 228), - Set(Key(vec![8, 49]), 217), - Set(Key(vec![6]), 192), - Set(Key(vec![5]), 47), - Set(Key(vec![]), 5), - Set(Key(vec![0]), 103), - Set(Key(vec![1]), 230), - Set(Key(vec![0, 229]), 117), - Set(Key(vec![]), 112), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_36() { - // postmortem: suffix truncation caused - // regions to be permanently inaccessible - // when applied to split points on index - // nodes. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![152]), 65), - Set(Key(vec![]), 227), - Set(Key(vec![101]), 23), - Merge(Key(vec![254]), 97), - Set(Key(vec![254, 5]), 207), - Scan(Key(vec![]), -30), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_37() { - // postmortem: suffix truncation was so - // aggressive that it would cut into - // the prefix in the lo key sometimes. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 82), - Set(Key(vec![2, 0]), 40), - Set(Key(vec![2, 0, 0]), 49), - Set(Key(vec![1]), 187), - Scan(Key(vec![]), 33), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_38() { - // postmortem: Free pages were not being initialized in the - // pagecache properly. - for _ in 0..10 { - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![193]), 73), - Merge(Key(vec![117]), 216), - Set(Key(vec![221]), 176), - GetLt(Key(vec![123])), - Restart, - ], - false, - false, - 0, - 0, + prop_tree_matches_btreemap as fn(Vec, bool, i32, usize) -> bool, ); - } -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_39() { - // postmortem: - for _ in 0..100 { - prop_tree_matches_btreemap( - vec![ - Set( - Key(vec![ - 67, 48, 34, 254, 61, 189, 196, 127, 26, 185, 244, 63, - 60, 63, 246, 194, 243, 177, 218, 210, 153, 126, 124, - 47, 160, 242, 157, 2, 51, 34, 88, 41, 44, 65, 58, 211, - 245, 74, 192, 101, 222, 68, 196, 250, 127, 231, 102, - 177, 246, 105, 190, 144, 113, 148, 71, 72, 149, 246, - 38, 95, 106, 42, 83, 65, 84, 73, 148, 34, 95, 88, 57, - 232, 219, 227, 74, 14, 5, 124, 106, 57, 244, 50, 81, - 93, 145, 111, 40, 190, 127, 227, 17, 242, 165, 194, - 171, 60, 6, 255, 176, 143, 131, 164, 217, 18, 123, 19, - 246, 183, 29, 0, 6, 39, 175, 57, 134, 166, 231, 47, - 254, 158, 163, 178, 78, 240, 108, 157, 72, 135, 34, - 236, 103, 192, 109, 31, 2, 72, 128, 242, 4, 113, 109, - 224, 120, 61, 169, 226, 131, 210, 33, 181, 91, 91, 197, - 223, 127, 26, 94, 158, 55, 57, 3, 184, 15, 30, 2, 222, - 39, 29, 12, 42, 14, 166, 176, 28, 13, 246, 11, 186, 8, - 247, 113, 253, 102, 227, 68, 111, 227, 238, 54, 150, - 11, 57, 155, 4, 75, 179, 17, 172, 42, 22, 199, 44, 242, - 211, 0, 39, 243, 221, 114, 86, 145, 22, 226, 108, 32, - 248, 42, 49, 191, 112, 1, 69, 101, 112, 251, 243, 252, - 83, 140, 132, 165, - ]), - 250, - ), - Del(Key(vec![ - 11, 77, 168, 37, 181, 169, 239, 146, 240, 211, 7, 115, 197, - 119, 46, 80, 240, 92, 221, 108, 208, 247, 221, 129, 108, - 13, 36, 21, 93, 11, 243, 103, 188, 39, 126, 77, 29, 32, - 206, 175, 199, 245, 71, 96, 221, 7, 68, 64, 45, 78, 68, - 193, 73, 13, 60, 13, 28, 167, 147, 7, 90, 11, 206, 44, 84, - 243, 3, 77, 122, 87, 7, 125, 184, 6, 178, 59, - ])), - Merge(Key(vec![176]), 123), - Restart, - Merge( - Key(vec![ - 93, 43, 181, 76, 63, 247, 227, 15, 17, 239, 9, 252, - 181, 53, 65, 74, 22, 18, 71, 64, 115, 58, 110, 30, 13, - 177, 31, 47, 124, 14, 0, 157, 200, 194, 92, 215, 21, - 36, 239, 204, 18, 88, 216, 149, 18, 208, 187, 188, 32, - 76, 35, 12, 142, 157, 38, 186, 245, 63, 2, 230, 13, 79, - 160, 86, 32, 170, 239, 151, 25, 180, 170, 201, 22, 211, - 238, 208, 24, 139, 5, 44, 38, 48, 243, 38, 249, 36, 43, - 200, 52, 244, 166, 0, 29, 114, 10, 18, 253, 253, 130, - 223, 37, 8, 109, 228, 0, 122, 192, 16, 68, 231, 37, - 230, 249, 180, 214, 101, 17, - ]), - 176, - ), - Set( - Key(vec![ - 153, 217, 142, 179, 255, 74, 1, 20, 254, 1, 38, 28, 66, - 244, 81, 101, 210, 58, 18, 107, 12, 116, 74, 188, 95, - 56, 248, 9, 204, 128, 24, 239, 143, 83, 83, 213, 17, - 32, 135, 73, 217, 8, 241, 44, 57, 131, 107, 139, 122, - 32, 194, 225, 136, 148, 227, 196, 196, 121, 97, 81, 74, - ]), - 42, - ), - Set(Key(vec![]), 160), - GetLt(Key(vec![ - 244, 145, 243, 120, 149, 64, 125, 161, 98, 205, 205, 107, - 191, 119, 83, 42, 92, 119, 25, 198, 47, 123, 26, 224, 190, - 98, 144, 238, 74, 36, 76, 186, 226, 153, 69, 217, 109, 214, - 201, 104, 148, 107, 132, 219, 37, 109, 98, 172, 70, 160, - 177, 115, 194, 80, 76, 60, 148, 176, 191, 84, 109, 35, 51, - 107, 157, 11, 233, 126, 71, 183, 215, 116, 72, 235, 218, - 171, 233, 181, 53, 253, 104, 231, 138, 166, 40, - ])), - Set( - Key(vec![ - 37, 160, 29, 162, 43, 212, 2, 100, 236, 24, 2, 82, 58, - 38, 81, 137, 89, 55, 164, 83, - ]), - 64, - ), - Get(Key(vec![ - 15, 53, 101, 33, 156, 199, 212, 82, 2, 64, 136, 70, 235, - 72, 170, 188, 180, 200, 109, 231, 6, 13, 30, 70, 4, 132, - 133, 101, 82, 187, 78, 241, 157, 49, 156, 3, 17, 167, 216, - 209, 7, 174, 112, 186, 170, 189, 85, 99, 119, 52, 39, 38, - 151, 108, 203, 42, 63, 255, 216, 234, 34, 2, 80, 168, 122, - 70, 20, 11, 220, 106, 49, 110, 165, 170, 149, 163, - ])), - GetLt(Key(vec![])), - Merge(Key(vec![136]), 135), - Cas(Key(vec![177]), 159, 209), - Cas(Key(vec![101]), 143, 240), - Set(Key(vec![226, 62, 34, 63, 172, 96, 162]), 43), - Merge( - Key(vec![ - 48, 182, 144, 255, 137, 100, 2, 139, 69, 111, 159, 133, - 234, 147, 118, 231, 155, 74, 73, 98, 58, 36, 35, 21, - 50, 42, 71, 25, 200, 5, 4, 198, 158, 41, 88, 75, 153, - 254, 248, 213, 0, 89, 43, 160, 58, 206, 88, 107, 57, - 208, 119, 34, 80, 166, 112, 13, 241, 46, 172, 115, 179, - 42, 59, 200, 225, 125, 65, 18, 173, 77, 27, 129, 228, - 68, 53, 175, 61, 230, 27, 136, 131, 171, 64, 79, 125, - 149, 52, 80, - ]), - 105, - ), - Merge( - Key(vec![ - 126, 109, 165, 43, 2, 82, 97, 81, 59, 78, 243, 142, 37, - 105, 109, 178, 25, 73, 50, 103, 107, 129, 213, 193, - 158, 16, 63, 108, 160, 204, 78, 83, 2, 43, 66, 2, 18, - 11, 147, 47, 106, 106, 141, 82, 65, 101, 99, 171, 178, - 68, 106, 7, 190, 159, 105, 132, 155, 240, 155, 95, 66, - 254, 239, 202, 168, 26, 207, 213, 116, 215, 141, 77, 7, - 245, 174, 144, 39, 28, - ]), - 122, - ), - Del(Key(vec![ - 13, 152, 171, 90, 130, 131, 232, 51, 173, 103, 255, 225, - 156, 192, 146, 141, 94, 84, 39, 171, 152, 114, 133, 20, - 125, 68, 57, 27, 33, 175, 37, 164, 40, - ])), - Scan(Key(vec![]), -34), - Set(Key(vec![]), 85), - Merge(Key(vec![112]), 104), - Restart, - Restart, - Del(Key(vec![237])), - Set( - Key(vec![ - 53, 79, 71, 234, 187, 78, 206, 117, 48, 84, 162, 101, - 132, 137, 43, 144, 234, 23, 116, 13, 28, 184, 174, 241, - 181, 201, 131, 156, 7, 103, 135, 17, 168, 249, 7, 120, - 74, 8, 192, 134, 109, 54, 175, 130, 145, 206, 185, 49, - 144, 133, 226, 244, 42, 126, 176, 232, 96, 56, 70, 56, - 159, 127, 35, 39, 185, 114, 182, 41, 50, 93, 61, - ]), - 144, - ), - Merge( - Key(vec![ - 10, 58, 6, 62, 17, 15, 26, 29, 79, 34, 77, 12, 93, 65, - 87, 71, 19, 57, 25, 40, 53, 73, 57, 2, 81, 49, 67, 62, - 78, 14, 34, 70, 86, 49, 86, 84, 16, 33, 24, 7, 87, 49, - 58, 50, 13, 14, 35, 46, 7, 39, 76, 51, 21, 76, 9, 53, - 45, 21, 71, 48, 16, 73, 68, 1, 63, 34, 12, 42, 11, 85, - 79, 19, 11, 77, 90, 0, 62, 56, 37, 33, 10, 69, 20, 64, - 15, 51, 64, 90, 69, 15, 7, 41, 53, 71, 52, 21, 45, 45, - 49, 3, 59, 15, 90, 7, 12, 62, 30, 81, - ]), - 131, - ), - Get(Key(vec![ - 79, 28, 48, 41, 5, 70, 54, 56, 36, 32, 59, 15, 26, 42, 61, - 23, 53, 6, 71, 44, 61, 65, 4, 17, 23, 15, 65, 64, 46, 66, - 27, 63, 51, 44, 35, 1, 8, 70, 7, 1, 13, 10, 40, 6, 36, 64, - 68, 52, 8, 0, 46, 53, 48, 32, 9, 52, 69, 41, 8, 57, 27, 31, - 79, 27, 12, 70, 72, 33, 6, 22, 47, 37, 11, 38, 32, 7, 31, - 37, 45, 23, 74, 22, 46, 1, 3, 74, 72, 56, 52, 65, 78, 28, - 5, 68, 30, 36, 5, 43, 7, 2, 48, 75, 16, 53, 31, 40, 9, 3, - 49, 71, 70, 20, 24, 6, 23, 76, 49, 21, 12, 60, 54, 43, 7, - 79, 74, 62, 53, 20, 46, 11, 74, 29, 31, 43, 20, 27, 22, 22, - 15, 59, 12, 21, 61, 11, 8, 28, 5, 78, 70, 22, 11, 36, 62, - 56, 44, 49, 25, 39, 37, 24, 72, 65, 67, 22, 48, 16, 50, 5, - 10, 13, 36, 65, 29, 3, 26, 74, 15, 73, 78, 36, 14, 36, 30, - 42, 19, 73, 65, 75, 2, 25, 1, 32, 38, 43, 58, 19, 37, 37, - 48, 23, 72, 77, 34, 24, 1, 4, 42, 11, 68, 54, 23, 34, 0, - 48, 20, 20, 23, 61, 65, 72, 64, 24, 63, 3, 21, 48, 63, 57, - 40, 36, 46, 48, 8, 20, 62, 7, 69, 35, 79, 38, 45, 74, 7, - 16, 48, 59, 56, 31, 13, 13, - ])), - Del(Key(vec![176, 58, 119])), - Get(Key(vec![241])), - Get(Key(vec![160])), - Cas(Key(vec![]), 166, 235), - Set( - Key(vec![ - 64, 83, 151, 149, 100, 93, 5, 18, 91, 58, 84, 156, 127, - 108, 99, 168, 54, 51, 169, 185, 174, 101, 178, 148, 28, - 91, 25, 138, 14, 133, 170, 97, 138, 180, 157, 131, 174, - 22, 91, 108, 59, 165, 52, 28, 17, 175, 44, 95, 112, 38, - 141, 46, 124, 49, 116, 55, 39, 109, 73, 181, 104, 86, - 81, 150, 95, 149, 69, 110, 110, 102, 22, 62, 180, 60, - 87, 127, 127, 136, 12, 139, 109, 165, 34, 181, 158, - 156, 102, 38, 6, 149, 183, 69, 129, 98, 161, 175, 82, - 51, 47, 93, 136, 16, 118, 65, 152, 139, 8, 30, 10, 100, - 47, 13, 47, 179, 87, 19, 109, 78, 116, 20, 111, 89, 28, - 0, 86, 39, 139, 7, 111, 40, 145, 155, 107, 45, 36, 90, - 143, 154, 135, 36, 13, 98, 61, 150, 65, 128, 16, 52, - 100, 128, 11, 5, 49, 143, 56, 78, 48, 62, 86, 50, 86, - 41, 153, 53, 139, 89, 164, 33, 136, 83, 182, 53, 132, - 144, 177, 105, 104, 55, 9, 174, 30, 65, 76, 33, 163, - 172, 80, 169, 175, 54, 165, 173, 109, 24, 70, 25, 158, - 135, 76, 130, 76, 9, 56, 20, 13, 133, 33, 168, 160, - 153, 43, 80, 58, 56, 171, 28, 97, 122, 162, 32, 164, - 11, 112, 177, 63, 47, 25, 0, 66, 87, 169, 118, 173, 27, - 154, 79, 72, 107, 140, 126, 150, 60, 174, 184, 111, - 155, 22, 32, 185, 149, 95, 60, 146, 165, 103, 34, 131, - 91, 92, 85, 6, 102, 172, 131, 178, 141, 76, 84, 121, - 49, 19, 66, 127, 45, 23, 159, 33, 138, 47, 36, 106, 39, - 83, 164, 83, 16, 126, 126, 118, 84, 171, - ]), - 143, - ), - Scan(Key(vec![165]), -26), - Get(Key(vec![])), - Del(Key(vec![])), - Set( - Key(vec![ - 197, 224, 20, 219, 111, 246, 70, 138, 190, 237, 9, 202, - 187, 160, 47, 10, 231, 14, 2, 131, 30, 202, 95, 48, 44, - 21, 192, 155, 172, 51, 101, 155, 73, 5, 22, 140, 137, - 11, 37, 79, 79, 92, 25, 107, 82, 145, 39, 45, 155, 136, - 242, 8, 43, 71, 28, 70, 94, 79, 151, 20, 144, 53, 100, - 196, 74, 140, 27, 224, 59, 1, 143, 136, 132, 85, 114, - 166, 103, 242, 156, 183, 168, 148, 2, 33, 29, 201, 7, - 96, 13, 33, 102, 172, 21, 96, 27, 1, 86, 149, 150, 119, - 208, 118, 148, 51, 143, 54, 245, 89, 216, 145, 145, 72, - 105, 51, 19, 14, 15, 18, 34, 16, 101, 172, 133, 32, - 173, 106, 157, 15, 48, 194, 27, 55, 204, 110, 145, 99, - 9, 37, 195, 206, 13, 246, 161, 100, 222, 235, 184, 12, - 64, 103, 50, 158, 242, 163, 198, 61, 224, 130, 226, - 187, 158, 175, 135, 54, 110, 33, 9, 59, 127, 135, 47, - 204, 109, 105, 0, 161, 48, 247, 140, 101, 141, 81, 157, - 80, 135, 228, 102, 44, 74, 53, 121, 116, 17, 56, 26, - 112, - ]), - 22, - ), - Set(Key(vec![110]), 222), - Set(Key(vec![94]), 5), - GetGt(Key(vec![ - 181, 161, 96, 186, 128, 24, 232, 74, 149, 3, 129, 98, 220, - 25, 111, 111, 163, 244, 229, 137, 159, 137, 13, 12, 97, - 150, 6, 88, 76, 77, 31, 36, 57, 54, 82, 85, 119, 250, 187, - 163, 132, 73, 194, 129, 149, 176, 62, 118, 166, 50, 200, - 28, 158, 184, 28, 139, 74, 87, 144, 87, 1, 73, 37, 46, 226, - 91, 102, 13, 67, 195, 64, 189, 90, 190, 163, 216, 171, 22, - 69, 234, 57, 134, 96, 198, 179, 115, 43, 160, 104, 252, - 105, 192, 91, 211, 176, 171, 252, 236, 202, 158, 250, 186, - 134, 154, 82, 17, 113, 175, 13, 125, 185, 101, 38, 236, - 155, 30, 110, 11, 33, 198, 114, 184, 84, 91, 67, 125, 55, - 188, 124, 242, 89, 124, 69, 18, 26, 137, 34, 33, 201, 58, - 252, 134, 33, 131, 126, 136, 168, 20, 32, 237, 10, 57, 158, - 149, 102, 62, 10, 98, 106, 10, 93, 78, 240, 205, 38, 186, - 97, 104, 204, 14, 34, 100, 179, 161, 135, 136, 194, 99, - ])), - Merge(Key(vec![95]), 253), - GetLt(Key(vec![99])), - Merge(Key(vec![]), 124), - Get(Key(vec![61])), - Restart, - ], - false, - false, - 0, - 0, - ); - } -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_40() { - // postmortem: deletions of non-existant keys were - // being persisted despite being unneccessary. - prop_tree_matches_btreemap( - vec![Del(Key(vec![99; 111222333]))], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_41() { - // postmortem: indexing of values during - // iteration was incorrect. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![]), 131), - Set(Key(vec![17; 1]), 214), - Set(Key(vec![4; 1]), 202), - Set(Key(vec![24; 1]), 79), - Set(Key(vec![26; 1]), 235), - Scan(Key(vec![]), 19), - ], - false, - false, - 0, - 0, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_42() { - // postmortem: during refactoring, accidentally - // messed up the index selection for merge destinations. - for _ in 0..100 { - prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![]), 112), - Set(Key(vec![110; 1]), 153), - Set(Key(vec![15; 1]), 100), - Del(Key(vec![110; 1])), - GetLt(Key(vec![148; 1])), - ], - false, - false, - 0, - 0, - ); - } -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_43() { - // postmortem: when changing the PageState to always - // include a base node, we did not account for this - // in the tag + size compressed value. This was not - // caught by the quickcheck tests because PageState's - // Arbitrary implementation would ensure that at least - // one frag was present, which was the invariant before - // the base was extracted away from the vec of frags. - prop_tree_matches_btreemap( - vec![ - Set(Key(vec![241; 1]), 199), - Set(Key(vec![]), 198), - Set(Key(vec![72; 108]), 175), - GetLt(Key(vec![])), - Restart, - Restart, - ], - false, - false, - 0, - 52, - ); -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_44() { - // postmortem: off-by-one bug related to LSN recovery - // where 1 was added to the index when the recovered - // LSN was actually divisible by the segment size - assert!(prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![]), 97), - Merge(Key(vec![]), 41), - Merge(Key(vec![]), 241), - Set(Key(vec![21; 1]), 24), - Del(Key(vec![])), - Set(Key(vec![]), 145), - Set(Key(vec![151; 1]), 187), - Get(Key(vec![])), - Restart, - Set(Key(vec![]), 151), - Restart, - ], - false, - false, - 0, - 0, - )) -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_45() { - // postmortem: recovery was not properly accounting for - // the possibility of a segment to be maxed out, similar - // to bug 44. - for _ in 0..10 { - assert!(prop_tree_matches_btreemap( - vec![ - Merge(Key(vec![206; 77]), 225), - Set(Key(vec![88; 190]), 40), - Set(Key(vec![162; 1]), 213), - Merge(Key(vec![186; 1]), 175), - Set(Key(vec![105; 16]), 111), - Cas(Key(vec![]), 75, 252), - Restart - ], - false, - true, - 0, - 210 - )) - } -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_46() { - // postmortem: while implementing the heap slab, decompression - // was failing to account for the fact that the slab allocator - // will always write to the end of the slab to be compatible - // with O_DIRECT. - for _ in 0..1 { - assert!(prop_tree_matches_btreemap(vec![Restart], false, true, 0, 0)) - } -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_47() { - // postmortem: - assert!(prop_tree_matches_btreemap( - vec![Set(Key(vec![88; 1]), 40), Restart, Get(Key(vec![88; 1]))], - false, - false, - 0, - 0 - )) -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_48() { - // postmortem: node value buffer calculations were failing to - // account for potential padding added to avoid buffer overreads - // while looking up offsets. - assert!(prop_tree_matches_btreemap( - vec![ - Set(Key(vec![23; 1]), 78), - Set(Key(vec![120; 1]), 223), - Set(Key(vec![123; 1]), 235), - Set(Key(vec![60; 1]), 234), - Set(Key(vec![]), 71), - Del(Key(vec![120; 1])), - Scan(Key(vec![]), -9) - ], - false, - false, - 0, - 0 - )) -} - -#[test] -#[cfg_attr(miri, ignore)] -fn tree_bug_49() { - // postmortem: was incorrectly calculating the child offset while searching - // for a node with omitted keys, where the distance == the stride, and - // as a result we went into an infinite loop trying to apply a parent - // split that was already present - assert!(prop_tree_matches_btreemap( - vec![ - Set(Key(vec![39; 1]), 245), - Set(Key(vec![108; 1]), 96), - Set(Key(vec![147; 1]), 44), - Set(Key(vec![102; 1]), 2), - Merge(Key(vec![22; 1]), 160), - Set(Key(vec![36; 1]), 1), - Set(Key(vec![65; 1]), 213), - Set(Key(vec![]), 221), - Set(Key(vec![84; 1]), 20), - Merge(Key(vec![229; 1]), 61), - Set(Key(vec![156; 1]), 69), - Merge(Key(vec![252; 1]), 85), - Set(Key(vec![36; 2]), 57), - Set(Key(vec![245; 1]), 143), - Set(Key(vec![59; 1]), 209), - GetGt(Key(vec![136; 1])), - Set(Key(vec![40; 1]), 96), - GetGt(Key(vec![59; 2])) - ], - false, - false, - 0, - 0 - )) } diff --git a/tests/test_tree_failpoints.rs b/tests/test_tree_failpoints.rs index 849c3e99d..575a3e3b1 100644 --- a/tests/test_tree_failpoints.rs +++ b/tests/test_tree_failpoints.rs @@ -485,9 +485,10 @@ fn run_tree_crashes_nicely(ops: Vec, flusher: bool) -> bool { if reference_entry.versions.len() > 1 && reference_entry.crash_epoch == crash_counter { - let last = std::mem::take(&mut reference_entry.versions) - .pop() - .unwrap(); + let last = + std::mem::take(&mut reference_entry.versions) + .pop() + .unwrap(); reference_entry.versions.push(last); } } @@ -1005,7 +1006,7 @@ fn failpoints_bug_11() { #[cfg_attr(miri, ignore)] fn failpoints_bug_12() { // postmortem 1: we were not sorting the recovery state, which - // led to divergent state across recoveries. TODO wut + // led to divergent state across recoveries. assert!(prop_tree_crashes_nicely( vec![ Set, diff --git a/tests/tree/mod.rs b/tests/tree/mod.rs index 026d7de55..ec56946c7 100644 --- a/tests/tree/mod.rs +++ b/tests/tree/mod.rs @@ -1,10 +1,11 @@ use std::{collections::BTreeMap, convert::TryInto, fmt, panic}; -use quickcheck::{Arbitrary, Gen, RngCore}; -use rand::{rngs::StdRng, Rng, SeedableRng}; +use quickcheck::{Arbitrary, Gen}; use rand_distr::{Distribution, Gamma}; -use sled::*; +use sled::{Config, Db as SledDb, InlineArray}; + +type Db = SledDb<3>; #[derive(Clone, Ord, PartialOrd, Eq, PartialEq)] pub struct Key(pub Vec); @@ -24,70 +25,11 @@ impl fmt::Debug for Key { } } -struct SledGen { - r: StdRng, - size: usize, -} - -impl Gen for SledGen { - fn size(&self) -> usize { - self.size - } -} - -impl RngCore for SledGen { - fn next_u32(&mut self) -> u32 { - self.r.gen::() - } - - fn next_u64(&mut self) -> u64 { - self.r.gen::() - } - - fn fill_bytes(&mut self, dest: &mut [u8]) { - self.r.fill_bytes(dest) - } - - fn try_fill_bytes( - &mut self, - dest: &mut [u8], - ) -> std::result::Result<(), rand::Error> { - self.r.try_fill_bytes(dest) - } -} - -pub fn fuzz_then_shrink(buf: &[u8]) { - let use_compression = !cfg!(feature = "no_zstd") - && !cfg!(miri) - && buf.first().unwrap_or(&0) % 2 == 0; - - let ops: Vec = buf - .chunks(2) - .map(|chunk| { - let mut seed = [0_u8; 32]; - seed[0..chunk.len()].copy_from_slice(chunk); - let rng: StdRng = SeedableRng::from_seed(seed); - let mut sled_rng = SledGen { r: rng, size: 2 }; - - Op::arbitrary(&mut sled_rng) - }) - .collect(); - - let cache_bits = *buf.get(1).unwrap_or(&0); - let segment_size_bits = *buf.get(2).unwrap_or(&0); - - match panic::catch_unwind(move || { - prop_tree_matches_btreemap( - ops, - false, - use_compression, - cache_bits, - segment_size_bits, - ) - }) { - Ok(_) => {} - Err(_e) => panic!("TODO"), - } +fn range(g: &mut Gen, min_inclusive: usize, max_exclusive: usize) -> usize { + assert!(max_exclusive > min_inclusive); + let range = max_exclusive - min_inclusive; + let generated = usize::arbitrary(g) % range; + min_inclusive + generated } impl Arbitrary for Key { @@ -95,24 +37,24 @@ impl Arbitrary for Key { #![allow(clippy::cast_precision_loss)] #![allow(clippy::cast_sign_loss)] - fn arbitrary(g: &mut G) -> Self { - if g.gen::() { + fn arbitrary(g: &mut Gen) -> Self { + if bool::arbitrary(g) { let gs = g.size(); let gamma = Gamma::new(0.3, gs as f64).unwrap(); let v = gamma.sample(&mut rand::thread_rng()); let len = if v > 3000.0 { 10000 } else { (v % 300.) as usize }; - let space = g.gen_range(0, gs) + 1; + let space = range(g, 0, gs) + 1; - let inner = (0..len).map(|_| g.gen_range(0, space) as u8).collect(); + let inner = (0..len).map(|_| range(g, 0, space) as u8).collect(); Self(inner) } else { - let len = g.gen_range(0, 2); + let len = range(g, 0, 2); let mut inner = vec![]; for _ in 0..len { - inner.push(g.gen::()); + inner.push(u8::arbitrary(g)); } Self(inner) @@ -134,7 +76,7 @@ impl Arbitrary for Key { #[derive(Debug, Clone)] pub enum Op { Set(Key, u8), - Merge(Key, u8), + // Merge(Key, u8), Get(Key), GetLt(Key), GetGt(Key), @@ -144,25 +86,25 @@ pub enum Op { Restart, } -use self::Op::{Cas, Del, Get, GetGt, GetLt, Merge, Restart, Scan, Set}; +use self::Op::*; impl Arbitrary for Op { - fn arbitrary(g: &mut G) -> Self { - if g.gen_bool(1. / 10.) { + fn arbitrary(g: &mut Gen) -> Self { + if range(g, 0, 10) == 0 { return Restart; } - let choice = g.gen_range(0, 8); + let choice = range(g, 0, 7); match choice { - 0 => Set(Key::arbitrary(g), g.gen::()), - 1 => Merge(Key::arbitrary(g), g.gen::()), - 2 => Get(Key::arbitrary(g)), - 3 => GetLt(Key::arbitrary(g)), - 4 => GetGt(Key::arbitrary(g)), - 5 => Del(Key::arbitrary(g)), - 6 => Cas(Key::arbitrary(g), g.gen::(), g.gen::()), - 7 => Scan(Key::arbitrary(g), g.gen_range(-40, 40)), + 0 => Set(Key::arbitrary(g), u8::arbitrary(g)), + 1 => Get(Key::arbitrary(g)), + 2 => GetLt(Key::arbitrary(g)), + 3 => GetGt(Key::arbitrary(g)), + 4 => Del(Key::arbitrary(g)), + 5 => Cas(Key::arbitrary(g), u8::arbitrary(g), u8::arbitrary(g)), + 6 => Scan(Key::arbitrary(g), range(g, 0, 80) as isize - 40), + //7 => Merge(Key::arbitrary(g), u8::arbitrary(g)), _ => panic!("impossible choice"), } } @@ -170,10 +112,12 @@ impl Arbitrary for Op { fn shrink(&self) -> Box> { match *self { Set(ref k, v) => Box::new(k.shrink().map(move |sk| Set(sk, v))), + /* Merge(ref k, v) => Box::new( k.shrink() .flat_map(move |k| vec![Set(k.clone(), v), Merge(k, v)]), ), + */ Get(ref k) => Box::new(k.shrink().map(Get)), GetLt(ref k) => Box::new(k.shrink().map(GetLt)), GetGt(ref k) => Box::new(k.shrink().map(GetGt)), @@ -196,6 +140,7 @@ fn u16_to_bytes(u: u16) -> Vec { u.to_be_bytes().to_vec() } +/* // just adds up values as if they were u16's fn merge_operator( _k: &[u8], @@ -208,20 +153,19 @@ fn merge_operator( let ret = u16_to_bytes(new_n); Some(ret) } +*/ pub fn prop_tree_matches_btreemap( ops: Vec, flusher: bool, - use_compression: bool, - cache_bits: u8, - segment_size_bits: u8, + compression_level: i32, + cache_size: usize, ) -> bool { if let Err(e) = prop_tree_matches_btreemap_inner( ops, flusher, - use_compression, - cache_bits, - segment_size_bits, + compression_level, + cache_size, ) { eprintln!("hit error while running quickcheck on tree: {:?}", e); false @@ -233,26 +177,20 @@ pub fn prop_tree_matches_btreemap( fn prop_tree_matches_btreemap_inner( ops: Vec, flusher: bool, - use_compression: bool, - cache_bits: u8, - segment_size_bits: u8, -) -> Result<()> { + compression: i32, + cache_size: usize, +) -> std::io::Result<()> { use self::*; super::common::setup_logger(); - let use_compression = cfg!(feature = "compression") && use_compression; - - let config = Config::new() - .temporary(true) - .use_compression(use_compression) + let config = Config::tmp()? + .zstd_compression_level(compression) .flush_every_ms(if flusher { Some(1) } else { None }) - .cache_capacity(256 * (1 << (cache_bits as usize % 16))) - .idgen_persist_interval(1) - .segment_size(256 * (1 << (segment_size_bits as usize % 16))); + .cache_capacity_bytes(cache_size); - let mut tree = config.open().unwrap(); - tree.set_merge_operator(merge_operator); + let mut tree: Db = config.open().unwrap(); + //tree.set_merge_operator(merge_operator); let mut reference: BTreeMap = BTreeMap::new(); @@ -270,11 +208,13 @@ fn prop_tree_matches_btreemap_inner( tree ); } + /* Merge(k, v) => { tree.merge(&k.0, vec![v]).unwrap(); let entry = reference.entry(k).or_insert(0_u16); *entry += u16::from(v); } + */ Get(k) => { let res1 = tree.get(&*k.0).unwrap().map(|v| bytes_to_u16(&*v)); let res2 = reference.get(&k).cloned(); @@ -286,7 +226,7 @@ fn prop_tree_matches_btreemap_inner( .iter() .rev() .find(|(key, _)| **key < k) - .map(|(k, _v)| IVec::from(&*k.0)); + .map(|(k, _v)| InlineArray::from(&*k.0)); assert_eq!( res1, res2, "get_lt({:?}) should have returned {:?} \ @@ -300,7 +240,7 @@ fn prop_tree_matches_btreemap_inner( let res2 = reference .iter() .find(|(key, _)| **key > k) - .map(|(k, _v)| IVec::from(&*k.0)); + .map(|(k, _v)| InlineArray::from(&*k.0)); assert_eq!( res1, res2, "get_gt({:?}) expected {:?} in tree {:?}", @@ -337,7 +277,9 @@ fn prop_tree_matches_btreemap_inner( .map(|(rk, rv)| (rk.0.clone(), *rv)); for r in ref_iter { - let tree_next = tree_iter.next().unwrap(); + let tree_next = tree_iter + .next() + .expect("iterator incorrectly stopped early"); let lhs = (tree_next.0, &*tree_next.1); let rhs = (r.0.clone(), &*u16_to_bytes(r.1)); assert_eq!( @@ -380,38 +322,16 @@ fn prop_tree_matches_btreemap_inner( Restart => { drop(tree); tree = config.open().unwrap(); - tree.set_merge_operator(merge_operator); + //tree.set_merge_operator(merge_operator); } } - if let Err(e) = config.global_error() { + if let Err(e) = tree.check_error() { eprintln!("quickcheck test encountered error: {:?}", e); return Err(e); } } - let space_amplification = tree - .space_amplification() - .expect("should be able to read files and pages"); - - assert!( - space_amplification < MAX_SPACE_AMPLIFICATION, - "space amplification was measured to be {}, \ - which is higher than the maximum of {}", - space_amplification, - MAX_SPACE_AMPLIFICATION - ); - - drop(tree); - config.global_error() -} + let _ = std::fs::remove_dir_all(config.path); -#[test] -fn fuzz_test() { - let seed = [0; 32]; - fuzz_then_shrink(&seed); - let seed = [0; 31]; - fuzz_then_shrink(&seed); - let seed = [0; 33]; - fuzz_then_shrink(&seed); - fuzz_then_shrink(&[]); + tree.check_error() } diff --git a/tsan_suppressions.txt b/tsan_suppressions.txt index b15c0dd5e..008766b0c 100644 --- a/tsan_suppressions.txt +++ b/tsan_suppressions.txt @@ -6,11 +6,6 @@ # Read more about how to use this file at: # https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions -# We ignore this because collect() calls functionality that relies -# on atomic::fence for correctness, which doesn't get picked up by TSAN -# as of Feb 1 2018 / rust 1.23. -race:sled::ebr::internal::Global::collect - # Arc::drop is not properly detected by TSAN due to the use # of a raw atomic Acquire fence after the strong-count # atomic subtraction with a Release fence in the Drop impl.