feat: capture stream metadata in database and use for IOD #375

dav1do · 2024-06-10T21:35:41Z

We now store stream metadata in the database so we can find all events for a stream. This simplifies the in order delivery behavior. We do make more database queries now but use less memory to try to keep track of events. This could be more efficient, but for now I went with simple and hopefully good enough, especially while streams are short. If streams start being longer, this may not be a very good approach.

We run a task to update deliverable status for streams. It gets notifications when an event for a recon event is inserted and is currently deliverable, and it tries to sort any other pieces out and update them. It includes start up operation to load streams and mark them deliverable as well. Finally, there's a data migration run on daemon start to populate the new tables.

This should fix the test in js-ceramic. It requires #381 and #382.

linear · 2024-06-10T21:35:44Z

AES-87 Add stream metadata to event records in database

nathanielc

Overall looking good

nathanielc · 2024-06-11T14:03:06Z

migrations/sqlite/20240530125008_event_header.up.sql

@@ -0,0 +1,9 @@
+-- Add up migration script here
+CREATE TABLE IF NOT EXISTS "ceramic_one_event_header" (


How does this table get populated for existing dbs?

I'm going to tackle that part today 😅. I'm planning to follow the kubo approach and basically add a CLI flag to not prompt, otherwise ask/exit if they should be applied. It will require reading every event, parsing, and storing the headers so it can't (easily) be done via a sql statement. I'm open to other approaches... was thinking I'd create a table like below and then query/update at start.

create table ceramic_one_data_migrations ( version text unique not null, started_at timestamp not null default current_timestamp, completed_at timestamp );

event/src/unvalidated/payload/init.rs

store/src/sql/access/event.rs

store/src/sql/access/stream.rs

stbrody

Haven't finished reviewing ordering_task.rs or service.rs yet, but there's a lot of comments already I want to get out and my brain is starting to turn mushy reading all this code so I think this is all I can get through today. Will plan to revisit tomorrow.

stbrody · 2024-06-11T20:15:10Z

migrations/sqlite/20240529202212_stream.up.sql

+-- Add up migration script here
+CREATE TABLE IF NOT EXISTS "ceramic_one_stream" (
+    "cid" BLOB NOT NULL, -- init event cid
+    "sep" TEXT NOT NULL, 


what about controller? Seems like we'll definitely want that for event validation.

yeah, I wanted to discuss controller and whether we want to fully normalize it out. it's easy to capture for just the init event here, but we may want to include it for every event as well. I wasn't sure if we wanted many to many or just a single one. having it normalized for every event makes it possible to do "flat recon" things but it would still be a decent amount of work to implement.

I think for now we should assume that there is only ever a single controller, and that the controller for a stream never changes. Both of those are true today, and if either were to ever change it would require work across the entire stack, so I think requiring a database migration if/when that were to happen would be okay. For now let's keep it simple.

migrations/sqlite/20240530125008_event_header.up.sql

stbrody · 2024-06-11T20:17:08Z

migrations/sqlite/20240530125008_event_header.up.sql

+    "stream_cid" BLOB NOT NULL,  -- id field in header. can't have FK because stream may not exist until we discover it but should reference ceramic_one_stream(cid)
+    "prev" BLOB, -- prev event cid (can't have FK because node may not know about prev event)
+    PRIMARY KEY(cid),
+    FOREIGN KEY(cid) REFERENCES ceramic_one_event(cid)


why have this metadata in a separate table at all instead of directly in the ceramic_one_event table?

I didn't want to make that table any wider as we do large scans and we have to load the entire row off disk even if we only read some of the columns. Since these types are pretty small it probably doesn't matter, but I kind of expected event_metadata (probably going with that since event_header is silly) to keep growing and I didn't want to end up with a 25 column event table.

If I could redefine these tables (wow, that was quick), I'd rename event -> event_hash and make event_header/event_metadata the event table and put delivered and all the new data in there.

okay fair enough. Was also just talking to @smrz2001 about the fact that for self-anchoring we'll probably want to track whether we learned of an event via the API or via Recon. That might be something we want to add to this metadata table as well.

service/src/tests/mod.rs

store/src/sql/access/event.rs

store/src/sql/entities/stream.rs

service/src/event/ordering_task.rs

stbrody · 2024-06-11T21:28:03Z

service/src/event/ordering_task.rs

+        deliverable.push_back(start_with);
+        self.remove_by_event_cid(&start_with);
+        let mut tip = start_with;
+        // technically, could be in deliverable set as well if the stream is forking?


we should definitely make sure we have tests that cover the case of stream forks and multiple histories each with gaps that get filled in at different times.

dav1do

Thanks for the thorough review Spencer! I pretty much agree with everything you mentioned. Here's my summary of what I think I should do:

fix the bugs 😄
rename ceramic_one_event_header to ceramic_one_event_metadata (or possibly just put the data in the event table but I'd rather not as I wrote in that comment)
remove commit from the codebase and just refer to events
review and rename/consolidate types. A number of them were specific to queries for serialization or grouping a few CIDs and useful data without relying on a tuple just for a bit of sanity, but it could be a lot cleaner. most of them aren't public outside of the module but also may not be necessary.
- this includes relying on something from the event crate and possibly sharing carfile parsing with the api server
clean up comments. some were just missed/cruft/typos, some were me reasoning about why something was correct/okay but don't add any value and are a bit confusing now

And there's possibly more I missed but I'll go over each one as I change things tomorrow.

dav1do · 2024-06-12T05:03:38Z

store/src/sql/access/event.rs

+                    unreachable!("Init events should have been filtered out")
+                }
+                EventHeader::Data { cid, prev, .. } | EventHeader::Time { cid, prev, .. } => {
+                    // check for prev in this set and fallback to database


Hmm, I think you're right. Good catch! I should probably write a better test and make sure 😄. This would fail if we somehow got 3 writes from the API for a single stream out of order. For recon discovered events, we would miss marking them immediately (but wouldn't error), and the stream would get sent to the task and it would find them, but it shouldn't need to.

dav1do · 2024-06-12T05:10:08Z

store/src/sql/access/event.rs

+        events: &'a [EventInsertable],
+        pool: &SqlitePool,
+        require_history: bool,
+    ) -> Result<Vec<(bool, &'a EventInsertable)>> {


yeah, I 100% agree with this. I discussed a bit with Nathaniel and we want to make the an event from the event crate part of the public interface, but some of it comes out of event validation and the migrations stuff he's working on right now.

I made a lot of types to wrap specific behavior (like a group of CIDs rather than a tuple) or a specific database query result but it merits more thought and clean up.

dav1do · 2024-06-12T05:11:56Z

store/src/sql/access/event.rs

-            let new_key = Self::insert_key(&mut tx, &item.order_key).await?;
+        for (idx, (deliverable, item)) in to_add.iter().enumerate() {
+            let new_key = Self::insert_key(&mut tx, &item.order_key, *deliverable).await?;
+            let candiadate = CandidateEvent::new(item.cid(), item.stream_cid());


thanks :). I renamed this locally in the migration to populate the new tables and will push it tonight even though it's not quite ready (i.e. working)

dav1do · 2024-06-12T05:13:31Z

store/src/sql/access/stream.rs

+};
+
+/// Access to the stream and related tables. Generally querying events as a stream.
+pub struct CeramicOneStream {}


Yeah, that makes sense. These access structs were intended to be a sort of grouping of higher level functionality than just single table/entity writes but the names are pretty much copied straight from the tables which is a misnomer for sure.

store/src/sql/access/stream.rs

dav1do · 2024-06-12T05:21:49Z

store/src/sql/access/event.rs

+                }
+            } else {
+                undelivered.push(candiadate);
+            }
            if new_key {
                for block in item.body.blocks.iter() {
                    CeramicOneBlock::insert(&mut tx, block.multihash.inner(), &block.bytes).await?;


everything is wrapped in a single (large) transaction so it gets rolled back together

migrations/sqlite/20240530125008_event_header.up.sql

dav1do · 2024-06-12T05:23:10Z

migrations/sqlite/20240529202212_stream.up.sql

+-- Add up migration script here
+CREATE TABLE IF NOT EXISTS "ceramic_one_stream" (
+    "cid" BLOB NOT NULL, -- init event cid
+    "sep" TEXT NOT NULL, 


yeah, I wanted to discuss controller and whether we want to fully normalize it out. it's easy to capture for just the init event here, but we may want to include it for every event as well. I wasn't sure if we wanted many to many or just a single one. having it normalized for every event makes it possible to do "flat recon" things but it would still be a decent amount of work to implement.

dav1do · 2024-06-12T05:25:51Z

migrations/sqlite/20240530125008_event_header.up.sql

+    "stream_cid" BLOB NOT NULL,  -- id field in header. can't have FK because stream may not exist until we discover it but should reference ceramic_one_stream(cid)
+    "prev" BLOB, -- prev event cid (can't have FK because node may not know about prev event)
+    PRIMARY KEY(cid),
+    FOREIGN KEY(cid) REFERENCES ceramic_one_event(cid)


I didn't want to make that table any wider as we do large scans and we have to load the entire row off disk even if we only read some of the columns. Since these types are pretty small it probably doesn't matter, but I kind of expected event_metadata (probably going with that since event_header is silly) to keep growing and I didn't want to end up with a 25 column event table.

If I could redefine these tables (wow, that was quick), I'd rename event -> event_hash and make event_header/event_metadata the event table and put delivered and all the new data in there.

service/src/tests/mod.rs

IPLD impls debug, PartialEq, Eq so unless we allow true floats I think this is okay

- populated when discovering events - rewrote IOD using it (been running `recon_lots_of_streams` in a loop and it keeps passing) - created crate specific InsertResult structs (api, recon, service, store). If any api batch write fails because of something in the service (i.e. no prev), it won't fail other writes in the batch

dav1do requested review from stbrody and nathanielc June 10, 2024 21:35

dav1do temporarily deployed to github-tests-2024 June 10, 2024 21:45 — with GitHub Actions Inactive

dav1do temporarily deployed to github-tests-2024 June 10, 2024 22:06 — with GitHub Actions Inactive

nathanielc reviewed Jun 11, 2024

View reviewed changes

dav1do commented Jun 11, 2024

View reviewed changes

event/src/unvalidated/payload/init.rs Show resolved Hide resolved

store/src/sql/access/event.rs Outdated Show resolved Hide resolved

store/src/sql/access/event.rs Outdated Show resolved Hide resolved

store/src/sql/access/stream.rs Outdated Show resolved Hide resolved

stbrody requested changes Jun 11, 2024

View reviewed changes

dav1do commented Jun 12, 2024

View reviewed changes

dav1do force-pushed the feat/aes-87-streams branch from 6a2c772 to 6ee824a Compare June 13, 2024 05:18

dav1do had a problem deploying to github-tests-2024 June 13, 2024 19:12 — with GitHub Actions Failure

dav1do force-pushed the feat/aes-87-streams branch 3 times, most recently from 636fcb1 to ac79dee Compare June 13, 2024 23:35

dav1do added 2 commits June 13, 2024 17:41

feat: derive event header traits

f54d61c

IPLD impls debug, PartialEq, Eq so unless we allow true floats I think this is okay

chore: clean up errors and unused code

7057175

dav1do mentioned this pull request Jun 13, 2024

chore: minor clean up and derive more behavior on events #381

Merged

dav1do temporarily deployed to github-tests-2024 June 13, 2024 23:45 — with GitHub Actions Inactive

dav1do added 7 commits June 13, 2024 17:51

feat: crate tables to track data migrations

7b63733

feat: initial data migrator that does nothing but track version

f0adbe9

feat: add command to run data migrations automatically on daemon startup

2b3f86e

feat: add tables to track stream/event metadata

9b2634f

chore: remove todo

2198d3a

chore: write migration to populate new tables from blockstore

d7e194b

dav1do changed the base branch from main to feat/migration June 14, 2024 01:53

dav1do changed the base branch from feat/migration to feat/aes-130-data-migrations June 14, 2024 01:54

dav1do force-pushed the feat/aes-87-streams branch from ac79dee to d7e194b Compare June 14, 2024 01:55

dav1do temporarily deployed to github-tests-2024 June 14, 2024 02:04 — with GitHub Actions Inactive

dav1do force-pushed the feat/aes-130-data-migrations branch from 83a3068 to e46e101 Compare June 14, 2024 22:06

dav1do marked this pull request as draft June 17, 2024 14:40

dav1do mentioned this pull request Jun 18, 2024

fix: IOD long streams could remain undelivered #387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: capture stream metadata in database and use for IOD #375

feat: capture stream metadata in database and use for IOD #375

dav1do commented Jun 10, 2024 •

edited

Loading

linear bot commented Jun 10, 2024

nathanielc left a comment

nathanielc Jun 11, 2024

dav1do Jun 11, 2024

stbrody left a comment

stbrody Jun 11, 2024

dav1do Jun 12, 2024

stbrody Jun 12, 2024

stbrody Jun 11, 2024

dav1do Jun 12, 2024

stbrody Jun 12, 2024

stbrody Jun 11, 2024

dav1do left a comment

dav1do Jun 12, 2024

dav1do Jun 12, 2024

dav1do Jun 12, 2024

dav1do Jun 12, 2024

dav1do Jun 12, 2024

dav1do Jun 12, 2024

dav1do Jun 12, 2024

		@@ -0,0 +1,9 @@
		-- Add up migration script here
		CREATE TABLE IF NOT EXISTS "ceramic_one_event_header" (

feat: capture stream metadata in database and use for IOD #375

Are you sure you want to change the base?

feat: capture stream metadata in database and use for IOD #375

Conversation

dav1do commented Jun 10, 2024 • edited Loading

linear bot commented Jun 10, 2024

nathanielc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stbrody left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dav1do left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dav1do commented Jun 10, 2024 •

edited

Loading