-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to defrag? #306
Comments
Some loose thoughts (I'll need to spend more time on it). uint64 is one option. It would be good if we didn't need to use uint64 for the arrays in jitdb. That takes up double the space, but then again it might be good if we didn't have to rewrite the whole index file on changes, only what has changed. As for the delete and compaction, it might be possible to do something like keeping track of the latest point on the log where everything before that is deleted. Then you could store that as an starting point or "deleted offset" and once you recalibrate seqs related to this you could set it to 0. |
This is smart! But would it ever work? I mean, how can we guarantee that the 1st X messages would be eventually deleted? It might be that those messages are authored by the root metafeed, or are contact messages, and in that case they wouldn't ever be deleted. |
Good point. Another idea would be to have multiple databases, one for things that are semi stable and one for "fast moving" stuff. The second with most of the churn might even be migrated to a fresh database once in a while to keep it small. Imagine a sliding window that can grow to 2x and then migrates to a new database at ½x or 1x. Anyway just throwing ideas out there :) |
That's a pretty good idea, to have multiple databases. It reminds me of a couple other ideas from the distant past:
Your idea also gave me another idea, to have a throw-away temporary log (and indexes), so that there would be three:
The reason for the third, temp db, is that you could browse the profiles of acquaintances and strangers and just replicate them on the spot (first you just replicate their profiles, which is a tiny amount of data, then you can choose to scroll their posts, either on the temp db or on the other persisted DBs). It would reduce the burden on the storage needs from the other two DB types, and would support the browsing usage whenever you are currently connected to other peers. I think the real challenge would be to weave together all these three in the same |
migrating to u64 is the simplest, but it has the downside of doubling the size of things that are already u32... varint means you cant just jump to an byte offset and know you have a valid record. but what about this: current u32 data stays. with a varint, a similar pattern is used but a single bit indicates there is another byte, so to randomly seek into series of varints in memory, you'd need to scan backwards to the previous end byte, 0... then take the subsequent zero or more bytes that start with 1 until one with a leading zero. |
how does jitdb use arrays? maybe there can be something like a page level where it indicates wether it's using u64 or u32 for that page? did it use bitfields in the end? if so, is there a thing that maps sequential indexes back to byte offsets? sequential indexes are better for bitfields, but of course you need to get the actual byte offset somehow. but you only need to switch, at some point, to using u64, and it always grows, so there is a point where that change happens. |
if you really want to compact the database, a simple way to do that would be to recopy the log into a new log, minus the deleted stuff, rebuild the indexes, then drop the old log. this is basically how couch db does it. since scanning the log and building indexes is fast, it wouldn't take so much, and, you could do it slowly in the background so it wasn't noticable. |
I had another idea today, that improves my above suggestion from 2N to just use max N space, but 2N index memory. Instead of writing a separate second log, superimpose the compacted log on top of the old one, keeping track of the min uncompacted sequence number. When you do a query from the old indexes, drop results with offsets smaller than the uncompacted sequence, and mix with results from the new indexes. merge index responses by sequence numbers, so numbers smaller than the uncompacted sequence are from the compacted section and the numbers greater are from the uncompacted log. I'm not sure that this optimization is really necessary, but might not be that hard to pull off once you have the first part. |
That's great! I was already getting used to the idea that we have to compromise with the 2N limit for logs, so getting it down to N is much better. Indexes at 2N is fine, because they are typically much smaller than the log. (Example from production: log is 960MB, indexes are in total 262MB) UPDATE: wait, why would indexes be at 2N? Couldn't we wait for the superimposition writes to finish and then clear all indexes are rewrite? Or were you suggesting that the compacted section of the log would also have its own indexes being written simultaneously? What if instead of dropping results from the uncompacted section, we drop results from the compacted section? I mean, the old indexes will already be coherent with the uncompacted section, plus most queries are for recent content anyways (thus likely to be in the latter section of the log). By the way, I still believe in having an in-memory ephemeral database, a good use case being to fetch the profile details of a feed you block. You don't want to actually "replicate" that feed (using your storage and forwarding it to others), you just want to know what the name of the blocked person is, and maybe one of the peers currently connected has that information. (PS: the assumption here is that once you block someone, we'll delete their records from the log) |
I agree it's a pretty simple change, on the other hand, I kind of like the idea of SSB being always for "small" use cases, and by small I'm exaggerating here because 4.3GB is enough for roughly 20 years of content from thousands of people. That should be enough for most cases. Use cases that fall outside of that would be: archival and "Corporate Big Data Or Surveillance Capitalism".
Yes jitdb uses bitfields. Not sure if I understood how to map your terminology to our terminology, but it goes like this: we have bitvectors where the Nth bit is about the Nth record on the log, and then we have an index called jitdb also has some other esoteric indexes, like "prefix indexes", which allow us to store (e.g.) the first 4 bytes of a msg.key for all messages, and this is very often "unique enough", we don't need the full msg.key to guarantee uniqueness. Of course we do a double check for false positives in case it's not unique. And we also have "prefix maps" which is basically a huge object that looks like this: {
'%Lihv': [seq1, seq8, seq185, seq280],
'%W1iI': [seq6, seq59, seq127, seq305, seq407, seq558],
'%rMcC': [seq3, seq67, seq111, seq189, seq198]
} Which we use (e.g.) to find all messages that point to a certain root message, in threads. |
is the prefix map a flumeview-reduce? one of my main ideas with flumeview-vectors was to create exactly this data structure, but on bytes, so you didn't have to load everything. another idea: if the |
I was thinking that you'd perform the compaction process gracefully in the background, without interrupting usability. So you'd keep the old indexes, but gradually build a new set too. That way you can do it slowly without making the phone hot! If you were prepared to make the user look at a progress bar for a while, then you could indeed have max N size indexes. Maybe you could perform a similar in place update to the indexes, but there are lots of different kinds of indexes so I wouldn't assume so in general
this doesn't make sense. the messages in the compacted section now have different offsets, from filling the gaps created by deleted messages. So the old (uncompacted) indexes do not point to the start of messages in the compacted section. What I meant is in the old (uncompacted) indexes, drop results that point into the compacted section. You'd still keep results from the uncompacted section that point to after where compaction is up to. |
It's pretty brutal at the moment. It's just a big JSON object that's stringified / parsed, and the whole object is kept in memory. |
|
@arj03 @dominictarr After some weeks have passed, have you had any new thoughts about database compactions? I intend to start working on it this week and now would be the best time to make comments about the big picture before code is changed. My plan is to initially support only one non-controversial use case: given that you have blocked a peer, you can "delete" (overwrite with zero bytes) their messages from the log, and then we would like to compact the database afterwards. So far I won't do anything related to deleting portions of a friend's feed, as that would require extra caution. |
Some benchmark that I want to keep somewhere, so here:
|
Are you removing half of the messages? At random? And for production does the "time to compact" include indexes reindexing? |
Oh yeah, forgot to add any context to this. It's running in a test suite, and I remove half of the messages, every odd numbered message is deleted. So not at random. "Time to compact" does not include reindexing time. The |
Okay, I thought prod was where you ran this on a copy of your manyverse desktop database :) |
I ran it for real in my manyverse desktop db, like this:
Results:
Reindexing took 12min but I suspect that it's not functionally correct, I have a suspicion that it was rebuilding leveldb indexes too soon, or concurrently. I have to check if the code is correct. So there's still hope that reindexing would drop down to 8min or less. |
@arj03 I'm working on a module that deletes feeds that are blocked, and I got some stats to share with you. In Manyverse, I put a script which looks at ssb.friends.hops and deletes any feed that has a negative distance. (Sequentially! One deleteFeed at a time). Then, it compacts the db and rebuilds indexes. Before
During
How much time it took for each feed to be deleted, distribution chart where x-axis is each feed, and y-axis is time-to-delete in milliseconds: After
Notice that the log size went down 70 MB, which is pretty cool. What's puzzling me is that it seems |
That is strange. Did you save your log.bipf from before compaction? It could be interesting to do a full reindex on that database nad see if the same new message appears. It should. And if it does then the must have been something wrong with the old indexes. Do you have a more detailed list of the index sizes? |
Also not sure if you have updated bipf yet, it should provide a nice speed boost for indexing (around 10% faster) |
I'll give another serious run of these numbers again after the bugs in ssb-db2 are fixed, so we can rule out at least some weirdness.
It was using bipf 1.6.5, which is the version with your fast-varint, but before we added markIdempotent etc. |
I ran delete (of ~1100 blocked feeds) and compaction again, on my production database, and I noticed a couple of funny things. In the table below, Before shows sizes of things (in bytes) before deletion and compaction. After is after compaction and reindexing induced by compaction. And then, I deleted I think there is some bug somewhere, because
|
Strange. Those numbers are all over the place. I don't remember if there is a test where you use fixture to generate a database with say 10 users, delete 5 of them and check that all indexes return the correct number of results? |
That might indeed be a good test. In the micro scale there are tests (for 1 author I guess?) and they pass. The problem seems to happen somewhere in medium or large scale, no idea why. |
One thing I suspected was that I wasn't "flushing deletes" after calling all those Here are stats for running on ssb-fixtures, Before
After
|
All the jitdb indexes should be lower or same size. I think for level there might be some compaction that hasn't kicked in. So the best would be to run queries on the results to see if things looks correct. |
Ok, I hacked ssb-db2 code in Manyverse to force log.bipf: same
jitdb: same Now it looks much better. Doesn't seem like there is a problem. I also wrote an ssb-db2 test (which I could make a PR for) and couldn't notice any problem. |
Turns out that |
More numbers after using the latest jitdb with a single log.stream:
(In both of these cases, the log.stream was updating/creating many indexes at once) I have one more suspicion to check tomorrow: live jitdb queries are interfering with something during post-compaction indexing. They seem to kick in only after initial indexing is done. UPDATE: actually, numbers vary drastically. Here's for a second run:
But the total indexing time (jitdb PLUS level indexes) is still consistently larger for the post-compaction one. If you want to see the full log: Click to see all
|
Incredibly good news from the jitdb refactor (to use only one log.stream). I ran the following on my production Manyverse desktop ( Before: 6m00s That's 9x speed up! 🔥 |
Wow, that is really great news. Made my day to read this 🌞 |
Just to make sure this good news is for real, I also deleted both Before: 9m23s (this is consistent with what people have been telling me, that Manyverse takes "only" 10min to build indexes) That's 563s / 162s = 3.5x speedup. |
I've been thinking about getting back to implementing partial replication (finishing ssbc/ssb-replication-scheduler#5), and putting it in Manyverse.
TL;DR how can we keep the sizes of log + indexes somewhat constant/stable as M new messages come in and M older messages get overwritten by zero bytes?
Deleting messages from the log and defragmenting will be an important part, because initially the use of partial replication would add N index messages to the log, and N could be significant (it's at least the number of votes/posts/contacts/abouts), which could bring the log size from the current ~1GB to 2GB or more. Defragging is important because if we just overwrite the "deleted" records with zero bytes, this still doesn't change the total size of the log, and doesn't change the offsets in the indexes, and there's a limit of ~4.3GB for the max offset that indexes can point to (due to uint32).
I thought about using sparse files for the log, or compressing the log on writes and decompressing on reads, but those would still not fix the problem which is about the uint32 limit for offsets in the indexes, not about the log.
So we would need to defrag the log itself, which means also that all indexes have to be rebuilt to match the new offsets for each record. How often to do this defrag+reindex would be tricky, and the waiting experience would be pretty bad for users.
One idea is to upgrade the offsets from uint32 to uint64 which would increase the "lifespan" of the log by 4 million times. Assuming it takes 1 year to get 1GB worth of messages on the average user's log, then we would have millions of years, so uint64 is well enough.
Perhaps if offset variables are updated to uint64, and if we use compression so that the log file itself can shrink the space taken by those zero bytes, then this problem might be "solved". But I'm not sure about that, because index sizes could still keep growing.
Ideally (just to make sure our system works at scale), we should assume that an insane amount of new messages come in via replication (say, +1GB on the log every month. This is much more that the previous mentioned +1GB/year, but the "lifespan" of the log with uint64 would still be ridiculous, about 350k years), and that an equal amount of older messages gets "deleted" (overwritten by zero bytes), and how to keep the user's disk usage somewhat stable. I mean, it's acceptable if the disk usage has a sawtooth pattern (like garbage collection sawtooth charts), as long as an upper bound is never crossed, i.e. no "disk" leaks happen.
The text was updated successfully, but these errors were encountered: