Improve certstore corruption resilience #755

rvagg · 2024-11-26T04:42:12Z

I've been running corrupt F3 Lotus nodes for who-knows-how-long and only realised today when I started poking at the F3 APIs and it kept on telling me it wasn't running.

I eventually debugged my way to this problem:

{"level":"info","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:308","msg":"resuming F3 internals"}
{"level":"error","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:184","msg":"failed to reconfigure GPBFT","error":"failed to open certstore: getting latest power table: failed to find expected power table for instance 0: failed to unmarshal power table at instance 0: unmarshaling (*t)[i]: unmarshaling t.Power: cbor input for fil big int was not a byte string (6)"}
{"level":"info","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:190","msg":"F3 is starting","initialDelay":0,"hasPendingManifest":false,"NetworkName":"calibrationnet","BootstrapEpoch":2081674,"Finality":900,"InitialPowerTable":"bafy2bzaceab236vmmb3n4q4tkvua2n4dphcbzzxerxuey3mot4g3cov5j3r2c","CommitteeLookback":10}

F3#Start calls startInternal and fails on openCertstore because (I assume) I had an earlier version of F3 running on both my calibnet and mainnet nodes and now the certificate format is slightly different so it won't unmarshall. But then it falls into the manifest-update loop and doesn't proceed. It's been running for weeks (months?) like this, not activating, just sitting. My logs show nothing beyond the F3 is starting line above.

I think this is because startInternal needs to get to newRunner for things to really start ticking, but it fails and returns the error at openCertstore so then it can't even get a new manifest from the network? I'm hazy on what happens here, but it can't progress.

I fixed it by adding cs.DeleteAll to the error cases in here:

go-f3/certstore/certstore.go

Lines 264 to 268 in c2f99cf

    
           if b, err := cs.ds.Get(ctx, cs.keyForPowerTable(instance)); err != nil { 
        
           	return nil, fmt.Errorf("failed to load power table at instance %d: %w", instance, err) 
        
           } else if err := powerTable.UnmarshalCBOR(bytes.NewReader(b)); err != nil { 
        
           	return nil, fmt.Errorf("failed to unmarshal power table at instance %d: %w", instance, err) 
        
           }

And then restarting the node after the first error. Not the most elegant solution but it would be nice if it recognised that it was corrupt, or a bad version, logged an error but continued on and recovered by starting from scratch.

The fact that I have this problem on both nodes suggests that this may have been from an RC or some other non-master release (maybe, I can't be sure what versions I've run on both of these). So there's a nonzero chance there's more people out there with nodes with this problem, and I'm not sure what we could even tell them about fixing it other than to delete their entire datastore.

The text was updated successfully, but these errors were encountered:

rvagg · 2024-11-28T08:48:21Z

Kuba pointed out that the network name should stop this from happening again, this should only have occured on calibnet and that mainnet's name was bumped after the cert format was changed.

github-project-automation bot added this to F3 Nov 26, 2024

github-project-automation bot moved this to Todo in F3 Nov 26, 2024

rvagg closed this as completed Nov 28, 2024

github-project-automation bot moved this from Todo to Done in F3 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve certstore corruption resilience #755

Improve certstore corruption resilience #755

rvagg commented Nov 26, 2024

rvagg commented Nov 28, 2024

Improve certstore corruption resilience #755

Improve certstore corruption resilience #755

Comments

rvagg commented Nov 26, 2024

rvagg commented Nov 28, 2024