Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve certstore corruption resilience #755

Closed
rvagg opened this issue Nov 26, 2024 · 1 comment
Closed

Improve certstore corruption resilience #755

rvagg opened this issue Nov 26, 2024 · 1 comment

Comments

@rvagg
Copy link
Member

rvagg commented Nov 26, 2024

I've been running corrupt F3 Lotus nodes for who-knows-how-long and only realised today when I started poking at the F3 APIs and it kept on telling me it wasn't running.

I eventually debugged my way to this problem:

{"level":"info","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:308","msg":"resuming F3 internals"}
{"level":"error","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:184","msg":"failed to reconfigure GPBFT","error":"failed to open certstore: getting latest power table: failed to find expected power table for instance 0: failed to unmarshal power table at instance 0: unmarshaling (*t)[i]: unmarshaling t.Power: cbor input for fil big int was not a byte string (6)"}
{"level":"info","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:190","msg":"F3 is starting","initialDelay":0,"hasPendingManifest":false,"NetworkName":"calibrationnet","BootstrapEpoch":2081674,"Finality":900,"InitialPowerTable":"bafy2bzaceab236vmmb3n4q4tkvua2n4dphcbzzxerxuey3mot4g3cov5j3r2c","CommitteeLookback":10}

F3#Start calls startInternal and fails on openCertstore because (I assume) I had an earlier version of F3 running on both my calibnet and mainnet nodes and now the certificate format is slightly different so it won't unmarshall. But then it falls into the manifest-update loop and doesn't proceed. It's been running for weeks (months?) like this, not activating, just sitting. My logs show nothing beyond the F3 is starting line above.

I think this is because startInternal needs to get to newRunner for things to really start ticking, but it fails and returns the error at openCertstore so then it can't even get a new manifest from the network? I'm hazy on what happens here, but it can't progress.

I fixed it by adding cs.DeleteAll to the error cases in here:

if b, err := cs.ds.Get(ctx, cs.keyForPowerTable(instance)); err != nil {
return nil, fmt.Errorf("failed to load power table at instance %d: %w", instance, err)
} else if err := powerTable.UnmarshalCBOR(bytes.NewReader(b)); err != nil {
return nil, fmt.Errorf("failed to unmarshal power table at instance %d: %w", instance, err)
}

And then restarting the node after the first error. Not the most elegant solution but it would be nice if it recognised that it was corrupt, or a bad version, logged an error but continued on and recovered by starting from scratch.

The fact that I have this problem on both nodes suggests that this may have been from an RC or some other non-master release (maybe, I can't be sure what versions I've run on both of these). So there's a nonzero chance there's more people out there with nodes with this problem, and I'm not sure what we could even tell them about fixing it other than to delete their entire datastore.

@rvagg
Copy link
Member Author

rvagg commented Nov 28, 2024

Kuba pointed out that the network name should stop this from happening again, this should only have occured on calibnet and that mainnet's name was bumped after the cert format was changed.

@rvagg rvagg closed this as completed Nov 28, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in F3 Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant