You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been running corrupt F3 Lotus nodes for who-knows-how-long and only realised today when I started poking at the F3 APIs and it kept on telling me it wasn't running.
I eventually debugged my way to this problem:
{"level":"info","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:308","msg":"resuming F3 internals"}
{"level":"error","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:184","msg":"failed to reconfigure GPBFT","error":"failed to open certstore: getting latest power table: failed to find expected power table for instance 0: failed to unmarshal power table at instance 0: unmarshaling (*t)[i]: unmarshaling t.Power: cbor input for fil big int was not a byte string (6)"}
{"level":"info","ts":"2024-11-26T13:41:32.593+1100","logger":"f3","caller":"[email protected]/f3.go:190","msg":"F3 is starting","initialDelay":0,"hasPendingManifest":false,"NetworkName":"calibrationnet","BootstrapEpoch":2081674,"Finality":900,"InitialPowerTable":"bafy2bzaceab236vmmb3n4q4tkvua2n4dphcbzzxerxuey3mot4g3cov5j3r2c","CommitteeLookback":10}
F3#Start calls startInternal and fails on openCertstore because (I assume) I had an earlier version of F3 running on both my calibnet and mainnet nodes and now the certificate format is slightly different so it won't unmarshall. But then it falls into the manifest-update loop and doesn't proceed. It's been running for weeks (months?) like this, not activating, just sitting. My logs show nothing beyond the F3 is starting line above.
I think this is because startInternal needs to get to newRunner for things to really start ticking, but it fails and returns the error at openCertstore so then it can't even get a new manifest from the network? I'm hazy on what happens here, but it can't progress.
I fixed it by adding cs.DeleteAll to the error cases in here:
returnnil, fmt.Errorf("failed to unmarshal power table at instance %d: %w", instance, err)
}
And then restarting the node after the first error. Not the most elegant solution but it would be nice if it recognised that it was corrupt, or a bad version, logged an error but continued on and recovered by starting from scratch.
The fact that I have this problem on both nodes suggests that this may have been from an RC or some other non-master release (maybe, I can't be sure what versions I've run on both of these). So there's a nonzero chance there's more people out there with nodes with this problem, and I'm not sure what we could even tell them about fixing it other than to delete their entire datastore.
The text was updated successfully, but these errors were encountered:
Kuba pointed out that the network name should stop this from happening again, this should only have occured on calibnet and that mainnet's name was bumped after the cert format was changed.
I've been running corrupt F3 Lotus nodes for who-knows-how-long and only realised today when I started poking at the F3 APIs and it kept on telling me it wasn't running.
I eventually debugged my way to this problem:
F3#Start
callsstartInternal
and fails onopenCertstore
because (I assume) I had an earlier version of F3 running on both my calibnet and mainnet nodes and now the certificate format is slightly different so it won't unmarshall. But then it falls into the manifest-update loop and doesn't proceed. It's been running for weeks (months?) like this, not activating, just sitting. My logs show nothing beyond theF3 is starting
line above.I think this is because
startInternal
needs to get tonewRunner
for things to really start ticking, but it fails and returns the error atopenCertstore
so then it can't even get a new manifest from the network? I'm hazy on what happens here, but it can't progress.I fixed it by adding
cs.DeleteAll
to the error cases in here:go-f3/certstore/certstore.go
Lines 264 to 268 in c2f99cf
And then restarting the node after the first error. Not the most elegant solution but it would be nice if it recognised that it was corrupt, or a bad version, logged an error but continued on and recovered by starting from scratch.
The fact that I have this problem on both nodes suggests that this may have been from an RC or some other non-master release (maybe, I can't be sure what versions I've run on both of these). So there's a nonzero chance there's more people out there with nodes with this problem, and I'm not sure what we could even tell them about fixing it other than to delete their entire datastore.
The text was updated successfully, but these errors were encountered: