State distribution #292

hosie · 2024-10-10T01:05:47Z

Adding reliable messages between nodes for the distribution of state as per the handshake described https://github.com/kaleido-io/paladin/blob/engine-docs/architecture/distributed-transaction-management.md#distribution-of-private-state-data

Also took steps towards improving the threading model and error handling for the co-ordinator / privateTransactionManger orchestrator. (TODO sort out a consistent naming of these things)

TODO:

refactor: move stateDistributer out of privatetxnmgr and into internal/statedistributer
clean up the usage of the identityLocator in transport messages. We are currently abusing the who part of who@where for internal component
use flush writer to store the state on the receiving node ( currently creates a new DB TX for each one)
unit test coverage
- some of the new code does not have unit test but is covered by the new component test so overall coverage has not dropped below the threshold
manual exploratory test with zeto domain. Given that AFAIK, this should work but I don't know what I don't know
- had to fix an issue where VerifierType was missing on some data structures
- given that zeto does not require any signatures, the PostAssembly.Signatures array was never being initialised and failed the validation here https://github.com/kaleido-io/paladin/blob/09d9081f68185c5d9585472c35ce6f8ae5fa2d20/core/go/internal/domainmgr/private_smart_contract.go#L299
- had to use debugger and add a bunch of debug logging because we don't record the error reasons for failed transactions or transactions that are currently in a retry loop ( and sometimes don't realise that we should be in a retry loop). Next priority item is to improve this error handling and reporting. Not sure if it will make it into this PR before it gets merged.
add white box test for reliability - e.g. similar to component test but rather than use real grpc transport, use a fake transport that simulates network unreliability
restart recover testing - test that message delivery is reliable over a node restart

Last 2 bullets may defer until we have implemented all of the other reliable message exchanges (endorsement requests, delegate to remote coordinator, dispatch to remote submitter etc...)

I think there is potential for some generic code to be teased out of this and used for the other cross node message exchanges where we need but for now, I'll just look to get the code into a sensible structure as a step towards that.

One particular decision point that I could do with @peterbroadhurst review on is that I added an optional parameter to StateManager.GetSchema where the caller can chose to pass a database transaction. Previously, if this was called during a DB transaction (e.g. WriteReceivedStates ) then it was always creating a new transaction which flat out doesn't work on sqllite and on PostGres would be wasteful of DB resources. I did consider calling GetSchema before calling WriteReceivedStates to force the schema to be cached but wasn't completely comfortable that would always be safe.

Signed-off-by: John Hosie <[email protected]>

hosie · 2024-10-12T10:24:28Z

core/go/internal/statedistribution/received_state_writer.go

+		}
+	}
+
+	_, err := rsw.stateManager.WriteReceivedStates(ctx, tx, values[0].DomainName, stateUpserts)


We are using the DomainName from the first value in the array but it applies to all stateUpserts so we are making an assumption that they are all on the same domain. This is a safe assumption while WriteKey returns DomainName https://github.com/kaleido-io/paladin/pull/292/files#diff-81f7ddfed27a2e83ea5d097162f6fbcc56a85dce3836841563a3dbd3ec353e0dR51

however, I feel that this could be a brittle assumption in future and wonder if we wanted to find some way to codify this assertion.

This is a safe assumption while WriteKey returns DomainName

Not not I understand why this helps. It definitely means that all writes for the same domain go to the same writer, but it doesn't give us any assurance that all writes in the batch are for the same domain.

I think the logic in this function needs to group the writes in a map[string][]*components.StateUpsertOutsideContext by domain, and then call rsw.stateManager.WriteReceivedStates multiple times (once per domain).

Or we update WriteReceivedStates to take the domain in each record.

Happy with either, and happy to help with that (but after I make the task switch)

ah. ok. worse than I thought then. I'll fix this. I have a similar situation in my next PR too so I'll get it right there off the bat.

will probably copy the pattern proven out here 647d234

hosie · 2024-10-12T10:28:10Z

core/go/internal/statedistribution/state_distributer.go

+		nodeID:           nodeID,
+	}
+	sd.acknowledgementWriter = NewAcknowledgementWriter(ctx, sd.persistence, &conf.AcknowledgementWriter)
+	sd.receivedStateWriter = NewReceivedStateWriter(ctx, stateManager, persistence, &conf.ReceivedStateWriter)


Was in 2 minds here whether to combine these into a single writer and therefore a single pool of flush workers. Decided to keep them separate in the interest of simpler code ( albeit more LOC) and to allow finer grained tuning through config. Since making this decision, I have become less confident that it is the correct decision and think I might come back to this and fold them into one. But interested in review comments to sway my thinking.

😄 - I'd indeed made comments earlier in the PR review on this. Merging into one seems good to me, but I also don't think it's urgent.

Maybe it goes along with resolving the different-domain-per-state issue discussed above

The PR I am currently working on for error handling, I am taking the single writer approach in privatetxmgr ( for persisting dispatches to public and for finalizing reverts to TxMgr) so will work out an idiomatic code structure for that and maybe copy the pattern here later.

peterbroadhurst

This is really great @hosie - having an eventually-consistent state distribution model is an approach I really like.

Sender records that it's trying to get states to receiver
Sender records persistently, but lazily, that it's got acknowledgement
On restart (and I think periodic going forwards) sender tries to re-send states

This feels both performant, and resilient.

Some comments as I went through, for you/I to factor in at an appropriate point. Nothing at all that stops this going into main though, so merging.

peterbroadhurst · 2024-10-12T11:35:38Z

config/pkg/pldconf/privatetxmgr.go

+}
+
+type StateDistributerConfig struct {
+	AcknowledgementWriter FlushWriterConfig `json:"acknowledgementWriter"`


Conversation for another day - but I wonder how valuable it is to have lots of individual ones of these, with individual thread pools, vs. just a single one.

If we go for lots - I think there's a bit of work I should do to have the go-routines spin up on demand rather than sitting around indefinitely as they do today.

peterbroadhurst · 2024-10-12T11:39:54Z

core/go/componenttest/utils_for_test.go

@@ -233,6 +238,10 @@ func newInstanceForComponentTesting(t *testing.T, domainRegistryAddress *tktypes
 		},
 	}

+	//uncomment for debugging
+	//i.conf.DB.SQLite.DSN = "./sql." + i.name + ".db"


This pokes a bit on why PostgreSQL isn't supported in some of these tests.

Being able to switch to postgres is extremely helpful for debugging - I do this a lot for SQL debugging in the tests that support it and run with both in the build

We should in the build run these more complete component tests with PSQL as well as SQLite, or if we can only do one - it should be PSQL as that's the one that's most important

I had assumed sqllite in memory was preferred for speed of test execution. I guess we could use the t.Short flag to control that and maybe have a gradle testShort task for developers to run if they do want quick test run not to interrupt their dev flow too much. There are some other tests that I'd probably just skip altogether in short mode.

peterbroadhurst · 2024-10-12T11:43:54Z

core/go/db/migrations/postgres/000008_create_private_transaction_tables.up.sql

@@ -10,5 +10,26 @@ CREATE TABLE dispatches (

 CREATE UNIQUE INDEX dispatches_public_private ON dispatches("public_transaction_address","public_transaction_nonce","private_transaction_id");

+CREATE TABLE state_distributions (


I think we're going to want timestamps in these tables, if we're taking the overhead of using the DB as a log (and resilient queue) of what we have sent / received.

We'll be very likely to want to query this using a suitable JSON/RPC a lot in debugging scenarios... in fact adding that query is likely something I'll do quite soon after starting using this.

peterbroadhurst · 2024-10-12T11:44:20Z

core/go/internal/identityresolver/identityresolver.go

@@ -244,8 +244,8 @@ func (ir *identityResolver) handleResolveVerifierRequest(ctx context.Context, me
 			err = ir.transportManager.Send(ctx, &components.TransportMessage{
 				MessageType:   "ResolveVerifierResponse",
 				CorrelationID: requestID,
-				ReplyTo:       tktypes.PrivateIdentityLocator(fmt.Sprintf("%s@%s", IDENTITY_RESOLVER_DESTINATION, ir.nodeID)),
-				Destination:   replyTo,
+				Component:     IDENTITY_RESOLVER_DESTINATION,


Really pleased to see this worked through 👍

peterbroadhurst · 2024-10-12T11:45:05Z

core/go/internal/msgs/en_errors.go

+	MsgComponentIdentityResolverStartError      = ffe("PD010030", "Error starting identity resolver")
+	MsgComponentAdditionalMgrInitError          = ffe("PD010031", "Error initializing %s manager")
+	MsgComponentAdditionalMgrStartError         = ffe("PD010032", "Error initializing %s manager")
+	MsgPrivateTxManagerInvalidEventMissingField = ffe("PD010033", "Invalid event: missing field %s")


Noting wrong prefix on this error.

I have a TODO to sort out prefixes (public TX mgr in particularly is not following convention) so happy to pick that up.

peterbroadhurst · 2024-10-12T11:58:10Z

core/go/internal/statedistribution/received_state_writer.go

+		}
+	}
+
+	_, err := rsw.stateManager.WriteReceivedStates(ctx, tx, values[0].DomainName, stateUpserts)


This is a safe assumption while WriteKey returns DomainName

Not not I understand why this helps. It definitely means that all writes for the same domain go to the same writer, but it doesn't give us any assurance that all writes in the batch are for the same domain.

I think the logic in this function needs to group the writes in a map[string][]*components.StateUpsertOutsideContext by domain, and then call rsw.stateManager.WriteReceivedStates multiple times (once per domain).

Or we update WriteReceivedStates to take the domain in each record.

Happy with either, and happy to help with that (but after I make the task switch)

peterbroadhurst · 2024-10-12T11:59:57Z

core/go/internal/statedistribution/state_distributer.go

+		nodeID:           nodeID,
+	}
+	sd.acknowledgementWriter = NewAcknowledgementWriter(ctx, sd.persistence, &conf.AcknowledgementWriter)
+	sd.receivedStateWriter = NewReceivedStateWriter(ctx, stateManager, persistence, &conf.ReceivedStateWriter)


😄 - I'd indeed made comments earlier in the PR review on this. Merging into one seems good to me, but I also don't think it's urgent.

Maybe it goes along with resolving the different-domain-per-state issue discussed above

peterbroadhurst · 2024-10-12T12:04:15Z

core/go/internal/statedistribution/state_distributer.go

+	}
+
+	var stateDistributions []StateDistributionPersisted
+	err = sd.persistence.DB().Table("state_distributions").


This query need pagination, and probably also some consideration for dealing with fairness across parties we're distributing to.

It's feasible, and probably likely, that we'll have situations where a party we're looking to share states with for transactions becomes unreachable for a long period. We build up a lot of cruft for them, which might never get resolved, but we also need to make sure we're distributing states efficiently to other parties.

Simplest is probably to run a state distribution worker per peer.

This is a great point. Don't want to forget about this so have opened an issue #299

peterbroadhurst · 2024-10-12T12:05:43Z

core/go/internal/statedistribution/state_receiver.go

+}
+
+func (sd *stateDistributer) sendStateAcknowledgement(ctx context.Context, domainName string, contractAddress string, stateId string, receivingParty string, distributingNode string, distributionID string) error {
+	log.L(ctx).Debugf("stateDistributer:sendStateAcknowledgement %s %s %s %s %s %s", domainName, contractAddress, stateId, receivingParty, distributingNode, distributionID)


in logs it's super helpful to have domainName=%s in these key multi-input log lines

hosie added 20 commits October 8, 2024 12:40

error handling

84d23cd

Signed-off-by: John Hosie <[email protected]>

state distribution

8ae1d1c

Signed-off-by: John Hosie <[email protected]>

acknowledgements

39d291f

Signed-off-by: John Hosie <[email protected]>

move state distributer to separate package

996b355

Signed-off-by: John Hosie <[email protected]>

refactor

30af295

Signed-off-by: John Hosie <[email protected]>

batch write received states

7572a28

Signed-off-by: John Hosie <[email protected]>

refactor

e2996ad

Signed-off-by: John Hosie <[email protected]>

add distriubtion list to the outgoing coin

471cc74

Signed-off-by: John Hosie <[email protected]>

fix tests after refactor

8c7483f

Signed-off-by: John Hosie <[email protected]>

Merge branch 'main' into state-distribution

f7e90d0

Signed-off-by: John Hosie <[email protected]>

order of drops

1ad7cb0

Signed-off-by: John Hosie <[email protected]>

remove deliberate failure

670245f

Signed-off-by: John Hosie <[email protected]>

Merge branch 'registry-integration' into state-distribution

6919ef3

Signed-off-by: John Hosie <[email protected]>

Merge branch 'main' into state-distribution

e8a1505

Signed-off-by: John Hosie <[email protected]>

seperate destination into component and node

ed7d6fa

Signed-off-by: John Hosie <[email protected]>

register transport client during post init

4e40ed6

Signed-off-by: John Hosie <[email protected]>

liint errors

09d9081

Signed-off-by: John Hosie <[email protected]>

verifier type field

9ad145d

Signed-off-by: John Hosie <[email protected]>

set distribution list

1a2fa65

Signed-off-by: John Hosie <[email protected]>

distriubtion list test

4b80c18

Signed-off-by: John Hosie <[email protected]>

hosie marked this pull request as ready for review October 12, 2024 08:40

hosie commented Oct 12, 2024

View reviewed changes

peterbroadhurst approved these changes Oct 12, 2024

View reviewed changes

peterbroadhurst merged commit 5bf1c7d into main Oct 12, 2024
3 checks passed

peterbroadhurst deleted the state-distribution branch October 12, 2024 12:10

hosie mentioned this pull request Oct 12, 2024

pagination and fairness on state distribution #299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State distribution #292

State distribution #292

hosie commented Oct 10, 2024 •

edited

Loading

hosie Oct 12, 2024

peterbroadhurst Oct 12, 2024

hosie Oct 12, 2024

hosie Oct 13, 2024

hosie Oct 12, 2024

peterbroadhurst Oct 12, 2024

hosie Oct 12, 2024

peterbroadhurst left a comment

peterbroadhurst Oct 12, 2024

peterbroadhurst Oct 12, 2024

hosie Oct 12, 2024

peterbroadhurst Oct 12, 2024

peterbroadhurst Oct 12, 2024

peterbroadhurst Oct 12, 2024

peterbroadhurst Oct 12, 2024

peterbroadhurst Oct 12, 2024

peterbroadhurst Oct 12, 2024

hosie Oct 12, 2024

peterbroadhurst Oct 12, 2024

		@@ -10,5 +10,26 @@ CREATE TABLE dispatches (

		CREATE UNIQUE INDEX dispatches_public_private ON dispatches("public_transaction_address","public_transaction_nonce","private_transaction_id");

		CREATE TABLE state_distributions (

State distribution #292

State distribution #292

Conversation

hosie commented Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbroadhurst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hosie commented Oct 10, 2024 •

edited

Loading