Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(recovery): add crash recovery implementation #3491

Draft
wants to merge 10 commits into
base: satp-dev
Choose a base branch
from

Conversation

Yogesh01000100
Copy link
Contributor

@Yogesh01000100 Yogesh01000100 commented Aug 20, 2024

This PR addresses issue #3114

Changes

  • Crash Recovery: Updated crash_recovery.proto and related ts files; added core recovery logic (created functions not yet implemented).
  • Knex Config: Improved knexfile.ts and knexfile-remote.ts; added Docker Compose for production.

@RafaelAPB
Copy link
Contributor

I will review this PR

@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please rebase with satp-dev (should not have conflicts)

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 0de9744 to 4c0124d Compare August 23, 2024 17:19
@Yogesh01000100 Yogesh01000100 changed the title feat: add crash recovery and knex config for production feat(recovery): add crash recovery implementation Aug 25, 2024
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please include documentation and tests, and update the description, as discussed.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from ce9a179 to 24b8eaf Compare September 8, 2024 07:44
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 24b8eaf to 728e7cb Compare September 16, 2024 18:56
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 could you please squash the commits and rebase with latest version of satp-dev, prior to merge?

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 1a55673 to 21ad772 Compare September 17, 2024 10:11
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks very good, but there are some changes to be done prior to merging.
Summarizing my comments:

  1. Add other authors to the commit
  2. Incorporate feedback from the logging process (namely un-hardcoding logs and adding more information)
  3. Implement RollbackState (for example, should state how many more steps are to be rolled-back, at any moment; what was rolledback already; estimated time to completion, etc)
  4. Please add tests that support the new feature
  5. Please add comprehensive documentation on this feature. Example: The readme of SATP should have a section on how to run the docker compose with several examples of configurations.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 49e1135 to fb703b4 Compare October 16, 2024 19:57
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from fb703b4 to b30ccb5 Compare November 3, 2024 19:16
Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review how sessionData is being used, and take a look at the Stage 3 question.
Please document the new code as well. The rest is being documented in this PR:
https://github.com/hyperledger/cacti/pull/3619


private async checkCrash(session: SATPSession): Promise<CrashStatus> {
const fnTag = `${this.className}#checkCrash()`;
const sessionData = session.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this prioritize client session?
In this Implementation the gateway can be a client and a server at the same time. So, when this is the case we may not be deteting some crashes.

public async checkAndResolveCrash(session: SATPSession): Promise<void> {
const fnTag = `${this.className}#checkAndResolveCrash()`;

const sessionData = session.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

public async handleRecovery(session: SATPSession): Promise<boolean> {
const fnTag = `${this.className}#handleRecovery()`;

const sessionData = session.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

throw new Error(`${fnTag}, session data is not correctly initialized`);
}
const sessionData = session.hasClientSessionData()
? session.getClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too


this.log.info(`${fnTag} Asset Id: ${assetId} amount: ${amount}`);

await bridgeManager.burnAsset(assetId, Number(amount));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Stage 3 rollback, the rollback is only feasible if it occurs before the asset is minted on the receiver chain. Once minting happens, if the gateway encounters an issue and fails to assign the minted amount to the recipient, a rollback can no longer be initiated. Is this the reason why a rollback isn’t considered in such cases?

Wouldn't it make sense, then, for the minted amount on the receiver chain to be burned and re-minted on the source chain too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client: true,
});

const sessionData = mockSession.hasClientSessionData()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider this carefully. If sessionData remains as it is, we must handle it with care and clearly differentiate between the client and server sides of the gateway. I designed the sessionData this way to ensure that a gateway can act as both a client and server to itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect(result).toBe(true);
});

/*it("intitiate rollback function test", async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for this to be commented out?

Yogesh01000100 and others added 6 commits November 21, 2024 11:37
Signed-off-by: Yogesh01000100 <[email protected]>

chore(satp-hermes): improve DB management

Signed-off-by: Rafael Belchior <[email protected]>

chore(satp-hermes): crash recovery architecture

Signed-off-by: Rafael Belchior <[email protected]>

fix(recovery): enhance crash recovery and rollback implementation

Signed-off-by: Yogesh01000100 <[email protected]>

refactor(recovery): consolidate logic and improve SATP message handling

Signed-off-by: Yogesh01000100 <[email protected]>

feat(recovery): add rollback implementations

Signed-off-by: Yogesh01000100 <[email protected]>

fix: correct return types and inits

Signed-off-by: Yogesh01000100 <[email protected]>
Co-authored-by: Rafael Belchior <[email protected]>
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from d14f178 to 4eef528 Compare November 26, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants