Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mithril] script updates #1785

Merged
merged 10 commits into from
Aug 10, 2024
Merged

[Mithril] script updates #1785

merged 10 commits into from
Aug 10, 2024

Conversation

TrevorBenson
Copy link
Collaborator

@TrevorBenson TrevorBenson commented Jul 26, 2024

Description

  1. Changes to the nginx mithril relay LB and squid mithril relay
    • Nginx mithril relay LB:
      • Uses stream protocol
      • proxy_connect_timeout of 10 seconds
      • max_fails set to 1
      • fail_timeout set to 10 seconds per relay (20 seconds for 2 relay configurations, 30 for 3, etc.)
    • Squid mithril relay:
      • Change to how ACL's and allow rulesare written for multiple block producers
      • Includes rules for non block producers (other relays, local relay IP, simplifies testing from other than block producer) etc.
  2. Enhances the verify_signer_registered function
    • Continues to check if mithril-signer has registered in the current epoch.
    • If the signer has registered for the current epoch it will also check if the signer registered two epochs earlier and reports if the signer is valid to sign certificates in the current epoch or not.
  3. Adds a new MITHRIL_HOME variable to the scripts/cnode-helper-scripts/env file.
    • Defaults to the current path of ${CNODE_HOME}/mithril.
    • Provides the SPO a way to set a unique path for where mithril environment and mithril data-stores will be located.
  4. Changes to a status code for checks on mithril minimum versions. Sourced from past conversation w/ @Scitz0
  5. Adds functions semantic_version_check and check_mithril_upgrade_safe
    • GitHub Actions workflow now uses check_mithril_upgrade_safe against the repo node-latest.txt to determine if changing mithril-latest.xt is acceptable.
  • Not yet implemented in guild-deploy.sh, but the check_mithril_upgrade_safe should support both CI Actions and production node/signer environments. Open for implementation discussions.

Where to start?

I'd suggest review of each commit vs. reviewing the files changed. The commit for Refactor mithril functions into mithril library. will be quite large. There is a general section of functions used by multiple scripts, and then functions are sorted into blocks by the script that uses them, client, relay or signer. Each section is alphabetically sorted.

Motivation and context

  1. Nginx relay/lb
  • Issues with current nginx config reported in Koios support channel.
  • Simplified testing of squid proxy when additional IPs are provided.
  • Offline relays were adding 60 second delays until nginx timeouts occured and LB moved onto the next relay.
  • Tuned the max fails to 1 and fail_timeout 10 seconds * # of relays.
  • Temporarily prevents re-using an offline relay if many requests get sent through the same LB
  1. Reduce confusion for new signer users who were not aware signing had a 2 epoch delay.
  2. Flexibility for SPO's to choose their own directories.
  3. Past discussions with @Scitz0
  4. Past discussions with @rdlrt

How has this been tested?

  1. Nginx relay/squid

    • Direct curl requests on the block producer:
      curl -4 --proxy http://127.0.0.1:3132 https://aggregator.release-mainnet.api.mithril.network/aggregator

    • Enabling tcpdump on port 3132 for all (squid) mithril relays, disabling the first relay in the mithril_relays upstream definition of the (nginx) mithril relay lb. Then timing the request while watching traffic from each mithril relay:
      time curl -4 --proxy http://127.0.0.1:3132 https://aggregator.release-mainnet.api.mithril.network/aggregator.

    • The first request takes 10.0x seconds, to succeed, subsequent requests take sub 1 second and traffic shows round robin over the remaining online mithril relays. If round robin reaches the offline mithril relay before the fail_timout has expired it gets skipped and the traffic is again sent to the first online (second in the list) upstream mithril relay.

  2. Manually

  3. Test virtual machines altering path to /opt/cardano/mithril

  4. Testing locally and in forked repository github actions

  5. Testing locally and in forked repository github actions
    image

@TrevorBenson TrevorBenson requested review from rdlrt and Scitz0 July 26, 2024 15:04
@TrevorBenson
Copy link
Collaborator Author

@gmoratorio This PR's for you.

@TrevorBenson
Copy link
Collaborator Author

@gmoratorio I found stream to work in the container at docker hub, which appears to be running nginx v1.25. However I also ran a different test where I found nginx v1.20.1 does not support stream nginx: [emerg] unknown directive "stream" in /etc/nginx/nginx.conf:7.

I am investigating various ubuntu versions now to see if/when stream support shows up.

@TrevorBenson
Copy link
Collaborator Author

  • hub.docker.com - nginx/1.25.4 - includes stream support
  • Ubuntu 20.04 - nginx/1.18.0 - no stream support
  • Ubuntu 24.04 - nginx/1.24.0 - no stream support

I did have success however adding the nginx stable repository to ubuntu 20.04 and updating to nginx 1.26. This guide describes the process.

@gmoratorio
Copy link
Contributor

Thanks @TrevorBenson , updating to nginx 1.26 with your instructions solved the stream issue in the nginx.conf. One strange thing I did notice after upgrading using those instructions, though, is when I tried to see the mithril service logs with sudo journalctl -u cnode-mithril-signer.service -f I started getting this error in the logs

Jul 26 21:24:54 bp.us-central1-a.c.stake-pool-428321.internal systemd[1]: /etc/systemd/system/cnode-mithril-signer.service:19: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.

sure enough, removing these two entries in the cnode-mithril-signer.service script

StandardOutput=syslog
StandardError=syslog

Resolves the logging error. But definitely something to look out for.

@TrevorBenson
Copy link
Collaborator Author

TrevorBenson commented Jul 26, 2024

Thanks @TrevorBenson , updating to nginx 1.26 with your instructions solved the stream issue in the nginx.conf. One strange thing I did notice after upgrading using those instructions, though, is when I tried to see the mithril service logs with sudo journalctl -u cnode-mithril-signer.service -f I started getting this error in the logs

Jul 26 21:24:54 bp.us-central1-a.c.stake-pool-428321.internal systemd[1]: /etc/systemd/system/cnode-mithril-signer.service:19: Standard output type syslog is obsolete, automatically updating to journal. Please update your unit file, and consider removing the setting altogether.

sure enough, removing these two entries in the cnode-mithril-signer.service script

StandardOutput=syslog
StandardError=syslog

Resolves the logging error. But definitely something to look out for.

For tracking purposes @gmoratorio and I spoke in DM. I suspect this might come from libsystemd0/noble-updates 255.4-1ubuntu8.2 amd64 [upgradable from: 255.4-1ubuntu8.1] which appears to have been released a couple of weeks ago.

@TrevorBenson TrevorBenson marked this pull request as draft July 30, 2024 05:07
@TrevorBenson
Copy link
Collaborator Author

Converted from ready to draft.

Now that I have confirmed that the nginx stable repository did not cause the syslog errors observed by @gmoratorio I feel it is likely acceptable to enable this repository to get a minimized nginx load balancer configuration using stream proxy module.

Also, now that #1789 has merged I'm going to review all mithril scripts and mithril.library for a small refactor on user input etc.

I'll move this PR back to ready for review later this week once complete.

@TrevorBenson TrevorBenson force-pushed the revised-mithril-relay-lb-with-timeouts branch 3 times, most recently from bb54843 to ccf4e33 Compare August 7, 2024 22:15
@TrevorBenson TrevorBenson force-pushed the revised-mithril-relay-lb-with-timeouts branch from ccf4e33 to f951c7d Compare August 8, 2024 00:00
@TrevorBenson TrevorBenson force-pushed the revised-mithril-relay-lb-with-timeouts branch from f951c7d to f75a2f9 Compare August 8, 2024 00:09
@TrevorBenson TrevorBenson marked this pull request as ready for review August 8, 2024 00:13
@TrevorBenson TrevorBenson changed the title Changes to the nginx mithril relay LB and squid mithril relay [Mithril] script updates Aug 8, 2024
@TrevorBenson TrevorBenson force-pushed the revised-mithril-relay-lb-with-timeouts branch from ab34c90 to 1ef3670 Compare August 8, 2024 15:30
@TrevorBenson TrevorBenson requested a review from rdlrt August 8, 2024 15:30
@TrevorBenson TrevorBenson requested a review from rdlrt August 9, 2024 05:26
@rdlrt rdlrt merged commit 1b0b6b0 into alpha Aug 10, 2024
1 of 3 checks passed
@rdlrt rdlrt deleted the revised-mithril-relay-lb-with-timeouts branch August 10, 2024 01:15
rdlrt added a commit that referenced this pull request Aug 13, 2024
## Description
1. Changes to the nginx mithril relay LB and squid mithril relay
   * Nginx mithril relay LB:
     * Uses stream protocol
     * proxy_connect_timeout of 10 seconds
     * max_fails set to 1
* fail_timeout set to 10 seconds per relay (20 seconds for 2 relay
configurations, 30 for 3, etc.)
   * Squid mithril relay:
* Change to how ACL's and allow rulesare written for multiple block
producers
* Includes rules for non block producers (other relays, local relay IP,
simplifies testing from other than block producer) etc.
2. Enhances the verify_signer_registered function
* Continues to check if mithril-signer has registered in the current
epoch.
* If the signer has registered for the current epoch it will also check
if the signer registered two epochs earlier and reports if the signer is
valid to sign certificates in the current epoch or not.
3. Adds a new MITHRIL_HOME variable to the
`scripts/cnode-helper-scripts/env` file.
   * Defaults to the current path of `${CNODE_HOME}/mithril`.
* Provides the SPO a way to set a unique path for where mithril
environment and mithril data-stores will be located.
4. Changes to a status code for checks on mithril minimum versions.
Sourced from past conversation w/ @Scitz0
5. Adds functions **semantic_version_check** and
**check_mithril_upgrade_safe**
* GitHub Actions workflow now uses **check_mithril_upgrade_safe**
against the repo `node-latest.txt` to determine if changing
`mithril-latest.xt` is acceptable.
* Not yet implemented in `guild-deploy.sh`, but the
**check_mithril_upgrade_safe** should support both CI Actions and
production node/signer environments. Open for implementation
discussions.

## Where to start?
I'd suggest review of each commit vs. reviewing the files changed. The
commit for **Refactor mithril functions into mithril library.** will be
quite large. There is a general section of functions used by multiple
scripts, and then functions are sorted into blocks by the script that
uses them, client, relay or signer. Each section is alphabetically
sorted.

## Motivation and context
1. Nginx relay/lb
 * Issues with current nginx config reported in Koios support channel.
 * Simplified testing of squid proxy when additional IPs are provided.
* Offline relays were adding 60 second delays until nginx timeouts
occured and LB moved onto the next relay.
 * Tuned the max fails to 1 and fail_timeout 10 seconds * # of relays.
* Temporarily prevents re-using an offline relay if many requests get
sent through the same LB
2. Reduce confusion for new signer users who were not aware signing had
a 2 epoch delay.
3. Flexibility for SPO's to choose their own directories.
4. Past discussions with @Scitz0
5. Past discussions with @rdlrt

## How has this been tested?
1. Nginx relay/squid
   * Direct curl requests on the block producer:
`curl -4 --proxy http://127.0.0.1:3132
https://aggregator.release-mainnet.api.mithril.network/aggregator`

* Enabling tcpdump on port 3132 for all (squid) mithril relays,
disabling the first relay in the mithril_relays upstream definition of
the (nginx) mithril relay lb. Then timing the request while watching
traffic from each mithril relay:
`time curl -4 --proxy http://127.0.0.1:3132
https://aggregator.release-mainnet.api.mithril.network/aggregator`.

* The first request takes 10.0x seconds, to succeed, subsequent requests
take sub 1 second and traffic shows round robin over the remaining
online mithril relays. If round robin reaches the offline mithril relay
before the fail_timout has expired it gets skipped and the traffic is
again sent to the first online (second in the list) upstream mithril
relay.

2. Manually
3. Test virtual machines altering path to `/opt/cardano/mithril`
4. Testing locally and in forked repository github actions
5. Testing locally and in forked repository github actions

![image](https://github.com/user-attachments/assets/3838f427-1f38-4808-8188-eb517def0153)

---------

Co-authored-by: RdLrT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants