Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equinix Move: Rebuild SmartOS Hosts #3731

Open
6 tasks done
ryanaslett opened this issue May 17, 2024 · 48 comments
Open
6 tasks done

Equinix Move: Rebuild SmartOS Hosts #3731

ryanaslett opened this issue May 17, 2024 · 48 comments

Comments

@ryanaslett
Copy link
Contributor

ryanaslett commented May 17, 2024

This is a sub step of
#3597

The two release and four testing smartOS hosts need to be replaced.
The current ones are running SmartOS 18 and SmartOS 20, which are both EOL.

In discussions with @bahamat It was determined that the likely targets for smartos should change to be SmartOS 21 and SmartOS 23

I plan to Provision the following hosts:

  • release-mnx-smartos21-x64-1
  • release-mnx-smartos23-x64-1
  • test-mnx-smartos21-x64-1
  • test-mnx-smartos21-x64-2
  • test-mnx-smartos23-x64-1
  • test-mnx-smartos23-x64-2

Once those are up I'll connect them to jenkins and we can run some preliminary tests to make sure they work and are accepting jobs.

Theres a non-zero chance that something has broken from smartos20 all the way to smartos23, so somebody with SmartOS/node knowledge is likely going to have to manage anticipated failures.

@targos
Copy link
Member

targos commented May 18, 2024

Are release hosts needed? We stopped building SmartOS a while ago.

@ryanaslett
Copy link
Contributor Author

I've provisioned the smartos test hosts and got ansible to run through end to end. #3737
That being said I don't yet have jenkins admin access to fully add these machines, nor do I really know the next steps to verify that tests can actually run on them.

I've held off on provisioning the release hosts because @targos makes a good point that we probably do not need to provision machines that are not used.

@ryanaslett
Copy link
Contributor Author

Discovered the supporting info that we do not need release nodes for smartos: #2168

@ryanaslett
Copy link
Contributor Author

I have added the four smartos nodes to jenkins, and re-ran ansible with their jenkins secret set, however I do not yet have infra level access so I cannot change the firewall settings to allow for these nodes to connect to jenkins to test and verify that they are functioning properly.

@targos
Copy link
Member

targos commented May 30, 2024

I added the IPs to the firewall rules. The 4 machines are connected to Jenkins now.

@ryanaslett
Copy link
Contributor Author

Great. I added them with the smartos21-64 and smartos23-64 labels, and altered the node-test-commit-smartos job to attempt to run on them. https://ci.nodejs.org/job/node-test-commit-smartos/

Both smartos21 and smartos23 are failing during the make step, but this is where I dont think I can be of much assistance in getting these jobs to successfully build on smartos -> Im unfamiliar with the build/compilation process, and Im unfamiliar with smartos in general.

Example failures:

https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos23-64/54688/console
https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos21-64/54688/console

Is there a point of contact we can ask in the smartos community to attempt to get these jobs to build?

@targos
Copy link
Member

targos commented May 31, 2024

@nodejs/platform-smartos PTAL.

@richardlau
Copy link
Member

richardlau commented May 31, 2024

Example failures:

https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos23-64/54688/console

21:55:02 ../deps/v8/src/base/platform/platform-posix.cc:81:16: error: conflicting declaration of C function 'int madvise(caddr_t, std::size_t, int)'
21:55:02    81 | extern "C" int madvise(caddr_t, size_t, int);
21:55:02       |                ^~~~~~~
21:55:02 In file included from ../deps/v8/src/base/platform/platform-posix.cc:20:
21:55:02 /usr/include/sys/mman.h:268:12: note: previous declaration 'int madvise(void*, std::size_t, int)'
21:55:02   268 | extern int madvise(void *, size_t, int);
21:55:02       |            ^~~~~~~
21:55:02 make[2]: *** [tools/v8_gypfiles/v8_libbase.target.mk:209: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos23-64/out/Release/obj.target/v8_libbase/deps/v8/src/base/platform/platform-posix.o] Error 1
21:55:02 rm 7a898a1853eba0e19e78f94553f35c8ae3ef2670.intermediate

looks like #3108 (comment) which was sort of machine configuration issue with the system header files.

@ryanaslett
Copy link
Contributor Author

It appears that smartos updated those header files

illumos/illumos-gate@df5cd01#diff-e7ee1077e2440c045ad16995cdc8ecba3b955d39cab24eaa5f022fe737a234eaL38

So If I understand correctly, v8 is checking whether or not it is compiling on solaris and accounting for atypical header file structure and in the meantime, smartos/illumos has updated the atypical header structure, causing v8 to no longer compile.

This also seems like only the first issue we'll encounter trying to get v8 to compile on modern SmartOS, and this will likely take somebody some dedicated time to get smartOS building again, and that also, this is upstream v8, and not nodejs itself that needs patching.

I'm unclear how to proceed here. I don't imagine that we'll be able to get this resolved before we need to remove the smartos18/smartos20 hosts, so it appears as though smartos tests are going to be broken for the forseeable future.

@bahamat
Copy link

bahamat commented May 31, 2024

@ryanaslett I think we have patches for this. We can get @jperkin to take a look, but he’s currently out on vacation so it won’t be right away.

@jperkin
Copy link

jperkin commented Jun 3, 2024

Yeh, we've been running with this patch for the nodejs pkgsrc builds for the last few years:

NetBSD/pkgsrc@527b29c

The problem is that you can't just update the prototypes, as that would break builds on older platforms, and there's no way to test which is available using the preprocessor.

I'd recommend applying a similar patch to just remove the prototypes (it's unclear why they were added in the first place but they certainly aren't required, at least on modern illumos or Solaris platforms) and probably the posix_madvise() chunk too.

@mhdawson
Copy link
Member

mhdawson commented Jun 5, 2024

@nodejs/build to facilitate this we have:

  1. created this copy of the smartos job - https://ci.nodejs.org/job/node-test-commit-smartos-test-ryan/. It currently pulls additional patches from this PR - Jenkins: Add smartos stub patches and patch script #3753 and runs on smartos 21 and 23

  2. I'm going to give @bahamat the ability to run that job unless there are any objections by Tomorrow. EDIT: also added @jperkin as mentioned below

The path forward would be:

  1. Somebody from the smartos community (@bahamat to identify), works to get the patch to apply. Once the patch applies and the job passes we can update the PR and land.
  2. One the build PR is landed we can move the changes from the test job into the regular job for smartos. They include the lines to pull/apply the patch script, and change to run on smartos 21 and 23 instead of the older versions.
  3. If 2) cannot be completed by June 14 we will need to remove smartos testing from the PR testing job until it is.

@bahamat we'd also need you to test with the current Node.js versions which include 18, 20 and 22. The patch will either need to apply to all of them, or different patches will need to be applied for each version by the script. We can also configure the job to only run on a subset of those versions (for example latest) if that is what makes sense for the smartos community.

@bahamat
Copy link

bahamat commented Jun 6, 2024

@mhdawson The best person to handle that from my side is @jperkin.

@mhdawson
Copy link
Member

mhdawson commented Jun 6, 2024

@bahamat, you and @jperkin should now be able to run that job.

@ryanaslett
Copy link
Contributor Author

@bahamat / @jperkin I've invited you to collaborate on my build fork that contains the smartos patches. You should be able to adjust the patch, and then run the job.

@richardlau
Copy link
Member

I noticed that even the newer SmartOS hosts are running Java 11 (according to https://ci.nodejs.org/computer/). Jenkins LTS is expecting to drop support for Java 11 in October, and will need Java 17 or later.

@bahamat
Copy link

bahamat commented Sep 24, 2024

This is also something that we've been dealing with in our Jenkins as well. OpenJDK is available in zone images 22.4.0 and later. Here's my recommendation:

  • Move build hosts to platform image 20210826T002459Z
  • Move oldest zone image to 22.4.0 (21.4.x will be EOL in 3 months)
  • Move newest zone image to 23.4.0 (or wait until January and move to 24.4.0, once it's available, or do 23.4.0 now and move to 24.4.0 in January)

@sxa
Copy link
Member

sxa commented Oct 8, 2024

@bahamat Can you confirm if 22.4.0 has a build of OpenJDK17 and not just 11? I know there are third party patches to OpenJDK which will allow it to build and run, but the primary OpenJDK codebase no longer supports building JDK11 from it (I'm guessing there's a chance that SmartOS already incorporated such patches to build 11 though ;-) )

@jperkin
Copy link

jperkin commented Oct 8, 2024

@sxa Yes, 22.4 includes openjdk17-17.0.5. 23.4 includes openjdk17-17.0.11 and openjdk21-21.0.3, and the upcoming 24.4 will include whatever the latest -ga tags are of each.

@bahamat
Copy link

bahamat commented Oct 8, 2024

@bahamat Can you confirm if 22.4.0 has a build of OpenJDK17 and not just 11?

Yep. We’re already using it in our Jenkins agent instances.

See here: https://smartos.org/packages/set/2022Q4-x86_64/search/Openjdk

@richardlau
Copy link
Member

There was an ask in nodejs/node#55239 (comment) to use newer Python than 3.8.

@bahamat
Copy link

bahamat commented Oct 9, 2024

The 22.4.0 image has 3.9 through 3.11, so we're good there.

https://smartos.org/packages/set/2022Q4-x86_64/search/python3

@ryanaslett
Copy link
Contributor Author

Status update:

I worked with @bahamat today to see if we can zero in on the configuration/provisioning discrepancies to find out why the preliminary host that I built is taking 3.5 hours to compile node.

It appeared CPU bound, so I re-provisioned on a larger instance with double the cores, and that build is currently processing (https://ci.nodejs.org/job/node-test-commit-smartos-test-ryan/nodes=smartos23-x64/12/consoleFull)

It doesn't seem to have helped the performance much, given how long its taken to get to the same point in the build as compared to the existing smartos20 machines at equinix.

Given that its an entirely new machine, my guess some of it may be due to ccache not existing yet.

Once that build that is running completes, I'll re-run it in the hopes that a warmed cache will help things along.

@bahamat
Copy link

bahamat commented Dec 14, 2024

Additional update on this. We discovered that the caching played a huge role in the build times. Once the cache was properly primed, the new instance builds node in about 38m. So this unblocks things and we'll be able to get the rest of the infra stood up pretty soon.

@ryanaslett
Copy link
Contributor Author

I have provisioned 1 additional SmartOS23 host, and two additional SmartOS 22 hosts.

All hosts had an initial test run (all of which took around 11 hours) and, a subsequent run which were all less than an hour (ranging from 33 minutes to 56 minutes): See builds 15/16/17/18 : https://ci.nodejs.org/job/node-test-commit-smartos-test-ryan/

I went ahead and modified https://ci.nodejs.org/job/node-test-commit-smartos/ to use the smartos22-x64 and smartos23-x64 labels, and removed the 18/20 machines from the rotation.

I will be monitoring for anomalies and failures over the weekend.

@bahamat
Copy link

bahamat commented Dec 14, 2024

Awesome, thanks @ryanaslett!!

@ryanaslett
Copy link
Contributor Author

PR for infra here: #3980

@ryanaslett
Copy link
Contributor Author

ryanaslett commented Dec 14, 2024

TODO:

  • Add MNX.io credentials to the secrets repo
  • Add a readme to the build repo for SmartOS host provisioning documentation (I have this in a google keep document, it just needs to be transferred somewhere accessible to the whole team
  • Verify that any other jobs linked to smartos18/20 (like libuv) are able to run on the smartos22/23 machines, and once they do, remove the smartos 18/20 machines from the rotation, and decomission them from equinix.
  • Update the JDK on the instances

@ryanaslett
Copy link
Contributor Author

Im struggling to figure something out with ansible and its frustrating. somehow when I provision smartos23-x64-5, the jenkins agent on smartos23-x64-4 gets restarted with the secret key from x64-5. The same thing happens when I provision smartos22-x64-2, it resets the agent on smartos22-x64-1 to point at 2 with 2's secret key.

I run ansible with -vvv and can see every ssh connection its making, yet its not connecting to the wrong host.. is there some sort of jenkins mechanism that is configuring those agents? What is going on?

@ryanaslett
Copy link
Contributor Author

Checked on libuv - needed to install autoconf on the smartos images, and then they run (albeit with one testing error: https://ci.nodejs.org/job/libuv-test-commit-smartos/2211/ ... I think that is an actual error and not an issue of environment config. need somebody who knows libuv to determine. In any case that error is there on smartos 18/20 as well)

@richardlau
Copy link
Member

@ryanaslett / @bahamat FYI the smartos build with Node.js 18 has started failing over the weekend.

On SmartOS 23 it's failing to compile,
e.g. https://ci.nodejs.org/job/node-test-commit-smartos/58181/nodes=smartos23-x64/console

13:10:53 dtrace: failed to compile script src/v8ustack.d: line 264: failed to resolve V8DBG_SMITAG: Unknown variable name
13:10:54 make[2]: *** [node_dtrace_ustack.target.mk:26: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos23-x64/out/Release/obj.target/libnode/src/node_dtrace_ustack.o] Error 1

@bahamat
Copy link

bahamat commented Dec 16, 2024

@richardlau Any idea when the last known working build was? Or how I might go about finding it? This seems like an unusual place for it to break on the oldest LTS branch.

Looks like this might be a change in GNU binutils. We're still looking into this, but we think we've seen something like this before.

@jperkin
Copy link

jperkin commented Dec 16, 2024

The SmartOS 23 build hosts use

binutils-2.41       GNU binary utilities

whereas older hosts have binutils-2.39. This appears to be a binutils regression, where v8dbg_SmiTag is no longer in the objdump output used by genv8constants.py.

binutils-2.39:

$ objdump -z -D ./out/Release/libv8_base_without_compiler.a | grep SmiTag
0000000000002b70 <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterE>:
    2b87:       e9 00 00 00 00          jmp    2b8c <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterE+0x1c>
0000000000002b90 <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterES2_>:
    2ba9:       e8 00 00 00 00          call   2bae <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterES2_+0x1e>
    2bc8:       e9 00 00 00 00          jmp    2bcd <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterES2_+0x3d>
00000000000004a0 <v8dbg_SmiTagMask>:
00000000000001a0 <v8dbg_SmiTag>:

binutils-2.41:

$ objdump -z -D ./out/Release/libv8_base_without_compiler.a | grep SmiTag
0000000000002b70 <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterE>:
    2b87:       e9 00 00 00 00          jmp    2b8c <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterE+0x1c>
0000000000002b90 <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterES2_>:
    2ba9:       e8 00 00 00 00          call   2bae <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterES2_+0x1e>
    2bc8:       e9 00 00 00 00          jmp    2bcd <_ZN2v88internal14TurboAssembler6SmiTagENS0_8RegisterES2_+0x3d>
00000000000004a0 <v8dbg_SmiTagMask>:

I'll see if there's any way we can work around this.

@ryanaslett
Copy link
Contributor Author

The SmartOS 23 build hosts use

binutils-2.41       GNU binary utilities

whereas older hosts have binutils-2.39. This appears to be a binutils regression, where v8dbg_SmiTag is no longer in the objdump output used by genv8constants.py.

binutils-2.39:

....

I'll see if there's any way we can work around this.

Thank you. I never would have found this. Let me know what I should/can adjust on the smartos23 hosts.

@ryanaslett
Copy link
Contributor Author

So the node24 builds seem to still be relatively quick, but the node 18/20/22 builds all took in the neighborhood of 8 hours (but those were all fresh runs). We just happened to evenly distribute the jobs between the matched pairs of instances such that no instances have run node 18/20/22 more than once, so Im not sure if that is sufficient to populate the ccache for us.

We'll see on tomorrow's daily runs.

@jperkin
Copy link

jperkin commented Dec 16, 2024

So it looks like we've entirely replicated Dan's previous debugging of this issue in https://sourceware.org/bugzilla/show_bug.cgi?id=32211

I can confirm that adding CXXFLAGS+= -fno-zero-initialized-in-bss in my test build fixes this for me. Is it easy enough for you to add this to the build environment or does this require a node patch?

@ryanaslett
Copy link
Contributor Author

It depends on the scope/impact.

The configuration where we run into this issue is Nodejs 18, on smartos23.

Will that flag impact any other Node versions? i.e. can it be set at the smartos23 host level, or does it need to be set only when Node18 is being tested? (also, I may be wrong, but Im pretty sure theres almost no chance that node would patch node 18 for this).

@ryanaslett
Copy link
Contributor Author

also, would it be okay to set that flag on the smartos 22 hosts as well? (the ones with binutils-2.39)

@bahamat
Copy link

bahamat commented Dec 17, 2024

also, would it be okay to set that flag on the smartos 22 hosts as well? (the ones with binutils-2.39)

Yes, this can be applied to all builds.

@ryanaslett
Copy link
Contributor Author

Image
I'll set it at the jenkins level as an injected env variable for smartos builds.

@ryanaslett
Copy link
Contributor Author

Ah, re-reading this I realize that its merely adding that to the end of the CXXFLAGS variable. I may have to re-evaluate how we do this.

@ryanaslett
Copy link
Contributor Author

Image

Adjusted
I would kick off a daily build to verify that this works, but the build queue is currently tremendously clogged because we've had smartos testing machines up for one whole day and the ccaches got invalidated immediately by the work in this PR. (nodejs/node#56275). That kind of thing is expected, I just wish it didnt happen less than 24 hours after getting these machines up.

It does underscore an issue with these machines : the tremendous discrepancy between the ccache build times (reasonable < 1 hour builds) and fresh builds (8-11 hours depending).

@aduh95
Copy link
Contributor

aduh95 commented Dec 18, 2024

All SmartOS builds are failing with g++: error: unrecognized command-line option '-fno-zero-initialized-in-bss -fno-zero-initialized-in-bss' now

@aduh95
Copy link
Contributor

aduh95 commented Dec 18, 2024

@ryanaslett any chance you had time to look at the compilation issue mentioned in my previous comment? SmartOS builds are failing on all PRs rn, which is frustrating

@bahamat
Copy link

bahamat commented Dec 18, 2024

Somehow that option got in there twice, and I think that’s what’s causing this.

@ryanaslett
Copy link
Contributor Author

@aduh95 That was an error in how jenkins was injecting those env variables. I had remedied that, and builds were running with the adjusted fix.

however the builds were in the midst of a long running "rebuild their ccaches" as a result of nodejs/node#56275 and the related work, and you cancelled those running builds before they finished.

builds are passing now, (https://ci.nodejs.org/job/node-test-commit-smartos/58244/) but tonight the daily build that we were trying to fix (see #3731 (comment)) will, again, take a very long time to get its ccaches warmed up and back to performing normally.

Please do not abort these long running smartos builds. They need to complete to get to a state where we have performant CI runners.

@aduh95
Copy link
Contributor

aduh95 commented Dec 19, 2024

I will keep aborting those as long as I need to get the release proposal ready on time, sorry not sorry. Getting those long running jobs through is something that needs to happen eventually I agree, but it seems like a low priority in comparison to not slowing down the day-to-day activities of the project (i.e. landing PRs) and the time sensitive ones (i.e. releasing Node.js). I'm hopeful the caches will populate over the weekend when the CI is less busy.

Anyway, thanks for fixing the builds!

@ryanaslett
Copy link
Contributor Author

The replacement smartos nodes have been successfully running builds for a good while now. I plan on decomissioning the old smartos nodes and removing them from Jenkins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants