-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Equinix Move: Rebuild SmartOS Hosts #3731
Comments
Are release hosts needed? We stopped building SmartOS a while ago. |
I've provisioned the smartos test hosts and got ansible to run through end to end. #3737 I've held off on provisioning the release hosts because @targos makes a good point that we probably do not need to provision machines that are not used. |
Discovered the supporting info that we do not need release nodes for smartos: #2168 |
I have added the four smartos nodes to jenkins, and re-ran ansible with their jenkins secret set, however I do not yet have infra level access so I cannot change the firewall settings to allow for these nodes to connect to jenkins to test and verify that they are functioning properly. |
I added the IPs to the firewall rules. The 4 machines are connected to Jenkins now. |
Great. I added them with the smartos21-64 and smartos23-64 labels, and altered the node-test-commit-smartos job to attempt to run on them. https://ci.nodejs.org/job/node-test-commit-smartos/ Both smartos21 and smartos23 are failing during the make step, but this is where I dont think I can be of much assistance in getting these jobs to successfully build on smartos -> Im unfamiliar with the build/compilation process, and Im unfamiliar with smartos in general. Example failures: https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos23-64/54688/console Is there a point of contact we can ask in the smartos community to attempt to get these jobs to build? |
@nodejs/platform-smartos PTAL. |
21:55:02 ../deps/v8/src/base/platform/platform-posix.cc:81:16: error: conflicting declaration of C function 'int madvise(caddr_t, std::size_t, int)'
21:55:02 81 | extern "C" int madvise(caddr_t, size_t, int);
21:55:02 | ^~~~~~~
21:55:02 In file included from ../deps/v8/src/base/platform/platform-posix.cc:20:
21:55:02 /usr/include/sys/mman.h:268:12: note: previous declaration 'int madvise(void*, std::size_t, int)'
21:55:02 268 | extern int madvise(void *, size_t, int);
21:55:02 | ^~~~~~~
21:55:02 make[2]: *** [tools/v8_gypfiles/v8_libbase.target.mk:209: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos23-64/out/Release/obj.target/v8_libbase/deps/v8/src/base/platform/platform-posix.o] Error 1
21:55:02 rm 7a898a1853eba0e19e78f94553f35c8ae3ef2670.intermediate looks like #3108 (comment) which was sort of machine configuration issue with the system header files. |
It appears that smartos updated those header files So If I understand correctly, v8 is checking whether or not it is compiling on solaris and accounting for atypical header file structure and in the meantime, smartos/illumos has updated the atypical header structure, causing v8 to no longer compile. This also seems like only the first issue we'll encounter trying to get v8 to compile on modern SmartOS, and this will likely take somebody some dedicated time to get smartOS building again, and that also, this is upstream v8, and not nodejs itself that needs patching. I'm unclear how to proceed here. I don't imagine that we'll be able to get this resolved before we need to remove the smartos18/smartos20 hosts, so it appears as though smartos tests are going to be broken for the forseeable future. |
@ryanaslett I think we have patches for this. We can get @jperkin to take a look, but he’s currently out on vacation so it won’t be right away. |
Yeh, we've been running with this patch for the nodejs pkgsrc builds for the last few years: The problem is that you can't just update the prototypes, as that would break builds on older platforms, and there's no way to test which is available using the preprocessor. I'd recommend applying a similar patch to just remove the prototypes (it's unclear why they were added in the first place but they certainly aren't required, at least on modern illumos or Solaris platforms) and probably the |
@nodejs/build to facilitate this we have:
The path forward would be:
@bahamat we'd also need you to test with the current Node.js versions which include 18, 20 and 22. The patch will either need to apply to all of them, or different patches will need to be applied for each version by the script. We can also configure the job to only run on a subset of those versions (for example latest) if that is what makes sense for the smartos community. |
I noticed that even the newer SmartOS hosts are running Java 11 (according to https://ci.nodejs.org/computer/). Jenkins LTS is expecting to drop support for Java 11 in October, and will need Java 17 or later. |
This is also something that we've been dealing with in our Jenkins as well. OpenJDK is available in zone images 22.4.0 and later. Here's my recommendation:
|
@bahamat Can you confirm if 22.4.0 has a build of OpenJDK17 and not just 11? I know there are third party patches to OpenJDK which will allow it to build and run, but the primary OpenJDK codebase no longer supports building JDK11 from it (I'm guessing there's a chance that SmartOS already incorporated such patches to build 11 though ;-) ) |
@sxa Yes, 22.4 includes |
Yep. We’re already using it in our Jenkins agent instances. See here: https://smartos.org/packages/set/2022Q4-x86_64/search/Openjdk |
There was an ask in nodejs/node#55239 (comment) to use newer Python than 3.8. |
The 22.4.0 image has 3.9 through 3.11, so we're good there. https://smartos.org/packages/set/2022Q4-x86_64/search/python3 |
Status update: I worked with @bahamat today to see if we can zero in on the configuration/provisioning discrepancies to find out why the preliminary host that I built is taking 3.5 hours to compile node. It appeared CPU bound, so I re-provisioned on a larger instance with double the cores, and that build is currently processing (https://ci.nodejs.org/job/node-test-commit-smartos-test-ryan/nodes=smartos23-x64/12/consoleFull) It doesn't seem to have helped the performance much, given how long its taken to get to the same point in the build as compared to the existing smartos20 machines at equinix. Given that its an entirely new machine, my guess some of it may be due to ccache not existing yet. Once that build that is running completes, I'll re-run it in the hopes that a warmed cache will help things along. |
Additional update on this. We discovered that the caching played a huge role in the build times. Once the cache was properly primed, the new instance builds node in about 38m. So this unblocks things and we'll be able to get the rest of the infra stood up pretty soon. |
I have provisioned 1 additional SmartOS23 host, and two additional SmartOS 22 hosts. All hosts had an initial test run (all of which took around 11 hours) and, a subsequent run which were all less than an hour (ranging from 33 minutes to 56 minutes): See builds 15/16/17/18 : https://ci.nodejs.org/job/node-test-commit-smartos-test-ryan/ I went ahead and modified https://ci.nodejs.org/job/node-test-commit-smartos/ to use the smartos22-x64 and smartos23-x64 labels, and removed the 18/20 machines from the rotation. I will be monitoring for anomalies and failures over the weekend. |
Awesome, thanks @ryanaslett!! |
PR for infra here: #3980 |
TODO:
|
Im struggling to figure something out with ansible and its frustrating. somehow when I provision smartos23-x64-5, the jenkins agent on smartos23-x64-4 gets restarted with the secret key from x64-5. The same thing happens when I provision smartos22-x64-2, it resets the agent on smartos22-x64-1 to point at 2 with 2's secret key. I run ansible with -vvv and can see every ssh connection its making, yet its not connecting to the wrong host.. is there some sort of jenkins mechanism that is configuring those agents? What is going on? |
Checked on libuv - needed to install autoconf on the smartos images, and then they run (albeit with one testing error: https://ci.nodejs.org/job/libuv-test-commit-smartos/2211/ ... I think that is an actual error and not an issue of environment config. need somebody who knows libuv to determine. In any case that error is there on smartos 18/20 as well) |
@ryanaslett / @bahamat FYI the smartos build with Node.js 18 has started failing over the weekend.
On SmartOS 23 it's failing to compile, 13:10:53 dtrace: failed to compile script src/v8ustack.d: line 264: failed to resolve V8DBG_SMITAG: Unknown variable name
13:10:54 make[2]: *** [node_dtrace_ustack.target.mk:26: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos23-x64/out/Release/obj.target/libnode/src/node_dtrace_ustack.o] Error 1 |
@richardlau Any idea when the last known working build was? Or how I might go about finding it? Looks like this might be a change in GNU binutils. We're still looking into this, but we think we've seen something like this before. |
The SmartOS 23 build hosts use
whereas older hosts have binutils-2.39:
binutils-2.41:
I'll see if there's any way we can work around this. |
....
Thank you. I never would have found this. Let me know what I should/can adjust on the smartos23 hosts. |
So the node24 builds seem to still be relatively quick, but the node 18/20/22 builds all took in the neighborhood of 8 hours (but those were all fresh runs). We just happened to evenly distribute the jobs between the matched pairs of instances such that no instances have run node 18/20/22 more than once, so Im not sure if that is sufficient to populate the ccache for us. We'll see on tomorrow's daily runs. |
So it looks like we've entirely replicated Dan's previous debugging of this issue in https://sourceware.org/bugzilla/show_bug.cgi?id=32211 I can confirm that adding |
It depends on the scope/impact. The configuration where we run into this issue is Nodejs 18, on smartos23. Will that flag impact any other Node versions? i.e. can it be set at the smartos23 host level, or does it need to be set only when Node18 is being tested? (also, I may be wrong, but Im pretty sure theres almost no chance that node would patch node 18 for this). |
also, would it be okay to set that flag on the smartos 22 hosts as well? (the ones with binutils-2.39) |
Yes, this can be applied to all builds. |
Ah, re-reading this I realize that its merely adding that to the end of the CXXFLAGS variable. I may have to re-evaluate how we do this. |
Adjusted It does underscore an issue with these machines : the tremendous discrepancy between the ccache build times (reasonable < 1 hour builds) and fresh builds (8-11 hours depending). |
All SmartOS builds are failing with |
@ryanaslett any chance you had time to look at the compilation issue mentioned in my previous comment? SmartOS builds are failing on all PRs rn, which is frustrating |
Somehow that option got in there twice, and I think that’s what’s causing this. |
@aduh95 That was an error in how jenkins was injecting those env variables. I had remedied that, and builds were running with the adjusted fix. however the builds were in the midst of a long running "rebuild their ccaches" as a result of nodejs/node#56275 and the related work, and you cancelled those running builds before they finished. builds are passing now, (https://ci.nodejs.org/job/node-test-commit-smartos/58244/) but tonight the daily build that we were trying to fix (see #3731 (comment)) will, again, take a very long time to get its ccaches warmed up and back to performing normally. Please do not abort these long running smartos builds. They need to complete to get to a state where we have performant CI runners. |
I will keep aborting those as long as I need to get the release proposal ready on time, sorry not sorry. Getting those long running jobs through is something that needs to happen eventually I agree, but it seems like a low priority in comparison to not slowing down the day-to-day activities of the project (i.e. landing PRs) and the time sensitive ones (i.e. releasing Node.js). I'm hopeful the caches will populate over the weekend when the CI is less busy. Anyway, thanks for fixing the builds! |
The replacement smartos nodes have been successfully running builds for a good while now. I plan on decomissioning the old smartos nodes and removing them from Jenkins. |
This is a sub step of
#3597
The two release and four testing smartOS hosts need to be replaced.
The current ones are running SmartOS 18 and SmartOS 20, which are both EOL.
In discussions with @bahamat It was determined that the likely targets for smartos should change to be SmartOS 21 and SmartOS 23
I plan to Provision the following hosts:
Once those are up I'll connect them to jenkins and we can run some preliminary tests to make sure they work and are accepting jobs.
Theres a non-zero chance that something has broken from smartos20 all the way to smartos23, so somebody with SmartOS/node knowledge is likely going to have to manage anticipated failures.
The text was updated successfully, but these errors were encountered: