-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus VM gets wedged? #4074
Comments
Update: This is reproducible in the current HEAD of this repository, so I'd like to redefine this as a bug, not a question. |
Hey @TonyWildish-BH what version of Nexus do you have deployed? |
I pulled the HEAD a week ago, it's whatever's there, I haven't touched the nexus code. We did a separate test, pulling this repo yesterday, and that shows the same problem. That's why I'm thinking it's not anything I've done, since that second test had no modifications whatsoever w.r.t. this repo. |
Sure. What version is the nexus template? |
3.0.0 |
Thanks. I'll take a look, see if I can reproduce. |
hi @tim-allen-ck, did you get a chance to look at this? What I have found since is that the Windows VMs seem not to provoke the problem, though the Linux VMs definitely do. Probably because they have so much more to update than the Windows VMs. |
Hi @tim-allen-ck , I understand it is a busy season - and wondered if this has been looked into? |
@akolensky what troubleshooting steps have you tried? It's not something we've seen elsewhere. |
All I've managed to deduce so far is that it seems to be related to the Linux VMs doing a mass update. The load average in the nexus container goes over 40, and it stops responding, completely - which isn't surprising at that load average. Rebooting the nexus VM clears the issue, but a 'restart' in the portal doesn't work, because the VM doesn't respond to it, you have to 'stop' and 'start', which takes a very long time, usually. The problem is repeatable, but not guaranteed. With a fresh install of nexus, it wedges about ⅔ of the time, on one of the first 2 or 3 Linux VMs - often the first. It's certainly not rare. |
Have you added some custom repositories? We've got instances running elsewhere and Nexus have been working without issue for long periods. So something must be different in your instance. Have you tried using a larger VM? Might be the container needs some resource limits as to leave the host some resources. |
Marcus, this is in fresh installations, predominantly. It looks like a first-time cache-filling problem where the requests are not throttled, and the server gets overloaded. After rebooting, it tends to behave itself, but still spits the dummy every now and then. This happens in a virgin installation, with unmodified code, as stated. No custom anything. We see it in the pure MS code base, and also in our own, where we have not touched anything relating to nexus, or to any of the core resources. This is repeatable, three different people using three different setups have seen it, including one outside Barts. It's not our environment. I did try using a larger VM (64 GB x 8 cores), that didn't help. Restricting the container isn't likely to help much, though it might let the host OS kill and restart it, at best. If the container is spawning > 40 threads, all bets are off, that's too many. My best guess is that the server needs throttling, which means either Nexus or Java VM configuration. Do you know if Tim tried to reproduce it? |
Hi @TonyWildish-BH I've not been able to reproduce it. Was it only 1 or 2 VMs you'd deployed when you'd found the issue? |
hi @tim-allen-ck, I've been able to reproduce it on the first Linux VM I boot in a new SDE. It happens about 50% of the time in that situation, more or less. |
What's the exact SKU you are using for the VM? What additional software is installed. In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU. It would be useful if the SKU was a parameter. Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM? |
This has all happened with a completely unmodified installation from the HEAD of this repository. A fresh checkout of the code, with nothing changed. Not the Nexus VM, not the Linux template I'm trying to boot from it. Nothing. I set my |
@akolensky are you able to help @TonyWildish-BH answer my question above? Thanks. |
Hi Marcus,
SKU is
As stated, I'm seeing this error on multiple installations. One is our own, with custom VMs that have nearly all the packages installed, the other is the unmodified Microsoft codebase, commit hash c3e4c8d. That uses a cloud-init script to update the vanilla OS which comes with the TRE. I see the issue in both these environments, therefore, this is not an issue of customisation from our side. |
That's the image sku rather than VM SKU. The VM SKU will be a letter followed by number(s). My thinking is you have something different going on the VM. Antivirus maybe? That in conjunction with the VM scripts is causing all the credits to be used on the B series Nexus VM. In addition as per https://microsoft.github.io/AzureTRE/latest/tre-templates/user-resources/guacamole-linux-vm/ I suggest you use VM images in production. |
Where do I find the VM SKU? Whatever is happening on the VM is whatever happens out of the box, because we haven't modified it in any way at all. There is no customisation of the Nexus VM. We haven't changed anything there. We haven't installed anything extra. Nothing. I'm aware of that recommendation, and we will indeed be using our own VM images, but I need this bug fixed before we can consider going into production. |
Description
I'm trying to merge the current AzureTRE into my own repository to get the latest changes. The merge went smoothly, no conflicts, and now I'm testing it.
The issue I see is that the Nexus VM gets wedged after a while. I'm able to create one or two VMs, either Windows or Linux, and they work, booting to completion. However, if I deploy more VMs, they eventually get stuck, with Nexus failing to respond.
Restarting the Nexus VM clears things up for a while, but the problem recurs just a short while later, when I deploy more VMs.
I'm able to connect to the Nexus VM in the azure portal, via the bastion, but when the problem happens, that session gets wedged too. It's a whole-VM phenomenon.
I haven't changed anything relating to any shared services in my TRE, and in particular, I haven't touched Nexus at all, the configuration there is exactly as-is in this repo. So while I can't rule out that it's something I've done, I'm wondering if anyone else has seen this, or anything like it?
Any suggestions of what to look for would be greatly appreciated.
The text was updated successfully, but these errors were encountered: