Nexus VM gets wedged? #4074

TonyWildish-BH · 2024-08-20T14:09:09Z

Description

I'm trying to merge the current AzureTRE into my own repository to get the latest changes. The merge went smoothly, no conflicts, and now I'm testing it.

The issue I see is that the Nexus VM gets wedged after a while. I'm able to create one or two VMs, either Windows or Linux, and they work, booting to completion. However, if I deploy more VMs, they eventually get stuck, with Nexus failing to respond.

Restarting the Nexus VM clears things up for a while, but the problem recurs just a short while later, when I deploy more VMs.

I'm able to connect to the Nexus VM in the azure portal, via the bastion, but when the problem happens, that session gets wedged too. It's a whole-VM phenomenon.

I haven't changed anything relating to any shared services in my TRE, and in particular, I haven't touched Nexus at all, the configuration there is exactly as-is in this repo. So while I can't rule out that it's something I've done, I'm wondering if anyone else has seen this, or anything like it?

Any suggestions of what to look for would be greatly appreciated.

TonyWildish-BH · 2024-08-21T15:25:48Z

Update: This is reproducible in the current HEAD of this repository, so I'd like to redefine this as a bug, not a question.

tim-allen-ck · 2024-08-21T15:36:44Z

Hey @TonyWildish-BH what version of Nexus do you have deployed?

TonyWildish-BH · 2024-08-21T15:50:57Z

I pulled the HEAD a week ago, it's whatever's there, I haven't touched the nexus code. We did a separate test, pulling this repo yesterday, and that shows the same problem. That's why I'm thinking it's not anything I've done, since that second test had no modifications whatsoever w.r.t. this repo.

tim-allen-ck · 2024-08-21T16:06:02Z

Sure. What version is the nexus template?

TonyWildish-BH · 2024-08-21T16:08:18Z

3.0.0

tim-allen-ck · 2024-08-21T17:54:18Z

Thanks. I'll take a look, see if I can reproduce.

TonyWildish-BH · 2024-08-30T10:20:20Z

hi @tim-allen-ck, did you get a chance to look at this?

What I have found since is that the Windows VMs seem not to provoke the problem, though the Linux VMs definitely do. Probably because they have so much more to update than the Windows VMs.

akolensky · 2024-09-18T12:48:04Z

Hi @tim-allen-ck , I understand it is a busy season - and wondered if this has been looked into?

marrobi · 2024-09-18T14:01:44Z

@akolensky what troubleshooting steps have you tried? It's not something we've seen elsewhere.

TonyWildish-BH · 2024-09-18T15:57:03Z

All I've managed to deduce so far is that it seems to be related to the Linux VMs doing a mass update. The load average in the nexus container goes over 40, and it stops responding, completely - which isn't surprising at that load average.

Rebooting the nexus VM clears the issue, but a 'restart' in the portal doesn't work, because the VM doesn't respond to it, you have to 'stop' and 'start', which takes a very long time, usually.

The problem is repeatable, but not guaranteed. With a fresh install of nexus, it wedges about ⅔ of the time, on one of the first 2 or 3 Linux VMs - often the first. It's certainly not rare.

marrobi · 2024-09-19T05:20:21Z

Have you added some custom repositories?

We've got instances running elsewhere and Nexus have been working without issue for long periods. So something must be different in your instance.

Have you tried using a larger VM?

Might be the container needs some resource limits as to leave the host some resources.

TonyWildish-BH · 2024-09-19T08:48:13Z

Marcus, this is in fresh installations, predominantly. It looks like a first-time cache-filling problem where the requests are not throttled, and the server gets overloaded. After rebooting, it tends to behave itself, but still spits the dummy every now and then.

This happens in a virgin installation, with unmodified code, as stated. No custom anything. We see it in the pure MS code base, and also in our own, where we have not touched anything relating to nexus, or to any of the core resources.

This is repeatable, three different people using three different setups have seen it, including one outside Barts. It's not our environment.

I did try using a larger VM (64 GB x 8 cores), that didn't help.

Restricting the container isn't likely to help much, though it might let the host OS kill and restart it, at best. If the container is spawning > 40 threads, all bets are off, that's too many. My best guess is that the server needs throttling, which means either Nexus or Java VM configuration.

Do you know if Tim tried to reproduce it?

tim-allen-ck · 2024-09-19T10:55:06Z

Hi @TonyWildish-BH I've not been able to reproduce it. Was it only 1 or 2 VMs you'd deployed when you'd found the issue?

TonyWildish-BH · 2024-09-19T10:59:07Z

hi @tim-allen-ck, I've been able to reproduce it on the first Linux VM I boot in a new SDE. It happens about 50% of the time in that situation, more or less.

marrobi · 2024-09-19T23:27:08Z

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

TonyWildish-BH · 2024-09-20T09:16:57Z

This has all happened with a completely unmodified installation from the HEAD of this repository. A fresh checkout of the code, with nothing changed. Not the Nexus VM, not the Linux template I'm trying to boot from it. Nothing.

I set my config.yaml at the top level and install, from scratch, following the instructions. I create a Linux VM, and with high probability, Nexus wedges.

marrobi · 2024-09-20T13:50:56Z

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

@akolensky are you able to help @TonyWildish-BH answer my question above? Thanks.

TonyWildish-BH · 2024-09-20T14:38:20Z

Hi Marcus,

What's the exact SKU you are using for the VM? What additional software is installed.

SKU is 22_04-lts-gen2.
As stated, there is no additional software installed. None.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

As stated, I'm seeing this error on multiple installations. One is our own, with custom VMs that have nearly all the packages installed, the other is the unmodified Microsoft codebase, commit hash c3e4c8d. That uses a cloud-init script to update the vanilla OS which comes with the TRE.

I see the issue in both these environments, therefore, this is not an issue of customisation from our side.

marrobi · 2024-09-20T14:54:40Z

That's the image sku rather than VM SKU. The VM SKU will be a letter followed by number(s).

My thinking is you have something different going on the VM. Antivirus maybe? That in conjunction with the VM scripts is causing all the credits to be used on the B series Nexus VM.

In addition as per https://microsoft.github.io/AzureTRE/latest/tre-templates/user-resources/guacamole-linux-vm/ I suggest you use VM images in production.

TonyWildish-BH · 2024-09-20T15:15:14Z

Where do I find the VM SKU?

Whatever is happening on the VM is whatever happens out of the box, because we haven't modified it in any way at all. There is no customisation of the Nexus VM. We haven't changed anything there. We haven't installed anything extra. Nothing.

I'm aware of that recommendation, and we will indeed be using our own VM images, but I need this bug fixed before we can consider going into production.

TonyWildish-BH added the question Further information is requested label Aug 20, 2024

TonyWildish-BH mentioned this issue Aug 22, 2024

Upstream refresh Barts-Life-Science/AzureTRE#142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus VM gets wedged? #4074

Nexus VM gets wedged? #4074

TonyWildish-BH commented Aug 20, 2024

TonyWildish-BH commented Aug 21, 2024

tim-allen-ck commented Aug 21, 2024

TonyWildish-BH commented Aug 21, 2024

tim-allen-ck commented Aug 21, 2024

TonyWildish-BH commented Aug 21, 2024

tim-allen-ck commented Aug 21, 2024

TonyWildish-BH commented Aug 30, 2024

akolensky commented Sep 18, 2024

marrobi commented Sep 18, 2024

TonyWildish-BH commented Sep 18, 2024

marrobi commented Sep 19, 2024

TonyWildish-BH commented Sep 19, 2024

tim-allen-ck commented Sep 19, 2024

TonyWildish-BH commented Sep 19, 2024

marrobi commented Sep 19, 2024 •

edited

Loading

TonyWildish-BH commented Sep 20, 2024

marrobi commented Sep 20, 2024 •

edited

Loading

TonyWildish-BH commented Sep 20, 2024

marrobi commented Sep 20, 2024

TonyWildish-BH commented Sep 20, 2024

Nexus VM gets wedged? #4074

Nexus VM gets wedged? #4074

Comments

TonyWildish-BH commented Aug 20, 2024

Description

TonyWildish-BH commented Aug 21, 2024

tim-allen-ck commented Aug 21, 2024

TonyWildish-BH commented Aug 21, 2024

tim-allen-ck commented Aug 21, 2024

TonyWildish-BH commented Aug 21, 2024

tim-allen-ck commented Aug 21, 2024

TonyWildish-BH commented Aug 30, 2024

akolensky commented Sep 18, 2024

marrobi commented Sep 18, 2024

TonyWildish-BH commented Sep 18, 2024

marrobi commented Sep 19, 2024

TonyWildish-BH commented Sep 19, 2024

tim-allen-ck commented Sep 19, 2024

TonyWildish-BH commented Sep 19, 2024

marrobi commented Sep 19, 2024 • edited Loading

TonyWildish-BH commented Sep 20, 2024

marrobi commented Sep 20, 2024 • edited Loading

TonyWildish-BH commented Sep 20, 2024

marrobi commented Sep 20, 2024

TonyWildish-BH commented Sep 20, 2024

marrobi commented Sep 19, 2024 •

edited

Loading

marrobi commented Sep 20, 2024 •

edited

Loading