Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus VM gets wedged? #4074

Open
TonyWildish-BH opened this issue Aug 20, 2024 · 20 comments
Open

Nexus VM gets wedged? #4074

TonyWildish-BH opened this issue Aug 20, 2024 · 20 comments
Labels
question Further information is requested

Comments

@TonyWildish-BH
Copy link

Description

I'm trying to merge the current AzureTRE into my own repository to get the latest changes. The merge went smoothly, no conflicts, and now I'm testing it.

The issue I see is that the Nexus VM gets wedged after a while. I'm able to create one or two VMs, either Windows or Linux, and they work, booting to completion. However, if I deploy more VMs, they eventually get stuck, with Nexus failing to respond.

Restarting the Nexus VM clears things up for a while, but the problem recurs just a short while later, when I deploy more VMs.

I'm able to connect to the Nexus VM in the azure portal, via the bastion, but when the problem happens, that session gets wedged too. It's a whole-VM phenomenon.

I haven't changed anything relating to any shared services in my TRE, and in particular, I haven't touched Nexus at all, the configuration there is exactly as-is in this repo. So while I can't rule out that it's something I've done, I'm wondering if anyone else has seen this, or anything like it?

Any suggestions of what to look for would be greatly appreciated.

@TonyWildish-BH TonyWildish-BH added the question Further information is requested label Aug 20, 2024
@TonyWildish-BH
Copy link
Author

Update: This is reproducible in the current HEAD of this repository, so I'd like to redefine this as a bug, not a question.

@tim-allen-ck
Copy link
Collaborator

Hey @TonyWildish-BH what version of Nexus do you have deployed?

@TonyWildish-BH
Copy link
Author

I pulled the HEAD a week ago, it's whatever's there, I haven't touched the nexus code. We did a separate test, pulling this repo yesterday, and that shows the same problem. That's why I'm thinking it's not anything I've done, since that second test had no modifications whatsoever w.r.t. this repo.

@tim-allen-ck
Copy link
Collaborator

Sure. What version is the nexus template?

@TonyWildish-BH
Copy link
Author

3.0.0

@tim-allen-ck
Copy link
Collaborator

Thanks. I'll take a look, see if I can reproduce.

@TonyWildish-BH
Copy link
Author

hi @tim-allen-ck, did you get a chance to look at this?

What I have found since is that the Windows VMs seem not to provoke the problem, though the Linux VMs definitely do. Probably because they have so much more to update than the Windows VMs.

@akolensky
Copy link

Hi @tim-allen-ck , I understand it is a busy season - and wondered if this has been looked into?

@marrobi
Copy link
Member

marrobi commented Sep 18, 2024

@akolensky what troubleshooting steps have you tried? It's not something we've seen elsewhere.

@TonyWildish-BH
Copy link
Author

All I've managed to deduce so far is that it seems to be related to the Linux VMs doing a mass update. The load average in the nexus container goes over 40, and it stops responding, completely - which isn't surprising at that load average.

Rebooting the nexus VM clears the issue, but a 'restart' in the portal doesn't work, because the VM doesn't respond to it, you have to 'stop' and 'start', which takes a very long time, usually.

The problem is repeatable, but not guaranteed. With a fresh install of nexus, it wedges about ⅔ of the time, on one of the first 2 or 3 Linux VMs - often the first. It's certainly not rare.

@marrobi
Copy link
Member

marrobi commented Sep 19, 2024

Have you added some custom repositories?

We've got instances running elsewhere and Nexus have been working without issue for long periods. So something must be different in your instance.

Have you tried using a larger VM?

Might be the container needs some resource limits as to leave the host some resources.

@TonyWildish-BH
Copy link
Author

Marcus, this is in fresh installations, predominantly. It looks like a first-time cache-filling problem where the requests are not throttled, and the server gets overloaded. After rebooting, it tends to behave itself, but still spits the dummy every now and then.

This happens in a virgin installation, with unmodified code, as stated. No custom anything. We see it in the pure MS code base, and also in our own, where we have not touched anything relating to nexus, or to any of the core resources.

This is repeatable, three different people using three different setups have seen it, including one outside Barts. It's not our environment.

I did try using a larger VM (64 GB x 8 cores), that didn't help.

Restricting the container isn't likely to help much, though it might let the host OS kill and restart it, at best. If the container is spawning > 40 threads, all bets are off, that's too many. My best guess is that the server needs throttling, which means either Nexus or Java VM configuration.

Do you know if Tim tried to reproduce it?

@tim-allen-ck
Copy link
Collaborator

Hi @TonyWildish-BH I've not been able to reproduce it. Was it only 1 or 2 VMs you'd deployed when you'd found the issue?

@TonyWildish-BH
Copy link
Author

hi @tim-allen-ck, I've been able to reproduce it on the first Linux VM I boot in a new SDE. It happens about 50% of the time in that situation, more or less.

@marrobi
Copy link
Member

marrobi commented Sep 19, 2024

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

@TonyWildish-BH
Copy link
Author

This has all happened with a completely unmodified installation from the HEAD of this repository. A fresh checkout of the code, with nothing changed. Not the Nexus VM, not the Linux template I'm trying to boot from it. Nothing.

I set my config.yaml at the top level and install, from scratch, following the instructions. I create a Linux VM, and with high probability, Nexus wedges.

@marrobi
Copy link
Member

marrobi commented Sep 20, 2024

What's the exact SKU you are using for the VM? What additional software is installed.

In the terraform I can see it's a B series VM. If you are using the default it might be this isn't appropriate for your needs given the nature of burstable CPU suggest you try a different SKU.

It would be useful if the SKU was a parameter.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

@akolensky are you able to help @TonyWildish-BH answer my question above? Thanks.

@TonyWildish-BH
Copy link
Author

Hi Marcus,

What's the exact SKU you are using for the VM? What additional software is installed.

SKU is 22_04-lts-gen2.
As stated, there is no additional software installed. None.

Also are you using VM images with packages reinstalled as recommended or are you installing them using a startup script on the VM?

As stated, I'm seeing this error on multiple installations. One is our own, with custom VMs that have nearly all the packages installed, the other is the unmodified Microsoft codebase, commit hash c3e4c8d. That uses a cloud-init script to update the vanilla OS which comes with the TRE.

I see the issue in both these environments, therefore, this is not an issue of customisation from our side.

@marrobi
Copy link
Member

marrobi commented Sep 20, 2024

That's the image sku rather than VM SKU. The VM SKU will be a letter followed by number(s).

My thinking is you have something different going on the VM. Antivirus maybe? That in conjunction with the VM scripts is causing all the credits to be used on the B series Nexus VM.

In addition as per https://microsoft.github.io/AzureTRE/latest/tre-templates/user-resources/guacamole-linux-vm/ I suggest you use VM images in production.

@TonyWildish-BH
Copy link
Author

Where do I find the VM SKU?

Whatever is happening on the VM is whatever happens out of the box, because we haven't modified it in any way at all. There is no customisation of the Nexus VM. We haven't changed anything there. We haven't installed anything extra. Nothing.

I'm aware of that recommendation, and we will indeed be using our own VM images, but I need this bug fixed before we can consider going into production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants