-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VM is not initializing on ARM64 #956
Comments
/cc @zhlhahaha Is this something you are able to help with? |
Hi @jaredcash |
@zhlhahaha unfortunately, increasing the memory did not help. It still appears that the VM will not initialize as the guest-console-log container logs are blank and I am unable to access it:
guest-console-log:
Note: I also tried creating the VM with 1G of memory and got the same results. I attached the describe of both the virt-launcher pod and the new VM object with 256M in case that is helpful. |
It is interesting that there are no failure logs in virt-launcher.log or the console log. This usually indicates that the VM may be encountering a boot failure during the bootloader stage. This could be caused by incorrect UEFI firmware, a corrupted VM disk, or a mismatch in the CPU architecture of the VM disk. I will investigate this further in my local environment. |
The
I can successfully boot the VM based on the image in my local env. |
@aburdenthehand If @jaredcash verified the image issue, we can update the document https://kubevirt.io/labs/kubernetes/lab1. |
I don't see any of the labs specifying the cirros disk image. Am I missing something? |
@zhlhahaha unfortunately, the VM is still not initializing with the ARM specific image and higher allocated memory. The VM still shows as running but again I'm unable to access it and there are no console logs:
I'm unsure if this is somehow related to my hardware, even though the |
It do not specific cirros disk image in the doc. However, the command in the vm configuration file vm.yaml contains the disk image information, and in which it would use x86 only cirros image.
Ok, I will rise an issue after I solve @jaredcash 's problem. |
Would you mind to collect the following information?
|
hello @zhlhahaha, here is the requested information:
Please view the following of the qemu processes on the worker node:
I have attached all the container logs of the virt-launcher pod after adding more verbose logging.
I do have virtctl installed on the manager and I have attempted to console into the VM. I do not get an error message but I just get a blank display. Even if I hit any key, it will not progress any further until I
Please let me know if anything additional is needed. |
There is no error from the virt-launcher.log and qemu process can start successfully. Let's try fedora image, would you mind to use following config to start fedora image?
|
With the fedora image, the VM is still not initializing. Even though vmi is showing the VM as running, virtctl console is still blank. I am personally not noticing outliers in the logs but I have attached them for further inspection.
As a note, I also tested the fedora VM with 2048M and got the same results. The logs I provided are from the 1024M VM. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Unfortunately, this issue persists. @zhlhahaha and/or Kubevirt team, have you had some time to review my previous message? |
Sorry, I missed your message. I suspect the UEFI boot failed. Would you mind to provide more information?
|
@zhlhahaha it seems there is an issue with
I gather the CPU information for all my nodes via the Additionally, I deployed a new Cirros VM using the image suggested here (#956 (comment)) and gather the |
Hi @jaredcash
|
Apologies for my misunderstanding @zhlhahaha. I have attached the dumpxml of the cirros VM I created with pure virsh. |
Thanks! I didn’t notice any differences between the successfully booted Cirros VM and the KubeVirt one. Would you mind double-checking if the successfully booted Cirros VM is starting on the server with the Cortex-A55 CPU? Initially, I suspected a difference in the Generic Interrupt Controller (GIC) versions between the Cortex-A55 and Cortex-A72 CPUs, but they appear to use the same GIC version. Now, it seems the UEFI firmware may be the only possible cause. Would you be able to replace /usr/share/AAVMF/AAVMF_CODE.fd in the virt-launcher with the one from the host? |
@zhlhahaha I redeployed the cirros test VM I did with kubevirt to the same node (node4) to ensure it is using the same CPU (Cortex-A55).
Note, I did attempt to become the root user but it is asking for a password:
As a workaround, I used
After giving it some time, the VM was still not initializing. In an attempt to get it to work, I changed the ownership of
Unfortunately, the VM still did not initialize. I restarted the pod to see if that would work but it unfortunately did not work.
I have also attached fresh logs of the virt-launcher pod for reference. |
The AAVMF_CODE file is the UEFI boot firmware used during VM startup. After replacing this file, a VM reboot is necessary for the changes to take effect. Additionally, if you restart the pod, it will revert to the original virt-launcher image where the AAVMF_CODE file hasn't been replaced. To make this change effective, you’ll need to replace the AAVMF_CODE.fd file in the virt-launcher image itself rather than in individual pods, then use this updated virt-launcher image to start the VM. @andreabolognani Do you have any suggestion? |
There might be a way to inject files into the pod before the VM starts, for example using the sidecar hook. I'm not too familiar with these facilities, so I might be wrong about it. Rebuilding the My suggestion would be to try and figure out a way to change <loader readonly='yes' secure='no' type='pflash'>/usr/share/AAVMF/AAVMF_CODE.fd</loader> in the domain XML to <loader readonly='yes' secure='no' type='pflash'>/usr/share/AAVMF/AAVMF_CODE.verbose.fd</loader> The verbose build of AAVMF would hopefully produce at least some output pointing us in the right direction. Again, I'm not sure what facilities, if any, KubeVirt provides to inject this kind of change. Sidecar hook might be the one. |
Yes, sidecar is a good suggestion! It can run custom script before VM initialization. Here is an guidance, https://kubevirt.io/user-guide/user_workloads/hook-sidecar/ |
Based on this example, something like apiVersion: v1
kind: ConfigMap
metadata:
name: my-config-map
data:
my_script.sh: |
#!/bin/sh
tempFile=`mktemp --dry-run`
echo $4 > $tempFile
sed -i "s|AAVMF_CODE.fd|AAVMF_CODE.verbose.fd|" $tempFile
cat $tempFile (completely untested) should do the trick. |
Hello @zhlhahaha @andreabolognani, following the suggestions above, I was to create a VM with the I have attached all container logs of the virt-launcher and the dumpxml of my VM. I could not use the sidecar hook functionality to get the host's local UEFI boot firmware to the VM (if possible) as a test. I am still troubleshooting (of course I will welcome any suggestions if we want to go down that route) but I wanted to provide you both with the current data in the meantime. |
@jaredcash the XML configuration looks good, it's clearly pointing at the verbose AAVMF build now. I don't see any guest output in the log, though admittedly I'm not entirely sure it's supposed to be there in the first place. Do you still get absolutely zero output on the VM's serial console? |
@andreabolognani yes, unfortunately, I am still getting zero output from the VM's serial console. For reference:
|
I assume you're making sure to connect to the console the moment it is available, so no output is lost because of a delay. Well, I'm truly out of ideas at this point. The VM configuration looks good, and even if the guest image was completely busted you should still get some output out of the verbose AAVMF build. Since the pod at least remains up, maybe you can play inside it to try and get a better understanding. Maybe run
That should produce a lot of output. |
@andreabolognani from within the pod, the
Interestingly, I am getting no output from the
I was playing around with the command but I am not seeing an option for a more verbose output. |
Looks like the issue is probably at the QEMU level. Hopefully @zhlhahaha can help you debug this further, because I'm entirely out of my depth at this point :) |
Thanks @andreabolognani , currently I have no idea why it happens. Maybe we can make it simpler, @jaredcash would you mind to try
See if there is any output. It works fine on my local Arm64 server and I can see many outputs:
|
I also try to use the original
If there is still no output, you can try to replace the AAVMF_CODE.fd with the host one. |
@zhlhahaha I am not getting an output using both
I will try to replace the |
You can do it easily in this docker environment
Then see if there is any output. |
BTW, @jaredcash ,
|
@zhlhahaha apologies, that was my misunderstanding. I copied the host's local UEFI boot firmware to the docker container and I am getting the same output from the container as I do with running the
Also, here is the command used to start the VM using pure Virsh:
|
@jaredcash You can get the output when using the host AAVMF_CODE.fd, so the issue may caused by the UEFI firmware problem.
|
I found only one place where an overall image registry can be configured: kubevirt/types.go#L1980. However, it does not support setting the repository specifically for the virt-launcher image. To use a new virt-launcher image, the process is somewhat tricky:
|
@zhlhahaha, it looks like UEFI firmware was the issue. After applying this workaround, the VM was accessible. As a note, my cluster is using Kubevirt v1.2.1, so those were the images I pushed to my repo.
Console/SSH verification:
|
KubeVirt 1.2.1 comes with edk2 2023.05 while the recently released 1.4.0 comes with edk2 2024.05. It would be interesting to know whether the latest release works out of the box on your machine, which means we can chalk it down to some edk2 issue that's been addressed in the meantime, or not. Do I understand correctly that you've copied the host's edk build into the container? That's very interesting. Ubuntu 22.04 comes with a version of edk2 that's even older (2022.02) so it's somewhat surprising to me that it apparently works better. |
Base on @andreabolognani 's suggestion, you can give it a quick check by running:
To see if there are any output. |
hey @andreabolognani
Yep, that is correct (specifically with the @zhlhahaha I tried your suggestion of running a container with virt-launcher v1.4.0 but unfortunately, I ran into the same situation of getting no output:
|
@jaredcash that's very interesting! The fact that a firmware image consistently either works or doesn't work across two very different QEMU builds (Ubuntu 22.04 vs recent Fedora releases) certainly seems to point in the direction of edk2 being the root cause. I'd like to narrow things down further. This will be a bit of work for you, hopefully you don't mind too much. You've been extremely cooperative so far, which I appreciate a lot :) The idea is to try an old Fedora build, roughly matching the upstream version of the one in Ubuntu 22.04. If that works, then the issue was likely introduced upstream; if it doesn't, it might be downstream-specific. I think it's fine to run the tests using the host's QEMU, since as we've seen from earlier tries that doesn't seem to be the determining factor. So please download
You'll need to install the This is what the contents of the newly-created directory should look like:
If everything looks good run, run the test again, this time pointing to the firmware image you've just extracted from the package:
Similarly, it would interesting to test the latest Ubuntu build. The process is similar: download
This is what the contents should look like this time around:
Run the test one last time:
Phew! Thanks in advance :) |
@andreabolognani no problem! I also appreciate all the help both you and @zhlhahaha have provided! Testing your theory, the Fedora build that roughly matches Ubuntu 22.04 worked:
Interestingly enough, the latest Ubuntu build did not work:
|
@jaredcash thanks. It's increasingly looking like an upstream issue indeed! I think the first thing to try at this point would be a more recent package, specifically There's a small chance that the issue has already been found and fixed upstream. Maybe we're just lucky like that. Assuming we're not, it would be extremely helpful to narrow down things further. These are all the builds made between edk2 2022.02, which based on the test above we know works, and edk2 2023.05, which is the version included in KubeVirt 1.21 and we know doesn't:
If you could take them all for a spin and report back the results, we'd then be able to pass the information on to the edk2 maintainer for further analysis. |
@andreabolognani the following are the results of the test.
Fedora builds that did not work: |
@jaredcash thanks a lot, that's very useful. @kraxel can you please take a look at this? The tl;dr is that the reporter is having trouble running VMs on their machine and we've tracked the problem down to edk2, as changing just that component while leaving everything else untouched makes all the difference between a successful run and an unsuccessful one. We have further narrowed it down to an upstream issue rather than a downstream one since Fedora and Ubuntu builds of the same edk2 release present identical behavior. Last working version: edk2 2022.08 Hardware: Turing RK1 compute module (aarch64) Reproducer: Full details above, of course :) |
Where is the variable store flash? |
@kraxel we didn't set up one for the (simplified) reproducer. Do you think that could make a difference? Note that the failure was originally reported against KubeVirt, which will take care of setting up both pflash devices, so the absence of the NVRAM part can't be the determining factor. |
It's an invalid configuration and will most likely not boot up. It'll probably a different failure mode though, i.e. fail late enough that at least some messages show up on the serial line. Any results with the latest fedora build? One change in late 2022 was that the armvirt builds turn on paging very early. Which allows to properly setup memory attributes etc. But some buggy aarch64 cores tripped over that. So edk2 got the CAVIUM_ERRATUM_27456 config option (which enables an workaround). IIRC that landed in the first 2023 release, and the fedora builds have that turned on. Possibly there are more errata dragons lurking though. The "Turing RK1 compute module" is the only hardware affected it seems, is that correct? @ardbiesheuvel ^^^ |
Interestingly, that change did arrive between 22.08 and 22.11 so it might indeed be implicated here
Which kernel version is the host using? One thing that would be instructive is to check whether single stepping through the first 50 instructions or so is sufficient to get things running. If you run qemu with
to connect, and then use |
Yeah, that was the idea: working builds produce at least some output, while not working ones are completely silent. We could probably try again with NVRAM if we think it could realistically make a difference.
The most recent that was tested so far is @jaredcash can you please try again with the very latest build,
It's the only one we know about, yes.
Very interesting indeed :) |
hey @andreabolognani the latest Fedora build is also not working, for reference:
As an additional note, the following is the kernel version of all the physical nodes in the environment:
|
This is a known KVM issue that was fixed in commit torvalds/linux@406504c. This fix was backported to v5.10.164 |
What happened:
I am unable to access a newly deployed VM. The output of
kubectl get vmi
shows that the VM is running and ready but I believe it is not fully initializing as I am unable to access the VM viavirtctl console
/virtctl ssh
and there are no guest console logs from the virt-launcher pod. As a note, I deployed the Kubernetes cluster using K0s.All nodes in the cluster are passing qemu validation:
Kubevirt components:
What you expected to happen:
Deploy a working VM using kubevirt.
How to reproduce it (as minimally and precisely as possible):
k0sctl
(https://docs.k0sproject.io/v1.30.0+k0s.0/k0sctl-install/) on a Turing RK1 compute module. Note, I am using Calico with vxlan as my CNI but this did fail with the same issue using kube-router (default CNI with K0s)Additional context:
My server is using ARM64 architecture and the hardware is Turing RK1 compute modules (https://turingpi.com/product/turing-rk1/). I have been able to successfully deploy a Cirros VM using
virsh
with cirros-0.5.2-aarch64 image. I have attempted to use an aarch64 image for my kubevirt VM but that also failed to initialize (I used image quay.io/kubevirt/cirros-container-disk-demo:v1.2.2-arm64).I have been interested in using Kubevirt but I have been running into this same issue when using different Kubernetes deployments (KIND and Minikube). All tests have been done on a Turing Pi RK1 cluster (single node and multi-node).
I have attached the logs from the virt-launcher pod (all containers) and my kubevirt CR object.
Environment:
virtctl version
): v1.2.1kubectl version
): v1.30.0+k0suname -a
): 5.10.160-rockchip aarch64 GNU/Linuxkubevirt-cr-yaml.txt
virt-launcher-logs.txt
The text was updated successfully, but these errors were encountered: