Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network throughput lost after each topology hop #2092

Open
micush opened this issue Aug 5, 2022 · 24 comments
Open

Network throughput lost after each topology hop #2092

micush opened this issue Aug 5, 2022 · 24 comments
Assignees
Milestone

Comments

@micush
Copy link

micush commented Aug 5, 2022

Hi,

I'm currently running GNS3 v2.2.33.1 on Ubuntu 22.04.

If I have a back-to-back directly connected topology like this:

host1 <-> host2

Using iperf3 between the two hosts I can achieve throughput close to 1Gbps, which is fine.

However, if I have a back-to-back-to-back topology like this:

host1 <-> host2 <-> host3

Using iperf3 between the two outermost host1 and host3 and using host2 as a transit I can achieve throughput of around 500Mbps, which is roughly half.

Additionally, if I have a back-to-back-to-back-to-back topology like this:

host1 <-> host2 <-> host3 <-> host4

Using iperf3 between the two outermost host1 and host4 and using host2 and host3 as a transit I can achieve throughput of around 250Mbps, which is roughly half again.

If I continue in this manner I can get down into the single digits for throughput between hosts, which is obviously not good.

This performance degradation has existed for many years over many different versions of GNS3. It doesn't matter the type of host I use, Windows, Linux, IOS, IOS XR, etc. They all experience the same type of throughput degradation the more hosts they traverse.

This is obviously some sort of issue with the underlying virtualization layer. Can somebody confirm this behavior and perhaps suggest a fix? I do realize this is a virtualized topology and I don't expect full throughput between devices, however, a 50% throughput hit for each traversed device is a bit excessive.

Any suggestions are welcome.

Thanks much.

@grossmj grossmj added this to the 2.2.34 milestone Aug 5, 2022
@josephmhiggins
Copy link

josephmhiggins commented Aug 11, 2022

On Windows 10 with Vmware Workstation 16.2.1, on my 12 year old computer, worth $0, I ran the gns3-internal-topology iperf tests, for

  1. (Test 1) Ubuntu 22 to Ubuntu 22 (to establish a gns3, not ubridge only, baseline)
  2. (Test 2) Ubuntu 22 to Frr (frrrouting/frr: latest and latest is today 20220811) to Ubuntu 22

I see no 50% bandwidth drop after the first hop. Since I saw no 5% bandwidth drop, I believe there was no reason for me to proceed.
I am only getting involved because I am learning ubridge.exe.
I see a 280M to 218M drop and after the first hop and I can easily explain that away after I relearn C and bridge.exe and Python (I do not know how ubridge is called).

The “actua”l bandwidth numbers I believe can be ignored because my computer is very old.
The “only” think I am concentrating is on the drop between locally connected (Test 1) and 1 hop away (Test 2)
I see NOTHING wrong and it is what I would expect.

This post should includes custom made reports from Windows 10 and I will not say how I created them – for multiple reasons.
This post should include custom made charts from Microsoft Excel and I will not say how I created them – for multiple reasons.

Disclaimer:
The post may not include those items if github blocks me from uploading them.
The post may get hosed up by some special characters so I have to alter some text to remove things like tildas and pound signs,Microsoft Word page breaks, and basically any non alphanumeric characters.
The reader is trusted to know how to configure zebra routing on frr (I only use zebra routing) and ip addresses, static routing on the Ubuntu 22 devices.
The Ubuntu 22 nodes are drawn from an Ubuntu 22 master and only have ssh and iperf on them.
The SSH network is only used so I can ssh into the device and enter text commands at the CLI.

The textual numbers for the tests are:

Test 1:
Iperf server Ubu22-iperf-ssh-102 iperf -s 192.168.21.102
iperf: ignoring extra argument -- 192.168.21.102

Server listening on TCP port 5001
TCP window size: 128 KByte default

Iperf-client Ubu22-iperf-ssh-101: iperf -c 192.168.21.102 -i 60 -t 180

Client connecting to 192.168.21.102, TCP port 5001
TCP window size: 85.0 KByte default

local 192.168.21.101 port 52832 connected with 192.168.21.102 port 5001
Interval Transfer Bandwidth
0.0000-60.0000 sec 2.01 GBytes 288 Mbits sec
60.0000-120.0000 sec 2.00 GBytes 286 Mbits sec
120.0000-180.0000 sec 2.01 GBytes 288 Mbits sec
0.0000-180.0386 sec 6.02 GBytes 287 Mbits sec

Test 2:
Iperf server Ubu22-iperf-ssh-104: iperf -s 192.168.54.104
iperf: ignoring extra argument -- 192.168.54.104

Server listening on TCP port 5001
TCP window size: 128 KBytes

Iperf client Ubu22-iperf-ssh-103 iperf -c 192.168.54.104 -i 60 -t 180

Client connecting to 192.168.54.104, TCP port 5001
TCP window size: 85.0 KByte default

local 192.168.53.103 port 40768 connected with 192.168.54.104 port 5001
Interval Transfer Bandwidth
0.0000-60.0000 sec 1.48 GBytes 212 Mbits sec
60.0000-120.0000 sec 1.51 GBytes 216 Mbits sec
120.0000-180.0000 sec 1.54 GBytes 221 Mbits sec
0.0000-180.0776 sec 4.53 GBytes 216 Mbits sec

Oh, the .gns3 file has to renamed from
ubu-frr-ubu.gns3
to
ubu-frr-ubu.txt

If the reader wants to load my ubu-frr-ubu project, the reader is trusted to know how to replace their Ubuntu 22 id with my Ubuntu 22 id – I think it is the template or node id in the ubu-frr-ubu.txt

The project is really 2 projects, but, I chose to make it 1 project because my confidence level was extremely high that there is no problem with GNS3.

The Window reports I just learned how to make yesterday.
The Microsoft excel charts I just learned how to make today.
Oh, My total RAM on my host computer is 32G, but that is a solid number as opposed to the other graphs which are percentage….That was a tough call there.

Give this 24 hours and I will put in all the stuff I may have forgot to upload and fix all they special characters that might be hidden.
ubu-frr-ubu.txt
iperf-Ubuntu-Frr-Ubuntu
iperf-Ubuntu-Ubuntu

Oh, the raw csv files I will upload, but I am not uploading the normalized "csv" files (one is an xlsx because Excel deletes formulas if you save a file as a csv file).

01-ram-cpu-privilege-raw.csv
02-ram-cpu-privilege-raw.csv

No one should mess around with the raw csv files because they are very time consuming to understand an convert to a ram-cpu-priviliged-normalized.xlsx

Oh, the topology diagram here:
topology

Oh, during the tests, even moving the mouse might generate an interrupt, so I had to walk away and close down just about everything. I did better and closing things down on the 2nd test. And I run Windows and Windows is always doing crazy stuff on its own.

@josephmhiggins
Copy link

Oh, I chose FRR because micush pointed another gns3 community member to using FRR.
I never used FRR before today.

@micush
Copy link
Author

micush commented Aug 11, 2022

Hi,

Thanks for the reply. My issue was never with the first hop. It's after that when it becomes an issue.

image

@micush
Copy link
Author

micush commented Aug 11, 2022

Just an FYI. Running these same KVM VM images in Proxmox and not in GNS3 yields about 20Gbps to all hops in my setup.

@josephmhiggins
Copy link

josephmhiggins commented Aug 11, 2022

I only saw a 23% drop after the first hop on Windows with FRR as the middle node.
I am comfortable with that drop on Windows with VMware Workstation as my hypervisor on my old machine.

Traditionally what you do is assign "host 1" as a client and "host 3" as a server and then you graph out the throughput Wireshark to determine if it is a "network" problem or a "server" problem. So GNS3 AND your host2 would be the "network".

So, it can be very complicated if your host2 is paged in and out of memory.

I did not include a graph of memory being paged out or any of the other thousands of snmp items because it is very complicated and is above the gns3 user expert level.

It gets very complicated because the ubridge hypervisor can be used as a baseline for a system.......but I looked at gns3 the ubridge documentation and

  1. I do not think it supports more than 2 nodes if you run ubridge.exe as a standalone. In other words, that's the gns3 developer job responsibility to handle it because I do not know how to do that.
  2. I run Windows and I did not like the ubridge documentation requesting that I install various programs to run ubridge as a standalone hypervisor on my Windows system. I am not loading anything on my system that could damage GNS3.

I do not know what kind of vms your host1, host2, host3 are. I do not know your cpu usage. I do not know your operating system. I do not know how constrained your RAM is.

My ubu-frr-ubu.txt is concrete.

You originally asked for "Can somebody confirm this behavior" and I have

....denied that behavior on Windows 10 Pro.

I have done all that I can do.

@micush
Copy link
Author

micush commented Aug 11, 2022

Thanks for the input. All images are KVM/QEMU images and can be copied to and run on a Proxmox host as well. When doing so there is no such performance degradation and throughput is about 20Gbps for all attached hosts. There is obviously an issue somewhere. Whether that is with ubridge or GNS3 or whatever. However, if Promox running KVM/QEMU can forward packets at 20Gbps between images and GNS3 cannot when using the same technology, there is an issue there somewhere. My concern isn't running it on Windows with VMware or anything else. My GNS3 host is Ubuntu 22.04 running QEMU 6.2. I see the issue there. My Proxmox host is 7.2 also running QEMU 6.2. The same VM image was copied from my GNS3 environment into my Proxmox environment, connectivity is set up the same in both environments, and the GNS3 environment is not even close to the Proxmox environment performance with the same VM images, also running QEMU 6.2.

@grossmj
Copy link
Member

grossmj commented Aug 12, 2022

I will need to investigate this more. I think the bottleneck is definitely uBridge here, this small program has no specific optimization (e.g. GNS3/ubridge#2) and runs in userland. It is basically copying packets from one socket to another within a thread representing one unidirectional connection.

One major benefit of uBridge is it doesn't need special rights to create new connections (it uses UDP tunnels) and it can run on multiple platforms. Now that we are dropping Windows support for the GNS3 server, starting with version 3.0, we plan to replace uBridge by Linux bridges which are much faster and we allow us to implement some exciting features like advanced filtering etc. I bet this is what Proxmox uses in their backend.

@micush
Copy link
Author

micush commented Aug 12, 2022

Hi,

Thanks for the reply. Much appreciated. My testing on Proxmox was indeed done with Linux Bridges. I've also tested on OVS Bridges as well with very similar results. Both technologies allow me to forward packets between any host at about 20Gbps where with ubridge (I assume as I just created the topology with the GNS3 GUI on my local host) the best I got was 1Gbps between directly connected hosts and it only goes down from there the more hops you add in between hosts.

Regards

@grossmj grossmj modified the milestones: 2.2.34, 3.1 Aug 12, 2022
@grossmj grossmj self-assigned this Aug 12, 2022
@spikefishjohn
Copy link
Contributor

@grossmj FYI - I've been playing around with FRR and vxlan and think I've found Linux bridge doesn't have full acceleration as compared to ovs bridges. I haven't switched over to ovs bridge yet. I'll try to get something together on that so I can get a better idea of how much overhead linux bridge has vs ovs bridge.

Ean Towne made a spreadsheet showing the effects of different nic drivers also.

https://docs.google.com/spreadsheets/d/1lEY1P6xTwDePtR0eqRhtxUlrZ72qgIZ5nnrRvnjWeMQ/edit?usp=sharing

@micush
Copy link
Author

micush commented Aug 13, 2022

For simplicity's sake, I tried to come up with a better way to represent this throughput loss.

This is the topology I used:
image

These are the numbers I got during iperf3 testing:
image

All hosts are KVM VMs running Ubuntu 22.04 with 8GB RAM and 4 vCPUs each.

Visually I hope this makes better sense.

@josephmhiggins
Copy link

Yeah, it makes better sense.

Oooops, I thought gns3server.exe was calling ubridge.exe.

So, my %processor in my graph is not extremely important.
Um, it could be important if I had lots of programs open and Windows 10 'Antimalware Service Executable' was running.
Windows 10 'Antimalware Service Executable' can slow down my system dramatically (I just changed it to only run when the system is idle).

What is extremely important is GNS3 VM Cpu usage.
When I am running the test my GNS3 VM is at 100%.
So, I can tweek my VMware Workstation and give it more resources.

But...I am not touching anything in my GNS3 VM.

@josephmhiggins
Copy link

I tweeked my VMware workstation and consequently my speed throughput improved by 35% and my overall cpu usage on my host went up by 300%..

The graphs I created would have to be done with powershell, as far as I know.
I think it would take me about 32 hours to figure out how to create a graph via a powershell script.
That would be time that I do no have.

Furthermore, as a rule, I do not help people tweek their host machines.

As far as you showing 50% drop on your Ubuntu host and I am showing 23% drop on my 12 year old Windows computer with the GNS3 VM, on VMware Workstation, there's nothing I can do about that.

@micush
Copy link
Author

micush commented Aug 13, 2022

@josephmhiggins, I'm glad you are not seeing an issue on your Windows host running VMware. However, your environment is not comparable to what I have outlined above and does not fit the use case. Run the same tests on a Linux host (server) running Qemu 6.2 for a closer comparison.

As @grossmj pointed out, starting with GNS3 v3 Windows will not be supported on the server side anymore and ubridge will (eventually) be replaced with Linux bridges, which will hopefully alleviate the issue I am seeing. At least in my testing on Proxmox with Linux bridges (and OVS bridges as well) it points to this conclusion. I'll wait patiently for the change to be made and retest when appropriate.

Thanks for the discussion all, much appreciated.

@josephmhiggins
Copy link

josephmhiggins commented Aug 13, 2022

Unless I am mistaken, gns3 v3 will not support windows without the gns3 vm, but will support windows with the gns3 vm. my gns3 vm is ubuntu 20.04

Edit: Let me clarify. With the all VMs running on the GNS3 VM on Windows, the GNS3 VM 20.04 is directly comparable to a Ubuntu 20.04. (An Ubuntu 22.04 running KVM is, strictly by the book consider a level 2 hypervisor like the GNS3 VM and it is the same as VMware Workstation. People get very ticklish up differentiating level 1 and level 2 hypervisors. To turn Ubuntu KVM into a true level-1 hypervisor a person would have to do something, but I forget what they have to do.) In other words, the GNS3 VM Windows adds an unknown amount of overhead, but I do not think it is that much.

@spikefishjohn
Copy link
Contributor

Using debian 11.4, virtio nic and ethernet switches in between I get the following.

host1 -> host1 20GB
host1 -> host2 200M
host1 -> host3 98M
host1 -> host4 69M

That first drop is.. alarming.

image

image

@grossmj
Copy link
Member

grossmj commented Aug 14, 2022

@spikefishjohn Ethernet switches are simulated by Dynamips so it may reduce the throughput even more.

Just for info I remember I made a document about network throughput in GNS3 a long time ago, probably in 2015, however I hadn't tested multiple topology hops.

Network Throughput in GNS3.pdf

@grossmj
Copy link
Member

grossmj commented Aug 14, 2022

@micush

I'll wait patiently for the change to be made and retest when appropriate.

Thanks again for opening this issue. Using Linux bridges with VXLAN (for overlay network across multiple compute/hosts) is our current preferred solution to fix it. Similar to what is described in these pages https://programmer.help/blogs/practice-vxlan-under-linux.html and https://vincent.bernat.ch/en/blog/2017-vxlan-linux

@micush
Copy link
Author

micush commented Aug 14, 2022

@spikefishjohn

That first drop is.. alarming.

Yeah it is, to the point to where I was like WTF?!?! :)

Like @grossmj said, if I take my VMs and move them out of GNS3 and into Proxmox and use Linux bridges for the connectivity between them the issue disappears entirely. Full speed to all hosts no matter how deep the hop count is.

Hopefully this gets changed in 3.x. Right now it's pretty painful when you're 5 or 6 hops deep from one host to another and the app/feature/thing you are trying to test is performing terribly for no apparent reason.

Again, thanks everybody. GNS3 is a great product. I've enjoyed it a lot over the years. This fix will make it even better.

@spikefishjohn
Copy link
Contributor

@grossmj Don't know if this helps, but I have 3 GNS3 server that i'm already using FRR to make a evpn l2 network. I'm using ansible for the deployment. This way if I add a bridge its added to all servers. I will say netplan is a big problem. I've had a lot of issues where I needed to create network configs using systemd. For example vxlan isn't supported in netplan.

I based my lab more or less of this.

https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn

Only I don't have a route reflector. I'm also peering my vxlans interfaces from a loop interface and using OSPF to advertise the loop interfaces then MP-BGP peers across the loop IPs. Vxlan is a bit odd since it doesn't have any loop prevention aside from split horizon. My first attempt I made a massive loop and the very first multicast packet would loop for forever. Thats when I moved each host to use a single vxlan peer. Note I also don't have a 100gb switch which is why I have a triangle between the servers.

image

OSPF, BGP, vxlan and bridge configs are all pushed from ansible.

@micush
Copy link
Author

micush commented Aug 15, 2022

@spikefishjohn , I've found Netplan to be a big problem as well. So, I remove it from all my hosts and replace it with ifupdown-ng. Ifupdown-ng makes vxlan simple. Give it try. It's like the original ifupdown, but with all the "advanced" networking support built right into it in a way that is compatible with the original ifupdown. One caveat though, install ifupdown-ng first, and then uninstall netplan. Ask me how I know. :)

@kefins
Copy link

kefins commented Aug 25, 2022

@micush

I'll wait patiently for the change to be made and retest when appropriate.

Thanks again for opening this issue. Using Linux bridges with VXLAN (for overlay network across multiple compute/hosts) is our current preferred solution to fix it. Similar to what is described in these pages https://programmer.help/blogs/practice-vxlan-under-linux.html and https://vincent.bernat.ch/en/blog/2017-vxlan-linux

@grossmj This solution seems much like the solution in openstack neutron. But I think we can make it a little easy, for the vms in the same physical server, the link can be achived by VLAN of linux bridge, and for the vms on different pysical server, the link can be achieved by VXLAN connected with linux bridge.
I think the performance of this solution will act much better than ubridge, but the capture function on the link seems a litttle more complex than ubridge.

@josephmhiggins
Copy link

Yeah, my icon for my VMCI driver went yellow in Windows 10 Device manager with an error code 31. I looked up this thing called VMCI and VMware is reporting in VMCI Socket Performance - 15 years ago - throughput of 29Gbps from unix vm to unix vm and 6Gbps from windows vm to windows vm. But that was 15 years ago, and it depends on tcp message size, etc. Ubridge is userland and there is a tremendous penalty for that and it depends on the model of the cpu and the processor speed, etc. And all this stuff has to work on different operating systems, etc. Off the top of my head, with no data to back it up, inter-vm communcation inside the GNS3 VM has to be at least 50Gbps for unix-to-unix because those tests were 15 years ago - but that does not include any hops.

I only have a 1Gbps pipe so, 50Gbps does nothing for me.

@josephmhiggins
Copy link

And be careful, many gns3 users run a gns3 server on a laptop.
High bandwidth between gns3 vms will drive a cpu temperature through the roof and can possibly destroy a laptop.

@grossmj grossmj modified the milestones: 3.1, 3.2 Jan 31, 2023
@elico
Copy link

elico commented Jan 20, 2024

And be careful, many gns3 users run a gns3 server on a laptop. High bandwidth between gns3 vms will drive a cpu temperature through the roof and can possibly destroy a laptop.

Depends on the quality of the laptop.....
For me on a gen8 I5 with 32 GB ram it works fine for now, and even if it heats up, it still works fine without any hiccups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants