Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use private networks with subnet size != 16 #434

Open
ValentinVoigt opened this issue Sep 3, 2024 · 10 comments · May be fixed by #438
Open

Unable to use private networks with subnet size != 16 #434

ValentinVoigt opened this issue Sep 3, 2024 · 10 comments · May be fixed by #438

Comments

@ValentinVoigt
Copy link

I have an existing network at Hetzner, to which I need to add a new cluster. I use the existing_network_name-feature for that, as per the docs:

networking:
  private_network:
    enabled: true
    subnet: 172.16.0.0/12
    existing_network_name: "heo"

Unfortunately, since what I think was this commit, I am unable to add a new cluster. The relevant excerpt from the logs is:

[Instance prod-cx22-master2] Waiting for private network IP in subnet 172.16.0.0/12 to be available... (Attempt 1/30)
[Instance prod-cx22-master2] Waiting for private network IP in subnet 172.16.0.0/12 to be available... (Attempt 2/30)
[Instance prod-cx22-master2] Waiting for private network IP in subnet 172.16.0.0/12 to be available... (Attempt 3/30)
^C

After some digging, it looks like the following is happening: the master_install_script.sh here script takes my 172.16.0.0/12 subnet, removes the /12 and removes the last .0 making it 172.16.0.. As my subnet is actually a /12 and the server's new IP address is 172.18.X.Y (which is indeed part of the original /12), a simple grep will not match. This would be true for every subnet that is not a /16.

The same probably applies to worker_install_script.sh as well, although I haven't tried.

I think there should be a hint in the docs, if this is not going to be supported anymore.

Adopting the code to just support different sizes might be very difficult. There is some awk-magic around or one could use some inline Python. But maybe there's a better way than using grep on ip.

I don't know the code's intention to reasonably argue about that, but I may have a suggestion. Hetzner writes here, that their private network MTU is always 1450, while the public interface has a MTU of 1500. Instead of looking for a matching IP address, maybe the following could be an alternative?

ip -4 addr show | egrep -q "mtu 1450.+state UP"
@vitobotta
Copy link
Owner

Hello, I appreciate the recommendation to check the MTU, as it should always be effective. However, when I executed the command you suggested on my servers running Ubuntu 24.04, it didn't produce any output. The alternative command ip -4 addr show | awk '/mtu 1450/ && /state UP/ {getline; print $2}' does work, but in my situation, it returns two values. This is because I am using Cilium, which also has an MTU of 1450. Therefore, we need a more reliable method to obtain a single result for the actual private network interface.

@ValentinVoigt
Copy link
Author

ValentinVoigt commented Sep 3, 2024

It doesn't show an output, because I used the -q switch (because master_install_script.sh does that as well). Remove the -q or execute echo $? add the end. Is it really a problem that there will be two lines matching when using that regex? When Cilium is running, we can assume that k3s is already installed and we therefore don't need to wait for the interface to come up? Just guessing here, honestly.

I can offer some inline Python code, as I personally think this would be too complex for bash alone. But I don't know if this is a good solution. What do you think?

#!/bin/bash

SUBNET="172.16.0.0/12"

python3 - <<EOF "$SUBNET"
import netifaces, ipaddress, sys

network = ipaddress.ip_network(sys.argv[1])

for iface in netifaces.interfaces():
    ips = netifaces.ifaddresses(iface)
    if netifaces.AF_INET not in ips:
        continue
    for obj in ips[netifaces.AF_INET]:
        ip = ipaddress.ip_address(obj['addr'])
        if ip in network:
            sys.exit(0)

sys.exit(1)
EOF

if [ $? -eq 0 ]; then
	echo "ip in $SUBNET exists"
else
	echo "ip in $SUBNET does not exist"
fi

@vitobotta
Copy link
Owner

I still wasn't getting any results from your command, but I found a solution that correctly identifies the interface (just one):

ip -o link show | awk -F': ' '$2 !~ /cilium|br|flannel|docker|veth/ {print $2}' | xargs -I {} bash -c "ethtool {} &>/dev/null && echo {}" | while read -r iface; do mtu=$(ip link show "$iface" | awk '/mtu/ {print $5}'); if [ "$mtu" -eq 1450 ]; then echo "$iface"; fi; done

This method is sufficient for now since we're only using flannel and cilium.

We can't depend on Python or other tools because the worker installation script runs during the Cloud Init process with just sh when initializing autoscaled nodes (for static nodes, the script runs in regular bash), which imposes some limitations.

The command above works perfectly with the sh shell.

Could you possibly make a PR considering this? If not, I'll handle it, but I'm currently swamped with work.

ValentinVoigt pushed a commit to ValentinVoigt/hetzner-k3s that referenced this issue Sep 4, 2024
@ValentinVoigt ValentinVoigt linked a pull request Sep 4, 2024 that will close this issue
@holooloo
Copy link

is it related to my problem?
image

networking:
  ssh:
    port: 22
    use_agent: false # set to true if your key has a passphrase
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api: # this will firewall port 6443 on the nodes; it will NOT firewall the API load balancer
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: "infra"
  cni:
    enabled: true
    encryption: false
    mode: flannel
    ```

@ValentinVoigt
Copy link
Author

Your subnet size seems to be exactly 16, so... no?

@vitobotta
Copy link
Owner

is it related to my problem? image

networking:
  ssh:
    port: 22
    use_agent: false # set to true if your key has a passphrase
    public_key_path: "~/.ssh/id_rsa.pub"
    private_key_path: "~/.ssh/id_rsa"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api: # this will firewall port 6443 on the nodes; it will NOT firewall the API load balancer
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled : true
    subnet: 10.0.0.0/16
    existing_network_name: "infra"
  cni:
    enabled: true
    encryption: false
    mode: flannel
    ```

Are you sure 10.0.0.0/16 is the correct subnet for the network "infra"?

@holooloo
Copy link

sorry it was my fault
subnet != network )))

@vitobotta
Copy link
Owner

@ValentinVoigt Hi, I guess we can close this issue since there you opened a PR for the same problem? I haven't had a chance to test it unfortunately.

@ValentinVoigt
Copy link
Author

I personally only close issues, once they're fixed in a new release, but that's your decision.

@vitobotta
Copy link
Owner

It was just to do a cleanup since we also have the pr but it's ok. We can keep this open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants