-
Notifications
You must be signed in to change notification settings - Fork 0
infiniband setup
Changes on al01
are not yet persistent. Perhaps if you reboot then
- check the subnet manager
- assign IP by hand as :
sudo ifconfig ibs2f1 10.10.1.100 netmask 255.255.0.0
(i did not yet put this in the netplan)
Each IB network needs a subnet manager, typically this is on the switch but can be any machine. In our case, it is al01
When the links are up and the subnet manager is missing then you will see the ports as up, but initializing
atr@node1:/home/atr$ ibstat
CA 'mlx5_0'
CA type: MT41682
Number of ports: 1
Firmware version: 18.26.1040
Hardware version: 0
Node GUID: 0x1c34da030072bbf6
System image GUID: 0x1c34da030072bbf6
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x1c34da030072bbf6
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT41682
Number of ports: 1
Firmware version: 18.26.1040
Hardware version: 0
Node GUID: 0x1c34da030072bbf7
System image GUID: 0x1c34da030072bbf6
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 100
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x1c34da030072bbf7
Link layer: InfiniBand
Then I started the subnet manager on node0, and then we have on node1
atr@node1:/home/atr$ ibstat
CA 'mlx5_0'
CA type: MT41682
Number of ports: 1
Firmware version: 18.26.1040
Hardware version: 0
Node GUID: 0x1c34da030072bbf6
System image GUID: 0x1c34da030072bbf6
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x1c34da030072bbf6
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT41682
Number of ports: 1
Firmware version: 18.26.1040
Hardware version: 0
Node GUID: 0x1c34da030072bbf7
System image GUID: 0x1c34da030072bbf6
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x1c34da030072bbf7
Link layer: InfiniBand
At this point it can be configured as any eth network.
atr@al01:/home/atr$ sudo service opensmd status
● opensmd.service - LSB: Manage OpenSM
Loaded: loaded (/etc/init.d/opensmd; generated)
Active: active (running) since Tue 2021-04-06 08:18:45 UTC; 1s ago
Docs: man:systemd-sysv-generator(8)
Process: 58592 ExecStart=/etc/init.d/opensmd start (code=exited, status=0/SUCCESS)
Tasks: 92 (limit: 309035)
Memory: 14.5M
CGroup: /system.slice/opensmd.service
└─58609 /usr/sbin/opensm --daemon --pidfile /var/run/opensm.pid
Apr 06 08:18:45 al01 systemd[1]: Starting LSB: Manage OpenSM...
Apr 06 08:18:45 al01 opensmd[58592]: Starting opensm: * done
Apr 06 08:18:45 al01 OpenSM[58609]: /var/log/opensm.log log file opened
Apr 06 08:18:45 al01 OpenSM[58609]: OpenSM 5.7.2.MLNX20201014.9378048
Apr 06 08:18:45 al01 systemd[1]: Started LSB: Manage OpenSM.
Apr 06 08:18:45 al01 OpenSM[58609]: Entering DISCOVERING state
Apr 06 08:18:45 al01 OpenSM[58609]: Entering MASTER state
File logs are at /var/log/opensm.log
which driver the nic is using, you can check with
atr@al01:/home/atr$ ethtool -i ibs2f1
driver: mlx5_core[ib_ipoib]
version: 4.9-2.2.4
firmware-version: 18.28.2006 (MT_0000000244)
expansion-rom-version:
bus-info: 0000:86:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
atr@al01:/home/atr$
make sure ib_ipoib
module is in.
Edit the /etc/netplan/00-installer-config.yaml
file as
# This is the network config written by 'subiquity'
network:
ethernets:
eno1:
addresses:
- 192.168.1.103/16
gateway4: 192.168.1.100
nameservers:
addresses:
- 1.1.1.1
- 1.1.1.1
- 8.8.8.8
search: []
ibs2f1:
dhcp4: no
addresses:
- 10.10.1.103/16
gateway4: 192.168.1.100
nameservers:
addresses: [1.1.1.1, 8.8.8.8]
search: []
version: 2
netplan examples:
- https://netplan.io/examples/
- How to Configure Static IP Address on Ubuntu 20.04, https://linuxize.com/post/how-to-configure-static-ip-address-on-ubuntu-20-04/
[atr@node1 ~]$ lspci | grep -i Mellanox
86:00.0 Infiniband controller: Mellanox Technologies MT416842 BlueField integrated ConnectX-5 network controller
86:00.1 Infiniband controller: Mellanox Technologies MT416842 BlueField integrated ConnectX-5 network controller
86:00.2 DMA controller: Mellanox Technologies MT416842 BlueField SoC management interfac
[atr@node1 ~]$
The guide is available here: https://docs.mellanox.com/display/bluefieldsniceth/Hardware+Installation
Installing MOFED : https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed April 6th 2021 : mlx4-installation-log on node 3
- IP over InfiniBand (IPoIB), https://docs.mellanox.com/pages/viewpage.action?pageId=25138271
- Configuring IPoIB on Linux, https://docs.oracle.com/cd/E19436-01/820-3522-10/ch4-linux.html
- Subnet Manager, https://docs.mellanox.com/display/MLNXOSv381000/Subnet+Manager
- https://wiki.archlinux.org/index.php/InfiniBand
- Start the Subnet Manager With the opensmd Daemon, https://docs.oracle.com/cd/E19632-01/835-0783-03/z400000b1835922.html