This is the artifact that is used to evaluate NetClone, as described in the paper "NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs" in ACM SIGCOMM 2023. NetClone is an in-network dynamic request cloning mechanism for microsecond-scale workloads. With Netclone, the switch dynamically duplicates requests to two idle servers and returns only a faster response to the client. With this, the client can enjoy low tail latency even if there is unexpected latency variability in servers.
This repository contains the following code segments:
- Switch data plane code
- Switch control plane code
- Client and server applications with synthetic RPC workloads.
- To run experiments using the artifact, at least 3 nodes (1 client and 2 servers) are required. However, it is recommended to use more nodes because the benefit may not be much in a small cluster.
- Nodes should be equipped with an Nvidia ConnectX-5 NIC or similar NIC supporting Nvidia VMA for kernel-bypass networking. Experiments can still be run without the VMA-capable NICs, but this may result in increased latency and decreased throughput due to the application's reliance on a legacy network stack.
- A programmable switch with Intel Tofino1 ASIC is needed.
Our artifact for a minimal working example is tested on:
- 3 nodes (1 client and 2 servers) with single-port Nvidia 100GbE MCX515A-CCAT ConnectX-5 NIC
- APS BF6064XT switch with Intel Tofino1 ASIC
Our artifact is tested on:
Clients and servers:
- Ubuntu 20.04 LTS with Linux kernel 5.15.
- Mellanox OFED NIC drivers 5.8-1.2.1 LTS.
- gcc 9.4.0.
- VMA libvma 9.4.0.
We also tested our artifact on:
- Ubuntu 22.04 LTS with Linux kernel 6.5.0.
- Mellanox OFED NIC drivers 23.10-1.1.9 LTS.
- gcc 11.4.0.
- VMA libvma 9.8.40 LTS.
Switch:
- Ubuntu 20.04 LTS with Linux kernel 5.4.
- python 3.8.10
- Intel P4 Studio SDE 9.7.0 and BSP 9.7.0.
-
Place
client.c
,server.c
,header.h
, andMakefile
in the home directory (We used/home/netclone
in the paper). -
Configure cluster-related details in
header.h
, such as IP and MAC addresses. Note that IP configuration is important in this artifact. Each node should have a linearly-increasing IP address. For example, we use 10.0.1.101 for node1, 10.0.1.102 for node2, and 10.0.1.103 for node3. This is because the server program automatically assigns the server ID based on the last digit of the IP address.header.h
- Line 4
char *interface
// Interface name. - Line 5
NUM_CLI
// Number of clients. Need to assign server ID automatically. Also need for LAEDGE. - Line 6
NUM_SRV_LAEDGE
// the number of servers (including LAEDGE coordinator node). Need for LAEDGE. Min. number is 3 (1 coordinator and 2 servers) - Line 7
char* src_ip
// client IP addresses. Need for LAEDGE. - Line 8
char* dst_ip
server IP addresses. Need for LAEDGE, NoClone, C-Clone.
- Line 4
-
Compile
client.c
,server.c
, andheader.h
usingmake
.
-
Place
controller.py
andnetclone.p4
in the SDE directory. -
Configure cluster-related information in the
netclone.p4
.- Line 2
RECIRC_PORT
// Recirculation port number. 452 is the recirculation port for pipeline 3 in our APS BF6064XT. Check your switch spec and set it correctly.
- Line 2
-
Configure cluster-related information in the
controller.py
. This includes IP and MAC addresses, and port-related information.- Line 2
RECIRC_PORT
// Recirculation port number. - Line 3
ip_list
// IP addresses of nodes. - Line 8
port_list
// port number of nodes. - Line 13
mac_list
// MAC addresses of nodes.
- Line 2
-
Compile
netclone.p4
using the P4 compiler (we usedp4build.sh
provided by Intel). You can compile it manually with the following commands.cmake ${SDE}/p4studio -DCMAKE_INSTALL_PREFIX=${SDE_INSTALL} -DCMAKE_MODULE_PATH=${SDE}/cmake -DP4_NAME=netclone -DP4_PATH=${SDE}/netclone.p4
make
make install
${SDE}
and${SDE_INSTALL}
are path to the SDE. In our testbed, SDE =/home/admin/bf-sde-9.7.0
and SDE_INSTALL =/home/admin/bf-sde-9.7.0/install
.- If done well, you should see the following outputs
-- P4_LANG: p4-16 P4C: /home/admin/bf-sde-9.7.0/install/bin/bf-p4c P4C-GEN_BRFT-CONF: /home/admin/bf-sde-9.7.0/install/bin/p4c-gen-bfrt-conf P4C-MANIFEST-CONFIG: /home/admin/bf-sde-9.7.0/install/bin/p4c-manifest-config -- P4_NAME: netclone -- P4_PATH: /home/admin/bf-sde-9.7.0/netclone.p4 -- Configuring done -- Generating done -- Build files have been written to: /home/admin/bf-sde-9.7.0 [ 0%] Built target bf-p4c [ 0%] Built target driver [100%] Generating netclone/tofino/bf-rt.json /home/admin/bf-sde-9.7.0/netclone.p4(385): [--Wwarn=shadow] warning: 'srv2' shadows 'srv2' action get_srvID_action(bit<32> srv1, bit<32> srv2){ ^^^^ /home/admin/bf-sde-9.7.0/netclone.p4(117) Register<bit<32>,_>(32,0) srv2; ^^^^ /home/admin/bf-sde-9.7.0/netclone.p4(129): [--Wwarn=uninitialized_out_param] warning: out parameter 'ig_md' may be uninitialized when 'SwitchIngressParser' terminates out metadata_t ig_md, ^^^^^ /home/admin/bf-sde-9.7.0/netclone.p4(126) parser SwitchIngressParser( ^^^^^^^^^^^^^^^^^^^ [100%] Built target netclone-tofino [100%] Built target netclone [ 0%] Built target bf-p4c [ 0%] Built target driver [100%] Built target netclone-tofino [100%] Built target netclone Install the project... -- Install configuration: "RelWithDebInfo" -- Up-to-date: /home/admin/bf-sde-9.7.0/install/share/p4/targets/tofino -- Installing: /home/admin/bf-sde-9.7.0/install/share/p4/targets/tofino/netclone.conf -- Up-to-date: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone -- Installing: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/events.json -- Up-to-date: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/pipe -- Installing: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/pipe/context.json -- Installing: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/pipe/tofino.bin -- Installing: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/bf-rt.json -- Installing: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/source.json
- Open three terminals for the switch control plane. We need them for 1) starting the switch program, 2) port configuration, 3) rule configuration by controller
- In terminal 1, run NetClone program using
run_switchd.sh -p netclone
in the SDE directory.run_switch.sh
is included in the SDE by default.
- The output should be like...
Using SDE /home/admin/bf-sde-9.7.0
Using SDE_INSTALL /home/admin/bf-sde-9.7.0/install
Setting up DMA Memory Pool
Using TARGET_CONFIG_FILE /home/admin/bf-sde-9.7.0/install/share/p4/targets/tofino/netclone.conf
Using PATH /home/admin/bf-sde-9.7.0/install/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/admin/bf-sde-9.7.0/install/bin
Using LD_LIBRARY_PATH /usr/local/lib:/home/admin/bf-sde-9.7.0/install/lib::/home/admin/bf-sde-9.7.0/install/lib
bf_sysfs_fname /sys/class/bf/bf0/device/dev_add
Install dir: /home/admin/bf-sde-9.7.0/install (0x56432030abd0)
bf_switchd: system services initialized
bf_switchd: loading conf_file /home/admin/bf-sde-9.7.0/install/share/p4/targets/tofino/netclone.conf...
bf_switchd: processing device configuration...
Configuration for dev_id 0
Family : tofino
pci_sysfs_str : /sys/devices/pci0000:00/0000:00:03.0/0000:05:00.0
pci_domain : 0
pci_bus : 5
pci_fn : 0
pci_dev : 0
pci_int_mode : 1
sbus_master_fw: /home/admin/bf-sde-9.7.0/install/
pcie_fw : /home/admin/bf-sde-9.7.0/install/
serdes_fw : /home/admin/bf-sde-9.7.0/install/
sds_fw_path : /home/admin/bf-sde-9.7.0/install/share/tofino_sds_fw/avago/firmware
microp_fw_path:
bf_switchd: processing P4 configuration...
P4 profile for dev_id 0
num P4 programs 1
p4_name: netclone
p4_pipeline_name: pipe
libpd:
libpdthrift:
context: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/pipe/context.json
config: /home/admin/bf-sde-9.7.0/install/share/tofinopd/netclone/pipe/tofino.bin
Pipes in scope [0 1 2 3 ]
diag:
accton diag:
Agent[0]: /home/admin/bf-sde-9.7.0/install/lib/libpltfm_mgr.so
non_default_port_ppgs: 0
SAI default initialize: 1
bf_switchd: library /home/admin/bf-sde-9.7.0/install/lib/libpltfm_mgr.so loaded
bf_switchd: agent[0] initialized
Health monitor started
Operational mode set to ASIC
Initialized the device types using platforms infra API
ASIC detected at PCI /sys/class/bf/bf0/device
ASIC pci device id is 16
Starting PD-API RPC server on port 9090
bf_switchd: drivers initialized
Setting core_pll_ctrl0=cd44cbfe
-
bf_switchd: dev_id 0 initialized
bf_switchd: initialized 1 devices
Adding Thrift service for bf-platforms to server
bf_switchd: thrift initialized for agent : 0
bf_switchd: spawning cli server thread
bf_switchd: spawning driver shell
bf_switchd: server started - listening on port 9999
bfruntime gRPC server started on 0.0.0.0:50052
********************************************
* WARNING: Authorised Access Only *
********************************************
bfshell> Starting UCLI from bf-shell
- In terminal 2, configure ports manually or
run_bfshell.sh
. It is recommended to configure ports to 100Gbps.
- After starting the switch program, run
./run_bfshell.sh
and typeucli
andpm
. - You can create ports like
port-add #/- 100G NONE
andport-enb #/-
. It is recommended to turn off auto-negotiation usingan-set -/- 2
. This part requires knowledge of Intel Tofino-related stuff. You can find more information in the switch manual or on Intel websites.
- In terminal 3, run the controller using
python3 controller.py 3 2 0
in the SDE directory for the minimal working example.
- The output should be ...
root@tofino:/home/admin/bf-sde-9.7.0# python3 controller.py 3 2 0
Binding with p4_name netclone
Binding with p4_name netclone successful!!
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
Received netclone on GetForwarding on client 0, device 0
root@tofino:/home/admin/bf-sde-9.7.0#
-
Open terminals for each node. For example, we open 3 terminals for 3 nodes (1 client and 2 servers).
-
Make sure your ARP table and IP configuration are correct. The provided switch code does not concern the network setup of hosts. Therefore, you should do network configuration in hosts manually. Also, please double-check check the cluster-related information in the codes is configured correctly.
- You can set the arp rule using
arp -s IP_ADDRESS MAC_ADDRESS
. For example, typearp -s 10.0.1.101 0c:42:a1:2f:12:e6
in node 2~3 for node 1.
- You can set the arp rule using
-
Make sure each node can communicate by using tools like
ping
. e.g.,ping 10.0.1.101
in other nodes. -
Configure VMA-related stuffs like socket buffers, hugepages, etc. The following commands must be executed in all nodes.
sysctl -w net.core.rmem_max=104857600 && sysctl -w net.core.rmem_default=104857600
echo 2000000000 > /proc/sys/kernel/shmmax
echo 2048 > /proc/sys/vm/nr_hugepages
ulimit -l unlimited
-
Turn on the server program in server nodes by typing the following command
LD_PRELOAD=libvma.so VMA_THREAD_MODE=2 ./server NUM_WORKERS PROTOCOL_ID DIST
NUM_WORKERS
: The number of worker threads.
PROTOCOL_ID
: The ID of protocols to use. 0 is the baseline (no cloning, NoCLONE), 1 is C-Clone (CLICLONE in the code), 2 is LAEDGE, 3 is NetClone.
DIST
: The distribution of RPC workloads. For example, 0 is exponential (25us), 1 is bimodal (25us,250us), and etc. Check the details in the code (lines 205~208).
To evaluate LAEDGE, one node should be the coordinator. To run the coordinator, set PROTOCOL_ID
to 99. Also note that the minimum required number for LAEDGE is 4 (1 client, 1 coordinator, 2 servers)
LD_PRELOAD=libvma.so VMA_THREAD_MODE=2 ./server 1 99 0
For our minimal working example, use the command as follows:
LD_PRELOAD=libvma.so VMA_THREAD_MODE=2 ./server 1 3 0
If done well, the output should be as follows.
root@node2:/home/netclone# LD_PRELOAD=libvma.so VMA_THREAD_MODE=2 ./server 1 3 0
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 9.7.2-1 Release built on Nov 14 2022 17:03:52
VMA INFO: Cmd Line: ./server 1 3 0
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8-1.1.2.1:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Thread mode Multi mutex lock [VMA_THREAD_MODE]
VMA INFO: ---------------------------------------------------------------------------
Server 1 is running
Server Index in Switch is 0.
The dispatcher is running
Tx/Rx Worker 1 is running with Socket 19
- Turn on the client program in client nodes by using the following command.
Usage: ./client NUM_SRV PROTOCOL_ID DIST TIME_EXP TARGET_QPS
NUM_SRV
: The number of server nodes.
PROTOCOL_ID
: The ID of protocols to use. Same as in the server-side one.
DIST
: Same as in the server-side one, but this is only for the naming of the log file.
TIME_EXP
: The experiment time. Set this to more than 20 because there is a warm-up effect at the early phase of the experiment. For functionality check, it is okay to use a short time like 5 seconds.
TARGET_QPS
: The target throughput (=Tx throughput). This should be large enough (recommend to use larger than 5000) since there are accuracy issues when computing inter-arrival time with a very low value. For example, if you set this less than 2000, the clients do not send requests.
For our minimal working example, use the command as follows:
LD_PRELOAD=libvma.so VMA_THREAD_MODE=2 ./client 2 3 0 20 20000
The output should be like ...
root@node1:/home/netclone# LD_PRELOAD=libvma.so VMA_THREAD_MODE=2 ./client 2 3 0 20 20000
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: VMA_VERSION: 9.7.2-1 Release built on Nov 14 2022 17:03:52
VMA INFO: Cmd Line: ./client 2 3 0 20 20000
VMA INFO: OFED Version: MLNX_OFED_LINUX-5.8-1.1.2.1:
VMA INFO: ---------------------------------------------------------------------------
VMA INFO: Log Level INFO [VMA_TRACELEVEL]
VMA INFO: Thread mode Multi mutex lock [VMA_THREAD_MODE]
VMA INFO: ---------------------------------------------------------------------------
Client 1 is running
Rx Worker 0 is running with Socket 19
Tx Worker 0 is running with Socket 19
Tx Worker 0 done with 400000 reqs, Tx throughput: 19303 RPS
Rx Worker 0 finished with 0 redundant replies
Total time: 20.790212 seconds
Total received pkts: 400000
Rx Throughput: 19239 RPS
- When the experiment is finished, the clients report Tx/Rx throughput, experiment time, and other related information. Request latency in microseconds is logged as a text file. The end line of the log contains the total experiment time. Therefore, when you analyze the log, you should be careful. A sample log file can be found in
log/log-3-1-0-2-15-1-0-5-2000.txt
in this repository (the sample log is the result without VMA).
At lines 306 ~ 318 in server.c
, there are dummy RPC work lines. Here, the value '0.197'
determines the accuracy of the runtime.
/* Do dummy RPC work*/
uint64_t i = 0;
if (rand() / (double) RAND_MAX < probability) run_ns = run_ns * multiple;
do {
asm volatile ("nop");
i++;
} while (i / 0.197 < (double) run_ns);
These lines are based on the RackSched artifact. https://github.com/netx-repo/RackSched/blob/master/server_code/shinjuku/dp/core/worker.c (See lines 121 ~ 125).
uint64_t i = 0;
do {
asm volatile ("nop");
i++;
} while ( i / 0.58 < req->runNs);
In our testbed, when I used '0.58'
as RackSched does, the actual runtime was not like what we targeted. For example, when I set run_ns
to 25000 (25us), it lasts for like 30us.
So we tuned the value and found the correct value 0.197
for my testbed.
*Update: We find that 0.197 is the value for Linux kernel 5.15 and 0.64 for Linux kernel 6.5.0. So, please check the actual runtime of your testbed environments.
The runtime accuracy can be checked by fixing the runtime and adding timestamp like..
if (n > 0) {
if(DIST==0) run_ns = 25000; //fixing runtime
else if(DIST==1) run_ns = bimodal_dist(90,small,large);
else if(DIST==2) run_ns = exp_dist(medium);
else if(DIST==3) run_ns = bimodal_dist(90,medium,vlarge);
if (rand() / (double) RAND_MAX < probability) run_ns = run_ns * multiple;
run_ns = 10000; // To find the correct value, we use a static runtime of 10us.
uint64_t mmm = get_cur_ns(); // adding timestamp
/* Do dummy RPC work*/
uint64_t i = 0;
do {
asm volatile ("nop");
i++;
} while (i / 0.197 < (double) run_ns);
printf("%lu\n",(get_cur_ns()-mmm)); // print the runtime to check runtime accuracy
If your logged latency looks like unexpected (too high or too low), check the runtime accuracy and tune it.
Please cite this work if you refer to or use any part of this artifact for your research.
BibTex:
@inproceedings {netclone,
author = {Gyuyeong Kim},
title = {NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs},
booktitle = {Proc. of ACM SIGCOMM},
year = {2023},
address = {New York, NY, USA},
month = sep,
publisher = {Association for Computing Machinery},
numpages = {13},
pages ={195–207},
}