-
Notifications
You must be signed in to change notification settings - Fork 0
George's INC Reading List
- NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data CentresL Mai et al - CoNEXT '14 [Aggregation]
- Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data ReductionRL Graham et al. - COMHPC '16 [Aggregation]
- In-Network Computation is a Dumb Idea Whose Time Has ComeSapio et. al. - HotNets '17 [Aggregation]
- Accelerating Distributed Reinforcement Learning with In-Switch ComputingLi et. al. - ISCA '19 [Aggregation]
- SwitchAgg: A Further Step Towards In-Network ComputationYang et. al - FPGA '19 [Aggregation]
- Scaling Distributed Machine Learning with In-Network AggregationSapio et. al. - NSDI '20 [Aggregation][ML][Training]
- NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training AccelerationLiu et. al. - arXiv preprint [Aggegation][ML][Training]
- ATP: In-network Aggregation for Multi-tenant LearningLao et. al. - NSDI '21 [Aggregation][ML][Training]
- Offloading Online MapReduce tasks with Stateful Programmable Data PlanesV Bruschi et. al - ICIN '20 [Aggregation]
- An In-Network Architecture for Accelerating Shared-Memory Multiprocessor CollectivesKlenk et. al. - ISCA '20 [Aggregation]
- PANAMA: In-network Aggregation for Shared Machine Learning ClustersGebara et al. - MLSys'21 [Aggegation][ML]
- Programmable Switch as a Parallel Computing DeviceChen et al. - arXiv '18
- The Case For In-Network Computing On DemandTokusashi, Yuta, et al - EuroSys '19
- LaKe: The Power of In-Network ComputingY Tokusashi, H Matsutani, N Zilberman - ReConFig '18
- Can the Network be the AI Accelerator?D Sanvito, G Siracusano, R Bifulco - NetCompute '18 [ML]
- ZipLine: in-network compression at line speedVaucher et. al. - CoNEXT '20 [Compression]
NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data CentresL Mai et al - CoNEXT '14
!!: Old paper. Potentially the first on on-path aggregation.
Middlebox compute servers connected to switches execute aggregation functions. Multiple agg boxes form aggregation tree. Traffic transparently intercepted and redirected to them.
Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data ReductionRL Graham et al. - COMHPC '16
INC implementation of MPI and OpenSHMEM reduction and barrier collectives on Mellanox switches. Specifically:
-
MPI_Barrier()
,MPI_Reduce()
,MPI_Reduce_scatter()
,MPI_AllReduce()
,MPI_Scan()
,MPI_Exscan()
-
shmem_barrier_all()
,shmem_barrier()
,shmem_dt_op_to_all()
Implementation is basically in the HPC-X library to craft SHArP packets and in the Mellanox switches to understand the SHArP protocol.
In-Network Computation is a Dumb Idea Whose Time Has ComeSapio et. al. - HotNets '17 [website] [repo]
Switches form aggregation tree. 3-tier arch: workers send data to the master, the network performs the aggregation (on the switches), and finally the master receives the result. MapReduce-like workloads. Specifically the step where a reducer has to receive data from mappers. Multiple reducers -> multiple aggregation trees.
!!: Limited in the number of entries. Max key-length is assumed.
Implementation details:
- Switch:
action process_entry_1() { process_entry(); drop(); } // ... action process_entry_10() { process_entry(); process_entry_9(); } // And exact match on num_entries field
- Host:
Ether(...) / IP(...) / UDP(...) / PREAMBLE(...) / ENTRY(key='a',value=1) / ENTRY(key="b",value=2) / ENTRY(key="c",value=3) / ENTRY(key="d",value=4)
Accelerating Distributed Reinforcement Learning with In-Switch ComputingLi et. al. - ISCA '19
FPGA Switch with an aggregation accelerator + custom protocol and packet format.
Switch arbiter forwards tagged packets to the accelerator and non-tagged to the forwarding module. Accelerator output treated as ingress queue packet (untagged) and subsequently sent to the forwarding module.
SwitchAgg: A Further Step Towards In-Network ComputationYang et. al - FPGA '19
Yet another switch arch for in-net aggregation (FPGA). Key-Value pairs payload.
Multiple processing engines on the switch, aggregate values of the same key. Each processing engine is dedicated to fixed-length KV-pair processing.
Bulk of the paper describes the switch arch so i haven't read it in too much detail (for now).
Scaling Distributed Machine Learning with In-Network AggregationSapio et. al. - NSDI '20 [website][pres][repo]
Similar principles to [Sapio'17]. But unlike Key-Value aggregations, perform AllReduce with the help of the switches. Switch-Host co-design.
N workers, each hold an array with their own model updates (same size). Each array is split in s slots, each of k elements. Workers send packets with slot id and data. When the switch has aggregated N (sub)vectors in a slot it multicasts a packet to all workers. Then workers update their model for the next iteration.
Quantization at the end-host.
NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training AccelerationLiu et. al. - arXiv preprint
AllReduce with in-network aggregation implemented over RoCE
1 GPU/Worker: Operation is very similar to SwitchML. Workers send chunk of gradients to a network device, the device performs the aggregation and broadcasts the (partial) update. However, unlike SwitchML, the aggegation device is an FPGA attached to the switch instead of the switch itself.
n GPUs/Worker:
- Local GPUs do scatter-reduce -> Each GPU has a partial result on a chunk of the worker's updates.
- GPUs with the same rank across machines perform ring-AllReduce (multiple rings). E.g H1:[g0,g1,g2] H2:[g3,g4,g5] H3:[g6,g7,g8] form rings (g0,g3,g6), (g1,g4,g7), and (g2,g5,g8). Aggregation for those AllReduce operations happens on the switch.
- After the previous step all workers have the entire update, however different (local) GPUs have different parts of it. In the final step all local GPUs perform an All-Gather to obtain the missing parts.
ATP: In-network Aggregation for Multi-tenant LearningLao et. al. - NSDI '21 [website][repo]
Multiple in-network aggregations, from different training jobs.
Switch has a number of aggregation slots. Aggregations for the same part of the tensor fall under the same slot. Aggregation data for a slot must fit a packet.
Collisions: forward the packet to the next hop towards the parameter server.
Aggregation at the ToR switch of the workers, or at the ToR switch of the PS.
Offloading Online MapReduce tasks with Stateful Programmable Data PlanesV Bruschi et. al - ICIN '20
Map MapReduce operations to the FlowBlaze architecture
FlowBlaze is an architecture for stateful packet processing. Only skimmed for now because i need to learn more about FlowBlaze and the paper is not very descriptive on that part. Will revisit.
An In-Network Architecture for Accelerating Shared-Memory Multiprocessor CollectivesKlenk et. al. - ISCA '20
In-network aggregation for PGAS GPU systems
Have only skimmed. Need to carefully read as the paper has many details on the PGAS side of things, and how loads/stores are translated into network ops.
PANAMA: In-network Aggregation for Shared Machine Learning ClustersGebara et al. - MLSys'21
The paper argues that INC does not benefit aggregation itself as much, but rather doing in-net aggregation reduces data-parallel traffic, freeing up BW for other traffic, which is a bigger gain.
PSwitch -> Traditional Switch + Aggregation Accelerator (FPGA)
The system handles congestion and load balancing.
Traffic load same as SwitchML. ML job completion slightly worse than SwitchML.
Programmable Switch as a Parallel Computing DeviceChen et al. - arXiv '18 [repo]
Consider the set of switches as a parallel computing device. Identify the costs.
Word-count on switch. Server sends out (a) 1 packet per item [CPU cost] or (b) k items and the switch forwards k packets (k - 1 recirculations) [Throughput cost].
Reducer(s) in later switch(es). Assumes items of same size.
Results show case (b) is more efficient. This case is like the Map phase on MapReduce.
p4mr: program -> [p4mr cmpl. (topo, constraints)] -> p4 programs for each switch The repository does not include the compiler.
The Case For In-Network Computing On DemandTokusashi, Yuta, et al - EuroSys '19
Applications: KVS, Paxos and DNS. Platform: NetFPGA SUME
The paper argues that INC actually IS power efficient. Also, it suggests treating network hardware as just another compute resource that can be scheduled. The idea is to do things like enabling NIC-based PAXOS on high-traffic, and fallover to SW PAXOS on low trafic, etc. Another example is in the DNS application, where a packet classifier decides whether to serve the DNS query on the device or to forward it (in which case the device acts as a normal NIC/Switch). When forwarded, the query is handled by software DNS.
LaKe: The Power of In-Network ComputingY Tokusashi, H Matsutani, N Zilberman - ReConFig '18
Two levels of caching. L1 - on-chip FPGA memory, L2 - FPGA DRAM. Query CPU application if both miss. Need 5 PE on the FPGA to saturate 10Gbps line rate (13M queries / sec). 10x latency and 24x power effiency over SW Memcached.
Can the Network be the AI Accelerator?D Sanvito, G Siracusano, R Bifulco - NetCompute '18 [ML]
Paper focuses on NN inference as an INC candidate due to its low latency requirements.
The main problem in latency-sensitive inference is the time it takes to move data to an accelerator (GPU, TPU, etc.). If the accelerator is within a network device no data-movement is required.
Implementation based on N2Net (BNN -> P4). Extend to handle NIC.
ZipLine: in-network compression at line speedVaucher et. al. - CoNEXT '20
Compress/Decompress at line-rate on the switch. Saves energy/time on the end-host
- When Should The Network Be The Computer? DRK Ports, J Nelson - HotOS’19
When Should The Network Be The Computer? DRK Ports, J Nelson - HotOS’19
Suggested Principles:
- Offloading Primitives, Reusable Primitives, Preserve fate sharing, Keep state out of the network, Minimal interference