Skip to content

George's INC Reading List

George Karlos edited this page May 12, 2021 · 14 revisions

Contents


Applications and Use Cases


NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data CentresL Mai et al - CoNEXT '14

!!: Old paper. Potentially the first on on-path aggregation.

Main idea: Middlebox compute servers connected to switches execute aggregation functions. Multiple agg boxes form aggregation tree. Traffic transparently intercepted and redirected to them.

Workflow:

  1. Client submits request on master node
  2. Master node sends sub-requests to worker nodes (map)
  3. Shim layer on worker nodes redirect data to first agg-box hop (instead of master) (reduce)
  4. Agg-box performs aggregation function and sends to next agg-box hop (reduce)
  5. The final agg-box sends result to master

Obvious advantage(s):

  • Multitenancy. Since an aggregator is just another server, multiple applications can run on it.
  • Turing completeness. Agg-boxes (unlike switches) are not constrained on the operations they can perform (e.g. no loops, no floats).
  • No memory limitations

Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data ReductionRL Graham et al. - COMHPC '16

Main Idea: INC-implementation of MPI and OpenSHMEM reduction and barrier collectives on Mellanox switches. Specifically:

  • MPI_Barrier(), MPI_Reduce(), MPI_Reduce_scatter(), MPI_AllReduce(), MPI_Scan(), MPI_Exscan()
  • shmem_barrier_all(), shmem_barrier(), shmem_dt_op_to_all()

Builds overlay reduction tree and aggregates data going through it.

Implementation is basically in the HPC-X library to craft SHArP packets and in the Mellanox switches to understand the SHArP protocol. It is proprietary so i have only skimmed through the paper. Perhaps the protocol description (packet formats etc.) is something usefull i can revisit later on.

One important point is that, unlike the P4-switch aggregations below, the SHARP switch only performs the aggregation only when the entire data has been received by the switch.

!!: A comparison between SHARP and one of the P4 aggregations would be really cool. Although nobody did it AFAIK ---> Need Mellanox switches $$$$


In-Network Computation is a Dumb Idea Whose Time Has ComeSapio et. al. - HotNets '17 [website] [repo]

Main Idea: Form aggregation tree out of switches. 3-tier arch: workers send data to the master, the network performs the aggregation (on the switches), and finally the master receives the result.

The paper focuses on MapReduce-like workloads. Specifically the step where a reducer has to receive data from mappers. There may be multiple reducers -> multiple aggregation trees.

Data is Key-Value pairs and a packet's payload is just a list of them.

Each switch has S aggregation slots (2 S-length arrays for keys & values) and a spillover buffer.
Switch updates Entries[Hash(K)] = f(Entries[Hash(K)], V) for each K,V-pair in the payload. If there is a collision the pair goes to the spillover buffer. When that fills, a packet is crafted and sent to the parent in the tree.
Switches also know how many children to expect data from. When all children have finished transmitting, the switch sends its K,V pair to its parent.

!!: This kind of aggregation is only applicable when f is both commutative and associative.

!!: Is limited in the number of entries that can be stored on the switch (i think the actual implementation stores only 10!) and potentially wastes space. A max key-length is assumed.

?: What happens when all children finished (KVs must be flushed) but the spillover buffer is non-empty? According to the algorithm the spillover buffer is flushed only when it is full. This algorithm description allows for data to be lost (and thus incorrect results).

Implementation details:

  • Hardcoded aggegation steps:
     action process_entry_1() { process_entry(); drop(); }
     // ...
     action process_entry_9() { ... }
     action process_entry_10() { process_entry(); process_entry_9(); }
     
     // And exact match on num_entries field
    
  • Worker code just does:
     Ether(dst=MACS[sys.argv[2]]) / IP(dst="10.0.1.10") / UDP(dport=10000) / PREAMBLE(number_of_entries=4,tree_id=7) / ENTRY(key='a',value=1) / ENTRY(key="b",value=2) / ENTRY(key="c",value=3) / ENTRY(key="d",value=4) 

Accelerating Distributed Reinforcement Learning with In-Switch ComputingLi et. al. - ISCA '19

Main Idea: FPGA Switch with an aggregation accelerator + custom protocol and packet format.

Switch arbiter forwards tagged packets to the accelerator and non-tagged to the forwarding module. Accelerator output treated as ingress queue packet (untagged) and subsequently sent to the forwarding module.


SwitchAgg: A Further Step Towards In-Network ComputationYang et. al - FPGA '19

Main Idea: Yet another switch arch for in-net aggregation (FPGA)

Key-Value pairs payload.

Multiple processing engines on the switch, aggregate values of the same key. Each processing engine is dedicated to fixed-length KV-pair processing.

Bulk of the paper describes the switch arch so i haven't read it in too much detail (for now).


Scaling Distributed Machine Learning with In-Network AggregationSapio et. al. - NSDI '20 [website][pres][repo]

Main Idea: Similar principles to [Sapio'17]. But unlike Key-Value aggregations, perform AllReduce with the help of the switches. Switch-Host co-design.

N workers, each hold an array with their own model updates (same size). Each array is split in s slots, each of k elements. Workers send packets with slot id and data. When the switch has aggregated N (sub)vectors in a slot it multicasts a packet to all workers. Then workers update their model for the next iteration.

The entire parameter vector may not fit on the switch -> process updates in windows of k * s elements. Once a worker receives an update (from the switch) for slot i starting at offset o, it then sends slot (i + 1) at o + k * s to the switch. This works because a worker receiving a slot update means that all workers have already sent slot i on the switch in order for it to multicast the result.

The above is the gist of the algorithm. The real version also deals with packet loss and is therefore more complicated. Also requires more memory from the switch.

End-hosts use quantization because the switch does not support FP. The scaling factor is set such that the max aggregated (quantized) value is still representable as a 32-bit int w.o overflows. This requires knowledge of the max value for a slot across workers. It is realized by having workers attach their the max gradient in block (j+1) to the packet for block j (for the same slot i). Then when the the switch performs the aggregation, it also finds the max and attaches it to the broadcasted packet. Now workers have a value to scale with.

?: What happens when j = 0?

!!: The number of in-flight aggregations (happening in parallel) is the number of slots s. This value is thus critical to performance and must be judiciously tuned. For instance it has to be a large fraction of the badwidth-delay product of the workers, and also take into account k, which determines the packet size. On the other hand both s,k influence switch memory. So there may be friction/tradeoffs.

!!: The algorithm looks like it is missing some steps. Maybe the desc. is intentionally simplified. In short, workers must wait for all updates in the first window (idx 0 to (k*s)). That is to avoid reusing slots untill all workers have contributed. However the worker algorithm does not reflect this (more details in the notes)

!!: The system targets a single rack with the aggregation happening in the ToR switch.

!!: It is not clear if there is any room to overlap computation with communication (of the model update) at the workers

Overall performance gains come from traffic reduction, compared to model updates by:

  1. parameter-server aggregations
  2. ring-AllReduce

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training AccelerationLiu et. al. - arXiv preprint

Main Idea: AllReduce with in-network aggregation implemented over RoCE

1 GPU/Worker: Operation is very similar to SwitchML. Workers send chunk of gradients to a network device, the device performs the aggregation and broadcasts the (partial) update. However, unlike SwitchML, the aggegation device is an FPGA attached to the switch instead of the switch itself.

n GPUs/Worker:

  1. Local GPUs do scatter-reduce -> Each GPU has a partial result on a chunk of the worker's updates.
  2. GPUs with the same rank across machines perform ring-AllReduce (multiple rings). E.g H1:[g0,g1,g2] H2:[g3,g4,g5] H3:[g6,g7,g8] form rings (g0,g3,g6), (g1,g4,g7), and (g2,g5,g8). Aggregation for those AllReduce operations happens on the switch.
  3. After the previous step all workers have the entire update, however different (local) GPUs have different parts of it. In the final step all local GPUs perform an All-Gather to obtain the missing parts.

Implementation of NetReduce outperforms SwitchML (not by a lot). Authors state:

The larger performance gain of NetReduce than SwitchML mainly comes from two points: processing full-length Ethernet frame by using FGPA and offloading the network stack processing to RDMA NIC.

I don't understand the first point. AFAIK Ethernet frames are up to 1.5KB which (i think?) is well within the capabilities of a switch ASIC, in which case i don't see how the FPGA can be faster than an ASIC. Not sure about jumbo frames though.

?: Is this even about frame size being too large for the switch, or something about the SwitchML algorithm that doesn't allow to saturate the MTU?


Offloading Online MapReduce tasks with Stateful Programmable Data PlanesV Bruschi et. al - ICIN '20

Main Idea: Map MapReduce operations to the FlowBlaze architecture

FlowBlaze is an architecture for stateful packet processing. Only skimmed for now because i need to learn more about FlowBlaze and the paper is not very descriptive on that part. Will revisit.


An In-Network Architecture for Accelerating Shared-Memory Multiprocessor CollectivesKlenk et. al. - ISCA '20

Main Idea: In-network aggregation for PGAS GPU systems

Have only skimmed. Need to carefully read as the paper has many details on the PGAS side of things, and how loads/stores are translated into network ops.


ATP: In-network Aggregation for Multi-tenant LearningLao et. al. - NSDI '21 [website][repo]

Main Idea: Multiple in-network aggregations, from different training jobs

Switch has a number of aggregation slots.

Aggregations for the same part of the tensor fall under the same slot. Aggregation data for a slow must fit a packet.

There will be collisions. When that happens the switch simply forwards the packet to the next hop towards the parameter server.

Aggregation happens either at the ToR switch of the workers, at the ToR switch of the PS, and as a last resort at the PS itself. (This two-level aggregation allows for an easy way to avoid duplicate aggregations)

?: Are more levels doable/make sense?
The paper claims (citing the Lyra paper) that in a DC setting only the ToR switches are programmable, so (rather conveniently) the 2-layer agg. is the best match.

  1. In-Network aggregation cannot work with end to end protocols because switches drop packets after aggregation, which is interpreted as packet loss by those protocols. --> Switch - End-Host net stack co-design.

!!: I do not understand the following:

Note that two gradients from different workers that will be aggregated never meet at switches in ring all-reduce architecture [58]. To the best of our knowledge, any in-network aggregation, as well as ATP, can not apply to ring all-reduce architecture.


In-Network Aggregation for Shared Machine Learning ClustersGebara et. al. - MLSys '21

Main Idea: Support in-net agg. flows alongside non-agg. flows in a DC environment.

In-net agg. principles are the same as the other papers. The extra contribution is the support for both aggregation and non-aggregation traffic in the same system, and the incorporation of load-balancing, congestion control and fairness.

System:

  • PSwitch -> Traditional Switch + Aggregation Accelerator (FPGA)
  • No assumptions on worker placement (can span multiple racks).
  • ECN-based congestion control for aggregation packets
  • LB: ECMP not suitable for aggregation flows that are typically large (up to GBs for exchanged data) -> Multiple aggregation trees per training job -> Spread traffic accross multiple paths
  • CG: Many-to-Many comm. --> Point-to-Point mechanisms not applicable + Switch needs to store aggregation packets untill packets from all workers have arrived --> Sending rate of workers must match
    • Aggregation result packets treated by workers as an ACK to increase window size
    • ECN bit set --> adjust sending rate
    • Cap sending window -> Agg. accelerators update aggregation packet using cwnd_cap field

New Aggregation Job:

  1. Controller initializes PSwitches in the spanning tree
    1. Unique IP multicast for each tree
  2. Controller configures workers to use in-net agg. + notifies them of the IP multicast address.

Implementation:

  • Comm. lib replacing NCCL and MPI.
  • Custom packet format (L4)

Paper argues that in-network aggregation impact on training time is limited because gradient aggregation is only a small fraction of the training task, so it argues that the real motivation is to reduce the volume of traffic generated by data-parallel gradient exchange. This is somewhat is contrast to the other aggregation papers argueing that dist. training is becoming network/comm. bound.

Paper gives examples of tasks that do not directly benefit from In-net agg. but indirectly through the resource reduction caused by aggregating in the network in tasks that do benefit from it (even though not by a lot).

TODO: Figure out why the tasks mentioned in the paper do not benefit from in-net agg. Maybe there are opportunities for other kinds of INC there?

!!: Given that this paper is the newest of all it would be good if it included comparisson with at least SwitchML -> It doesn't!


The Case For In-Network Computing On DemandTokusashi, Yuta, et al - EuroSys '19

Applications: KVS, Paxos and DNS. Platform: NetFPGA SUME

The paper argues that INC actually IS power efficient. Also, it suggests treating network hardware as just another compute resource that can be scheduled. The idea is to do things like enabling NIC-based PAXOS on high-traffic, and fallover to SW PAXOS on low trafic, etc. Another example is in the DNS application, where a packet classifier decides whether to serve the DNS query on the device or to forward it (in which case the device acts as a normal NIC/Switch). When forwarded, the query is handled by software DNS.

?: Is this on-demand approach generic-enough or is it just these 3 application that fit well to it?


LaKe: The Power of In-Network ComputingY Tokusashi, H Matsutani, N Zilberman - ReConFig '18

Main idea: Two levels of caching. L1 - on-chip FPGA memory, L2 - FPGA DRAM. Query CPU application if both miss. Need 5 PE on the FPGA to saturate 10Gbps line rate (13M queries / sec). 10x latency and 24x power effiency over SW Memcached.


Can the Network be the AI Accelerator?D Sanvito, G Siracusano, R Bifulco - NetCompute '18 [ML]

Paper focuses on NN inference as an INC candidate due to its low latency requirements.

The main problem in latency-sensitive inference is the time it takes to move data to an accelerator (GPU, TPU, etc.). If the accelerator is within a network device no data-movement is required. They identify that splitting the NN processing (which is the only overhead for in-network accelerators) does not impose big overhead.

BaNaNa Split:

  • Quantized models -> Fit on the device + simpler ops (supported by current net devices)
  • Split before FC layers
  • First set of layers on the CPU
  • The rest Quantized
  • At runtime NN req handled first by CPU. Intermediate res encapsulated in packets and set to the NIC. The NIC executes quantized layers.

Implementation based on N2Net (BNN -> P4). Extend to handle NIC.


ZipLine: in-network compression at line speedVaucher et. al. - CoNEXT '20

Main Idea: Compress/Decompress at line-rate on the switch. Saves energy/time on the end-host

TODO: Finish reading




Surveys-Visions


When Should The Network Be The Computer? DRK Ports, J Nelson - HotOS’19

Suggested Principles:

  • Offloading Primitives
  • Make primitives reusable
    IETF draft on data plane programmability
  • Preserve fate sharing - Recover from switch failure, unless switch is single point of failure (and would have anyway made the system unavailable)
  • Keep state out of the network - In the event of failures, application level protocol can deal with recovery
  • Minimal interference - New primitives should minimize impact with network policies and reliability

Open Challenges:

  • Scale and decentralization - Assumption that a single switch handles all traffic is increasingly unrealistic.
  • In-device parallelism - Inter-pipeline shared state
  • Logical vs. Wire messages - Bridge the semantic gap between applications not operating in packets (e.g. TCP streams) and the UDP messages that most in-network systems use to communicate.
  • Encryption - Traffic is increasingly encrypted
  • Multitenancy - Contention from multiple applications running on the same device. Currently devices run a single monolithic programs. OS-like abstractions (virtualization, resources mngmt, protected domains) are needed.
  • Isolation - To allow users to deploy their own programs
  • Interoperability - CS providers do not use single-vendor hardware. Need languages that can target multiple architectures (Program synthesis might be a solution)

Clone this wiki locally