Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0001-economic-dataloader.md #69

Open
wants to merge 201 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
201 commits
Select commit Hold shift + click to select a range
5930560
add dataloader-echonomic
Sep 14, 2024
8907692
aa
Sep 14, 2024
52711e4
aa
Sep 14, 2024
50ad219
aa
Sep 14, 2024
669e558
aa
Sep 14, 2024
8410ac4
aa
Sep 14, 2024
ad91f49
aa
Sep 14, 2024
27f832d
aa
Sep 14, 2024
8630a0a
aa
Sep 14, 2024
12609bc
aa
Sep 14, 2024
4aa041b
aa
Sep 14, 2024
b81aaaa
aa
Sep 14, 2024
701d622
aa
Sep 14, 2024
aaed9cc
aa
Sep 14, 2024
2542183
aa
Sep 14, 2024
7e6974a
aa
Sep 14, 2024
ce2ed98
aa
Sep 14, 2024
7a10f00
aa
Sep 14, 2024
6f4f3e8
aa
Sep 14, 2024
67d3e6a
aa
Sep 14, 2024
efc757d
aa
Sep 14, 2024
dce9bb9
aa
Sep 14, 2024
a67e58e
aa
Sep 14, 2024
0352695
aa
Sep 14, 2024
907f5e4
aa
Sep 14, 2024
693c4d4
aa
Sep 14, 2024
f59538c
aa
Sep 14, 2024
81455e4
aa
Sep 14, 2024
efe328e
aa
Sep 14, 2024
f1c86c4
aa
Sep 14, 2024
c268128
aa
Sep 14, 2024
c733ba0
aa
Sep 14, 2024
e176e3b
aa
Sep 14, 2024
d14bbb4
aa
Sep 14, 2024
461077c
aa
Sep 14, 2024
297a2ef
aa
Sep 14, 2024
a0edc2a
aa
Sep 14, 2024
018436b
aa
Sep 14, 2024
9ee0122
aa
Sep 14, 2024
62ea09a
aa
Sep 14, 2024
1b6d0fb
aa
Sep 14, 2024
b6eef2a
aa
Sep 14, 2024
2610343
aa
Sep 14, 2024
bf2af02
aa
Sep 14, 2024
c4f7c68
aa
Sep 14, 2024
d6afbda
aa
Sep 14, 2024
f367699
aa
Sep 14, 2024
eb20e6d
aa
Sep 14, 2024
d705791
aa
Sep 14, 2024
cb85c7b
aa
Sep 14, 2024
06779a2
aa
Sep 14, 2024
fb15e73
aa
Sep 14, 2024
304972b
aa
Sep 14, 2024
aefc0d9
aa
Sep 14, 2024
2b5b2c9
aa
Sep 14, 2024
691dffd
aa
Sep 14, 2024
8b63340
aa
Sep 14, 2024
244ebfe
aa
Sep 14, 2024
a8f2b88
aa
Sep 14, 2024
6f5ce0b
aa
Sep 14, 2024
90e8b77
aa
Sep 14, 2024
de9847b
aa
Sep 14, 2024
fe96da4
aa
Sep 15, 2024
e0ef89d
aa
Sep 15, 2024
8ba4f87
aa
Sep 15, 2024
7725db3
aa
Sep 15, 2024
f73bcd9
aa
Sep 15, 2024
e2a9731
aa
Sep 15, 2024
bdf1e0a
aa
Sep 15, 2024
156f3e8
aa
Sep 15, 2024
34c76c2
aa
Sep 15, 2024
8272220
aa
Sep 15, 2024
ccfc64c
aa
Sep 15, 2024
10b663a
aa
Sep 15, 2024
0ca3801
aa
Sep 17, 2024
ead9f2f
aa
Sep 18, 2024
e7e1467
aa
Sep 18, 2024
9e50057
aa
Sep 18, 2024
11f1420
aa
Sep 18, 2024
6fa4a57
aa
Sep 18, 2024
35bec79
aa
Sep 18, 2024
9b1a567
aa
Sep 18, 2024
9c55d0c
aa
Sep 18, 2024
82a463c
aa
Sep 21, 2024
45855ca
aa
Sep 21, 2024
5447ded
aa
Sep 21, 2024
abb5b2b
aa
Sep 21, 2024
803c6c3
aa
Sep 21, 2024
270cc46
aa
Sep 21, 2024
1fe602a
aa
Sep 21, 2024
8d7cfcd
aa
Sep 21, 2024
5cf04db
aa
Sep 21, 2024
8001cdc
aa
Sep 21, 2024
b68d6ca
aa
Sep 21, 2024
6df21b8
aa
Sep 21, 2024
d5c8ee5
aa
Sep 21, 2024
89c48ce
aa
Sep 21, 2024
63baac5
aa
Sep 21, 2024
0c61dc2
aa
Sep 21, 2024
8133931
aa
Sep 21, 2024
0e83d00
aa
Sep 21, 2024
9a3dc2d
aa
Sep 21, 2024
3da3e82
aa
Sep 21, 2024
88c8404
aa
Sep 21, 2024
0416e78
aa
Sep 21, 2024
6c58565
aa
Sep 21, 2024
249397a
aa
Sep 21, 2024
85901f7
aa
Sep 21, 2024
211dc53
aa
Sep 21, 2024
06cf95a
aa
Sep 21, 2024
ceb05ad
aa
Sep 21, 2024
d01b635
aa
Sep 21, 2024
e3b24af
aa
Sep 21, 2024
c273863
aa
Sep 21, 2024
5de494c
aa
Sep 21, 2024
b2b1d5c
aa
Sep 21, 2024
b4d3491
aa
Sep 21, 2024
bd0cab3
aa
Sep 21, 2024
6c8d8ef
aa
Sep 21, 2024
8b06c78
aa
Sep 21, 2024
6483e86
aa
Sep 21, 2024
fb22db2
aa
Sep 21, 2024
12f16fc
aa
Sep 21, 2024
3ae4de0
aa
Sep 21, 2024
95510a4
aa
Sep 21, 2024
68d6016
aa
Sep 21, 2024
6a371bf
aa
Sep 21, 2024
f8a1275
aa
Sep 21, 2024
c612cdf
aa
Sep 21, 2024
686d572
aa
Sep 21, 2024
81224d3
aa
Sep 21, 2024
bf15a58
aa
Sep 21, 2024
9d4fb6f
aa
Sep 21, 2024
4bb8057
aa
Sep 21, 2024
005bdc8
aa
Sep 21, 2024
58495e9
aa
Sep 21, 2024
2dcd339
aa
Sep 21, 2024
8df058e
aa
Sep 21, 2024
f7a004c
aa
Sep 21, 2024
cdf56f8
aa
Sep 21, 2024
d739543
aa
Sep 21, 2024
78d0769
aa
Sep 21, 2024
5c60b34
aa
Sep 21, 2024
b1fd6bf
aa
Sep 21, 2024
4729473
aa
Sep 21, 2024
8841641
aa
Sep 21, 2024
6a761bd
aa
Sep 21, 2024
4a42420
aa
Sep 21, 2024
057c2c6
aa
Sep 21, 2024
a5963d1
aa
Sep 21, 2024
5acf430
aa
Sep 21, 2024
692d2ad
aa
Sep 21, 2024
22ea48e
aa
Sep 21, 2024
5a255f3
aa
Sep 21, 2024
3c9b9c4
aa
Sep 21, 2024
dcb1dee
aa
Sep 21, 2024
4c2ea2b
aa
Sep 21, 2024
00feec4
aa
Sep 21, 2024
54a732e
aa
Sep 21, 2024
6598729
aa
Sep 21, 2024
e8d285a
aa
Sep 21, 2024
96f4d2d
aa
Sep 21, 2024
91febb4
aa
Sep 21, 2024
28a6cde
aa
Sep 24, 2024
ebcce3d
aa
Sep 24, 2024
81f1c81
aa
Sep 24, 2024
7c6d655
aa
Sep 27, 2024
b2b3454
aa
Sep 27, 2024
875fc21
aa
Sep 27, 2024
e1bba42
aa
Sep 27, 2024
f35b640
aa
Sep 27, 2024
6cd8837
aa
Sep 27, 2024
4d886d6
aa
Sep 27, 2024
85e2b7a
aa
Sep 27, 2024
019df38
aa
Sep 27, 2024
f7817ef
aa
Sep 27, 2024
dfd4006
aa
Sep 27, 2024
fae3a3b
aa
Sep 27, 2024
2893c9d
aa
Sep 27, 2024
0854707
aa
Sep 27, 2024
139a983
aa
Sep 27, 2024
676895f
aa
Sep 27, 2024
46ca604
aa
Sep 27, 2024
ad9d88f
aa
Sep 27, 2024
4e824af
aa
Sep 27, 2024
2e2775c
aa
Sep 27, 2024
f9148e6
aa
Sep 27, 2024
3c276c5
aa
Sep 27, 2024
17d265b
aa
Sep 27, 2024
b0e6538
aa
Sep 27, 2024
c20a4e6
aa
Sep 27, 2024
67a7aa8
aa
Sep 27, 2024
b5d7d14
aa
Sep 27, 2024
53a6086
aa
Sep 27, 2024
39c1f50
aa
Sep 27, 2024
ddfc8f0
aa
Sep 27, 2024
75fd80e
aa
Sep 27, 2024
2422d7e
aa
Sep 27, 2024
4bedb9a
aa
Sep 27, 2024
38717ac
aa
Sep 27, 2024
c7b25aa
aa
Sep 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions RFC-0001-economic-dataloader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Economic DataLoader: Multiprocessing Pipeline Design Suggestion

**Authors:**
* @yoadbs

## **Summary**
A new dataloader multiprocessing pipeline design is suggested. This pipeline splits the task of batch generation, into 2 types of workers:\
item generating workers (by calling `dataset.__getitem__`), and batch generating workers (by calling `collate_fn`).
This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).

## **Motivation**
In several applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing, 3D graphics, etc.

By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
In practice, about _num_workers_ prepared batches are simultaneously stored in shared memory, nearly after epoch start.
At most, (_num_workers_ * _prefetch_factor_) prepared batches may be simultaneously stored in shared memory.
The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test.

Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
_num_workers_ < (_total_available_ram_in_bytes_ / _batch_size_in_bytes_) \
This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
Alternatively, to increase num_workers, a sever with more available RAM is required, increasing sever cost.

A new dataloader multiprocessing pipeline design is suggested. In this pipeline, there are two types of workers:
item generating workers (by calling `dataset.__getitem__`), and batch generating workers (by calling `collate_fn`).
This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
The decoupling of number of processed batches from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected.

Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
Hence, epoch can potentially start faster.

The new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.

## **Proposed Implementation**

### **Definitions**

| symbol | description |
|-----------------------|:---------------------------------------------------------------------------------------------------------------------------|
| _index_queue_ | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker |
| _item_queue_ | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker |
| _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process |
| _item_idx_ | Item serial index in epoch (0 for first item, 1 for next item, etc.) |
| _item_idx_in_batch_ | Item serial index in batch |
| _batch_idx_ | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.) |
| _item_index_ | Item's dataset index, as in `dataset.__getitem__(index=item_index)` |
| _iw_idx_ | Item_worker index {0, 1, ..., _num_workers_ - 1} |
| _bw_idx_ | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1} |
| _batch_size_ | batch size (may be smaller for last batch in epoch) |

### **High Level Description**

By the current multiprocessing pipeline, a single level of workers is used.
The main process sends _prefetch_factor_ batches to each worker, by _index_queue_.
Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
After a batch is retrieved by the main process, another batch is sent.

In the suggested pipeline, there are 2 levels of workers:
* Item_worker - Generate one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by _item_queue_
* The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time)
* Batch_worker - Retrive items from _item_queue_, prepare batches by running `collate_fn`, and send them back to the main process by _worker_result_queue_

Current design dataflow: main_process -> workers -> main_process

Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process

#### **Main Process Flow**
* Retrieve and store prepared batches from _worker_result_queue_
* Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item_workers, when retrieving the batch
* Send batches of items for preparation to index_queues, one batch at a time
* Each item should include the following metadata: (_item_idx_in_batch_, _batch_idx_, _item_index_, _iw_idx_, _bw_idx_, _batch_size_):
* A possibly different item_worker should be assigned to each item
* Select iw_idx by the item_worker with the minimal workload
* The same batch_worker should be assigned to all items in the same batch
* Select bw_idx by the batch_worker with the minimal workload
* Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit
* Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item_workers, when sending the batch of items
* Once the next required batch is available (by _batch_idx_), return batch to caller function

#### **Item_worker Flow**
* Get item metadata from _index_queue_
* Generate item, by running `dataset.__getitem__(item_index)`
* Send item to the appropriate _item_queue_ (by item's bw_idx)

#### **Batch_worker Flow**
* Get one item at a time from _item_queue_ and collect them into batches, by item's metadata (batch_idx, item_idx_in_batch, and batch_size)
* Once all items of a given batch are received, run collate_fn and send the prepared batch to _worker_result_queue_

#### **New Parameters**
The following dataloader input parameters were modified / added:

| name | description |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _num_workers_ (modified) | Number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
| | |
| _prefetch_factor_ (modified) | Number of batches simultaneously sent for processing <u>by all workers</u> (2 by default) |
| _num_batch_workers_ (new) | Number of batch_workers (defaults to _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_ |

## **Metrics**
The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
To monitor shared memory usage, type in Linux server terminal: \
$ monitor -n0.1 df -h \
and review /dev/shm "used" column.

## **Drawbacks**
* Additional layer of batch_workers is required, somewhat increasing flow complexity
* CPU usage is somewhat higher in the suggested flow, due to the additional _num_batch_workers_ processes
* The user should be aware that if `collate_fn` is very slow and becomes a bottleneck, an increase in _prefetch_factor_ should be considered


## **How We Teach This**
* Update Dataloader documentation to include the description of the suggested pipeline
* Add/update description of the new/modified parameters