Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support basic data parallel #366

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from
Open

Conversation

shendiaomo
Copy link
Collaborator

No description provided.

@codecov
Copy link

codecov bot commented Oct 26, 2020

Codecov Report

Merging #366 into develop will decrease coverage by 0.19%.
The diff coverage is 50.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #366      +/-   ##
===========================================
- Coverage    87.77%   87.57%   -0.20%     
===========================================
  Files           33       34       +1     
  Lines         1505     1513       +8     
===========================================
+ Hits          1321     1325       +4     
- Misses         121      125       +4     
  Partials        63       63              
Impacted Files Coverage Δ
nn/parallel/parallel.go 50.00% <50.00%> (ø)

// 1. Scatter the input to the given devices,
// 2. Replicate (deep clone) the model on each device,
// 3. Evaluate each module with its input on its device,
// 4. Gather the outputs of each replica into a single output tensor, located on the `outputDevice`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two approaches for data parallelism in for multi-GPU training:

  • Single-Process Multi-GPU
  • Per Process Per GPU

PyTorch DistributedDataParallel has proved that Per Process Per GPU is more efficient.

Single-Process Multi-GPU is not the recommended mode for "
"DDP. In this mode, each DDP instance operates on multiple "
"devices and creates multiple module replicas within one "
"process. The overhead of scatter/gather and GIL contention "
"in every forward pass can slow down training. "
"Please consider using one DDP instance per device or per "
"module replica by explicitly setting device_ids or "
"CUDA_VISIBLE_DEVICES.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, scatter--> parallel apply --> gather is not suggested. Instead, we launch a training process for each device. Each training process does dataloading/forward/backward/allreduce/update individually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants