Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We made a toolkit can parallelize almost all the Hugging Face models. But we have some question ! #12772

Closed
hyunwoongko opened this issue Jul 18, 2021 · 29 comments

Comments

@hyunwoongko
Copy link
Contributor

hyunwoongko commented Jul 18, 2021

We recently developed an opensource called parallelformers, (https://github.com/tunib-ai/parallelformers) and have a few questions, so we write an issue here.

Q. As a logo, an image homage to the hugging face was used. Not exactly the same CI, but from Unicode. Will it be a problem?

Q. What do you think about collaboration? We can include model parallelization for all models in hugging face transformers.


The following is what I posted on Reddit to promote our opensource.

Hello, I am writing to inform you about the release of Parallelformers (https://github.com/tunib-ai/parallelformers), a model parallelization library at TUNiB. Parallelformers is a toolkit that supports inference parallelism for 68 models in Huggingface Transformers with 1 line of code.

Previously, DeepSpeed-Inference was used as a parallelization toolkit for model inference.

(1) It was impossible to deploy to the web server due to the process flow,

(2) Lack of integration with Huggingface Transformers, which has now become the de facto standard for natural language processing tools. (DeepSpeed-Inference only supports 3 models)

(3) Also, since parallelization starts in the GPU state, there was a problem that all parameters of the model had to be put on the GPU before parallelization.

Parallelformers solved a number of problems in DeepSpeed-Inference. Using this toolkit internally, we were able to easily deploy a large model to our web server, reducing the cost of deployment by up to 3-5x. More detailed information and source code can be found on GitHub. Thanks !

@LysandreJik
Copy link
Member

Hello @hyunwoongko, thanks a lot for sharing, this is a really cool project! No problem at all regarding the image homage (really cool logo by the way!)

I'm pinging @stas00 who has led the efforts of model parallelization and DeepSpeed integration on our side and would probably be interested. Also pinging @sgugger as he has done some similar work.

@stas00
Copy link
Contributor

stas00 commented Jul 19, 2021

Thank you for implementing and sharing your project, @hyunwoongko,

I haven't had a chance to study your project closely yet, but is it correct that you implemented tensor parallelism from Megatron?
In other words this outstanding feature: #10321 except for inference-only?

(There are many types of model parallelism and it is much easier to understand things when the generic MP term is not used, but an explicit type is described. Here is my initial attempt to map out the distinctions https://huggingface.co/transformers/master/parallelism.html)

@hyunwoongko
Copy link
Contributor Author

@stas00
I'll write a blog post about the architecture of our tools soon and share them.

@stas00
Copy link
Contributor

stas00 commented Jul 21, 2021

Oh, but why have you deleted all the detailed comments you posted earlier? I was looking forward to studying those and now they are all gone. I'm puzzled.

My plan was to do a feasibility study and then see if we can integrate your work into HF transformers. Just very busy with other projects to respond quickly at the moment.

@hyunwoongko
Copy link
Contributor Author

Because the article was written too hastily and too long, I decided that it would be more helpful for you to understand it by organizing it more neatly and accurately than explaining it in the issue comments. (I was going to blog soon, maybe within this week.)

@stas00
Copy link
Contributor

stas00 commented Jul 21, 2021

at the end I was able to cheat since github sent me all the comments ;) So I have just read those comments you deleted.

It wasn't long at all, on the contrary I'd say it could use more details in places. Some images were great and some weren't super clear. So adding some words would help.

And I'm very appreciating you wanted to merge this into HF transformers! That would be an amazing contribution!

So bottom line, beside the leaner launcher, the core of your project is Tensor Parallel from built upon Megatron-LM, correct? this is exactly what I was planning to work on when I had free time, so your timing is perfect.

Let's discuss the training side of it. I think to most users of HF transformers that would be the most important application of Tensor parallelism. So in the deleted note you mentioned that DDP-support needs to be integrated to make it work in training. That's the MPU part, right? And we probably should think about Pipeline too while building the MPU, while not implementing it just yet.

Also do you think it'd be a good idea to invite @RezaYazdaniAminabadi into this process, so that gradually we can use your project's flexibility and add Deepspeed CUDA kernels speeds where possible. i.e. work together with the Deepspeed project. That's of course if Reza is interested and his superiors support the effort. We already discussed with Deepspeed to start deploying some of their kernels in the transformers (but haven't done anything yet).

How do you propose we work on integrating this? Perhaps pick a few models first and work on a PR that integrates those and then in a subsequent PR work on other models? Probably leaving the optional launcher out at first and then considering it next?

On a personal note: we are about to launch the first training of the Big Science project https://github.com/bigscience-workshop/ so my availability depends on that, if when we launch it all goes well, I will have more time, if not please bear with me, but I will do my best to support this integration process at least a bit at a time.

If you have any questions or concerns please don't hesitate to ask. I will try to address those.

@hyunwoongko
Copy link
Contributor Author

I have sent my thoughts about collaboration to your email ([email protected]) !

@stas00
Copy link
Contributor

stas00 commented Jul 21, 2021

Thank you for emailing me your notes, @hyunwoongko

We need to discuss it here and not in private, since this is not my personal project. Therefore please re-paste all or just the parts that you feel are open to the public and we will continue the discussion here.

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Jul 21, 2021

Okay. First of all, I'm very happy to have your positive comments. Here are my thoughts.

  1. The basic architecture of Parallelformers is similar to that of DeepSpeed. Now that the integration of HuggingFace Transformers and DeepSpeed is in progress, I think it would be the best if we all could cooperate together. A day ago I received such a proposal (working with DeepSpeed team) from @RezaYazdaniAminabadi, an MS engineer.

  2. Probably I should be working on the implementation of model parallelization through tensor slicing. Currently Parallelformers supports almost all models in HuggingFace without fused CUDA kernels. If we work together on this project, which I really hope to do, I plan to find a way to integrate the current mechanism to the fused CUDA kernel. I believe if this works out, we can obtain both the speed of the fused CUDA kernel and the scalabilities of Parallelformers.

  3. I also think training parallelization is an important issue. In my opinion, it is necessary to consider integrating Tensor MP with DP and DDP. I hope ultimately all the models in Transformers support 3D parallelization through ZeRO + Pipeline with Tensor MP.

  4. Currently it is not possible to deploy a model on the web server with DeepSpeed, which I think a critical issue. Obviously, Parallelformers started to tackle it, but I'm open to any cooperation to find a better solution.

@stas00
Copy link
Contributor

stas00 commented Jul 21, 2021

Everything you shared sounds good to me, @hyunwoongko.

With regards to 3D parallelism. currently the main obstacle in HF Transformers to support Pipeline Parallelism (PP) is the presence of multiple optional features that prevent the model from being convertable to nn.Sequential which is the prerequisite for implementing PP. Though Sagemaker docs claim that they are able to use PP without a model being converted to nn.Sequential. So it's possible that to get to PP we may have to make alternative versions stripped of the optional features. But we can discuss this when we are done with TP (tensor parallelism).

I posted this earlier, could you please address this?

Let's discuss the training side of it. I think to most users of HF transformers that would be the most important application of Tensor parallelism. So in the deleted note you mentioned that DDP-support needs to be integrated to make it work in training. That's the MPU part, right? And we probably should think about Pipeline too while building the MPU, while not implementing it just yet.

Practically, since you understand your code the best, please let's discuss how to approach the integration of it.

Also let me add a reference to your project at https://huggingface.co/transformers/master/parallelism.html#tensor-parallelism

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Jul 22, 2021

With regards to 3D parallelism. currently the main obstacle in HF Transformers to support Pipeline Parallelism (PP) is the presence of multiple optional features that prevent the model from being convertable to nn.Sequential which is the prerequisite for implementing PP. Though Sagemaker docs claim that they are able to use PP without a model being converted to nn.Sequential. So it's possible that to get to PP we may have to make alternative versions stripped of the optional features. But we can discuss this when we are done with TP (tensor parallelism).

I totally agree with your opinion. An interesting thing is that my former colleague was the first to implement PP on a torch. (torchgpipe) He first implemented it in a way that uses nn.Sequential. So, if possible, I'll try to ask him for advice.

One thing I'm considering is to utilize nn.ModuleList in PP. Currently, most of the Transformers models are implemented as nn.ModuleList. I think it would be good to use it for PP. the fact that Sagemaker can parallelize Huggingface's models easily means there's something we haven't been able to figure out. I hope that in the future we will continue to work together to find such a scalable way.

Let's discuss the training side of it. I think to most users of HF transformers that would be the most important application of Tensor parallelism. So in the deleted note you mentioned that DDP-support needs to be integrated to make it work in training. That's the MPU part, right? And we probably should think about Pipeline too while building the MPU, while not implementing it just yet.

Yes, we need to implement training side of it. However, it seems a little difficult to use NVIDIA's mpu implementation in transformers. My idea is to leverage the mechanism of parallelformers again. It is to utilize the most of existing transformers code. When I was implementing parallelformers, I was able to successfully parallelize most models forward by changing only a few nn.Linear layers while utilize existing transformers codes. And I think this can be applied to backward as well. However, combining this with the fused CUDA kernel on DeepSpeed side can be quite difficult. I think forward is ok, but backward is hard. Because backward is not implemented in their Tensor MP kernel.

Combining DP and DDP probably requires minor changes to the existing torch implementation. As you know, with DP and DDP, same model parameters are broadcast to all GPU. And, each piece of data is sent to each GPUs.

e.g.

  • if bsz=16, n_gpus=2
  • gpu1=batch 0-7
  • gpu2=batch 8-15

This needs to be partitioned. If Tensor MP size is 2, we should create two partitions.

e.g.

  • mp_group1=gpu 0, 1
  • mp_group2=gpu 2, 3

And I think that the data should be split by each partition, not by each GPU.

e.g.

  • if bsz=16, n_gpus=4, mp_size=2
  • mp_group1(gpu0,1)=batch 0-7
  • mp_group2(gpu2,3)=batch 8-15

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Jul 22, 2021

I wrote it with a little help from a translator. If you can't understand, please tell me :)


Here is a first draft of the collaboration plan. Please feel free to comment. Everyone involved in the collaboration will be able to modify this plan depending on the circumstances.

Step 1. Collaborate DeepSpeed and TUNiB to move Paralleformers Tensor MP

The method of replacing the existing layer uses the scalable method of parallelformers. This does not change the entire transformer layer, but a method to replace a few linear layers with a sliced ​​linear layer or a sliced ​​all-reduce linear layer. Since DeepSpeed's Tensor MP replaced the entire Transformer layer, it could not reflect the specific mechanism of each model.

Firstly, I will implement this method and PR to DeepSpeed. (And this is what the DeepSpeed ​​team wants me to do. refer to here) Ultimately, it's a good idea to archive parallelformers after most of the mechanisms of parallelformers are moved in DeepSpeed. It's a pity that our toolkit will be archived, but I think user accessibility is much more important because I want more people to easily use the large model. Parallelformers are less accessible compared to HF Transformers and MS DeepSpeed.

Step 2. Collaborate DeepSpeed and TUNiB about fused CUDA kernel

However, it is quite challenging to combine it with the CUDA kernel in the training process. In my opinion, it would not be difficult to implement forward pass, but the problem is backward. There is currently no backward pass implementation in the Tensor MP kernel in DeepSpeed. Because currently, Tensor MP is provided as inferences, DeepSpeed team didn't need to implement backward pass. Unfortunately, since I do not understand the CUDA code at a high level, it will be difficult for me to write the CUDA backward code myself.

Therefore, collaboration with DeepSpeed ​​should be made in this part. It would be nice if we could collaborate with DeepSpeed ​​and discuss about backward implementation of the DeepSpeed ​​Tensor MP kernel. If this is impossible, it may be difficult to use the CUDA kernel during the training process.

Step 3. Collaborate Huggingface and TUNiB about transformers

In this step, we will add the newly implemented Tensor MP kernel by DeepSpeed and TUNiB into the HuggingFace. I think it will be similar to the Policy I implemented in parallelformers.

There are two methods to add to HuggingFace side.

  1. Like modeling_MODEL.py and tokenization_MODEL.py in each model directory, we can create parallel_MODEL.py about the parallelization policy.

  2. Alternatively, it is also worth considering about utilizing config.json. However, this can also be a fairly large work because every config.json files uploaded to hub needs to be changed.

Step 4. Collaborate Huggingface and TUNiB about DP, DDP, PP

Once Tensor MP is done, we will be able to proceed with combining it with DP and DDP. At the same time, It would be good to consider about implementing PP using nn.ModuleList. In my opinion, the existing PP based on nn.Sequntial is not suitable for HF transformers. I will to ask a former colleague for their opinion on a PP implementation based on nn.ModuleList.

@stas00
Copy link
Contributor

stas00 commented Jul 22, 2021

We probably should discuss PP elsewhere and focus in this thread on what's already working in your project. So I will give a brief overview only:

I totally agree with your opinion. An interesting thing is that my former colleague was the first to implement PP on a torch. (torchgpipe) He first implemented it in a way that uses nn.Sequential. So, if possible, I'll try to ask him for advice.

Great!

One thing I'm considering is to utilize nn.ModuleList in PP. Currently, most of the Transformers models are implemented as nn.ModuleList. I think it would be good to use it for PP. the fact that Sagemaker can parallelize Huggingface's models easily means there's something we haven't been able to figure out. I hope that in the future we will continue to work together to find such a scalable way.

The 3 frameworks that currently provide PP as an API that I know of are fairscale, deepspeed and pytorch's recent versions - these all require nn.Sequential. So unless we implements a custom PP, nn.ModuleList won't do. Moreover you have other modules before and after the block list with very different inputs/outputs.

Actually, the main complication of the current models, is the inputs/outputs. PP requires simple tensor variables that can be sliced at the batch dimension. HF models have a gazillion of variables that aren't tensors and thus can't be sliced. Some variables are tuples of tuples and are used as aggregates.

If you'd like to see the sort of jumps through the hoops I had to go through to make it work for t5, please see:

Note that over the spring pytorch has developed a much more user-friendlier PP API, which now allows passing non-tensor variables, which should make things much easier.

Most likely we will have to make stripped versions of the current models which support only the features that PP can accommodate.

@stas00
Copy link
Contributor

stas00 commented Jul 22, 2021

Yes, we need to implement training side of it. However, it seems a little difficult to use NVIDIA's mpu implementation in transformers.

I wasn't referring to a specific MPU implementation. Deepspeed has one too. It's basically the manager of all dimensions of parallelism. The only reason I mentioned it so that we consider the future PP dimension as we develop the manager.

My idea is to leverage the mechanism of parallelformers again. It is to utilize the most of existing transformers code. When I was implementing parallelformers, I was able to successfully parallelize most models forward by changing only a few nn.Linear layers while utilize existing transformers codes. And I think this can be applied to backward as well. However, combining this with the fused CUDA kernel on DeepSpeed side can be quite difficult. I think forward is ok, but backward is hard. Because backward is not implemented in their Tensor MP kernel.

Then we start with just that.

Combining DP and DDP probably requires minor changes to the existing torch implementation. As you know, with DP and DDP, same model parameters are broadcast to all GPU. And, each piece of data is sent to each GPUs.

e.g.

* if bsz=16, n_gpus=2

* gpu1=batch 0-7

* gpu2=batch 8-15

This needs to be partitioned. If Tensor MP size is 2, we should create two partitions.

e.g.

* mp_group1=gpu 0, 1

* mp_group2=gpu 2, 3

And I think that the data should be split by each partition, not by each GPU.

e.g.

* if bsz=16, n_gpus=4, mp_size=2

* mp_group1(gpu0,1)=batch 0-7

* mp_group2(gpu2,3)=batch 8-15

Yes, that's the whole point of MPU. DP doesn't even need to know about TP, it just sees gpu0 and gpu2 - it has no idea there are more GPUs in the pipe. Each parallel dimension typically hides its existence from other dimensions, which allows things to keep simple.

@stas00
Copy link
Contributor

stas00 commented Jul 22, 2021

Your collaboration plans is very clear, @hyunwoongko.

Thank you for your inspiration to share your work for the good of all! It's true that being part of a "bigger pie" will make your work accessible to a lot more users.

Wrt step2, you know that Deepspeed has a full TP implementation, except not in CUDA kernels - perhaps this can be utilized instead for backward?

Otherwise please ping or tag me when you need my input here or on the Deepspeed github.

Looking forward to this inspiring collaboration, @hyunwoongko

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Jul 22, 2021

First of all, we need to discuss this collaborative process with @RezaYazdaniAminabadi.
Can we discuss it here? I'm curious about your opinion.

@hyunwoongko
Copy link
Contributor Author

Wrt step2, you know that Deepspeed has a full TP implementation, except not in CUDA kernels - perhaps this can be utilized instead for backward?

I'll review the code soon. Thank you.

@hyunwoongko
Copy link
Contributor Author

https://tunib.notion.site/TECH-2021-07-26-Parallelformers-Journey-to-deploying-big-models-32b19a599c38497abaad2a98727f6dc8

Here is the English version of the blog post!

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 21, 2021

@stas00 Sorry for the delay this work. We are also making a public large-scale model that can cover Asian languages. I've been very busy these days, so I haven't had much time to contribute to Hugging Face. I will work on it as soon as possible.

@huggingface huggingface deleted a comment from github-actions bot Aug 21, 2021
@stas00
Copy link
Contributor

stas00 commented Aug 24, 2021

Also pinging @siddk whose team also has been working on improving transformers to support TP https://github.com/stanford-crfm/mistral.

For context, while your team was on a summer break, @hyunwoongko implemented Parallelformers and we started discussing how to integrate their work, while planning integration of Deepspeed CUDA kernels for TP.

So now that your team is getting back let's discuss how to best collaborate.

@siddk
Copy link
Contributor

siddk commented Aug 24, 2021

Oh this is awesome, thanks @stas00 and nice to meet you @hyunwoongko. Let me get up to speed on this thread, but this looks like amazing work!

@hyunwoongko
Copy link
Contributor Author

@siddk Hello. Could you please explain so I can get the context? :)

@hyunwoongko
Copy link
Contributor Author

I will resume this work from this weekend. Since my company is so busy now, most of the open source work will probably be done on weekends. I will working on deepspeed this week. I had an offline meeting with them and we are discussing how to combine. (Probably integration with Huggingface transformers will not take place soon because it is steps 3 and 4.)

@jaketae
Copy link
Contributor

jaketae commented Aug 24, 2021

It's really cool to see this collaboration in the pipeline! I'm not affiliated with any of the frameworks/organizations here at stake, but I do come from HF BigScience side of things where I've briefly discussed things with @stas00. If there's grunt work or anything else that has to be done, I'd be more than happy to contribute in ways that I can.

@hyunwoongko
Copy link
Contributor Author

hyunwoongko commented Aug 24, 2021

@jaketae I already know you by KoClip project. nice to meet you. Your work would be of great help. :)

@stas00 Currenlty, we need to talk more with the DeepSpeed team. I will first integrate the parallelformers features into deepspeed. However, what deepspeed and transformers currently want is slightly different, so we need to adjust it.

  1. deepspeed wants to improve deepspeed-inference, maybe they are not considering training features.
  2. transformers want to improve training features with 3D parallelization. (and as we said before, we have to consider a megatron-lm style mpu if we implement training features with pp and dp. The problem is I don't know if it's okay for me to implement this in deepspeed).

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

As I commented in another issue:

HF transformers wants both training and inference. It's just that we have a lot more users using the library for training. So there is definitely not misalignment between the two.

Remember that Deepspeed already has PP, so they are just missing TP and inference.

HF Transformers doesn't have those yet, hence the difference.

(thanks to @hyunwoongko for correcting me that DS doesn't have TP)

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

@siddk Hello. Could you please explain so I can get the context? :)

https://twitter.com/siddkaramcheti/status/1430195543301492744

@stas00
Copy link
Contributor

stas00 commented Aug 25, 2021

If there's grunt work or anything else that has to be done, I'd be more than happy to contribute in ways that I can.

@jaketae, the idea is to first pick one model and port it to TP and later PP. Then we will have to replicate this for all models (or at least models that will support this), so there will be a ton of work for quite a few people to contribute.

@hyunwoongko
Copy link
Contributor Author

I will close this issue. lets discuss in #13690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants