-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precomputation of conditions and latents #129
Conversation
Actually I think I have an idea which I would like to consider before this is merged. I am on the move currently but get the comments in by today before evening. |
Sounds good, LMK when you're free! |
Not for this PR but, usually, for large-scale training projects, any kind of precomputation is treated as "data processing" and is separated from trainer classes. So, in this case, as a (power) user, I would prefer:
But for users to get started, the current way of handling things is fine. If we do implement the functional APIs, trainer classes could implement a I also don't like |
I respectfully disagree. I may not have the experience you have, but as far as our initial goals were aimed towards, we wanted to make video model training more accessible, not target large-scale training. Adding extra steps and more things that users need to follow instructions for it to work is a big no-go IMO. Yes, we can eventually try and be more ambitious but for now, we should focus on streamlining things as much as possible and supporting all things lora (normal lora, control lora, turbo distillation loras, etc.) and make sure our potato friends can run it. From some feedback I gathered, people who currently use this project want two things:
Ofcourse, this is not a UI, so the button part is not to be considered, and is a battle for someone else ot fight. But opt-in precomputation like we do here from just two text files looks ideal to me (for now). We can reconsider in future as you mention. Me and you know about things and can work our way through all this, but I want you to think in the way of a general user with no programming knowledge, who is just experimenting from a creator background, has read some tutorials online and just wants to use the sensible defaults, that they will find all of this to be non-sense. Just from the history of diffusion inference in the past two years, I believe there's a lot of lessons to be learnt on what got most popular among users who actually use these kinds of tools daily. Large scale training is quite a ways away from what we have. Basic data parallelism like we have will never scale IMO, so first, that is what we should tackle before designing anything else if that's what we want to prioritize.
Awesome, so this is where we are at currently. We just need to "API"-fy it a bit more so that custom methods/callbacks can be passed in, or what we have currently can be more re-usable easily
Utilizing accelerators based on compute patterns of precomputation stage affect training...? I don't quite follow what you mean. Maybe we can hop in DM and I can try to understand?
I would like to distinguish between two things here:
TRL implements what I would call "algorithm-specific" training. It works with different models because many things are common and transformers provides some great abstractions, but also because it is mostly just a single model and not many moving components. The same does not apply to diffusion models, as we have discussed time and again how many moving components are involved, and we will most definitely require model-specific implementations because different models process things differently. But OTOH, most underlying training algorithms are independent of the modeling architecture itself. For example, with a decent abstraction like what we have now, lora training works for both LTX and Hunyuan without any lora stuff interleaving with the single file implementations we have regarding model loading, condition and latent computation, etc. I believe, similar would be the case for full finetuning, distillation, and pretty much everything else (we can break away into better abstraction when we get to it, but let's not prematurely design it with that in mind because it will only slow us down). I would prefer doing the following:
With that in mind, we can have multiple entry-points, but they should not be on "model" level and should instead be on "algorithm" level. The model to be trained and all the underlying configuration managements should happend with a single file such as These are my thoughts on how we should proceed to maximize our speed of integrating new models and techniques that we want to support. Premature refactoring might bring us abstractions that look good now, but might slow us down and cause further required refactoring in future. LMK what you think so we can proceed accordingly. If you think we absolutely need to do a better design first before moving forward, I'm happy to assist and will not make any further PRs until we get that sorted - but if not, I can take up CogVideoX next and maybe you could do Mochi? I will address your concerns shortly and push the relevant changes. Thanks for the reviews! |
I didn't say "affect". I am saying it gives us a chance to better utilize accelerators in the way we want to. Accelerator distribution for precomputation isn't always necessarily the same as training. Below is an example:
(both happening in parallel) This (the way accelerators are being utilized) is not what we do during training. But okay to not proceed in the direction I mentioned. But I won't absolutely prefer having a single |
What is okay with me is:
What is not okay with me:
Preparing conditions and latents is a simple enough thing IME that won't require an implementation for each kind of algorithm and can generally be re-used. I would also not be okay if each trainer class had to implement their own N different "algorithm" implementations. Basically, I don't want LTXTrainer to have its own implementation for all the different mentioned tasks unless absolutely necessary. Doing so will only create bugs and make it harder to maintain. If it were rather called something like What would be more likeable to me is |
So, this means you're not okay with the direction taken in #130, right?
Very minimal differences between |
Okay, I feel like despite my efforts, it is extremely hard to communicate my point across. Please make an effort to not pick two of the simplest examples and extrapolate it to the entirety of the message. Do we not agree that LoRA is a different training technique than full finetuning? I am trying to discuss about training algorithms, not how similar/dissimilar they are. My point is, we should have trainers specific to algorithms. Not models. It should be model-independant as long as the model provides all the helpful utility functions required for training. Let's not tie in the two together - this inevitably leads to bugs and I've seen my fair share with the diffusers scripts that do it model-specifically. It is hard to use and unstable, but it serves its purpose in that it allows for hacking and exploration. Here are the directions I'm not okay with in #130:
To reiterate, the models should not tie in with the actual training algorithm in any way. In all trainers I've looked at, those that do this have very complicated code (but partially also because they build atop the LDM codebase). SimpleTuner is a good example that does not tie in the loss to the model itself, and to me is the most successful training project built on Diffusers because it "just works" OOTB and you don't have to think much to get it to work. I would take inspiration from there and follow some design patterns, but improve wherever possible and customize to our broader set of goals. |
It's okay to have disagreements and I can see some of your points. So, I will roll with this direction. Making changes later isn't a big deal. The reason why I wanted to separate dataset prep was because not all models have the same components we need during precomputation (different text encoders, different masking, etc.). So, unifying that under a single method becomes very messy. And then for caption dropout, different models do it differently as mentioned the other day. It's easy to miss this detail but it plays a crucial role, sometimes. For loss, Mochi does it differently than the other flow models we have. TL;DR -- let's get this PR merged. I will work on Cog next as explained over DM. |
This, I agree with. What I want is for the utility functions to be part of the model-specific implementations. Not the loading and dataloader creation itself. That should be a separate decoupled component in itself. We leverage the utility functions with conditional in the trainer (just throwing out ideas):
Again, this is a trainer-specific detail and should not be tied into the model utilities itself. The trainer will invoke preparation methods. If it's an "empty"-ing dropout, null prompt will be passed to the preparation method. If it's a "tensor zero"-ing dropout, it can be done post-preparation. We already do this. This should be configurable via CLI args to enable exploration of what works, not hardcode it for certain models. We can provide sensible guidelines on what to do per-model.
It's one case among the thousands of flow-matching models. It's not a different flow matching algorithm as such. It's just a mistake that occurred in the original codebase, as I like to believe. Instead of the usual (timestep 1000 = noise) and (timestep = clean), they have a model trained with an inverse objective.
Thank you. I realize I might have spoken a bit in spirit and strongly about the way I expect moving forward. I think it is good to have these discussions in early stages before realizing it later. Working my way through the reviews and will push shortly |
Just so you know it's can be model-specific :D If a model was pre-trained using a particular caption-dropping method it will have to be accounted for. So, we can't eliminate possibility and it's something a user should be mindful about. By default, we shouldn't do any caption dropping out hence. I guess, that is fair? |
Very rarely it's the case that an empty/null prompt is not used. Maybe some of the recent ones, and for these cases, when the trainer object is initialized, I would like a warning be raised than disabling the possibility of doing anything else altogether.
Just like I said above, let's get to it when we get to it instead of thinking about all this. It only slows us down having pointless debates. None of the models currently require us to diverge from the sensible defaults |
Co-Authored-By: Sayak Paul <[email protected]>
Okay. I was trying to be thorough about the issues that can silently stem in training but it seems like we don't have to be worried about them that much for now. Let's not call that a pointless debate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits. Otherwise, looks good.
I believe I've addressed all reviews. Proceeding with merge. Follow-ups:
|
Works for me, thanks for chalking that out. I can also support training with bnb-quantized checkpoints similar to https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization (something that has worked like wonders in the LLM world and also for Flux) ;) |
Will add:
|
No description provided.