You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can run the script successfully as explained in the repository, such as creating a quantized model and then running it with generate.py. However, the actual issue arises when I try to implement it into the application. Of course, our main goal is to reduce latency, so we don't want the compilation (torch.compile) to be done at every request.
Thus, I want to keep the compilation process in the initialization stage of the application, aiming for it to run only once. Is it possible? Because even if I run the generate function separately after compilation, the first time always takes a long time, and then the later inferences are at a good speed.
For example, what I am trying to do is:
First, I run the main function so that the compilation runs. It compiles:
My approach is to call the main function during the initialization of the application so that we can run the compilation. Then, I call the generate function (snippet2) inside my inference function, which will be called at every request from the frontend.
However, the ISSUE is that when I do this, the first time snippet2 is called, it takes too long (7tokens/sec). Afterward, it runs at a good speed (90 tokens/sec). I don't understand why it runs very slowly the first time, even though there is no compilation done.
And is there any other way to cache the compilation results so that we can use it only for inference seamlessly???
Please help me understand this issue.
The text was updated successfully, but these errors were encountered:
I can run the script successfully as explained in the repository, such as creating a quantized model and then running it with generate.py. However, the actual issue arises when I try to implement it into the application. Of course, our main goal is to reduce latency, so we don't want the compilation (torch.compile) to be done at every request.
Thus, I want to keep the compilation process in the initialization stage of the application, aiming for it to run only once. Is it possible? Because even if I run the generate function separately after compilation, the first time always takes a long time, and then the later inferences are at a good speed.
For example, what I am trying to do is:
First, I run the main function so that the compilation runs. It compiles:
snippet1
Then, I run only the generate function with a new prompt, which will be coming from the user at each request:
snippet2
My approach is to call the main function during the initialization of the application so that we can run the compilation. Then, I call the generate function (snippet2) inside my inference function, which will be called at every request from the frontend.
However, the ISSUE is that when I do this, the first time snippet2 is called, it takes too long (7tokens/sec). Afterward, it runs at a good speed (90 tokens/sec). I don't understand why it runs very slowly the first time, even though there is no compilation done.
And is there any other way to cache the compilation results so that we can use it only for inference seamlessly???
Please help me understand this issue.
The text was updated successfully, but these errors were encountered: