Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization #127

Open
daniyal214 opened this issue Mar 8, 2024 · 0 comments

Comments

@daniyal214
Copy link

I can run the script successfully as explained in the repository, such as creating a quantized model and then running it with generate.py. However, the actual issue arises when I try to implement it into the application. Of course, our main goal is to reduce latency, so we don't want the compilation (torch.compile) to be done at every request.

Thus, I want to keep the compilation process in the initialization stage of the application, aiming for it to run only once. Is it possible? Because even if I run the generate function separately after compilation, the first time always takes a long time, and then the later inferences are at a good speed.

For example, what I am trying to do is:

First, I run the main function so that the compilation runs. It compiles:

snippet1

model_r, encoded_r, callback_r, tokenizer_r, model_size_r, prof_r = main(prompt= "Hello, my name is",
interactive = False,
num_samples= 1,
max_new_tokens = 128,
top_k = 200,
temperature = 0.8,
checkpoint_path: Path = Path("------model_int8.pth"),
compile = True,
compile_prefill = False,
profile = None,
draft_checkpoint_path = None,
speculate_k= 5,
device='cuda',
)

Then, I run only the generate function with a new prompt, which will be coming from the user at each request:
snippet2

prompt_2 = "Machine Learning is"
encoded_r_2 = encode_tokens(tokenizer_r, prompt_2, bos=True, device='cuda')
prompt_length_r_2 = encoded_r_2.size(0)
prompt_length_r_2


with prof_r:
    y, metrics = generate(
                model_r,
                encoded_r_2,
                512,
                draft_model=None,
                speculate_k=5,
                interactive=False,
                callback=callback_r,
                temperature=0.9,
                top_k=200,
            )

My approach is to call the main function during the initialization of the application so that we can run the compilation. Then, I call the generate function (snippet2) inside my inference function, which will be called at every request from the frontend.

However, the ISSUE is that when I do this, the first time snippet2 is called, it takes too long (7tokens/sec). Afterward, it runs at a good speed (90 tokens/sec). I don't understand why it runs very slowly the first time, even though there is no compilation done.

And is there any other way to cache the compilation results so that we can use it only for inference seamlessly???

Please help me understand this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant