-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The inference results are inconsistent with Huggingface. #30
Comments
I also found that the inference results are inconsistent. |
Yes, sometimes the results could be inconsistent. We owe it to floating point errors, and it is normal -- hf w/ flash-attn and hf w/o flash-attn can also have different outputs sometimes. If you use float32, lade's results should be exactly the same as hf's. |
Thanks for your reply. I figured this out. In the case of parallel decoding, due to the loss of precision, there is indeed no guarantee that the result will be exactly the same as that of single-step decoding. |
This is the floating point error. Although the logical flows are the same, the computations that happen in GPU are different (i.e., lade computes several tokens per step while hf only computes one token per step). |
And I do not think hf fp16's output is the 'correct' one. hf fp32/lade fp32 outputs should be the 'correct' one. Sometimes, lade fp16's output can align with the fp32 output, while hf fp16's output can be inconsistent with its fp32 output. |
I believe that it is impossible for both Lade and HF to always maintain consistency between FP16 output and FP32 output. Initially, I thought that Lade and HF always maintain consistency in output under the same precision. |
Hello! Thanks for this new parallel decoding algorithm.
When I was using minimal.py to compare the performance of LookaheadDecoding and Huggingface, I found that the output of some test cases was not consistent with Huggingface. Here I share my test code and environment, which is modified from minimal.py.
Environment:
Code:
You can use it like minimal.py. When I test, the output of Huggingface is:
And the output of LookaheadDecoding is:
The text was updated successfully, but these errors were encountered: