Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The inference results are inconsistent with Huggingface. #30

Open
cyfwry opened this issue Dec 6, 2023 · 6 comments
Open

The inference results are inconsistent with Huggingface. #30

cyfwry opened this issue Dec 6, 2023 · 6 comments

Comments

@cyfwry
Copy link

cyfwry commented Dec 6, 2023

Hello! Thanks for this new parallel decoding algorithm.
When I was using minimal.py to compare the performance of LookaheadDecoding and Huggingface, I found that the output of some test cases was not consistent with Huggingface. Here I share my test code and environment, which is modified from minimal.py.

Environment:

GPU: A100-80G
cuda: 11.8
driver: 470.103.01
python3: 3.9.16
pytorch: 1.13.0
transformers: 4.34.0

Code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time 
import os 
if int(os.environ.get("LOAD_LADE", 0)):
    import lade 
    lade.augment_all()
    lade.config_lade(LEVEL=7, WINDOW_SIZE=20, GUESS_SET_SIZE=20, DEBUG=1)

assert torch.cuda.is_available()

torch_device = "cuda"
model_name = "TinyLlama-1.1B-Chat-v0.3"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=torch_device)
model.tokenizer = tokenizer

data = "How to write a shell script to get a program to restart itself on crash"
model_inputs = tokenizer(data, return_tensors='pt').to(torch_device)
greedy_output = model.generate(**model_inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

You can use it like minimal.py. When I test, the output of Huggingface is:

How to write a shell script to get a program to restart itself on crash?
How do I write a shell script to restart a program when it crashes?
I have a program that I want to automatically restart when it crashes. I want it to be a simple script that just starts the program and then waits for it to finish and then starts the program again.
I've tried using kill -9 <pid> but it doesn't work. Any ideas?
Here's a simple script that should work:
#!/bin/bash

# Start the program
./program

# Wait for it to finish
while true; do
  # Check if the program is running
  pid=$(ps ax | grep "$PROGRAM_NAME" | awk '{print $2}')
  if [ -z "$pid" ]; then
    # Program is not running, start it
    ./program
  fi

  # Check if the program has finished
  sleep 10
done

This script uses the ps ax command to list the processes running the program and the sleep 10 to wait for 10 seconds before starting it again.
I hope this helps! Let me know if you have any questions.

A: You can use the

And the output of LookaheadDecoding is:

How to write a shell script to get a program to restart itself on crash?
How do I write a shell script to restart a program when it crashes?
I have a program that I want to automatically restart when it crashes. I want it to be a simple script that just starts the program and then waits for it to finish and then starts the program again.
I've tried using kill -9 <pid> but it doesn't work. Any ideas?
Here's a simple script that should work:
#!/bin/bash

# Start the program
./program

# Wait for it to finish
while true; do
  # Check if the program is running
  pid=$(ps -p $USER -o pid= --no-headers | awk '{print $1}')
  if [ -z "$pid" ]; then
    echo "Program not running, waiting..."
    sleep 10
  else
    # Start the program again
    ./program
  fi
done

This script uses ps to check if the program is running and, if it is, it waits for it to finish. If it's not running, it starts the program again.
I hope this helps! Let me know if you have any questions.
@LMX-xin
Copy link

LMX-xin commented Dec 6, 2023

I also found that the inference results are inconsistent.
The input is "who are you ", here is a space after "you" ,max_new_tokens=20.the values of N,W and G are 3, 2, 2 and the inference results of N, W and G values of 4, 4 ,4 are inconsistent.

@Viol2000
Copy link
Collaborator

Viol2000 commented Dec 6, 2023

Yes, sometimes the results could be inconsistent. We owe it to floating point errors, and it is normal -- hf w/ flash-attn and hf w/o flash-attn can also have different outputs sometimes. If you use float32, lade's results should be exactly the same as hf's.

@cyfwry
Copy link
Author

cyfwry commented Dec 6, 2023

Yes, sometimes the results could be inconsistent. We owe it to floating point errors, and it is normal -- hf w/ flash-attn and hf w/o flash-attn can also have different outputs sometimes. If you use float32, lade's results should be exactly the same as hf's.

Thanks for your reply. I figured this out. In the case of parallel decoding, due to the loss of precision, there is indeed no guarantee that the result will be exactly the same as that of single-step decoding.

@Viol2000
Copy link
Collaborator

Viol2000 commented Dec 6, 2023

This is the floating point error. Although the logical flows are the same, the computations that happen in GPU are different (i.e., lade computes several tokens per step while hf only computes one token per step).
Different floating point tensor computations could have very similar outputs, but there will always be differences.
Even if the difference is slight, it will accumulate and turn into inconsistent output someday. It can explain that the inconsistency happens when the output is relatively long.

@Viol2000
Copy link
Collaborator

Viol2000 commented Dec 6, 2023

And I do not think hf fp16's output is the 'correct' one. hf fp32/lade fp32 outputs should be the 'correct' one. Sometimes, lade fp16's output can align with the fp32 output, while hf fp16's output can be inconsistent with its fp32 output.

@cyfwry
Copy link
Author

cyfwry commented Dec 7, 2023

And I do not think hf fp16's output is the 'correct' one. hf fp32/lade fp32 outputs should be the 'correct' one. Sometimes, lade fp16's output can align with the fp32 output, while hf fp16's output can be inconsistent with its fp32 output.

I believe that it is impossible for both Lade and HF to always maintain consistency between FP16 output and FP32 output. Initially, I thought that Lade and HF always maintain consistency in output under the same precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants