You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I've built a chatbot using Llama2 on a machine equipped with four GPUs, each with 16GB of memory. However, it appears that only 'cuda:0' is currently being utilized. Consequently, we are experiencing high latency, approximately 60 seconds per question. I'm wondering if Tensor Parallel can help us leverage the other CUDA devices. I've attempted the following:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", local_files_only=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", local_files_only=True,\
low_cpu_mem_usage=True, \
torch_dtype=torch.float16,\
load_in_4bit=True)
model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"])
Please let me know if you have any suggestions or advice. Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi, I've built a chatbot using Llama2 on a machine equipped with four GPUs, each with 16GB of memory. However, it appears that only 'cuda:0' is currently being utilized. Consequently, we are experiencing high latency, approximately 60 seconds per question. I'm wondering if Tensor Parallel can help us leverage the other CUDA devices. I've attempted the following:
Please let me know if you have any suggestions or advice. Thanks in advance!
The text was updated successfully, but these errors were encountered: