Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during NeuronX compilation: "boost::filesystem::copy_file: No data available" when copying required binary files #1063

Open
Mr-Rakshit opened this issue Dec 10, 2024 · 0 comments

Comments

@Mr-Rakshit
Copy link

Mr-Rakshit commented Dec 10, 2024

I am encountering an error while compiling a model using neuronx-cc for the trn1 target with the following error message:

2024-12-10 14:54:16.000320: 44130 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/model.MODULE_10942762915402687297+725887e1.hlo_module.pb', '--output', '/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/model.MODULE_10942762915402687297+725887e1.neff', '--model-type', 'transformer', '--distribution-strategy=llm-training', '--enable-mixed-precision-accumulation', '--verbose=35']: 2024-12-10T14:54:16Z Warning: Non-output memory location with no reader: {xla__all_gather_all-gather.10274.1127}@SB<0,0>(1x2)#Internal DebugInfo: <xla__all_gather_all||UNDEF||[1, 1, 1]> [NLA001] Unhandled exception with message: boost::filesystem::copy_file: No data available [system:61]: "/fsx/training_jobs/trainium_nemo_llama_7b/dw576tCpmLoyF/myenv/lib/python3.10/site-packages/neuronxcc/pwp/pwp_bin_trainium/reciprocal_sqrt_and_small_bkt.bin", "/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/neuronxcc-_cnm4hh4/sgLnk/sg00/reciprocal_sqrt_and_small_bkt.bin"
The error occurs when neuronx-cc attempts to copy the binary file reciprocal_sqrt_and_small_bkt.bin from its source location (/fsx/training_jobs/trainium_nemo_llama_7b/dw576tCpmLoyF/myenv/lib/python3.10/site-packages/neuronxcc/pwp/pwp_bin_trainium/) to the temporary compilation directory (/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/neuronxcc-_cnm4hh4/sgLnk/sg00/).

Steps to Reproduce:

  1. Use neuronx-cc to compile the model (https://github.com/aws-neuron/neuronx-nemo-megatron/blob/main/nemo/examples/nlp/language_modeling/llama_7b.sh) using slurm for the trn1 target.
  2. The compilation process begins, but during the file copy operation for reciprocal_sqrt_and_small_bkt.bin, the error is triggered.
image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants