You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am encountering an error while compiling a model using neuronx-cc for the trn1 target with the following error message:
2024-12-10 14:54:16.000320: 44130 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/model.MODULE_10942762915402687297+725887e1.hlo_module.pb', '--output', '/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/model.MODULE_10942762915402687297+725887e1.neff', '--model-type', 'transformer', '--distribution-strategy=llm-training', '--enable-mixed-precision-accumulation', '--verbose=35']: 2024-12-10T14:54:16Z Warning: Non-output memory location with no reader: {xla__all_gather_all-gather.10274.1127}@SB<0,0>(1x2)#Internal DebugInfo: <xla__all_gather_all||UNDEF||[1, 1, 1]> [NLA001] Unhandled exception with message: boost::filesystem::copy_file: No data available [system:61]: "/fsx/training_jobs/trainium_nemo_llama_7b/dw576tCpmLoyF/myenv/lib/python3.10/site-packages/neuronxcc/pwp/pwp_bin_trainium/reciprocal_sqrt_and_small_bkt.bin", "/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/neuronxcc-_cnm4hh4/sgLnk/sg00/reciprocal_sqrt_and_small_bkt.bin"
The error occurs when neuronx-cc attempts to copy the binary file reciprocal_sqrt_and_small_bkt.bin from its source location (/fsx/training_jobs/trainium_nemo_llama_7b/dw576tCpmLoyF/myenv/lib/python3.10/site-packages/neuronxcc/pwp/pwp_bin_trainium/) to the temporary compilation directory (/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/neuronxcc-_cnm4hh4/sgLnk/sg00/).
I am encountering an error while compiling a model using neuronx-cc for the trn1 target with the following error message:
2024-12-10 14:54:16.000320: 44130 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/model.MODULE_10942762915402687297+725887e1.hlo_module.pb', '--output', '/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/model.MODULE_10942762915402687297+725887e1.neff', '--model-type', 'transformer', '--distribution-strategy=llm-training', '--enable-mixed-precision-accumulation', '--verbose=35']: 2024-12-10T14:54:16Z Warning: Non-output memory location with no reader: {xla__all_gather_all-gather.10274.1127}@SB<0,0>(1x2)#Internal DebugInfo: <xla__all_gather_all||UNDEF||[1, 1, 1]> [NLA001] Unhandled exception with message: boost::filesystem::copy_file: No data available [system:61]: "/fsx/training_jobs/trainium_nemo_llama_7b/dw576tCpmLoyF/myenv/lib/python3.10/site-packages/neuronxcc/pwp/pwp_bin_trainium/reciprocal_sqrt_and_small_bkt.bin", "/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/neuronxcc-_cnm4hh4/sgLnk/sg00/reciprocal_sqrt_and_small_bkt.bin"
The error occurs when neuronx-cc attempts to copy the binary file reciprocal_sqrt_and_small_bkt.bin from its source location (/fsx/training_jobs/trainium_nemo_llama_7b/dw576tCpmLoyF/myenv/lib/python3.10/site-packages/neuronxcc/pwp/pwp_bin_trainium/) to the temporary compilation directory (/tmp/root/neuroncc_compile_workdir/73591201-9a34-426f-8e51-11fb62c71eb8/neuronxcc-_cnm4hh4/sgLnk/sg00/).
Steps to Reproduce:
The text was updated successfully, but these errors were encountered: