-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graphstorm images utilisation in SageMaker TuningJobs #1072
Comments
Hi @milianru I think this actually a shortcoming of our SageMaker training output implementation. If you're willing to contribute a fix I can shepherd the PR and try to get it merged for the next release. |
Hi, @milianru, for the place where GraphStorm stores the model artifacts, we use You are free to adujst the training entry script if it works :). |
Hi @thvasilo, |
Hi @classicsong, |
Hi, @milianru |
Another workaround is you can change the sagemaker_train.py to customize the S3 path in The sagemaker job name can be found from the train_env (https://github.com/awslabs/graphstorm/blob/main/python/graphstorm/sagemaker/sagemaker_train.py#L178) |
Hello,
I was facing some issues with colliding model artifacts using the graphstorm training image in a TuningJob.
I was wondering, if the graphstorm docker images are in general intended for usage in Sagemaker TuningJobs? And if so, is there an example on how to orchestrate model artifacts of each training job?
When we constructed a
HyperparameterTuner
with aPyTorch
estimator using the training images, the result was first unexpected. All training jobs wrote their artifacts to--model-artifact-s3
, without building up the S3 prefixes for each training run - as I knew it from other training images.After diving a bit into the code and the AWS documentation I found that Sagemaker moves everything stored under '/opt/ml/code
to the output prefix specified by the estimator object. In contrast to this common interface the graphstorm container implement the logic controlled by '--model-artifact-s3
, which works perfectly fine for a single training job, but results in model artifacts collisions for a TuningJob.We found a way to workaround the issue by creating a custom training entry script. It which basically extended the regular train_entry.py with a final copy of the training artifacts to
/opt/ml/model
. Having the model artifacts on the containter stored there, allowed the TuniningJob to orchestrate all training artifacts successfully.I was wondering, whether the model storage in
graphstorm/sagemaker/sagemaker_train.py
under/tmp/gsgnn_model/
is prefered over/opt/ml/model
for a different reason (here)? If this is the case, would it be possible to define a new argument for the local storage directory or similar?In case this is something that you don't have on your agenda, but you see the purpose, I could also assist with a PR.
The text was updated successfully, but these errors were encountered: