graphstorm images utilisation in SageMaker TuningJobs #1072

milianru · 2024-10-21T18:19:30Z

Hello,

I was facing some issues with colliding model artifacts using the graphstorm training image in a TuningJob.

I was wondering, if the graphstorm docker images are in general intended for usage in Sagemaker TuningJobs? And if so, is there an example on how to orchestrate model artifacts of each training job?

When we constructed a HyperparameterTuner with a PyTorch estimator using the training images, the result was first unexpected. All training jobs wrote their artifacts to --model-artifact-s3, without building up the S3 prefixes for each training run - as I knew it from other training images.

After diving a bit into the code and the AWS documentation I found that Sagemaker moves everything stored under '/opt/ml/code to the output prefix specified by the estimator object. In contrast to this common interface the graphstorm container implement the logic controlled by '--model-artifact-s3, which works perfectly fine for a single training job, but results in model artifacts collisions for a TuningJob.

We found a way to workaround the issue by creating a custom training entry script. It which basically extended the regular train_entry.py with a final copy of the training artifacts to /opt/ml/model. Having the model artifacts on the containter stored there, allowed the TuniningJob to orchestrate all training artifacts successfully.

I was wondering, whether the model storage in graphstorm/sagemaker/sagemaker_train.py under /tmp/gsgnn_model/ is prefered over /opt/ml/model for a different reason (here)? If this is the case, would it be possible to define a new argument for the local storage directory or similar?

In case this is something that you don't have on your agenda, but you see the purpose, I could also assist with a PR.

The text was updated successfully, but these errors were encountered:

thvasilo · 2024-10-23T17:48:35Z

Hi @milianru I think this actually a shortcoming of our SageMaker training output implementation.

If you're willing to contribute a fix I can shepherd the PR and try to get it merged for the next release.

classicsong · 2024-10-23T17:55:11Z

Hi, @milianru, for the place where GraphStorm stores the model artifacts, we use /tmp/gsgnn_model/ because GraphStorm will upload the model to an S3 bucket (defined by --model-artifact-s3) by itself. GraphStorm does not rely on SageMaker to upload the model artifact (as sometime the artifact can be very large).

You are free to adujst the training entry script if it works :).
Another way is to see if HyperparameterTuner can automatically generate arguments for --model-artifact-s3.

milianru · 2024-11-10T19:14:31Z

Hi @thvasilo,
thanks for your response. I would have liked to contribute, but I am not allowed to participate in OpenSource using my company‘s cloud resources. Even though the change would be little I cannot execute an end2end test run.

milianru · 2024-11-10T19:24:34Z

Hi @classicsong,
The idea with modifying the ‘—model-artifact-s3‘ in the hyperparameter tuning, was also something that I had in mind, while finding a good workaround for us.
However I did not find directly documentation on how this can be achieved.
But maybe it is worth to spent some time researching about options (eg. to count ParameterRanges up or similar).

classicsong · 2024-11-13T05:52:58Z

Hi @thvasilo, thanks for your response. I would have liked to contribute, but I am not allowed to participate in OpenSource using my company‘s cloud resources. Even though the change would be little I cannot execute an end2end test run.

Hi, @milianru
Does your company have AWS SA to support your work? The SA may be able to help you find the SageMaker expert.
The SA can also help create a ticket to the GraphStorm team.

classicsong · 2024-11-13T06:05:22Z

Hi @classicsong, The idea with modifying the ‘—model-artifact-s3‘ in the hyperparameter tuning, was also something that I had in mind, while finding a good workaround for us. However I did not find directly documentation on how this can be achieved. But maybe it is worth to spent some time researching about options (eg. to count ParameterRanges up or similar).

Another workaround is you can change the sagemaker_train.py to customize the S3 path in
https://github.com/awslabs/graphstorm/blob/main/python/graphstorm/sagemaker/sagemaker_train.py#L232.
You can add the sagemaker job name into the S3 path.

The sagemaker job name can be found from the train_env (https://github.com/awslabs/graphstorm/blob/main/python/graphstorm/sagemaker/sagemaker_train.py#L178)
Here is the doc of the training environment parameter:
https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_training_env. There is a field called "job_name".

classicsong added the help wanted Extra attention is needed label Oct 23, 2024

classicsong assigned thvasilo Oct 23, 2024

thvasilo added the sagemaker label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graphstorm images utilisation in SageMaker TuningJobs #1072

graphstorm images utilisation in SageMaker TuningJobs #1072

milianru commented Oct 21, 2024

thvasilo commented Oct 23, 2024

classicsong commented Oct 23, 2024 •

edited

Loading

milianru commented Nov 10, 2024

milianru commented Nov 10, 2024

classicsong commented Nov 13, 2024 •

edited

Loading

classicsong commented Nov 13, 2024

graphstorm images utilisation in SageMaker TuningJobs #1072

graphstorm images utilisation in SageMaker TuningJobs #1072

Comments

milianru commented Oct 21, 2024

thvasilo commented Oct 23, 2024

classicsong commented Oct 23, 2024 • edited Loading

milianru commented Nov 10, 2024

milianru commented Nov 10, 2024

classicsong commented Nov 13, 2024 • edited Loading

classicsong commented Nov 13, 2024

classicsong commented Oct 23, 2024 •

edited

Loading

classicsong commented Nov 13, 2024 •

edited

Loading