Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation" and have encountered an issue during training. #15

Open
jaxonXu98 opened this issue Jan 6, 2025 · 0 comments

Comments

@jaxonXu98
Copy link

  1. Memory Error

When running the following command:

python train_rlgames.py --task=BlockAssemblyOrient --num_envs=1024
I encountered a memory error, with the following traceback:

Traceback (most recent call last):
File "train_rlgames.py", line 102, in
runner.run(vargs)
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 120, in run
self.run_train(args)
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 101, in run_train
agent.train()
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1162, in train
self.obs = self.env_reset()
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 470, in env_reset
obs = self.vec_env.reset()
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/vec_task_rlgames.py", line 183, in reset
self.task.step(actions)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/base_task.py", line 135, in step
self.pre_physics_step(actions)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1712, in pre_physics_step
self.reset_idx(env_ids, goal_env_ids)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1607, in reset_idx
self.post_reset(env_ids, hand_indices, object_indices, rand_floats)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1664, in post_reset
pos_err = self.segmentation_target_init_pos - self.rigid_body_states[:, self.hand_base_rigid_body_index, 0:3]
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I am currently using a single NVIDIA 4090 GPU. Could you please let me know how many GPUs (and which model) you used in your experiments? This will help me determine if the issue is related to hardware limitations.

When I reduce the number of num_envs to 64 and run the following command:

python train_rlgames.py --task=BlockAssemblyOrient --num_envs=64
I encounter another issue, with the following traceback:

Traceback (most recent call last):
File "train_rlgames.py", line 102, in
runner.run(vargs)
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 120, in run
self.run_train(args)
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 101, in run_train
agent.train()
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1173, in train
step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1037, in train_epoch
batch_dict = self.play_steps()
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 636, in play_steps
self.obs, rewards, self.dones, infos = self.env_step(res_dict['actions'])
File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 458, in env_step
obs, rewards, dones, infos = self.vec_env.step(actions)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/vec_task_rlgames.py", line 168, in step
self.task.step(actions_tensor)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/base_task.py", line 135, in step
self.pre_physics_step(actions)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1712, in pre_physics_step
self.reset_idx(env_ids, goal_env_ids)
File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1467, in reset_idx
self.saved_searching_ternimal_state = self.root_state_tensor.clone()[self.lego_indices.view(-1), :].view(self.num_envs, 108, 13)
RuntimeError: shape '[64, 108, 13]' is invalid for input of size 109824
This error seems related to a shape mismatch after reducing num_envs to 64.

  1. PyTorch Version

I would also like to confirm the version of PyTorch you used for this project. I want to make sure that I am using the correct version to avoid any compatibility issues.

  1. Program Stopping After Running main_rlgames("BlockAssemblySearch", 128)

When I run the following command:

python scripts/bi-optimization.py --task=BlockAssembly
The program executes only the first line:

search_policy_path = main_rlgames("BlockAssemblySearch", 128)
However, the subsequent lines do not run:

orient_policy_path = main_rlgames("BlockAssemblyOrient", 512)
grasp_sim_policy_path = main_rlgames("BlockAssemblyGraspSim", 512)
insert_sim_policy_path = main_rlgames("BlockAssemblyInsertSim", 512)
main_rlgames("BlockAssemblyInsertSim", 512, use_t_value=True, policy_path=insert_sim_policy_path)
transition_value_trainer("BlockAssemblyInsertSim", rollout=10000)
main_rlgames("BlockAssemblyGraspSim", 512, use_t_value=True, policy_path=grasp_sim_policy_path)
transition_value_trainer("BlockAssemblyGraspSim", rollout=10000)
main_rlgames("BlockAssemblyOrient", 128, use_t_value=True, policy_path=orient_policy_path)
transition_value_trainer("BlockAssemblyOrient", rollout=10000)
If I comment out the line search_policy_path = main_rlgames("BlockAssemblySearch", 128) after running it, essentially starting from orient_policy_path = main_rlgames("BlockAssemblyOrient", 512), I still encounter a memory error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant