TypeError: Object of type Tensor is not JSON serializable for finetune_hf by Lora #937
Closed
BourneKing
started this conversation in
General
Replies: 2 comments
-
求高手指教啊,非常非常感谢!!! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
NVIDIA GeForce RTX 3090单卡执行Lora FT报错
/ChatGLM3-main/finetune_demo$ nvidia-smi
Thu Mar 7 16:10:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 On | N/A |
| 0% 49C P8 39W / 350W | 626MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:04:00.0 Off | N/A |
| 0% 47C P8 23W / 350W | 10MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
samples in train.json after converting
{"conversations": [{"role": "user", "content": "类型#裙版型#宽松版型#显瘦颜色#黑色图案#撞色裙型#直筒裙裙款式#拼接"}, {"role": "assistant", "content": "采用简洁大体的黑色格调,宽松舒适的裙子内里,配上落肩的袖子拼接,不惧夏日的炎热,穿出清凉舒适。用时尚的英文字母,加上撞色的红白搭配,呈现大气时尚的款式。直筒的裙子轮廓,前短后长的长度裁剪,上身拉长宝宝的身体比例,挑高显瘦。"}]}
{"conversations": [{"role": "user", "content": "类型#上衣颜色#黑色颜色#紫色风格#性感图案#字母图案#文字图案#线条图案#刺绣衣样式#卫衣衣长#短款衣袖型#落肩袖衣款式#连帽"}, {"role": "assistant", "content": "卫衣的短款长度设计能够适当地露出腰线,打造出纤瘦的身材十分性感。衣身的字母刺绣图案有着小巧的样式,黑色的绣线在紫色的衣身上显得很出挑显眼。落肩袖的设计柔化了肩部的线条衬托得人很温柔可爱。紫色的颜色彰显出优雅的气质也不失年轻活力感。连帽的设计让卫衣更加丰满造型感很足,长长的帽绳直到腰际处,有着延长衣身的效果显得身材。"}]}
{"conversations": [{"role": "user", "content": "类型#上衣颜色#黑白风格#简约风格#休闲图案#条纹衣样式#风衣*衣样式#外套"}, {"role": "assistant", "content": "设计师以条纹作为风衣外套的主要设计元素,以简约点缀了外套,带来大气休闲的视觉效果。因为采用的是黑白的经典色,所以有着颇为出色的耐看性与百搭性,可以帮助我们更好的驾驭日常的穿着,而且不容易让人觉得它过时。"}]}
lora ft cmd in Ubuntu20.04执行命令
CUDA_VISIBLE_DEVICES=1 /home/yons/miniconda3/envs/apple/bin/python finetune_hf.py data/AdvertiseGen_fix /home/yons/llms/models/chatglm3-6b configs/lora.yaml no
然后报错:
TypeError: Object of type Tensor is not JSON serializable
卡了两天了,求高手指点,十分感谢!!!
具体执行过程及报错信息如下:
~/llms/ChatGLM3-main/finetune_demo$ CUDA_VISIBLE_DEVICES=1 /home/yons/miniconda3/envs/apple/bin/python finetune_hf.py data/AdvertiseGen_fix /home/yons/llms/models/chatglm3-6b configs/lora.yaml no
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 7/7 [00:03<00:00, 1.91it/s]
trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614
--> Model
--> model has 1.949696M params
train_dataset: Dataset({
features: ['input_ids', 'labels'],
num_rows: 114599
})
val_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
test_dataset: Dataset({
features: ['input_ids', 'output_ids'],
num_rows: 1070
})
max_steps is given, it will override any value given in num_train_epochs
[2024-03-07 15:58:43,521] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-07 15:58:43,642] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.13.4, git-hash=unknown, git-branch=unknown
[2024-03-07 15:58:43,642] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-03-07 15:58:43,642] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[2024-03-07 15:58:44,171] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.2.136, master_port=29500
[2024-03-07 15:58:44,171] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-07 15:58:59,824] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-03-07 15:58:59,824] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-03-07 15:58:59,825] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-03-07 15:58:59,826] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-03-07 15:58:59,826] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-03-07 15:58:59,826] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 2 optimizer
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:149:init] Reduce bucket size 500000000
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:150:init] Allgather bucket size 500000000
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:151:init] CPU Offload: False
[2024-03-07 15:58:59,826] [INFO] [stage_1_and_2.py:152:init] Round robin gradient partitioning: False
[2024-03-07 15:58:59,919] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states
[2024-03-07 15:58:59,919] [INFO] [utils.py:801:see_memory_usage] MA 11.67 GB Max_MA 11.67 GB CA 11.68 GB Max_CA 12 GB
[2024-03-07 15:58:59,919] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 7.78 GB, percent = 6.2%
[2024-03-07 15:58:59,980] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states
[2024-03-07 15:58:59,981] [INFO] [utils.py:801:see_memory_usage] MA 11.67 GB Max_MA 11.68 GB CA 11.7 GB Max_CA 12 GB
[2024-03-07 15:58:59,981] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 7.78 GB, percent = 6.2%
[2024-03-07 15:58:59,981] [INFO] [stage_1_and_2.py:539:init] optimizer state initialized
[2024-03-07 15:59:00,035] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer
[2024-03-07 15:59:00,035] [INFO] [utils.py:801:see_memory_usage] MA 11.67 GB Max_MA 11.67 GB CA 11.7 GB Max_CA 12 GB
[2024-03-07 15:59:00,035] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 7.78 GB, percent = 6.2%
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-03-07 15:59:00,036] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[(0.9, 0.999)]
[2024-03-07 15:59:00,036] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] amp_enabled .................. False
[2024-03-07 15:59:00,036] [INFO] [config.py:1000:print] amp_params ................... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] bfloat16_enabled ............. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f4ac821c7c0>
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] communication_data_type ...... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] dataloader_drop_last ......... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] disable_allgather ............ False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] dump_state ................... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] elasticity_enabled ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] fp16_auto_cast ............... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] fp16_enabled ................. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] global_rank .................. 0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] grad_accum_dtype ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] graph_harvesting ............. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 65536
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] load_universal_checkpoint .... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] loss_scale ................... 0
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] memory_breakdown ............. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] mics_shard_size .............. -1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] optimizer_name ............... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] optimizer_params ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] pld_enabled .................. False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] pld_params ................... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] prescale_gradients ........... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] scheduler_name ............... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] scheduler_params ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] sparse_attention ............. None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] steps_per_print .............. inf
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] train_batch_size ............. 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] use_data_before_expert_parallel_ False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] use_node_local_storage ....... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] wall_clock_breakdown ......... False
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] weight_quantization_config ... None
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] world_size ................... 1
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_allow_untested_optimizer True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_enabled ................. True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True
[2024-03-07 15:59:00,037] [INFO] [config.py:1000:print] zero_optimization_stage ...... 2
[2024-03-07 15:59:00,037] [INFO] [config.py:986:print_user_config] json = {
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5.000000e+08,
"contiguous_gradients": true
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"train_batch_size": 1,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false,
"zero_allow_untested_optimizer": true
}
***** Running training *****
Num examples = 114,599
Num Epochs = 1
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Gradient Accumulation steps = 1
Total optimization steps = 3,000
Number of trainable parameters = 1,949,696
{'loss': 4.4402, 'grad_norm': tensor(4.1844, device='cuda:0'), 'learning_rate': 4.9833333333333336e-05, 'epoch': 0.0}
{'loss': 4.9043, 'grad_norm': tensor(3.6091, device='cuda:0'), 'learning_rate': 4.966666666666667e-05, 'epoch': 0.0}
{'loss': 4.6645, 'grad_norm': tensor(4.6015, device='cuda:0'), 'learning_rate': 4.9500000000000004e-05, 'epoch': 0.0}
{'loss': 4.7369, 'grad_norm': tensor(5.0354, device='cuda:0'), 'learning_rate': 4.933333333333334e-05, 'epoch': 0.0}
{'loss': 4.267, 'grad_norm': tensor(4.8521, device='cuda:0'), 'learning_rate': 4.9166666666666665e-05, 'epoch': 0.0}
{'loss': 4.1836, 'grad_norm': tensor(6.1400, device='cuda:0'), 'learning_rate': 4.9e-05, 'epoch': 0.0}
{'loss': 3.7965, 'grad_norm': tensor(6.3060, device='cuda:0'), 'learning_rate': 4.883333333333334e-05, 'epoch': 0.0}
{'loss': 3.7979, 'grad_norm': tensor(6.0029, device='cuda:0'), 'learning_rate': 4.866666666666667e-05, 'epoch': 0.0}
{'loss': 3.51, 'grad_norm': tensor(5.0960, device='cuda:0'), 'learning_rate': 4.85e-05, 'epoch': 0.0}
{'loss': 3.9912, 'grad_norm': tensor(5.5196, device='cuda:0'), 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.0}
{'loss': 3.7437, 'grad_norm': tensor(5.1386, device='cuda:0'), 'learning_rate': 4.8166666666666674e-05, 'epoch': 0.0}
{'loss': 4.0131, 'grad_norm': tensor(5.2996, device='cuda:0'), 'learning_rate': 4.8e-05, 'epoch': 0.0}
{'loss': 3.8314, 'grad_norm': tensor(5.4587, device='cuda:0'), 'learning_rate': 4.7833333333333335e-05, 'epoch': 0.0}
{'loss': 3.6627, 'grad_norm': tensor(7.8379, device='cuda:0'), 'learning_rate': 4.766666666666667e-05, 'epoch': 0.0}
{'loss': 3.4525, 'grad_norm': tensor(6.6903, device='cuda:0'), 'learning_rate': 4.75e-05, 'epoch': 0.0}
{'loss': 3.6549, 'grad_norm': tensor(6.1589, device='cuda:0'), 'learning_rate': 4.7333333333333336e-05, 'epoch': 0.0}
{'loss': 3.6318, 'grad_norm': tensor(7.3151, device='cuda:0'), 'learning_rate': 4.716666666666667e-05, 'epoch': 0.0}
{'loss': 3.9439, 'grad_norm': tensor(6.4505, device='cuda:0'), 'learning_rate': 4.7e-05, 'epoch': 0.0}
{'loss': 3.7131, 'grad_norm': tensor(6.2490, device='cuda:0'), 'learning_rate': 4.683333333333334e-05, 'epoch': 0.0}
{'loss': 3.6848, 'grad_norm': tensor(6.6936, device='cuda:0'), 'learning_rate': 4.666666666666667e-05, 'epoch': 0.0}
{'loss': 3.3516, 'grad_norm': tensor(6.8842, device='cuda:0'), 'learning_rate': 4.6500000000000005e-05, 'epoch': 0.0}
{'loss': 3.7281, 'grad_norm': tensor(7.0181, device='cuda:0'), 'learning_rate': 4.633333333333333e-05, 'epoch': 0.0}
{'loss': 3.5209, 'grad_norm': tensor(6.9573, device='cuda:0'), 'learning_rate': 4.6166666666666666e-05, 'epoch': 0.0}
{'loss': 3.7479, 'grad_norm': tensor(7.1273, device='cuda:0'), 'learning_rate': 4.600000000000001e-05, 'epoch': 0.0}
{'loss': 3.5268, 'grad_norm': tensor(6.9141, device='cuda:0'), 'learning_rate': 4.5833333333333334e-05, 'epoch': 0.0}
{'loss': 3.5688, 'grad_norm': tensor(9.0669, device='cuda:0'), 'learning_rate': 4.566666666666667e-05, 'epoch': 0.0}
{'loss': 3.5719, 'grad_norm': tensor(8.0154, device='cuda:0'), 'learning_rate': 4.55e-05, 'epoch': 0.0}
{'loss': 3.5658, 'grad_norm': tensor(8.4898, device='cuda:0'), 'learning_rate': 4.5333333333333335e-05, 'epoch': 0.0}
{'loss': 3.54, 'grad_norm': tensor(7.5420, device='cuda:0'), 'learning_rate': 4.516666666666667e-05, 'epoch': 0.0}
{'loss': 3.6279, 'grad_norm': tensor(7.9869, device='cuda:0'), 'learning_rate': 4.5e-05, 'epoch': 0.0}
{'loss': 3.6281, 'grad_norm': tensor(7.8166, device='cuda:0'), 'learning_rate': 4.483333333333333e-05, 'epoch': 0.0}
{'loss': 3.4217, 'grad_norm': tensor(7.2795, device='cuda:0'), 'learning_rate': 4.466666666666667e-05, 'epoch': 0.0}
{'loss': 3.3732, 'grad_norm': tensor(8.6140, device='cuda:0'), 'learning_rate': 4.4500000000000004e-05, 'epoch': 0.0}
{'loss': 3.5418, 'grad_norm': tensor(10.2145, device='cuda:0'), 'learning_rate': 4.433333333333334e-05, 'epoch': 0.0}
{'loss': 3.4326, 'grad_norm': tensor(7.4907, device='cuda:0'), 'learning_rate': 4.4166666666666665e-05, 'epoch': 0.0}
{'loss': 3.4521, 'grad_norm': tensor(7.3476, device='cuda:0'), 'learning_rate': 4.4000000000000006e-05, 'epoch': 0.0}
{'loss': 3.5385, 'grad_norm': tensor(7.8114, device='cuda:0'), 'learning_rate': 4.383333333333334e-05, 'epoch': 0.0}
{'loss': 3.5908, 'grad_norm': tensor(9.1731, device='cuda:0'), 'learning_rate': 4.3666666666666666e-05, 'epoch': 0.0}
{'loss': 3.5932, 'grad_norm': tensor(7.8424, device='cuda:0'), 'learning_rate': 4.35e-05, 'epoch': 0.0}
{'loss': 3.3729, 'grad_norm': tensor(7.9566, device='cuda:0'), 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.0}
{'loss': 3.2678, 'grad_norm': tensor(8.0636, device='cuda:0'), 'learning_rate': 4.316666666666667e-05, 'epoch': 0.0}
{'loss': 3.6715, 'grad_norm': tensor(11.0745, device='cuda:0'), 'learning_rate': 4.3e-05, 'epoch': 0.0}
{'loss': 3.5162, 'grad_norm': tensor(9.7909, device='cuda:0'), 'learning_rate': 4.2833333333333335e-05, 'epoch': 0.0}
{'loss': 3.2836, 'grad_norm': tensor(8.5034, device='cuda:0'), 'learning_rate': 4.266666666666667e-05, 'epoch': 0.0}
{'loss': 3.7068, 'grad_norm': tensor(9.4987, device='cuda:0'), 'learning_rate': 4.25e-05, 'epoch': 0.0}
{'loss': 3.7994, 'grad_norm': tensor(9.6808, device='cuda:0'), 'learning_rate': 4.233333333333334e-05, 'epoch': 0.0}
{'loss': 3.8311, 'grad_norm': tensor(13.3884, device='cuda:0'), 'learning_rate': 4.216666666666667e-05, 'epoch': 0.0}
{'loss': 3.2943, 'grad_norm': tensor(10.4600, device='cuda:0'), 'learning_rate': 4.2e-05, 'epoch': 0.0}
{'loss': 3.3723, 'grad_norm': tensor(7.9136, device='cuda:0'), 'learning_rate': 4.183333333333334e-05, 'epoch': 0.0}
{'loss': 3.4182, 'grad_norm': tensor(11.4667, device='cuda:0'), 'learning_rate': 4.166666666666667e-05, 'epoch': 0.0}
17%|█████████████▋ | 500/3000 [00:45<03:55, 10.62it/s]***** Running Evaluation *****
Num examples = 50
Batch size = 16
17%|█████████████▋ | 500/3000 [00:58<03:55, 10.62it/sBuilding prefix dict from the default dictionary ...█████████████████████████████████████████| 4/4 [00:11<00:00, 2.89s/it]
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.260 seconds.
Prefix dict has been built successfully.
{'eval_rouge-1': 30.055642000000002, 'eval_rouge-2': 6.419442, 'eval_rouge-l': 24.02483, 'eval_bleu-4': 0.030992641634256496, 'eval_runtime': 19.3324, 'eval_samples_per_second': 2.586, 'eval_steps_per_second': 0.207, 'epoch': 0.0}
17%|█████████████▋ | 500/3000 [01:05<03:55, 10.62it/sSaving model checkpoint to ./output/tmp-checkpoint-500
tokenizer config file saved in ./output/tmp-checkpoint-500/tokenizer_config.json
Special tokens file saved in ./output/tmp-checkpoint-500/special_tokens_map.json
[2024-03-07 16:00:20,630] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is about to be saved!
/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/torch/nn/modules/module.py:1877: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[2024-03-07 16:00:20,634] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./output/tmp-checkpoint-500/global_step500/mp_rank_00_model_states.pt
[2024-03-07 16:00:20,634] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/tmp-checkpoint-500/global_step500/mp_rank_00_model_states.pt...
[2024-03-07 16:00:20,659] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./output/tmp-checkpoint-500/global_step500/mp_rank_00_model_states.pt.
[2024-03-07 16:00:20,659] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./output/tmp-checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-03-07 16:00:20,704] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./output/tmp-checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-03-07 16:00:20,705] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved ./output/tmp-checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-03-07 16:00:20,705] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step500 is ready now!
Traceback (most recent call last):
File "/home/yons/llms/ChatGLM3-main/finetune_demo/finetune_hf.py", line 587, in
app()
File "/home/yons/llms/ChatGLM3-main/finetune_demo/finetune_hf.py", line 543, in main
trainer.train()
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer.py", line 2525, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/home/yons/miniconda3/envs/apple/lib/python3.10/site-packages/transformers/trainer_callback.py", line 113, in save_to_json
json_string = json.dumps(dataclasses.asdict(self), indent=2, sort_keys=True) + "\n"
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/init.py", line 238, in dumps
**kw).encode(obj)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 201, in encode
chunks = list(chunks)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 431, in _iterencode
yield from _iterencode_dict(o, _current_indent_level)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 325, in _iterencode_list
yield from chunks
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 405, in _iterencode_dict
yield from chunks
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 438, in _iterencode
o = _default(o)
File "/home/yons/miniconda3/envs/apple/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.class.name} '
TypeError: Object of type Tensor is not JSON serializable
17%|█████████████▋ | 500/3000 [01:20<06:43, 6.19it/s]
Beta Was this translation helpful? Give feedback.
All reactions