Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否可以修改依赖中的transformers版本,怀疑下面报错为依赖问题 #524

Open
3 tasks done
baiyi-os opened this issue Dec 26, 2024 · 0 comments
Open
3 tasks done
Assignees
Labels
environment related to third-party dependency, DJ-pypi, DJ-docker, etc. question Further information is requested

Comments

@baiyi-os
Copy link

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

Traceback (most recent call last):
File "/data/xinqiuqing/data-juicer/data_juicer/ops/base_op.py", line 60, in wrapper
return method(samples, *args, **kwargs)
File "/data/xinqiuqing/data-juicer/data_juicer/ops/mapper/generate_qa_from_text_mapper.py", line 113, in process_batched
model, _ = get_model(self.model_key, rank, self.use_cuda())
File "/data/xinqiuqing/data-juicer/data_juicer/utils/model_utils.py", line 807, in get_model
MODEL_ZOO[model_key] = model_key(device=device)
File "/data/xinqiuqing/data-juicer/data_juicer/utils/model_utils.py", line 377, in prepare_huggingface_model
pipe = transformers.pipeline(task=pipe_task,
File "/root/miniconda3/envs/xinqiuqing/lib/python3.10/site-packages/transformers/pipelines/init.py", line 1178, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/root/miniconda3/envs/xinqiuqing/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 106, in init
if self.prefix is not None:
AttributeError: 'TextGenerationPipeline' object has no attribute 'prefix'
generate_qa_from_text_mapper_process: 100%|##########| 129/129 [01:34<00:00, 1.36 examples/s]
2024-12-26 16:54:19 | INFO | data_juicer.core.data:226 - [7/7] OP [generate_qa_from_text_mapper] Done in 95.495s. Left 0 samples.

Additional 额外信息

这是我的配置yaml文件
project_name: 'test01'
dataset_path: 'data/test/'
export_path: 'outputs/test/test.jsonl'
export_shard_size: 0
export_in_parallel: false
np: 4
suffixes: ['.txt']

process:

对文本进行分片处理

  • text_chunk_mapper:
    max_len: 1000
    split_pattern: '\n\n'
    overlap_len: 200
    tokenizer: 'qwen2.5-72b-instruct'
    trust_remote_code: True

删除链接,例如以 http 或 ftp 开头的

  • clean_links_mapper:

删除 HTML 标签并返回所有节点的纯文本

  • clean_html_mapper:

删除 TeX 文档的参考文献

  • remove_bibliography_mapper:

删除 TeX 文档头,例如标题、章节数字/名称等

  • remove_header_mapper:
    drop_no_head: true

删除样本中的重复句子

  • remove_repeat_sentences_mapper:
    lowercase: false
    ignore_special_character: true
    min_repeat_sentence_length: 2

从文本中生成问答对

  • generate_qa_from_text_mapper:
    hf_model: 'impira/layoutlm-document-qa'
    output_pattern: null
    enable_vllm: false
    model_params: {}
    sampling_params: {}
    mem_required: '10GB
@baiyi-os baiyi-os added the question Further information is requested label Dec 26, 2024
@yxdyc yxdyc added the environment related to third-party dependency, DJ-pypi, DJ-docker, etc. label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
environment related to third-party dependency, DJ-pypi, DJ-docker, etc. question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants