Skip to content

Latest commit

 

History

History
195 lines (157 loc) · 7.88 KB

README_EN.md

File metadata and controls

195 lines (157 loc) · 7.88 KB

Daliy-Dialogue

A Daliy Dialogue Context Generator trained with Bloom and GPT

中文简介

Brief introduction

DailyDialog is a high-quality multi-turn dialog dataset. This project give some Dialogue Context Generators trained with Bloom and GPT in English and Chinese.

HuggingFace demonstration

Model demonstration

Name HuggingFace Model link HuggingFace Space link Language
Bloom English Daliy Dialogue Generator 🦅🌸 https://huggingface.co/svjack/bloom-daliy-dialogue-english https://huggingface.co/spaces/svjack/bloom-daliy-dialogue-english English
Bloom Chinese Daliy Dialogue Generator 🐰🌸 https://huggingface.co/svjack/bloom-daliy-dialogue https://huggingface.co/spaces/svjack/bloom-daliy-dialogue-chinese Chinese
GPT Chinese Daliy Dialogue Generator 🐰 https://huggingface.co/svjack/gpt-daliy-dialogue https://huggingface.co/spaces/svjack/gpt-daliy-dialogue-chinese Chinese
Bloom Chinese Dialogue Generator 🐰🌸 https://huggingface.co/svjack/bloom-dialogue https://huggingface.co/spaces/svjack/bloom-dialogue-chinese Chinese
GPT Chinese Dialogue Generator 🐰 https://huggingface.co/svjack/gpt-dialogue https://huggingface.co/spaces/svjack/gpt-dialogue-chinese Chinese

Dataset generate by above models demonstration

Name HuggingFace Dataset link HuggingFace Space link Language
English Daliy Dialogue model generate samples 🦅🌸 https://huggingface.co/datasets/svjack/bloom-dialogue-generate-ds-en https://huggingface.co/spaces/svjack/bloom-dialogue-english-sample-search English
Chinese Dialogue model generate samples 🐰🌸 https://huggingface.co/datasets/svjack/bloom-dialogue-generate-ds-zh https://huggingface.co/spaces/svjack/bloom-gpt-dialogue-chinese-sample-search Chinese

Installation and Instructions

Refer to HuggingFace Model cards.

Installation

pip install -r requirements.txt

Instructions

Tips: try to decrease the max_length, the output may more related with the text you ask.

  • 1 Bloom English Daliy Dialogue Generator 🦅🌸:
from predict import *
from transformers import BloomTokenizerFast, BloomForCausalLM

model_path = "svjack/bloom-daliy-dialogue-english"
tokenizer = BloomTokenizerFast.from_pretrained(model_path)
model = BloomForCausalLM.from_pretrained(model_path)

obj = Obj(model, tokenizer)
obj.predict("This dog is fierce,", max_length=128)[0].split("\n-----\n")

will output:

['This dog is fierce, nowadays. ',
  " What's his name? ",
  ' His name is Bingo. ',
  ' What kind of dog is he? ',
  " We're not sure because the neighbour gave him to us after they moved away from here. ",
  " Well, he sure likes to chew my father's shoes when he likes to scratch the couch. ",
  ' Is he well behaved? ',
  ' Yeah, he likes to scratch the couch but he likes to scratch the couch regularly. ',
  " He likes to scratch the couch, doesn't he? ",
  ' Yes, he likes to scratch the couch']

  • 2 Bloom Chinese Daliy Dialogue Generator 🐰🌸:
from predict import *
from transformers import BloomTokenizerFast, BloomForCausalLM

model_path = "svjack/bloom-daliy-dialogue"
tokenizer = BloomTokenizerFast.from_pretrained(model_path)
model = BloomForCausalLM.from_pretrained(model_path)

obj = Obj(model, tokenizer)
obj.predict("这只狗很凶,", max_length=128)[0].split("\n-----\n")

will output:

['这只狗很凶, 它是邻居的宠物。',
 '那是一只小狗。你知道,它不是一个很好的宠物。但正是因为它的行动。',
 '这倒是真的。我在它们还是小狗的时候就养了它们。我每个月只能负担300磅左右的费用。这就像一个大狗,就像电视上的专业厨师一样!',
 '绝对的 据说他们每天要供应3000只狗呢',
 '不可思议啊 对了,盘子里的这些东西是什么?',
 '哦,我的盘子,镂空的芝麻包,以及黄油和面粉。我将告诉你如何做一个搅拌']

  • 3 GPT Chinese Daliy Dialogue Generator 🐰:
from predict import *
from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline

model_path = "svjack/gpt-daliy-dialogue"
tokenizer = BertTokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)

obj = Obj(model, tokenizer)
x = obj.predict("这只狗很凶,", max_length=128)[0]
list(map(lambda x: "".join(x).replace(" ", ""),batch_as_list(re.split(r"([。.??])" ,x), 2)))

will output:

['这只狗很凶,你怎么知道的?',
 '我当然知道了,因为它是如此的凶。',
 '那是什么样的狗?',
 '他有一只小白猫哦,是吗?',
 '他还没有被抓到捕捕令呢我不相信他会抓到捕捕令他肯定在第二天早些时候就被抓到了好吧,我可以告诉他,但他必须在第二天早上给你打电话谢谢你对他的']

You can see the last line of above list is too long, that not well segmented.
And with the help of Context Reconstructor in svjack/GLM-Open-Dialogue, we can try to fix this problem.

y = ['这只狗很凶,你怎么知道的?',
'我当然知道了,因为它是如此的凶。',
'那是什么样的狗?',
'他有一只小白猫哦,是吗?',
'他还没有被抓到捕捕令呢我不相信他会抓到捕捕令他肯定在第二天早些时候就被抓到了好吧,我可以告诉他,但他必须在第二天早上给你打电话谢谢你对他的']
from reconstructor import *
predict_split(y)

will output:

['这只狗很凶,你怎么知道的?',
 '我当然知道了,因为它是如此的凶。',
 '那是什么样的狗?',
 '他有一只小白猫',
 '哦,是吗?',
 '他还没有被捕捉到捕捕令呢',
 '我不相信他会抓到它',
 '捕捕令?他肯定在第二天早些时候就被抓到了',
 '好吧,我可以告诉他,但他必须在第二天早上给你打电话。',
 '谢谢你对他的到来']

More Info and Disscussion

You can see Bloom may perform better in context segmentation.

More information can be find from https://github.com/svjack/GLM-Open-Dialogue to get Context Reconstructor and topic about Open Dialogue Context Generator.

Contact

svjack - https://huggingface.co/svjack - [email protected] - [email protected]

Project Link:https://github.com/svjack/Daliy-Dialogue

Acknowledgements