Skip to content

Latest commit

ย 

History

History
180 lines (143 loc) ยท 12.2 KB

File metadata and controls

180 lines (143 loc) ยท 12.2 KB

๐Ÿ† Level 1 Project :: STS(Semantic Text Similarity)

๐Ÿ“œ Abstract

๋ถ€์ŠคํŠธ ์บ ํ”„ AI-Tech 5๊ธฐ NLP Level 1 ๊ธฐ์ดˆ ํ”„๋กœ์ ํŠธ ๊ฒฝ์ง„๋Œ€ํšŒ๋กœ, Dacon๊ณผ Kaggle๊ณผ ์œ ์‚ฌํ•ญ ๋Œ€ํšŒํ˜• ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฌธ์žฅ์ด ์˜๋ฏธ์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜๋Š” N21 ์ž์—ฐ์–ด์ฒ˜๋ฆฌ Task์ธ ์˜๋ฏธ ์œ ์‚ฌ๋„ ํŒ๋ณ„(Semantic Text Similarity, STS)๋ฅผ ์ฃผ์ œ๋กœ, ๋ชจ๋“  ํŒ€์›์ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ๋ชจ๋ธ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ AI ๋ชจ๋ธ๋ง์˜ ์ „๊ณผ์ •์„ ๋ชจ๋‘๊ฐ€ End-to-End๋กœ ํ˜‘์—…ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.


๐ŸŽ–๏ธProject Leader Board

public_1st private_2nd

  • ๐Ÿฅ‡ Public Leader Board

public_leader_board

  • ๐ŸฅˆPrivate Leader Board

private_leader_board


๐Ÿง‘๐Ÿปโ€๐Ÿ’ป Team Introduction & Members

Team name : ์œค์Šฌ [ NLP 11์กฐ ]

๐Ÿ‘จ๐Ÿผโ€๐Ÿ’ป Members

๊ฐ•๋ฏผ์žฌ ๊น€์ฃผ์› ๊น€ํƒœ๋ฏผ ์‹ ํ˜์ค€ ์œค์ƒ์›
Github Github Github Github Github

๐Ÿง‘๐Ÿปโ€๐Ÿ”ง Members' Role

๋Œ€๋ถ€๋ถ„์˜ ํŒ€์›๋“ค์ด ์ฒซ NLP ๋„๋ฉ”์ธ์˜ ํ”„๋กœ์ ํŠธ์ธ๋งŒํผ ๋ช…ํ™•ํ•œ ๊ธฐ์ค€์„ ๊ฐ€์ง€๊ณ  ์—…๋ฌด๋ฅผ ๊ตฌ๋ถ„ํ•œ ๊ฒƒ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๊ธฐ๋ฅด๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ถ€ํ„ฐ ๋ชจ๋ธ ํŠœ๋‹๊นŒ์ง€ End-to-End๋กœ ๊ฒฝํ—˜ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜์—ฌ ํ˜‘์—…์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ์ž ํŠœ๋‹ํ•  ๋ชจ๋ธ์„ ํ• ๋‹นํ•˜์—ฌ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ์ฆ๊ฐ• ๋“ฑ ๋ณธ์ธ์˜ ์•„์ด๋””์–ด๋ฅผ ๊ตฌํ˜„ํ•˜๋˜ ์„œ๋กœ์˜ ๋‚ด์šฉ์ด ๊ฒน์น˜์ง€ ์•Š๋„๋ก ๋ถ„์—…์„ ํ•˜์—ฌ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฆ„ ์—ญํ• 
๊ฐ•๋ฏผ์žฌ ๋ชจ๋ธ ํŠœ๋‹(electra-kor-base , koelectra-base-v3-discriminator),๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(back translation / switching sentence pair /์ž„์˜๊ธ€์ž์‚ฝ์ž…๋ฐ์ œ๊ฑฐ),๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์‹คํ—˜(๋ ˆ์ด๋ธ” ์ •์ˆ˜ํ™” ๋ฐ ๋…ธ์ด์ฆˆ์ถ”๊ฐ€),Ensemble ์‹คํ—˜(output ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจํ™œ์šฉ),EDA(๊ธ€์ž์ˆ˜ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ถ„์„)
๊น€ํƒœ๋ฏผ Hugging Face ๊ธฐ๋ฐ˜ Baseline ์ฝ”๋“œ ์ž‘์„ฑ , Task์— ์ ํ•ฉํ•œ ๋ชจ๋ธ Search ๋ฐ ๋ถ„๋ฐฐ , ๋ชจ๋ธ ์‹คํ—˜ ์ด๊ด„ , ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์‹คํ—˜(Random Token Masking , Label Random Noise, Fill Random Token Mask, Source Tagging), Custom Loss ์‹คํ—˜(Binary Cross Entropy + Focal Loss),๋ชจ๋ธ ํŠœ๋‹(xlm-roberta-large, electra-kor-base),๋ชจ๋ธ Ensemble
๊น€์ฃผ์› ๋ชจ๋ธ ํŠœ๋‹(kobigbird-bert-base, electra-kor-base),EDA(๋ผ๋ฒจ ๋ถ„ํฌ ๋ฐ์ดํ„ฐ๋ถ„์„),EDA ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์•„์ด๋””์–ด ์ œ์‹œ , ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(Easy Augmented DataSR ์ฆ๊ฐ•),ํŒ€ ํ˜‘์—… ํ”„๋กœ์„ธ์Šค ๊ด€๋ฆฌ(Github ํŒ€๊ด€๋ฆฌ+ ํŒ€ Notion ํŽ˜์ด์ง€๊ด€๋ฆฌ) ,Custom Loss ์‹คํ—˜(RMSE)
์œค์ƒ์› ๋ชจ๋ธ ํŠœ๋‹(koelectra-base-finetuned-nsmc, KR-ELECTRA-discriminator ๋ชจ๋ธํŠœ๋‹),๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(label rescaling, ๋‹จ์ˆœ๋ณต์ œ๋ฐ์ดํ„ฐ์ฆ๊ฐ•, ์–ด์ˆœ๋„์น˜๋ฐ์ดํ„ฐ์ฆ๊ฐ•, under sampling + swap sentence + copied sentence + uniform distribution + random noise),๋ชจ๋ธ Ensemble
์‹ ํ˜์ค€ ย ย ย ย ย ย ย ย ย  ๋ชจ๋ธ ํŠœ๋‹(KR-ELECTRA-discriminator, mdeberta-v3-base-kor-further)๋ฐ์ดํ„ฐ ์ฆ๊ฐ•(๋งž์ถค๋ฒ•๊ต์ •์ฆ๊ฐ•,EDA(Easy Data Augmentation) SR(Synonym Replacement)ํ’ˆ์‚ฌ์„ ํƒ(๋ช…์‚ฌ, ์กฐ์‚ฌ) ๊ต์ฒด+ swap sentence + copied sentence, Data Distribution),๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์‹คํ—˜(๋งž์ถค๋ฒ•๊ต์ •)

๐Ÿ–ฅ๏ธ Project Introduction

ํ”„๋กœ์ ํŠธ ์ฃผ์ œ Semantic Text Similarity (STS) : ๋‘ ํ…์ŠคํŠธ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ํŒ๋‹จํ•˜๋Š” NLP Task
ํ”„๋กœ์ ํŠธ ๊ตฌํ˜„๋‚ด์šฉ 1. Hugging Face์˜ Pretrained ๋ชจ๋ธ๊ณผSTS ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•ด ๋‘ ๋ฌธ์žฅ์˜ 0๊ณผ 5์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” AI๋ชจ๋ธ์„ ๊ตฌ์ถ•
2. ๋ฆฌ๋”๋ณด๋“œ ํ‰๊ฐ€์ง€ํ‘œ์ธ ํ”ผ์–ด์Šจ ์ƒ๊ด€ ๊ณ„์ˆ˜(Pearson Correlation Coefficient ,PCC)์—์„œ ๋†’์€ ์ ์ˆ˜(1์— ๊ฐ€๊นŒ์šด ์ ์ˆ˜)์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ์ฆ๊ฐ•, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ง„ํ–‰
๊ฐœ๋ฐœ ํ™˜๊ฒฝ โ€ข GPU : Tesla V100 ์„œ๋ฒ„ 5๊ฐœ (RAM32G) / K80, T4, and P100 ๋žœ๋ค ํ• ๋‹น(RAM52G) /GeForce RTX 4090 ๋กœ์ปฌ (RAM 24GB), Rtx3060ti 8gb ๋กœ์ปฌ 2๋Œ€ (RAM 8 GB)
โ€ข ๊ฐœ๋ฐœ Tool : PyCharm, Jupyter notebook, VS Code [์„œ๋ฒ„ SSH์—ฐ๊ฒฐ], Colab Pro +, wandb
ํ˜‘์—… ํ™˜๊ฒฝ โ€ข Github Repository : Baseline ์ฝ”๋“œ ๊ณต์œ  ๋ฐ ๋ฒ„์ „ ๊ด€๋ฆฌ, issue ํŽ˜์ด์ง€๋ฅผ ํ†ตํ•˜ ์‹คํ—˜ ์ง„ํ–‰
โ€ข Notion : STS ํ”„๋กœ์ ํŠธ ํŽ˜์ด์ง€๋ฅผ ํ†ตํ•œ ์—ญํ• ๋ถ„๋‹ด, ์•„์ด๋””์–ด ๋ธŒ๋ ˆ์ธ ์Šคํ† ๋ฐ, ๋Œ€ํšŒ๊ด€๋ จ ํšŒ์˜ ๋‚ด์šฉ ๊ธฐ๋ก
โ€ข SLACK, Zoom : ์‹ค์‹œ๊ฐ„ ๋Œ€๋ฉด/๋น„๋Œ€๋ฉด ํšŒ์˜

๐Ÿ“ Project Structure

๐Ÿ—‚๏ธ ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ ์„ค๋ช…

  • ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ฒฝ๋กœ:ย ./data
  • ๊ณต๊ฐœ Pretrained ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”๊ฐ€ Fine Tuning ํ•™์Šต์„ ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒฝ๋กœ
    • ./save_folder/kykim/checkpoint-7960
    • ./save_folder/snunlp/checkpoint-31824
    • ./save_folder/xlm_roberta_large/checkpoint-7960
  • ํ•™์Šต ๋ฉ”์ธ ์ฝ”๋“œ:ย ./train.py
  • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ๊ฒฝ๋กœ: ./data/aug_train.csv
  • ํ…Œ์ŠคํŠธ ๋ฉ”์ธ ์ฝ”๋“œ:ย ./infer.py
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๊ฒฝ๋กœ:ย ./data/test.csv

๐Ÿ“„ ์ฝ”๋“œ ๊ตฌ์กฐ ์„ค๋ช…

ํ•™์Šต ์ง„ํ–‰ํ•˜๊ธฐ ์ „ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ๋จผ์ € ์‹คํ–‰ํ•˜์—ฌ ํ•™์Šต ์‹œ๊ฐ„ ๋‹จ์ถ•

  • ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•ย Get Augmentation Data : augmentation.py
  • Train : train.py
  • Predict : infer.py
  • Ensemble : python esnb.py
  • ์ตœ์ข… ์ œ์ถœ ํŒŒ์ผ : ./esnb/esnb.csv
๐Ÿ“ฆlevel1_semantictextsimilarity-nlp-11
 โ”ฃ .gitignore
 โ”ฃ config_yaml
 โ”ƒ โ”ฃ kykim.yaml
 โ”ƒ โ”ฃ snunlp.yaml
 โ”ƒ โ”ฃ test.yaml
 โ”ƒ โ”— xlm_roberta_large.yaml
 โ”ฃ data
 โ”ƒ โ”ฃ train.csv
 โ”ƒ โ”ฃ aug_train.csv
 โ”ƒ โ”ฃ dev.csv
 โ”ƒ โ”— test.csv
 โ”ฃ wordnet
 โ”ƒ โ”— wordnet.pickle
 โ”ฃ save_folde
 โ”ƒ โ”ฃ kykim
 โ”ƒ โ”ƒ โ”— checkpoint-7960
 โ”ƒ โ”ฃ snunlp
 โ”ƒ โ”ƒ โ”— checkpoint-31824
 โ”ƒ โ”— xlm_roberta_large
 โ”ƒ   โ”— checkpoint-7960
 โ”ฃ esnb
 โ”ƒ โ”— esnb.csv
 โ”ฃ output
 โ”ƒ โ”ฃ xlm_roberta_large.csv
 โ”ƒ โ”ฃ kykim.csv
 โ”ƒ โ”— snunlp.csv
 โ”ฃ .gitignore
 โ”ฃ Readme.md
 โ”ฃ augmentation.py
 โ”ฃ dataloader.py
 โ”ฃ esnb.py
 โ”ฃ infer.py
 โ”ฃ train.py
 โ”— utils.py

๐Ÿ“ Project Ground Rule

ํŒ€ ํ˜‘์—…์„ ์œ„ํ•ด ํ”„๋กœ์ ํŠธ ๊ด€๋ จ Ground Rule์„ ์„ค์ •ํ•˜์—ฌ ํ”„๋กœ์ ํŠธ๊ฐ€ ์›ํ™œํ•˜๊ฒŒ ๋Œ์•„๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํŒ€ ๊ทœ์น™์„ ์ •ํ–ˆ์œผ๋ฉฐ, ๋‚ ์งœ ๋‹จ์œ„๋กœ ๊ฐ„๋žตํ•œ ๋ชฉํ‘œ๋ฅผ ์„ค์ •ํ•˜์—ฌ ํ˜‘์—…์„ ์›ํ™œํ•˜๊ฒŒ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ณ„ํš์„ ํ•˜์—ฌ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

-a. ์‹คํ—˜ ๊ด€๋ จ Ground Rule : ๋ณธ์ธ ์‹คํ—˜ ์‹œ์ž‘ํ•  ๋•Œ, Github issue์— "[score ์ ์ˆ˜(์—†๋‹ค๋ฉด --)] ๋ชจ๋ธ์ด๋ฆ„, data = ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜, ์ „์ฒ˜๋ฆฌ ์ข…๋ฅ˜, ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ์ข…๋ฅ˜"์–‘์‹์œผ๋กœ issue๋ฅผ ์˜ฌ๋ฆฐ ๋’ค ์‹คํ—˜์„ ์‹œ์ž‘ํ•œ๋‹ค.

-b. Commit ๊ด€๋ จ Ground Rule : git commit & push๋Š” ํ•œ๋ฒˆ ์‹คํ—˜ํ•  ๋•Œ ๋งˆ๋‹ค ์ง„ํ–‰ํ•œ๋‹ค. ์ฝ”๋“œ ์ˆ˜์ • ๋‚ด์šฉ, ์ ์ˆ˜, ๊ด€๋ จ๋œ issue๊ฐ€ ๋“ค์–ด๊ฐ€๋„๋ก commitํ•˜๊ณ  ๊ฐœ์ธ branch์— pushํ•œ๋‹ค.

-c. Submission ๊ด€๋ จ Ground Rule: ๊ฐ ์‚ฌ๋žŒ๋ณ„๋กœ ํ•˜๋ฃจ submission ํšŸ์ˆ˜๋Š” 2ํšŒ์”ฉ ํ• ๋‹นํ•œ๋‹ค. ์ถ”๊ฐ€๋กœ submission์„ ํ•˜๊ณ  ์‹ถ์œผ๋ฉด SLACK ๋‹จ์ฒด ํ†ก๋ฐฉ์—์„œ ํ•ด๋‹น ๋‚ ์งœ์— submission๊ณ„ํš์ด ์—†๋Š” ํ˜น์€, ํšŸ์ˆ˜๊ฐ€ ๋‚จ๋Š” ์‚ฌ๋žŒ์—๊ฒŒ ๋ฌผ์–ด๋ด์„œ ์—ฌ์œ ๊ฐ€ ๋œ๋‹ค๋ฉด ์ถ”๊ฐ€ submission ๊ฐ€๋Šฅํ•˜๋‹ค.


๐Ÿ—“๏ธ Project Procedure

  • (1~3์ผ์ฐจ): NLP ๊ธฐ์ดˆ ๋Œ€ํšŒ ๊ด€๋ จ ๋Œ€ํšŒ ๊ฐ•์˜ ๋ฐ ์ŠคํŽ˜์…œ ๋ฏธ์…˜ ์™„๋ฃŒ & ํ˜‘์—… ๊ด€๋ จ Ground Rule ์„ค์ •
  • (3~4์ผ์ฐจ): STS Baseline ์ฝ”๋“œ ์™„์„ฑ & EDA(Exploratory Data Analysis) & ์ „์ฒ˜๋ฆฌ/์ฆ๊ฐ• ๊ด€๋ จ ์•„์ด๋””์–ด ํšŒ์˜
  • (5~14์ผ์ฐจ) : ์ „์ฒ˜๋ฆฌ, ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•, ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๋“ฑ ์•„์ด๋””์–ด ๊ตฌํ˜„ ๋ฐ ์‹คํ—˜ ์ง„ํ–‰

*์•„๋ž˜๋Š” ์ €ํฌ ํ”„๋กœ์ ํŠธ ์ง„ํ–‰๊ณผ์ •์„ ๋‹ด์€ Gantt์ฐจํŠธ ์ž…๋‹ˆ๋‹ค.

road_map


โš™๏ธ Architecture

๋ถ„๋ฅ˜ ๋‚ด์šฉ
๋ชจ๋ธ kykim/electra-kor-base, snunlp/KR-ELECTRA-discriminator, xlm-roberta-large+ HuggingFace Transformer Trainer
๋ฐ์ดํ„ฐ โ€ข v1 : swap sentence, copied sentence ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ๋ ˆ์ด๋ธ” ๋ถˆ๊ท ํ˜•์„ ํ•ด์†Œํ•œ ๋ฐ์ดํ„ฐ์…‹
โ€ข v2 : KorEDA์˜ Wordnet ํ™œ์šฉํ•˜์—ฌ Synonym Replacement ๊ธฐ๋ฒ•์œผ๋กœ ์ฆ๊ฐ•ํ•œ ๋ฐ์ดํ„ฐ์…‹
๊ฒ€์ฆ ์ „๋žต โ€ข Evaluation ๋‹จ๊ณ„์˜ ํ”ผ์–ด์Šจ ์ƒ๊ด€ ๊ณ„์ˆ˜๋ฅผ ์ผ์ฐจ์ ์œผ๋กœ ๋น„๊ต
โ€ข ๊ธฐ์กด SOTA ๋ชจ๋ธ๊ณผ ์„ฑ๋Šฅ์ด ๋น„์Šทํ•œ ๋ชจ๋ธ์„ ์ œ์ถœํ•˜์—ฌ public ์ ์ˆ˜๋ฅผ ํ™•์ธํ•˜์—ฌ ์ด์ฐจ ๊ฒ€์ฆ
์•™์ƒ๋ธ” ๋ฐฉ๋ฒ• โ€ข ์ƒ๊ธฐ 3๊ฐœ์˜ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ์•„์„œ ํ‰๊ท ์„ ๋‚ด๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์•™์ƒ๋ธ” ์ˆ˜ํ–‰
๋ชจ๋ธ ํ‰๊ฐ€ ๋ฐ ๊ฐœ์„  ย ย ย ย ย ย ย ย ย ย  ํ† ํฌ๋‚˜์ด์ง• ๊ฒฐ๊ณผ ๋ถ„์„์„ ํ†ตํ•ด max_length๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต ์‹œ๊ฐ„์„ ์ ˆ๋ฐ˜ ๊ฐ€๋Ÿ‰ ๋‹จ์ถ•ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‹ค์–‘ํ•œ ์ฆ๊ฐ• ๋ฐ ์ „์ฒ˜๋ฆฌ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด label imbalance ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ overfitting์„ ๋ฐฉ์ง€ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋˜ํ•œ, HuggingFace Trainer์™€ wandb๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•œ์ธต ๋” ํŽธ๋ฆฌํ•˜๊ณ  ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๐Ÿ’ป Getting Started

โš ๏ธ How To install Requirements

#ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜
# version 0.5
pip install git+https://github.com/haven-jeon/PyKoSpacing.git
# version 1.1
pip install git+https://github.com/jungin500/py-hanspell
pip install -r requirements.txt
sudo apt install default-jdk

โŒจ๏ธ How To Train

# ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
python3 augmentation.py
# train.py ์ฝ”๋“œ ์‹คํ–‰ : ๋ชจ๋ธ ํ•™์Šต ์ง„ํ–‰
# model_name์„ kykim/electra-kor-base, snunlp/KR-ELECTRA-discriminator, xlm-roberta-large๋กœ ๋ณ€๊ฒฝํ•˜๋ฉฐ train์œผ๋กœ ํ•™์Šต
python3 train.py # model_name = model_list[0]
python3 train.py # model_name = model_list[1]
python3 train.py # model_name = model_list[2]

โŒจ๏ธ How To Infer output.csv

# infer.py ์ฝ”๋“œ ์‹คํ–‰ : ํ›ˆ๋ จ๋œ ๋ชจ๋ธ load + sample_submission์„ ์ด์šฉํ•œ train ์ง„ํ–‰
python3 infer.py # model_name = model_list[0]
python3 infer.py # model_name = model_list[1]
python3 infer.py # model_name = model_list[2]
python3 esnb.py