Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

Open
SangwonYoon opened this issue Apr 13, 2023 · 0 comments
Assignees
Labels
Preprocessing Data Preprocessing

Comments

@SangwonYoon
Copy link
Contributor

SangwonYoon commented Apr 13, 2023

def preprocess_text(self, text):
# create Korean tokenizer using soynlp library
# tokenizer = RegexTokenizer()
# 2회 이상 반복된 문자를 정규화
text = repeat_normalize(text, num_repeats=2)
# 불용어 제거
# text = ' '.join([token for token in text.split() if not token in stopwords])
# 대문자를 소문자로 변경
text = text.lower()
# ""을 "사람"으로 변경
text = re.sub('', '사람', text)
# 한글 문자, 영어 문자, 공백 문자를 제외한 모든 문자 제거
text = re.sub('[^가-힣a-z\\s]', '', text)
# 텍스트를 토큰으로 분리 예) "안녕하세요" -> "안녕", "하", "세요"
# tokens = tokenizer.tokenize(text)
# 어간 추출
# tokens = [self.stemmer.morphs(token)[0] for token in text.split()]
# join tokens back into sentence
# text = ' '.join(tokens)
return text

@SangwonYoon SangwonYoon added the Preprocessing Data Preprocessing label Apr 13, 2023
@SangwonYoon SangwonYoon changed the title [--] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 [score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 Apr 13, 2023
@SangwonYoon SangwonYoon self-assigned this Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Preprocessing Data Preprocessing
Projects
None yet
Development

No branches or pull requests

1 participant