[score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

SangwonYoon · 2023-04-13T17:31:04Z

def preprocess_text(self, text):
# create Korean tokenizer using soynlp library
# tokenizer = RegexTokenizer()
# 2회 이상 반복된 문자를 정규화
text = repeat_normalize(text, num_repeats=2)
# 불용어 제거
# text = ' '.join([token for token in text.split() if not token in stopwords])
# 대문자를 소문자로 변경
text = text.lower()
# ""을 "사람"으로 변경
text = re.sub('', '사람', text)
# 한글 문자, 영어 문자, 공백 문자를 제외한 모든 문자 제거
text = re.sub('[^가-힣a-z\\s]', '', text)
# 텍스트를 토큰으로 분리 예) "안녕하세요" -> "안녕", "하", "세요"
# tokens = tokenizer.tokenize(text)
# 어간 추출
# tokens = [self.stemmer.morphs(token)[0] for token in text.split()]
# join tokens back into sentence
# text = ' '.join(tokens)
return text

SangwonYoon added the Preprocessing Data Preprocessing label Apr 13, 2023

SangwonYoon added a commit that referenced this issue Apr 13, 2023

monolomonologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

00f8361

SangwonYoon changed the title ~~[--] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터~~ [score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 Apr 13, 2023

SangwonYoon self-assigned this Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

[score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

SangwonYoon commented Apr 13, 2023 •

edited

Loading

[score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

[score 0.894] monologg/koelectra-base-finetuned-nsmc / 전처리 데이터 #27

Comments

SangwonYoon commented Apr 13, 2023 • edited Loading

SangwonYoon commented Apr 13, 2023 •

edited

Loading