help! Jieba user-defined dictionary function doesn't work at all! #72

Wonpio · 2021-10-20T07:58:16Z

Hello, I am now preparing for Chinese text mining using jiebaR, in Korea on Korean language Windows OS.
Followings are my computing environment verified by library(jiebaR); sessionInfo().

library(jiebaR); sessionInfo();
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_1.0.7 tidytext_0.3.2 stringr_1.4.0 jiebaR_0.11 jiebaRD_0.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 rstudioapi_0.13 magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44
[6] R6_2.5.1 rlang_0.4.11 fansi_0.5.0 tools_4.1.1 grid_4.1.1
[11] utf8_1.2.2 cli_3.0.1 DBI_1.1.1 janeaustenr_0.1.5 ellipsis_0.3.2
[16] assertthat_0.2.1 tibble_3.1.3 lifecycle_1.0.0 crayon_1.4.1 Matrix_1.3-4
[21] purrr_0.3.4 SnowballC_0.7.0 tokenizers_0.2.1 vctrs_0.3.8 glue_1.4.2
[26] stringi_1.7.4 compiler_4.1.1 pillar_1.6.2 generics_0.1.0 pkgconfig_2.0.3

I managed to set up the word segmentation process as the following. The result, however, is disappointing in that user-defined dictionary doesn't work.

bri_text<-readLines("BRIVA_revised3.txt", encoding="UTF-8")

bri_stnc<-bri_text %>% as_tibble() %>% unnest_tokens(input=value, output=sentence, token="sentences")
bri_stnc<-bri_stnc %>% mutate(sentence_id=row_number())
bri_df<-bri_stnc %>%mutate(text=sapply(segment(bri_stnc$sentence, worker(bylines=TRUE, user= "C:/Users/user/Documents/R/win-library/4.1/jiebaRD/dict/user.dict.utf8")), function(x){paste(x, collapse=" ")})) %>% unnest_tokens(word, text)
bri_df

A tibble: 4,175 x 3

sentence sentence_id word

1 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 2000
2 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 多年
3 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 前
4 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+4E9A>
5 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+6B27>
6 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+9646>上
7 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勤<U+52B3>
8 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勇敢
9 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 的
10 2000多年前，<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>，探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 探索

... with 4,165 more rows.

The problem is that there is no difference between with user-defined dictionary and without. Tibble structure of "4175 by 3" does not change even with user.dict. By the way, I checked out that stopwords.dict works well. I have no idea of what seems toi be the problem
For reference, I attach screen capture of "use.dict.utf8" file below.
Thanks for advise!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

help! Jieba user-defined dictionary function doesn't work at all! #72

help! Jieba user-defined dictionary function doesn't work at all! #72

Wonpio commented Oct 20, 2021

help! Jieba user-defined dictionary function doesn't work at all! #72

help! Jieba user-defined dictionary function doesn't work at all! #72

Comments

Wonpio commented Oct 20, 2021

A tibble: 4,175 x 3

... with 4,165 more rows.