Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help! Jieba user-defined dictionary function doesn't work at all! #72

Open
Wonpio opened this issue Oct 20, 2021 · 0 comments
Open

help! Jieba user-defined dictionary function doesn't work at all! #72

Wonpio opened this issue Oct 20, 2021 · 0 comments

Comments

@Wonpio
Copy link

Wonpio commented Oct 20, 2021

Hello, I am now preparing for Chinese text mining using jiebaR, in Korea on Korean language Windows OS.
Followings are my computing environment verified by library(jiebaR); sessionInfo().

library(jiebaR); sessionInfo();
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C LC_TIME=Korean_Korea.949

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_1.0.7 tidytext_0.3.2 stringr_1.4.0 jiebaR_0.11 jiebaRD_0.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 rstudioapi_0.13 magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44
[6] R6_2.5.1 rlang_0.4.11 fansi_0.5.0 tools_4.1.1 grid_4.1.1
[11] utf8_1.2.2 cli_3.0.1 DBI_1.1.1 janeaustenr_0.1.5 ellipsis_0.3.2
[16] assertthat_0.2.1 tibble_3.1.3 lifecycle_1.0.0 crayon_1.4.1 Matrix_1.3-4
[21] purrr_0.3.4 SnowballC_0.7.0 tokenizers_0.2.1 vctrs_0.3.8 glue_1.4.2
[26] stringi_1.7.4 compiler_4.1.1 pillar_1.6.2 generics_0.1.0 pkgconfig_2.0.3

I managed to set up the word segmentation process as the following. The result, however, is disappointing in that user-defined dictionary doesn't work.

bri_text<-readLines("BRIVA_revised3.txt", encoding="UTF-8")

bri_stnc<-bri_text %>% as_tibble() %>% unnest_tokens(input=value, output=sentence, token="sentences")
bri_stnc<-bri_stnc %>% mutate(sentence_id=row_number())
bri_df<-bri_stnc %>%mutate(text=sapply(segment(bri_stnc$sentence, worker(bylines=TRUE, user= "C:/Users/user/Documents/R/win-library/4.1/jiebaRD/dict/user.dict.utf8")), function(x){paste(x, collapse=" ")})) %>% unnest_tokens(word, text)
bri_df

A tibble: 4,175 x 3

sentence sentence_id word

1 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 2000
2 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 多年
3 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 前
4 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+4E9A>
5 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+6B27>
6 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+9646>上
7 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勤<U+52B3>
8 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勇敢
9 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 的
10 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 探索

... with 4,165 more rows.

The problem is that there is no difference between with user-defined dictionary and without. Tibble structure of "4175 by 3" does not change even with user.dict. By the way, I checked out that stopwords.dict works well. I have no idea of what seems toi be the problem
For reference, I attach screen capture of "use.dict.utf8" file below.
Thanks for advise!
screen capture_briva_user dict

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant