You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am now preparing for Chinese text mining using jiebaR, in Korea on Korean language Windows OS.
Followings are my computing environment verified by library(jiebaR); sessionInfo().
library(jiebaR); sessionInfo();
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
I managed to set up the word segmentation process as the following. The result, however, is disappointing in that user-defined dictionary doesn't work.
The problem is that there is no difference between with user-defined dictionary and without. Tibble structure of "4175 by 3" does not change even with user.dict. By the way, I checked out that stopwords.dict works well. I have no idea of what seems toi be the problem
For reference, I attach screen capture of "use.dict.utf8" file below.
Thanks for advise!
The text was updated successfully, but these errors were encountered:
Hello, I am now preparing for Chinese text mining using jiebaR, in Korea on Korean language Windows OS.
Followings are my computing environment verified by library(jiebaR); sessionInfo().
Matrix products: default
locale:
[1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949
[4] LC_NUMERIC=C LC_TIME=Korean_Korea.949
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.7 tidytext_0.3.2 stringr_1.4.0 jiebaR_0.11 jiebaRD_0.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 rstudioapi_0.13 magrittr_2.0.1 tidyselect_1.1.1 lattice_0.20-44
[6] R6_2.5.1 rlang_0.4.11 fansi_0.5.0 tools_4.1.1 grid_4.1.1
[11] utf8_1.2.2 cli_3.0.1 DBI_1.1.1 janeaustenr_0.1.5 ellipsis_0.3.2
[16] assertthat_0.2.1 tibble_3.1.3 lifecycle_1.0.0 crayon_1.4.1 Matrix_1.3-4
[21] purrr_0.3.4 SnowballC_0.7.0 tokenizers_0.2.1 vctrs_0.3.8 glue_1.4.2
[26] stringi_1.7.4 compiler_4.1.1 pillar_1.6.2 generics_0.1.0 pkgconfig_2.0.3
I managed to set up the word segmentation process as the following. The result, however, is disappointing in that user-defined dictionary doesn't work.
bri_text<-readLines("BRIVA_revised3.txt", encoding="UTF-8")
A tibble: 4,175 x 3
sentence sentence_id word
1 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 2000
2 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 多年
3 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 前
4 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+4E9A>
5 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+6B27>
6 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 <U+9646>上
7 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勤<U+52B3>
8 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 勇敢
9 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 的
10 2000多年前,<U+4E9A><U+6B27><U+2F24><U+9646>上勤<U+52B3>勇敢的<U+2F08><U+2EA0>,探索出多<U+6761><U+8FDE>接<U+4E9A><U+6B27><U+2FAE><U+2F0F><U+2F24><U+2F42>明的<U+8D38>易和~ 1 探索
... with 4,165 more rows.
The problem is that there is no difference between with user-defined dictionary and without. Tibble structure of "4175 by 3" does not change even with user.dict. By the way, I checked out that stopwords.dict works well. I have no idea of what seems toi be the problem
For reference, I attach screen capture of "use.dict.utf8" file below.
Thanks for advise!
The text was updated successfully, but these errors were encountered: