Sentence Splitting issue #123

lucadaniello · 2024-02-16T09:49:06Z

Hi! I would like to ask you something about the splitting of text into sentences during the annotation phase.

I thought that the sentences were split by considering dots at the end of them, but it is not always the case. Sometimes sentence separators are ":" or a term in uppercase.

I would like to ask:

What is the rule for sentence splitting?
Is it possible to set the separator? For instance, split a sentence only when a dot is found.

I’m using the udpipe package in R.
Below is an example text where I find that sentences are separated by an uppercase term:

model <- udpipe_download_model(language = "english")
txt <- c("No previous
study has investigated the influence of governance and organizational AHCs configurations
on the productivity and scientific impact of AHCs.")
df <- udpipe(txt, object = udpipe_load_model(model$file_model))

Thank you!!

jwijffels · 2024-02-18T07:08:47Z

Sentence splitting is based on a statistical classification model trained on conllu data from universaldependencies. It predicts for each letter in the text if a new sentence starts at that letter given the surrounding context.
If you want to use another way of splitting, you could use udpipe::strsplit.data.frame or strsplit from base R in order to define your own hardcoded sentence splitting criteria.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Splitting issue #123

Sentence Splitting issue #123

lucadaniello commented Feb 16, 2024

jwijffels commented Feb 18, 2024

Sentence Splitting issue #123

Sentence Splitting issue #123

Comments

lucadaniello commented Feb 16, 2024

jwijffels commented Feb 18, 2024