Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence Splitting issue #123

Open
lucadaniello opened this issue Feb 16, 2024 · 1 comment
Open

Sentence Splitting issue #123

lucadaniello opened this issue Feb 16, 2024 · 1 comment

Comments

@lucadaniello
Copy link

Hi! I would like to ask you something about the splitting of text into sentences during the annotation phase.

I thought that the sentences were split by considering dots at the end of them, but it is not always the case. Sometimes sentence separators are ":" or a term in uppercase.

I would like to ask:

  1. What is the rule for sentence splitting?
  2. Is it possible to set the separator? For instance, split a sentence only when a dot is found.

I’m using the udpipe package in R.
Below is an example text where I find that sentences are separated by an uppercase term:

model <- udpipe_download_model(language = "english")
txt <- c("No previous
study has investigated the influence of governance and organizational AHCs configurations
on the productivity and scientific impact of AHCs.")
df <- udpipe(txt, object = udpipe_load_model(model$file_model))

Thank you!!

@jwijffels
Copy link
Contributor

Sentence splitting is based on a statistical classification model trained on conllu data from universaldependencies. It predicts for each letter in the text if a new sentence starts at that letter given the surrounding context.
If you want to use another way of splitting, you could use udpipe::strsplit.data.frame or strsplit from base R in order to define your own hardcoded sentence splitting criteria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants