Byte offsets for original input bytes to allow non-destructive tokenization #143

ELind77 · 2022-03-15T23:07:04Z

Hi all,

Is there a way to use blingfire to get byte offsets of tokens or sentence boundaries in the original input bytes rather than the constructed output byte array? I'm interested in non-destructively storing my text content and my token boundaries so I can do things like pre-process sentence breaks and tokens offline and then slice my content at runtime without having to store both the original content (which I may want to do further processing on) and the output of blingfire.

-- Eric

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte offsets for original input bytes to allow non-destructive tokenization #143

Byte offsets for original input bytes to allow non-destructive tokenization #143

ELind77 commented Mar 15, 2022

Byte offsets for original input bytes to allow non-destructive tokenization #143

Byte offsets for original input bytes to allow non-destructive tokenization #143

Comments

ELind77 commented Mar 15, 2022