Skip to content

Commit

Permalink
bugfix count total sentences
Browse files Browse the repository at this point in the history
  • Loading branch information
guipenedo committed Oct 16, 2023
1 parent 2be346e commit c5f0d0a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datatrove/pipeline/dedup/sentence_dedup.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ def filter(self, doc: Document, du_lines: set = None):
filtered_sentences = [sent for idx, sent in enumerate(sentences) if not du_lines or idx not in du_lines]
if len(filtered_sentences) < len(sentences):
self.stat_update("removed_sentences", len(sentences) - len(filtered_sentences))
self.stat_update()
self.stat_update("original_sentences", len(sentences))
doc.content = " ".join(filtered_sentences).strip()
if len(word_tokenize(doc.content)) > self.min_doc_words:
return True
Expand Down

0 comments on commit c5f0d0a

Please sign in to comment.