Skip to content

Commit

Permalink
bugfix: fix character repetition method
Browse files Browse the repository at this point in the history
  • Loading branch information
zhijianma committed Oct 20, 2023
1 parent 063c3d4 commit 57efa05
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions data_juicer/ops/filter/character_repetition_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ def compute_stats(self, sample):

freq_char_ngrams = sorted(list(freq_char_ngrams.values()),
reverse=True)
rep_more_than_one = len([el for el in freq_char_ngrams if el > 1])
num_no_rep_char_ngrams = len([el for el in freq_char_ngrams if el == 1])
num_rep_char_ngrams = min(
int(np.sqrt(len(freq_char_ngrams))),
len(freq_char_ngrams) - rep_more_than_one,
len(freq_char_ngrams) - num_no_rep_char_ngrams,
)
sample[Fields.stats][StatsKeys.char_rep_ratio] = (sum(
freq_char_ngrams[:num_rep_char_ngrams]) / sum(freq_char_ngrams)) \
Expand Down

0 comments on commit 57efa05

Please sign in to comment.