Skip to content

Commit

Permalink
Merge pull request #1966 from jqnatividad/1962-frequency
Browse files Browse the repository at this point in the history
`frequency`: fix unique identifiers column detection
  • Loading branch information
jqnatividad authored Jul 13, 2024
2 parents 65136f4 + 6d97805 commit c85b043
Show file tree
Hide file tree
Showing 3 changed files with 385 additions and 10 deletions.
301 changes: 301 additions & 0 deletions resources/test/data1962.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
year
2001
2002
2002
2003
2003
2003
2004
2004
2004
2004
2005
2005
2005
2005
2005
2006
2006
2006
2006
2006
2006
2007
2007
2007
2007
2007
2007
2007
2008
2008
2008
2008
2008
2008
2008
2008
2009
2009
2009
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2011
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2012
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2013
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2014
2015
2015
2015
2015
2015
2015
2015
2015
2015
2015
2015
2015
2015
2015
2015
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2016
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2017
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2018
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2019
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2022
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2023
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
2024
24 changes: 14 additions & 10 deletions src/cmd/frequency.rs
Original file line number Diff line number Diff line change
Expand Up @@ -242,12 +242,20 @@ impl Args {
let unique_counts_len = counts.len();
if self.flag_lmt_threshold == 0 || self.flag_lmt_threshold >= unique_counts_len {
// check if the column has all unique values
// by checking if counts length is equal to ftable length
// do this by looking at the counts vec
// and see if it has a count of 1, indicating all unique values
let all_unique = counts[if self.flag_asc {
unique_counts_len - 1
} else {
0
}]
.1 == 1;

let abs_limit = self.flag_limit.unsigned_abs();
let unique_limited = if self.flag_limit > 0
let unique_limited = if all_unique
&& self.flag_limit > 0
&& self.flag_unq_limit != abs_limit
&& self.flag_unq_limit > 0
&& unique_counts_len == ftab.len()
{
counts.truncate(self.flag_unq_limit);
true
Expand Down Expand Up @@ -435,13 +443,9 @@ impl Args {
if self.flag_no_trim {
// case-sensitive, don't trim whitespace
for (i, field) in nsel.select(row_buffer.into_iter()).enumerate() {
field_buffer = {
if let Ok(s) = simdutf8::basic::from_utf8(field) {
s.as_bytes().to_vec()
} else {
field.to_vec()
}
};
// no need to convert to string and back to bytes for a "case-sensitive"
// comparison we can just use the field directly
field_buffer = field.to_vec();

// safety: we do get_unchecked_mut on freq_tables for the same reason above
if !field_buffer.is_empty() {
Expand Down
Loading

0 comments on commit c85b043

Please sign in to comment.