Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get nerd-sniped and do some micro-optimizations. #41

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

adam-azarchs
Copy link
Member

Use std::unique_ptr in more places.

Replace c-style arrays with c++ std::array.

Use defaulted constructors/destructors where appropriate.

Make it easier for the compiler to vectorize comparisons in the inner
loop of stitchAlignToTranscript.

Muck around a little with compiler flags.

Use std::unique_ptr in more places.

Replace c-style arrays with c++ std::array.

Use defaulted constructors/destructors where appropriate.

Make it easier for the compiler to vectorize comparisons in the inner
loop of stitchAlignToTranscript.
We're linking statically, and this lets the compiler be better at
dead-code elimination.
src/lib.rs Outdated
Comment on lines 268 to 274
let (chr, pos, cigar, cigar_ops) = if record.len() > 0 {
let rec = &record[0];
(rec.tid().to_string(), rec.pos().to_string(), format!("{}", rec.cigar()), rec.cigar().len().to_string())
} else {
("NA".to_string(), "NA".to_string(), "NA".to_string(), "NA".to_string())
};
println!("{:?},{:?},{:?},{},{},{},{}", std::str::from_utf8(&read).unwrap(), new_now.duration_since(now), record.len(), chr, pos, cigar, cigar_ops);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of unnecessary memcpy going on here.

Suggested change
let (chr, pos, cigar, cigar_ops) = if record.len() > 0 {
let rec = &record[0];
(rec.tid().to_string(), rec.pos().to_string(), format!("{}", rec.cigar()), rec.cigar().len().to_string())
} else {
("NA".to_string(), "NA".to_string(), "NA".to_string(), "NA".to_string())
};
println!("{:?},{:?},{:?},{},{},{},{}", std::str::from_utf8(&read).unwrap(), new_now.duration_since(now), record.len(), chr, pos, cigar, cigar_ops);
if record.len() > 0 {
let rec = &record[0];
println!("{:?},{:?},{:?},{},{},{},{}", std::str::from_utf8(&read).unwrap(), new_now.duration_since(now), record.len(), rec.tid(), rec.pos(),rec.cigar(), rec.cigar().len())
} else {
println!("{:?},{:?},{:?},NA,NA,NA,NA", std::str::from_utf8(&read).unwrap(), new_now.duration_since(now), record.len())
};

Though I guess it doesn't really matter if this is just temporary benchmarking code.

But, maybe this should be an actual benchmark?

@evolvedmicrobe
Copy link
Contributor

Hey @adam-azarchs, yeah agreed it doesn't matter since this is just quick benchmarking code. Just for context, we're spending a bunch of time in ALIGN_AND_COUNT and I have zero faith that couldn't be made a lot faster, so was taking a look at the STAR code today.

In general, on a per read basis we take about ~280 microseconds to align a read, for a throughput of about 3.5K reads a second, which means that with datasets with ~380M reads we're taking almost 30 core hours to get our alignments out. Obviously, with cloud that's nothing, but I suspect we could probably do much better for stand alone users and consume less energy.

I benchmarked this branch again, and as before found that although it's better for reasons of readability and I kind of want to merge it in just for that, it didn't move the needle on speed (actually was ~4% slower on average, and although not shown, that wasn't statistically significant, so it may have been a little faster or a little slower but wasn't a game changer).

The plot below is the timings for 10K of the same reads to align, we have a lot of reproducible variation between reads but not so much between branches. The plot shows how long it took to align each read on master and your branch.

There's a lot of heuristics at work in this code, but stepping through the code it looks like we should have a much higher throughput than we do, so am looking into that a bit more.

image

@adam-azarchs
Copy link
Member Author

I think I know why there was maybe a performance regression, because the loop vectorizer was having trouble figuring out that comparing two array<char, 4> was just comparing two 32-bit values. I added a template specialization to make it easier for it to do it right.

@adam-azarchs
Copy link
Member Author

I'm wondering whether #53 will make any difference to what we see here.

@evolvedmicrobe
Copy link
Contributor

I think at next cellranger-dev meeting we should discuss getting a benchmark you can iterate with. The CPU utilization of ALIGN_AND_COUNT remains quite low, and I think it might be best to subdivide the algorithmic and implementation optimizations amongst folks.

@adam-azarchs
Copy link
Member Author

Low utilization means I/O bound. There's definitely low-hanging fruit to be found there. To start with, I think we need to give another try to the mmap strategy, just this time a less buggy implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants