Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

State of Data Cache Optimizations #178

Open
andreybokhanko opened this issue Jun 24, 2021 · 5 comments
Open

State of Data Cache Optimizations #178

andreybokhanko opened this issue Jun 24, 2021 · 5 comments

Comments

@andreybokhanko
Copy link

andreybokhanko commented Jun 24, 2021

Hi BOLT Team,

First of all, thank you again for creating such a wonderful tool! -- we (Huawei) are getting very real performance gains thanks to BOLT. Also, as you probably noticed, we are filling remaining gaps and contribute our changes back to upstream. I hope that the rate of our contributions will only increase in the near future.

Next frontier for us is data cache optimizations. Apparently, BOLT already has some support (options starting from -reorder-data), but it looks like it's not as mature as code cache optimizations.

Could you, please, elaborate a bit more on the current state of data cache-directed optimizations? Specifically:

  • What performance gains you're getting thanks to them? (The original article (https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/) only dissects gains from code cache directed ones.)
  • Have you tried to do an estimate of potential performance gains? If so, can you share these estimates?
  • Are there any active developments in this area?
  • I suppose existing optimizations deal with statically allocated objects only. Have you explored optimization of dynamically allocated objects as well? -- for example, by using some dynamic library that can do the magic based on profile data collected before.
  • If there is already a good summary of state of things in this area, please kindly point me at it.

We may start some development in this direction, and obviously, we don't want to simply repeat what you already covered.

Thank you!

Yours,
Andrey
===
Advanced Software Technology Lab
Huawei

@maksfb
Copy link
Contributor

maksfb commented Jun 24, 2021

Hi Andrey,

Really happy that you see the performance gains from using BOLT and even happier that you are willing to improve BOLT with your current and upcoming contributions!

Data cache optimizations have been on our radar for quite some time. We've started with jump tables placement optimization (-jump-tables={move|split|aggressive}) and then added more general read-only static data reordering optimization. The later feature didn't become mainstream as we realized that we need more info from the linker/compiler for it to become reliable and efficient.

The gains from static data reordering will be much dependent on the application. We haven't measured them on any open-source benchmark, hence the lack of data in our publications. We did see a decrease in D$ misses with jump tables splitting for some of our workloads.

For real-world estimation, you will have to look at D$/TLB miss rates and their distribution in the address space. With special counters (perf record -d ...), you should see how many misses are happening in the statically allocated areas vs. dynamically-allocated ones and stack.

In the near future, we'll be looking into static data reordering in both LLD and BOLT. We will need additional support from the linker for the feature to be usable in BOLT. And compiling with -fdata-sections will very likely be a prerequisite.

For making dynamic memory allocations cache-friendly, you'll need more than just BOLT. E.g., HALO paper (https://dl.acm.org/doi/10.1145/3368826.3377914) uses BOLT and additionally has to modify heap allocators. The following RFC from Google uses sanitizer technique for profiling heap allocations: https://groups.google.com/g/llvm-dev/c/0PN-rBV9WAs/m/MfF8OmJIAQAJ.

Cheers,
Maksim

@andreybokhanko
Copy link
Author

andreybokhanko commented Jun 25, 2021

Hi Maksim,

Thank you for the quick and informative reply!

Data cache optimizations have been on our radar for quite some time. We've started with jump tables placement optimization (-jump-tables={move|split|aggressive}) and then added more general read-only static data reordering optimization. The later feature didn't become mainstream as we realized that we need more info from the linker/compiler for it to become reliable and efficient.

The gains from static data reordering will be much dependent on the application. We haven't measured them on any open-source benchmark, hence the lack of data in our publications. We did see a decrease in D$ misses with jump tables splitting for some of our workloads.

That's encouraging to know!

The biggest question mark for us is whatever this direction is worthy to invest our limited resources at all.

For real-world estimation, you will have to look at D$/TLB miss rates and their distribution in the address space. With special counters (perf record -d ...), you should see how many misses are happening in the statically allocated areas vs. dynamically-allocated ones and stack.

Makes total sense. Will do.

In the near future, we'll be looking into static data reordering in both LLD and BOLT. We will need additional support from the linker for the feature to be usable in BOLT. And compiling with -fdata-sections will very likely be a prerequisite.

Cool! Please keep us in the loop; we might contribute -- at least with testing on a different corpus of applications; most likely with code as well.

For making dynamic memory allocations cache-friendly, you'll need more than just BOLT. E.g., HALO paper (https://dl.acm.org/doi/10.1145/3368826.3377914) uses BOLT and additionally has to modify heap allocators. The following RFC from Google uses sanitizer technique for profiling heap allocations: https://groups.google.com/g/llvm-dev/c/0PN-rBV9WAs/m/MfF8OmJIAQAJ.

Thanks for these pointers!

Yours,
Andrey

@andreybokhanko
Copy link
Author

andreybokhanko commented Jul 5, 2021

Hi Maksim,

One more thing, if you don't mind.

Data cache optimizations have been on our radar for quite some time. We've started with jump tables placement optimization (-jump-tables={move|split|aggressive}) and then added more general read-only static data reordering optimization. The later feature didn't become mainstream as we realized that we need more info from the linker/compiler for it to become reliable and efficient.

Could you, please, elaborate a bit on what exactly you need from the linker/compiler?

My naive understanding is that BOLT can move statically allocated data as it pleases, no?

Yours,
Andrey

@maksfb
Copy link
Contributor

maksfb commented Jul 7, 2021

Hi Andrey,

Could you, please, elaborate a bit on what exactly you need from the linker/compiler?
My naive understanding is that BOLT can move statically allocated data as it pleases, no?

In the compiler, we would like to enable -fdata-sections that forces it to put every data object in its own section. As for the linker, we would like for it to start preserving the original section information.

Suppose you have a local object that is referenced with some offset DataObejct + 0x100. Once the local object symbol is removed by the linker/compiler, we are left with a containing section reference and a new offset, e.g. .rodata + 0x3000. The object located at this address may have nothing to do with the original DataObject, and once we move/re-arrange data objects, the new reference would be completely off. Hence the requirements above. I haven't looked into whether we could use DWARF info to track data references, but I'm hesitant to rely on debug info correctness even if we could.

Cheers,
Maksim

@andreybokhanko
Copy link
Author

Hi Maksim,

Thanks! -- it's clear now.

I haven't looked into whether we could use DWARF info to track data references, but I'm hesitant to rely on debug info correctness even if we could.

I agree that relying on debug info is a non-starter. If anything, I doubt all the optimizations preserve it fully correctly.

Yours,
Andrey

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants