Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage basic block with adaptive replacement cache #105

Closed
jserv opened this issue Jan 5, 2023 · 8 comments
Closed

Manage basic block with adaptive replacement cache #105

jserv opened this issue Jan 5, 2023 · 8 comments
Assignees

Comments

@jserv
Copy link
Contributor

jserv commented Jan 5, 2023

Current basic block management consumes a significant amount of memory, which leads to unnecessary waste due to frequent map allocation and release. Adaptive Replacement Cache (ARC) is a page replacement algorithm with better performance than least recently used (LRU). Better memory usage and hit rates can be achieved after the translated blocks are handled by ARC by keeping track of frequently used and recently used pages as well as a recent eviction history for both.

Quote from ZFS Caching:

  • Combines the best of LRU and LFU, plus some novel tricks
  • The cache size (c) is partitioned (p) into two sections
  • At the start, p = ½ c, first half of the cache is LRU, second LFU
  • In addition to the two caches, there is a “ghost” list for each
  • Each time an item is evicted from either cache, its key (but not its data) moves to the ghost list for that cache

Expected output:

  1. Create cache.[ch] which consist of clean API. See map.[ch] for reference.
  2. Implement adaptive replacement cache in cache.[ch] along with configurable capacity.
  3. Integrate ARC in basic block management.

Reference:

@qwe661234
Copy link
Collaborator

To manage basic blocks, this commit implements the Adaptive Replacement Cache (ARC) and basic profiling tools to show cache information. However, even if a cache hit rate of running CoreMark is more than 99%, the performance is worse than the previous implementation. We suppose the issue is that the overhead of cache operation is too high.

  • CoreMark
Model Compiler a304446 ARC Speedup
Core i7-8700 clang-15 971.951 895.248 -8.6%
Core i7-8700 gcc-12 963.336 884.865 -8.9%
eMAG 8180 clang-15 335.396 300.218 -11.7%
eMAG 8180 gcc-12 332.561 305.406 -8.9%

@jserv
Copy link
Contributor Author

jserv commented Feb 9, 2023

To manage basic blocks, this commit implements the Adaptive Replacement Cache (ARC) and basic profiling tools to show cache information. However, even if a cache hit rate of running CoreMark is more than 99%, the performance is worse than the previous implementation.

You should check memory usage for each RISC-V emulation-related process. Considering that CoreMark has fixed allocations, it might not be an appropriate goal. Choose some from the "tests" directory, such as nqueens. Here, we care about memory efficiency rather than iterations per second.

@qwe661234
Copy link
Collaborator

qwe661234 commented Feb 16, 2023

To manage basic blocks, this commit implements the Adaptive Replacement Cache (ARC) and basic profiling tools to show cache information. However, even if a cache hit rate of running CoreMark is more than 99%, the performance is worse than the previous implementation.

You should check memory usage for each RISC-V emulation-related process. Considering that CoreMark has fixed allocations, it might not be an appropriate goal. Choose some from the "tests" directory, such as nqueens. Here, we care about memory efficiency rather than iterations per second.

Coremark created the most blocks after examining the number of creating blocks among test cases. As a result, I continue to use Coremark as a test target. The following data generated by Valgrind shows a memory usage analysis while running Coremark. The difference in total heap usage between commit 72bbb63 and commit which ARC is approximately 20%.

Commit allocate/free times total heap usage(bytes)
72bbb63 3459 42,876,901
commit which ARC 2902 35,945,813

@sysprog21 sysprog21 deleted a comment from qwe661234 Feb 16, 2023
@jserv
Copy link
Contributor Author

jserv commented Feb 16, 2023

Furthermore, let's shrink memory utilization by removing blocks with too few instructions.

@qwe661234
Copy link
Collaborator

qwe661234 commented Feb 25, 2023

To manage block efficiently and lower the memory usage, I implement two mechanisms.

  1. Add adaptive replacement cache and import memory pool to manage block and IR.
  2. Merge predict block.
    We find that many blocks only contain a small number of instructions, so we wish to increase the number of instructions in a block by merging blocks and their predict blocks.

The following data compared the performance and memory usage among baseline and these two mechanisms. Benchmark tool CoreMark generated performance data, and Valgrind shows a memory usage analysis while running Coremark.

Model Compiler Iterations / Sec Memory usage(btyes)
Core i7-8700 gcc-12 993.958209 43,172,197
Core i7-8700 clang-15 971.2947286 42,778,469
eMAG 8180 gcc-12 325.248602 43,172,160
eMAG 8180 clang-15 331.404806 43,245,984
  • Merge predict block, commit 033d99d
Model Compiler Iterations / Sec Memory usage(btyes) Speed Up Save Memory
Core i7-8700 gcc-12 1072.082956 71,221,477 +7.9% -64%
Core i7-8700 clang-15 1051.587435 71,508,421 +8.3% -67%
eMAG 8180 gcc-12 350.570987 71,508,384 +7.8% -66%
eMAG 8180 clang-15 354.28749 71,549,376 +6.9% -65%
  • ARC + Block & IR memory pool, commit ac21bc9
Model Compiler Iterations / Sec Memory usage(btyes) Speed Up Memory Save
Core i7-8700 gcc-12 889.465393 805,685 -11.7% +5258%
Core i7-8700 clang-15 877.8968544 805,685 -10.6% +5210%
eMAG 8180 gcc-12 318.221017 805648 -2.2% +5259%
eMAG 8180 clang-15 322.088447 805648 -2.9% +5268%
  • ARC + Block & IR memory pool + Merge predict block, commit 03e2e96
Model Compiler Iterations / Sec Memory usage(btyes) Speed Up Memory Save
Core i7-8700 gcc-12 993.958209 805,685 -1.7% +5258%
Core i7-8700 clang-15 971.2947286 805,685 -7.2% +5210%
eMAG 8180 gcc-12 325.248602 805648 -2.3% +5259%
eMAG 8180 clang-15 331.404806 805648 -3.3% +5268%

The statistics demonstrated that there is just a 1–7% performance reduction while we save more than 52 times (42MB to 805 KB) the memory usage if we combines these two mechanisms.

@jserv
Copy link
Contributor Author

jserv commented Feb 25, 2023

The statistics demonstrated that there is just a 1–7% performance reduction while we save more than 52 times (42MB to 805 KB) the memory usage if we combines these two mechanisms.

It looks promising. Let's concentrate on this approach. Do evaluate #94 for deterministic memory pool implementation.

@jserv
Copy link
Contributor Author

jserv commented Feb 26, 2023

Once we refine the implementation for the combination of code block caching and memory pool, we can move on #104 for EBB exploration.

@jserv
Copy link
Contributor Author

jserv commented Apr 20, 2023

Considered as completed.

@jserv jserv closed this as completed Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants