Manage basic block with adaptive replacement cache #105

jserv · 2023-01-05T11:33:33Z

Current basic block management consumes a significant amount of memory, which leads to unnecessary waste due to frequent map allocation and release. Adaptive Replacement Cache (ARC) is a page replacement algorithm with better performance than least recently used (LRU). Better memory usage and hit rates can be achieved after the translated blocks are handled by ARC by keeping track of frequently used and recently used pages as well as a recent eviction history for both.

Quote from ZFS Caching:

Combines the best of LRU and LFU, plus some novel tricks
The cache size (c) is partitioned (p) into two sections
At the start, p = ½ c, first half of the cache is LRU, second LFU
In addition to the two caches, there is a “ghost” list for each
Each time an item is evicted from either cache, its key (but not its data) moves to the ghost list for that cache

Expected output:

Create cache.[ch] which consist of clean API. See map.[ch] for reference.
Implement adaptive replacement cache in cache.[ch] along with configurable capacity.
Integrate ARC in basic block management.

Reference:

ZFS Caching
C++ implement of Adaptive Replacement Cache
cachehash for C API design
Cache replacement policies

qwe661234 · 2023-02-09T08:57:16Z

To manage basic blocks, this commit implements the Adaptive Replacement Cache (ARC) and basic profiling tools to show cache information. However, even if a cache hit rate of running CoreMark is more than 99%, the performance is worse than the previous implementation. We suppose the issue is that the overhead of cache operation is too high.

CoreMark

Model	Compiler	`a304446`	ARC	Speedup
Core i7-8700	clang-15	971.951	895.248	-8.6%
Core i7-8700	gcc-12	963.336	884.865	-8.9%
eMAG 8180	clang-15	335.396	300.218	-11.7%
eMAG 8180	gcc-12	332.561	305.406	-8.9%

jserv · 2023-02-09T09:11:18Z

To manage basic blocks, this commit implements the Adaptive Replacement Cache (ARC) and basic profiling tools to show cache information. However, even if a cache hit rate of running CoreMark is more than 99%, the performance is worse than the previous implementation.

You should check memory usage for each RISC-V emulation-related process. Considering that CoreMark has fixed allocations, it might not be an appropriate goal. Choose some from the "tests" directory, such as nqueens. Here, we care about memory efficiency rather than iterations per second.

qwe661234 · 2023-02-16T04:34:54Z

To manage basic blocks, this commit implements the Adaptive Replacement Cache (ARC) and basic profiling tools to show cache information. However, even if a cache hit rate of running CoreMark is more than 99%, the performance is worse than the previous implementation.

You should check memory usage for each RISC-V emulation-related process. Considering that CoreMark has fixed allocations, it might not be an appropriate goal. Choose some from the "tests" directory, such as nqueens. Here, we care about memory efficiency rather than iterations per second.

Coremark created the most blocks after examining the number of creating blocks among test cases. As a result, I continue to use Coremark as a test target. The following data generated by Valgrind shows a memory usage analysis while running Coremark. The difference in total heap usage between commit 72bbb63 and commit which ARC is approximately 20%.

Commit	allocate/free times	total heap usage(bytes)
`72bbb63`	3459	42,876,901
commit which ARC	2902	35,945,813

jserv · 2023-02-16T14:31:22Z

Furthermore, let's shrink memory utilization by removing blocks with too few instructions.

qwe661234 · 2023-02-25T12:14:13Z

To manage block efficiently and lower the memory usage, I implement two mechanisms.

Add adaptive replacement cache and import memory pool to manage block and IR.
Merge predict block.
We find that many blocks only contain a small number of instructions, so we wish to increase the number of instructions in a block by merging blocks and their predict blocks.

The following data compared the performance and memory usage among baseline and these two mechanisms. Benchmark tool CoreMark generated performance data, and Valgrind shows a memory usage analysis while running Coremark.

Baseline, commit 72bbb63

Model	Compiler	Iterations / Sec	Memory usage(btyes)
Core i7-8700	gcc-12	993.958209	43,172,197
Core i7-8700	clang-15	971.2947286	42,778,469
eMAG 8180	gcc-12	325.248602	43,172,160
eMAG 8180	clang-15	331.404806	43,245,984

Merge predict block, commit 033d99d

Model	Compiler	Iterations / Sec	Memory usage(btyes)	Speed Up	Save Memory
Core i7-8700	gcc-12	1072.082956	71,221,477	+7.9%	-64%
Core i7-8700	clang-15	1051.587435	71,508,421	+8.3%	-67%
eMAG 8180	gcc-12	350.570987	71,508,384	+7.8%	-66%
eMAG 8180	clang-15	354.28749	71,549,376	+6.9%	-65%

ARC + Block & IR memory pool, commit ac21bc9

Model	Compiler	Iterations / Sec	Memory usage(btyes)	Speed Up	Memory Save
Core i7-8700	gcc-12	889.465393	805,685	-11.7%	+5258%
Core i7-8700	clang-15	877.8968544	805,685	-10.6%	+5210%
eMAG 8180	gcc-12	318.221017	805648	-2.2%	+5259%
eMAG 8180	clang-15	322.088447	805648	-2.9%	+5268%

ARC + Block & IR memory pool + Merge predict block, commit 03e2e96

Model	Compiler	Iterations / Sec	Memory usage(btyes)	Speed Up	Memory Save
Core i7-8700	gcc-12	993.958209	805,685	-1.7%	+5258%
Core i7-8700	clang-15	971.2947286	805,685	-7.2%	+5210%
eMAG 8180	gcc-12	325.248602	805648	-2.3%	+5259%
eMAG 8180	clang-15	331.404806	805648	-3.3%	+5268%

The statistics demonstrated that there is just a 1–7% performance reduction while we save more than 52 times (42MB to 805 KB) the memory usage if we combines these two mechanisms.

jserv · 2023-02-25T12:26:12Z

The statistics demonstrated that there is just a 1–7% performance reduction while we save more than 52 times (42MB to 805 KB) the memory usage if we combines these two mechanisms.

It looks promising. Let's concentrate on this approach. Do evaluate #94 for deterministic memory pool implementation.

jserv · 2023-02-26T05:09:44Z

Once we refine the implementation for the combination of code block caching and memory pool, we can move on #104 for EBB exploration.

jserv · 2023-04-20T10:05:43Z

Considered as completed.

jserv assigned qwe661234 Jan 5, 2023

This was referenced Jan 5, 2023

Implement instruction usage histogram #34

Closed

Integrate deterministic memory pool #94

Open

sysprog21 deleted a comment from qwe661234 Feb 16, 2023

qwe661234 mentioned this issue Mar 3, 2023

Add adaptive replacement cache #118

Merged

jserv closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage basic block with adaptive replacement cache #105

Manage basic block with adaptive replacement cache #105

jserv commented Jan 5, 2023

qwe661234 commented Feb 9, 2023

jserv commented Feb 9, 2023

qwe661234 commented Feb 16, 2023 •

edited

Loading

jserv commented Feb 16, 2023

qwe661234 commented Feb 25, 2023 •

edited

Loading

jserv commented Feb 25, 2023

jserv commented Feb 26, 2023

jserv commented Apr 20, 2023

Manage basic block with adaptive replacement cache #105

Manage basic block with adaptive replacement cache #105

Comments

jserv commented Jan 5, 2023

qwe661234 commented Feb 9, 2023

jserv commented Feb 9, 2023

qwe661234 commented Feb 16, 2023 • edited Loading

jserv commented Feb 16, 2023

qwe661234 commented Feb 25, 2023 • edited Loading

jserv commented Feb 25, 2023

jserv commented Feb 26, 2023

jserv commented Apr 20, 2023

qwe661234 commented Feb 16, 2023 •

edited

Loading

qwe661234 commented Feb 25, 2023 •

edited

Loading