HMSDK v2.0 Performance Results

Important

Please read Performance Results instead because the current document is deprecated and no longer valid.

We assessed the effectiveness of the HMSDK 2.0, particularly under high memory pressure conditions. HMSDK 2.0 aims to reduce the performance slowdown caused by misplaced hot/cold data and limited DRAM capacity. Please note that this evaluation can also be found on this LKML cover letter.

1. Environments

Our experimental setup consists of a NUMA node with local DRAM, and a NUMA node with CXL/PCIe attached DRAM, which will be called simply a CXL memory node as follows:

CPU: 64 cores (including SMT)
node0: local DRAM, 512GB with a CPU socket (fast tier)
node1: disabled
node2: CXL DRAM, 96GB, no CPU attached (slow tier)

2. Workload

The evaluation was done using redis, a widely used in-memory database, and YCSB, a tool for generating memory access patterns. We used zipfian (requestdistribution=zipfian) and latest (requestdistribution=latest) distributions based on one of the core workloads of YCSB, workloadc. To make memory usage higher and execution time longer, we set recordcount=30000000 and operationcount=5000000.

We assumed that there is enough amount of cold memory in datacenters as TMO and TPP papers mentioned. Therefore, we pre-allocate cold memory externally using mmap and memset before launching redis-server to simulate a real-world scenario where large amounts of cold data and memory pressure exist.

The evaluation sequence is as follows.

Turn on HMSDK 2.0 with DAMOS_DEMOTE action for the DRAM node and DAMOS_PROMOTE action for the CXL node. It demotes cold pages on the DRAM node and promotes hot pages on the CXL node a regular intervals.
Allocate a huge block of cold memory by calling mmap and memset at the fast tier DRAM node, then make the process sleep to make the fast tier have insufficient memory for redis-server.
Launch redis-server and load prebaked snapshot image, dump.rdb. The redis-server consumes 52GB of anon pages and 33GB of file pages, but due to the cold memory allocated at 2, it fails to allocate the entire memory of redis-server on the fast tier DRAM node so it partially allocates the remaining on the slow tier CXL node. The ratio of DRAM:CXL depends on the size of the pre-allocated cold memory.
Run YCSB to make zipfian or latest distribution of memory accesses to redis-server, then measure its execution time when it's completed.
Repeat 4 over 50 times to measure the average execution time for each run.
Increase the cold memory size then repeat goes to 2.

Repeating the same test set multiple times does not show much difference.

3. Results

All the result values are normalized to the execution time of DRAM-only. The DRAM-only execution time is the ideal result without being affected by the performance gap between DRAM and CXL. Each test result is based on the execution environment as follows.

DRAM-only : redis-server uses only local DRAM memory.
CXL-only : redis-server uses only CXL memory.
default : default memory policy(MPOL_DEFAULT). numa balancing disabled.
HMSDK 2.0 : DAMON enabled with DAMOS_DEMOTE for DRAM nodes and DAMOS_PROMOTE for CXL nodes.

YCSB zipfian distribution read-only workload memory pressure with cold memory on node0 with 512GB of local DRAM

=============+================================================+=========
             |       cold memory occupied by mmap and memset  |
             |   0G  440G  450G  460G  470G  480G  490G  500G |
=============+================================================+=========
Execution time normalized to DRAM-only values                 | GEOMEAN
-------------+------------------------------------------------+---------
DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
CXL-only     | 1.21     -     -     -     -     -     -     - | 1.21
default      |    -  1.09  1.10  1.13  1.15  1.18  1.21  1.21 | 1.15
HMSDK 2.0    |    -  1.02  1.04  1.05  1.04  1.05  1.05  1.06 | 1.04
=============+================================================+=========
CXL usage of redis-server in GB                               | AVERAGE
-------------+------------------------------------------------+---------
DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
default      |    -  19.4  26.1  32.3  38.5  44.7  50.5  50.3 | 37.4
HMSDK 2.0    |    -   0.1   1.6   5.2   8.0   9.1  11.8  13.6 |  7.1
=============+================================================+=========

The above result shows the "default" execution time goes up as the size of cold memory is increased from 440G to 500G because the more cold memory used, the more CXL memory is used for the target redis workload and this makes the execution time increase.

However, the result of HMSDK 2.0 shows less slowdown because the DAMOS_DEMOTE action at the DRAM node proactively demotes pre-allocated cold memory to the CXL node and this free space at DRAM increases more chance to allocate hot or warm pages of redis-server to fast DRAM node. Moreover, DEMOS_PROMOTE action at the CXL node also promotes hot pages of redis-server to DRAM node actively.

YCSB latest distribution read-only workload memory pressure with cold memory on node0 with 512GB of local DRAM

=============+================================================+=========
             |       cold memory occupied by mmap and memset  |
             |   0G  440G  450G  460G  470G  480G  490G  500G |
=============+================================================+=========
Execution time normalized to DRAM-only values                 | GEOMEAN
-------------+------------------------------------------------+---------
DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
default      |    -  1.16  1.15  1.17  1.18  1.16  1.18  1.15 | 1.17
HMSDK 2.0    |    -  1.04  1.04  1.05  1.05  1.06  1.05  1.06 | 1.05
=============+================================================+=========
CXL usage of redis-server in GB                               | AVERAGE
-------------+------------------------------------------------+---------
DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
default      |    -  19.3  26.1  32.2  38.5  44.6  50.5  50.6 | 37.4
HMSDK 2.0    |    -   1.3   3.8   7.0   4.1   9.4  12.5  16.7 |  7.8
=============+================================================+=========

The above result of the latest distribution workload shows similar data. A similar evaluation was done in another machine that has 256GB of local DRAM and 96GB of CXL memory while the size of cold memory changed from about 190 GB to 240 GB. The performance slowdown is reduced from 20 ~ 24% for "default" to 5 ~ 7% for HMSDK 2.0.

4. Summary

Our evaluation shows that the memory management of HMSDK 2.0 reduces the performance slowdown compared to the "default" memory policy from 15 ~ 17% to 4 ~ 5% when the system runs with high memory pressure on its fast tier DRAM nodes. Thus, having these DAMOS_DEMOTE and DAMOS_PROMOTE actions can make 2-tier memory systems run more efficiently under high memory pressures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HMSDK v2.0 Performance Results

1. Environments

2. Workload

3. Results

4. Summary

Clone this wiki locally