-
Notifications
You must be signed in to change notification settings - Fork 8
User Guide
Important
Please read HMSDK Overview instead because the current document is deprecated and no longer valid after hmsdk-v2.0.
This document describes how to install and use HMSDK.
HMSDK consists of four main parts: cemalloc library, linux kernel, numactl, and tools.
This section describes the HMSDK components.
cemalloc
: heterogeneous memory allocator
- cemalloc (CXL-Expansion malloc) contains hooking functions of the memory allocators and our custom memory allocators.
linux
: linux kernel containing HMSDK support
- It contains patches for the HMSDK's new memory policy, named Bandwidth-aware Interleaving, and regarding new system calls.
numactl
: numactl
for Bandwidth-aware Interleaving
- It offers the interface to use HMSDK's new memory policy.
tools
: Tools for HMSDK
- It contains a calculation tool(
bwactl
) for setting the optimal memory interleaving ratio.
To use HMSDK, you need at least two NUMA nodes on your system. If you use a heterogeneous memory such as a CXL memory device, it should be perceived as a NUMA node. This chapter describes how to use HMSDK in various NUMA node configurations. HMSDK has only been tested with x86-64 systems.
If the system has CXL memory devices, make sure if your CXL memories are detected as NUMA nodes. For example, the following numactl result shows that node 2 is a separate numa node, but without having cpus.
$ numactl --hardware
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 0 size: 31956 MB
node 0 free: 29028 MB
node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 1 size: 32242 MB
node 1 free: 31032 MB
node 2 cpus:
node 2 size: 96761 MB
node 2 free: 96659 MB
node distances:
node 0 1 2
0: 10 21 14
1: 21 10 24
2: 14 24 10
If the system does not have CXL memory devices, but just have multiple NUMA nodes, you can emulate the use of HMSDK by regarding one of those nodes as a memory device.
If your system has a single NUMA node, you can emulate multiple nodes by creating fake NUMA nodes. For more information, please read link.
You can download HMSDK repository from GitHub. Make sure you attach
--recursive
since HMSDK includes additional repositories as submodules.
$ git clone --recursive https://github.com/SKhynix/hmsdk.git
$ cd hmsdk
This includes downloading the entire linux git history, so git cloning with
--shallow-submodules
will significantly reduce the download time.
Please read this link for the general linux kernel build,
Since HMSDK includes new memory policies in its linux kernel, an additional
build configuration, CONFIG_INTERLEAVE_WEIGHT
, has to be enabled.
The kernel build and installation can be done as follows.
$ cd hmsdk/linux
$ cp /boot/config-$(uname -r) .config
$ make menuconfig
$ echo 'CONFIG_INTERLEAVE_WEIGHT=y' >> .config
$ make -j$(nproc)
$ sudo make modules_install
$ sudo make install
The CONFIG_INTERLEAVE_WEIGHT
can also be enabled in menuconfig following
"Memory Management options" -> "Enable interleave weight policy".
You can use uname -r
to verify that the kernel has been installed correctly.
$ uname -r
6.1.0-hmsdk+
cemalloc requires:
- build-essential (c++11)
- cmake (>=3.14)
- python3
- autoconf (for jemalloc build)
HMSDK uses CMake build system but a helper script,
build.py
, is provided for easier build and installation. You can run one of
the following build command based on your requirement.
$ cd cemalloc
# build jemalloc and cemalloc for basic support.
$ ./build.py # implies --mode=build --build_type=release --build_target=malloc
or
# build cemalloc_python as well as jemalloc and cemalloc.
$ ./build.py --build_target=cemalloc_python
or
# build cemalloc_java as well as jemalloc and cemalloc.
# make sure JAVA_HOME is already set.
$ ./build.py --build_target=cemalloc_java
or
# build all the packages available.
# make sure JAVA_HOME is already set.
$ ./build.py --build_target=all
-
--build_target=malloc
: It buildscemalloc
. -
--build_target=cemalloc_python
: It builds the Python package of cemalloc.- Please run
python3 -m pip install cemalloc_package/cemalloc-1.1-py3-none-any.whl
when build is done.
- Please run
-
--build_target=cemalloc_java
: It builds the Java package of cemalloc.- Please make sure
JAVA_HOME
environment variable is set to system JDK directory.
- Please make sure
If the build is done with--build_target=all
, cemalloc_package/
directory looks like below:
$ ls cemalloc_package/
cemalloc-1.1-py3-none-any.whl cemalloc.jar include libcemallocjava.so libcemalloc.so libcemalloc.so.1.1
For more information about the build script and build options:
$ ./build.py -h
Once HMSDK linux kernel is installed, customized numactl that supports new system calls can be built. Please read INSTALL.md for detail.
$ cd numactl
$ ./autogen.sh
$ ./configure
$ make
# make install
HMSDK offers efficient ways to allocate memories for a system with heterogeneous
memory. You can use libcemalloc
or our customized numactl
. This page is
about what you have to prepare for using HMSDK, how to use HMSDK, what features
are available, and simple examples.
- Before running your code, you must set environment variables according to your desired allocation mode.
Environment Variable | Description | Valid Value |
---|---|---|
CE_MODE | To set the mode of cemalloc. |
CE_EXPLICIT , CE_IMPLICIT
|
CE_CXL_NODE | Target CXL node number. Used if CE_ALLOC is CE_ALLOC_CXL . |
an integer within the NUMA node range |
CE_ALLOC | To set the allocation mode (read below for more information). |
CE_ALLOC_CXL CE_ALLOC_BWAWARE CE_ALLOC_USERDEFINED
|
CE_INTERLEAVE_NODE | Target interleave node and weight information. Used if CE_ALLOC is CE_ALLOC_USERDEFINED . |
comma delimited list of node weight in the format node_id*weight (e.g., "0*2,1*1,2*1" ) |
-
allocation mode (
CE_ALLOC
):-
CE_ALLOC_CXL
: UseCE_CXL_NODE
memory as your default memory device. Heap memory requested by functions such asmalloc
will be allocated onCE_CXL_NODE
memory. -
CE_ALLOC_BWAWARE
: Use Bandwidth-aware Interleaving policy. Heap memory requested by functions such asmalloc
is interleaved between those memory nodes specified by/sys/kernel/mm/interleave_weight/node/node*/interleave_weight
. You can find more information about sysfs interface here. To see the way to set memory nodes and ratio, read how to use bwactl. -
CE_ALLOC_USERDEFINED
: Heap memory requested by functions such asmalloc
is allocated only toCE_INTERLEAVE_NODE
memory.CE_INTERLEAVE_NODE
contains comma-separated"node*weight"
information.
-
Using LD_PRELOAD
with other environment variables allows you to use
libcemalloc.so
without any modification of your source code. libcemalloc.so
will override standard malloc
and use heterogeneous memory according to how
you set the environment variables.
- Example
$ export CE_MODE=CE_IMPLICIT
$ export CE_CXL_NODE=2
$ export CE_ALLOC=CE_ALLOC_CXL
$ LD_PRELOAD=/path/to/libcemalloc.so ./your_program
Function / Operator | Description |
---|---|
mmap |
map files or devices into memory |
malloc |
allocate dynamic memory |
calloc |
allocate dynamic memory and the memory is set to zero |
realloc |
reallocate the given area of memory |
posix_memalign |
allocate aligned memory |
memalign |
allocate aligned memory |
valloc |
have the same effect as malloc , except that the allocated memory will be aligned to a multiple of the value returned by sysconf(_SC_PAGESIZE)
|
aligned_alloc |
allocate uninitialized storage whose alignment is specified by parameter |
free |
free dynamic memory |
malloc_usable_size |
obtain size of block of memory allocated from heap |
new |
allocate dynamic memory |
delete |
free dynamic memory |
When you decide to change your source code, you can optimize the use of HMSDK
by adopting explicit APIs of libcemalloc.so
. It offers various language options
(C / C++, Python, Java).
Function / Operator | Description |
---|---|
cxl_mmap |
map files or devices into memory |
cxl_malloc |
allocate dynamic memory |
cxl_calloc |
allocate dynamic memory and the memory is set to zero |
cxl_realloc |
reallocate the given area of memory |
cxl_posix_memalign |
allocate aligned memory |
cxl_memalign |
allocate aligned memory |
cxl_valloc |
have the same effect as malloc , except that the allocated memory will be aligned to a multiple of the value returned by sysconf(_SC_PAGESIZE)
|
cxl_aligned_alloc |
allocate uninitialized storage whose alignment is specified by parameter |
cxl_free |
free dynamic memory |
cxl_malloc_usable_size |
obtain size of block of memory allocated from heap |
By replacing malloc
with cxl_malloc
, you can utilize heterogeneous
memory effectively. malloc
operates the same as usual, requesting memory
from your local memories. You can change the behavior of explicit APIs
(cxl_malloc
, cxl_calloc
, cxl_free
, ...) by setting environment
variables.
$ gcc -o explicit_example explicit_example.cc -lcemalloc
$ export CE_MODE=CE_EXPLICIT
$ export CE_CXL_NODE=2
$ export CE_ALLOC=CE_ALLOC_CXL
$ ./explicit_example
// code example (explicit_example.c)
#include <cemalloc.h>
...
char* local;
char* cxl;
local = (char*)malloc(sizeof(char)); // memory requested from local memory
cxl = (char*)cxl_malloc(sizeof(char)); // memory requested from CE_CXL_NODE
...
free(local);
cxl_free(cxl);
...
Unlike C or C++ by which programmers can explicitly request memory, Python's
memory allocation request is not directly accessible. For these cases, we
offer simple indicator-like methods. Call SetCxlMemory()
or
SetHostMemory()
before creating objects. It is valid until you call it
again (for current thread only).
Function | Description |
---|---|
SetHostMemory |
use host memory for allocation |
SetCxlMemory |
use cxl memory for allocation |
GetMemoryMode |
get the current memory mode(host or cxl) |
CE_MODE
should always be set to CE_EXPLICIT_INDICATOR
. For
information regarding other environment variables, read
above.
$ export LIBCEMALLOC_DIR=/path/where/libcemalloc/exists # or you can put libcemalloc.so under where LD_LIBRARY_PATH_points
$ export CE_MODE=CE_EXPLICIT_INDICATOR
$ export CE_CXL_NODE=2
$ export CE_ALLOC=CE_ALLOC_CXL
$ LD_PRELOAD=/path/to/libcemalloc.so python3 example.py
# example.py
import cemalloc
import numpy as np
cemalloc.SetCxlMemory()
arr = np.zeros(1024) # arr will be allocated on CE_CXL_NODE
cemalloc.SetHostMemory()
arr2 = np.zeros(1024) # arr2 will be allocated on host memory
Java also has hidden memory allocation requests unlike C/C++. So, how to use
cemalloc in Java is the same as in Python. Call SetCxlMemory()
or
SetHostMemory()
before you create objects. It is valid until you call it
again (for current thread only).
Function | Description |
---|---|
SetHostMemory |
enable allocation to host memory |
SetCxlMemory |
enable allocation to cxl memory |
GetMemoryMode |
get the current memory mode(host or cxl) |
CE_MODE
should always be set to CE_EXPLICIT_INDICATOR
. For
information regarding other environment variables, read
above.
import cemalloc.Cemalloc;
class Example {
public static void main(String[] args)
{
int arr_length = 1024 * 1024 * 1024;
int[] arr_host = null;
int[] arr_cxl = null;
Cemalloc test = new Cemalloc();
test.SetCxlMemory();
assert test.GetMemoryMode().name() == "CXL";
arr_cxl = new int[arr_length]; // arr_cxl will be allocated on CXL memory
test.SetHostMemory();
assert test.GetMemoryMode().name() == "HOST";
arr_host = new int[arr_length]; // arr_host will be allocated on Host memory
}
}
Then, you need to set some environment variables for determining the ceamlloc's
allocation attributes and Java environments. You can set below environment
variables through source env.sh
(example/explicit_api/java/env.sh)
$ export CEMALLOC_PATH=/path/to/cemalloc_package
$ export LD_LIBRARY_PATH=${CEMALLOC_PATH}
# It should be set to CE_EXPLICIT_INDICATOR for Java application.
$ export CE_MODE=CE_EXPLICIT_INDICATOR
# CXL memory node number.
$ export CE_CXL_NODE=2
# It should be set to one of CE_ALLOC_CXL, CE_ALLOC_USERDEFINED, or CE_ALLOC_BWAWARE.
$ export CE_ALLOC=CE_ALLOC_CXL
$ export JAVA_HOME=/path/to/java
$ export CLASSPATH=/path/to/cemalloc.jar
$ make
$ LD_PRELOAD=$CEMALLOC_PATH/libcemalloc.so java Example
Please note that HMSDK support for python and java is still in experimental stages.
With numactl, you can run your program with our new memory policy. You can use
--interleave-weight
option.
--interleave-weight(--w)
: either a string "bwa" or a comma delimited list
of node weight in the format "node_id*weight"
.
- Example
$ numactl --interleave-weight=bwa ./your_program $ numactl --interleave-weight="0*2,1*1" ./your_program
When --interleave-weight=bwa
, bandwidth-aware interleaving will be applied.
Otherwise, you can set your own interleaving weight in "node_id*weight"
format. Run numactl -h
or man numactl
for more information.
HMSDK introduces two new system calls for interleave-weight memory allocation policy.
set_mempolicy_node_weight
, and mrange_node_weight
. These system calls
are currently only available on x86-64 system. For more information, read
set_mempolicy_node_weight.md and
mrange_node_weight.md.
HMSDK linux kernel provides a way to set the weight information under sysfs at
/sys/kernel/mm/interleave_weight/node/node*/interleave_weight
.
The sysfs structure is designed as follows.
-
enabled
: state of interleave_weight (1: enable, 0: disable) -
possible
: node list with one or more cpus -
interleave_weight
: weight value of each node
$ tree /sys/kernel/mm/interleave_weight/
/sys/kernel/mm/interleave_weight/
├── enabled
├── possible
└── node
├── node0
│ └── interleave_weight
└── node1
└── interleave_weight
Once you set enabled
to 1
, you are ready to use interleave-weight. We
highly recommend you to use bwactl when
setting the optimal value for interleave_weight
.
HMSDK provides bwactl.py
script to set the optimal memory interleaving ratio
for a new memory policy, interleave-weight. This script measures system's peak
memory bandwidth for all the NUMA nodes using
Intel MLC(Memory Latency Checker).
Then, it calculates the optimal memory interleaving ratio for each nodes based
on the result measured earlier. The result of running this utility is used in
HMSDK's "Bandwidth-aware Interleaving" policy.
$ sudo tools/bwactl.py
Run bwactl.py
as root user to get the bandwidth ratio for each NUMA node.
As a result, the bandwidth ratio is applied to
/sys/kernel/mm/interleave_weight/node/node*/interleave_weight
.
We adopt Intel MLC to measure memory bandwidth, and it will be automatically
installed when this script is run for the first time.
There is an example execution as follows.
# Please make sure lstopo is installed in the system.
$ sudo tools/bwactl.py
Bandwidth ratio for all NUMA nodes
node0: 0*2,2*1
node1: 1*3,3*1
Bandwidth ratio is successfully updated at
/sys/kernel/mm/interleave_weight/node/node0/interleave_weight
/sys/kernel/mm/interleave_weight/node/node1/interleave_weight
It updates the optimal interleaving ratio (in this case 2:1 for node0
and
node2
, 3:1 for node1
and node3
) for all nodes with CPU to any nodes with
memory directly attached to them. The result may vary based on the system
configurations.
When using the Bandwidth-aware Interleaving policy, the linux kernel reads and
applies the value written on /sys/.../interleave_weight
when allocating
memory.
The bandwidth aware interleaving policy tries to utilize the memory bandwidth
among memory nodes in the directly linked package, not across processor
interconnect such as UPI, so bwactl.py
reads the system topology by
running lstopo command and parse the result to get the package layout.
For example, the above bwactl.py
execution result was based on the system with
the topology as follows.
Machine
Package P#0
NUMANode P#0 # DRAM
NUMANode P#2 # CXL
Package P#1
NUMANode P#1 # DRAM
NUMANode P#3 # CXL
Since the package 0 has numa node 0, 2 and package 1 has numa node 1, 3 so the inteleaving is expected to be applied among the nodes inside the same package.
If user wants to change the topology manually for some reasons, bwactl.py
also
provides a way to manually set the topology with --topology
option.
The argument should be 2 level nested set of nodes as follows.
<topology> := [<packages>]
<packages> := [<node_ids>] | [<node_ids>] "," <packages>
<node_ids> := <node_id> | <node_id> "," <node_ids>
The argument looks like 2 dimensional array list of python, which has greater
or equal to 1 length of inner list. For example, the above topology can be
represented as [[0,2],[1,3]]
and you can run the following command.
$ sudo ./bwactl.py --topology "[[0,2],[1,3]]"
Bandwidth ratio for all NUMA nodes
node0: 0*7,2*5
node1: 1*1,3*1
Another topology example can be used as follows.
$ sudo ./bwactl.py --topology "[[0],[1,2,3]]"
Bandwidth ratio for all NUMA nodes
node0: 0*1
node1: 1*5,2*8,3*5
The above topology represents the following lstopo like result.
Machine
Package P#0
NUMANode P#0 # DRAM
Package P#1
NUMANode P#1 # DRAM
NUMANode P#2 # CXL
NUMANode P#3 # CXL