User Guide

Important

Please read HMSDK Overview instead because the current document is deprecated and no longer valid.

This document describes how to install and use HMSDK.

1. Installation

HMSDK consists of four main parts: cemalloc library, linux kernel, numactl, and tools.

1.1. HMSDK Components

This section describes the HMSDK components.

cemalloc: heterogeneous memory allocator

cemalloc (CXL-Expansion malloc) contains hooking functions of the memory allocators and our custom memory allocators.

linux: linux kernel containing HMSDK support

It contains patches for the HMSDK's new memory policy, named Bandwidth-aware Interleaving, and regarding new system calls.

numactl: numactl for Bandwidth-aware Interleaving

It offers the interface to use HMSDK's new memory policy.

tools: Tools for HMSDK

It contains a calculation tool(bwactl) for setting the optimal memory interleaving ratio.

1.2. System Requirements

To use HMSDK, you need at least two NUMA nodes on your system. If you use a heterogeneous memory such as a CXL memory device, it should be perceived as a NUMA node. This chapter describes how to use HMSDK in various NUMA node configurations. HMSDK has only been tested with x86-64 systems.

If the system has CXL memory devices, make sure if your CXL memories are detected as NUMA nodes. For example, the following numactl result shows that node 2 is a separate numa node, but without having cpus.

  $ numactl --hardware
    available: 3 nodes (0-2)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
    node 0 size: 31956 MB
    node 0 free: 29028 MB
    node 1 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
    node 1 size: 32242 MB
    node 1 free: 31032 MB
    node 2 cpus:
    node 2 size: 96761 MB
    node 2 free: 96659 MB
    node distances:
    node   0   1   2
      0:  10  21  14
      1:  21  10  24
      2:  14  24  10

If the system does not have CXL memory devices, but just have multiple NUMA nodes, you can emulate the use of HMSDK by regarding one of those nodes as a memory device.

If your system has a single NUMA node, you can emulate multiple nodes by creating fake NUMA nodes. For more information, please read link.

1.3. Installation

You can download HMSDK repository from GitHub. Make sure you attach --recursive since HMSDK includes additional repositories as submodules.

$ git clone --recursive https://github.com/SKhynix/hmsdk.git
$ cd hmsdk

This includes downloading the entire linux git history, so git cloning with --shallow-submodules will significantly reduce the download time.

Building kernel

Please read this link for the general linux kernel build,

Since HMSDK includes new memory policies in its linux kernel, an additional build configuration, CONFIG_INTERLEAVE_WEIGHT, has to be enabled.

The kernel build and installation can be done as follows.

$ cd hmsdk/linux
$ cp /boot/config-$(uname -r) .config
$ make menuconfig
$ echo 'CONFIG_INTERLEAVE_WEIGHT=y' >> .config
$ make -j$(nproc)
$ sudo make modules_install
$ sudo make install

The CONFIG_INTERLEAVE_WEIGHT can also be enabled in menuconfig following "Memory Management options" -> "Enable interleave weight policy".

You can use uname -r to verify that the kernel has been installed correctly.

$ uname -r
6.1.0-hmsdk+

Building cemalloc

cemalloc requires:

build-essential (c++11)
cmake (>=3.14)
python3
autoconf (for jemalloc build)

HMSDK uses CMake build system but a helper script, build.py, is provided for easier build and installation. You can run one of the following build command based on your requirement.

$ cd cemalloc

# build jemalloc and cemalloc for basic support.
$ ./build.py    # implies --mode=build --build_type=release --build_target=malloc
      or
# build cemalloc_python as well as jemalloc and cemalloc.
$ ./build.py --build_target=cemalloc_python
      or
# build cemalloc_java as well as jemalloc and cemalloc.
# make sure JAVA_HOME is already set.
$ ./build.py --build_target=cemalloc_java
      or
# build all the packages available.
# make sure JAVA_HOME is already set.
$ ./build.py --build_target=all

--build_target=malloc: It builds cemalloc.
--build_target=cemalloc_python: It builds the Python package of cemalloc.
- Please run python3 -m pip install cemalloc_package/cemalloc-1.1-py3-none-any.whl when build is done.
--build_target=cemalloc_java: It builds the Java package of cemalloc.
- Please make sure JAVA_HOME environment variable is set to system JDK directory.

If the build is done with--build_target=all, cemalloc_package/ directory looks like below:

$ ls cemalloc_package/
cemalloc-1.1-py3-none-any.whl  cemalloc.jar  include  libcemallocjava.so  libcemalloc.so  libcemalloc.so.1.1

For more information about the build script and build options:

$ ./build.py -h

Building numactl

Once HMSDK linux kernel is installed, customized numactl that supports new system calls can be built. Please read INSTALL.md for detail.

$ cd numactl
$ ./autogen.sh
$ ./configure
$ make
# make install

2. How to Use

HMSDK offers efficient ways to allocate memories for a system with heterogeneous memory. You can use libcemalloc or our customized numactl. This page is about what you have to prepare for using HMSDK, how to use HMSDK, what features are available, and simple examples.

2.1. Environment Variables

Before running your code, you must set environment variables according to your desired allocation mode.

Environment Variable	Description	Valid Value
CE_MODE	To set the mode of cemalloc.	`CE_EXPLICIT`, `CE_IMPLICIT`
CE_CXL_NODE	Target CXL node number. Used if `CE_ALLOC` is `CE_ALLOC_CXL`.	an integer within the NUMA node range
CE_ALLOC	To set the allocation mode (read below for more information).	`CE_ALLOC_CXL` `CE_ALLOC_BWAWARE` `CE_ALLOC_USERDEFINED`
CE_INTERLEAVE_NODE	Target interleave node and weight information. Used if `CE_ALLOC` is `CE_ALLOC_USERDEFINED`.	comma delimited list of node weight in the format `node_idweight` (e.g., `"02,11,21"`)

allocation mode (CE_ALLOC):
- CE_ALLOC_CXL: Use CE_CXL_NODE memory as your default memory device. Heap memory requested by functions such as malloc will be allocated on CE_CXL_NODE memory.
- CE_ALLOC_BWAWARE: Use Bandwidth-aware Interleaving policy. Heap memory requested by functions such as malloc is interleaved between those memory nodes specified by /sys/kernel/mm/interleave_weight/node/node*/interleave_weight. You can find more information about sysfs interface here. To see the way to set memory nodes and ratio, read how to use bwactl.
- CE_ALLOC_USERDEFINED: Heap memory requested by functions such as malloc is allocated only to CE_INTERLEAVE_NODE memory. CE_INTERLEAVE_NODE contains comma-separated "node*weight" information.

2.2. HMSDK User API

2.2.1. Implicit API

Using LD_PRELOAD with other environment variables allows you to use libcemalloc.so without any modification of your source code. libcemalloc.so will override standard malloc and use heterogeneous memory according to how you set the environment variables.

Example

$ export CE_MODE=CE_IMPLICIT
$ export CE_CXL_NODE=2
$ export CE_ALLOC=CE_ALLOC_CXL
$ LD_PRELOAD=/path/to/libcemalloc.so ./your_program

List of Overrides (C / C++)

Function / Operator	Description
`mmap`	map files or devices into memory
`malloc`	allocate dynamic memory
`calloc`	allocate dynamic memory and the memory is set to zero
`realloc`	reallocate the given area of memory
`posix_memalign`	allocate aligned memory
`memalign`	allocate aligned memory
`valloc`	have the same effect as `malloc`, except that the allocated memory will be aligned to a multiple of the value returned by `sysconf(_SC_PAGESIZE)`
`aligned_alloc`	allocate uninitialized storage whose alignment is specified by parameter
`free`	free dynamic memory
`malloc_usable_size`	obtain size of block of memory allocated from heap
`new`	allocate dynamic memory
`delete`	free dynamic memory

2.2.2. Explicit API

When you decide to change your source code, you can optimize the use of HMSDK by adopting explicit APIs of libcemalloc.so. It offers various language options (C / C++, Python, Java).

C / C++

API List

Function / Operator	Description
`cxl_mmap`	map files or devices into memory
`cxl_malloc`	allocate dynamic memory
`cxl_calloc`	allocate dynamic memory and the memory is set to zero
`cxl_realloc`	reallocate the given area of memory
`cxl_posix_memalign`	allocate aligned memory
`cxl_memalign`	allocate aligned memory
`cxl_valloc`	have the same effect as `malloc`, except that the allocated memory will be aligned to a multiple of the value returned by `sysconf(_SC_PAGESIZE)`
`cxl_aligned_alloc`	allocate uninitialized storage whose alignment is specified by parameter
`cxl_free`	free dynamic memory
`cxl_malloc_usable_size`	obtain size of block of memory allocated from heap

By replacing malloc with cxl_malloc, you can utilize heterogeneous memory effectively. malloc operates the same as usual, requesting memory from your local memories. You can change the behavior of explicit APIs (cxl_malloc, cxl_calloc, cxl_free, ...) by setting environment variables.

Example (example/explicit_api/cpp/explicit_example.cc)

$ gcc -o explicit_example explicit_example.cc -lcemalloc
$ export CE_MODE=CE_EXPLICIT
$ export CE_CXL_NODE=2
$ export CE_ALLOC=CE_ALLOC_CXL
$ ./explicit_example

// code example (explicit_example.c)
#include <cemalloc.h>
...
char* local;
char* cxl;

local = (char*)malloc(sizeof(char));    // memory requested from local memory
cxl = (char*)cxl_malloc(sizeof(char));  // memory requested from CE_CXL_NODE
...
free(local);
cxl_free(cxl);
...

Python

Unlike C or C++ by which programmers can explicitly request memory, Python's memory allocation request is not directly accessible. For these cases, we offer simple indicator-like methods. Call SetCxlMemory() or SetHostMemory() before creating objects. It is valid until you call it again (for current thread only).

API List(Python)

Function	Description
`SetHostMemory`	use host memory for allocation
`SetCxlMemory`	use cxl memory for allocation
`GetMemoryMode`	get the current memory mode(host or cxl)

CE_MODE should always be set to CE_EXPLICIT_INDICATOR. For information regarding other environment variables, read above.

Example (example/explicit_api/python/example.py)

$ export LIBCEMALLOC_DIR=/path/where/libcemalloc/exists # or you can put libcemalloc.so under where LD_LIBRARY_PATH_points
$ export CE_MODE=CE_EXPLICIT_INDICATOR
$ export CE_CXL_NODE=2
$ export CE_ALLOC=CE_ALLOC_CXL
$ LD_PRELOAD=/path/to/libcemalloc.so python3 example.py

# example.py
import cemalloc
import numpy as np

cemalloc.SetCxlMemory()
arr = np.zeros(1024)       # arr will be allocated on CE_CXL_NODE

cemalloc.SetHostMemory()
arr2 = np.zeros(1024)      # arr2 will be allocated on host memory

Java

Java also has hidden memory allocation requests unlike C/C++. So, how to use cemalloc in Java is the same as in Python. Call SetCxlMemory() or SetHostMemory() before you create objects. It is valid until you call it again (for current thread only).

API List(Java)

Function	Description
`SetHostMemory`	enable allocation to host memory
`SetCxlMemory`	enable allocation to cxl memory
`GetMemoryMode`	get the current memory mode(host or cxl)

CE_MODE should always be set to CE_EXPLICIT_INDICATOR. For information regarding other environment variables, read above.

Example (example/explicit_api/java/Example.java)

import cemalloc.Cemalloc;

class Example {
  public static void main(String[] args)
  {
    int arr_length = 1024 * 1024 * 1024;
    int[] arr_host = null;
    int[] arr_cxl = null;

    Cemalloc test = new Cemalloc();

    test.SetCxlMemory();
    assert test.GetMemoryMode().name() == "CXL";
    arr_cxl = new int[arr_length]; // arr_cxl will be allocated on CXL memory

    test.SetHostMemory();
    assert test.GetMemoryMode().name() == "HOST";
    arr_host = new int[arr_length];     // arr_host will be allocated on Host memory
  }
}

Then, you need to set some environment variables for determining the ceamlloc's allocation attributes and Java environments. You can set below environment variables through source env.sh (example/explicit_api/java/env.sh)

For cemalloc library

$ export CEMALLOC_PATH=/path/to/cemalloc_package
$ export LD_LIBRARY_PATH=${CEMALLOC_PATH}

For cemalloc's allocation attribute

# It should be set to CE_EXPLICIT_INDICATOR for Java application.
$ export CE_MODE=CE_EXPLICIT_INDICATOR

# CXL memory node number.
$ export CE_CXL_NODE=2

# It should be set to one of CE_ALLOC_CXL, CE_ALLOC_USERDEFINED, or CE_ALLOC_BWAWARE.
$ export CE_ALLOC=CE_ALLOC_CXL

For Java environment

$ export JAVA_HOME=/path/to/java
$ export CLASSPATH=/path/to/cemalloc.jar

Build & Run the Java application

$ make
$ LD_PRELOAD=$CEMALLOC_PATH/libcemalloc.so java Example

Please note that HMSDK support for python and java is still in experimental stages.

2.3. numactl

With numactl, you can run your program with our new memory policy. You can use --interleave-weight option.

--interleave-weight(--w): either a string "bwa" or a comma delimited list of node weight in the format "node_id*weight".

Example

$ numactl --interleave-weight=bwa ./your_program
$ numactl --interleave-weight="0*2,1*1" ./your_program

When --interleave-weight=bwa, bandwidth-aware interleaving will be applied. Otherwise, you can set your own interleaving weight in "node_id*weight" format. Run numactl -h or man numactl for more information.

2.4. System Calls

HMSDK introduces two new system calls for interleave-weight memory allocation policy.

set_mempolicy_node_weight, and mrange_node_weight. These system calls are currently only available on x86-64 system. For more information, read set_mempolicy_node_weight.md and mrange_node_weight.md.

2.5. Sysfs Interface

HMSDK linux kernel provides a way to set the weight information under sysfs at /sys/kernel/mm/interleave_weight/node/node*/interleave_weight.

The sysfs structure is designed as follows.

enabled: state of interleave_weight (1: enable, 0: disable)
possible: node list with one or more cpus
interleave_weight: weight value of each node

  $ tree /sys/kernel/mm/interleave_weight/
  /sys/kernel/mm/interleave_weight/
  ├── enabled
  ├── possible
  └── node
      ├── node0
      │   └── interleave_weight
      └── node1
          └── interleave_weight

Once you set enabled to 1, you are ready to use interleave-weight. We highly recommend you to use bwactl when setting the optimal value for interleave_weight.

3. Tools (bwactl)

HMSDK provides bwactl.py script to set the optimal memory interleaving ratio for a new memory policy, interleave-weight. This script measures system's peak memory bandwidth for all the NUMA nodes using Intel MLC(Memory Latency Checker). Then, it calculates the optimal memory interleaving ratio for each nodes based on the result measured earlier. The result of running this utility is used in HMSDK's "Bandwidth-aware Interleaving" policy.

To Set Bandwidth-aware Interleaving Ratio

$ sudo tools/bwactl.py

Run bwactl.py as root user to get the bandwidth ratio for each NUMA node. As a result, the bandwidth ratio is applied to /sys/kernel/mm/interleave_weight/node/node*/interleave_weight. We adopt Intel MLC to measure memory bandwidth, and it will be automatically installed when this script is run for the first time.

There is an example execution as follows.

# Please make sure lstopo is installed in the system.

$ sudo tools/bwactl.py
Bandwidth ratio for all NUMA nodes
node0: 0*2,2*1
node1: 1*3,3*1

Bandwidth ratio is successfully updated at
  /sys/kernel/mm/interleave_weight/node/node0/interleave_weight
  /sys/kernel/mm/interleave_weight/node/node1/interleave_weight

It updates the optimal interleaving ratio (in this case 2:1 for node0 and node2, 3:1 for node1 and node3) for all nodes with CPU to any nodes with memory directly attached to them. The result may vary based on the system configurations.

When using the Bandwidth-aware Interleaving policy, the linux kernel reads and applies the value written on /sys/.../interleave_weight when allocating memory.

Topology

The bandwidth aware interleaving policy tries to utilize the memory bandwidth among memory nodes in the directly linked package, not across processor interconnect such as UPI, so bwactl.py reads the system topology by running lstopo command and parse the result to get the package layout.

For example, the above bwactl.py execution result was based on the system with the topology as follows.

  Machine
    Package P#0
      NUMANode P#0    # DRAM
      NUMANode P#2    # CXL
    Package P#1
      NUMANode P#1    # DRAM
      NUMANode P#3    # CXL

Since the package 0 has numa node 0, 2 and package 1 has numa node 1, 3 so the inteleaving is expected to be applied among the nodes inside the same package.

If user wants to change the topology manually for some reasons, bwactl.py also provides a way to manually set the topology with --topology option. The argument should be 2 level nested set of nodes as follows.

  <topology> := [<packages>]
  <packages> := [<node_ids>] | [<node_ids>] "," <packages>
  <node_ids> := <node_id> | <node_id> "," <node_ids>

The argument looks like 2 dimensional array list of python, which has greater or equal to 1 length of inner list. For example, the above topology can be represented as [[0,2],[1,3]] and you can run the following command.

  $ sudo ./bwactl.py --topology "[[0,2],[1,3]]"
  Bandwidth ratio for all NUMA nodes
  node0: 0*7,2*5
  node1: 1*1,3*1

Another topology example can be used as follows.

  $ sudo ./bwactl.py --topology "[[0],[1,2,3]]"
  Bandwidth ratio for all NUMA nodes
  node0: 0*1
  node1: 1*5,2*8,3*5

The above topology represents the following lstopo like result.

  Machine
    Package P#0
      NUMANode P#0    # DRAM
    Package P#1
      NUMANode P#1    # DRAM
      NUMANode P#2    # CXL
      NUMANode P#3    # CXL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly