String vs Buffer memory usage + performance #4506

znewsham · 2024-11-21T18:52:33Z

Node.js Version

v18-v22

NPM Version

v10.8.2

Operating System

Linux zacknewsham-xps 6.8.0-48-generic #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

buffer, string_decoder, v8

Description

I'm trying to understand why my application has high baseline memory usage - in doing this I discovered something I can't explain - strings seem to cost >10x more memory per character than the equivalent buffer. Some amount of this is expected (~2x given UTF-16 nature of JS strings) - but not on this scale.

A secondary question is why the setup time of a String->String map is so much slower (8x) than a map that takes that string and converts it to a buffer before storing.

Below is a minimal preproduction - the commented out lines in test allow you to toggle between the string->string map and the string->buffer map

I run it with --expose-gc just to get a valid heap snapshot at the end. The total string size stored is (17 + 1000) * 100,000 - so the absolute minimal memory usage of this would be around 100mb (a trivial C++ implementation of the same takes 114mb).

When running with the string->string map, the memory cost is around 3.2GB and the setup time (to populate the map) is ~11s, when running as a string->buffer map the memory cost is 280MB and the setup time is ~1.3s. The "time" difference reported is completely explicable (the cost of parsing the buffer each time)

Minimal Reproduction

import { setTimeout } from "timers/promises";

const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';

// random alpha numeric strings of a specific length
function makeid(length) {
  let result = '';
  const charactersLength = characters.length;
  let counter = 0;
  while (counter < length) {
    result += characters.charAt(Math.floor(Math.random() * charactersLength));
    counter += 1;
  }
  return result;
}

// test setup - 100,000 keys 17 chars long, 1mn iterations, values are 1000 chars long
const keyCount = 100000;
const iterations = 1_000_000;
const keyLength = 17;
const valueLength = 1000;
const keys = new Array(keyCount).fill(0).map(() => makeid(keyLength));

function testMap(map) {
  const startSetup = performance.now();
  keys.forEach(key => map.set(key, makeid(valueLength)));
  const endSetup = performance.now();
  const start = performance.now();
  for (let i = 0; i < iterations; i++) {
    const key = keys[Math.floor(Math.random() * keys.length)];
    const value = map.get(key);

    // v8 optimisation busting - without this the loop is 4x faster due to optimising out the get call
    globalThis.value = value;
  }
  const end = performance.now();
  return { time: end - start, setup: endSetup - startSetup };
}

// a naive implementation that keeps the API the same but converts value's into buffers
class ConvertToBufferMap extends Map {
  set(key, value) {
    super.set(key, Buffer.from(value, "utf-8"));
  }
  get(key) {
    return super.get(key)?.toString("utf-8");
  }
}


async function test() {
  // const map = new Map();
  // console.log("map", testMap(map));
  const bufferMap = new ConvertToBufferMap();
  console.log("bufferMap", testMap(bufferMap));
  gc();
  console.log(process.memoryUsage().rss / 1024 / 1024);

  // pause to go get a heap snapshot or whatever
  await setTimeout(100000);
}

test();

Output

bufferMap { time: 705.9530600000003, setup: 1303.258812 }
Memory usage:  279.30078125

map { time: 83.8109829999994, setup: 10450.127824000001 }
Memory usage:  3195.6953125

Before You Submit

I have looked for issues that already exist before submitting this
My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask

The text was updated successfully, but these errors were encountered:

znewsham · 2024-11-21T19:31:32Z

As is so often the case, 10 mins after I ask the question I figure it out :( (at least partially). The problem is the test setup - it looks like makeid leaks memory - I was aware that it would allocate a lot of memory in the incremental string building, but thought the call to gc would clear it up - evidently not. Additionally, it seems that any conversion (e.g., to a buffer) releases that accumulated memory.

If I change makeid to use an array of a pre-allocated length + join at the end, the setup performance and memory usage both drop below that of a buffer - I ran valgrind on it, and it didn't show any lost bytes - so the memory does get cleaned up, just not by the gc call

preveen-stack · 2024-12-04T09:20:33Z

@nodejs/performance PTAL

H4ad · 2024-12-04T15:21:45Z

Take a read at the nodejs/performance#173 to understand more about the cost of creating a new buffer.

You can use the strategy implemented at mongodb/js-bson#703 to help reduce the memory usage if you really need to use buffer, to summarize, create a new buffer and store the offset, and instead of creating a new buffer everytime, you just take slices of that main buffer until you fill everything, then you discard the old buffer and create a new one. This is more efficient when the length of the text you are storing is well-known but works for different lengths too.

lemire · 2024-12-04T19:22:59Z

@H4ad's answer is really excellent, and I want to stress this part of it...

instead of creating a new buffer everytime, you just take slices of that main buffer until you fill everything, then you discard the old buffer and create a new one

It is a fundamental performance pattern.

znewsham · 2024-12-04T19:33:29Z

Thanks for the answers - I think things have gotten a bit muddled (or possibly I'm not understanding something) - my original question was why buffer performance was substantially better than string performance (in both setup time and memory) - the increased time of converting the stored buffer to a string on each access was expected. (Outside of this example I would of course not be converting buffers to strings on every map access!)

I discovered the answer to my original question shortly after posting (as I noted in my followup) - the issue was that memory associated with building the string was not freed - when the function ended, and was retained with the string when it was stored in the map - as opposed to the buffer case where converting the string to a buffer released all the memory used to build the string.

This memory behaviour (and frankly the execution time) still seems strange to me - but at least has an explanation (even if odd) and means that in most real world situations the memory increase would not be observed. I suspect if I dig into the string implementation I'd see that each string concatenation retains a reference to both strings as opposed to allocating, copying and releasing the originals.

lemire · 2024-12-04T19:52:07Z

@znewsham

I suspect if I dig into the string implementation I'd see that each string concatenation retains a reference to both strings as opposed to allocating, copying and releasing the originals.

You can flatten the string. The following usage pattern might work: s.match(/./);. There might be other similar tricks.

H4ad · 2024-12-04T20:11:58Z

@znewsham

I think things have gotten a bit muddled (or possibly I'm not understanding something) - my original question was why buffer performance was substantially better than string performance

My bad, I totally misread your question and your data, the buffer consumes less memory because node.js implements by default the strategy I explained earlier, you can read the implementation at:

https://github.com/nodejs/node/blob/c4aa34aa4dc5accb100be07b4aceded83eeb3956/lib/buffer.js#L447-L472

bson uses allocateUnsafe, which does not use the buffer pool.

Edit:

You can understand more the behavior of strings on v8 by reading https://iliazeus.lol/articles/js-string-optimizations-en/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String vs Buffer memory usage + performance #4506

String vs Buffer memory usage + performance #4506

znewsham commented Nov 21, 2024

znewsham commented Nov 21, 2024

preveen-stack commented Dec 4, 2024

H4ad commented Dec 4, 2024

lemire commented Dec 4, 2024

znewsham commented Dec 4, 2024

lemire commented Dec 4, 2024

H4ad commented Dec 4, 2024 •

edited

Loading

String vs Buffer memory usage + performance #4506

String vs Buffer memory usage + performance #4506

Comments

znewsham commented Nov 21, 2024

Node.js Version

NPM Version

Operating System

Subsystem

Description

Minimal Reproduction

Output

Before You Submit

znewsham commented Nov 21, 2024

preveen-stack commented Dec 4, 2024

H4ad commented Dec 4, 2024

lemire commented Dec 4, 2024

znewsham commented Dec 4, 2024

lemire commented Dec 4, 2024

H4ad commented Dec 4, 2024 • edited Loading

H4ad commented Dec 4, 2024 •

edited

Loading