Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String vs Buffer memory usage + performance #4506

Open
2 tasks done
znewsham opened this issue Nov 21, 2024 · 7 comments
Open
2 tasks done

String vs Buffer memory usage + performance #4506

znewsham opened this issue Nov 21, 2024 · 7 comments

Comments

@znewsham
Copy link

Node.js Version

v18-v22

NPM Version

v10.8.2

Operating System

Linux zacknewsham-xps 6.8.0-48-generic #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

buffer, string_decoder, v8

Description

I'm trying to understand why my application has high baseline memory usage - in doing this I discovered something I can't explain - strings seem to cost >10x more memory per character than the equivalent buffer. Some amount of this is expected (~2x given UTF-16 nature of JS strings) - but not on this scale.

A secondary question is why the setup time of a String->String map is so much slower (8x) than a map that takes that string and converts it to a buffer before storing.

Below is a minimal preproduction - the commented out lines in test allow you to toggle between the string->string map and the string->buffer map

I run it with --expose-gc just to get a valid heap snapshot at the end. The total string size stored is (17 + 1000) * 100,000 - so the absolute minimal memory usage of this would be around 100mb (a trivial C++ implementation of the same takes 114mb).

When running with the string->string map, the memory cost is around 3.2GB and the setup time (to populate the map) is ~11s, when running as a string->buffer map the memory cost is 280MB and the setup time is ~1.3s. The "time" difference reported is completely explicable (the cost of parsing the buffer each time)

Minimal Reproduction

import { setTimeout } from "timers/promises";

const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';

// random alpha numeric strings of a specific length
function makeid(length) {
  let result = '';
  const charactersLength = characters.length;
  let counter = 0;
  while (counter < length) {
    result += characters.charAt(Math.floor(Math.random() * charactersLength));
    counter += 1;
  }
  return result;
}

// test setup - 100,000 keys 17 chars long, 1mn iterations, values are 1000 chars long
const keyCount = 100000;
const iterations = 1_000_000;
const keyLength = 17;
const valueLength = 1000;
const keys = new Array(keyCount).fill(0).map(() => makeid(keyLength));

function testMap(map) {
  const startSetup = performance.now();
  keys.forEach(key => map.set(key, makeid(valueLength)));
  const endSetup = performance.now();
  const start = performance.now();
  for (let i = 0; i < iterations; i++) {
    const key = keys[Math.floor(Math.random() * keys.length)];
    const value = map.get(key);

    // v8 optimisation busting - without this the loop is 4x faster due to optimising out the get call
    globalThis.value = value;
  }
  const end = performance.now();
  return { time: end - start, setup: endSetup - startSetup };
}

// a naive implementation that keeps the API the same but converts value's into buffers
class ConvertToBufferMap extends Map {
  set(key, value) {
    super.set(key, Buffer.from(value, "utf-8"));
  }
  get(key) {
    return super.get(key)?.toString("utf-8");
  }
}


async function test() {
  // const map = new Map();
  // console.log("map", testMap(map));
  const bufferMap = new ConvertToBufferMap();
  console.log("bufferMap", testMap(bufferMap));
  gc();
  console.log(process.memoryUsage().rss / 1024 / 1024);

  // pause to go get a heap snapshot or whatever
  await setTimeout(100000);
}

test();

Output

bufferMap { time: 705.9530600000003, setup: 1303.258812 }
Memory usage:  279.30078125

map { time: 83.8109829999994, setup: 10450.127824000001 }
Memory usage:  3195.6953125

Before You Submit

  • I have looked for issues that already exist before submitting this
  • My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask
@znewsham
Copy link
Author

As is so often the case, 10 mins after I ask the question I figure it out :( (at least partially). The problem is the test setup - it looks like makeid leaks memory - I was aware that it would allocate a lot of memory in the incremental string building, but thought the call to gc would clear it up - evidently not. Additionally, it seems that any conversion (e.g., to a buffer) releases that accumulated memory.

If I change makeid to use an array of a pre-allocated length + join at the end, the setup performance and memory usage both drop below that of a buffer - I ran valgrind on it, and it didn't show any lost bytes - so the memory does get cleaned up, just not by the gc call

@preveen-stack
Copy link
Contributor

@nodejs/performance PTAL

@H4ad
Copy link
Member

H4ad commented Dec 4, 2024

Take a read at the nodejs/performance#173 to understand more about the cost of creating a new buffer.

You can use the strategy implemented at mongodb/js-bson#703 to help reduce the memory usage if you really need to use buffer, to summarize, create a new buffer and store the offset, and instead of creating a new buffer everytime, you just take slices of that main buffer until you fill everything, then you discard the old buffer and create a new one. This is more efficient when the length of the text you are storing is well-known but works for different lengths too.

@lemire
Copy link
Member

lemire commented Dec 4, 2024

@H4ad's answer is really excellent, and I want to stress this part of it...

instead of creating a new buffer everytime, you just take slices of that main buffer until you fill everything, then you discard the old buffer and create a new one

It is a fundamental performance pattern.

@znewsham
Copy link
Author

znewsham commented Dec 4, 2024

Thanks for the answers - I think things have gotten a bit muddled (or possibly I'm not understanding something) - my original question was why buffer performance was substantially better than string performance (in both setup time and memory) - the increased time of converting the stored buffer to a string on each access was expected. (Outside of this example I would of course not be converting buffers to strings on every map access!)

I discovered the answer to my original question shortly after posting (as I noted in my followup) - the issue was that memory associated with building the string was not freed - when the function ended, and was retained with the string when it was stored in the map - as opposed to the buffer case where converting the string to a buffer released all the memory used to build the string.

This memory behaviour (and frankly the execution time) still seems strange to me - but at least has an explanation (even if odd) and means that in most real world situations the memory increase would not be observed. I suspect if I dig into the string implementation I'd see that each string concatenation retains a reference to both strings as opposed to allocating, copying and releasing the originals.

@lemire
Copy link
Member

lemire commented Dec 4, 2024

@znewsham

I suspect if I dig into the string implementation I'd see that each string concatenation retains a reference to both strings as opposed to allocating, copying and releasing the originals.

You can flatten the string. The following usage pattern might work: s.match(/./);. There might be other similar tricks.

@H4ad
Copy link
Member

H4ad commented Dec 4, 2024

@znewsham

I think things have gotten a bit muddled (or possibly I'm not understanding something) - my original question was why buffer performance was substantially better than string performance

My bad, I totally misread your question and your data, the buffer consumes less memory because node.js implements by default the strategy I explained earlier, you can read the implementation at:

https://github.com/nodejs/node/blob/c4aa34aa4dc5accb100be07b4aceded83eeb3956/lib/buffer.js#L447-L472

bson uses allocateUnsafe, which does not use the buffer pool.

Edit:

You can understand more the behavior of strings on v8 by reading https://iliazeus.lol/articles/js-string-optimizations-en/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants