Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rss memory increasing when writing data, and caused node Out Of Memory. #1487

Closed
IAmFQQ opened this issue Feb 20, 2024 · 13 comments
Closed

rss memory increasing when writing data, and caused node Out Of Memory. #1487

IAmFQQ opened this issue Feb 20, 2024 · 13 comments

Comments

@IAmFQQ
Copy link

IAmFQQ commented Feb 20, 2024

What is the bug?
Node OOM when writing data to KKN index.

RSS memory monitoring
image

How can one reproduce the bug?
KNN field mappings

{
          "type": "knn_vector",
          "doc_values": false,
          "dimension": 96,
          "method": {
            "engine": "nmslib",
            "space_type": "cosinesimil",
            "name": "hnsw",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        }

index settings

        "refresh_interval": "-1",
        "number_of_shards": "20",
        "translog": {
          "flush_threshold_size": "30gb",
          "sync_interval": "300s",
          "durability": "async"
        }

one index

health status index                                     uuid                   pri    rep           docs.count  docs.deleted store.size pri.store.size
green  open   my_index                     myindex_uuid            20      0            144574385            0           444.5gb        444.5gb

cluster settings

indices.memory.index_buffer_size: "50%"

The above index/cluster settings is to reduce the number of segments generated, less segments could help us reduce the force merge time. And less segments would also benefit the KNN searching performance.
We get it from L&P test.

What is the expected behavior?

We daily roll over our index and delete them, keep 3 indexes.
But the rss memory (from top or pmap) keep increasing day by day and eventually the node would be down by OOM.

This is the monitoring of the rss memory, match the output of TOP command.

image

With the above index settings, the flush or the fsync and commit in the background every 300s or the index_buffer_size is full. After writing the data, we will submit a force merge request.

From the rss memory trend, we can see that the rss memory will gradually grow each when writing to tranlsog. During force merge, there is a certain decrease but not much, the overall trend is upward.

On 2-18, the 3rd day, when we pushed data to the new index(writing to translog), the node was crashsed by OOM.

Java heap was not over used. This is the monitoring of the java heap
image

What is your host/environment?

  • OS: linux
  • Version 5.15.0-26-generic
  • Plugins knn 2.7
  • Java Heap 30G
  • K8S Pod Memory 64G
~$ uname -a
Linux my_node_host_name 5.15.0-26-generic #26 SMP Fri Sep 8 10:12:43 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
~$ curl localhost:9200
{
  "name" : "my_cluster_node_name",
  "cluster_name" : "my_cluster_uuid",
  "cluster_uuid" : "my_cluster",
  "version" : {
    "build_flavor" : "default",
    "number" : "7.10.2",
    "build_type" : "docker",
    "build_hash" : "my_cluster_build_hash",
    "build_date" : "my_cluster_build_date",
    "build_snapshot" : false,
    "lucene_version" : "9.5.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}
 pmap -x $(ps -aux | awk 'NR==4 {print $2}') | sort -k2 -n


.........
.........
00007f76f4000000    1296    1296    1296 rw---   [ anon ]
00007f7914000000    1320    1316    1316 rw---   [ anon ]
00007f7ba08c0000    1324       0       0 r--s- _b.nvd
00007f7be0000000    1328    1304    1304 rw---   [ anon ]
00007f7bc4000000    1348    1320    1320 rw---   [ anon ]
00007f76f0000000    1356    1352    1352 rw---   [ anon ]
00007f7700000000    1356    1352    1352 rw---   [ anon ]
00007f7710000000    1356    1356    1356 rw---   [ anon ]
00007f7728000000    1372    1368    1368 rw---   [ anon ]
00007f7bd8000000    1376    1372    1372 rw---   [ anon ]
00007f79cffe5000    1388     812       0 r-x-- libopensearchknn_nmslib.so
00007f7920000000    1408    1408    1408 rw---   [ anon ]
00007f7b3c000000    1432    1372    1372 rw---   [ anon ]
00007f7b60000000    1444    1392    1392 rw---   [ anon ]
00007f7474000000    1504    1504    1504 rw---   [ anon ]
00007f7d0909b000    1504    1364       0 r-x-- libc-2.31.so
00007f7470000000    1524    1524    1524 rw---   [ anon ]
00007f7c64000000    1532    1532    1532 rw---   [ anon ]
00007f7c38800000    1536    1536    1536 rw---   [ anon ]
00007f7bdc000000    1644    1468    1468 rw---   [ anon ]
00007f7b54000000    1716    1680    1680 rw---   [ anon ]
00007f78dfe4e000    1736       0       0 -----   [ anon ]
00007f79f7e4d000    1740       0       0 -----   [ anon ]
00007f771c000000    1744    1720    1720 rw---   [ anon ]
00007f7b90000000    1904    1864    1864 rw---   [ anon ]
00007f7c3a200000    2048    2048    2048 rw---   [ anon ]
00007f7c3adfb000    2048       0       0 ----- jna17713769312024914914.tmp (deleted)
00007f7c3b700000    2048    2048    2048 rw---   [ anon ]
00007f7c3b900000    2048    2048    2048 rw---   [ anon ]
00007f79d02ff000    2052    1032    1032 rw---   [ anon ]
00007f7b48000000    2052    1892    1892 rw---   [ anon ]
00007f7454000000    2164    2164    2164 rw---   [ anon ]
00007f79bfdd1000    2236       0       0 -----   [ anon ]
00007f7d07d34000    2376    2376       0 r---- libjvm.so
00007f7cf09ab000    2496    2496    2496 rwx--   [ anon ]
00007f7c3b303000    2548    1624    1624 rw---   [ anon ]
00007f7d08ca3000    2608     820       0 r---- libjvm.so
0000000801980000    3200    3200    3200 rw---   [ anon ]
00007f7cf0c1b000    3208       0       0 -----   [ anon ]
00007f776c000000    3464    3452    3452 rw---   [ anon ]
00007f7c44000000    3556    3556    3556 rw---   [ anon ]
00007f7786c80000    3584       0       0 -----   [ anon ]
00007f7c0b800000    3584    3584    3584 rw---   [ anon ]
00007f7b43c79000    3612       0       0 r--s- _b_Lucene90_0.tip
00007f7b4320d000    3616      84       0 r--s- _b_Lucene90_0.tip
00007f7ba1863000    3620       0       0 r--s- _b_Lucene90_0.tip
00007f7b404b2000    3712       0       0 r--s- _b_Lucene90_0.tip
00007f7c3a600000    3712    3712    3712 rw---   [ anon ]
00007f7786703000    4084    3188    3188 rw---   [ anon ]
00007f7c0bc00000    4096    4096    4096 rw---   [ anon ]
00007f7ca1000000    4096    4096    4096 rw---   [ anon ]
0000000800000000    4428    4428    4396 rw--- classes.jsa
00007f747c000000    4616    4616    4616 rw---   [ anon ]
00007f7ca0b03000    5108    4184    4184 rw---   [ anon ]
00007f7c39c00000    5120    5120    5120 rw---   [ anon ]
00007f7ba038a000    5336       0       0 r--s- _b.kdd
00007f72d0000000    5472    5320    5320 rw---   [ anon ]
00007f79d0600000    6144    6144    6144 rw---   [ anon ]
00007f7c0b200000    6144    6144    6144 rw---   [ anon ]
00007f7c38a00000    6144    6144    6144 rw---   [ anon ]
00007f7720000000    7000    7000    7000 rw---   [ anon ]
00007f7b43595000    7056       0       0 r--s- _b.nvd
00007f7b42b27000    7064     112       0 r--s- _b.nvd
00007f7ba117c000    7068       0       0 r--s- _b.nvd
00007f79c0000000    7184      12      12 rw---   [ anon ]
00007f78d0000000    7196      12      12 rw---   [ anon ]
00007f79e8000000    7196      12      12 rw---   [ anon ]
00007f79d0ecb000    7228       0       0 r--s- _b.nvd
00007f72d4000000    7340    7116    7116 rw---   [ anon ]
0000000800453000    7688    7688       0 r---- classes.jsa
00007f7768000000    8064    7764    7764 rw---   [ anon ]
00007f7c09703000    8180    7260    7260 rw---   [ anon ]
00007f7c0a000000    8192    8192    8192 rw---   [ anon ]
00007f7c0a800000    8192    8192    8192 rw---   [ anon ]
00007f7c38000000    8192    8192    8192 rw---   [ anon ]
0000000801080000    9152    9152    9152 rw---   [ anon ]
00007f7ba2f03000    9204    8284    8284 rw---   [ anon ]
00007f7c39303000    9204    8280    8280 rw---   [ anon ]
00007f7c04000000    9600    9528    9528 rw---   [ anon ]
00007f7478000000    9876    9876    9876 rw---   [ anon ]
00007f7c50000000   11812   11808   11808 rw---   [ anon ]
00007f7d06c9e000   12504    6296    6296 rw---   [ anon ]
00007f7c98000000   13408   13408   13408 rw---   [ anon ]
00007f7d07f86000   13428   11624       0 r-x-- libjvm.so
00007f7c18000000   13960   13960   13960 rw---   [ anon ]
00007f7c00000000   15268    9984    9984 rw---   [ anon ]
00000007ff082000   15864   15864   15864 rw---   [ anon ]
00007f7480000000   17440   15360   15360 rw---   [ anon ]
00007f7c6c000000   17600   17600   17600 rw---   [ anon ]
00007f7450000000   18148   17788   17788 rw---   [ anon ]
00007f78b8000000   22316   22316   22316 rw---   [ anon ]
00007f7b58000000   22424   16600   16600 rw---   [ anon ]
00007f7c2c000000   26224   26224   26224 rw---   [ anon ]
00007f7180000000   26916   26916   26916 rw---   [ anon ]
00007f7c34000000   27960   22248   22248 rw---   [ anon ]
00007f79d3205000   28132       0       0 r--s- _b_Lucene90_0.doc
00007f79ee431000   28476       0       0 r--s- _b.kdd
00007f7949fb9000   28480       0       0 r--s- _b.kdd
00007f7a08fa6000   28496     104       0 r--s- _b.kdd
00007f79d15da000   28844       0       0 r--s- _b_Lucene90_0.tim
00007f7535409000   29168       0       0 r--s- _b.kdd
00007f7ce9474000   32256   32256   32256 rwx--   [ anon ]
00007f7cf0f3d000   35392   35392   35392 rwx--   [ anon ]
00007f7c35b4e000   37576       0       0 -----   [ anon ]
00007f7181a49000   38620       0       0 -----   [ anon ]
00007f7c2d99c000   39312       0       0 -----   [ anon ]
00007f7d0408f000   43048     804     804 rw---   [ anon ]
00007f7b595e6000   43112       0       0 -----   [ anon ]
00007f78b95cb000   43220       0       0 -----   [ anon ]
00007f74511b9000   47388       0       0 -----   [ anon ]
00007f7c6d130000   47936       0       0 -----   [ anon ]
00007f7481108000   48096       0       0 -----   [ anon ]
00007f7c00ee9000   50268       0       0 -----   [ anon ]
00007f7c18da2000   51576       0       0 -----   [ anon ]
00007f7c98d18000   52128       0       0 -----   [ anon ]
00007f7c50b89000   53724       0       0 -----   [ anon ]
00007f74789a5000   55660       0       0 -----   [ anon ]
00007f7c04960000   55936       0       0 -----   [ anon ]
00007f77687e0000   57472       0       0 -----   [ anon ]
00007f72d472b000   58196       0       0 -----   [ anon ]
00007f78d0707000   58340       0       0 -----   [ anon ]
00007f79e8707000   58340       0       0 -----   [ anon ]
00007f79c0704000   58352       0       0 -----   [ anon ]
00007f77206d6000   58536       0       0 -----   [ anon ]
00007f72d0558000   60064       0       0 -----   [ anon ]
00007f747c482000   60920       0       0 -----   [ anon ]
00007f7c44379000   61980       0       0 -----   [ anon ]
00007f776c362000   62072       0       0 -----   [ anon ]
00007f79bc000000   63300   63300   63300 rw---   [ anon ]
00007f745421d000   63372       0       0 -----   [ anon ]
00007f7b48201000   63484       0       0 -----   [ anon ]
00007f7b901dc000   63632       0       0 -----   [ anon ]
00007f771c1b4000   63792       0       0 -----   [ anon ]
00007f7b541ad000   63820       0       0 -----   [ anon ]
00007f7bdc19b000   63892       0       0 -----   [ anon ]
00007f7c6417f000   64004       0       0 -----   [ anon ]
00007f747017d000   64012       0       0 -----   [ anon ]
00007f7474178000   64032       0       0 -----   [ anon ]
00007f7b60169000   64092       0       0 -----   [ anon ]
00007f7b3c166000   64104       0       0 -----   [ anon ]
00007f7920160000   64128       0       0 -----   [ anon ]
00007f7bd8158000   64160       0       0 -----   [ anon ]
00007f7728157000   64164       0       0 -----   [ anon ]
00007f76f0153000   64180       0       0 -----   [ anon ]
00007f7700153000   64180       0       0 -----   [ anon ]
00007f7710153000   64180       0       0 -----   [ anon ]
00007f7bc4151000   64188       0       0 -----   [ anon ]
00007f7be014c000   64208       0       0 -----   [ anon ]
00007f791414a000   64216       0       0 -----   [ anon ]
00007f76f4144000   64240       0       0 -----   [ anon ]
00007f7934140000   64256       0       0 -----   [ anon ]
00007f7bc013c000   64272       0       0 -----   [ anon ]
00007f7bd413b000   64276       0       0 -----   [ anon ]
00007f791c13a000   64280       0       0 -----   [ anon ]
00007f7bac138000   64288       0       0 -----   [ anon ]
00007f7928136000   64296       0       0 -----   [ anon ]
00007f7a04136000   64296       0       0 -----   [ anon ]
00007f7bb8130000   64320       0       0 -----   [ anon ]
00007f793c129000   64348       0       0 -----   [ anon ]
00007f792c124000   64368       0       0 -----   [ anon ]
00007f7bc8123000   64372       0       0 -----   [ anon ]
00007f7bb4122000   64376       0       0 -----   [ anon ]
00007f7bbc11f000   64388       0       0 -----   [ anon ]
00007f746811a000   64408       0       0 -----   [ anon ]
00007f7938117000   64420       0       0 -----   [ anon ]
00007f746c0f9000   64540       0       0 -----   [ anon ]
00007f74580ee000   64584       0       0 -----   [ anon ]
00007f7c880e2000   64632       0       0 -----   [ anon ]
00007f74640e0000   64640       0       0 -----   [ anon ]
00007f79180e0000   64640       0       0 -----   [ anon ]
00007f7be40e0000   64640       0       0 -----   [ anon ]
00007f74600d4000   64688       0       0 -----   [ anon ]
00007f7bd00cf000   64708       0       0 -----   [ anon ]
00007f7be80cf000   64708       0       0 -----   [ anon ]
00007f7b840c8000   64736       0       0 -----   [ anon ]
00007f7c800c5000   64748       0       0 -----   [ anon ]
00007f7c7c0c4000   64752       0       0 -----   [ anon ]
00007f7c8c0bb000   64788       0       0 -----   [ anon ]
00007f7bf80b9000   64796       0       0 -----   [ anon ]
00007f7c900b6000   64808       0       0 -----   [ anon ]
00007f7c9c0b5000   64812       0       0 -----   [ anon ]
00007f7c780a5000   64876       0       0 -----   [ anon ]
00007f7b9c09f000   64900       0       0 -----   [ anon ]
00007f7bcc09e000   64904       0       0 -----   [ anon ]
00007f7c5409d000   64908       0       0 -----   [ anon ]
00007f7c84098000   64928       0       0 -----   [ anon ]
00007f7c74087000   64996       0       0 -----   [ anon ]
00007f7b98084000   65008       0       0 -----   [ anon ]
00007f7ba407e000   65032       0       0 -----   [ anon ]
00007f7ba8073000   65076       0       0 -----   [ anon ]
00007f7bb0073000   65076       0       0 -----   [ anon ]
00007f7b94056000   65192       0       0 -----   [ anon ]
00007f7b5004c000   65232       0       0 -----   [ anon ]
00007f7bf0042000   65272       0       0 -----   [ anon ]
00007f78f0000000   65300   65300   65300 rw---   [ anon ]
00007f7b44022000   65400       0       0 -----   [ anon ]
00007f7b80022000   65400       0       0 -----   [ anon ]
00007f7c10022000   65400       0       0 -----   [ anon ]
00007f76f8021000   65404       0       0 -----   [ anon ]
00007f7708021000   65404       0       0 -----   [ anon ]
00007f770c021000   65404       0       0 -----   [ anon ]
00007f7714021000   65404       0       0 -----   [ anon ]
00007f7718021000   65404       0       0 -----   [ anon ]
00007f7724021000   65404       0       0 -----   [ anon ]
00007f7760021000   65404       0       0 -----   [ anon ]
00007f7774021000   65404       0       0 -----   [ anon ]
00007f7b5c021000   65404       0       0 -----   [ anon ]
00007f7b64021000   65404       0       0 -----   [ anon ]
00007f7b70021000   65404       0       0 -----   [ anon ]
00007f7b74021000   65404       0       0 -----   [ anon ]
00007f7b78021000   65404       0       0 -----   [ anon ]
00007f7b7c021000   65404       0       0 -----   [ anon ]
00007f7b88021000   65404       0       0 -----   [ anon ]
00007f7b8c021000   65404       0       0 -----   [ anon ]
00007f7bec021000   65404       0       0 -----   [ anon ]
00007f7bf4021000   65404       0       0 -----   [ anon ]
00007f7bfc021000   65404       0       0 -----   [ anon ]
00007f7c0c021000   65404       0       0 -----   [ anon ]
00007f7c14021000   65404       0       0 -----   [ anon ]
00007f7c1c021000   65404       0       0 -----   [ anon ]
00007f7c20021000   65404       0       0 -----   [ anon ]
00007f7c24021000   65404       0       0 -----   [ anon ]
00007f7c28021000   65404       0       0 -----   [ anon ]
00007f7c30021000   65404       0       0 -----   [ anon ]
00007f7c3c021000   65404       0       0 -----   [ anon ]
00007f7c40021000   65404       0       0 -----   [ anon ]
00007f7c48021000   65404       0       0 -----   [ anon ]
00007f7c58021000   65404       0       0 -----   [ anon ]
00007f7c5c021000   65404       0       0 -----   [ anon ]
00007f7c60021000   65404       0       0 -----   [ anon ]
00007f7c68021000   65404       0       0 -----   [ anon ]
00007f7c70021000   65404       0       0 -----   [ anon ]
00007f7c94021000   65404       0       0 -----   [ anon ]
00007f72b0000000   65524   65524   65524 rw---   [ anon ]
00007f72cc000000   65524   65524   65524 rw---   [ anon ]
00007f72d8000000   65524   65524   65524 rw---   [ anon ]
00007f7318000000   65524   65524   65524 rw---   [ anon ]
00007f731c000000   65524   65524   65524 rw---   [ anon ]
00007f7320000000   65524   65524   65524 rw---   [ anon ]
00007f73b0000000   65524   65524   65524 rw---   [ anon ]
00007f73d4000000   65524   65524   65524 rw---   [ anon ]
00007f78c0000000   65524   65524   65524 rw---   [ anon ]
00007f6e54000000   65528   65528   65528 rw---   [ anon ]
00007f6fb4000000   65528   65528   65528 rw---   [ anon ]
00007f6fb8000000   65528   65528   65528 rw---   [ anon ]
00007f6fc4000000   65528   65528   65528 rw---   [ anon ]
00007f71c4000000   65528   65528   65528 rw---   [ anon ]
00007f71d4000000   65528   65528   65528 rw---   [ anon ]
00007f7200000000   65528   65528   65528 rw---   [ anon ]
00007f72a8000000   65528   65528   65528 rw---   [ anon ]
00007f72c0000000   65528   65528   65528 rw---   [ anon ]
00007f72c4000000   65528   65528   65528 rw---   [ anon ]
00007f72e4000000   65528   65528   65528 rw---   [ anon ]
00007f72e8000000   65528   65528   65528 rw---   [ anon ]
00007f732c000000   65528   65528   65528 rw---   [ anon ]
00007f733c000000   65528   65528   65528 rw---   [ anon ]
00007f73f4000000   65528   65528   65528 rw---   [ anon ]
00007f73f8000000   65528   65528   65528 rw---   [ anon ]
00007f78c8000000   65528   65528   65528 rw---   [ anon ]
00007f78f8000000   65528   65528   65528 rw---   [ anon ]
00007f790c000000   65528   65528   65528 rw---   [ anon ]
00007f7910000000   65528   65528   65528 rw---   [ anon ]
00007f795c000000   65528   65528   65528 rw---   [ anon ]
00007f796c000000   65528   65528   65528 rw---   [ anon ]
00007f6fbc000000   65532   65532   65532 rw---   [ anon ]
00007f6fc8000000   65532   65532   65532 rw---   [ anon ]
00007f6fd4000000   65532   65532   65532 rw---   [ anon ]
00007f7004000000   65532   65532   65532 rw---   [ anon ]
00007f71ac000000   65532   65532   65532 rw---   [ anon ]
00007f71b0000000   65532   65532   65532 rw---   [ anon ]
00007f71cc000000   65532   65532   65532 rw---   [ anon ]
00007f71f8000000   65532   65532   65532 rw---   [ anon ]
00007f72b4000000   65532   65532   65532 rw---   [ anon ]
00007f72c8000000   65532   65532   65532 rw---   [ anon ]
00007f72e0000000   65532   65532   65532 rw---   [ anon ]
00007f7300000000   65532   65532   65532 rw---   [ anon ]
00007f7310000000   65532   65532   65532 rw---   [ anon ]
00007f7328000000   65532   65532   65532 rw---   [ anon ]
00007f7330000000   65532   65532   65532 rw---   [ anon ]
00007f7338000000   65532   65532   65532 rw---   [ anon ]
00007f7368000000   65532   65532   65532 rw---   [ anon ]
00007f73dc000000   65532   65532   65532 rw---   [ anon ]
00007f73e0000000   65532   65532   65532 rw---   [ anon ]
00007f6fc0000000   65536   65536   65536 rw---   [ anon ]
00007f6fd8000000   65536   65536   65536 rw---   [ anon ]
00007f7008000000   65536   65536   65536 rw---   [ anon ]
00007f700c000000   65536   65536   65536 rw---   [ anon ]
00007f70a0000000   65536   65536   65536 rw---   [ anon ]
00007f71c8000000   65536   65536   65536 rw---   [ anon ]
00007f71d0000000   65536   65536   65536 rw---   [ anon ]
00007f71d8000000   65536   65536   65536 rw---   [ anon ]
00007f71fc000000   65536   65536   65536 rw---   [ anon ]
00007f7298000000   65536   65536   65536 rw---   [ anon ]
00007f72ac000000   65536   65536   65536 rw---   [ anon ]
00007f72dc000000   65536   65536   65536 rw---   [ anon ]
00007f72ec000000   65536   65536   65536 rw---   [ anon ]
00007f7304000000   65536   65536   65536 rw---   [ anon ]
00007f7314000000   65536   65536   65536 rw---   [ anon ]
00007f7324000000   65536   65536   65536 rw---   [ anon ]
00007f7334000000   65536   65536   65536 rw---   [ anon ]
00007f736c000000   65536   65536   65536 rw---   [ anon ]
00007f7390000000   65536   65536   65536 rw---   [ anon ]
00007f7394000000   65536   65536   65536 rw---   [ anon ]
00007f73ac000000   65536   65536   65536 rw---   [ anon ]
00007f73b4000000   65536   65536   65536 rw---   [ anon ]
00007f73c0000000   65536   65536   65536 rw---   [ anon ]
00007f73d8000000   65536   65536   65536 rw---   [ anon ]
00007f740c000000   65536   65536   65536 rw---   [ anon ]
00007f741c000000   65536   65536   65536 rw---   [ anon ]
00007f760c000000   65536   65536   65536 rw---   [ anon ]
00007f76fc000000   65536   65536   65536 rw---   [ anon ]
00007f7704000000   65536   65532   65532 rw---   [ anon ]
00007f7764000000   65536   65536   65536 rw---   [ anon ]
00007f7780000000   65536   65536   65536 rw---   [ anon ]
00007f7794000000   65536   65536   65536 rw---   [ anon ]
00007f77a4000000   65536   65536   65536 rw---   [ anon ]
00007f788c000000   65536   65536   65536 rw---   [ anon ]
00007f78bc000000   65536   65536   65536 rw---   [ anon ]
00007f78c4000000   65536   65536   65536 rw---   [ anon ]
00007f78e0000000   65536   65536   65536 rw---   [ anon ]
00007f78e8000000   65536   65536   65536 rw---   [ anon ]
00007f78f4000000   65536   65536   65536 rw---   [ anon ]
00007f78fc000000   65536   65536   65536 rw---   [ anon ]
00007f7924000000   65536   65536   65536 rw---   [ anon ]
00007f7930000000   65536   65536   65536 rw---   [ anon ]
00007f7944000000   65536   65536   65536 rw---   [ anon ]
00007f7954000000   65536   65536   65536 rw---   [ anon ]
00007f7960000000   65536   65536   65536 rw---   [ anon ]
00007f7970000000   65536   65536   65536 rw---   [ anon ]
00007f797c000000   65536   65536   65536 rw---   [ anon ]
00007f79c4000000   65536   65536   65536 rw---   [ anon ]
00007f79dc000000   65536   65536   65536 rw---   [ anon ]
00007f79f8000000   65536   65536   65536 rw---   [ anon ]
00007f7b4c000000   65536   65536   65536 rw---   [ anon ]
00007f7b68000000   65536   65536   65536 rw---   [ anon ]
00007f7b6c000000   65536   65528   65528 rw---   [ anon ]
00007f7c4c000000   65536   65508   65508 rw---   [ anon ]
00007f7d00000000   65536   55028   55028 rw---   [ anon ]
00007f7cf31cd000   84636       0       0 -----   [ anon ]
00007f7ceb3f4000   87772       0       0 -----   [ anon ]
00007f79e11ca000  112856  112856  112856 rw---   [ anon ]
00007f79fd192000  113080  113080  113080 rw---   [ anon ]
00007f7ce1c74000  122880  122880  122880 rw---   [ anon ]
00007f7286840000  126444       0       0 r--s- _b_Lucene90_0.tim
00007f7cf8474000  126512    3248       0 r--s- modules
00007f79c82ed000  126588     108       0 r--s- _b_Lucene90_0.tim
00007f7149089000  126664       0       0 r--s- _b_Lucene90_0.tim
00007f7595a27000  128156     140       0 r--s- _a.cfs
00007f79f0000000  129332  129332  129332 rw---   [ anon ]
00007f78d8000000  129336  129336  129336 rw---   [ anon ]
00007f7778000000  130392  130392  130392 rw---   [ anon ]
00007f71e4000000  130924  130924  130924 rw---   [ anon ]
00007f6fe4000000  131060  131060  131060 rw---   [ anon ]
00007f73ec000000  131060  131060  131060 rw---   [ anon ]
00007f778c000000  131060  131060  131060 rw---   [ anon ]
00007f6fa4000000  131064  131064  131064 rw---   [ anon ]
00007f6fdc000000  131064  131064  131064 rw---   [ anon ]
00007f6ffc000000  131064  131064  131064 rw---   [ anon ]
00007f72f8000000  131064  131064  131064 rw---   [ anon ]
00007f7350000000  131064  131064  131064 rw---   [ anon ]
00007f73b8000000  131064  131064  131064 rw---   [ anon ]
00007f73c4000000  131064  131064  131064 rw---   [ anon ]
00007f6f94000000  131068  131068  131068 rw---   [ anon ]
00007f71bc000000  131068  131068  131068 rw---   [ anon ]
00007f72f0000000  131068  131068  131068 rw---   [ anon ]
00007f7358000000  131068  131068  131068 rw---   [ anon ]
00007f73a4000000  131068  131068  131068 rw---   [ anon ]
00007f7404000000  131068  131068  131068 rw---   [ anon ]
00007f779c000000  131068  131068  131068 rw---   [ anon ]
00007f7964000000  131068  131068  131068 rw---   [ anon ]
00007f6f8c000000  131072  131072  131072 rw---   [ anon ]
00007f6f9c000000  131072  131072  131072 rw---   [ anon ]
00007f6fac000000  131072  131072  131072 rw---   [ anon ]
00007f6fcc000000  131072  131072  131072 rw---   [ anon ]
00007f6fec000000  131072  131072  131072 rw---   [ anon ]
00007f6ff4000000  131072  131072  131072 rw---   [ anon ]
00007f7188000000  131072  131072  131072 rw---   [ anon ]
00007f7190000000  131072   65668   65668 rw---   [ anon ]
00007f719c000000  131072  131072  131072 rw---   [ anon ]
00007f71a4000000  131072  131072  131072 rw---   [ anon ]
00007f71b4000000  131072  131072  131072 rw---   [ anon ]
00007f71dc000000  131072  131072  131072 rw---   [ anon ]
00007f71f0000000  131072  131072  131072 rw---   [ anon ]
00007f72a0000000  131072  131072  131072 rw---   [ anon ]
00007f72b8000000  131072  131072  131072 rw---   [ anon ]
00007f7308000000  131072  131072  131072 rw---   [ anon ]
00007f7340000000  131072  131072  131072 rw---   [ anon ]
00007f7348000000  131072  131072  131072 rw---   [ anon ]
00007f7360000000  131072  131072  131072 rw---   [ anon ]
00007f7370000000  131072  131072  131072 rw---   [ anon ]
00007f7378000000  131072  131072  131072 rw---   [ anon ]
00007f7380000000  131072  131072  131072 rw---   [ anon ]
00007f7388000000  131072  131072  131072 rw---   [ anon ]
00007f739c000000  131072  131072  131072 rw---   [ anon ]
00007f73cc000000  131072  131072  131072 rw---   [ anon ]
00007f73e4000000  131072  131072  131072 rw---   [ anon ]
00007f73fc000000  131072  131072  131072 rw---   [ anon ]
00007f7414000000  131072  131072  131072 rw---   [ anon ]
00007f7898000000  131072  131072  131072 rw---   [ anon ]
00007f78a0000000  131072  131072  131072 rw---   [ anon ]
00007f78a8000000  131072  131072  131072 rw---   [ anon ]
00007f78b0000000  131072  131072  131072 rw---   [ anon ]
00007f794c000000  131072  131072  131072 rw---   [ anon ]
00007f7974000000  131072  131072  131072 rw---   [ anon ]
00007f7537085000  131552       0       0 r--s- _b_Lucene90_0.tim
00007f759d74e000  144732      68       0 r--s- _8.cfs
00007f75820ec000  146812     112       0 r--s- _9.cfs
00007f7578b42000  153256     104       0 r--s- _3.cfs
00007f728e3bb000  160020       0       0 r--s- _b_Lucene90_0.doc
00007f7176392000  160184     124       0 r--s- _b_Lucene90_0.doc
00007f7a0ab7a000  160308       0       0 r--s- _b_Lucene90_0.doc
00007f753f0fd000  163408       0       0 r--s- _b_Lucene90_0.doc
00007f758b04b000  173936     112       0 r--s- _7.cfs
00007f75b1a90000  185760      12       0 r--s- _5.cfs
00007f75a64a5000  186284      44       0 r--s- _4.cfs
00007f7900000000  196608  196608  196608 rw---   [ anon ]
00007f75dada7000  211436      64       0 r--s- _6.cfs
00007f7549091000  339620      44       0 r--s- _2.cfs
00007f755dc3a000  441376      64       0 r--s- _0.cfs
00007f75bcff8000  489148      76       0 r--s- _1.cfs
00007f7ca1f73000  492548  491536  491536 rw---   [ anon ]
00007f75e7c22000  527768       0       0 r--s- _b_Lucene90_0.dvd
00007f7cc0074000  552960  552960  552960 rw---   [ anon ]
00007f7a14807000  647140       0       0 r--s- _0.cfs
00007f6e28527000  715620       0       0 r--s- _b_Lucene90_0.dvd
00007f6ee01ac000  719184       8       0 r--s- _b_Lucene90_0.dvd
00007f6de7fb5000  721196       0       0 r--s- _b_Lucene90_0.dvd
00007f7485781000  782880       0       0 r--s- _b_Lucene90_0.dvd
00007f772fea1000  787836  787836  787836 rw---   [ anon ]
0000000801ca0000 1031552       0       0 -----   [ anon ]
00007f6f0c000000 1048576       0       0 r--s- _b_Lucene90_0.dvd
00007f6f4c000000 1048576      44       0 r--s- _b_Lucene90_0.dvd
00007f70c9089000 1048576       0       0 r--s- _b_Lucene90_0.dvd
00007f7109089000 1048576      44       0 r--s- _b_Lucene90_0.dvd
00007f7206840000 1048576       0       0 r--s- _b_Lucene90_0.dvd
00007f7246840000 1048576      44       0 r--s- _b_Lucene90_0.dvd
00007f74b5409000 1048576       0       0 r--s- _b_Lucene90_0.dvd
00007f74f5409000 1048576       0       0 r--s- _b_Lucene90_0.dvd
00007f7a3c000000 1048576       0       0 r--s- _0.cfs
00007f7a7c000000 1048576       0       0 r--s- _0.cfs
00007f7abc000000 1048576       0       0 r--s- _0.cfs
00007f7afc000000 1048576      44       0 r--s- _0.cfs
00007f6be6500000 8415956 8415956 8415956 rw---   [ anon ]
0000000080000000 31439872 31439872 31439872 rw---   [ anon ]

many 64MB and 128MB memory blocks.

We thought it might be the same to #772.

So we used the code of nmslib_wrapper.cpp from 2.8 to replace the same file in 2.7
refer to https://github.com/opensearch-project/k-NN/blame/f11f1f1d4ad0de76b05517b57bcc87e0a6788031/jni/src/nmslib_wrapper.cpp#L129
And rebuild the OS&KNN image and deploy

The issue is not gone, the nodes are still OOM

Finally , we used tcmalloc to replace ptmalloc in glibc, the OOM is gone but the formerge time of the index increased from 1h to 4h. It's not acceptable.

Our questions?

  1. Is the issue we encountered similar to 772?
  2. If so, how's the plan to improve it in the future?
@IAmFQQ IAmFQQ added bug Something isn't working untriaged labels Feb 20, 2024
@jmazanec15
Copy link
Member

jmazanec15 commented Feb 20, 2024

@IAmFQQ Did you try jemalloc instead of tcmalloc? jemalloc we have seen make significant improvements on memory fragmentation. We havent benchmarked on impact of forcemerge time on it yet. But if possible, it would be good to get data point

Additionally, faiss hnsw can be tried. In the implementation, there are fewer, smaller allocs made.

Related issues:

  1. [BUG] opensearch process exits during addition of documents with knn_vector field #72
  2. Add support to use jemalloc as memory allocator for supported platforms OpenSearch#9496

@IAmFQQ
Copy link
Author

IAmFQQ commented Feb 21, 2024

thanks @jmazanec15 for your suggestions, very much appreciated.

We will have the testing and get back to you in next week.

Test Cases,

  1. faiss hnsw using ptmalloc
  2. nmslib hnsw using jemalloc.

@jmazanec15
Copy link
Member

Sounds good. Yes, please let us know.

@jmazanec15
Copy link
Member

Hey @IAmFQQ were you able to test?

@IAmFQQ
Copy link
Author

IAmFQQ commented Mar 6, 2024

hi @jmazanec15
Last week, we got some troubles when building the latest image.

We will have the test report by the end of this week and will share with you.

@IAmFQQ
Copy link
Author

IAmFQQ commented Mar 19, 2024

hi @jmazanec15
Sorry for the late response.

We tested the nmslib hnsw using jemalloc, the forcemerge time did not drop too much, the same as we using tcmalloc.

To use faiss hnsw, we have to change the space_type from cosinesimil to l2, we have not tested it yet. It requires a reassessment of the scores. Our upstream application needs to evaluate it.

I will keep you posted.

@jmazanec15
Copy link
Member

@IAmFQQ thanks

We tested the nmslib hnsw using jemalloc, the forcemerge time did not drop too much, the same as we using tcmalloc.

Im trying to understand why this might be the case. Any ideas? Ill try to run an experiment locally.

@vamshin
Copy link
Member

vamshin commented Mar 19, 2024

@IAmFQQ thanks

We tested the nmslib hnsw using jemalloc, the forcemerge time did not drop too much, the same as we using tcmalloc.

Im trying to understand why this might be the case. Any ideas? Ill try to run an experiment locally.

I am also very curious to understand how this could impact forcemerge time as in how the strategy of reclaiming free space impacts the graph build time. Is it because we are running in constrained environment and there are pauses to free up space? Any thing interesting you noticed in the logs will be helpful

@IAmFQQ
Copy link
Author

IAmFQQ commented Mar 28, 2024

@vamshin @jmazanec15
We made a mistake. The forcemerge time did not increase, instead it decreased.
Our automation tool will check the number of segments and determine whether it needs to be submitted again based on the number of segments, because the restart of the node will interrupt the execution of forcemerge on some nodes.

However, our regular statistics tool only counts the last time, which leads to the counting error.

Thanks for your help @jmazanec15 .

So the solution to avoid OOM is to use tcmallo/jemalloc.

@jmazanec15
Copy link
Member

@IAmFQQ Oh good to hear. Did you notice any diff between tcmalloc and jemalloc?

@jmazanec15 jmazanec15 removed the bug Something isn't working label Mar 28, 2024
@navneet1v navneet1v changed the title [BUG] rss memory increasing when writing data, and caused node Out Of Memory. rss memory increasing when writing data, and caused node Out Of Memory. Mar 28, 2024
@jmazanec15
Copy link
Member

Also, I ran some micro-benchmarks based on https://github.com/jmazanec15/k-NN-1/blob/micro-benchmarks/micro-benchmarks/src/main/java/org/opensearch/knn/BuildNativeIndexBenchmarks.java and found that for 250K vectors, jemalloc makes building graphs a little bit faster

# Default malloc
Benchmark                                           (dimension)  (efConstruction)  (engine)  (indexThreadQty)  (m)   (spaceType)  Mode  Cnt    Score   Error  Units
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100    nmslib                 4   16  innerproduct    ss    4   67.803 ? 1.715   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100    nmslib                 4   16            l2    ss    4   70.823 ? 1.086   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100    nmslib                 4   16  innerproduct    ss    4  176.794 ? 0.491   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100    nmslib                 4   16            l2    ss    4  169.535 ? 3.413   s/op

# Jemalloc
Benchmark                                           (dimension)  (efConstruction)  (engine)  (indexThreadQty)  (m)   (spaceType)  Mode  Cnt    Score   Error  Units
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100    nmslib                 4   16  innerproduct    ss    4   62.201 ? 0.763   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100    nmslib                 4   16            l2    ss    4   66.505 ? 0.935   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100    nmslib                 4   16  innerproduct    ss    4  170.340 ? 2.244   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100    nmslib                 4   16            l2    ss    4  164.709 ? 5.022   s/op

Checking to see if same applies for faiss.

@jmazanec15
Copy link
Member

Faiss

About the same

#Default malloc
Benchmark                                           (dimension)  (efConstruction)  (engine)  (indexThreadQty)  (m)   (spaceType)  Mode  Cnt    Score   Error  Units
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100     faiss                 4   16  innerproduct    ss    4   36.799 ? 2.609   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100     faiss                 4   16            l2    ss    4   38.620 ? 1.216   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100     faiss                 4   16  innerproduct    ss    4  128.657 ? 2.237   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100     faiss                 4   16            l2    ss    4  111.069 ? 1.493   s/op

# Jemalloc
Benchmark                                           (dimension)  (efConstruction)  (engine)  (indexThreadQty)  (m)   (spaceType)  Mode  Cnt    Score   Error  Units
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100     faiss                 4   16  innerproduct    ss    4   36.818 ? 1.558   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 128               100     faiss                 4   16            l2    ss    4   38.968 ? 0.672   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100     faiss                 4   16  innerproduct    ss    4  126.917 ? 1.734   s/op
BuildNativeIndexBenchmarks.buildNativeIndex                 512               100     faiss                 4   16            l2    ss    4  109.311 ? 1.692   s/op



@jmazanec15
Copy link
Member

Closing no activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants