Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High resource usage over time (memory leak?) #3446

Open
TruncatedDinoSour opened this issue Nov 28, 2024 · 28 comments
Open

High resource usage over time (memory leak?) #3446

TruncatedDinoSour opened this issue Nov 28, 2024 · 28 comments

Comments

@TruncatedDinoSour
Copy link

Background information

  • Dendrite version or git SHA: 0.13.8+79b87c7
  • SQLite3 or Postgres?: Postgres
  • Running in Docker?: No
  • go version: go version go1.23.0 linux/amd64
  • Client used (if applicable): Schildichat, Element, Hydrogen, or CinnyChat is what most people use I believe on the HS

Description

  • What is the problem: Dendrite, when running over time, begins hogging ram over time. This results in not only a big RAM hog, but also a CPU hog, since I run zRAM (maybe also related to dendrite using the CPU? Idk.) Regardless, every week or so I have to restart Dendrite because it keeps eating more and more RAM, making the overall server performance worse.
  • Who is affected: The server.
  • How is this bug manifesting: It appears as Dendrite runs in long-term, slowly eating more and more RAM and/or swap.
  • When did this first appear: I can't recall. I don't remember needing to restart dendrite before 1.18 I think.

Steps to reproduce

  • list the steps
  • that reproduce the bug
  • using hyphens as bullet points
  • Run dendrite
  • Use it for a week or so
  • Watch the resource, mainly RAM, usage grow over time

This is its resource usage only 2 days later after its most recent restart. And it only creeps over time until I restart it:

image

It's weird.

@TruncatedDinoSour
Copy link
Author

Had to restart it again. Could only last 3 days.

@TruncatedDinoSour
Copy link
Author

s7evink/fetch-auth-events fixed it

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 9, 2024

nvm lmao, it was fine for like 6 days and now its no again

@neilalexander
Copy link
Contributor

No idea if zRAM is a setup that should be supported but without a memory profile it will be difficult to tell what’s going on.

https://element-hq.github.io/dendrite/development/profiling

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 11, 2024

a setup that should be supported but without a memory profile it will be difficult

it is zram, yeah
but, is profiling a good choice over like a week ? the report would be huge, no ?
and even so, wouldnt it severely impact the performance for the week ? is there a way to check this with minimal disruption ?

edit :

image

its only growing :')

actually since its clearly majorly growing over a day, im down to set up reporting tomorrow day and report day after tmrw : ) ill do that

@neilalexander
Copy link
Contributor

neilalexander commented Dec 11, 2024

You just need a single memory profile captured when the memory usage is high. That should contain enough info and the files are small.

Having profiling enabled has next to no runtime cost so it’s fine to have it switched on for a long time.

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 11, 2024

You just need a single memory profile captured when the memory usage is high. That should contain enough info and the files are small.

Having profiling enabled has next to no runtime cost so it’s fine to have it switched on for a long time.

oh nice, okay then, ill enable the profiler tomorrow since for today i consider myself done and want the rest of the dayevening offish xD

ill send out a memory profile capture in 1-3 days in this thread :)

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 11, 2024

ok since it was just 1 environment variable i enabled it now, thought maybe its more complicated but nope, ill post the thing when the resource usage is high :)

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 14, 2024

@neilalexander okay i got the heap report, idk how safe it is to send out heap reports, ive heard that its generally acceptable but for now ill stick to web ui screenshots

it is at 2.5 gigs of ram usage atm according to systemd, or was at the time when i took the report, it dropped to 2.2 now ( ?? )

● dendrite.service - dendrite
     Loaded: loaded (/etc/systemd/system/dendrite.service; enabled; preset: enabled)
     Active: active (running) since Thu 2024-12-12 21:13:13 UTC; 1 day 16h ago
   Main PID: 271785 (dendrite)
      Tasks: 20 (limit: 19144)
     Memory: 2.2G
        CPU: 2h 17min 46.905s
     CGroup: /system.slice/dendrite.service
             └─271785 /home/matrix/go/bin/dendrite --config dendrite.yaml

had to restart dendrite 2 days ago since it was refusing to write to postgresql database and both sending and receiving messages was slow :)

if you need the actual report in full lmk

image

phony.run is 64% somehow

image

image

something about sql dbs, im using postgres if it matters

image

graph :

image

image

image

ive now disabled pprof and restarted dendrite, if you need more info i can provide it, but for now i think this will be okay

@TruncatedDinoSour
Copy link
Author

iirc @jjj333-p has experienced a similar issue with dendrite, may have extra input ?

@neilalexander
Copy link
Contributor

Looks like it's struggling to process events. Can you try switching your instance to #3447 and see if it improves after a couple hours?

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 14, 2024

Looks like it's struggling to process events. Can you try switching your instance to #3447 and see if it improves after a couple hours?

the pr or the s7evink/fetch-auth-events branch ? im already on the branch if thats what youre asking, i can try the pr i guess

edit : oh the pr is literally a merge request to merge the branch into main

edit 1 : and yes the branch drastically improves the performance, not the resource bug though, it mightve helped it partially at least though since it is not as brutal anymore

@jjj333-p
Copy link
Contributor

i face this same issue, i can say that the fetch-auth-events pr did help a lot but its still not fixed. i also notice that after about a week of uptime it will be near oom on my system (full 4gb of ram + zram), and then i reboot and both the cpu and ram usage are way down (10-15% avg cpu, only 1-2gb of ram used). i also have all the caching i can disabled in the config. idk how helpful to this i can be though sorry

@jjj333-p
Copy link
Contributor

jjj333-p commented Dec 14, 2024

image
dendrite itself (after 2 days up) is using 1.8gb of ram, and the postgres dedicated to just it and the sliding sync proxy is using another half a gig or more depending on how busy it is

heres it after restarting the dendrite service and sending a message in a large room
image

@jjj333-p
Copy link
Contributor

ill also note that when rooms like the abandoned genshin impact room start backfilling in (thanks t2bot for not working correclty ever) it does max out the cpu and raise ram some, but not to that level. its a gradual over time thing.

image

@neilalexander
Copy link
Contributor

@TruncatedDinoSour Can you attach the full profile?

@TruncatedDinoSour
Copy link
Author

@TruncatedDinoSour Can you attach the full profile?

sure

heap.gz

@TruncatedDinoSour
Copy link
Author

@TruncatedDinoSour Can you attach the full profile?

sure

heap.gz

ftr if it wasnt clear this is gzipped xD as in

gzip -d heap.gz for the actual profile

i had to gzip it cuz github cried about the raw one

@TruncatedDinoSour
Copy link
Author

istg i cant even run it for 3 days without it crying in agony ugh

@neilalexander
Copy link
Contributor

I saw it was gzipped yeah, it looks a lot like the profile is just showing the server is crunching some extremely complex room state. Are you a member of any very large or complex rooms?

@TruncatedDinoSour
Copy link
Author

I saw it was gzipped yeah, it looks a lot like the profile is just showing the server is crunching some extremely complex room state. Are you a member of any very large or complex rooms?

no and i dont think anyone on the hs really is

@neilalexander
Copy link
Contributor

Any interesting log entries happening at the time? Particularly level=error?

@TruncatedDinoSour
Copy link
Author

Any interesting log entries happening at the time? Particularly level=error?

not that i see from recent logs, some media stuff but other than that nothing worth of concern i believe

@TruncatedDinoSour
Copy link
Author

Any interesting log entries happening at the time? Particularly level=error?

not that i see from recent logs, some media stuff but other than that nothing worth of concern i believe

turns out 1 user was a part of the matrix.org HQ maybe thats why

@neilalexander
Copy link
Contributor

Try using the admin API evacuateRoom to force everyone out of that room and see if things calm down. (Afterwards you could try purging it too, just in case the room state is corrupt from former auth errors.)

@TruncatedDinoSour
Copy link
Author

TruncatedDinoSour commented Dec 16, 2024

Try using the admin API evacuateRoom to force everyone out of that room and see if things calm down. (Afterwards you could try purging it too, just in case the room state is corrupt from former auth errors.)

dont worry i am using both evac and purge right now

its taking some time but its working

regardless, the resource usage only has been a problem recently when that account has been there since the start of the homeserver so idk

maybe itll help ? idk, well see i guess :D

@bones-was-here
Copy link

If you "need" to host huge rooms on a system without much RAM you might benefit from disabling the in-memory cache:

global:
  cache:
    max_age: 0s

The cost is somewhat more work for postgres.

@TruncatedDinoSour
Copy link
Author

If you "need" to host huge rooms on a system without much RAM you might benefit from disabling the in-memory cache:

global:
  cache:
    max_age: 0s

The cost is somewhat more work for postgres.

i did a while ago alrd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants