-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange increase in CPU-usage over time #565
Comments
How long does it take to start to notice the problem beginning? I can try running a profiler for a long time and see where it spends the most time. |
Also can you verify the postgrest version you're running? (Send it a request and inspect the |
As a test I've started instances of all 0.3.x versions of postgrest on a Debian server and will watch them to see which versions start to consume CPU. |
Thanks I was a little busy earlier so answer delayed. But I tried to get the version from the response header, but since I have nginx in front I only found the nginx version. But one of the two servers was installed a few weeks ago with master from github. The other is maybe two months old. Both behsve the same. I think this issie should show after a week, but it is a slope so it is getting worse the longer time it runs. I will keep an eye on mine and we can discuss the numbers from them when it starts to show. I have only experienced this once per server yet, but I see no reason it shouldn't come back. |
The question is if the increase is gradual or abrupt. If gradual then it's some type of leak. If it's abrupt then it's a specific request that causes the issue and in that case you need to run the exact schema/requests to replicate it |
it was very gradual, but I don't see it yet either |
Now suddenly things start to happen. The server is not idling all the time, but is in a testing phase before going in production. So we are filling it with data and doing tests on it. But it is not under constant load so something is wrong with this gradual increase in cpu. As you see it let go of memory, but cpu is just increasing. You also see on I/O that there is more or less no load for the last 24 h. The graphs shows only postgrest. The other processes is not following in increasing cpu. |
@nicklasaven join gitter chat (let's see if strace reveals anything interesting) |
The only thing i can think of PostgREST could get in an infinite loop is this section So first order of business is to check on the PostgreSQL side if there are 40001 errors happening. |
Do your system utilization logs and database logs both go back to when you started postgrest? You could try to visually find the point in time where CPU started climbing, then look at the db logs from around that time. Maybe there will be a telltale event. Also the CPU on my server is still holding steady but I haven't sent the postgrest servers any requests. I can try sending various types of requests to experiment:
I could begin at step 1 then wait a day for any CPU growth. Then try 2, etc. |
Sorry for the delay. I am under some presure and I am short of both day time and evening time. But I will get back later today or tomorrow and investigate more. I checked yesterday and it is still increasing, but it doesn't affect the server yet as it is only limited usage on it. |
How to use strace here? I find no error 40001 in the logs. |
Fropm what i know you can attach to a running process by pid, but you need to look it up how to exactly do that |
Can confirm this, under heavy load for ~2 weeks (sustained 100+ req/s), postgrest very gradually grew from about 3% CPU to 40% before I killed it. I can try to dig through my metrics later today |
gradual is good i think, next time don't kill it until you get an strace log |
Ok, now I have collected with strace for a few minutes. But to me it seems like I miss what it is doing when leaving waiting before getting back to waiting again. CPU for both is approx 60 % now |
@nicklasaven you need to strace again, |
1 similar comment
@nicklasaven you need to strace again, |
I will start by looking at 3831 and 6722 since they seem to have used a lot of CPU. I guess the time is total cpu usage |
They both only gave result like this: So I guess they have children too Or they are just not used right now of course hmm, more digging needed |
Ok, here is something seems like sched_yield() is involved |
I would be interested to see a flame graph of where the CPU is spending time. Try this: # install perf and debug symbols for libc
sudo apt-get update
sudo apt-get install linux-tools-common libc6-dbg
# check and note whether your virtual server provides hardware events
perf list
# observe postgrest for a minute, sampling the stack at 99 hertz
sudo perf record -F 99 -p <postgrest-pid> -g --call-graph dwarf -- sleep 60
# create visualization of perf dump
git clone https://github.com/brendangregg/FlameGraph.git
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > postgrest.svg If your postgrest binary has sufficient debugging symbols then we should get a good idea of how it is spending time. |
the problem is that from my 2 instances it gets spawned another 20-30 processes. Is it possible to collect them all in one go?? I post 2 strace-loggs here. postgrest_all.txt is all those processes in one file before restarting the server. postgrest_all_after.txt is all processes when postgrest, and the whole server was restarted. They are quite small when compressed but approx 400 mb each when decompressed. Be aware |
Rather than |
Random hunch is that this is related to threads getting spawned and then not dying. The futex calls in your logs look like threads continually asking to be put to sleep. Maybe an asynchronous exception messes something up, possibly in the hasql library. Once nikita-volkov/hasql#47 is fixed I can disable the GHC |
I have been running PostgREST for a week (on and off since this computer sleeps a lot), it is now up to peaks of 69% of CPU, 108 MB of RAM, for close to 8h30 of execution time. I have not had a great experience with However, I could not get any report. Parsing the
I tried to move the I am now considering either or both of:
|
Maybe if you upload the event log to S3 we could run the event analyzer on an EC2 instance with a large amount of RAM:
|
@eric-brechemier how about checking the RTS thing? |
I do not think that RAM is the problem here (I have 16GB). |
That's also on the table. |
The problem was not the size of the file, but the fact that writing was still in progress, apparently ;
It produced the three report files:
which I compressed to:
|
So in postgrest.totals.txt i see the first line :)
ready to try my suggestion :) ? (don't run it on the custom compilation with profiling, run it on 3.1.1) |
Sure :)
OK. |
Now running |
The difference is beyond compare: PostgREST server is now running below 1% of CPU, with less than 10MB of RAM. |
HA! :) I feel a new release coming up. |
What a relief. 🌞 I'm sure there's a way to bake RTS settings either into the cabal file or the stack.yaml. I'll make a PR with the change. When the PR is merged I'll close this issue. Then it's time to make a release! We have a lot of other fixes and features queued up as well. I wonder if this calls for filing an issue in Warp or updating https://ghc.haskell.org/trac/ghc/ticket/4322 |
I think it's worth opening an issue and providing links to this thread and to the link above. I've search their github repo and there is no mention of high cpu usage or custom +RTS parameters. |
I confirm that after 5 days, the
OK. I just restarted the server with |
Thanks a lot! |
Yes. |
Thanks for the debugging help everyone. |
Following this thread from a foreign constellation, isn't safer to use |
@PierreR these particular numbers are still unsatisfying to me. Why 1.5 rather than 0.31? And does the magic number vary based on server CPU speed? It seems to me that the lower the number the more likely that the CPU leak will happen, and the higher the number the more likely that postgrest could temporarily run out of memory. The CPU leak has been observed, but I don't think we've observed a problem with memory so it's probably safer to err on the side of a higher magic number. |
@begriffs as per bug report i mentioned earlier, this is what is happening if your application has a timer signal that executes some Haskell code every 1s, say, then you'll get an idle GC after each one. Setting the idle GC to a longer period than your timer signal (e.g. 2s) will prevent this happening. Warp has a timer that runs at 1s intervals to generate the string timestamp for the log output, it's an optimisation thing. Setting the -I to anything above 1 sec should fix this, but 2s i think is ok. Another solution can be this: i think the 1s timer is created by the logger middleware, if that is removed then it's probable 0.3 will also work. I think in a production setup, the logging should be done by nginx and postgrest should not log at all, except start/stop ... maybe in the future this can be an option/flag not to log |
@eric-brechemier so increasing idle GC interval also caused the memory creep you saw disappear? Why would that be I wonder? |
Yes.
@ruslantalpa found out that it was related to this GHC bug: |
@eric-brechemier do you find the information in that GHC bug report satisfying? It's still not clear to me why idle gc would manifest as a space leak. Maybe related to finalizers...? This has just cropped up in another codebase I'm working on, so I'm wondering whether to comment on that GHC 4322 bug, open a new one, etc. Wondering if you have any further information. We found last time that setting |
We have a server running on Debian with PostgreSQL 9.5 and latest (I think) postgrest.
The server is more or less just idling since it is a production server not yet in production.
Over time I have seen that the cpu-usage is increasing. After some weeks it is using more than 50 % of 400 % in total (4 cores). I can see on the graphs that the usage is increasing very slowly day by day.
I have 2 instances of postgrest runing on the server.
I see the same thing on our test server, but that one is under some load during this test phase. But also when it is idling it uses a lot of CPU.
When I restart the postgrest instances things go back to normal.
The text was updated successfully, but these errors were encountered: