Strange increase in CPU-usage over time #565

nicklasaven · 2016-04-18T10:18:31Z

We have a server running on Debian with PostgreSQL 9.5 and latest (I think) postgrest.

The server is more or less just idling since it is a production server not yet in production.

Over time I have seen that the cpu-usage is increasing. After some weeks it is using more than 50 % of 400 % in total (4 cores). I can see on the graphs that the usage is increasing very slowly day by day.

I have 2 instances of postgrest runing on the server.

I see the same thing on our test server, but that one is under some load during this test phase. But also when it is idling it uses a lot of CPU.

When I restart the postgrest instances things go back to normal.

begriffs · 2016-04-18T13:52:29Z

How long does it take to start to notice the problem beginning? I can try running a profiler for a long time and see where it spends the most time.

begriffs · 2016-04-18T14:39:11Z

Also can you verify the postgrest version you're running? (Send it a request and inspect the Server header in the response)

begriffs · 2016-04-18T21:23:25Z

As a test I've started instances of all 0.3.x versions of postgrest on a Debian server and will watch them to see which versions start to consume CPU.

nicklasaven · 2016-04-18T21:39:49Z

Thanks

I was a little busy earlier so answer delayed. But I tried to get the version from the response header, but since I have nginx in front I only found the nginx version. But one of the two servers was installed a few weeks ago with master from github. The other is maybe two months old. Both behsve the same.

I think this issie should show after a week, but it is a slope so it is getting worse the longer time it runs.

I will keep an eye on mine and we can discuss the numbers from them when it starts to show.

I have only experienced this once per server yet, but I see no reason it shouldn't come back.

begriffs · 2016-04-21T16:11:14Z

After three days of running all 0.3 versions at once the CPU is staying flat on my Debian 8.3 server.

Maybe the problem happens only after postgrest has received a few requests. I'll let it go another day and then try sending some requests.

ruslantalpa · 2016-04-21T16:40:04Z

The question is if the increase is gradual or abrupt. If gradual then it's some type of leak. If it's abrupt then it's a specific request that causes the issue and in that case you need to run the exact schema/requests to replicate it

nicklasaven · 2016-04-21T16:49:01Z

it was very gradual, but I don't see it yet either

nicklasaven · 2016-04-24T08:48:10Z

Now suddenly things start to happen. The server is not idling all the time, but is in a testing phase before going in production. So we are filling it with data and doing tests on it. But it is not under constant load so something is wrong with this gradual increase in cpu.

As you see it let go of memory, but cpu is just increasing. You also see on I/O that there is more or less no load for the last 24 h. The graphs shows only postgrest. The other processes is not following in increasing cpu.

ruslantalpa · 2016-04-24T08:51:36Z

@nicklasaven join gitter chat (let's see if strace reveals anything interesting)

ruslantalpa · 2016-04-24T09:14:28Z

The only thing i can think of PostgREST could get in an infinite loop is this section
https://github.com/nikita-volkov/hasql-transaction/blob/master/library/Hasql/Transaction.hs#L69-L73
It seems that for a particular error, the transaction lib would just try to run the transaction again.
Also note that there is a commit just a few days ago related to this particular section in the code, maybe just upgrading the transaction dependence will fix this, probably nikkita can give an oppinion if this cpu usage might be caused by that particular code section.
The error is this one
https://github.com/nikita-volkov/postgresql-error-codes/blob/master/library/PostgreSQL/ErrorCodes.hs#L388

So first order of business is to check on the PostgreSQL side if there are 40001 errors happening.
Second thing to try is to attach to the postgrest process with strace and log all system calls for a while, this might give us indication on what is happening.
Third thing to try (after we have the strace log) is to just stop postgresql (db) and see what happens, then start it back up after a minute or so to see if that has an impact on the cpu usage (but this is the last thing to try, we don't want to eliminate the problem and wait another week for it to come up)

begriffs · 2016-04-24T21:18:46Z

Do your system utilization logs and database logs both go back to when you started postgrest? You could try to visually find the point in time where CPU started climbing, then look at the db logs from around that time. Maybe there will be a telltale event.

Also the CPU on my server is still holding steady but I haven't sent the postgrest servers any requests. I can try sending various types of requests to experiment:

A read-only request that causes no errors (maybe GET /)
A read-only request that causes an error (GET /fakeroute)
Insert a row
Insert a row causing a constraint violation

I could begin at step 1 then wait a day for any CPU growth. Then try 2, etc.

nicklasaven · 2016-04-26T06:58:35Z

Sorry for the delay. I am under some presure and I am short of both day time and evening time.

But I will get back later today or tomorrow and investigate more.

I checked yesterday and it is still increasing, but it doesn't affect the server yet as it is only limited usage on it.

nicklasaven · 2016-04-26T10:33:54Z

How to use strace here?
Can I hook the tool into the running process?

I find no error 40001 in the logs.

ruslantalpa · 2016-04-26T11:22:54Z

Fropm what i know you can attach to a running process by pid, but you need to look it up how to exactly do that

fortytw2 · 2016-04-27T08:30:58Z

Can confirm this, under heavy load for ~2 weeks (sustained 100+ req/s), postgrest very gradually grew from about 3% CPU to 40% before I killed it. I can try to dig through my metrics later today

ruslantalpa · 2016-04-27T10:05:04Z

gradual is good i think, next time don't kill it until you get an strace log

nicklasaven · 2016-04-28T08:35:06Z

Ok, now I have collected with strace for a few minutes.
It is two instances, and they seems to behave the same, both of them.

But to me it seems like I miss what it is doing when leaving waiting before getting back to waiting again.

CPU for both is approx 60 % now

postgrest_3541.txt
postgrest_3538.txt

nicklasaven · 2016-04-28T08:39:51Z

This is the graphs for the last 24 hours.

The strace was only a few minutes.

ruslantalpa · 2016-04-28T08:49:34Z

@nicklasaven you need to strace again,
you did that only for the parent process
read this https://meenakshi02.wordpress.com/2011/02/02/strace-hanging-at-futex/ and resubmit the strace

ruslantalpa · 2016-04-28T08:49:39Z

@nicklasaven you need to strace again,
you did that only for the parent process
read this https://meenakshi02.wordpress.com/2011/02/02/strace-hanging-at-futex/ and resubmit the strace

nicklasaven · 2016-04-28T09:10:37Z

Ok

Some interesting info by just finding out the child processes

21 April it started 18 child-processes on one instance and 19 on the other.

nicklasaven · 2016-04-28T09:12:15Z

I will start by looking at 3831 and 6722 since they seem to have used a lot of CPU. I guess the time is total cpu usage

nicklasaven · 2016-04-28T09:16:00Z

They both only gave result like this:
11:13:18 futex(0x7f008000091c, FUTEX_WAIT_PRIVATE, 1000143, NULL <detached ...>

So I guess they have children too

Or they are just not used right now of course

hmm, more digging needed

nicklasaven · 2016-04-28T09:23:53Z

Ok, here is something

seems like sched_yield() is involved

postgrest_7844.txt

begriffs · 2016-04-28T16:21:22Z

I would be interested to see a flame graph of where the CPU is spending time. Try this:

# install perf and debug symbols for libc
sudo apt-get update
sudo apt-get install linux-tools-common libc6-dbg

# check and note whether your virtual server provides hardware events
perf list

# observe postgrest for a minute, sampling the stack at 99 hertz
sudo perf record -F 99 -p <postgrest-pid> -g --call-graph dwarf -- sleep 60

# create visualization of perf dump
git clone https://github.com/brendangregg/FlameGraph.git
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > postgrest.svg

If your postgrest binary has sufficient debugging symbols then we should get a good idea of how it is spending time.

nicklasaven · 2016-04-28T17:58:53Z

the problem is that from my 2 instances it gets spawned another 20-30 processes.

Is it possible to collect them all in one go??

I post 2 strace-loggs here. postgrest_all.txt is all those processes in one file before restarting the server.

postgrest_all_after.txt is all processes when postgrest, and the whole server was restarted.

They are quite small when compressed but approx 400 mb each when decompressed. Be aware

postgrest_all.txt.zip
postgrest_all_after.txt.zip

begriffs · 2016-04-28T18:42:41Z

Rather than -p <pid> try -u <user> to watch all processes spawned by a given user.

begriffs · 2016-04-28T20:06:08Z

Random hunch is that this is related to threads getting spawned and then not dying. The futex calls in your logs look like threads continually asking to be put to sleep. Maybe an asynchronous exception messes something up, possibly in the hasql library.

Once nikita-volkov/hasql#47 is fixed I can disable the GHC -threaded flag and go back to lightweight green Haskell threads rather than system threads.

eric-brechemier · 2016-05-17T19:19:05Z

I have been running PostgREST for a week (on and off since this computer sleeps a lot), it is now up to peaks of 69% of CPU, 108 MB of RAM, for close to 8h30 of execution time.

I have not had a great experience with ghc-events-analyze so far. It failed to install either with stack or cabal, that both complain about missing dependencies. I installed it from source instead, then compiled it with stack.

However, I could not get any report. Parsing the postgrest.eventlog file, which is now larger than 1GB, fails with following error:

ghc-events-analyze: Data.Binary.Get.runGet at position 1109390685: not enough bytes

I tried to move the postgrest.eventlog file to a different name and create an empty postgrest.eventlog but the file handle remains on the renamed file, which gets updated, alone.

I am now considering either or both of:

trying a different version of ghc-events-analyze, perhaps a particular tag rather than master
restarting PostgREST and running ghc-events-analyze sooner, on a much smaller file

begriffs · 2016-05-17T19:41:58Z

Maybe if you upload the event log to S3 we could run the event analyzer on an EC2 instance with a large amount of RAM:

Model	vCPU	Mem (GiB)	SSD Storage (GB)
r3.large	2	15.25	1 x 32
r3.xlarge	4	30.5	1 x 80
r3.2xlarge	8	61	1 x 160
r3.4xlarge	16	122	1 x 320
r3.8xlarge	32	244	2 x 320

ruslantalpa · 2016-05-17T19:43:18Z

@eric-brechemier how about checking the RTS thing?

eric-brechemier · 2016-05-17T20:21:34Z

Maybe if you upload the event log to S3 we could run the event analyzer on an EC2 instance with a large amount of RAM

I do not think that RAM is the problem here (I have 16GB).

eric-brechemier · 2016-05-17T20:21:47Z

@eric-brechemier how about checking the RTS thing?

That's also on the table.

eric-brechemier · 2016-05-18T09:50:18Z

The problem was not the size of the file, but the fact that writing was still in progress, apparently ;
I stopped the PostgREST instance, and ran ghc-events-analyze successfully on the file of 1.1GB:

-rw-r--r--   1 eric  staff   1.1G May 18 08:38 postgrest.eventlog

It produced the three report files:

-rw-r--r--  1 eric  staff   103M May 18 08:44 postgrest.timed.svg
-rw-r--r--  1 eric  staff    27M May 18 08:44 postgrest.timed.txt
-rw-r--r--  1 eric  staff   683K May 18 08:40 postgrest.totals.txt

which I compressed to:

-rw-r--r--  1 eric  wheel   4.9M May 18 11:37 postgrest.timed.svg.gz
-rw-r--r--  1 eric  wheel   110K May 18 11:37 postgrest.timed.txt.gz
-rw-r--r--  1 eric  wheel    87K May 18 11:37 postgrest.totals.txt.gz

Related issue: PostgREST#565

ruslantalpa · 2016-05-18T11:32:23Z

So in postgrest.totals.txt i see the first line :)

GC                           4251002483327ns   4251.002s

USER EVENTS (user events are corrected for GC)
webrequest                     35588768192ns     35.589s
TOTAL                          35588768192ns     35.589s

THREAD EVENTS
TOTAL                          29216272209ns     29.216s

ready to try my suggestion :) ? (don't run it on the custom compilation with profiling, run it on 3.1.1)

eric-brechemier · 2016-05-18T11:58:13Z

Sure :)

don't run it on the custom compilation with profiling, run it on 3.1.1

OK.

eric-brechemier · 2016-05-18T12:04:18Z

Now running stack exec postgrest -- +RTS -I2 -RTS (...) off v0.3.1.1.

eric-brechemier · 2016-05-20T07:34:27Z

The difference is beyond compare: PostgREST server is now running below 1% of CPU, with less than 10MB of RAM.

ruslantalpa · 2016-05-20T08:36:08Z

HA! :) I feel a new release coming up.
ps: there is a way to hardcode the -I2 param when compiling so that users do not have to manually specify it.
@eric-brechemier can you also please experiment with -I1.5
I think it just needs to be a bit longer then 1 second since i think warp has a timer doing something at each second so any value above 1s should work (this might take some experimenting especially under heavier load)

begriffs · 2016-05-20T20:29:50Z

What a relief. 🌞 I'm sure there's a way to bake RTS settings either into the cabal file or the stack.yaml. I'll make a PR with the change. When the PR is merged I'll close this issue.

Then it's time to make a release! We have a lot of other fixes and features queued up as well.

I wonder if this calls for filing an issue in Warp or updating https://ghc.haskell.org/trac/ghc/ticket/4322

ruslantalpa · 2016-05-20T20:57:52Z

I think it's worth opening an issue and providing links to this thread and to the link above. I've search their github repo and there is no mention of high cpu usage or custom +RTS parameters.

Fixes #565

eric-brechemier · 2016-05-23T15:41:11Z

I confirm that after 5 days, the -I2 instance still runs below 1% of CPU and within 10MB or RAM.

@eric-brechemier can you also please experiment with -I1.5

OK. I just restarted the server with -I1.5.

nicklasaven · 2016-05-26T06:45:44Z

Thanks a lot!
Very good job!

eric-brechemier · 2016-05-26T12:39:57Z

Yes.
Note: The -I1.5 instance behaves in the same way as -I2.

begriffs · 2016-05-26T15:00:25Z

Thanks for the debugging help everyone.

PierreR · 2016-05-26T15:00:54Z

Following this thread from a foreign constellation, isn't safer to use 1.5 instead of 2 knowing the default is 0.3 ?

begriffs · 2016-05-30T20:17:12Z

@PierreR these particular numbers are still unsatisfying to me. Why 1.5 rather than 0.31? And does the magic number vary based on server CPU speed? It seems to me that the lower the number the more likely that the CPU leak will happen, and the higher the number the more likely that postgrest could temporarily run out of memory. The CPU leak has been observed, but I don't think we've observed a problem with memory so it's probably safer to err on the side of a higher magic number.

ruslantalpa · 2016-05-30T20:42:18Z

@begriffs as per bug report i mentioned earlier, this is what is happening

if your application has a timer signal that executes some Haskell code every 1s, say, then you'll get an idle GC after each one. Setting the idle GC to a longer period than your timer signal (e.g. 2s) will prevent this happening.

Warp has a timer that runs at 1s intervals to generate the string timestamp for the log output, it's an optimisation thing. Setting the -I to anything above 1 sec should fix this, but 2s i think is ok.

Another solution can be this: i think the 1s timer is created by the logger middleware, if that is removed then it's probable 0.3 will also work. I think in a production setup, the logging should be done by nginx and postgrest should not log at all, except start/stop ... maybe in the future this can be an option/flag not to log

jberryman · 2018-07-09T20:08:21Z

@eric-brechemier so increasing idle GC interval also caused the memory creep you saw disappear? Why would that be I wonder?

eric-brechemier · 2018-07-23T06:28:42Z

so increasing idle GC interval also caused the memory creep you saw disappear?

Yes.

Why would that be I wonder?

@ruslantalpa found out that it was related to this GHC bug:

High CPU usage during idle time due to GC

jberryman · 2019-07-22T05:47:47Z

@eric-brechemier do you find the information in that GHC bug report satisfying? It's still not clear to me why idle gc would manifest as a space leak. Maybe related to finalizers...?

This has just cropped up in another codebase I'm working on, so I'm wondering whether to comment on that GHC 4322 bug, open a new one, etc. Wondering if you have any further information. We found last time that setting -I2 seemed to fix things. We did have a jobs on timers FWIW.

begriffs added bug QOS labels Apr 18, 2016

eric-brechemier pushed a commit to eGullGolf/postgrest that referenced this issue May 18, 2016

Add compressed reports generated by ghc-events-analyze

beb7a63

Related issue: PostgREST#565

eric-brechemier mentioned this issue May 18, 2016

Add reports generated by ghc-events-analyze #596

Closed

begriffs added a commit that referenced this issue May 21, 2016

Run GC every 2s rather than 0.3s

44cdc97

Fixes #565

begriffs mentioned this issue May 21, 2016

Run GC every 2s rather than 0.3s #604

Merged

begriffs closed this as completed in #604 May 26, 2016

jberryman mentioned this issue Jul 9, 2018

CPU usage while idle #1076

Closed

jberryman mentioned this issue Jul 22, 2019

Hasura using more CPU over time hasura/graphql-engine#2565

Closed

Strange increase in CPU-usage over time #565

Strange increase in CPU-usage over time #565

Comments

nicklasaven commented Apr 18, 2016

begriffs commented Apr 18, 2016

begriffs commented Apr 18, 2016

begriffs commented Apr 18, 2016

nicklasaven commented Apr 18, 2016

begriffs commented Apr 21, 2016

ruslantalpa commented Apr 21, 2016

nicklasaven commented Apr 21, 2016

nicklasaven commented Apr 24, 2016 • edited Loading

ruslantalpa commented Apr 24, 2016

ruslantalpa commented Apr 24, 2016

begriffs commented Apr 24, 2016

nicklasaven commented Apr 26, 2016

nicklasaven commented Apr 26, 2016

ruslantalpa commented Apr 26, 2016

fortytw2 commented Apr 27, 2016

ruslantalpa commented Apr 27, 2016

nicklasaven commented Apr 28, 2016

nicklasaven commented Apr 28, 2016

ruslantalpa commented Apr 28, 2016

ruslantalpa commented Apr 28, 2016

nicklasaven commented Apr 28, 2016

nicklasaven commented Apr 28, 2016

nicklasaven commented Apr 28, 2016 • edited Loading

nicklasaven commented Apr 28, 2016

begriffs commented Apr 28, 2016

nicklasaven commented Apr 28, 2016

begriffs commented Apr 28, 2016

begriffs commented Apr 28, 2016

eric-brechemier commented May 17, 2016

begriffs commented May 17, 2016

ruslantalpa commented May 17, 2016

eric-brechemier commented May 17, 2016

eric-brechemier commented May 17, 2016

eric-brechemier commented May 18, 2016

ruslantalpa commented May 18, 2016

eric-brechemier commented May 18, 2016

eric-brechemier commented May 18, 2016

eric-brechemier commented May 20, 2016

ruslantalpa commented May 20, 2016 • edited Loading

begriffs commented May 20, 2016

ruslantalpa commented May 20, 2016

eric-brechemier commented May 23, 2016

nicklasaven commented May 26, 2016

eric-brechemier commented May 26, 2016

begriffs commented May 26, 2016

PierreR commented May 26, 2016

begriffs commented May 30, 2016

ruslantalpa commented May 30, 2016

jberryman commented Jul 9, 2018

eric-brechemier commented Jul 23, 2018

jberryman commented Jul 22, 2019 • edited Loading

nicklasaven commented Apr 24, 2016 •

edited

Loading

nicklasaven commented Apr 28, 2016 •

edited

Loading

ruslantalpa commented May 20, 2016 •

edited

Loading

jberryman commented Jul 22, 2019 •

edited

Loading