Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange increase in CPU-usage over time #565

Closed
nicklasaven opened this issue Apr 18, 2016 · 79 comments
Closed

Strange increase in CPU-usage over time #565

nicklasaven opened this issue Apr 18, 2016 · 79 comments

Comments

@nicklasaven
Copy link

We have a server running on Debian with PostgreSQL 9.5 and latest (I think) postgrest.

The server is more or less just idling since it is a production server not yet in production.

Over time I have seen that the cpu-usage is increasing. After some weeks it is using more than 50 % of 400 % in total (4 cores). I can see on the graphs that the usage is increasing very slowly day by day.

I have 2 instances of postgrest runing on the server.

I see the same thing on our test server, but that one is under some load during this test phase. But also when it is idling it uses a lot of CPU.

When I restart the postgrest instances things go back to normal.

@begriffs
Copy link
Member

How long does it take to start to notice the problem beginning? I can try running a profiler for a long time and see where it spends the most time.

@begriffs
Copy link
Member

Also can you verify the postgrest version you're running? (Send it a request and inspect the Server header in the response)

@begriffs
Copy link
Member

As a test I've started instances of all 0.3.x versions of postgrest on a Debian server and will watch them to see which versions start to consume CPU.

@nicklasaven
Copy link
Author

Thanks

I was a little busy earlier so answer delayed. But I tried to get the version from the response header, but since I have nginx in front I only found the nginx version. But one of the two servers was installed a few weeks ago with master from github. The other is maybe two months old. Both behsve the same.

I think this issie should show after a week, but it is a slope so it is getting worse the longer time it runs.

I will keep an eye on mine and we can discuss the numbers from them when it starts to show.

I have only experienced this once per server yet, but I see no reason it shouldn't come back.

@begriffs
Copy link
Member

After three days of running all 0.3 versions at once the CPU is staying flat on my Debian 8.3 server.

screen shot 2016-04-21 at 9 08 17 am

Maybe the problem happens only after postgrest has received a few requests. I'll let it go another day and then try sending some requests.

@ruslantalpa
Copy link
Contributor

The question is if the increase is gradual or abrupt. If gradual then it's some type of leak. If it's abrupt then it's a specific request that causes the issue and in that case you need to run the exact schema/requests to replicate it

@nicklasaven
Copy link
Author

it was very gradual, but I don't see it yet either

@nicklasaven
Copy link
Author

nicklasaven commented Apr 24, 2016

Now suddenly things start to happen. The server is not idling all the time, but is in a testing phase before going in production. So we are filling it with data and doing tests on it. But it is not under constant load so something is wrong with this gradual increase in cpu.

As you see it let go of memory, but cpu is just increasing. You also see on I/O that there is more or less no load for the last 24 h. The graphs shows only postgrest. The other processes is not following in increasing cpu.

linode

@ruslantalpa
Copy link
Contributor

@nicklasaven join gitter chat (let's see if strace reveals anything interesting)

@ruslantalpa
Copy link
Contributor

The only thing i can think of PostgREST could get in an infinite loop is this section
https://github.com/nikita-volkov/hasql-transaction/blob/master/library/Hasql/Transaction.hs#L69-L73
It seems that for a particular error, the transaction lib would just try to run the transaction again.
Also note that there is a commit just a few days ago related to this particular section in the code, maybe just upgrading the transaction dependence will fix this, probably nikkita can give an oppinion if this cpu usage might be caused by that particular code section.
The error is this one
https://github.com/nikita-volkov/postgresql-error-codes/blob/master/library/PostgreSQL/ErrorCodes.hs#L388

So first order of business is to check on the PostgreSQL side if there are 40001 errors happening.
Second thing to try is to attach to the postgrest process with strace and log all system calls for a while, this might give us indication on what is happening.
Third thing to try (after we have the strace log) is to just stop postgresql (db) and see what happens, then start it back up after a minute or so to see if that has an impact on the cpu usage (but this is the last thing to try, we don't want to eliminate the problem and wait another week for it to come up)

@begriffs
Copy link
Member

Do your system utilization logs and database logs both go back to when you started postgrest? You could try to visually find the point in time where CPU started climbing, then look at the db logs from around that time. Maybe there will be a telltale event.

Also the CPU on my server is still holding steady but I haven't sent the postgrest servers any requests. I can try sending various types of requests to experiment:

  1. A read-only request that causes no errors (maybe GET /)
  2. A read-only request that causes an error (GET /fakeroute)
  3. Insert a row
  4. Insert a row causing a constraint violation

I could begin at step 1 then wait a day for any CPU growth. Then try 2, etc.

@nicklasaven
Copy link
Author

Sorry for the delay. I am under some presure and I am short of both day time and evening time.

But I will get back later today or tomorrow and investigate more.

I checked yesterday and it is still increasing, but it doesn't affect the server yet as it is only limited usage on it.

@nicklasaven
Copy link
Author

How to use strace here?
Can I hook the tool into the running process?

I find no error 40001 in the logs.

@ruslantalpa
Copy link
Contributor

Fropm what i know you can attach to a running process by pid, but you need to look it up how to exactly do that

@fortytw2
Copy link

Can confirm this, under heavy load for ~2 weeks (sustained 100+ req/s), postgrest very gradually grew from about 3% CPU to 40% before I killed it. I can try to dig through my metrics later today

@ruslantalpa
Copy link
Contributor

gradual is good i think, next time don't kill it until you get an strace log

@nicklasaven
Copy link
Author

Ok, now I have collected with strace for a few minutes.
It is two instances, and they seems to behave the same, both of them.

But to me it seems like I miss what it is doing when leaving waiting before getting back to waiting again.

CPU for both is approx 60 % now

postgrest_3541.txt
postgrest_3538.txt

@nicklasaven
Copy link
Author

This is the graphs for the last 24 hours.

The strace was only a few minutes.

linode

@ruslantalpa
Copy link
Contributor

@nicklasaven you need to strace again,
you did that only for the parent process
read this https://meenakshi02.wordpress.com/2011/02/02/strace-hanging-at-futex/ and resubmit the strace

1 similar comment
@ruslantalpa
Copy link
Contributor

@nicklasaven you need to strace again,
you did that only for the parent process
read this https://meenakshi02.wordpress.com/2011/02/02/strace-hanging-at-futex/ and resubmit the strace

@nicklasaven
Copy link
Author

Ok

Some interesting info by just finding out the child processes

21 April it started 18 child-processes on one instance and 19 on the other.

image

@nicklasaven
Copy link
Author

I will start by looking at 3831 and 6722 since they seem to have used a lot of CPU. I guess the time is total cpu usage

@nicklasaven
Copy link
Author

nicklasaven commented Apr 28, 2016

They both only gave result like this:
11:13:18 futex(0x7f008000091c, FUTEX_WAIT_PRIVATE, 1000143, NULL <detached ...>

So I guess they have children too

Or they are just not used right now of course

hmm, more digging needed

@nicklasaven
Copy link
Author

Ok, here is something

seems like sched_yield() is involved

postgrest_7844.txt

@begriffs
Copy link
Member

I would be interested to see a flame graph of where the CPU is spending time. Try this:

# install perf and debug symbols for libc
sudo apt-get update
sudo apt-get install linux-tools-common libc6-dbg

# check and note whether your virtual server provides hardware events
perf list

# observe postgrest for a minute, sampling the stack at 99 hertz
sudo perf record -F 99 -p <postgrest-pid> -g --call-graph dwarf -- sleep 60

# create visualization of perf dump
git clone https://github.com/brendangregg/FlameGraph.git
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > postgrest.svg

If your postgrest binary has sufficient debugging symbols then we should get a good idea of how it is spending time.

@nicklasaven
Copy link
Author

the problem is that from my 2 instances it gets spawned another 20-30 processes.

Is it possible to collect them all in one go??

I post 2 strace-loggs here. postgrest_all.txt is all those processes in one file before restarting the server.

postgrest_all_after.txt is all processes when postgrest, and the whole server was restarted.

They are quite small when compressed but approx 400 mb each when decompressed. Be aware

postgrest_all.txt.zip
postgrest_all_after.txt.zip

@begriffs
Copy link
Member

Rather than -p <pid> try -u <user> to watch all processes spawned by a given user.

@begriffs
Copy link
Member

Random hunch is that this is related to threads getting spawned and then not dying. The futex calls in your logs look like threads continually asking to be put to sleep. Maybe an asynchronous exception messes something up, possibly in the hasql library.

Once nikita-volkov/hasql#47 is fixed I can disable the GHC -threaded flag and go back to lightweight green Haskell threads rather than system threads.

@eric-brechemier
Copy link
Contributor

I have been running PostgREST for a week (on and off since this computer sleeps a lot), it is now up to peaks of 69% of CPU, 108 MB of RAM, for close to 8h30 of execution time.

I have not had a great experience with ghc-events-analyze so far. It failed to install either with stack or cabal, that both complain about missing dependencies. I installed it from source instead, then compiled it with stack.

However, I could not get any report. Parsing the postgrest.eventlog file, which is now larger than 1GB, fails with following error:

ghc-events-analyze: Data.Binary.Get.runGet at position 1109390685: not enough bytes

I tried to move the postgrest.eventlog file to a different name and create an empty postgrest.eventlog but the file handle remains on the renamed file, which gets updated, alone.

I am now considering either or both of:

  1. trying a different version of ghc-events-analyze, perhaps a particular tag rather than master
  2. restarting PostgREST and running ghc-events-analyze sooner, on a much smaller file

@begriffs
Copy link
Member

Maybe if you upload the event log to S3 we could run the event analyzer on an EC2 instance with a large amount of RAM:

Model vCPU Mem (GiB) SSD Storage (GB)
r3.large 2 15.25 1 x 32
r3.xlarge 4 30.5 1 x 80
r3.2xlarge 8 61 1 x 160
r3.4xlarge 16 122 1 x 320
r3.8xlarge
32
244
2 x 320

@ruslantalpa
Copy link
Contributor

@eric-brechemier how about checking the RTS thing?

@eric-brechemier
Copy link
Contributor

Maybe if you upload the event log to S3 we could run the event analyzer on an EC2 instance with a large amount of RAM

I do not think that RAM is the problem here (I have 16GB).

@eric-brechemier
Copy link
Contributor

@eric-brechemier how about checking the RTS thing?

That's also on the table.

@eric-brechemier
Copy link
Contributor

The problem was not the size of the file, but the fact that writing was still in progress, apparently ;
I stopped the PostgREST instance, and ran ghc-events-analyze successfully on the file of 1.1GB:

-rw-r--r--   1 eric  staff   1.1G May 18 08:38 postgrest.eventlog

It produced the three report files:

-rw-r--r--  1 eric  staff   103M May 18 08:44 postgrest.timed.svg
-rw-r--r--  1 eric  staff    27M May 18 08:44 postgrest.timed.txt
-rw-r--r--  1 eric  staff   683K May 18 08:40 postgrest.totals.txt

which I compressed to:

-rw-r--r--  1 eric  wheel   4.9M May 18 11:37 postgrest.timed.svg.gz
-rw-r--r--  1 eric  wheel   110K May 18 11:37 postgrest.timed.txt.gz
-rw-r--r--  1 eric  wheel    87K May 18 11:37 postgrest.totals.txt.gz

@ruslantalpa
Copy link
Contributor

So in postgrest.totals.txt i see the first line :)

GC                           4251002483327ns   4251.002s

USER EVENTS (user events are corrected for GC)
webrequest                     35588768192ns     35.589s
TOTAL                          35588768192ns     35.589s

THREAD EVENTS
TOTAL                          29216272209ns     29.216s

ready to try my suggestion :) ? (don't run it on the custom compilation with profiling, run it on 3.1.1)

@eric-brechemier
Copy link
Contributor

Sure :)

don't run it on the custom compilation with profiling, run it on 3.1.1

OK.

@eric-brechemier
Copy link
Contributor

Now running stack exec postgrest -- +RTS -I2 -RTS (...) off v0.3.1.1.

@eric-brechemier
Copy link
Contributor

The difference is beyond compare: PostgREST server is now running below 1% of CPU, with less than 10MB of RAM.

@ruslantalpa
Copy link
Contributor

ruslantalpa commented May 20, 2016

HA! :) I feel a new release coming up.
ps: there is a way to hardcode the -I2 param when compiling so that users do not have to manually specify it.
@eric-brechemier can you also please experiment with -I1.5
I think it just needs to be a bit longer then 1 second since i think warp has a timer doing something at each second so any value above 1s should work (this might take some experimenting especially under heavier load)

@begriffs
Copy link
Member

What a relief. 🌞 I'm sure there's a way to bake RTS settings either into the cabal file or the stack.yaml. I'll make a PR with the change. When the PR is merged I'll close this issue.

Then it's time to make a release! We have a lot of other fixes and features queued up as well.

I wonder if this calls for filing an issue in Warp or updating https://ghc.haskell.org/trac/ghc/ticket/4322

@ruslantalpa
Copy link
Contributor

I think it's worth opening an issue and providing links to this thread and to the link above. I've search their github repo and there is no mention of high cpu usage or custom +RTS parameters.

@eric-brechemier
Copy link
Contributor

I confirm that after 5 days, the -I2 instance still runs below 1% of CPU and within 10MB or RAM.

@eric-brechemier can you also please experiment with -I1.5

OK. I just restarted the server with -I1.5.

@nicklasaven
Copy link
Author

Thanks a lot!
Very good job!

@eric-brechemier
Copy link
Contributor

Yes.
Note: The -I1.5 instance behaves in the same way as -I2.

@begriffs
Copy link
Member

Thanks for the debugging help everyone.

@PierreR
Copy link
Contributor

PierreR commented May 26, 2016

Following this thread from a foreign constellation, isn't safer to use 1.5 instead of 2 knowing the default is 0.3 ?

@begriffs
Copy link
Member

@PierreR these particular numbers are still unsatisfying to me. Why 1.5 rather than 0.31? And does the magic number vary based on server CPU speed? It seems to me that the lower the number the more likely that the CPU leak will happen, and the higher the number the more likely that postgrest could temporarily run out of memory. The CPU leak has been observed, but I don't think we've observed a problem with memory so it's probably safer to err on the side of a higher magic number.

@ruslantalpa
Copy link
Contributor

@begriffs as per bug report i mentioned earlier, this is what is happening

if your application has a timer signal that executes some Haskell code every 1s, say, then you'll get an idle GC after each one. Setting the idle GC to a longer period than your timer signal (e.g. 2s) will prevent this happening.

Warp has a timer that runs at 1s intervals to generate the string timestamp for the log output, it's an optimisation thing. Setting the -I to anything above 1 sec should fix this, but 2s i think is ok.

Another solution can be this: i think the 1s timer is created by the logger middleware, if that is removed then it's probable 0.3 will also work. I think in a production setup, the logging should be done by nginx and postgrest should not log at all, except start/stop ... maybe in the future this can be an option/flag not to log

@jberryman
Copy link

@eric-brechemier so increasing idle GC interval also caused the memory creep you saw disappear? Why would that be I wonder?

@eric-brechemier
Copy link
Contributor

so increasing idle GC interval also caused the memory creep you saw disappear?

Yes.

Why would that be I wonder?

@ruslantalpa found out that it was related to this GHC bug:

@jberryman
Copy link

jberryman commented Jul 22, 2019

@eric-brechemier do you find the information in that GHC bug report satisfying? It's still not clear to me why idle gc would manifest as a space leak. Maybe related to finalizers...?

This has just cropped up in another codebase I'm working on, so I'm wondering whether to comment on that GHC 4322 bug, open a new one, etc. Wondering if you have any further information. We found last time that setting -I2 seemed to fix things. We did have a jobs on timers FWIW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants