Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usual input load around 7000 msg/s - output works intermitently #1839

Closed
thebox-dev opened this issue Feb 18, 2016 · 10 comments
Closed

Usual input load around 7000 msg/s - output works intermitently #1839

thebox-dev opened this issue Feb 18, 2016 · 10 comments
Labels

Comments

@thebox-dev
Copy link

Problem description

We have the following setup in AWS:
3 x graylog server instances started from the official graylog AMI version 1.3.3.

  • instance_1 is configured to run graylog server + mongoDB - type: "c4.4xlarge"
  • instance_2 is configured to run elasticsearch only - type: "r3.xlarge"
  • instance_3 is configured to run graylog-web interface - type: "c4.large"

During the night, the usual throughput on the input is around 6000 msg/s and pretty constant.

Steps to reproduce the problem

We noticed the issue only yesterday, two weeks after we installed the system. During the day, under higher load, around 8000 msg/s, we noticed the journal utilization increasing steadily while the messages were not getting consumed at a constant rate, meaning that there are large pauses in the outputs, like 20-25 seconds of complete stops - 0 msg/s and then output rates of 12.000-15.000 msg/s for 10 seconds followed again by a complete stop for another 20-25 seconds. The behavior is pretty stable ever since we noticed it. At some point, during the night, when the traffic returns to the usual values, the behavior is similar but the journal utilization gets much lower, down to 2-3%.

We increased the java heap size on the instances and the number of FDs, we checked the CPU and memory and saw nothing out of the ordinary. The network connectivity looks sharp, the I/O of the disks is good since we are only using SSDs (2 TB ~ 6000 I/O on the elasticsearch instance). Everything looks right to me, but those long pauses of over 20 seconds, with CPUs down to 2-3% drive me nuts.

Please let me know if it sounds familiar to you and help me troubleshoot this issue. Many thanks!

Environment

  • Graylog Version: 1.3.3
  • Elasticsearch Version: 1.7.4
  • MongoDB Version: v3.0.6
  • Operating System: Ubuntu 14.04.1 LTS
@the-Pretender
Copy link

Noticed similar behaviour on our testing system. As this was under heavy load, both disk (not ssd) and CPU while testing EPS, I initially thought it was disk problems causing the hickups. But since others are experiencing the same without said signs, maybe there is an underlying issue. In my scenario, I was testing parsing of Checkpoint logs at 8k msg/s, which worked fine for ~20 minutes, before the journal started filling up, I then turned down the msg/s by 1K at the time, until it stabilized at 5K.

@kroepke
Copy link
Member

kroepke commented Feb 19, 2016

Have you enabled the garbage collection log for elasticsearch? This sounds like a GC problem (esp because the intermittent pauses).

@thebox-dev
Copy link
Author

Not related to elasticsearch apparently. It looks like on the GL server instance the processing buffers were full so I added some more in the config and also doubled the ring size. But the thing that fixed the issue was removing the default 14 extractors that came with GL. Now GL runs fine with the mentioned load and the journal stays on the 1-2% at most.

@thebox-dev
Copy link
Author

I have one question though: is the web-interface timeout parameter able to take higher values than 60s? I get these errors in the logs when doing heavy searches, even though I explicitly increased the timeout above 60s:

2016-02-18 21:55:59,947 - [ERROR] - from org.graylog2.restclient.lib.ApiClient in play-akka.actor.default-dispatcher-84
REST call failed
java.util.concurrent.TimeoutException: Read timeout to /172.31.7.18:12900 of 60000 ms
        at com.ning.http.client.providers.netty.request.timeout.TimeoutTimerTask.expire(TimeoutTimerTask.java:47) ~[com.ning.async-http-client-1.9.31.jar:na]
        at com.ning.http.client.providers.netty.request.timeout.ReadTimeoutTimerTask.run(ReadTimeoutTimerTask.java:57) ~[com.ning.async-http-client-1.9.31.jar:na]
        at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:556) ~[io.netty.netty-3.10.4.Final.jar:na]
        at org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:632) ~[io.netty.netty-3.10.4.Final.jar:na]
        at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:369) ~[io.netty.netty-3.10.4.Final.jar:na]
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[io.netty.netty-3.10.4.Final.jar:na]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_66]

@joschi
Copy link
Contributor

joschi commented Mar 2, 2016

@thebox-dev
Copy link
Author

Actually, I found out that it can be adjusted:

graylog-labs/graylog2-web-interface#1679

But I cannot seem to be allowed to adjust it to values higher than 60s.

@joschi
Copy link
Contributor

joschi commented Mar 2, 2016

Actually, I found out that it can be adjusted

No, you didn't. That specific timeout is hard coded. Feel free to check out the source code I've linked to in my last reply.

@thebox-dev
Copy link
Author

I see... thanks for clarifying.

Is there a way to get around this currently? The network communication seems fine between all three components, the GL server instance seems OK on resources, but the ES instance is going to 99% CPU when doing the searches for specific log entries. I think GL server waits on ES to provide the searched data (I adjusted the elasticsearch_timeout to 3 minutes) and the web-interface instance simply times out when the wait is longer than 1 min.

@joschi
Copy link
Contributor

joschi commented Mar 2, 2016

Is there a way to get around this currently?

Not without forking Graylog and making that timeout user-configurable.

@jalogisch
Copy link
Contributor

@thebox-dev please update to the latest Version. Your issue should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants