-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usual input load around 7000 msg/s - output works intermitently #1839
Comments
Noticed similar behaviour on our testing system. As this was under heavy load, both disk (not ssd) and CPU while testing EPS, I initially thought it was disk problems causing the hickups. But since others are experiencing the same without said signs, maybe there is an underlying issue. In my scenario, I was testing parsing of Checkpoint logs at 8k msg/s, which worked fine for ~20 minutes, before the journal started filling up, I then turned down the msg/s by 1K at the time, until it stabilized at 5K. |
Have you enabled the garbage collection log for elasticsearch? This sounds like a GC problem (esp because the intermittent pauses). |
Not related to elasticsearch apparently. It looks like on the GL server instance the processing buffers were full so I added some more in the config and also doubled the ring size. But the thing that fixed the issue was removing the default 14 extractors that came with GL. Now GL runs fine with the mentioned load and the journal stays on the 1-2% at most. |
I have one question though: is the web-interface timeout parameter able to take higher values than 60s? I get these errors in the logs when doing heavy searches, even though I explicitly increased the timeout above 60s:
|
@thebox-dev Unfortunately this specific timeout is hard coded in Graylog 1.x. Related source code: https://github.com/Graylog2/graylog2-server/blob/1.3.3/graylog2-rest-client/src/main/java/org/graylog2/restclient/models/UniversalSearch.java#L111-L115 |
Actually, I found out that it can be adjusted: graylog-labs/graylog2-web-interface#1679 But I cannot seem to be allowed to adjust it to values higher than 60s. |
No, you didn't. That specific timeout is hard coded. Feel free to check out the source code I've linked to in my last reply. |
I see... thanks for clarifying. Is there a way to get around this currently? The network communication seems fine between all three components, the GL server instance seems OK on resources, but the ES instance is going to 99% CPU when doing the searches for specific log entries. I think GL server waits on ES to provide the searched data (I adjusted the elasticsearch_timeout to 3 minutes) and the web-interface instance simply times out when the wait is longer than 1 min. |
Not without forking Graylog and making that timeout user-configurable. |
@thebox-dev please update to the latest Version. Your issue should be fixed. |
Problem description
We have the following setup in AWS:
3 x graylog server instances started from the official graylog AMI version 1.3.3.
During the night, the usual throughput on the input is around 6000 msg/s and pretty constant.
Steps to reproduce the problem
We noticed the issue only yesterday, two weeks after we installed the system. During the day, under higher load, around 8000 msg/s, we noticed the journal utilization increasing steadily while the messages were not getting consumed at a constant rate, meaning that there are large pauses in the outputs, like 20-25 seconds of complete stops - 0 msg/s and then output rates of 12.000-15.000 msg/s for 10 seconds followed again by a complete stop for another 20-25 seconds. The behavior is pretty stable ever since we noticed it. At some point, during the night, when the traffic returns to the usual values, the behavior is similar but the journal utilization gets much lower, down to 2-3%.
We increased the java heap size on the instances and the number of FDs, we checked the CPU and memory and saw nothing out of the ordinary. The network connectivity looks sharp, the I/O of the disks is good since we are only using SSDs (2 TB ~ 6000 I/O on the elasticsearch instance). Everything looks right to me, but those long pauses of over 20 seconds, with CPUs down to 2-3% drive me nuts.
Please let me know if it sounds familiar to you and help me troubleshoot this issue. Many thanks!
Environment
The text was updated successfully, but these errors were encountered: