Docker deamon for ERDDAP hosted on AWS keeps crashing #69

MathewBiddle · 2024-04-17T13:17:15Z

When running this erddap-gold-standard on AWS, every few weeks the docker daemon for the erddap-gold-standard docker deployment crashes.

$ docker-compose restart
ERROR: Couldn't connect to Docker daemon at http+docker://localhost - is it running?
If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.

It's a simple fix to get it up and running again using:

$ sudo systemctl start docker
$ docker-compose restart

I'm curious if other folks have experienced this before with an ERDDAP deployed using Docker on AWS??

I've discussed with @patrick-tripp and the current work around would be to set a cronjob to check the url, if it fails, restart docker.

cc: @mwengren, @ocefpaf, @patrick-tripp.

MathewBiddle · 2024-04-17T13:17:47Z

maybe live restore??

https://docs.docker.com/config/containers/live-restore/

MathewBiddle · 2024-04-17T13:29:31Z

Okay, testing live-restore:

$ more /etc/docker/daemon.json
{
        "live-restore": true
}

$ sudo systemctl start docker
$ docker ps
CONTAINER ID   IMAGE                                    COMMAND                  CREATED      STATUS         PORTS                                                                            NAMES
ec3b94b319fe   axiom/docker-erddap:2.23-jdk17-openjdk   "/entrypoint.sh cata…"   7 days ago   Up 5 seconds   0.0.0.0:80->8080/tcp, :::80->8080/tcp, 0.0.0.0:443->8443/tcp, :::443->8443/tcp   erddap_gold_standard

I will check back in a few weeks to see if this fixes the issue. Luckily we have plenty of checks hitting this server, so we will know quickly when it breaks.

MathewBiddle · 2024-04-17T13:49:27Z

To confirm the change was accepted:

$ docker info | grep Live
 Live Restore Enabled: true

srstsavage · 2024-04-17T14:38:59Z

Do you have access to the docker daemon logs? Also what are the docker and kernel versions?

MathewBiddle · 2024-04-17T14:50:37Z

Do you have access to the docker daemon logs?

I have access to /var/log which has a few messages files. I think those are the logs as documented here.

Also what are the docker and kernel versions?

$ docker --version
Docker version 20.10.25, build b82b9f3
$ uname -sr
Linux 5.10.210-201.852.amzn2.x86_64

MathewBiddle · 2024-05-06T15:46:24Z

Live Restore seems to be working. From status:

Current time is 2024-05-06T15:44:10+00:00
Startup was at  2024-04-17T13:24:27+00:00

I'll keep this open until 2 months have passed without the daemon crashing.

MathewBiddle · 2024-06-11T13:35:46Z

Boo... looks like it crashed again.

$ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Restarted with:

/usr/local/erddap-gold-standard$ sudo systemctl start docker
/usr/local/erddap-gold-standard$ docker-compose restart
Restarting erddap_gold_standard ... done
/usr/local/erddap-gold-standard$ docker info | grep Live
 Live Restore Enabled: true

ocefpaf · 2024-06-11T14:48:04Z

Boo... looks like it crashed again.

Same frequency as before, sooner, or later? We need to inspect the logs here to see if we can understand what is going on.

MathewBiddle · 2024-06-11T16:48:42Z

much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵

ocefpaf · 2024-06-11T17:23:49Z

much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵

Well, maybe that is a (small) win. I never looked into ERDDAP logs, we should probably ask for help here from the experts (Ben, Chris, Shane).

MathewBiddle · 2024-07-23T12:34:03Z

and it went down again. ~1.5 months

MathewBiddle · 2024-08-23T13:29:04Z

and down again - that was at least a month.

ocefpaf · 2024-08-23T14:22:35Z

We should check the logs, try to investigate further on what may be going on.

srstsavage · 2024-08-23T23:14:08Z

Do you have metrics on the memory usage, number of open files, etc on this instance through time?

MathewBiddle · 2024-08-26T14:15:03Z

ummm, I'm going to say no. I this something I could use https://github.com/callumrollo/erddaplogs for?

srstsavage · 2024-08-26T15:19:52Z

There may be useful hints in the ERDDAP logs (max memory usage etc), but I was referring more to host level metrics on memory, number of open files, system load, etc. Typically this is collected using an agent running on the host sending its metrics somewhere for analysis.

Probably the easiest since its the AWS solution:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

Others:
https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/
https://www.netdata.cloud/

srstsavage · 2024-08-26T15:22:20Z

See also @rmendels comments here:

ERDDAP/erddap#185 (comment)

[...] monitor the usage of the following:

heap space use

metaspace use

total java memory use (and the total memory available)

swap space use (and total swap space available)

number of threads running under java (I find the number given by say visualvm is not as good as that given by btop)

A detailed time series isn't as important as likely maximum values of each and some idea of how much they fluctuate. Particularly for total memory use, you need to have the ERDDAP completely loaded and running for a bit to get a feel for the total java memory and number of threads

rmendels · 2024-08-26T15:32:05Z

We may be closer to understanding how to prevent this (unfortunately no one simple answer). It has to do both with Java's new memory model, that a lot of non-heap memory gets used if a lot of child threads are started (often 5GB-10GB more), plus how the OS behaves, which can (and will) start cacheing all of the file requests. We were seeing this on our system, do not so presently, mostly because we added more memory and it turns out you need a lot. Since then our memory use has stayed pretty constant. Add yes, in order to give any advice we need the metrics above, as well as if possible the cache memory usage.

benjwadams · 2024-08-26T21:47:06Z

@MatthewBiddle, if running on a Linux system, check journalctl logs -- You may see some log entries from OOMKiller killing Docker or the Java process which could be a case of the ERDDAP issue with memory I've reported, and @srstsavage has also posted in this issue ERDDAP/erddap#185

benjwadams · 2024-08-26T21:57:01Z

Also check uptime on the box. Memory exhaustion can in some cases lead to a system seizing up entirely, requiring a restart. If you don't have Docker daemon enabled on system startup and exhaust memory, it would not start. If you have sar on your system it can report historical memory usage over time without setting up other tools or CloudWatch. Unfortunately, last time I checked SystemD doesn't report logs from before the last startup without configuration, so the above may not work if posted.

However, if you do have the memory exhaustion going on, it would be much simpler to reproduce with the datasets you have vs some larger systems such as the IOOS Glider DAC.

rmendels · 2024-08-26T21:58:22Z

@benjwadams @srstsavage I have twice requested from you the information @srstsavage points to. Without that information we can't be of much help. It turns out that for heavily used ERDDAP with lot of files or a lot fo aggregated files the required memory can be quite high (not heap, but total memory), there are some settings that can help, but without that information to see what is filling up how there is not much more we can do. As I said, our ERDDAP is now running with no memory problems, but it needs a lot of memory, and swapping must be avoided.

MathewBiddle · 2024-10-21T14:46:11Z

And it's down again.

Looking at the ERDDAP email logs leading up to the point at which it crashed (2024-10-12), I don't see a memory issue happening:

$ grep "OS info" emailLog2024-10*
emailLog2024-10-01.txt:OS info: totalCPULoad=0.071428165 processCPULoad=0.0012237673 totalMemory=7834MB freeMemory=624MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-02.txt:OS info: totalCPULoad=0.06577647 processCPULoad=0.0012330246 totalMemory=7834MB freeMemory=622MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-03.txt:OS info: totalCPULoad=0.06445922 processCPULoad=0.0011969458 totalMemory=7834MB freeMemory=649MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-04.txt:OS info: totalCPULoad=0.0764798 processCPULoad=0.0013039889 totalMemory=7834MB freeMemory=590MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-05.txt:OS info: totalCPULoad=0.06200041 processCPULoad=0.0013341357 totalMemory=7834MB freeMemory=532MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-06.txt:OS info: totalCPULoad=0.17571415 processCPULoad=0.0012315342 totalMemory=7834MB freeMemory=592MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-07.txt:OS info: totalCPULoad=0.059477385 processCPULoad=0.0012016778 totalMemory=7834MB freeMemory=588MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-08.txt:OS info: totalCPULoad=0.069818884 processCPULoad=0.0012112252 totalMemory=7834MB freeMemory=538MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-09.txt:OS info: totalCPULoad=0.06342337 processCPULoad=0.0011966674 totalMemory=7834MB freeMemory=590MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-10.txt:OS info: totalCPULoad=0.07082052 processCPULoad=0.0012226121 totalMemory=7834MB freeMemory=545MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-11.txt:OS info: totalCPULoad=0.06763653 processCPULoad=0.0011976592 totalMemory=7834MB freeMemory=560MB totalSwapSpace=0MB freeSwapSpace=0MB
emailLog2024-10-12.txt:OS info: totalCPULoad=0.06363476 processCPULoad=0.0012607691 totalMemory=7834MB freeMemory=700MB totalSwapSpace=0MB freeSwapSpace=0MB

I'm looking at journalctl logs but not seeing anything for OOMKiller.

MathewBiddle mentioned this issue Apr 17, 2024

National platforms ioos/ioos_metrics#57

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

MathewBiddle commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

srstsavage commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

MathewBiddle commented May 6, 2024

MathewBiddle commented Jun 11, 2024

ocefpaf commented Jun 11, 2024

MathewBiddle commented Jun 11, 2024

ocefpaf commented Jun 11, 2024

MathewBiddle commented Jul 23, 2024

MathewBiddle commented Aug 23, 2024

ocefpaf commented Aug 23, 2024

srstsavage commented Aug 23, 2024

MathewBiddle commented Aug 26, 2024

srstsavage commented Aug 26, 2024

srstsavage commented Aug 26, 2024

rmendels commented Aug 26, 2024

benjwadams commented Aug 26, 2024

benjwadams commented Aug 26, 2024

rmendels commented Aug 26, 2024

MathewBiddle commented Oct 21, 2024

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

Comments

MathewBiddle commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

srstsavage commented Apr 17, 2024

MathewBiddle commented Apr 17, 2024

MathewBiddle commented May 6, 2024

MathewBiddle commented Jun 11, 2024

ocefpaf commented Jun 11, 2024

MathewBiddle commented Jun 11, 2024

ocefpaf commented Jun 11, 2024

MathewBiddle commented Jul 23, 2024

MathewBiddle commented Aug 23, 2024

ocefpaf commented Aug 23, 2024

srstsavage commented Aug 23, 2024

MathewBiddle commented Aug 26, 2024

srstsavage commented Aug 26, 2024

srstsavage commented Aug 26, 2024

rmendels commented Aug 26, 2024

benjwadams commented Aug 26, 2024

benjwadams commented Aug 26, 2024

rmendels commented Aug 26, 2024

MathewBiddle commented Oct 21, 2024