Skip to content
This repository has been archived by the owner on Jun 9, 2024. It is now read-only.

Timeout blocking sync #97

Open
nward opened this issue Oct 23, 2021 · 1 comment
Open

Timeout blocking sync #97

nward opened this issue Oct 23, 2021 · 1 comment

Comments

@nward
Copy link

nward commented Oct 23, 2021

Hi,

We're running octorpki on a couple of servers, and have found they are both getting hung up on a particular repo with network timeouts, and it stops processing of future requests.

We've got.. some sort of network issue reaching this repo, as it works from other places. For some reason our requests just time out every now and then. Regardless I would want octorpki to gracefully handle that sort of thing - there seems to be a lot of half baked rpki repos so I guess we want to be pretty robust dealing with those.

Below are some logs and relevant processes running. Notable is the rsync zombie process, I wonder if there's something odd going on here where rsync is forking unexpectedly so rsync timeout doesn't work like we'd hope, or something.

If I kill the rsync process (not the zombie) it moves on.

We're on centos 7.9, and I've included the rsync and octorpki versions below.

It looks like we should be upgrading to 1.3.0 anyway, so I'll get to doing that, but I don't see any changes which would be related to this issue.

Oct 20 11:21:31 ote-rpki-1 octorpki: time="2021-10-20T11:21:31+13:00" level=info msg="RRDP: Downloading root notification https://ca.rg.net/rrdp/notify.xml"
Oct 20 11:21:31 ote-rpki-1 octorpki: time="2021-10-20T11:21:31+13:00" level=info msg="RRDP sync https://ca.rg.net/rrdp/notify.xml"
Oct 20 11:21:33 ote-rpki-1 octorpki: time="2021-10-20T11:21:33+13:00" level=error msg="Error when processing https://ca.rg.net/rrdp/notify.xml (for rsync://ca.rg.net/rpki): error fetching for repo rrdp:https://ca.rg.net/rrdp/notify.xml (rsync:rsync://ca.rg.net/rpki): status is 404. Will add to rsync."
Oct 20 11:22:18 ote-rpki-1 octorpki: time="2021-10-20T11:22:18+13:00" level=info msg="RRDP: Downloading root notification https://cc.rg.net/rrdp/notify.xml"
Oct 20 11:22:18 ote-rpki-1 octorpki: time="2021-10-20T11:22:18+13:00" level=info msg="RRDP sync https://cc.rg.net/rrdp/notify.xml"
Oct 20 11:22:29 ote-rpki-1 octorpki: time="2021-10-20T11:22:29+13:00" level=error msg="Error when processing https://cc.rg.net/rrdp/notify.xml (for rsync://cc.rg.net/rpki): error fetching for repo rrdp:https://cc.rg.net/rrdp/notify.xml (rsync:rsync://cc.rg.net/rpki): Get \"https://cc.rg.net/rrdp/notify.xml\": net/http: TLS handshake timeout. Will add to rsync."
Oct 20 11:22:48 ote-rpki-1 octorpki: time="2021-10-20T11:22:48+13:00" level=info msg="RRDP: Downloading root notification https://cb.rg.net/rrdp/notify.xml"
Oct 20 11:22:48 ote-rpki-1 octorpki: time="2021-10-20T11:22:48+13:00" level=info msg="RRDP sync https://cb.rg.net/rrdp/notify.xml"
Oct 20 11:23:18 ote-rpki-1 octorpki: time="2021-10-20T11:23:18+13:00" level=error msg="Error when processing https://cb.rg.net/rrdp/notify.xml (for rsync://cb.rg.net/rpki): error fetching for repo rrdp:https://cb.rg.net/rrdp/notify.xml (rsync:rsync://cb.rg.net/rpki): Get \"https://cb.rg.net/rrdp/notify.xml\": dial tcp 147.28.0.46:443: i/o timeout. Will add to rsync."
Oct 20 11:23:39 ote-rpki-1 octorpki: time="2021-10-20T11:23:39+13:00" level=info msg="Rsync sync rsync://cb.rg.net/rpki"
Oct 20 11:25:47 ote-rpki-1 octorpki: time="2021-10-20T11:25:47+13:00" level=info msg="Rsync sync rsync://cc.rg.net/rpki"
Oct 20 11:25:47 ote-rpki-1 octorpki: time="2021-10-20T11:25:47+13:00" level=error msg="Error when processing cache/cb.rg.net/rpki (for {0xc00010e4d0}): exit status 10. Will add to rsync."
Oct 20 11:25:47 ote-rpki-1 octorpki: time="2021-10-20T11:25:47+13:00" level=error msg="rsync error: error in socket IO (code 10) at clientserver.c(125) [Receiver=3.1.2]"
Oct 20 11:25:47 ote-rpki-1 octorpki: time="2021-10-20T11:25:47+13:00" level=error msg="rsync: failed to connect to cb.rg.net (2001:418:1::46): Network is unreachable (101)"
Oct 20 11:25:47 ote-rpki-1 octorpki: time="2021-10-20T11:25:47+13:00" level=error msg="rsync: failed to connect to cb.rg.net (147.28.0.46): Connection timed out (110)"
octorpki 24723  0.0 52.3 3667600 2031876 ?     Ssl  Oct20   3:36 /usr/bin/octorpki -cache cache/ -cors.origins * -http.addr :8080 -http.cache -http.info /infos -http.metrics /metrics -loglevel info -manifest.use -mode server -output.roa output.json -output.sign -output.sign.key private.pem -output.sign.validity 1h -output.wait -refresh 20m -rrdp -rrdp.failover -rrdp.file rrdp.json -rsync.bin /usr/bin/rsync -rsync.timeout 20m -tal.root tals/afrnic.tal,tals/apnic.tal,tals/arin.tal,tals/lacnic.tal,tals/ripe.tal
octorpki  2068  0.0  0.0      0     0 ?        Z    Oct20   0:00  \_ [rsync] <defunct>
octorpki  2072  0.0  0.0 122504   336 ?        S    Oct20   0:00 /usr/bin/rsync -var rsync://cc.rg.net/rpki cache/cc.rg.net/rpki
[nward@ote-rpki-1 ~]$ rsync --version
rsync  version 3.1.2  protocol version 31
Copyright (C) 1996-2015 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, prealloc

rsync comes with ABSOLUTELY NO WARRANTY.  This is free software, and you
are welcome to redistribute it under certain conditions.  See the GNU
General Public Licence for details.
[nward@ote-rpki-1 ~]$ octorpki --version
OctoRPKI v1.2.2 (2020-12-09T20:04:56+0000)
@nward
Copy link
Author

nward commented Oct 23, 2021

OK - looks like sigkill kills the rsync parent. Here's the processes after restarting:

octorpki 32026 30.3 18.7 4278880 3055808 ?     Ssl  16:20   2:19 /usr/bin/octorpki -cache cache/ -cors.origins * -http.addr :8080 -http.cache -http.info /infos -http.metrics /metrics -loglevel info -manifest.use -mode server -output.roa output.json -output.sign -output.sign.key private.pem -output.sign.validity 1h -output.wait -refresh 20m -rrdp -rrdp.failover -rrdp.file rrdp.json -rsync.bin /usr/bin/rsync -rsync.timeout 20m -tal.root tals/afrnic.tal,tals/apnic.tal,tals/arin.tal,tals/lacnic.tal,tals/ripe.tal
octorpki 10478  0.0  0.0 122244  1600 ?        S    16:27   0:00  \_ /usr/bin/rsync -var rsync://cb.rg.net/rpki cache/cb.rg.net/rpki
octorpki 10584  0.0  0.0 122504   464 ?        S    16:27   0:00      \_ /usr/bin/rsync -var rsync://cb.rg.net/rpki cache/cb.rg.net/rpki

Then after 20 mins:

octorpki 32026 12.1  8.0 4278880 1303092 ?     Ssl  16:20   3:15 /usr/bin/octorpki -cache cache/ -cors.origins * -http.addr :8080 -http.cache -http.info /infos -http.metrics /metrics -loglevel info -manifest.use -mode server -output.roa output.json -output.sign -output.sign.key private.pem -output.sign.validity 1h -output.wait -refresh 20m -rrdp -rrdp.failover -rrdp.file rrdp.json -rsync.bin /usr/bin/rsync -rsync.timeout 20m -tal.root tals/afrnic.tal,tals/apnic.tal,tals/arin.tal,tals/lacnic.tal,tals/ripe.tal
octorpki 10478  0.0  0.0      0     0 ?        Z    16:27   0:00  \_ [rsync] <defunct>
octorpki 10584  0.0  0.0 122504   464 ?        S    16:27   0:00 /usr/bin/rsync -var rsync://cb.rg.net/rpki cache/cb.rg.net/rpki

Watching strace for those 3 procs I see 10478 get killed:

[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 32026] epoll_pwait(4, [{EPOLLIN, {u32=14002224, u64=14002224}}], 128, 46, NULL, 3) = 1
[pid 32026] read(5, "\0", 16)           = 1
[pid 32026] epoll_pwait(4, [], 128, 0, NULL, 21870373) = 0
[pid 32026] epoll_pwait(4, [], 128, 20, NULL, 3) = 0
[pid 32026] epoll_pwait(4, [], 128, 0, NULL, 21870373) = 0
[pid 32026] epoll_pwait(4, [], 128, 1, NULL, 3) = 0
[pid 32026] futex(0xc000040d50, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 32026] futex(0xc000040950, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 32026] madvise(0xc040b0a000, 8192, MADV_DONTNEED) = 0
[pid 32026] write(6, "\0", 1)           = 1
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
[pid 10478] <... select resumed> <unfinished ...>) = ?
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 10478] +++ killed by SIGKILL +++
[pid 32026] <... futex resumed>)        = 0
[pid 32026] futex(0xd2c810, FUTEX_WAIT_PRIVATE, 0, NULL) = 0

If I kill 10584 with SIGTERM, things get moving again (then it tries rsyncing cc.rg.net again and fails 👎)

This is reproducible on a test machine, so am happy to do debugging and such. For prod I think I'll need a script to kill off these rg.net rsyncs..

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant