-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spawn delay while waiting for DNS update #279
Comments
Can you share the logs of |
|
Thanks! Do you also have the logs of the |
|
Hi, I'm facing a similar issue of very slow spawning (~40 - 60secs) on swarmspawner as well. hub and proxy:
jupy-foo
FWIW, I'm using keycloak for oauth authentication but that seems to perform the authentication fairly quickly |
From the logs, we have these key events:
So everything's going smoothly and from the time login finishes to the newly spawned server first handling a request from your browser is 3 seconds. But then! somehow the oauth process between the hub and the singleuser server is taking over 30 seconds. The hub->server oauth sequence is:
It appears to be the user-server's handling of the oauth callback that's taking ~all of the time. The token validation request doesn't occur until. Seems like there must be a bug there. Since the delay is suspiciously close to 30 seconds, it could be related to cluster dns. Any delay of 30 seconds makes me suspect a problem in dns resolution. Can you try:
? in a notebook after it's launched and share how long it takes? Also whether JUPYTERHUB_API_URL uses a hostname or an ip? It's my hope that that will take ~30 seconds, since that would suggest that cluster dns is the path we want to follow. |
@minrk this also applies to my issue? |
@minrk Right on!
so what do you think should be used for cluster dns? any specific configuration requirements? |
Another thing I noticed while troubleshooting this issue: @statiksof might not be ideal but see if you can simply use a single node swarm. I can attest that it works without the long spawning times. |
My issue is resolved (at least for now). After cleaning up the resources used by the hub and proxy containers proxy and spawning are working perfectly fine. ( more info on why I was having the issue: jupyterhub/configurable-http-proxy#185 ) I'm using docker swarm's own DNS. It's working fast enough (Wall time is now 9ms instead of 27 seconds!!). Thanks for the help. @statiksof What environment are you running your spawner in? virtual? metal? and what specs? maybe I can try and help narrow the problem down. |
Thanks @Mohitsharma44. Actually, I am not using SwarmSpawner, only dockerspawner. My Jupyetrhub runs in a CentOS 7 virtual machine. I also tried to run JH as systems not in a container, but same problem. |
Hmm... I quickly spawned a CentOS7 minimal on virtualbox as 2 core 2 GB RAM VM and installed everything from there, spawned container using dockerspawner and didn't notice any hiccups. |
Can you provide a guideline on how to set up this?
…On 07.12.18 22:51, Mohit wrote:
single node swarm
|
Sure, just run |
Thanks! I will try this and see.
…On 08.12.18 01:48, Mohit wrote:
Sure, just run |docker swarm init| which will initialize docker swarm
(engine, not the swarm mode)
and finally |docker-compose up| with same files as in the swarm
example
<https://github.com/jupyterhub/dockerspawner/tree/master/examples/swarm>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#279 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Ao60jtS4DphcYQMTjHpYGxbtnifvsB7-ks5u2wxZgaJpZM4ZA5ju>.
|
I'm running into a very similar issue. Jupyter notebook sometimes fails to spawn, giving a 500 error. I notice quite a significant number of failed spawns on docker swarm. It fails after 30 seconds with a 500 error. I managed to get rid of the 500 errors, making the spawn slow but working, by setting: c.Spawner.http_timeout = 99 On docker I see the notebook being spawned, without any issues. The jupyterhub container however, cannot connect to it. Looking inside the jupyterhub container, I see connections hanging in a SYN_SENT state: root@jupyterhub:/srv/jupyterhub# netstat -atn|grep 8888
tcp 0 1 10.0.47.114:38248 10.0.47.131:8888 SYN_SENT The problem here, is caused by a seemingly invalid DNS resolution: root@jupyterhub:/srv/jupyterhub# Host jupyter-testaccount
jupyter-testaccount has address 10.0.47.133 Now I'm trying to get some understanding of what is going on here under water: root@jupyterhub:/srv/jupyterhub# while true;do host jupyter-testaccount;netstat -atn|grep 8888;sleep 0.05;done|tee output.log Run the above Host jupyter-testaccount not found: 3(NXDOMAIN) The netstat command either doesn't outut anything, or outputs about old connections in CLOSE_WAIT or LAST_ACK state. These are relevant, as they give some information about your previous container. Now spawn a container. There are two options:
jupyter-testaccount has address 10.0.47.135
tcp 1 1 10.0.47.114:51638 10.0.47.133:8888 LAST_ACK
tcp 0 0 10.0.47.114:38024 10.0.47.135:8888 ESTABLISHED
First this: Host jupyter-testaccount not found: 3(NXDOMAIN)
tcp 0 1 10.0.47.114:38224 10.0.47.135:8888 SYN_SENT
tcp 1 1 10.0.47.114:38024 10.0.47.135:8888 LAST_ACK Half a second (or so) later this: jupyter-testaccount has address 10.0.47.137
tcp 0 1 10.0.47.114:38224 10.0.47.135:8888 SYN_SENT
tcp 1 1 10.0.47.114:38024 10.0.47.135:8888 LAST_ACK So it seems there is some race condition between docker swarm spawning the service and jupyterhub trying to connect to it. tcpdumping trafik to the DNS, you see over 350 failed DNS requests before answers come through. That looks like this: Failed: 17:26:04.958598 IP (tos 0x0, ttl 64, id 39566, offset 0, flags [DF], proto UDP (17), length 161)
127.0.0.11.53 > 127.0.0.1.44212: [bad udp cksum 0xfeaa -> 0x3507!] 42293 NXDomain q: A? jupyter-testaccount. 0/1/0 ns: . [1h41m39s] SOA a.root-servers.net. nstld.verisign-grs.com. 2018121200 1800 900 604800 86400 (133)
E.....@[email protected]..........(jupyter-testaccount...............@.a.root-servers.net..nstld.verisign-grs.com.xJ........... :...Q.................
And finally success:
17:26:05.033451 IP (tos 0x0, ttl 64, id 39590, offset 0, flags [DF], proto UDP (17), length 142)
127.0.0.11.53 > 127.0.0.1.54346: [bad udp cksum 0xfe97 -> 0xebfa!] 62151 q: A? jupyter-testaccount. 1/0/0 jupyter-testaccount. [10m] A 10.0.47.185 (114)
E.....@[email protected]..............(jupyter-testaccount.....(jupyter-testaccount........X..
./................. It's during the time of the failed requests that jupyterhub wants to connect to the old IP address. After DNS starts succeeding, it will connect to the new IP address. So, I don't know if the same symptoms apply to the other people running into the problem of slow spawning notebooks, but it might be worth looking into. We managed to work around this by adjusting get_ip_and_port() in swarmspawner.py, adding a loop that keeps trying |
I was having this same issue. |
Hmmm, I don't think I can? I create the overlay network as part of the stack, similar to how it is described here: That means all notebook containers, plus the jupyterhub containers live in that network. Do you handle that differently? Besides, deleting the network may prevent something inside jupyterhub from connecting to the old IP address but I think it's less intrusive if that same something inside jupyterhub simply doesn't connect at all if the container name doesn't resolve. Or, alternatively, get the IP address from the docker API. |
I was vague in my previous message. Let me clarify. I am assuming you are testing things, which means you are stopping and starting the hub container (and consequently the notebook containers) The hub, when spawning notebooks, uses So, when you are stopping the hub, make sure to delete the overlay-network that you are using. The best way would be to create a compose file and let it handle this for you. |
Ah thanks, well the situation is a bit different, we are running it at the moment as a proof-of-concept for about 20 users or so. The problem is not when I stop/start the hub. The problem never occurs when the hub is freshly restarted and I spawn a notebook container. The problem begins the second time I launch a notebook container (with indeed the same name as the previous one). It's then that it seems to have a race condition, sometimes it works, sometimes it doesn't. And the way that translates into observed behaviour is what I wrote above. So with a connection in the SYN_SEND state to the old IP address. Expected behaviour is that when docker DNS doesn't return anything, that it retries a few times rather than falling back on an old value. @gen.coroutine
def get_ip_and_port(self):
if self.use_internal_ip:
ip = self.service_name
port = self.port
import time
import socket
time.sleep(3)
for attempt in range(30):
try:
time.sleep(1)
ip = socket.gethostbyname(ip)
self.log.info("Jupyter environment '%s' ip address is: %s",self.service_name,ip)
break
except:
self.log.info("Jupyter environment '%s' is still unknown, retrying %s..",self.service_name,str(attempt))
continue
else:
self.log.error("Jupyter environment '%s' is still unknown, please check docker logs.", self.service_name)
return ip, port I hope this clarifies the situation |
A similar (asyncio) wait for DNS resolution when |
Hi,
sorry to insist on this, but I didn't get any answer for it.
Using lastest JH and DockerSpawner version.
Spawning is too slow (~30 seconds and sometimes more). Does anyone of you experienced this before?
The text was updated successfully, but these errors were encountered: