Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The localhost socket connection that failed to connect to the R worker used port 11562 #11

Open
hinling-blisspoint opened this issue Apr 12, 2023 · 3 comments

Comments

@hinling-blisspoint
Copy link

Thanks for your helm chart and documentation. I was able to follow the instructions to install the rstudio server and workers without any problem. However, when I tried to run: plan(cluster, manual = TRUE, quiet = TRUE), I got the following error after 120 second timeout:

Error in socketConnection(localhostHostname, port = port, server = TRUE,  : 
  Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘future-scheduler-76594f99f4-db96z’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11562 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpBQzoyE/worker.rank=2.parallelly.parent=337.15179347101.pid")), silent = TRUE)' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential.
 * Failed to kill local

I did get the correct number of workers when I tried nbrOfWorkers(). I am pretty new to RStudio so maybe I am missing something obvious?

@paciorek
Copy link
Owner

Hmm, the worker pods should be starting R and trying to connect to the scheduler pod (the commands for this are in github.com/paciorek/future-helm-chart/templates/future-worker-deployment.yaml). The manual=TRUE argument should tell plan that the worker processes are already running and that the master process shouldn't be launching the workers.

So the fact that it says "Worker launch call", which seems to indicate that the master process is trying to start the workers (despite the use of manual=TRUE) is odd.

If you try running

plan(cluster, manual=TRUE, verbose=TRUE)

and let me know what you get, I might be able to make a suggestion.

@hinling-blisspoint
Copy link
Author

thanks for the reply! Here's what I got when running with verbose=TRUE:

> library(future)
> plan(cluster, manual=TRUE, verbose=TRUE)
[19:24:04.574] [local output] makeClusterPSOCK() ...
[19:24:04.673] [local output] Workers: [n = 1] ‘localhost’
[19:24:04.676] [local output] Base port: 11562
[19:24:04.676] [local output] Getting setup options for 1 cluster nodes ...
[19:24:04.676] [local output]  - Node 1 of 1 ...
[19:24:04.677] [local output] localMachine=TRUE => revtunnel=FALSE

[19:24:04.678] Testing if worker's PID can be inferred: ‘'/usr/local/lib/R/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'file.exists("/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")'’
[19:24:04.910] - Possible to infer worker's PID: TRUE
[19:24:04.911] [local output] Rscript port: 11562

[19:24:04.911] [local output] Getting setup options for 1 cluster nodes ... done
[19:24:04.911] [local output] Creating node 1 of 1 ...
[19:24:04.911] [local output] - setting up node
[19:24:04.912] [local output] - attempt #1 of 3
----------------------------------------------------------------------
Manually, start worker #1 on local machine ‘localhost’ with:

  '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

[19:24:04.913] [local output] Waiting for worker #1 on ‘localhost’ to connect back
Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘future-scheduler-65984456d4-9xxrt’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11562 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential.
 * Failed to kill local worker because it's PID is could not be identified.
 * Troubleshooting suggestions:
   - Suggestion #1: Set 'outfile=NULL' to see output from worker.
   - Suggestion #2: Set 'rshlogfile=TRUE' to enable logging for .

[19:26:09.427] [local output] - waiting 15 seconds before trying again
[19:26:24.427] [local output] - attempt #2 of 3
----------------------------------------------------------------------
Manually, start worker #1 on local machine ‘localhost’ with:

  '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

[19:26:24.428] [local output] Waiting for worker #1 on ‘localhost’ to connect back
Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘future-scheduler-65984456d4-9xxrt’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11562 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential.
 * Failed to kill local worker because it's PID is could not be identified.
 * Troubleshooting suggestions:
   - Suggestion #1: Set 'outfile=NULL' to see output from worker.
   - Suggestion #2: Set 'rshlogfile=TRUE' to enable logging for .

[19:28:28.938] [local output] - waiting 15 seconds before trying again
[19:28:43.938] [local output] - attempt #3 of 3
----------------------------------------------------------------------
Manually, start worker #1 on local machine ‘localhost’ with:

  '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

[19:28:43.939] [local output] Waiting for worker #1 on ‘localhost’ to connect back
[19:30:48.452] [local output]   Failed 3 attempts with 15 seconds delay
Error in socketConnection(localhostHostname, port = port, server = TRUE,  : 
  Failed to launch and connect to R worker on local machine ‘localhost’ from local machine ‘future-scheduler-65984456d4-9xxrt’.
 * The error produced by socketConnection() was: ‘reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11562 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/local/lib/R/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/tmp/RtmpdujxyT/worker.rank=1.parallelly.parent=420.1a46a215007.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11562 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRA

Like you mentioned, looks like it's trying to start a worker? [19:24:04.911] [local output] Creating node 1 of 1 ... but the worker is running for sure already:

k get pods
NAME                                READY   STATUS    RESTARTS   AGE
future-scheduler-65984456d4-9xxrt   1/1     Running   0          15h
future-worker-867984946d-bp5lb      1/1     Running   0          18h

Here's the worker log:

k exec -it future-worker-867984946d-bp5lb -- cat /tmp/rworker.log
starting worker pid=29 on future-scheduler:11562 at 01:19:26.708

@paciorek
Copy link
Owner

Hmm, it does look like the worker pod is running. It should have a running R process that tries to connect to the scheduler pod on port 11562.

You could try connecting to a shell in the worker pod to see if an R process is running in the pod.

k exec -it future-worker-867984946d-bp5lb -- /bin/bash

and then run ps or top, and look for the process with ID of 29, e.g.,

ps aux | grep 29

You could also try to manually start the R worker process when you are in a shell on the worker pod to see if it gives you an error.

root@future-worker-867984946d-bp5lb:/# Rscript -e 'setup_kube()' && Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()'  MASTER=future-scheduler PORT=11562 OUT=/tmp/newlogfile TIMEOUT=2592000 XDR=TRUE &

Then in your RStudio session (the scheduler process) you could try running plan() again and see what happens.

Given that your pods have been running for so many hours, there might be some timeout issue, so you may want to start from scratch by running helm uninstall and then redoing the helm install. (Or even perhaps restarting your Kubernetes cluster.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants