-
Notifications
You must be signed in to change notification settings - Fork 11
/
blog_post.html
200 lines (197 loc) · 27.4 KB
/
blog_post.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
<p>In this post, I'll demonstrate that you can easily use the <strong><a href="https://cran.r-project.org/package=future">future</a></strong> package in R on a cluster of machines running in the cloud, specifically on a Kubernetes cluster.</p>
<p>This allows you to easily doing parallel computing in R in the cloud. One advantage of doing this in the cloud is the ability to easily scale the number and type of (virtual) machines across which you run your parallel computation.</p>
<h2 id="why-use-kubernetes-to-start-a-cluster-in-the-cloud">Why use Kubernetes to start a cluster in the cloud?</h2>
<p>Kubernetes is a platform for managing containers. You can think of the containers as lightweight Linux machines on which you can do your computation. By using the Kubernetes service of a cloud provider such as Google Cloud Platform (GCP) or Amazon Web Services (AWS), you can easily start up a cluster of (virtual) machines.</p>
<p>There have been (and are) approaches to starting up a cluster of machines on AWS easily from the command line on your laptop. Some tools that are no longer actively maintained are <a href="http://star.mit.edu/cluster">StarCluster</a> and <a href="https://cfncluster.readthedocs.io/en/latest">CfnCluster</a>. And there is now something called <a href="https://docs.aws.amazon.com/parallelcluster/latest/ug/getting_started.html">AWS ParallelCluster</a>. But doing it via Kubernetes allows you to build upon an industry standard platform that can be used on various cloud providers. A similar effort (which I heavily borrowed from in developing the setup described here) allows one to run a <a href="https://docs.dask.org/en/latest/setup/kubernetes-helm.html">Python Dask cluster</a> accessed via a Jupyter notebook.</p>
<p>Many of the cloud providers have Kubernetes services (and it's also possible you'd have access to a Kubernetes service running at your institution or company). In particular, I've experimented with <a href="https://cloud.google.com/kubernetes-engine">Google Kubernetes Engine (GKE)</a> and <a href="https://aws.amazon.com/eks">Amazon's Elastic Kubernetes Service (EKS)</a>. This post will demonstrate setting up your cluster using Google's GKE, but see my GitHub <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository for details on doing it on Amazon's EKS. Note that while I've gotten things to work on EKS, there have been <a href="https://github.com/paciorek/future-kubernetes#AWS-troubleshooting">various headaches</a> that I haven't encountered on GKE.</p>
<p>I'm not a Kubernetes expert, nor a GCP or AWS expert (that might explain the headaches I just mentioned), but one upside is that hopefully I'll go through all the details at a level someone who is not an expert can follow along. In fact, part of my goal in setting this up has been to learn more about Kubernetes, which I've done, but note that there's <em>a lot</em> to it.</p>
<p>More details about the setup, including how it was developed and troubleshooting tips can be found in my <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p>
<h2 id="how-it-works-briefly">How it works (briefly)</h2>
<p>This diagram in Figure 1 outlines the pieces of the setup.</p>
<figure>
<img src="k8s.png" alt="Overview of using future on a Kubernetes cluster" width="700"/>
<figcaption style="font-style: italic;">
Figure 1. Overview of using future on a Kubernetes cluster
</figcaption>
</figure>
<p>Work on a Kubernetes cluster is divided amongst <em>pods</em>, which carry out the components of your work and can communicate with each other. A pod is basically a Linux container. (Strictly speaking a pod can contain multiple containers and shared resources for those containers, but for our purposes, it's simplest just to think of a pod as being a Linux container.) The pods run on the nodes in the Kubernetes cluster, where each Kubernetes node runs on a compute instance of the cloud provider. These instances are themselves virtual machines running on the cloud provider's actual hardware. (I.e., somewhere out there, behind all the layers of abstraction, there are actual real computers running on endless aisles of computer racks in some windowless warehouse!) One of the nice things about Kubernetes is that if a pod dies, Kubernetes will automatically restart it.</p>
<p>The basic steps are:</p>
<ol style="list-style-type: decimal">
<li>Start your Kubernetes cluster on the cloud provider's Kubernetes service</li>
<li>Start the pods using Helm, the Kubernetes package manager</li>
<li>Connect to the RStudio Server session running on the cluster from your browser</li>
<li>Run your future-based computation</li>
<li>Terminate the Kubernetes cluster</li>
</ol>
<p>We use the Kubernetes package manager, Helm, to run the pods of interest:</p>
<ul>
<li>one (scheduler) pod for a main process that runs RStudio Server and communicates with the workers</li>
<li>multiple (worker) pods, each with one R worker process to act as the workers managed by the <strong>future</strong> package</li>
</ul>
<p>Helm manages the pods and related <em>services</em>. An example of a service is to open a port on the scheduler pod so the R worker processes can connect to that port, allowing the scheduler pod RStudio Server process to communicate with the worker R processes. I have a <a href="https://github.com/paciorek/future-helm-chart">Helm chart</a> that does this; it borrows heavily from the <a href="https://github.com/dask/helm-chart">Dask Helm chart</a> for the Dask package for Python.</p>
<p>Each pod runs a Docker container. I use my own <a href="https://github.com/paciorek/future-kubernetes-docker">Docker container</a> that layers a bit on top of the <a href="https://rocker-project.org">Rocker</a> container that contains R and RStudio Server.</p>
<h2 id="step-1-start-the-kubernetes-cluster">Step 1: Start the Kubernetes cluster</h2>
<p>Here I assume you have already installed: - the command line interface to Google Cloud, - the <code>kubectl</code> interface for interacting with Kubernetes, and - <code>helm</code> for installing Helm charts (i.e., Kubernetes packages).</p>
<p>Installation details can be found in the <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p>
<p>First we'll start our cluster (the first part of Step 1 in Figure 1):</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">gcloud</span> container clusters create \
--machine-type n1-standard-1 \
--num-nodes 4 \
--zone us-west1-a \
--cluster-version latest \
my-cluster</code></pre></div>
<p>I've asked for four virtual machines (nodes), using the basic (and cheap) <code>n1-standard-1</code> instance type (which has a single CPU per virtual machine) from Google Cloud Platform.</p>
<p>You'll want to specify the total number of cores on the virtual machines to be equal to the number of R workers that you want to start and that you specify in the Helm chart (as discussed below). Here we ask for four one-cpu nodes, and our Helm chart starts four workers, so all is well. See the <a href="#modifications">Modifications section</a> below on how to start up a different number of workers.</p>
<p>Since the RStudio Server process that you interact with wouldn't generally be doing heavy computation at the same time as the workers, it's OK that the RStudio scheduler pod and a worker pod would end up using the same virtual machine.</p>
<h2 id="step-2-install-the-helm-chart-to-set-up-your-pods">Step 2: Install the Helm chart to set up your pods</h2>
<p>Next we need to get our pods going by installing the Helm chart (i.e., package) on the cluster; the installed chart is called a <em>release</em>. As discussed above, the Helm chart tells Kubernetes what pods to start and how they are configured.</p>
<p>First we need to give our account permissions to perform administrative actions:</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">kubectl</span> create clusterrolebinding cluster-admin-binding \
--clusterrole=cluster-admin</code></pre></div>
<p>Now let's install the release. This code assumes the use of Helm version 3 or greater (for older versions <a href="https://github.com/paciorek/future-kubernetes">see my full instructions</a>).</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="fu">git</span> clone https://github.com/paciorek/future-helm-chart # download the materials
<span class="fu">tar</span> -czf future-helm.tgz -C future-helm-chart . # create a zipped archive (tarball) <span class="ex">that</span> <span class="kw">`</span><span class="ex">helm</span> install<span class="kw">`</span> needs
<span class="ex">helm</span> install --wait test ./future-helm.tgz # install (start the pods)</code></pre></div>
<p>You'll need to name your release; I've used 'test' above.</p>
<p>The <code>--wait</code> flag tells helm to wait until all the pods have started. Once that happens, you'll see a message about the release and how to connect to the RStudio interface, which we'll discuss further in the next section.</p>
<p>We can check the pods are running:</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">kubectl</span> get pods</code></pre></div>
<p>You should see something like this (the alphanumeric characters at the ends of the names will differ in your case):</p>
<pre><code>NAME READY STATUS RESTARTS AGE
future-scheduler-6476fd9c44-mvmz6 1/1 Running 0 116s
future-worker-54db85cb7b-47qsd 1/1 Running 0 115s
future-worker-54db85cb7b-4xf4x 1/1 Running 0 115s
future-worker-54db85cb7b-rj6bj 1/1 Running 0 116s
future-worker-54db85cb7b-wvp4n 1/1 Running 0 115s</code></pre>
<p>As expected, we have one scheduler and four workers.</p>
<h2 id="step-3-connect-to-rstudio-server-running-in-the-cluster">Step 3: Connect to RStudio Server running in the cluster</h2>
<p>Next we'll connect to the RStudio instance running via RStudio Server on our main (scheduler) pod, using the browser on our laptop (Step 3 in Figure 1).</p>
<p>After installing the Helm chart, you should have seen a printout with some instructions on how to do this. First you need to connect a port on your laptop to the RStudio port on the main pod (running of course in the cloud):</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="bu">export</span> <span class="va">RSTUDIO_SERVER_IP=</span><span class="st">"127.0.0.1"</span>
<span class="bu">export</span> <span class="va">RSTUDIO_SERVER_PORT=</span>8787
<span class="ex">kubectl</span> port-forward --namespace default svc/future-scheduler <span class="va">$RSTUDIO_SERVER_PORT</span>:8787 <span class="kw">&</span></code></pre></div>
<p>You can now connect from your browser to the RStudio Server instance by going to the URL: <a href="https://127.0.0.1:8787" class="uri">https://127.0.0.1:8787</a>.</p>
<p>Enter <code>rstudio</code> as the username and <code>future</code> as the password to login to RStudio.</p>
<p>What's happening is that port 8787 on your laptop is forwarding to the port on the main pod on which RStudio Server is listening (which is also port 8787). So you can just act as if RStudio Server is accessible directly on your laptop.</p>
<p>One nice thing about this is that there is no public IP address for someone to maliciously use to connect to your cluster. Instead the access is handled securely entirely through <code>kubectl</code> running on your laptop. However, it also means that you couldn't easily share your cluster with a collaborator. For details on configuring things so there is a public IP, please see <a href="https://github.com/paciorek/future-kubernetes#connecting-to-the-rstudio-instance-when-starting-the-cluster-from-a-remote-machine">my repository</a>.</p>
<p>Note that there is nothing magical about running your computation via RStudio. You could <a href="#connect-to-a-pod">connect to the main pod</a> and simply run R in it and then use the <strong>future</strong> package.</p>
<h2 id="step-4-run-your-future-based-parallel-r-code">Step 4: Run your future-based parallel R code</h2>
<p>Now we'll start up our future cluster and run our computation (Step 4 in Figure 1):</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(future)
<span class="kw">plan</span>(cluster, <span class="dt">manual =</span> <span class="ot">TRUE</span>, <span class="dt">quiet =</span> <span class="ot">TRUE</span>)</code></pre></div>
<p>The key thing is that we set <code>manual = TRUE</code> above. This ensures that the functions from the <strong>future</strong> package don't try to start R processes on the workers, as those R processes have already been started by Kubernetes and are waiting to connect to the main (RStudio Server) process.</p>
<p>Note that we don't need to say how many future workers we want. This is because the Helm chart sets an environment variable in the scheduler pod's <code>Renviron</code> file based on the number of worker pod replicas. Since that variable is used by the <strong>future</strong> package (via <code>parallelly::availableCores()</code>) as the default number of future workers, this ensures that there are only as many future workers as you have worker pods. However, if you modify the number of worker pods after installing the Helm chart, you may need to set the <code>workers</code> argument to <code>plan()</code> manually. (And note that if you were to specify more future workers than R worker processes (i.e., pods) you would get an error and if you were to specify fewer, you wouldn't be using all the resources that you are paying for.)</p>
<p>Now we can use the various tools in the <strong>future</strong> package as we would if on our own machine or working on a Linux cluster.</p>
<p>Let's run our parallelized operations. I'm going to do the world's least interesting calculation of calculating the mean of many (10 million) random numbers forty separate times in parallel. Not interesting, but presumably if you're reading this you have your own interesting computation in mind and hopefully know how to do it using future's tools such as <strong><a href="https://cran.r-project.org/package=future.apply">future.apply</a></strong> and <strong><a href="https://cran.r-project.org/package=foreach">foreach</a></strong> with <strong><a href="https://cran.r-project.org/package=doFuture">doFuture</a></strong>.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(future.apply)
output <-<span class="st"> </span><span class="kw">future_sapply</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">40</span>, <span class="cf">function</span>(i) <span class="kw">mean</span>(<span class="kw">rnorm</span>(<span class="fl">1e7</span>)), <span class="dt">future.seed =</span> <span class="ot">TRUE</span>)</code></pre></div>
<p>Note that all of this assumes you're working interactively, but you can always reconnect to the RStudio Server instance after closing the browser, and any long-running code should continue running even if you close the browser.</p>
<p>Figure 2 shows a screenshot of the RStudio interface.</p>
<figure>
<img src="rstudio.png" alt="RStudio interface, demonstrating use of future commands" width="700"/>
<figcaption style="font-style: italic;">
Figure 2. Screenshot of the RStudio interface
</figcaption>
</figure>
<h3 id="working-with-files">Working with files</h3>
<p>Note that <code>/home/rstudio</code> will be your default working directory in RStudio and the RStudio Server process will be running as the user <code>rstudio</code>.</p>
<p>You can use <code>/tmp</code> and <code>/home/rstudio</code> for files, both within RStudio and within code running on the workers, but note that files (even in <code>/home/rstudio</code>) are not shared between workers nor between the workers and the RStudio Server pod.</p>
<p>To make data available to your RStudio process or get output data back to your laptop, you can use <code>kubectl cp</code> to copy files between your laptop and the RStudio Server pod. Here's an example of copying to/from <code>/home/rstudio</code>:</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="co">## create a variable with the name of the scheduler pod</span>
<span class="bu">export</span> <span class="va">SCHEDULER=$(</span><span class="ex">kubectl</span> get pod --namespace default -o jsonpath=<span class="st">'{.items[?(@.metadata.labels.component=="scheduler")].metadata.name}'</span><span class="va">)</span>
<span class="co">## copy a file to the scheduler pod</span>
<span class="ex">kubectl</span> cp my_laptop_file <span class="va">${SCHEDULER}</span>:home/rstudio/
<span class="co">## copy a file from the scheduler pod</span>
<span class="ex">kubectl</span> cp <span class="va">${SCHEDULER}</span>:home/rstudio/my_output_file .</code></pre></div>
<p>Of course you can also interact with the web from your RStudio process, so you could download data to the RStudio process from the internet.</p>
<h2 id="step-5-cleaning-up">Step 5: Cleaning up</h2>
<p>Make sure to shut down your Kubernetes cluster, so you don't keep getting charged.</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">gcloud</span> container clusters delete my-cluster --zone=us-west1-a</code></pre></div>
<h2 id="modifications">Modifications</h2>
<p>You can modify the Helm chart in advance, before installing it. For example you might want to install other R packages for use in your parallel code or change the number of workers.</p>
<p>To add additional R packages, go into the <code>future-helm-chart</code> directory (which you created using the directions above in Step 2) and edit the <a href="https://github.com/paciorek/future-helm-chart/blob/master/values.yaml">values.yaml</a> file. Simply modify the lines that look like this:</p>
<div class="sourceCode"><pre class="sourceCode yaml"><code class="sourceCode yaml"> <span class="fu">env:</span>
<span class="co"># - name: EXTRA_R_PACKAGES</span>
<span class="co"># value: data.table</span></code></pre></div>
<p>by removing the "#" comment characters and putting the R packages you want installed in place of <code>data.table</code>, with the names of the packages separated by spaces, e.g.,</p>
<div class="sourceCode"><pre class="sourceCode yaml"><code class="sourceCode yaml"> <span class="fu">env:</span>
<span class="kw">-</span> <span class="fu">name:</span><span class="at"> EXTRA_R_PACKAGES</span>
<span class="fu">value:</span><span class="at"> foreach doFuture</span></code></pre></div>
<p>In many cases you may want these packages installed on both the scheduler pod (where RStudio Server runs) and on the workers. If so, make sure to modify the lines above in both the <code>scheduler</code> and <code>worker</code> stanzas.</p>
<p>To modify the number of workers, modify the <code>replicas</code> line in the <code>worker</code> stanza of the <a href="https://github.com/paciorek/future-helm-chart/blob/master/values.yaml">values.yaml</a> file.</p>
<p>Then rebuild the Helm chart:</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="bu">cd</span> future-helm-chart ## ensure you are in the directory containing <span class="kw">`</span><span class="ex">values.yaml</span><span class="kw">`</span>
<span class="fu">tar</span> -czf ../future-helm.tgz .</code></pre></div>
<p>and install as done previously.</p>
<p>Note that doing the above to increase the number of workers would probably only make sense if you also modify the number of virtual machines you start your Kubernetes cluster with such that the total number of cores across the cloud provider compute instances matches the number of worker replicas.</p>
<p>You may also be able to modify a running cluster. For example you could use <code>gcloud container clusters resize</code>. I haven't experimented with this.</p>
<p>To modify if your Helm chart is already installed (i.e., your release is running), one simple option is to reinstall the Helm chart as discussed below. You may also need to kill the <code>port-forward</code> process discussed in Step 3.</p>
<p>For some changes, you can also also update a running release without uninstalling it by "patching" the running release or scaling resources. I won't go into details here.</p>
<h2 id="troubleshooting">Troubleshooting</h2>
<p>Things can definitely go wrong in getting all the pods to start up and communicate with each other. Here are some suggestions for monitoring what is going on and troubleshooting.</p>
<p>First, you can use <code>kubectl</code> to check the pods are running:</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">kubectl</span> get pods</code></pre></div>
<h3 id="connect-to-a-pod">Connect to a pod</h3>
<p>To connect to a pod, which allows you to check on installed software, check on what the pod is doing, and other troubleshooting, you can do the following</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="bu">export</span> <span class="va">SCHEDULER=$(</span><span class="ex">kubectl</span> get pod --namespace default -o jsonpath=<span class="st">'{.items[?(@.metadata.labels.component=="scheduler")].metadata.name}'</span><span class="va">)</span>
<span class="bu">export</span> <span class="va">WORKERS=$(</span><span class="ex">kubectl</span> get pod --namespace default -o jsonpath=<span class="st">'{.items[?(@.metadata.labels.component=="worker")].metadata.name}'</span><span class="va">)</span>
<span class="co">## access the scheduler pod:</span>
<span class="ex">kubectl</span> exec -it <span class="va">${SCHEDULER}</span> -- /bin/bash
<span class="co">## access a worker pod:</span>
<span class="bu">echo</span> <span class="va">$WORKERS</span>
<span class="ex">kubectl</span> exec -it <span class="op"><</span>insert_name_of_a_worker<span class="op">></span> -- /bin/bash</code></pre></div>
<p>Alternatively just determine the name of the pod with <code>kubectl get pods</code> and then run the <code>kubectl exec -it ...</code> invocation above.</p>
<p>Note that once you are in a pod, you can install software in the usual fashion of a Linux machine (in this case using <code>apt</code> commands such as <code>apt-get install</code>).</p>
<h3 id="connect-to-a-virtual-machine">Connect to a virtual machine</h3>
<p>Or to connect directly to an underlying VM, you can first determine the name of the VM and then use the <code>gcloud</code> tools to connect to it.</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">kubectl</span> get nodes
<span class="co">## now, connect to one of the nodes, 'gke-my-cluster-default-pool-8b490768-2q9v' in this case:</span>
<span class="ex">gcloud</span> compute ssh gke-my-cluster-default-pool-8b490768-2q9v --zone us-west1-a</code></pre></div>
<h3 id="check-your-running-code">Check your running code</h3>
<p>To check that your code is actually running in parallel, one can run the following test and see that the result returns the names of distinct worker pods.</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(future.apply)
<span class="kw">future_sapply</span>(<span class="kw">seq_len</span>(<span class="kw">nbrOfWorkers</span>()), <span class="cf">function</span>(i) <span class="kw">Sys.info</span>()[[<span class="st">"nodename"</span>]])</code></pre></div>
<p>You should see something like this:</p>
<pre><code>[1] future-worker-54db85cb7b-47qsd future-worker-54db85cb7b-4xf4x
[3] future-worker-54db85cb7b-rj6bj future-worker-54db85cb7b-wvp4n</code></pre>
<p>One can also connect to the pods or to the underlying virtual nodes (as discussed above) and run Unix commands such as <code>top</code> and <code>free</code> to understand CPU and memory usage.</p>
<h3 id="reinstall-the-helm-release">Reinstall the Helm release</h3>
<p>You can restart your release (i.e., restarting the pods, without restarting the whole Kubernetes cluster):</p>
<div class="sourceCode"><pre class="sourceCode sh"><code class="sourceCode bash"><span class="ex">helm</span> uninstall test
<span class="ex">helm</span> install --wait test ./future-helm.tgz </code></pre></div>
<p>Note that you may need to restart the entire Kubernetes cluster if you're having difficulties that reinstalling the release doesn't fix.</p>
<h2 id="how-does-it-work">How does it work?</h2>
<p>I've provided many of the details of how it works in my <a href="https://github.com/paciorek/future-kubernetes">future-kubernetes</a> repository.</p>
<p>The key pieces are:</p>
<ol style="list-style-type: decimal">
<li>The <a href="https://github.com/paciorek/future-helm-chart">Helm chart</a> with the instructions for how to start the pods and any associated services.</li>
<li>The <a href="https://github.com/paciorek/future-kubernetes-docker">Rocker-based Docker container(s)</a> that the pods run.</li>
</ol>
<p>That's all there is to it ... plus <a href="https://github.com/paciorek/future-kubernetes">these instructions</a>.</p>
<p>Briefly:</p>
<ol style="list-style-type: decimal">
<li>Based on the Helm chart, Kubernetes starts up the 'main' or 'scheduler' pod running RStudio Server and multiple worker pods each running an R process. All of the pods are running the Rocker-based Docker container</li>
<li>The RStudio Server main process and the workers use socket connections (via the R function <code>socketConnection()</code>) to communicate:
<ul>
<li>the worker processes start R processes that are instructed to regularly make a socket connection using a particular port on the main scheduler pod</li>
<li>when you run <code>future::plan()</code> (which calls <code>makeClusterPSOCK()</code>) in RStudio, the RStudio Server process attempts to make socket connections to the workers using that same port</li>
</ul></li>
<li>Once the socket connections are established, command of the RStudio session returns to you and you can run your future-based parallel R code.</li>
</ol>
<p>One thing I haven't had time to work through is how to easily scale the number of workers after the Kubernetes cluster is running and the Helm chart installed, or even how to auto-scale -- starting up workers as needed based on the number of workers requested via <code>plan()</code>.</p>
<h2 id="wrap-up">Wrap up</h2>
<p>If you're interested in extending or improving this or collaborating in some fashion, please feel free to get in touch with me via the <a href="https://github.com/paciorek/future-kubernetes/issues">'future-kubernetes' issue tracker</a> or by email.</p>
<p>And if you're interested in using R with Kubernetes, note that RStudio provides an integration of RStudio Server Pro with Kubernetes that should allow one to run future-based workflows in parallel.</p>
<p>/Chris</p>
<h2 id="links">Links</h2>
<ul>
<li>future-kubernetes repository:</li>
<li><p>GitHub page: https://github.com/paciorek/future-kubernetes</p></li>
<li>future-kubernetes Helm chart:</li>
<li><p>GitHub page: https://github.com/paciorek/future-helm-chart</p></li>
<li>future-kubernetes Docker container:</li>
<li><p>GitHub page: https://github.com/paciorek/future-kubernetes-docker</p></li>
<li>future package:</li>
<li>CRAN page: https://cran.r-project.org/package=future</li>
<li><p>GitHub page: https://github.com/HenrikBengtsson/future</p></li>
</ul>