Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test kubernetes in CI #3482

Draft
wants to merge 76 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
8e3d7cf
Added options to set annotations and a service account in the Kuberne…
shishichen Jun 7, 2024
45269ed
Correct punctuation in debug message. hack out tests that won't fail-…
benclifford Jun 7, 2024
7ceec7a
Fix a couple of docstrings
benclifford Jun 7, 2024
0cdece2
a bit of name sanitization for default pod names
benclifford Jun 7, 2024
87d3454
fiddle with markings to deal with no shared fs and no staging
benclifford Jun 8, 2024
954bad7
add config file i've been using
benclifford Jun 10, 2024
d3e3828
Merge remote-tracking branch 'shishichen/add-k8s-pod-options'
benclifford Jun 10, 2024
17a00dd
add the dockerfile i've been using
benclifford Jun 10, 2024
62e0e36
beginning of kubernetes-in-CI
benclifford Jun 10, 2024
9086e19
push docker image? upgrade ubuntu
benclifford Jun 10, 2024
26869e6
fiddle with default name
benclifford Jun 10, 2024
16c0a49
Add kubernetes, needed for submitting from inside a cluster
benclifford Jun 10, 2024
b03615a
Add more bits for running everything in a kubernetes cluster
benclifford Jun 10, 2024
e122d19
fix syntax error in github workflow definition
benclifford Jun 10, 2024
ee14f6e
Tighten timeout, add some debugging info at the end
benclifford Jun 10, 2024
120cf78
Correct pod name from my test
benclifford Jun 10, 2024
5c55fe6
try to stop Job from recreating pod on failure, but instead abort fast
benclifford Jun 10, 2024
b56dfd9
Randomise test order to see if a test failure is specific to a partic…
benclifford Jun 10, 2024
dd0f66c
Merge branch 'master' into benc-k8s-kind-ci
benclifford Jun 10, 2024
f8f5a27
Add some memory logging
benclifford Jun 10, 2024
f4a7300
Allocate more memory to workers
benclifford Jun 10, 2024
ffdb021
Add a staging_required marker that apparently wasn't breaking things …
benclifford Jun 10, 2024
21711e9
messing with backoff limits and restart policy
benclifford Jun 10, 2024
fb1733e
remove apparently invalid restart policy
benclifford Jun 10, 2024
73b3e1d
Flush out some more staging_required tests (by setting storage_access…
benclifford Jun 10, 2024
31bc958
Switch the Kubernetes client call to read_namespaced_pod_status() to …
shishichen Jun 12, 2024
8b39024
Fixed Kubernetes worker container launch command to remove trailing s…
shishichen Jun 13, 2024
4538763
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Jun 14, 2024
5f43aeb
Merge remote-tracking branch 'shishichen/fix-k8s-launch-cmd' into ben…
benclifford Jun 14, 2024
299de99
Merge remote-tracking branch 'shishichen/swap-k8s-pod-status' into be…
benclifford Jun 14, 2024
68e3a5d
Merge branch 'master' into benc-k8s-kind-ci
benclifford Jun 18, 2024
fe3c55e
Merge branch 'master' into benc-k8s-kind-ci
benclifford Jun 24, 2024
064b833
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Jul 2, 2024
9c6a04e
Merge remote-tracking branch 'origin/benc-k8s-kind-ci' into benc-k8s-…
benclifford Jul 2, 2024
ba5f047
Merge branch 'master' into benc-k8s-kind-ci
benclifford Jul 2, 2024
69fbf03
Merge branch 'master' into benc-k8s-kind-ci
benclifford Jul 7, 2024
75b7c02
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Jul 31, 2024
780fbb0
Merge remote-tracking branch 'origin/benc-k8s-kind-ci' into benc-k8s-…
benclifford Jul 31, 2024
2324744
Merge branch 'master' into benc-k8s-kind-ci
benclifford Aug 5, 2024
b75a3ae
function data in temp
colinthomas-z80 Aug 19, 2024
2c18d6c
use getpass for username
colinthomas-z80 Aug 19, 2024
c201ec1
use tempfile module
colinthomas-z80 Aug 20, 2024
9f6b037
flake etc
colinthomas-z80 Aug 20, 2024
5ec7cdb
Merge branch 'master' into tmp_function_data
benclifford Aug 21, 2024
c3f6d45
Merge branch 'master' into benc-k8s-kind-ci
benclifford Aug 22, 2024
edf870f
Merge branch 'master' into benc-k8s-kind-ci
benclifford Aug 26, 2024
811b8e5
Merge branch 'master' into benc-k8s-kind-ci
benclifford Sep 3, 2024
7347f64
Merge remote-tracking branch 'refs/remotes/origin/master' into benc-k…
benclifford Sep 4, 2024
cd7229f
Merge branch 'master' into tmp_function_data
benclifford Sep 4, 2024
08f8ce9
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Sep 5, 2024
5967f01
Merge remote-tracking branch 'refs/remotes/origin/benc-k8s-kind-ci' i…
benclifford Sep 5, 2024
4938dbf
Build cctools and run a probably-broken taskvine vs kubernetes test c…
benclifford Sep 5, 2024
98d7693
fix repr in taskvine
benclifford Sep 5, 2024
dfc94a8
install cloudpickle explicitly for taskvine
benclifford Sep 5, 2024
47378f3
Add more time onto job timeout, because more is happening in job with…
benclifford Sep 5, 2024
43af8ef
revert to 180s test time
benclifford Sep 5, 2024
6a32f0f
Log more to the console, kubernetes style
benclifford Sep 5, 2024
d4fab6a
Note a (documentation?) bug in taskvine address selection
benclifford Sep 5, 2024
21dcae6
force hostname based address config, in line with comment in previous…
benclifford Sep 5, 2024
e1cce03
now we're starting taskvine test successfully, give it time to complete
benclifford Sep 5, 2024
4d4b4ba
Make taskvine shutdown scale-in more like htex shutdown scale-in
benclifford Sep 5, 2024
2e42e5c
enable staging_required tests in taskvine, because taskvine might be …
benclifford Sep 5, 2024
3ba7e12
Output timestamps in kubernetes log to help diagnose hangs
benclifford Sep 5, 2024
084d797
failed to get non-staging tests working, made a note in comments
benclifford Sep 5, 2024
f34f2b8
correct duplicated 'and' in pytest -k option
benclifford Sep 5, 2024
60a8611
Merge remote-tracking branch 'colinthomas-z80/tmp_function_data' into…
benclifford Sep 6, 2024
1f09e5c
Add utils to sanitize strings for DNS compliance
rjmello Oct 14, 2024
83278c2
Ensure k8s pod names/labels are RFC 1123 compliant
rjmello Oct 15, 2024
86ade32
Use hex value for k8s job ID instead of pod name
rjmello Oct 15, 2024
0c4d541
Add tests for KubernetesProvider submit
rjmello Oct 17, 2024
08693ab
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Oct 21, 2024
415f780
Merge remote-tracking branch 'origin/rjmello-kube-pod-names' into ben…
benclifford Oct 21, 2024
c78defa
Fix some bad merge
benclifford Oct 21, 2024
54ea143
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Oct 21, 2024
535289f
Merge remote-tracking branch 'origin/master' into benc-k8s-kind-ci
benclifford Oct 31, 2024
fd26ddd
Merge branch 'master' into benc-k8s-kind-ci
benclifford Nov 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/ci-k8s.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: Parsl

on:
pull_request:
types:
- opened
- synchronize

jobs:
k8s-kind-suite:
runs-on: ubuntu-24.04
timeout-minutes: 60

steps:
- uses: actions/checkout@master

- name: Create k8s Kind Cluster
uses: helm/kind-action@v1
with:
# kind tooling uses this name by default, but kind-action uses
# a different default name
cluster_name: kind

- name: Build docker image
uses: docker/build-push-action@v5
with:
context: .
tags: parsl:ci

- name: Push docker image into kubernetes cluster
run: |
kind load docker-image parsl:ci

- name: set liberal permissions
run: |
kubectl create clusterrolebinding serviceaccounts-cluster-admin --clusterrole=cluster-admin --group=system:serviceaccounts

- name: launch pytest Job
run: |
free -h
kubectl create -f ./pytest-task.yaml

- name: wait for pytest Job
run: |
kubectl wait --timeout=600s --for=condition=Complete Job pytest

- name: report some info
if: ${{ always() }}
run: |
free -h
kubectl describe pods
kubectl describe jobs
kubectl logs --timestamps Job/pytest
40 changes: 40 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
FROM debian:trixie

RUN apt-get update && apt-get upgrade -y
rjmello marked this conversation as resolved.
Show resolved Hide resolved

RUN apt-get update && apt-get install -y sudo openssh-server

RUN apt-get update && apt-get install -y curl less vim

# git is needed for parsl to figure out it's own repo-specific
# version string
RUN apt-get update && apt-get install -y git

# useful stuff to have around
RUN apt-get update && apt-get install -y procps

# for building documentation
RUN apt-get update && apt-get install -y pandoc

# for monitoring visualization
RUN apt-get update && apt-get install -y graphviz wget

# for commandline access to monitoring database
RUN apt-get update && apt-get install -y sqlite3

RUN apt-get update && apt-get install -y python3.12 python3.12-dev
RUN apt-get update && apt-get install -y python3.12-venv
Comment on lines +25 to +26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we make the Python version configurable.

E.g.,

Suggested change
RUN apt-get update && apt-get install -y python3.12 python3.12-dev
RUN apt-get update && apt-get install -y python3.12-venv
ARG PYTHON_VERSION="3.12"
RUN apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev
RUN apt-get-install -y python${PYTHON_VERSION}-venv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit weird (something to do with how python is packaged in trixie?). Normally in debian it looks like there's a single OS-level python3 available (which changes when there's a new code-named release) and it seems unusual that trixie happens to have two. certainly debian isn't traditionally set up to expect you to be able to choose a python version from the OS.

there's a couple of things that could happen: i) always use the OS-level default python3 or ii) use something like Conda to provide a much richer Python environment. Some dependencies like the ndcctools recommends being installed using conda anyway, and so maybe that's the way to go here. I don't think there's any particular reason to want to stick with the OS-level Python, as this is "an image where Parsl works" rather than "an image that looks like a particular debian version".


RUN apt-get update && apt-get install -y gcc build-essential make pkg-config mpich

RUN python3.12 -m venv /venv

ADD . /parsl
WORKDIR /
RUN git clone https://github.com/cooperative-computing-lab/cctools
WORKDIR /cctools
RUN . /venv/bin/activate && apt install swig && ./configure --prefix=/ && make && make install

WORKDIR /parsl
RUN . /venv/bin/activate && pip3 install '.[kubernetes]' cloudpickle -r test-requirements.txt

26 changes: 26 additions & 0 deletions htex_k8s_kind.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from parsl.channels import LocalChannel
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.launchers import SimpleLauncher
from parsl.providers import KubernetesProvider


def fresh_config():
return Config(
executors=[
HighThroughputExecutor(
label="executorname",
storage_access=[],
worker_debug=True,
cores_per_worker=1,
encrypted=False, # needs certificate fs to be mounted in same place...
provider=KubernetesProvider(
worker_init=". /venv/bin/activate",
# pod_name="override-pod-name", # can't use default name because of dots, without own bugfix
image="parsl:ci",
max_mem="2048Gi" # was getting OOM-killing of workers with default... maybe this will help.
),
)
],
strategy='none',
)
19 changes: 14 additions & 5 deletions parsl/executors/taskvine/executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ def __init__(self,
storage_access: Optional[List[Staging]] = None):

# Set worker launch option for this executor
# This is to make repr work - otherwise it raises an attribute error
self.worker_launch_method = worker_launch_method
if worker_launch_method == 'factory' or worker_launch_method == 'manual':
provider = None

Expand Down Expand Up @@ -582,11 +584,18 @@ def shutdown(self, *args, **kwargs):
logger.debug("TaskVine shutdown started")
self._should_stop.set()

# Remove the workers that are still going
kill_ids = [self.blocks_to_job_id[block] for block in self.blocks_to_job_id.keys()]
if self.provider:
logger.debug("Cancelling blocks")
self.provider.cancel(kill_ids)
# BENC: removed this bit because the scaling code does this,
# and the kubernetes provider fails trying to scale in blocks
# that have already been deleted by the scaling code shutdown.
# This code assumes it can enumerate all blocks by using the
# blocks_to_job_id structure, but there's a not-very-strong
# principle that: you should not enumerate blocks_to_job_id
# (or the job to block map) and you should not call cancel on
# already cancelled blocks.
# kill_ids = [self.blocks_to_job_id[block] for block in self.blocks_to_job_id.keys()]
# if self.provider:
# logger.debug("Cancelling blocks")
# self.provider.cancel(kill_ids)

# Join all processes before exiting
logger.debug("Joining on submit process")
Expand Down
1 change: 1 addition & 0 deletions parsl/executors/taskvine/manager_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ class TaskVineManagerConfig:
address: Optional[str]
Address of the local machine.
If None, socket.gethostname() will be used to determine the address.
XXXX ^ if None, looks like get_any_address is being used and in my kubernetes setup, choosing 127.0.0.1

project_name: Optional[str]
If given, TaskVine will periodically report its status and performance
Expand Down
1 change: 1 addition & 0 deletions parsl/providers/kubernetes/kube.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,7 @@ def submit(self, cmd_string: str, tasks_per_node: int, job_name: str = "parsl.ku
worker_init=self.worker_init)

logger.debug("Pod name: %s", pod_name)

self._create_pod(image=self.image,
pod_name=pod_name,
job_id=job_id,
Expand Down
15 changes: 15 additions & 0 deletions pytest-task.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: batch/v1
kind: Job
metadata:
name: pytest
spec:
activeDeadlineSeconds: 600
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: pytest
image: parsl:ci
command: ["bash", "runme.sh"]

15 changes: 15 additions & 0 deletions runme.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash -e

source /venv/bin/activate

pytest parsl/tests/ --config ./htex_k8s_kind.py -k 'not issue3328 and not staging_required and not shared_fs' -x --random-order


# I tried letting staging_required tests run here but they do not -- a bit confused about this comment in taskvine:
#
# # Absolute paths are assumed to be in shared filesystem, and thus
# # not staged by taskvine.
# which I guess is saying something is making assumptions about the presence of a shared filesystem even when defaulting to shared_fs=False in the taskvine config?


PYTHONPATH=/usr/lib/python3.12/site-packages/ pytest parsl/tests/ --config ./taskvine_k8s_kind.py -k 'not issue3328 and not staging_required and not shared_fs' -x --random-order --log-cli-level=DEBUG
19 changes: 19 additions & 0 deletions taskvine_k8s_kind.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from parsl.channels import LocalChannel
from parsl.config import Config
from parsl.launchers import SimpleLauncher
from parsl.providers import KubernetesProvider
from parsl.addresses import address_by_hostname

from parsl.executors.taskvine import TaskVineExecutor, TaskVineManagerConfig

def fresh_config():
return Config(executors=[TaskVineExecutor(manager_config=TaskVineManagerConfig(address=address_by_hostname(), port=9000),
worker_launch_method='provider',
provider=KubernetesProvider(
worker_init=". /venv/bin/activate",
# pod_name="override-pod-name", # can't use default name because of dots, without own bugfix
image="parsl:ci",
max_mem="2048Gi" # was getting OOM-killing of workers with default... maybe this will help.
),

)])
Loading