Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] temporary changes, do not merge #51

Closed
wants to merge 64 commits into from

Conversation

henrirosten
Copy link
Collaborator

No description provided.

flokli added 30 commits January 3, 2024 12:31
This is used by the nix_build.sh script used to build images with
terraform.

Signed-off-by: Florian Klink <[email protected]>
This introduces a terraform module that can be used to nix-build and
upload VM images to Azure.

nix-build.sh originates from https://cs.tvl.fyi/depot/-/blob/ops/terraform/deploy-nixos/nixos-eval.sh,
which is why it inherits its copyright from there.

Signed-off-by: Florian Klink <[email protected]>
This groups some common together some resources to create a VM. We might
introduce more flexibility at a later point.

Signed-off-by: Florian Klink <[email protected]>
We can just include azure-config.nix from nixpkgs. It pulls in azure-
common.nix, which contains all necessary kernel config / udev rules.

It also defines a `config.system.azureImage` attribute, which builds
a vhd that we can import into azure, using the `azurerm-nix-vm-image`
terraform module

These can be referred to from source_image_id in Terraform
(using azurerm-linux-vm for example), allowing to boot the desired
machine config out of the box, without having to do a two-staged-deploy.

Signed-off-by: Florian Klink <[email protected]>
This allows injecting custom userdata to the VM at instance creation
time, which we can use to provision some config (like SSH pubkey config)
that's not part of the NixOS image.

Signed-off-by: Florian Klink <[email protected]>
azure-common.nix already sets
services.openssh.settings.{PermitRootLogin,ClientAliveInterval},
so we need to decide what wins.

To keep the intended behaviour, we want to mkForce PermitRootLogin to
"no" (azure-common.nix sets "prohibit-password"), and set the
ClientAliveInterval with mkDefault - bumping that timeout probably makes
sense for azure, and we don't want the setting in this file to take
priority.

Signed-off-by: Florian Klink <[email protected]>
This file contains all ssh public keys used by real humans.
It's parsed from Terraform to inject into instance metadata.

Signed-off-by: Florian Klink <[email protected]>
This builds the jenkins-master Nix image, turns it into a bootable Azure
image, and then boots an instance with the image.

Signed-off-by: Florian Klink <[email protected]>
That way, the VM survives reboots - the non-networkd configuration seems
to be quite brittle.

Signed-off-by: Florian Klink <[email protected]>
Ideally, we'd keep systemd-resolved disabled too, but the way nixpkgs
configures cloud-init prevents it from picking up DNS settings from
elsewhere.

Signed-off-by: Florian Klink <[email protected]>
Move the azure-specific config snipped into its own file, so we can
import it from multiple configuration.nix.

azure-common.nix is already used for the existing machine
configurations, and as we don't want to break these, it's using this
transient name.

Signed-off-by: Florian Klink <[email protected]>
This gives each VM a system-assigned identity, and exposes the principal
ID as a module output, allowing to grant access to certain resources.

Signed-off-by: Florian Klink <[email protected]>
This exposes a read-only HTTP webserver for the contents in the storage
container.
`rclone serve http` takes care of exposing the storage container over
HTTP.

We disallow listing (by only allowing access to certain paths), and
expose it over HTTP(S) with auto-ssl via caddy.

This will work with whatever domain we route to it, so it's not part of
the configuration.

Signed-off-by: Florian Klink <[email protected]>
This works around NixOS/nixpkgs#272532, we
can revert this once NixOS/nixpkgs#272617 has
landed here.

Signed-off-by: Florian Klink <[email protected]>
We don't want to blindly issue certs for all domains, but make this configurable.

This should be config coming from the environment, via cloud-init.

Signed-off-by: Florian Klink <[email protected]>
Define this for each machine outside the VM, and describe everything in
a single security group.
Attaching multiple security groups caused confusing duplicate errors,
this might be a Terraform Azure Provider Bug.

Signed-off-by: Florian Klink <[email protected]>
This adds filesystem-related tools to the $PATH of cloud-init, so it
can format disks with its disk_setup module (and fs_setup) config key.

This will be used to format data volumes attached to VMs.

Signed-off-by: Florian Klink <[email protected]>
We need to use cloud-init to format and mount data volumes in azure, we
can't use systemd for it.

Due to
hashicorp/terraform-provider-azurerm#6117,
disks in Azure gets attached late at boot, so any dev-disk-by-….device
units created via systemd-fstab-generator might not exist yet at the
time the graph for multi-user.target is created, causing systemd to fail
starting downstream services due to a missing dependency.

Once the volume is attached, the .device unit pops up via udev, and then
a manual restart of services depending on data disks would work, but
it's messy.

Letting cloud-init take care of data disk mounting (and formatting) is
the right choice, that way systemd doesn't need to do any dependency
tracking of it.

Signed-off-by: Florian Klink <[email protected]>
This adds the ghafbinarycache storage account, and a binary-cache-v1
storage container inside of it.

It's used to serve artifacts from (via the binary-cache) VM, and Nix
build artifacts are also uploaded to it.

Signed-off-by: Florian Klink <[email protected]>
This deploys the VM defined at binary-cache.

Attaching the data disks is still a bit messy (requires one reboot, or
manual reverse proxy restart).
Fixing this requires some more debugging.

Signed-off-by: Florian Klink <[email protected]>
The service-binary-cache module is all the specific hosts need.

Signed-off-by: Florian Klink <[email protected]>
Otherwise, cloud-init.service might still be running while we start up
services expecting the mount to happen.

Signed-off-by: Florian Klink <[email protected]>
Configure the domain and storage account name with cloud-init.

This allows keeping the same NixOS image across multiple deployments of
this image, serving another bucket at another domain.

Also, switch to listening on port 443 only, caddy can use the
TLS-ALPN-01 challenge just fine.

Signed-off-by: Florian Klink <[email protected]>
This should use tls-alpn-01 on port 443 just fine.

Signed-off-by: Florian Klink <[email protected]>
Apparently canonical/cloud-init#4673 and more
hacks are not needed, we can simply ramp up the timeout that systemd is
willing to wait for the .device unit to appear.

Signed-off-by: Florian Klink <[email protected]>
flokli and others added 27 commits January 3, 2024 12:31
This adds an additional "remote-build" ssh user.

The Jenkins controller will use this as user to do remote Nix builds.

Signed-off-by: Florian Klink <[email protected]>
Signed-off-by: Florian Klink <[email protected]>
This adds a allocate_public_ip boolean variable (defaulting to false),
and will only create a public ip if it's set to true.

Signed-off-by: Florian Klink <[email protected]>
This deploys two builders in a new subnet.

Signed-off-by: Florian Klink <[email protected]>
This creates an azure key vault and adds the private key as a secret
into there, then grants the jenkins-controller VM access to read that
secret.

Signed-off-by: Florian Klink <[email protected]>
Use the common group, instead of the current client object id.

Signed-off-by: Florian Klink <[email protected]>
This adds a fetch-build-ssh-key systemd service that fetches the ssh
private key into /etc/secrets/remote-build-ssh-key (owned by root),
and orders itself before nix-daemon.

Signed-off-by: Florian Klink <[email protected]>
Render /etc/nix/machines with terraform. In the future, we might want to
autodiscover this, or better, have agents register with the controller,
rather than having to recreate the VM whenever the list of builders is
changed.

Signed-off-by: Florian Klink <[email protected]>
This creates a Nix signing key, and uses terraform-provider-secret to
hold it in the terraform state.

It's then uploaded into an Azure key vault.

The jenkins-controller VM has access to it, and puts it at /etc/secrets/
nix-signing-key.

A post-build-hook is configured, uploading every build to the binary
cache bucket, with the signature.

Signed-off-by: Florian Klink <[email protected]>
There's no need for any user to ssh into builders, this can be dropped.

Signed-off-by: Florian Klink <[email protected]>
The consumes a list of IPs to ssh-keycan once, on startup.

In the future, we might want to add support for dynamic discovery, as
additional (longer-lived) static hosts.

Signed-off-by: Florian Klink <[email protected]>
Prevent the repo and nixpkgs linter from fighting each other about
formatting.

Signed-off-by: Florian Klink <[email protected]>
This describes the current concepts and components in this PR with more
prose.

It also describes some of the known issues / compromises.
@henrirosten henrirosten deleted the copy-of-azure-images branch January 31, 2024 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants