Skip to content

Move cookiecutter Tofu to new site environment #751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ Run the following from the repository root to activate the venv:
Use the `cookiecutter` template to create a new environment to hold your configuration:

cd environments
cookiecutter skeleton
cookiecutter ../cookiecutter

and follow the prompts to complete the environment name and description.

Expand Down
5 changes: 1 addition & 4 deletions ansible/roles/alertmanager/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,9 @@ Note that:
- No Grafana dashboard for alerts is currently provided.

Alertmanager is enabled by default on the `control` node in the
[everything](../../../environments/common/layouts/everything) template which
`cookiecutter` uses for a new environment's `inventory/groups` file.
`site` environment's `inventory/groups` file.

In general usage may only require:
- Adding the `control` node into the `alertmanager` group in `environments/site/groups`
if upgrading an existing environment.
- Enabling the Slack integration (see section below).
- Possibly setting `alertmanager_web_external_url`.

Expand Down
2 changes: 1 addition & 1 deletion ansible/roles/block_devices/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ This is a convenience wrapper around the ansible modules:

To avoid issues with device names changing after e.g. reboots, devices are identified by serial number and mounted by filesystem UUID.

**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/tofu/control.userdata.tpl`.
**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/site/tofu/control.userdata.tpl`.

[^1]: See `environments/common/inventory/group_vars/builder/defaults.yml`

Expand Down
4 changes: 2 additions & 2 deletions ansible/roles/freeipa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Support FreeIPA in the appliance. In production use it is expected the FreeIPA s

## Usage
- Add hosts to the `freeipa_client` group and run (at a minimum) the `ansible/iam.yml` playbook.
- Host names must match the domain name. By default (using the skeleton OpenTofu) hostnames are of the form `nodename.cluster_name.cluster_domain_suffix` where `cluster_name` and `cluster_domain_suffix` are OpenTofu variables.
- Host names must match the domain name. By default (using the site OpenTofu) hostnames are of the form `nodename.cluster_name.cluster_domain_suffix` where `cluster_name` and `cluster_domain_suffix` are OpenTofu variables.
- Hosts discover the FreeIPA server FQDN (and their own domain) from DNS records. If DNS servers are not set this is not set from DHCP, then use the `resolv_conf` role to configure this. For example when using the in-appliance FreeIPA development server:

```ini
Expand All @@ -28,7 +28,7 @@ Support FreeIPA in the appliance. In production use it is expected the FreeIPA s
- For production use with an external FreeIPA server, a random one-time password (OTP) must be generated when adding hosts to FreeIPA (e.g. using `ipa host-add --random ...`). This password should be set as a hostvar `freeipa_host_password`. Initial host enrolment will use this OTP to enrol the host. After this it becomes irrelevant so it does not need to be committed to git. This approach means the appliance does not require the FreeIPA administrator password.
- For development use with the in-appliance FreeIPA server, `freeipa_host_password` will be automatically generated in memory.
- The `control` host must define `appliances_state_dir` (on persistent storage). This is used to back-up keytabs to allow FreeIPA clients to automatically re-enrol after e.g. reimaging. Note that:
- This is implemented when using the skeleton OpenTofu; on the control node `appliances_state_dir` defaults to `/var/lib/state` which is mounted from a volume.
- This is implemented when using the site OpenTofu; on the control node `appliances_state_dir` defaults to `/var/lib/state` which is mounted from a volume.
- Nodes are not re-enroled by a [Slurm-driven reimage](../../collections/ansible_collections/stackhpc/slurm_openstack_tools/roles/rebuild/README.md) (as that does not run this role).
- If both a backed-up keytab and `freeipa_host_password` exist, the former is used.

Expand Down
19 changes: 19 additions & 0 deletions cookiecutter/{{cookiecutter.environment}}/ansible.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[defaults]
any_errors_fatal = True
stdout_callback = debug
stderr_callback = debug
gathering = smart
forks = 30
host_key_checking = False
inventory = ../common/inventory,../site/inventory,inventory
collections_path = ../../ansible/collections
roles_path = ../../ansible/roles
filter_plugins = ../../ansible/filter_plugins

[ssh_connection]
ssh_args = -o ServerAliveInterval=10 -o ControlMaster=auto -o ControlPath=~/.ssh/%r@%h-%p -o ControlPersist=240s -o PreferredAuthentications=publickey -o UserKnownHostsFile=/dev/null
pipelining = True

[inventory]
# Fail when any inventory source cannot be parsed.
any_unparsed_is_failed = True
21 changes: 21 additions & 0 deletions cookiecutter/{{cookiecutter.environment}}/tofu/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
variable "environment_root" {
type = string
description = "Path to environment root, automatically set by activate script"
}

module "cluster" {
source = "../../site/tofu/"
environment_root = var.environment_root

# Environment specific variables
# Note that some of the variables below may need to be moved to the site environment
# defaults e.g cluster_networks should be in site if your staging and prod
# environments use the same networks
cluster_name =
cluster_image_id =
control_node_flavor =
cluster_networks =
key_pair =
login =
compute =
}
2 changes: 1 addition & 1 deletion docs/adding-functionality.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Please contact us for specific advice, but this generally involves:
- Adding a role.
- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/site/inventory/groups`.
- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
- Updating READMEs.
3 changes: 1 addition & 2 deletions docs/alerting.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,7 @@ must be configured to generate notifications.
## Enabling alertmanager

1. Ensure both the `prometheus` and `alertmanager` servers are deployed on the
control node - for new environments the `cookiecutter` tool will have done
this:
control node - these are deployed by default in the site environment's groups:

```ini
# environments/site/groups:
Expand Down
3 changes: 1 addition & 2 deletions docs/experimental/isolated-clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ access from all nodes, possibly via a [proxy](../../ansible/roles/proxy/).
However many features (as defined by Ansible inventory groups/roles) will work
if the cluster network(s) provide no outbound access. Currently this includes
all "default" features, i.e. roles/groups which are enabled either in the
`common` environment or in the `environments/$ENV/inventory/groups` file
created by cookiecutter for a new environment.
`common` or `site` environments.

The full list of features and whether they are functional on such an "isolated"
network is shown in the table below. Note that:
Expand Down
2 changes: 1 addition & 1 deletion docs/monitoring-and-logging.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ The `prometheus` group determines the placement of the prometheus service. Load

### Access

Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/site/inventory/groups`, this will be set to the slurm `control` node, prometheus would then be accessible from:

> http://<control_node_ip>:9090

Expand Down
4 changes: 2 additions & 2 deletions docs/persistent-state.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ At present this will affect the following:
- Grafana data
- OpenDistro/elasticsearch data

If using the `environments/common/layout/everything` Ansible groups template (which is the default for a new cookiecutter-produced environment) then these services will all be on the `control` node and hence only this node requires persistent storage.
If using the upstream defaults in the `site` environments `inventory/groups` file then these services will all be on the `control` node and hence only this node requires persistent storage.

Note that if `appliances_state_dir` is defined, the path it gives must exist and should be owned by root. Directories will be created within this with appropriate permissions for each item of state defined above. Additionally, the systemd units for the services listed above will be modified to require `appliances_state_dir` to be mounted before service start (via the `systemd` role).

A new cookiecutter-produced environment supports persistent state in the default OpenTofu (see `environments/skeleton/{{cookiecutter.environment}}/tofu/`) by:
The `site` environment supports persistent state in the default OpenTofu (see `environments/site/tofu/`) by:

- Defining a volume with a default size of 150GB - this can be controlled by the OpenTofu variable `state_volume_size`.
- Attaching it to the control node.
Expand Down
57 changes: 16 additions & 41 deletions docs/production.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,15 @@ production-ready deployments.
- Get it agreed up front what the cluster names will be. Changing this later
requires instance deletion/recreation.

- At least three environments should be created:
- `site`: site-specific base environment
- At least two environments should be created using cookiecutter, which will derive from the `site` base environment:
- `production`: production environment
- `staging`: staging environment

A `dev` environment should also be created if considered required, or this
can be left until later.

These can all be produced using the cookicutter instructions, but the
`production` and `staging` environments will need their
`environments/$ENV/ansible.cfg` file modifying so that they point to the
`site` environment:

```ini
inventory = ../common/inventory,../site/inventory,inventory
```

In general only the `site` environment will need an `inventory/groups` file -
this is templated out by cookiecutter and should be modified as required to
In general only the `inventory/groups` file in the `site` environment is needed -
it can be modified as required to
enable features for all environments at the site.

- To avoid divergence of configuration all possible overrides for group/role
Expand All @@ -42,34 +32,10 @@ and referenced from the `site` and `production` environments, e.g.:
import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml"
```

- OpenTofu configurations should be defined in the `site` environment and used
as a module from the other environments. This can be done with the
cookie-cutter generated configurations:
- Delete the *contents* of the cookie-cutter generated `tofu/` directories
from the `production` and `staging` environments.
- Create a `main.tf` in those directories which uses `site/tofu/` as a
[module](https://opentofu.org/docs/language/modules/), e.g. :

```
...
variable "environment_root" {
type = string
description = "Path to environment root, automatically set by activate script"
}

module "cluster" {
source = "../../site/tofu/"
environment_root = var.environment_root

cluster_name = "foo"
...
}
```

Note that:
- When setting OpenTofu configurations:

- Environment-specific variables (`cluster_name`) should be hardcoded
into the cluster module block.
as arguments into the cluster module block at `environments/$ENV/tofu/main.tf`.
- Environment-independent variables (e.g. maybe `cluster_net` if the
same is used for staging and production) should be set as *defaults*
in `environments/site/tofu/variables.tf`, and then don't need to
Expand All @@ -87,7 +53,7 @@ and referenced from the `site` and `production` environments, e.g.:
instances) it may be necessary to configure or proxy `chronyd` via an
environment hook.

- By default, the cookiecutter-provided OpenTofu configuration provisions two
- By default, the site OpenTofu configuration provisions two
volumes and attaches them to the control node:
- "$cluster_name-home" for NFS-shared home directories
- "$cluster_name-state" for monitoring and Slurm data
Expand Down Expand Up @@ -143,13 +109,22 @@ and referenced from the `site` and `production` environments, e.g.:
- Configure Open OnDemand - see [specific documentation](openondemand.md) which
notes specific variables required.

- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml`
- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml`.
Replace the `hpctests_user` in `environments/$ENV/inventory/group_vars/all/hpctests.yml` with
an appropriately configured user.

- Consider whether having (read-only) access to Grafana without login is OK. If not, remove `grafana_auth_anonymous` in `environments/$ENV/inventory/group_vars/all/grafana.yml`

- If floating IPs are required for login nodes, create these in OpenStack and add the IPs into
the OpenTofu `login` definition.

- Consider enabling topology aware scheduling. This is currently only supported if your cluster does not include any baremetal nodes. This can be enabled by:
1. Creating Availability Zones in your OpenStack project for each physical rack
2. Setting the `availability_zone` fields of compute groups in your OpenTofu configuration
3. Adding the `compute` group as a child of `topology` in `environments/$ENV/inventory/groups`
4. (Optional) If you are aware of the physical topology of switches above the rack-level, override `topology_above_rack_topology` in your groups vars
(see [topology docs](../ansible/roles/topology/README.md) for more detail)

- Consider whether mapping of baremetal nodes to ironic nodes is required. See
[PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485).

Expand Down
14 changes: 10 additions & 4 deletions docs/upgrades.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ All other commands should be run on the Ansible deploy host.
prompts. Generally merge conflicts should only exist where functionality which was added
for your site (not in a hook) has subsequently been merged upstream.

Note that if upgrading from a release prior to v2.3, you will likely have merge conflicts
with existing site OpenTofu configurations in `environments/site/tofu`. Generally
- Changes to `default` values in `environments/site/tofu.variables.tf` should be rejected.
- All other changes to the OpenTofu configuration should be accepted, unless they overwrite
site-specific additional resources.

1. Push this branch and create a PR:

git push
Expand All @@ -50,10 +56,10 @@ All other commands should be run on the Ansible deploy host.
site-specific configuration. In general changes to existing functionality will aim to be
backward compatible. Alteration of site-specific configuration will usually only be
necessary to use new functionality or where functionality has been upstreamed as above.
Note that the `environments/common/layouts/everything` file contains all possible
groups which can be used to enable features; diff this against your e.g.
`environments/site/inventory/groups` file to see new features which you may
wish to enable in the latter file.
Note that the upstream `environments/site/inventory/groups` file contains all possible
groups which can be used to enable features. This will be updated when pulling changes
from the StackHPC repo, and any new groups should be enabled/disabled as required for
your site.

Make changes as necessary.

Expand Down
2 changes: 1 addition & 1 deletion environments/.caas/inventory/group_vars/all/nfs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ caas_nfs_home:
nfs_enable:
server: "{{ inventory_hostname in groups['control'] }}"
clients: "{{ inventory_hostname in groups['cluster'] }}"
nfs_export: "/exports/home" # assumes skeleton TF is being used
nfs_export: "/exports/home" # assumes default site TF is being used
nfs_client_mnt_point: "/home"

nfs_configurations: "{{ caas_nfs_home if not cluster_home_manila_share | bool else [] }}"
2 changes: 1 addition & 1 deletion environments/.stackhpc/ansible.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ callbacks_enabled = ansible.posix.profile_tasks
gathering = smart
forks = 30
host_key_checking = False
inventory = ../common/inventory,inventory
inventory = ../common/inventory,../site/inventory,inventory
collections_path = ../../ansible/collections
roles_path = ../../ansible/roles
filter_plugins = ../../ansible/filter_plugins
Expand Down
1 change: 0 additions & 1 deletion environments/.stackhpc/inventory/everything

This file was deleted.

This file was deleted.

4 changes: 2 additions & 2 deletions environments/.stackhpc/tofu/main.tf
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This terraform configuration uses the "skeleton" terraform, so that is checked by CI.
# This terraform configuration uses the site terraform, so that is checked by CI.

terraform {
required_version = ">= 0.14"
Expand Down Expand Up @@ -59,7 +59,7 @@ data "openstack_images_image_v2" "cluster" {
}

module "cluster" {
source = "../../skeleton/{{cookiecutter.environment}}/tofu/"
source = "../../site/tofu/"

cluster_name = var.cluster_name
cluster_networks = var.cluster_networks
Expand Down
11 changes: 7 additions & 4 deletions environments/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,20 @@ for usage instructions for that component.

Shared configuration for all environments. This is not
intended to be used as a standalone environment, hence the README does *not* detail
how to provision the infrastructure.
how to provision the infrastructure. This environment should not be edited, except as part of upstreaming new features or bug fixes.

### skeleton
## site

Provides the base configuration for all subsequent `cookiecutter` created environments,
including OpenTofu configurations for infrastructure. In general, most local customisations should be made by adding to this environment.

Skeleton directory that is used as a template to create a new environemnt.

## Defining an environment

To define an environment using cookiecutter:

cookiecutter skeleton
cd environments
cookiecutter ../cookiecutter

This will present you with a series of questions which you must answer.
Once you have answered all questions, a new environment directory will
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

firewalld_configs_default:
# A list of dicts defining firewalld rules.
# Using the "everything" template firewalld is deployed on the login node to enable fail2ban.
# Using the default site `groups` file, firewalld is deployed on the login node to enable fail2ban.
# However by default we rely on openstack security groups so make firewalld permissive.
# Each dict contains:
# name: An arbitrary name or description
Expand Down
2 changes: 1 addition & 1 deletion environments/common/inventory/group_vars/all/nfs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ nfs_configuration_home_volume: # volume-backed home directories
# Don't mount share on control node:
clients: "{{ inventory_hostname in groups['cluster'] and inventory_hostname not in groups['control'] }}"
nfs_server: "{{ nfs_server_default }}"
nfs_export: "/exports/home" # assumes skeleton TF is being used
nfs_export: "/exports/home" # assumes default site TF is being used
nfs_client_mnt_point: "/home"
# prevent tunnelling and setuid binaries:
# NB: this is stackhpc.nfs role defaults but are set here to prevent being
Expand Down
6 changes: 0 additions & 6 deletions environments/common/layouts/README.md

This file was deleted.

8 changes: 0 additions & 8 deletions environments/common/layouts/minimal

This file was deleted.

Loading
Loading