stackhpc · sjpb · Aug 7, 2025 · Aug 6, 2025 · Aug 6, 2025 · Aug 6, 2025
@@ -61,7 +61,7 @@ Run the following from the repository root to activate the venv:
 Use the `cookiecutter` template to create a new environment to hold your configuration:
 
     cd environments
-    cookiecutter skeleton
+    cookiecutter ../cookiecutter
 
 and follow the prompts to complete the environment name and description.
 

@@ -11,12 +11,9 @@ Note that:
 - No Grafana dashboard for alerts is currently provided.
 
 Alertmanager is enabled by default on the `control` node in the
-[everything](../../../environments/common/layouts/everything) template which
-`cookiecutter` uses for a new environment's `inventory/groups` file.
+`site` environment's `inventory/groups` file.
 
 In general usage may only require:
-- Adding the `control` node into the `alertmanager` group in `environments/site/groups`
-  if upgrading an existing environment.
 - Enabling the Slack integration (see section below).
 - Possibly setting `alertmanager_web_external_url`.
 

@@ -11,7 +11,7 @@ This is a convenience wrapper around the ansible modules:
 
 To avoid issues with device names changing after e.g. reboots, devices are identified by serial number and mounted by filesystem UUID.
 
-**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/skeleton/{{cookiecutter.environment}}/tofu/control.userdata.tpl`.
+**NB:** This role is ignored[^1] during Packer builds as block devices will not be attached to the Packer build VMs. This role is therefore deprecated and it is suggested that `cloud-init` is used instead. See e.g. `environments/site/tofu/control.userdata.tpl`.
 
 [^1]: See `environments/common/inventory/group_vars/builder/defaults.yml`
 

@@ -7,7 +7,7 @@ Support FreeIPA in the appliance. In production use it is expected the FreeIPA s
 
 ## Usage
 - Add hosts to the `freeipa_client` group and run (at a minimum) the `ansible/iam.yml` playbook.
-- Host names must match the domain name. By default (using the skeleton OpenTofu) hostnames are of the form `nodename.cluster_name.cluster_domain_suffix` where `cluster_name` and `cluster_domain_suffix` are OpenTofu variables.
+- Host names must match the domain name. By default (using the site OpenTofu) hostnames are of the form `nodename.cluster_name.cluster_domain_suffix` where `cluster_name` and `cluster_domain_suffix` are OpenTofu variables.
 - Hosts discover the FreeIPA server FQDN (and their own domain) from DNS records. If DNS servers are not set this is not set from DHCP, then use the `resolv_conf` role to configure this. For example when using the in-appliance FreeIPA development server:
 
   ```ini
@@ -28,7 +28,7 @@ Support FreeIPA in the appliance. In production use it is expected the FreeIPA s
 - For production use with an external FreeIPA server, a random one-time password (OTP) must be generated when adding hosts to FreeIPA (e.g. using `ipa host-add --random ...`). This password should be set as a hostvar `freeipa_host_password`. Initial host enrolment will use this OTP to enrol the host. After this it becomes irrelevant so it does not need to be committed to git. This approach means the appliance does not require the FreeIPA administrator password.
 - For development use with the in-appliance FreeIPA server, `freeipa_host_password` will be automatically generated in memory.
 - The `control` host must define `appliances_state_dir` (on persistent storage). This is used to back-up keytabs to allow FreeIPA clients to automatically re-enrol after e.g. reimaging. Note that:
-  - This is implemented when using the skeleton OpenTofu; on the control node `appliances_state_dir` defaults to `/var/lib/state` which is mounted from a volume.
+  - This is implemented when using the site OpenTofu; on the control node `appliances_state_dir` defaults to `/var/lib/state` which is mounted from a volume.
   - Nodes are not re-enroled by a [Slurm-driven reimage](../../collections/ansible_collections/stackhpc/slurm_openstack_tools/roles/rebuild/README.md) (as that does not run this role).
   - If both a backed-up keytab and `freeipa_host_password` exist, the former is used.
 

@@ -0,0 +1,19 @@
+[defaults]
+any_errors_fatal = True
+stdout_callback = debug
+stderr_callback = debug
+gathering = smart
+forks = 30
+host_key_checking = False
+inventory = ../common/inventory,../site/inventory,inventory
+collections_path = ../../ansible/collections
+roles_path = ../../ansible/roles
+filter_plugins = ../../ansible/filter_plugins
+
+[ssh_connection]
+ssh_args = -o ServerAliveInterval=10 -o ControlMaster=auto -o ControlPath=~/.ssh/%r@%h-%p -o ControlPersist=240s -o PreferredAuthentications=publickey -o UserKnownHostsFile=/dev/null
+pipelining = True
+
+[inventory]
+# Fail when any inventory source cannot be parsed.
+any_unparsed_is_failed = True
@@ -0,0 +1,21 @@
+variable "environment_root" {
+    type = string
+    description = "Path to environment root, automatically set by activate script"
+}
+
+module "cluster" {
+    source = "../../site/tofu/"
+    environment_root = var.environment_root
+
+    # Environment specific variables
+    # Note that some of the variables below may need to be moved to the site environment
+    # defaults e.g cluster_networks should be in site if your staging and prod
+    # environments use the same networks
+    cluster_name = 
+    cluster_image_id = 
+    control_node_flavor = 
+    cluster_networks = 
+    key_pair = 
+    login = 
+    compute = 
+}
@@ -3,7 +3,7 @@
 Please contact us for specific advice, but this generally involves:
 - Adding a role.
 - Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
-- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
+- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/site/inventory/groups`.
 - Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
 - Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
 - Updating READMEs.
@@ -21,8 +21,7 @@ must be configured to generate notifications.
 ## Enabling alertmanager
 
 1. Ensure both the `prometheus` and `alertmanager` servers are deployed on the
-control node  - for new environments the `cookiecutter` tool will have done
-this:
+control node  - these are deployed by default in the site environment's groups:
 
     ```ini
     # environments/site/groups:

@@ -6,8 +6,7 @@ access from all nodes, possibly via a [proxy](../../ansible/roles/proxy/).
 However many features (as defined by Ansible inventory groups/roles) will work
 if the cluster network(s) provide no outbound access. Currently this includes
 all "default" features, i.e. roles/groups which are enabled either in the
-`common` environment or in the `environments/$ENV/inventory/groups` file
-created by cookiecutter for a new environment.
+`common` or `site` environments.
 
 The full list of features and whether they are functional on such an "isolated"
 network is shown in the table below. Note that:

@@ -227,7 +227,7 @@ The `prometheus` group determines the placement of the prometheus service. Load
 
 ### Access
 
-Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
+Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/site/inventory/groups`, this will be set to the slurm `control` node, prometheus would then be accessible from:
 
  > http://<control_node_ip>:9090
 

@@ -9,11 +9,11 @@ At present this will affect the following:
 - Grafana data
 - OpenDistro/elasticsearch data
 
-If using the `environments/common/layout/everything` Ansible groups template (which is the default for a new cookiecutter-produced environment) then these services will all be on the `control` node and hence only this node requires persistent storage.
+If using the upstream defaults in the `site` environments `inventory/groups` file then these services will all be on the `control` node and hence only this node requires persistent storage.
 
 Note that if `appliances_state_dir` is defined, the path it gives must exist and should be owned by root. Directories will be created within this with appropriate permissions for each item of state defined above. Additionally, the systemd units for the services listed above will be modified to require `appliances_state_dir` to be mounted before service start (via the `systemd` role).
 
-A new cookiecutter-produced environment supports persistent state in the default OpenTofu (see `environments/skeleton/{{cookiecutter.environment}}/tofu/`) by:
+The `site` environment supports persistent state in the default OpenTofu (see `environments/site/tofu/`) by:
 
 - Defining a volume with a default size of 150GB - this can be controlled by the OpenTofu variable `state_volume_size`.
 - Attaching it to the control node.

@@ -7,25 +7,15 @@ production-ready deployments.
 - Get it agreed up front what the cluster names will be. Changing this later
   requires instance deletion/recreation.
 
-- At least three environments should be created:
-    - `site`: site-specific base environment
+- At least two environments should be created using cookiecutter, which will derive from the `site` base environment:
     - `production`: production environment
     - `staging`: staging environment
 
   A `dev` environment should also be created if considered required, or this
   can be left until later.
 
-  These can all be produced using the cookicutter instructions, but the
-  `production` and `staging` environments will need their
-  `environments/$ENV/ansible.cfg` file modifying so that they point to the
-  `site` environment:
-
-    ```ini
-    inventory = ../common/inventory,../site/inventory,inventory
-    ```
-
-  In general only the `site` environment will need an `inventory/groups` file -
-  this is templated out by cookiecutter and should be modified as required to
+  In general only the `inventory/groups` file in the `site` environment is needed -
+  it can be modified as required to
   enable features for all environments at the site.
 
 - To avoid divergence of configuration all possible overrides for group/role
@@ -42,34 +32,10 @@ and referenced from the `site` and `production` environments, e.g.:
       import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml"
     ```
 
-- OpenTofu configurations should be defined in the `site` environment and used
-  as a module from the other environments. This can be done with the
-  cookie-cutter generated configurations:
-  - Delete the *contents* of the cookie-cutter generated `tofu/` directories
-    from the `production` and `staging` environments.
-  - Create a `main.tf` in those directories which uses `site/tofu/` as a
-    [module](https://opentofu.org/docs/language/modules/), e.g. :
-
-    ```
-    ...
-    variable "environment_root" {
-      type = string
-      description = "Path to environment root, automatically set by activate script"
-    }
-
-    module "cluster" {
-        source = "../../site/tofu/"
-        environment_root = var.environment_root
-
-        cluster_name = "foo"
-        ...
-    }
-    ```
-
-    Note that:
+- When setting OpenTofu configurations:
 
     - Environment-specific variables (`cluster_name`) should be hardcoded
-      into the cluster module block.
+      as arguments into the cluster module block at `environments/$ENV/tofu/main.tf`.
     - Environment-independent variables (e.g. maybe `cluster_net` if the
       same is used for staging and production) should be set as *defaults*
       in `environments/site/tofu/variables.tf`, and then don't need to
@@ -87,7 +53,7 @@ and referenced from the `site` and `production` environments, e.g.:
   instances) it may be necessary to configure or proxy `chronyd` via an
   environment hook.
 
-- By default, the cookiecutter-provided OpenTofu configuration provisions two
+- By default, the site OpenTofu configuration provisions two
   volumes and attaches them to the control node:
     - "$cluster_name-home" for NFS-shared home directories
     - "$cluster_name-state" for monitoring and Slurm data
@@ -143,13 +109,22 @@ and referenced from the `site` and `production` environments, e.g.:
 - Configure Open OnDemand - see [specific documentation](openondemand.md) which
   notes specific variables required.
 
-- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml`
+- Remove the `demo_user` user from `environments/$ENV/inventory/group_vars/all/basic_users.yml`.
+  Replace the `hpctests_user` in `environments/$ENV/inventory/group_vars/all/hpctests.yml` with
+  an appropriately configured user.
 
 - Consider whether having (read-only) access to Grafana without login is OK. If not, remove `grafana_auth_anonymous` in `environments/$ENV/inventory/group_vars/all/grafana.yml`
 
 - If floating IPs are required for login nodes, create these in OpenStack and add the IPs into
   the OpenTofu `login` definition.
 
+- Consider enabling topology aware scheduling. This is currently only supported if your cluster does not include any baremetal nodes. This can be enabled by:
+    1. Creating Availability Zones in your OpenStack project for each physical rack
+    2. Setting the `availability_zone` fields of compute groups in your OpenTofu configuration
+    3. Adding the `compute` group as a child of `topology` in `environments/$ENV/inventory/groups`
+    4. (Optional) If you are aware of the physical topology of switches above the rack-level, override `topology_above_rack_topology` in your groups vars
+       (see [topology docs](../ansible/roles/topology/README.md) for more detail)
+
 - Consider whether mapping of baremetal nodes to ironic nodes is required. See
   [PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485).
 

@@ -41,6 +41,12 @@ All other commands should be run on the Ansible deploy host.
    prompts. Generally merge conflicts should only exist where functionality which was added
    for your site (not in a hook) has subsequently been merged upstream.
 
+   Note that if upgrading from a release prior to v2.3, you will likely have merge conflicts
+   with existing site OpenTofu configurations in `environments/site/tofu`. Generally
+   - Changes to `default` values in `environments/site/tofu.variables.tf` should be rejected.
+   - All other changes to the OpenTofu configuration should be accepted, unless they overwrite
+     site-specific additional resources.
+
 1. Push this branch and create a PR:
 
         git push
@@ -50,10 +56,10 @@ All other commands should be run on the Ansible deploy host.
    site-specific configuration. In general changes to existing functionality will aim to be
    backward compatible. Alteration of site-specific configuration will usually only be
    necessary to use new functionality or where functionality has been upstreamed as above.
-   Note that the `environments/common/layouts/everything` file contains all possible
-   groups which can be used to enable features; diff this against your e.g.
-   `environments/site/inventory/groups` file to see new features which you may
-   wish to enable in the latter file.
+   Note that the upstream `environments/site/inventory/groups` file contains all possible
+   groups which can be used to enable features. This will be updated when pulling changes
+   from the StackHPC repo, and any new groups should be enabled/disabled as required for
+   your site.
 
    Make changes as necessary.
 

@@ -5,7 +5,7 @@ caas_nfs_home:
     nfs_enable:
         server:  "{{ inventory_hostname in groups['control'] }}"
         clients: "{{ inventory_hostname in groups['cluster'] }}"
-    nfs_export: "/exports/home" # assumes skeleton TF is being used
+    nfs_export: "/exports/home" # assumes default site TF is being used
     nfs_client_mnt_point: "/home"
 
 nfs_configurations: "{{ caas_nfs_home if not cluster_home_manila_share | bool else [] }}"
@@ -6,7 +6,7 @@ callbacks_enabled = ansible.posix.profile_tasks
 gathering = smart
 forks = 30
 host_key_checking = False
-inventory = ../common/inventory,inventory
+inventory = ../common/inventory,../site/inventory,inventory
 collections_path = ../../ansible/collections
 roles_path = ../../ansible/roles
 filter_plugins = ../../ansible/filter_plugins

@@ -1,4 +1,4 @@
-# This terraform configuration uses the "skeleton" terraform, so that is checked by CI.
+# This terraform configuration uses the site terraform, so that is checked by CI.
 
 terraform {
   required_version = ">= 0.14"
@@ -59,7 +59,7 @@ data "openstack_images_image_v2" "cluster" {
 }
 
 module "cluster" {
-    source = "../../skeleton/{{cookiecutter.environment}}/tofu/"
+    source = "../../site/tofu/"
 
     cluster_name = var.cluster_name
     cluster_networks = var.cluster_networks

@@ -33,17 +33,20 @@ for usage instructions for that component.
 
 Shared configuration for all environments. This is not
 intended to be used as a standalone environment, hence the README does *not* detail
-how to provision the infrastructure.
+how to provision the infrastructure. This environment should not be edited, except as part of upstreaming new features or bug fixes.
 
-### skeleton
+## site
+
+Provides the base configuration for all subsequent `cookiecutter` created environments,
+including OpenTofu configurations for infrastructure. In general, most local customisations should be made by adding to this environment.
 
-Skeleton directory that is used as a template to create a new environemnt.
 
 ## Defining an environment
 
 To define an environment using cookiecutter:
 
-    cookiecutter skeleton
+    cd environments
+    cookiecutter ../cookiecutter
 
 This will present you with a series of questions which you must answer.
 Once you have answered all questions, a new environment directory will

@@ -3,7 +3,7 @@
 
 firewalld_configs_default:
   # A list of dicts defining firewalld rules.
-  # Using the "everything" template firewalld is deployed on the login node to enable fail2ban.
+  # Using the default site `groups` file, firewalld is deployed on the login node to enable fail2ban.
   # However by default we rely on openstack security groups so make firewalld permissive.
   # Each dict contains:
   #   name: An arbitrary name or description

@@ -18,7 +18,7 @@ nfs_configuration_home_volume: # volume-backed home directories
         # Don't mount share on control node:
         clients: "{{ inventory_hostname in groups['cluster'] and inventory_hostname not in groups['control'] }}"
     nfs_server: "{{ nfs_server_default }}"
-    nfs_export: "/exports/home" # assumes skeleton TF is being used
+    nfs_export: "/exports/home" # assumes default site TF is being used
     nfs_client_mnt_point: "/home"
     # prevent tunnelling and setuid binaries:
     # NB: this is stackhpc.nfs role defaults but are set here to prevent being