Support additional nodegroups #704

sjpb · 2025-06-11T14:33:08Z

Adds opentofu variable additional_nodegroups to support defining non-Slurm nodes in the cluster. E.g.:
```
additional_nodegroups = {
    squid = {
        nodes = ["squid-0"]
        flavor = var.other_node_flavor
    }
}
```
Nodes are automatically added to an inventory group of the same name as the node group.
Slurm-controlled rebuild and compute-init is not supported, as these nodes will not be running slurmd.
Security groups are those from opentofu variable nonlogin_security_groups by default, but may be overriden.
Also adds compute nodes into an inventory group of the same name as the note group, in addition to the existing ${cluster_name}_${group_name} inventory group required for stackhpc.openhpc role partition configuration. This simplifies multi-environment configuration.

The variable control_ip_address was documented but not implemented. Since we support multiple networks, change it to control_ip_addresses and implement it. Closes #642.

…les.tf

environments/skeleton/{{cookiecutter.environment}}/tofu/additional.tf

bertiethorpe

LGTM, once CI passing. I agree that login and compute should be suffixed with "_nodegroups" but we can live with it. It would be good to test this in CI at some point too

sjpb · 2025-06-18T16:14:17Z

Should have said, this has been tested locally using the CI environment, so it does actually work! Bit loath to add it to CI as we already struggle with resources.

sjpb · 2025-06-25T08:16:21Z

First attempt above - RL9 (only) failed with:

# Configure cluster at latest release
TASK [Run sinfo] ***************************************************************
...
FAILED - RETRYING: [slurmci-RL9-2607-login-0]: Run sinfo (1 retries left).
fatal: [slurmci-RL9-2607-login-0]: FAILED! => {
    "attempts": 200,
    "changed": false,
    "cmd": "sinfo --noheader --format=\"%N %P %a %l %D %t\" | sort",
    "delta": "0:00:00.013515",
    "end": "2025-06-24 16:50:16.721362",
    "rc": 0,
    "start": "2025-06-24 16:50:16.707847"
}

STDOUT:

 extra up 60-00:00:00 0 n/a
slurmci-RL9-2607-compute-[0-1] standard* up 60-00:00:00 2 drain

priteau and others added 12 commits June 6, 2025 10:35

Support fixed IP addresses for control node

4224926

The variable control_ip_address was documented but not implemented. Since we support multiple networks, change it to control_ip_addresses and implement it. Closes #642.

Update environments/skeleton/{{cookiecutter.environment}}/tofu/variab…

d5972b3

…les.tf

tf style tweaks

9c3bb96

update production docs for control_ip_addresses

d3acbe2

fix control IP address logic

bfa66ab

add validation for control_ip_addresses

9b5ed88

Merge branch 'main' into control-ip-addresses

ba4941d

remove fixed IPs from production docs - not standard process

c97615f

support ip_addresses fo all nodes

1afc5ea

make stackhpc tofu format consistent

e1b171d

add support for additional_nodegroups

25af907

add missing additional nodegroup file

54809f0

Base automatically changed from control-ip-addresses to main June 13, 2025 14:30

Merge branch 'main' into feat/additional-nodes

87d9914

sjpb marked this pull request as ready for review June 13, 2025 14:35

sjpb requested a review from a team as a code owner June 13, 2025 14:35

sjpb added 2 commits June 13, 2025 14:41

rename additional tf file for consistency

998246d

support changing security groups for additional nodes

de86177

sjpb commented Jun 13, 2025

View reviewed changes

environments/skeleton/{{cookiecutter.environment}}/tofu/additional.tf Show resolved Hide resolved

bertiethorpe previously approved these changes Jun 17, 2025

View reviewed changes

Merge branch 'main' into feat/additional-nodes

3645bfb

Merge branch 'main' into feat/additional-nodes

88e7e25

sjpb marked this pull request as draft June 25, 2025 11:26

add node_fqdn from PR#702 to additional_nodes

4efb1a1

sjpb dismissed bertiethorpe’s stale review via 4efb1a1 June 25, 2025 11:31

sjpb marked this pull request as ready for review June 25, 2025 12:18

bertiethorpe self-requested a review June 25, 2025 12:33

bertiethorpe approved these changes Jun 25, 2025

View reviewed changes

sjpb merged commit 9525f2c into main Jun 25, 2025
2 checks passed

sjpb deleted the feat/additional-nodes branch June 25, 2025 13:42

sjpb mentioned this pull request Aug 5, 2025

Fixup additional_nodes inventory group #749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support additional nodegroups #704

Support additional nodegroups #704

Uh oh!

sjpb commented Jun 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

bertiethorpe left a comment

Uh oh!

sjpb commented Jun 18, 2025

Uh oh!

sjpb commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Support additional nodegroups #704

Support additional nodegroups #704

Uh oh!

Conversation

sjpb commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bertiethorpe left a comment

Choose a reason for hiding this comment

Uh oh!

sjpb commented Jun 18, 2025

Uh oh!

sjpb commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

sjpb commented Jun 11, 2025 •

edited

Loading