Skip to content

Support additional nodegroups #704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 25, 2025
Merged

Support additional nodegroups #704

merged 18 commits into from
Jun 25, 2025

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Jun 11, 2025

  • Adds opentofu variable additional_nodegroups to support defining non-Slurm nodes in the cluster. E.g.:

    additional_nodegroups = {
        squid = {
            nodes = ["squid-0"]
            flavor = var.other_node_flavor
        }
    }

    Nodes are automatically added to an inventory group of the same name as the node group.
    Slurm-controlled rebuild and compute-init is not supported, as these nodes will not be running slurmd.
    Security groups are those from opentofu variable nonlogin_security_groups by default, but may be overriden.

  • Also adds compute nodes into an inventory group of the same name as the note group, in addition to the existing ${cluster_name}_${group_name} inventory group required for stackhpc.openhpc role partition configuration. This simplifies multi-environment configuration.

Base automatically changed from control-ip-addresses to main June 13, 2025 14:30
@sjpb sjpb marked this pull request as ready for review June 13, 2025 14:35
@sjpb sjpb requested a review from a team as a code owner June 13, 2025 14:35
bertiethorpe
bertiethorpe previously approved these changes Jun 17, 2025
Copy link
Member

@bertiethorpe bertiethorpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, once CI passing. I agree that login and compute should be suffixed with "_nodegroups" but we can live with it. It would be good to test this in CI at some point too

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 18, 2025

Should have said, this has been tested locally using the CI environment, so it does actually work! Bit loath to add it to CI as we already struggle with resources.

@sjpb
Copy link
Collaborator Author

sjpb commented Jun 25, 2025

First attempt above - RL9 (only) failed with:

# Configure cluster at latest release
TASK [Run sinfo] ***************************************************************
...
FAILED - RETRYING: [slurmci-RL9-2607-login-0]: Run sinfo (1 retries left).
fatal: [slurmci-RL9-2607-login-0]: FAILED! => {
    "attempts": 200,
    "changed": false,
    "cmd": "sinfo --noheader --format=\"%N %P %a %l %D %t\" | sort",
    "delta": "0:00:00.013515",
    "end": "2025-06-24 16:50:16.721362",
    "rc": 0,
    "start": "2025-06-24 16:50:16.707847"
}

STDOUT:

 extra up 60-00:00:00 0 n/a
slurmci-RL9-2607-compute-[0-1] standard* up 60-00:00:00 2 drain

@sjpb sjpb marked this pull request as draft June 25, 2025 11:26
@sjpb sjpb marked this pull request as ready for review June 25, 2025 12:18
@bertiethorpe bertiethorpe self-requested a review June 25, 2025 12:33
@sjpb sjpb merged commit 9525f2c into main Jun 25, 2025
2 checks passed
@sjpb sjpb deleted the feat/additional-nodes branch June 25, 2025 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants