-
Notifications
You must be signed in to change notification settings - Fork 219
[RFC] Enhance Stack Handling #1220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @FloThinksPi, Thank you for drafting RFC-0040! It provides a comprehensive overview of enhancements to stack handling in Cloud Foundry, and I appreciate the effort you’ve put into addressing these critical issues. However, I feel the RFC could benefit from being split into individual, logically scoped RFCs to improve focus and facilitate discussion. For example:
Each topic is significant and detailed enough to warrant its own RFC. Splitting them would allow contributors to dive deeper into specific areas, streamline discussions, and prioritize implementation efforts more effectively. On a higher level I'm also worried about handing responsibility for stacks in the hands of app developers. One of the great features of CF is that platform operators can fix CVEs in for example OpenSSL with a single bosh deploy. I understand the desire to fix this, but maybe a compromise could be found in making stack management an admin only feature (using the cf cli). Ideally using a blobstore first approach so it works in airgapped environments. Last but not least I'm wondering about the roll out mechanisms for handeling custom stack updates, given these changes are not orchestrated through bosh, how do we ensure we don't over we Diego? There needs to be some sort of global max_in_flight setting to control the total number of apps that are being restarted at the same time. |
Thanks for the Feedback @rkoster, I can split it in multiple RFCs however i think then the big picture is lost for which the RFCs are made as the sum of the 3 RFCs/Proposals only together brings improvements for the problem. I would rather like to accept or dismiss individual proposals as the result of the review and then document it in the proposals which one is accepted and which one is not. And then Accept or Dismiss the RFC as a whole. Maybe we can open multiple https://github.com/cloudfoundry/community/discussions for each proposal and link them here, would that help in the review process to structure it better ? E.g. for the Proposals: The last proposal is just a "Note for Future Reference" not an actual proposal to commit to but to pull out in case its desired in some months/years as extension scenario. I`ll answer in the Discussions on the other questions you had just to try that also out :) |
I'm not sure that additional discussion threads help. The RFC README clearly states that the discussion shall happen in the PR. I share @rkoster's opinion that this RFC is rather big and deserves a split. "Improve logical stack management in CF API" and "Bring your own stack" are proposals that bring value w/o each other and which can be implemented independently. "Provide a stack with every ubuntu LTS" is in my eyes not a good RFC candidate. The idea is clear and has been brought up in the past but in the end it is a question of resources and commitment. Concrete RFCs for the next stack like rfc-0039-noble-based-cflinuxfs5 are more helpful as they indicate commitment by the author and make clear that work will really start. The RFC README doesn't say anything about the scope/size of an RFC and we have RFC of all sizes (from smaller process related RFC like introducing the reviewer role up to long running CF API v2 removal and manifest v2). I would strive for not-too-big RFCs so that the corresponding implementation issues can get closed one day. |
@rkoster @stephanme alright split to #1251 and dropped the stack release proposal entirely. |
@rkoster to answer to your comment
We could extend the feature flag to have 3 modes -
I also thought intensively over that, in the end we actually already allowed this when we introduced "bring your own buildpack" to some degree. If you run your own buildpack - or even a system one - you anyway have to regularly restage to consume patches to your buildpack or languages libs. In case your application does not do dependency version pinning at all or does just pin direct dependencies but not transitive ones. E.g. patched log4j libs as an example everyone might know of. Even though CF gives you some things automatically in my experience users didn't know that sometimes they have to restage to consume new libs as they (naively)thought the system takes care if it. For example take the python buildpack - you only get a CVE patched in the python interpreter if you restage! Same goes for the Go Compiler, Ruby or Java JRE. What may thus be desirable when allowing freedom like custom buildpacks or custom stacks is to explain the boundary conditions very well and make them transparent to the user. Can be maybe added to the RFC/a new one that we make this more transparent than today either in the API with a special flag(e.g.
Deliberately in the RFC only Custom Stack + Custom buildpack shall be supported so bring your own everything. And firstly the operator has to allow its users with the feature flag to use this feature. Secondly a user has to willingly move away from the system buildpack and system stack - this cannot be an accidental thing. So if better documented i think if a user wants to deliberately care for his buildpack and stack an operator may be enabled to allow him that. As an operator of a foundation compliance can still be validated with a scan over the CF API programmatically in case the operator is responsible for the apps on the foundation. Or decide to not enable that feature at all in case he as concerns.
Since we run at large scale with the docker feature flag enabled on our Foundations we got quite some experience already. In CF the custom stack proposal from a diego point of view is nothing else than a docker app. Diego is - as outlined in the RFC - unware of lifecycles thats a CF API concept. It just knows LRPs and Tasks with either a baseLayer beeing on the disk or from a container registry see https://github.com/cloudfoundry/bbs/blob/main/docs/031-defining-lrps.md maybe i missed something here but that means we can derive the behaviour of custom stacks for diego from the already known behaviour of docker apps withing CF since from a diego point of view they are identical. Thinking about that now - maybe it makes sense for the feature flag for custom stacks to be only activate-able when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the RFC draft creation process and change the name of the file to rfc-draft-enhance-stack-handling.md
because our automation generates and assigns the RFC number when it is accepted and merged.
(Long Running Process(LRP)/Task) start will fail because the runtime did not find the | ||
preloaded stack in its local filesystem. One thus is not able to stop | ||
shipping/delivering the outdated, insecure stack anymore without causing | ||
downtimes to all apps still using it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument is not wrong but rather weak in my eyes. If you have to stop the delivery of an insecure old stack for formal/regulatory reasons, then this includes also its usage in app containers, isn't it.
There might be regulatory peculiarities that put different obligations towards the platform provider (delivering the stack) and users (using the stack in their apps). IANAL but in the end the CVEs get visible/exploitable via the apps not by the stack itself.
If the argumentation goes into the direction: as platform provider we are not responsible for app CVEs and we don't care (as e.g. for docker apps but also for the app coding), then I don't see a formal reason why an old and insecure stack has to be removed from the platform.
That said, I'm in favour of removing old stacks and communicate clear deprecation and removal timelines (and then also execute on those timelines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop the delivery of an insecure old stack for formal/regulatory reasons, then this includes also its usage in app containers, isn't it.
Not really i think, it depends who runs the CF Foundation and who is the user. In case both are the same company than that is true. In case the Foundation Operators are a different company and the user bought a CF organization, then the latter - the usage itself is the customers obligation similar to:
- What docker app he uses, if he updates them regularily or not
- If he uses a custom buildpack or not and if he updates them or not
- etc.
I don't see a formal reason why an old and insecure stack has to be removed from the platform.
If this first company ships a CF distribution - like VMware Tanzu or former Pivotal Cloud Foundry - to a customer this deliverable is required to not include any parts that dont receive security updates and are out of maintanance - because otherwise the first company is contractually liable for anything they ship and to keep it up to date too. So the providing company may have a strong interest to remove a outdated stack from its shipment - without braking customers already running applications on the stack thats removed from the shipment. Unfortunately removing a stack in CF-Deployment currently = to removing it from a foundation since bosh deploys the stack and places it on the VMs.
So these all are regulatory requirements not technical ones but also can be a big benefit for some CF Community members depending on their CF Usage.
meaningful error messages in case the conditions to create an app or | ||
stage an app are not meet. Error messages SHOULD refer to the date times | ||
and include since when a stack is deprecated, when it got locked or | ||
disabled to enrich the response to the CF user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much text (and confusing MAY/CAN/MUST) and it is still not clear if stack usage shall be controlled by a stack state or by stack timestamps or by both.
I would go for two additional fields per stack:
state
: active, deprecated, locked, disabled - explicitly set by operatorsinfo
: additional info for the stack that can be used in logs and CLI output to communicate timelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Stack lifecycle management is controlled exclusively by a set of timestamps, not by an explicit "state" field. The system determines the current state of a stack (active, deprecated, locked, or disabled) by comparing the current server time to these timestamps. This approach ensures that stack usability is automatically and unambiguously defined by time-based transitions.
- State transitions (deprecated_at → locked_at → disabled_at → removed_at) are derived from the configured timestamps.
- A validation check SHOULD be performed when setting the timestamps to ensure that they are in chronological order( deprecated_at → locked_at → disabled_at → removed_at).
- There is no explicit `state` field; the state is always computed at runtime from the timestamps.
- For additonal information the operator MAY reuse the existing `description` field in the stacks table to provide additional information about the stack.
- In case a timestamp is not set, the stack is considered to be in the previous state indefinitely. This SHOULD also be considered in Error and Log messages.
This model guarantees that stack usage is consistently enforced based on time, and all state transitions are predictable and transparent.
It also provides a interface for an operator in which no externally timed api call for a stack state change is required.
To make it more explicit and clear. That is a part also changeable it was just an idea to solve that with timestamps instead of state fields that need to be pushed from one state to another in addition - since for the logging and error messages one is in need for the timestamps anyway and that makes the state field obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Let's see what others think about the automatic, time-based stack deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having state implemented implicitly by timestamps makes testing and debugging harder in my opinion. Also, there could be CF foundations deployed all over the world. Yes, we will use UTC but people using different CF API clients have to translate the time and compute a state out of it in case they want to present a friendly msg to CF users. I will prefer to have an explicit state but if we can't agree on this we could have a voting about this.
For reference is spend a few minutes today to make the adoption in the cloud_controller_ng https://github.com/cloudfoundry/cloud_controller_ng/pull/4475/files its not ready yet, likely has failing tests etc. but gives a quick overview whats about to change with this RFC code wise roughly. |
This model guarantees that stack usage is consistently enforced based on time, and all state transitions are predictable and transparent. | ||
It also provides a interface for an operator in which no externally timed api call for a stack state change is required. | ||
|
||
The CF API SHOULD use the provided timestamps to influence the logging and error messages in the following way: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CF API (stacks endpoint) should probably return the current state in addition to the timestamps so that clients don't have to re-calculate the stack state (and they don't know the server time, at least not for sure).
- It MAY add a log line into the Staging/Restaging logs of apps with a locked stack. It SHOULD produce the deprecation warning, optionally with color support, to underline the importance of a deprecated stack. | ||
- It MAY add the time since when a stack is deprecated/locked/disabled to the Staging/Restaging logs. | ||
- It MAY add the time when future state transitions will happen to the Staging/Restaging logs, so when is going to be locked/disabled. | ||
- The `description` field MAY be used in addition in Staging/Restaging logs to enable CF Admins to include a custom message in the apps staging logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CF CLI should also print out the current stack state and the deprecate/lock/disable_at timestamps.
toc/rfc/rfc-0040-enhance-stack-handling/current_stack_usage.png
Outdated
Show resolved
Hide resolved
- disable_at -> Timestamp after which restaging existing apps with a that | ||
stack fail | ||
|
||
Stack lifecycle management is controlled exclusively by a set of timestamps, not by an explicit "state" field. The system determines the current state of a stack (active, deprecated, locked, or disabled) by comparing the current server time to these timestamps. This approach ensures that stack usability is automatically and unambiguously defined by time-based transitions. Additional information can already be added via the `description` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this level of automation/complexity is needed for something relatively rare (stack deprecations). A simpler implementation would be operator-configurable state
and deprecation_message
fields (or equivalent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operators could implement their own automated deprecation flows on top of the CF APIs, of course, CF just wouldn't be opinionated about how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to also have just different states - i just thought its a lot easier to utilize this for an operator without the need to write a lot of glue code logic for the workflow outside that automates the calls to the CF API at the right time. Especially when having a lot of landscapes.
The implementation is with timestamps actually quite simple - did a PoC for that quickly how this RFC could be implemented https://github.com/cloudfoundry/cloud_controller_ng/pull/4475/files#diff-58985a23dbdd10b651e8a688c84727b4be283d134d15d5ad1cbbb7a7d793317fR46-R58
Most of it are tests.
I also thought even when implementing own workflows one can just set the timestamp to the current time-minus a minute and gain the same behaviour as flags if desired. So external workflows are fully possible still although i cannot think about many differences from the build in workflow since an operator can also not set some timestamps if he doesnt want these states e.g. dont set locked go straight to disabled at a certain date.
Just pulling in Beyhans concern about timestamps too:
Having state implemented implicitly by timestamps makes testing and debugging harder in my opinion. Also, there could be CF foundations deployed all over the world. Yes, we will use UTC but people using different CF API clients have to translate the time and compute a state out of it in case they want to present a friendly msg to CF users. I will prefer to have an explicit state but if we can't agree on this we could have a voting about this.
So we could either leave the dates away in all error messages and just say its locked. But i guess this is also not user friendly. If we would use a freetext field like the stack description for a deprication message that anounces the timeframes e.g. This stack is depricated since 05/03/2025 21:30 UTC, it will be locked after 05/04/2025 21:30 UTC and will be disabled after 05/05/2025 21:30 UTC
Also in a freetext field you cannot pass proper date information if you want to let the consumer know times when what happens in a user friendly way. I think even if we print UTC times we are a lot better than before beeing it in a freetext message or programatically with the timestamps in the stacks table for each state.
So im in favor of timestamps and programatic messages(enhanced with customizeable freetext): having_timestamp > timestamp_in_freetext > state > state_infreetext
.
timestamp > state
because its easier for an operator to handle and in any case has more information for the consumer that can be displayed.
programatic_info > freetext
because it reflects the true system state and a missmatch between system behaviour and stack description freetext cannot happend.
But also fine if we all decide to use a state based aproach to change that. Maybe we vote for that specific thing with emojis on comments ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1: Timestamps with automatic messages/errors that are extended by a freetext field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 2: Timestamps with deprication/error messages from freetext field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 3: States with automatic messages/errors that are extended by a freetext field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 4: States with deprication/error messages from freetext field.
programmed client side by creating new CF Applications. | ||
However all existing apps using a locked stack SHOULD continue to run. | ||
|
||
- Mark a stack as disabled -> prevent using the stack for any app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we would get much value from having both "locked" and "disabled", especially since "locked" will de-facto break some blue-green deployments. It'd be simpler to understand (and implement) if we reduced it down to two functional states: locked/unlocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We brought in this intermediary state to have a (optional) step in between. An operator can also decide to just move from depricated to disabled straight without setting a timestamp for locked. That being said the locked stack is important for very very large foundations where the disabled stack would create to much disturbance at once and support load. With the gradual exposure with a locked state one can do this at least in 2 steps. Optimally i would liked to propose this as an per organisation setting however that is a much larger change in the CC and in the user expeience i thought so i`d like to firstly bring this in on the same logical level that exists today - globally as the stacks are only global assets. After that and after gathering more experience one may come forward with additional optimizations to the workflow/UX/process like:
- Individual Stack states per Org
- A Stacks visibillity mapping like with services which steer which org/space can use which stack
- etc.
But i think its more valuable to firstly improve something in the global stack level before goint into org scoped world. So the locked state in short is just to have not one big step in the deprication process but rather 2 smaller ones to reduce/distribute effects on large consumer bases.
- Mark a stack as deprecated -> staging and start process logs a | ||
prominent warning. This brings better awareness than release notes. | ||
|
||
- Mark a stack as locked -> prevent using the stack for new apps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a sense (or even data 😄 ) on what % of apps the locked
state would catch vs changing the default stack? My impression is that most app devs don't explicitly specify the stack when pushing new apps, so those apps should use the default stack, but maybe that's not true everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot share concrete numbers but from a representable sample count, an extremely impressive total app count after a few months of switching the default stack a small single percent number was using the new default stack. Extrapolating that would lead to being able to remove an old stack in the range of multiple decades i would say if i remember correctly.
We thus actively moved applications from one stack to another via changing stack - restaging zero downtime - checking if the app comes up again if not reverting the stack back. This moved the majority of apps to a new stack successfully but the leftover - still after multiple years a quite substantial number of apps remain on FS3.
I see this from the angle of a PaaS provider in which we have limited control over the workloads but need tools to gain influence on the workloads itself and create incentives to use certain usage patterns rather than others. These tools would also be of value for a small company hosting a instance just for themselfs or via a on premises offering by a vendor. Thats why this ADR goes hand in hand with #1251 as only combined they offer a better user and operator experience.
Click Here for a better reviewable/readable version.
Related RFC-0041