Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-127 (UserNS): allow customizing subids length #5020

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AkihiroSuda
Copy link
Member

  • One-line PR description: KEP-127 (UserNS): allow customizing subids length
  • Other comments:

The number of subuids and subgids for each of pods is hard-coded to 65536, regardless to the total ID count specified in /etc/subuid and /etc/subgid: https://github.com/kubernetes/kubernetes/blob/v1.32.0/pkg/kubelet/userns/userns_manager.go#L211-L228

This is not enough for some images.
Nested containerization needs a huge number of subids too.

The number of subuids and subgids for each of pods is hard-coded to 65536,
regardless to the total ID count specified in `/etc/subuid` and `/etc/subgid`:
https://github.com/kubernetes/kubernetes/blob/v1.32.0/pkg/kubelet/userns/userns_manager.go#L211-L228

This is not enough for some images.
Nested containerization needs a huge number of subids too.

Signed-off-by: Akihiro Suda <[email protected]>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 5, 2025
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 5, 2025
Comment on lines +338 to +339
The mapping length (multiple of 65536) will be customizable via a new
`KubeletConfiguration` property `subidsPerPod`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd got the impression we might want to make the mapping size configurable on a per-Pod basis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(what if you have a particular Pod that assigns a (POSIX) ID to each user, and you have 42000000 users, but all your other Pods only need 65000 UIDs?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible but not a common case IMO, and the implementation of adding a pod API field would be much more complex than adding a kubelet configuration field. I'm not sure the maintenance burden is worth it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as we're not accidentally tying ourselves into not being able to extend the Pod API in the future. If we are tying ourselves, let's make sure we'd never want the option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about introducing a Pod security context property like securityContext.userNS.staticMappingWithUsername: "foo".
This will run getsubids foo to obtain the subID range, and assign the entire range to the Pod.
(So, this is different from getsubids kubelet which returns the total range for the 110 pods)

Multiple pods may use the same range at their own risk.
This allows assigning an extremely large subID range. $(2^{32}-65536)$ at maximum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the max idrange inside of a container flexible? as in: could we have a kubelet field that toggles a dynamic range and the runtime interpret the range in the image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flexible. A container may use UID that is not present in /etc/passwd in the image. So, a runtime cannot "interpret the range in the image".

It should be still possible to have OCI Image annotations to declare the range of the needed UIDs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to prevent that such a field is not abused? An image could claim all the available IDs and prevents that other pods can be created

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admission-time checks is where I'd start; also ResourceQuota and LimitRange specifically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prevents that other pods can be created

No, with securityContext.userNS.staticMappingWithUsername: "foo" which allows ID conflicts and requires the explicit configuration of the securityContext.

This should be still probably prohibited for Restricted Pod Security Standard.

@rata
Copy link
Member

rata commented Jan 21, 2025

@AkihiroSuda can you please elaborate on what is needed for the use case?

We have several options, none is perfect with the info we have, but with more info on the use case we might be able to make a better decision.

For example, one option proposed here is to use bigger ranges for all pods. That might work or not, depending how big the ranges need to be (it can be that we can't run the number of maxPods configured to the node if the ranges are very big). Another option is to use the pod.spec as you suggested, but we need to think about abuses as @giuseppe was mentioning.

Can you share more details on the use case, so we can see what might be the best way to tackle it?

@AkihiroSuda
Copy link
Member Author

use case

e.g., nesting containers (Docker, Kubernetes, BuildKit, whatever) inside Kubernetes without full privileges

@rata
Copy link
Member

rata commented Jan 23, 2025

@AkihiroSuda cool! And what are the number of IDs needed for each?

  • Kubernetes: Currently, for nesting kubernetes inside kubernetes, you need 65536*2 inside the pod (for one level of nesting). Or maybe we can do it with less? I don't think so, but I'll need to double check.
  • Docker: I guess we don't need more, right? I think with 16 bits for the UID space is enough for docker?
  • Buildkit I don't know, maybe just 65536*2 is enough?

So, let's see what the needs for each of those are and think on ways to support them. With this partial info, correct me if I'm wrong, I understand that all will work fine if we support a multiple of 65536 as length. That simplifies a lot of things, so I'd like to keep that.

I'm not sure if a kubelet config vs pod.spec is the right place to choose this. Not sure about what granularity we want to expose for these pods with "wide mappings".

My thinking is:

  • If we need a pod.spec field, let's add a different feature gate for that, that can progress at its own ryhthm
  • If we need a kubelet config, I think we don't need a feature gate for that.

What do you think?

My gut feeling is that the subidsPerPod as kubelet config can get the job done here. It's hard for me to try to see if it will fall short in the future or not, though, so more opinions are very welcome :)

@haircommander
Copy link
Contributor

I still feel kubelet config is sufficient personally

@sftim
Copy link
Contributor

sftim commented Jan 23, 2025

I still feel kubelet config is sufficient personally

is there a way to do that (eg: config field named defaultSubidsPerPod) whilst leaving the door open for varying the count per Pod one day? Moving forward is good; open doors even better.

If you've one Pod that wants to use UID 999999999 it's a shame if you also have to give that at least many UIDs to every other Pod. It's painful even if you dedicate a couple of nodes for that component and run the other nodes with 65536.

@rata
Copy link
Member

rata commented Jan 24, 2025

@haircommander You are right, I agree. Seeing the use cases and sleeping over it last night, I agree. That should be more than enough for now, and the door is open if down the road we need to add a pod.spec field

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@giuseppe what do you think?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AkihiroSuda, rata
Once this PR has been reviewed and has the lgtm label, please assign mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Contributor

/lgtm

FYI to reviewers: I was also hoping to move this KEP to on by default beta in 1.33

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants