Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: attach devices to nodes #5007

Open
4 tasks
pohly opened this issue Dec 19, 2024 · 6 comments
Open
4 tasks

DRA: attach devices to nodes #5007

pohly opened this issue Dec 19, 2024 · 6 comments
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@pohly
Copy link
Contributor

pohly commented Dec 19, 2024

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 19, 2024
@pohly
Copy link
Contributor Author

pohly commented Dec 19, 2024

/assign @KobayashiD27

As discussed in kubernetes/kubernetes#124042 (comment).

/sig scheduling
/wg device-management

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to assign the following users: KobayashiD27.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @KobayashiD27

As discussed in kubernetes/kubernetes#124042 (comment).

/sig scheduling
/wg device-management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 19, 2024
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Dec 19, 2024
@KobayashiD27
Copy link

Thank you for creating the issue. I will post a draft KEP as soon as possible.

@KobayashiD27
Copy link

@pohly

To facilitate the discussion on the KEP, we would like to share the design of the composable controller we are considering as a component utilizing the fabric-oriented scheduler function. By sharing this, we believe we can deepen the discussion on the optimal implementation of the scheduler function. Additionally, we would like to verify whether the controller design matches the DRA design.

Background

Our controller's philosophy is to efficiently utilize fabric devices. Therefore, we prefer to allocate devices directly connected to the node over attached fabric devices. (e.g., Node-local devices > Attached fabric devices > Pre-attached fabric devices)

Design Overview

This design aims to efficiently utilize fabric devices, prioritizing node-local devices to improve performance. The composable controller manages fabric devices that can be attached and detached. Therefore, it publishes a list of fabric devices as ResourceSlices.

The structure we are considering is as follows:

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device1
  ...
  - name: device2
  ...

The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as ResourceSlices.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...

Here, when the scheduler selects the fabric device device1, it waits for the attachment of the fabric device during PreBind. The composable controller performs the attachment operation by checking the flag of the ResourceClaim. After successful attachment, the composable controller changes the flag of the ResourceClaim.

We are considering the following two methods for handling ResourceSlices upon completion of the attachment. We would like to hear your opinions and feasibility on these two composable controller proposals.

Proposal 1: The composable controller publishes ResourceSlices with NodeName set within the pool

Multiple ResourceSlices are published with the same pool name. One indicates the devices included in the fabric, and the other indicates the devices attached to the node.

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device2
  ...
---
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device1
  ...

If the vendor's plugin responds to hotplug, device1 will appear in the ResourceSlice published by the vendor.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...
  - name: device1
  ...

This may cause device duplication issues between ResourceSlices. To prevent multiple ResourceSlices from publishing duplicate devices, we plan to define a deny list and standardize it with DRA.

Advantages

  • No need to change the allocationResult by the scheduler or composable controller.
  • Can distinguish attached fabric devices and maintain prioritization.

Disadvantages

Proposal 2: Attached devices are published by the vendor's plugin

In this case, devices are removed from the composable-device pool.

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device2
  ...

If the vendor's plugin responds to hotplug, device1 will appear in the ResourceSlice published by the vendor.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...
  - name: device1
  ...

This breaks the linkage between ResourceClaim and ResourceSlice. Therefore, it is necessary to modify the AllocationResult of the ResourceClaim.

Advantages

  • Simplifies device management.
  • Centralizes management as the vendor's plugin directly publishes devices.
  • No need for mechanisms to prevent device duplication (e.g., deny list).

Disadvantages

  • Cannot distinguish attached fabric devices, making prioritization difficult.
  • Requires modification of the linkage between ResourceClaim and ResourceSlice (expected to be done by the scheduler or DRA controller. Which is more appropriate?).
  • Until the linkage is fixed, the device being used may be published as a ResourceSlice and reserved by other Pods.

We would appreciate your feedback and insights on these proposals to ensure the optimal implementation of the scheduler function and alignment with the DRA design.

@pohly
Copy link
Contributor Author

pohly commented Dec 19, 2024

Let's keep the discussion in this issue shorter. You now can put all of this, including the alternatives, into the KEP document.

@KobayashiD27
Copy link

@pohly
Could you please link this PR #5012 to "KEP update PR"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🏗 In progress
Status: Needs Triage
Development

No branches or pull requests

3 participants