Skip to content

some questions in ibgda #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Thunderbrook opened this issue Apr 8, 2025 · 2 comments
Open

some questions in ibgda #113

Thunderbrook opened this issue Apr 8, 2025 · 2 comments

Comments

@Thunderbrook
Copy link

Thunderbrook commented Apr 8, 2025

Hi, I have some questions about the following code in ibgda

void ibgda_submit_requests(nvshmemi_ibgda_device_qp_t *qp, uint64_t base_wqe_idx,
                           uint32_t num_wqes, int message_idx = 0) {
    nvshmemi_ibgda_device_qp_management_t *mvars = &qp->mvars;
    uint64_t new_wqe_idx = base_wqe_idx + num_wqes;

    // WQE writes must be finished first
    __threadfence();    // (1)

    // Wait for prior WQE slots to be filled first
    auto *ready_idx = reinterpret_cast<unsigned long long int*>(&mvars->tx_wq.ready_head);
    while (atomicCAS(ready_idx, base_wqe_idx, new_wqe_idx) != base_wqe_idx);     // (2)

    // Always post, not in batch
    constexpr int kNumRequestInBatch = 4;
    if (kAlwaysDoPostSend or (message_idx + 1) % kNumRequestInBatch == 0)
        ibgda_post_send(qp, new_wqe_idx);
}

(1) I personally understand that the purpose of threadfence here is to ensure that writing to the WQE and writing to the DB do not occur out of order. From the view of the NIC, should threadfence_system be used instead?
(2) I personally understand that all threads executing atomicCAS have different compare/swap values, so is it necessary to use "atomic" operations in this case?

Thanks

@sphish
Copy link
Collaborator

sphish commented Apr 8, 2025

The ibgda_submit_requests code here is simplified from NVSHMEM, and I'm also somewhat unclear about the synchronization semantics needed when GPUs and NICs interact with each other. However, NVSHMEM implements it this way, and I believe that as an internal NVIDIA team, they have more insight into these details and can ensure the safety of this approach. The following is just my personal understanding:

(1) If we consider the NIC as another GPU device, the threadfence here is indeed not strong enough to guarantee that the written WQE is visible to other GPUs. However, I think there might be special mechanisms when the NIC reads WQEs, such as always bypassing the cache, in which case threadfence would be sufficient.

(2) I think you are right, atomic op is not necessary, we can have a try.

@Thunderbrook
Copy link
Author

get it, very thanks for reply~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants