-
Notifications
You must be signed in to change notification settings - Fork 764
low latency mode bandwidth #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe you can look at the bandwidth monitoring of the physical RNIC and see if all the network cards are used. |
from the Normal Kernel result, seems you have 8x400Gbps RDMA NIC. but LL mode seems has significant congestion, is IB network or RoCE ? |
We are using RoCE. Do you have any suggestions in this regard, such as tools or benchmark scripts? |
modify run test file: |
From your previously result on normal kernel , I assume you have 8x400Gbps RoCE NIC, total BW is 3.2Tbps. what's the topology of these 2-H20 node, all ports on same switch ? how about the ECN configuration ? You may try to tune your network and CC on NIC. |
In an H20 2-node setup, why is the bandwidth of test_internode much higher than that of test_low_latency?
test_internode.py log,token's num is 4096
[tuning] Best combine: SMs 24, NVL chunk 4, RDMA chunk 32: 62.15 GB/s (RDMA), 205.04 GB/s (NVL)
test_low_latency.py log, token's num is 128
[rank 2] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=83610.61 us, min_t=924.03 us, max_t=377962.68 us
[rank 4] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=83610.93 us, min_t=644.26 us, max_t=377964.57 us
[rank 7] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=83611.20 us, min_t=920.80 us, max_t=377974.52 us
[rank 1] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=83612.06 us, min_t=778.56 us, max_t=377976.93 us
[rank 5] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=84203.44 us, min_t=785.44 us, max_t=395137.12 us
[rank 6] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=84464.68 us, min_t=774.43 us, max_t=402746.15 us
[rank 3] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=86244.62 us, min_t=665.34 us, max_t=454342.25 us
[rank 0] Dispatch + combine bandwidth: 0.26 GB/s, avg_t=86347.94 us, min_t=703.68 us, max_t=413629.70 us
[rank 4] Dispatch bandwidth: 0.10 GB/s, avg_t=73857.00 us | Combine bandwidth: 1.24 GB/s, avg_t=11727.00 us
[rank 7] Dispatch bandwidth: 0.11 GB/s, avg_t=68931.00 us | Combine bandwidth: 0.88 GB/s, avg_t=16481.00 us
[rank 0] Dispatch bandwidth: 0.13 GB/s, avg_t=57557.00 us | Combine bandwidth: 0.50 GB/s, avg_t=29269.00 us
[rank 2] Dispatch bandwidth: 0.10 GB/s, avg_t=74740.00 us | Combine bandwidth: 1.36 GB/s, avg_t=10686.00 us
[rank 5] Dispatch bandwidth: 0.10 GB/s, avg_t=76297.00 us | Combine bandwidth: 1.59 GB/s, avg_t=9131.00 us
[rank 1] Dispatch bandwidth: 0.16 GB/s, avg_t=48329.00 us | Combine bandwidth: 0.39 GB/s, avg_t=37488.00 us
[rank 6] Dispatch bandwidth: 0.12 GB/s, avg_t=63882.00 us | Combine bandwidth: 0.57 GB/s, avg_t=25385.00 us
[rank 3] Dispatch bandwidth: 0.10 GB/s, avg_t=73351.00 us | Combine bandwidth: 1.07 GB/s, avg_t=13643.00 us
[rank 7] Dispatch send/recv time: 1088.61 us | Combine send/recv time: 1319.52 us
[rank 4] Dispatch send/recv time: 229.82 us | Combine send/recv time: 333.26 us
[rank 3] Dispatch send/recv time: 1100.09 us | Combine send/recv time: 1346.97 us
[rank 6] Dispatch send/recv time: 308.76 us | Combine send/recv time: 381.72 us
[rank 2] Dispatch send/recv time: 343.36 us | Combine send/recv time: 445.31 us
[rank 1] Dispatch send/recv time: 728.29 us | Combine send/recv time: 924.38 us
[rank 5] Dispatch send/recv time: 822.57 us | Combine send/recv time: 1049.78 us
[rank 0] Dispatch send/recv time: 29.04 us | Combine send/recv time: 29.63 us
The text was updated successfully, but these errors were encountered: