InfiniBand Overview
Introduction
Section titled “Introduction”InfiniBand is an open standard, network communications protocol developed by the InfiniBand Trade Association (IBTA) - www.infinibandta.org
For a deep dive into InfiniBand concepts, see the Mellanox InfiniBand FAQ and the Introduction to InfiniBand Whitepaper. Note that these documents are from around 2014, so some throughput specifications may be outdated, but the core concepts remain relevant.
Prominent members of the IBTA are as follows:
- Nvidia
- Intel
- IBM
- Oracle
- HPE
InfiniBand is a high throughput low latency networking specification used to interconnect servers, switches, storage and embedded systems.
It’s heavily used in artificial intelligence and data science due to its ability to support incredibly high bandwidths with low latency and high scalability/flexibility.
Bandwidth Details
Section titled “Bandwidth Details”InfiniBand originally offered 10Gb/s starting in 2002 and has grown up to 1600Gb/s as of 2025. InfiniBand has always been non-blocking and bidirectional (full-duplex).
- 10Gb/s SDR (2002)
- 40Gb/s QDR (2008)
- 56Gb/s FDR (2011)
- 100Gb/s EDR (2015)
- 200Gb/s HDR (2018)
- 400Gb/s NDR (2021)
- 800Gb/s XDR (2023)
- 1600Gb/s GDR (2025)
Nvidia for example currently ships current NDR 400Gb/s fabrics and will continue to increase speeds as newer specifications are released by the IBTA.
Port Structure
Section titled “Port Structure”InfiniBand port bandwidth is achieved through a combination of multiple aggregated physical lanes. It can support up to 12 physical lanes but is typically implemented with 4 physical lanes.
For example, the HDR specification contains 4 physical lanes with each lane having a bi-directional bandwidth of 50Gb/s (200Gb/s total).
graph LR
%%{init: {'theme': 'base', 'themeVariables': { 'edgeLabelBackground': '#ffffff'}}}%%
A[HDR Port<br/>200Gb/s Total] -->|Lane 1<br/>50Gb/s| L1[Physical Lane 1]
A -->|Lane 2<br/>50Gb/s| L2[Physical Lane 2]
A -->|Lane 3<br/>50Gb/s| L3[Physical Lane 3]
A -->|Lane 4<br/>50Gb/s| L4[Physical Lane 4]
style A fill:#e0f2fe,stroke:#0369a1,stroke-width:2px,color:#000000
style L1 fill:#d1fae5,stroke:#10b981,color:#000000
style L2 fill:#d1fae5,stroke:#10b981,color:#000000
style L3 fill:#d1fae5,stroke:#10b981,color:#000000
style L4 fill:#d1fae5,stroke:#10b981,color:#000000
Diagram: An HDR port aggregates four physical lanes, each capable of 50Gb/s in both directions, totaling 200Gb/s full-duplex bandwidth.
InfiniBand components
Section titled “InfiniBand components”- HCA (Host Channel Adapter) - InfiniBand Network adapter installed onto a server.
- Switch - InfiniBand switch that moves packets within an IB subnet.
- Router - InfiniBand router that moves packets between IB subnets.
- Gateway - InfiniBand router that can enable IB hosts to communicate with an Ethernet network.
- Subnet Manager - Software that discovers and manages InfiniBand Nodes and Links within a subnet.
Low Latency Overview
Section titled “Low Latency Overview”Low latencies down to 1000 nanoseconds can be achieved through the use of IB specific accelerating and offloading mechanisms such as RDMA. This allows applications to completely bypass the OS kernel and get direct access to memory. The IB HCA handles moving data between application owned memory buffers removing the need for the Kernel to manage this. Essentially data is transferred directly between the GPUs and is managed by the HCA.
flowchart TB
%%{init: {'theme': 'base', 'themeVariables': { 'edgeLabelBackground': '#ffffff'}}}%%
subgraph H1[Host 1]
direction TB
H1APP[Application]
H1K[Kernel]
H1HCA[HCA]
H1APP -.->|Traditional| H1K -.-> H1HCA
end
subgraph H2[Host 2]
direction TB
H2APP[Application]
H2K[Kernel]
H2HCA[HCA]
H2HCA -.-> H2K -.->|Traditional| H2APP
end
H1APP ==>|RDMA Bypass| H1HCA
H1HCA <-->|IB Fabric| H2HCA
H2HCA ==>|RDMA Bypass| H2APP
style H1APP fill:#f7fee7,stroke:#65a30d,color:#000000
style H1K fill:#fee2e2,stroke:#db2777,stroke-dasharray: 5,color:#000000
style H1HCA fill:#dbeafe,stroke:#2563eb,color:#000000
style H2APP fill:#f7fee7,stroke:#65a30d,color:#000000
style H2K fill:#fee2e2,stroke:#db2777,stroke-dasharray: 5,color:#000000
style H2HCA fill:#dbeafe,stroke:#2563eb,color:#000000
Diagram: The traditional path (dotted lines) routes data through the OS kernel. With RDMA (thick arrows), the application bypasses the kernel entirely and communicates directly with the HCA.
Subnet Manager
Section titled “Subnet Manager”InfiniBand fabric requires a Subnet Manager (Software) to be running which runs and manages the fabric.
The Subnet Manager is responsible for the following:
- Node and Link Discovery.
- Local identifier assignments - LIDs (Similar to Mac Addresses in Ethernet).
- Routing table calculations and deployments.
- Configuring nodes and ports parameters like the QoS policy.
Scalability
Section titled “Scalability”A single InfiniBand subnet can scale up to 48k nodes and can be scaled beyond this limit by adding multiple subnets and connecting them with an InfiniBand router.
flowchart TB
%%{init: {'theme': 'base', 'themeVariables': { 'edgeLabelBackground': '#ffffff'}}}%%
subgraph S1 [Subnet 1]
direction TB
N1[Node 1]
N2[Node 2]
N3[... up to 48k Nodes ...]
SM1[Subnet Manager]
end
subgraph S2 [Subnet 2]
direction TB
N4[Node 1]
N5[Node 2]
N6[... up to 48k Nodes ...]
SM2[Subnet Manager]
end
Router[InfiniBand Router]
S1 <--> Router <--> S2
style S1 fill:#f0fdf4,stroke:#16a34a,color:#000000
style S2 fill:#f0fdf4,stroke:#16a34a,color:#000000
style Router fill:#fef9c3,stroke:#ca8a04,stroke-width:2px,color:#000000
style N1 fill:#ffffff,stroke:#374151,color:#000000
style N2 fill:#ffffff,stroke:#374151,color:#000000
style N3 fill:#ffffff,stroke:#374151,color:#000000
style SM1 fill:#ffffff,stroke:#374151,color:#000000
style N4 fill:#ffffff,stroke:#374151,color:#000000
style N5 fill:#ffffff,stroke:#374151,color:#000000
style N6 fill:#ffffff,stroke:#374151,color:#000000
style SM2 fill:#ffffff,stroke:#374151,color:#000000
Diagram: Scaling beyond the 48k node subnet limit by connecting multiple subnets via an InfiniBand Router.
Adaptive Routing
Section titled “Adaptive Routing”Adaptive Routing is enabled on all Nvidia (Mellanox) IB switches and offers several capabilities.
Link Failure Recovery
Section titled “Link Failure Recovery”The subnet manager computes the routing table and pushes it out to the IB switches. A link failure can cause up to 5 seconds of downtime before the Subnet Manager can recalculate a new routing topology. Nvidia IB Switches can handle this failure almost immediately through the use of Adaptive Routing, reducing recovery time to 1ms. The feature is referred to as SHIELD (Self-Healing Interconnect Enhancement for Intelligent Datacenters) and commonly referred to as Fast Link Fault Recovery (FLFR) in Nvidia documentation.
For more details, see How To Configure Adaptive Routing and Self-Healing Networking.
Load Balancing
Section titled “Load Balancing”Nvidia switches support dynamic load balancing which can achieve better fabric utilization than simple ECMP routing. It’s achieved by the Adaptive Routing feature enabled on the switches and managed centrally using Adaptive Routing Manager similar to the Subnet Manager.
Implemented across the fabric by defining I/O channels at the HCA level and defining Virtual Lanes at the link level. This is managed centrally by the Subnet Manager.
For more details, see the Nvidia QoS Documentation.
SHARP - Scalable Hierarchical Aggregation and Reduction Protocol
Section titled “SHARP - Scalable Hierarchical Aggregation and Reduction Protocol”Primary function is to offload the need to send data multiple times from/to host CPUs and GPUs. Essentially the host can send data once and the switch can handle the replication using SHARP similar to multicast in ethernet.
For more details, see the Nvidia SHARP Documentation.