IB Diagnostics
InfiniBand provides a rich set of diagnostic tools for troubleshooting at both the host level and across the entire fabric.
Host-Level Tools
Section titled “Host-Level Tools”These tools are used to inspect the local node’s configuration and connectivity.
| Command | Description |
|---|---|
ofed_info | Check the DOCA/OFED driver version |
lspci | Check the type and version of installed HCAs |
ibstat | Display the link status of a node in the IB fabric |
ibportstate <lid> <port> | Display the link status of a specific port on a node |
ibroute <lid> | Display the forwarding table of the switch with a specific LID |
ibv_devices | List InfiniBand devices (HCAs) |
ibv_devinfo | Display detailed information about InfiniBand devices (HCAs) |
Fabric-Level Tools
Section titled “Fabric-Level Tools”These tools operate across the fabric and are used for broader discovery and troubleshooting.
| Command | Description |
|---|---|
ibswitches | Identify all switches in the IB fabric |
ibhosts | Identify all HCAs in the IB fabric |
ibnodes | Identify all nodes in the IB fabric |
ibnetdiscover | Display node-to-node connectivity |
iblinkinfo | List all nodes and connectivity information |
sminfo | Identify the master Subnet Manager |
ibping | Ping-pong test over IB to validate connectivity between hosts |
ibtracert <src-lid> <dst-lid> | Display the route between two nodes |
ibdiagnet | Comprehensive fabric health diagnostics |
ib_write_lat | Measure RDMA Write latency between two nodes |
ib_read_lat | Measure RDMA Read latency between two nodes |
ib_write_bw | Measure RDMA Write bandwidth between two nodes |
ib_read_bw | Measure RDMA Read bandwidth between two nodes |
ibdiagnet Utility
Section titled “ibdiagnet Utility”ibdiagnet is the primary troubleshooting tool for fabric discovery, error detection, and general diagnostics. It is part of the ibutils2 package included in DOCA/OFED and UFM software packages.
It works by scanning the fabric using directed-route packets and extracting information about fabric connectivity and devices.
Checks Performed
Section titled “Checks Performed”ibdiagnet performs the following checks:
- Fabric Discovery — Sweeps the IB fabric and collects information from switches, HCAs, routers, aggregation nodes, and gateways.
- Duplicated GUIDs — Reports duplicated node and port GUIDs.
- Duplicated Node Descriptions — Warns about duplicated node descriptions for switches and HCAs.
- LIDs Check — Validates LID assignment and checks for duplicated LIDs.
- Links in Init/Unresponsive States — Reports links in INIT logical state and unresponsive devices, including the direct route to reach them.
- Counters Fetch — Fetches various counters from IB devices including standard/extended port counters, diagnostic counters, and physical counters.
- Error Counters Checks — Checks error counters crossing thresholds between counter snapshots.
- Routing Fetch and Checks — Validates switch routing tables and checks for credit-loop free routing.
- Link Width and Speed Checks — Verifies that fabric links are operating at maximum supported speed and width.
- Topology Matching — Compares the live topology with a previously stored one.
- Partition Checks — Dumps and validates HCA and switch partition tables.
- BER Test — Reports links with high Bit Error Rates.
Default Checks (No Flags)
Section titled “Default Checks (No Flags)”Running ibdiagnet without any flags performs:
- Fabric Discovery
- Duplicated GUIDs Check
- Duplicated Node Description Check
- LID Check
- Links Check
- Subnet Managers Check
- Port Counters Snapshot/Checks (1 second period)
- Nodes Information Check (uniform firmware versions)
- Speed/Width Check
- Alias GUIDs
- Dump Virtualization Information
- Partition Keys Checks
- Dump Temperature Sensing
- Create Network Dump file (ibnetdiscover format)
Basic Usage
Section titled “Basic Usage”ibdiagnetIf there are multiple HCAs, ibdiagnet runs on the first active interface. To select a specific HCA and port:
ibdiagnet --i <hca-name> --p <port-num>Check the version:
ibdiagnet --versionOutput Format
Section titled “Output Format”The standard output of ibdiagnet groups results by check, separated by dashes:
--------------------------------Discovery--------------------------------Lids Check--------------------------------Links Check--------------------------------Subnet Manager--------------------------------Output Files
Section titled “Output Files”ibdiagnet writes detailed results to several files in /var/tmp/ibdiagnet2/ by default. The output path can be changed with -o or --output_path.
| File | Contents |
|---|---|
ibdiagnet2.log | Log file for the ibdiagnet run |
ibdiagnet2.lst | Fabric links in LST format |
ibdiagnet2.net_dump | Fabric link dump including split cable mapping and FEC info |
ibdiagnet2.sm | Subnet managers (list of all SMs, state, priority) |
ibdiagnet2.pm | IB spec compliant port counters |
ibdiagnet2.fdbs | Unicast Forwarding Tables |
ibdiagnet2.ar | Adaptive routing tables |
ibdiagnet2.nodes_info | Nodes information |
ibdiagnet2.pkey | Pkey tables (all partitions and member host GUIDs) |
ibdiagnet2.slvl | SLVL tables of fabric switches |
ibdiagnet2.ibnetdiscover | Discovered network in ibnetdiscover format |
Useful Options
Section titled “Useful Options”Adjust the delta used when comparing port counters (default is 1 second):
ibdiagnet --pm_pause_time 60Generate a topology file:
ibdiagnet -w /var/tmp/ibdiagnet2/topologyFurther Reading
Section titled “Further Reading”- ibdiagnet User Manual — Official NVIDIA documentation for the
ibdiagnetutility.