Fabric Monitoring & HA

Ensuring the stability and availability of an InfiniBand fabric requires understanding how the Subnet Manager (SM) handles redundancy, failover, and continuous monitoring.

Subnet Manager High Availability

While a single Subnet Manager is required for the fabric to function, it represents a single point of failure. Therefore, it is recommended to have at least two SMs (one Master, one Standby).

Master SM: The active instance managing the fabric.
Standby SM: Passive instances that monitor the Master and are ready to take over.

Master SM Election

When multiple SMs are present, an election process determines the Master.

Priority: Each SM is assigned a 4-bit priority value (0-15).
- 0: Lowest priority (default).
- 15: Highest priority.
GUID Tie-Breaker: If multiple SMs have the same highest priority, the SM with the lowest GUID is elected Master.

SMInfo Attribute

The SMInfo attribute acts as a heartbeat and information exchange mechanism between SMs.

Used during subnet discovery and polling.
Contains: SM Port GUID, Priority, and State (Master/Standby).

Failover and Handover

Failover Process

If the Master SM fails or becomes disconnected:

A Standby SM detects the failure (via missing heartbeats).
The Standby with the highest priority (or lowest GUID) promotes itself to Master.

Impact:

Existing Sessions: Generally not impacted.
New Sessions: Must wait until the new Master is elected and the fabric is stable.
LIDs: Usually do not change. The new Master attempts to retrieve the GUID-to-LID database from the old Master. If unavailable, it may trigger a new discovery and assignment phase.

Double Failover Scenario

A “double failover” occurs when a failed Master comes back online with a higher priority than the current Master, causing another handover.

Prevention: To avoid unnecessary handovers, you can configure the master_sm_priority. When a Standby promotes itself, it can raise its priority to 15 (highest), ensuring that the old Master (likely with a lower priority) does not immediately take back control upon return.

Fabric Sweeps

The Subnet Manager continuously monitors the fabric using “sweeps”.

Light Sweep

Frequency: Periodically (default every 10 seconds).
Purpose: Checks for status changes without disrupting the fabric.
Triggers:
- Port status changes.
- New SM detected.
- Standby SM priority change.
Outcome: If any significant change is detected, it triggers a Heavy Sweep.

Heavy Sweep

Trigger: Triggered by a Light Sweep finding changes or by an InfiniBand Trap (e.g., a switch detecting a port state change).
Process:
- Full fabric discovery (rediscover topology).
- New LIDs assigned (only if needed, e.g., for new hosts).
- Switch Linear Forwarding Tables (LFTs) are recalculated and reprogrammed.
Impact:
- Traffic on affected routes may experience a short disruption/latency while the topology is recalculated.
- Host or Leaf switch failures typically trigger a Heavy Sweep.

Monitoring Utilities

The perftest and infiniband-diags packages provide tools to monitor SM status.

sminfo: Displays the Master SM’s LID, GUID, Priority, and State.
smpquery: Queries internal SM attributes.
- Example: smpquery nd 12 (Get Node Description of the node with LID 12).
saquery: Queries the Subnet Administration database.
- Example: saquery -s (List all active SMs, including Master and Standbys).