Skip to content

Fabric Monitoring & HA

Ensuring the stability and availability of an InfiniBand fabric requires understanding how the Subnet Manager (SM) handles redundancy, failover, and continuous monitoring.

While a single Subnet Manager is required for the fabric to function, it represents a single point of failure. Therefore, it is recommended to have at least two SMs (one Master, one Standby).

  • Master SM: The active instance managing the fabric.
  • Standby SM: Passive instances that monitor the Master and are ready to take over.

When multiple SMs are present, an election process determines the Master.

  1. Priority: Each SM is assigned a 4-bit priority value (0-15).
    • 0: Lowest priority (default).
    • 15: Highest priority.
  2. GUID Tie-Breaker: If multiple SMs have the same highest priority, the SM with the lowest GUID is elected Master.

The SMInfo attribute acts as a heartbeat and information exchange mechanism between SMs.

  • Used during subnet discovery and polling.
  • Contains: SM Port GUID, Priority, and State (Master/Standby).

If the Master SM fails or becomes disconnected:

  1. A Standby SM detects the failure (via missing heartbeats).
  2. The Standby with the highest priority (or lowest GUID) promotes itself to Master.

Impact:

  • Existing Sessions: Generally not impacted.
  • New Sessions: Must wait until the new Master is elected and the fabric is stable.
  • LIDs: Usually do not change. The new Master attempts to retrieve the GUID-to-LID database from the old Master. If unavailable, it may trigger a new discovery and assignment phase.

A “double failover” occurs when a failed Master comes back online with a higher priority than the current Master, causing another handover.

Prevention: To avoid unnecessary handovers, you can configure the master_sm_priority. When a Standby promotes itself, it can raise its priority to 15 (highest), ensuring that the old Master (likely with a lower priority) does not immediately take back control upon return.

The Subnet Manager continuously monitors the fabric using “sweeps”.

  • Frequency: Periodically (default every 10 seconds).
  • Purpose: Checks for status changes without disrupting the fabric.
  • Triggers:
    • Port status changes.
    • New SM detected.
    • Standby SM priority change.
  • Outcome: If any significant change is detected, it triggers a Heavy Sweep.
  • Trigger: Triggered by a Light Sweep finding changes or by an InfiniBand Trap (e.g., a switch detecting a port state change).
  • Process:
    • Full fabric discovery (rediscover topology).
    • New LIDs assigned (only if needed, e.g., for new hosts).
    • Switch Linear Forwarding Tables (LFTs) are recalculated and reprogrammed.
  • Impact:
    • Traffic on affected routes may experience a short disruption/latency while the topology is recalculated.
    • Host or Leaf switch failures typically trigger a Heavy Sweep.

The perftest and infiniband-diags packages provide tools to monitor SM status.

  • sminfo: Displays the Master SM’s LID, GUID, Priority, and State.
  • smpquery: Queries internal SM attributes.
    • Example: smpquery nd 12 (Get Node Description of the node with LID 12).
  • saquery: Queries the Subnet Administration database.
    • Example: saquery -s (List all active SMs, including Master and Standbys).