Subnet Manager
The Subnet Manager (SM) is the brain of an InfiniBand network. It is a centralized software entity responsible for discovering, configuring, and managing the fabric. Without an active SM, an InfiniBand network cannot function—even for a simple point-to-point connection between two servers.
Key Responsibilities
Section titled “Key Responsibilities”The Subnet Manager performs several critical tasks to bring the network up and keep it running efficiently:
-
Network Discovery (Sweep)
- The SM scans the network to discover all active devices (switches, HCAs, routers).
- It identifies the topology, including how switches and nodes are interconnected.
-
Addressing (LID Assignment)
- Assigns a unique Local Identifier (LID) to every port on the subnet.
- LIDs are 16-bit addresses used for local routing within the subnet (similar to MAC addresses in Ethernet, but assigned dynamically).
-
Routing Calculation & Deployment
- Calculates the most efficient paths for traffic between all pairs of nodes based on the topology.
- Supports various routing algorithms (e.g., MinHop, Up/Down, Fat-Tree, Torus-2QoS).
- Programs the Linear Forwarding Tables (LFTs) on every switch, instructing them where to forward packets based on destination LIDs.
-
Traffic Isolation (Partitioning)
- Manages Partitions using Partition Keys (P_Keys).
- Ensures that nodes can only communicate with other nodes that share the same P_Key (similar to VLANs in Ethernet).
-
Quality of Service (QoS)
- Configures Service Levels (SLs) and maps them to Virtual Lanes (VLs).
- Sets up arbitration on switches to prioritize critical traffic (e.g., low-latency MPI traffic over bulk storage traffic).
-
Fault Management
- Continuously monitors the fabric for changes (link failures, new devices).
- Detects topology changes (traps) and triggers a Heavy Sweep to re-discover and re-route the fabric if necessary.
Architecture
Section titled “Architecture”- Subnet Management Agent (SMA): Every InfiniBand device (switch or HCA) runs a small agent called the SMA. The centralized SM communicates with these agents via Subnet Management Packets (SMPs) to get status updates and push configurations.
- Master vs. Standby:
- A subnet can have multiple SMs running for redundancy, but only one is the Master SM.
- All others are Standby SMs. They monitor the Master and synchronize their state.
- If the Master fails, an election occurs, and a Standby promotes itself to Master.
Light vs. Heavy Sweeps
Section titled “Light vs. Heavy Sweeps”The Subnet Manager continuously monitors the fabric using two types of “sweeps” to ensure the topology is up-to-date.
Light Sweep
Section titled “Light Sweep”A Light Sweep is a low-impact check performed periodically to detect status changes without disrupting the fabric.
- Frequency: Occurs automatically at a set interval (default is every 10 seconds).
- What it does:
- Queries the status of ports and nodes.
- Checks for changes in SM priority or the presence of new SMs.
- Outcome: If the Light Sweep detects any significant change (e.g., a port that was down is now active), it immediately triggers a Heavy Sweep.
Heavy Sweep
Section titled “Heavy Sweep”A Heavy Sweep is a comprehensive discovery and configuration process. It is more resource-intensive and can momentarily impact fabric traffic.
- When it happens:
- Triggered by a Light Sweep finding a change.
- Triggered by a Trap: If a switch or HCA reports a critical event (like a link going up or down), it sends a trap to the SM, causing an immediate Heavy Sweep.
- Manual Trigger: Can be forced by an administrator (e.g., restarting OpenSM).
- What it does:
- Full Discovery: Rediscovers the entire network topology.
- LID Assignment: Assigns new LIDs to any newly discovered devices.
- Routing Calculation: Recalculates the routing tables (LFTs) for the entire fabric to handle the new topology or route around failures.
- Reprogramming: Pushes the new forwarding tables to all switches.
- Impact: During the reprogramming phase, traffic on affected routes may experience a brief pause or latency.
Common Routing Algorithms
Section titled “Common Routing Algorithms”The SM can use different algorithms depending on the network topology:
- MinHop: Finds the path with the fewest number of hops. Good for irregular topologies but can cause congestion.
- Up/Down: Prevents routing loops in irregular networks by enforcing a hierarchy (root nodes vs. leaf nodes).
- Fat-Tree: Optimized for Fat-Tree topologies (common in HPC). Ensures contention-free routing and full bisection bandwidth.
- Dimensional Order Routing (DOR): Used for Grid/Mesh/Torus topologies (e.g., Hypercube).
Deployment Options
Section titled “Deployment Options”The Subnet Manager can run in three locations. The choice should be based on fabric scale, feature requirements, and licensing budget.
| Deployment | Scale | Adaptive Routing | Dragonfly+ | License Required |
|---|---|---|---|---|
| Managed Switch (Embedded SM) | Up to 2,048 nodes | No | No | No |
| Server (OpenSM) | Medium–Large | Yes | Yes | No |
| UFM | Medium–Large | Yes | Yes | Yes (per device) |
Option 1: Embedded SM on a Managed Switch
Section titled “Option 1: Embedded SM on a Managed Switch”Nvidia Managed InfiniBand switches run the MLNX-OS operating system, which includes an embedded Subnet Manager. This is the simplest option for smaller fabrics.
- Scale: Up to 2,048 nodes.
- Limitations: Does not support Adaptive Routing or the Dragonfly+ routing engine.
- Management: Managed switches support both in-band (via IB fabric) and out-of-band (via an RJ45 Ethernet management port with a separate IP address) management. Unmanaged switches only support in-band management (no CPU or management capability).
Enabling the Embedded SM
Section titled “Enabling the Embedded SM”SSH into the managed switch and use the CLI (Cisco-like shell):
enableconfigure terminal
# Check SM status (should be disabled by default)show ib sm
# Enable the SMib sm
# Verify it is runningshow ib sm
# Save the configurationendwr memSetting SM Priority
Section titled “Setting SM Priority”It is recommended to set an explicit priority (especially when running multiple SMs):
enableconfigure terminal
ib sm sm-priority 14
# Verifyshow ib sm sm-priorityConfiguring the Routing Engine
Section titled “Configuring the Routing Engine”# List available routing enginesib sm routing-engines ?
# Set the routing engine (e.g., UpDn)ib sm routing-engines updnUse ib sm ? to see all available SM configuration options.
Option 2: OpenSM on a Server
Section titled “Option 2: OpenSM on a Server”For medium-to-large fabrics, OpenSM can be run on a server with the Nvidia DOCA-OFED stack installed.
- Scale: Defaults support up to ~200 nodes; can be tuned for larger fabrics.
- Features: Supports Adaptive Routing and the Dragonfly+ routing engine.
- License: None required.
Running OpenSM
Section titled “Running OpenSM”OpenSM should be run as a daemon (system service).
# Run OpenSM with defaultsopensm
# View helpopensm -hLogging
Section titled “Logging”OpenSM logs to two locations:
/var/log/messages: General major events./var/log/opensm.log: Detailed information and errors. All errors should be treated as indicators of fabric health.
Configuration
Section titled “Configuration”The OpenSM configuration file is typically stored at /etc/opensm/opensm.conf.
To generate a default configuration file:
opensm -c /etc/opensm/opensm.confConfiguring the Routing Engine
Section titled “Configuring the Routing Engine”Edit the configuration file to set the routing engine:
routing_engine ar_updnMultiple engines can be specified as fallbacks (space-separated). OpenSM will try each in order, eventually falling back to Min-Hop if all others fail:
routing_engine ar_updn updnAs of OpenSM 5.10 (November 2021), the default routing engine was changed from Min-Hop to UpDn with Adaptive Routing (ar_updn).
After making changes, restart the service:
systemctl restart opensmdOption 3: Nvidia UFM
Section titled “Option 3: Nvidia UFM”Unified Fabric Manager (UFM) is a licensed Nvidia product for medium-to-large scale fabrics. It provides a WebUI-based platform for comprehensive fabric management.
- Core: Uses OpenSM under the hood but adds enhanced diagnostics, telemetry, and automation.
- Deployment: Can run as a daemon, Docker container, or on a dedicated Nvidia hardware appliance.
- License: Per managed device.
UFM Platforms
Section titled “UFM Platforms”| Platform | Description |
|---|---|
| UFM Telemetry | Network telemetry, application workload usage, and system configuration. |
| UFM Enterprise | Adds automated network discovery/provisioning, traffic monitoring, congestion detection, job scheduler integration (Slurm, LSF), and cloud manager integration (OpenStack, Azure, VMware). |
| UFM Cyber-AI | Adds preventive maintenance and cybersecurity analytics for reducing supercomputing operational costs. |
Configuring SM via UFM
Section titled “Configuring SM via UFM”- Navigate to Main Navigation > Settings.
- Select the Subnet Manager tab.
- Configure settings under the relevant sub-tabs.
To configure a routing engine:
Settings > Network Management > Routing Engine