Skip to content

Unified Fabric Manager (UFM)

NVIDIA Unified Fabric Manager (UFM) is a licensed platform for managing InfiniBand fabrics at scale. It runs OpenSM under the hood and layers a WebUI on top for fabric management, monitoring, telemetry, and troubleshooting.

The UFM dashboard provides an at-a-glance view of fabric health and traffic.

  • Traffic Map — Visualizes traffic and congestion across fabric tiers:
    • Tier 1: Server to Leaf
    • Tier 2: Leaf to Spine
    • Tier 3: Spine to Leaf
    • Tier 4: Leaf to Server
    • Displays min/max/avg values per connection.
  • Inventory Display — Shows warnings, alarms, and firmware versions for all devices.
  • Top 5 Charts — Servers by bandwidth, switches by bandwidth, top congested servers, top congested switches, and top alarmed switches. Click any device for more detail.
  • Recent Activities — Lists warnings, alarms, and events. Clicking an event navigates to the Events/Alerts tab.

The Managed Elements tab provides views into devices, ports, virtual ports, cables, groups, P_Keys, and HCAs.

  • Devices Tab — Lists all fabric devices with details on firmware, partition keys, cabled interfaces, and more. Supports filtering and right-click actions:
    • Mark as unhealthy (isolate for maintenance)
    • Set node description
    • Reset firmware (unmanaged switches)
    • Upgrade firmware
  • Groups — Create groups of nodes for batch operations (e.g., firmware upgrades or isolation across multiple devices at once).
  • Ports — Search and discover port-level issues across the fabric.
  • Cables — View cable serial numbers, port numbers, and cable types.
  • Virtual Ports — Used in multi-tenant clusters to view virtual connections.

UFM collects alarms, logs, and reports to monitor fabric health.

Events are categorized as Info, Warning, Minor, or Critical. In the UFM settings you can configure how events are handled — log to file, ship logs externally, or send SNMP traps. Thresholds can be set for event count and TTL so that small, transient errors don’t trigger unnecessary alarms.

The Devices tab allows clicking into a specific device to view all relevant alerts (interface errors, hardware failures like power supplies, etc.).

TabPurpose
UFM HealthIs UFM itself healthy?
UFM LogsEvent logs, SM logs, and UFM logs with filtering
SnapshotTake configuration snapshots for restore or support cases
Fabric HealthRun fabric health tests (ibdiagnet-based)
Daily ReportsAutomated daily reports on fabric health and traffic
Topology CompareCompare topology changes and view topology files
Fabric ValidationRun validation tests across the fabric
ibdiagnetRun ibdiagnet commands with custom settings from the UI

UFM Telemetry collects metrics from cables, ports, and switches and streams them from a single source.

  • Telemetry Tab — Explore metrics, build charts, and view pre-defined dashboards.
  • Network Map — Enable link analysis and counter collection to troubleshoot network saturation. Specify metrics like port TX/RX data rate to isolate problems.
  • Start with the Traffic Map dashboard and Top-N charts to identify hotspots.
  • Use the Network Map with specific counters (port TX/RX data rate) to drill down.
  • Use Fabric Health Reports under System Health to collect information on existing issues.
  • Create Port Groups to group related ports and monitor them together in the traffic map.