Infiniband healthchecking

I was double checking the health of my infiniband setup

A really useful command is ‘ibdiagnet’ available on Exadata, Exalogic and Sparc SuperCluster. It has several command line options, here I am asking for the simplest test, with 100 packets being used for each link.

#ibdiagnet -c 100

It gives a summary table at the end of the run showing whether any problems were encountered during the execution.

----------------------------------------------------------------
-I- Stages Status Report:
STAGE                                    Errors Warnings
Bad GUIDs/LIDs Check                     0      0
Link State Active Check                  0      0
General Devices Info Report              0      0
Performance Counters Report              0      1
Partitions Check                         0      0
IPoIB Subnets Check                      0      1

Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------

So I have two warning areas on my report which I’ll investigate separately.

IPoIB Subnets Check

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

This warning looks pretty scary, however it is just saying that the multicast group was created as 4 x SDR (10Gb) rtather than QDR speed (40Gb) even though all the nodes are QDR. Open SM defaults to a 10Gb group rate for multicast groups.

Performance counters report

If you look in the logfile /tmp/ibdiagnet.log and search for -W- you will be able to find the Port(s) with the problem

-V- PM Port=9 lid=0x0075 guid=0x002128e8adaba0a0 dev=48438 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 159 80
-W- lid=0x0075 guid=0x002128e8adaba0a0 dev=48438 Port=9
Performance Monitor counter     : Value
link_error_recovery_counter     : 0xff (overflow)

Now to look for the actual device that matches this.

If you look in the file /tmp/ibdiagnet.lst it can give more information about the port with the problem..

{ SW Ports:24 SystemGUID:002128e8adaba0a3 NodeGUID:002128e8adaba0a0 PortGUID:002128e8adaba0a0 VenID:000002C9 DevID:BD36 Rev:000000A0 {SUN DCS 36P QDR sscasw-ib2.blah.com} LID:0075 PN:09 } { CA Ports:02 SystemGUID:0021280001ef508d NodeGUID:0021280001ef508a PortGUID:0021280001ef508b VenID:000002C9 DevID:673C Rev:000000B0 {MT25408 ConnectX Mellanox Technologies} LID:0035 PN:01 } PHY=4x LOG=ACT SPD=10

So this says Lid x075 Port 9 on switch sscasw-ib2 is the one with the problem.

Login to the switch and check out this port

[root@sscasw-ib2 opensm]# perfquery 0x075 9
# Port counters: Lid 117 port 9
PortSelect:......................9
CounterSelect:...................0x1b01
SymbolErrors:....................0
LinkRecovers:....................256
LinkDowned:......................0
RcvErrors:.......................0
RcvRemotePhysErrors:.............0
RcvSwRelayErrors:................0
XmtDiscards:.....................0
XmtConstraintErrors:.............0
RcvConstraintErrors:.............0
LinkIntegrityErrors:.............0
ExcBufOverrunErrors:.............0
VL15Dropped:.....................0
XmtData:.........................4294967295
RcvData:.........................4294967295
XmtPkts:.........................160825707
RcvPkts:.........................218355355

You can also check this out (look for amber lights!) through the managment bui on the switch. You can also use the BUI work out which physical port on the switch matches this Lid/Port combo (13A in this case)

I reseated the cable in port 13A, and cleared the error counter

[root@sscasw-ib2 opensm]# ibclearcounters

Now I’m monitoring the status of the port – if the error count increases again I will replace the cable.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s