Determining root cause for a SCSI Reservation Conflict Issue

Previous deep dive posts have dealt with performance issues or faulty hardware. This week Nathan Small takes us through the determination of root cause for a SCSI Reservation Conflict issue:
History of issue:
Customer performed a firmware upgrade to their IBM SVC's. Upon completing the firmware update and bringing them back online, command failures and SCSI reservation conflicts were observed across all hosts. Issue was resolved by sending LUN reset (vmkfstools -L lunreset /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxxx) to the affected LUNs. Root cause analysis requested.
We need to start by ensuring that all hosts saw the issue the same way, and that the firmware update sequence logically makes sense. Let's start by looking at the firmware update sequence, and to determine when it began and finished. We have to locate when the paths to one target dropped:
2012-04-29T13:18:19.684Z cpu10:4768)<3> rport-5:0-4: blocked FC remote port time out: saving binding
2012-04-29T13:18:19.685Z cpu2:4794)<3>lpfc820 0000:04:00.0: 0:(0):0203 Devloss timeout on WWPN 50:05:07:68:01:30:59:fd NPort x3f010d Data: x0 x7 x0
2012-04-29T13:18:19.713Z cpu12:4759)<3> rport-6:0-4: blocked FC remote port time out: saving binding
2012-04-29T13:18:19.714Z cpu7:4795)<3>lpfc820 0000:04:00.1: 1:(0):0203 Devloss timeout on WWPN 50:05:07:68:01:10:59:fd NPort x5d0010 Data: x0 x7 x0
From the above we can see there were 2 paths dropped, 1 per HBA. Infact, these Emulex device loss messages even give you the WWPN of the IBM SVC targets:
WWPN 50:05:07:68:01:30:59:fd
WWPN 50:05:07:68:01:10:59:fd
For our example, we will refer to LUN 42 when comparing WWPNs of targets and tracking path failures:
naa.60050768018900725000000000000810 : IBM Fibre Channel Disk (naa.60050768018900725000000000000810)
vmhba1:C0:T2:L42 LUN:42 state:active fc Adapter: WWNN: 20:00:00:00:c9:a2:26:87 WWPN: 10:00:00:00:c9:a2:26:87 Target: WWNN: 50:05:07:68:01:00:59:fd WWPN: 50:05:07:68:01:10:59:fd
vmhba1:C0:T1:L42 LUN:42 state:active fc Adapter: WWNN: 20:00:00:00:c9:a2:26:87 WWPN: 10:00:00:00:c9:a2:26:87 Target: WWNN: 50:05:07:68:01:00:59:fc WWPN: 50:05:07:68:01:20:59:fc
vmhba0:C0:T3:L42 LUN:42 state:active fc Adapter: WWNN: 20:00:00:00:c9:a2:26:86 WWPN: 10:00:00:00:c9:a2:26:86 Target: WWNN: 50:05:07:68:01:00:59:fc WWPN: 50:05:07:68:01:40:59:fc
vmhba0:C0:T2:L42 LUN:42 state:active fc Adapter: WWNN: 20:00:00:00:c9:a2:26:86 WWPN: 10:00:00:00:c9:a2:26:86 Target: WWNN: 50:05:07:68:01:00:59:fd WWPN: 50:05:07:68:01:30:59:fd
Based on the Device Loss messages from the Emulex (lpfc820) driver, we can see that the first and last paths listed were the ones that went down. This means that the other two paths to this LUN can still be used for access.
A non-disruptive upgrade sequence involves taking one array controller offline, apply firmware update, bring controller online, and then apply the same sequence to the other controller. Each of these sequences requires the connected ESX hosts to failover paths (MRU/FIXED) or remove those dead paths from the working path list (RR aka Round Robin). In this environment the PSP (Path Selection Policy) used is FIXED so we see failover events:
2012-04-29T13:18:19.745Z cpu10:6334)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x4124011a3a00) to dev "naa.60050768018900725000000000000810" on path "vmhba0:C0:T2:L42" Failed: H:0×1 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.Act:FAILOVER
2012-04-29T13:18:19.745Z cpu10:6334)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60050768018900725000000000000810": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
The above messages shows us a Host status of '1' (H:0×1), which translates to NO_CONNECT, and is a valid failover condition. For more information on NMP host statuses, See KB 1029039. Next we will see actual failover sequence:
2012-04-29T13:18:19.747Z cpu9:4278)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:479: Changing active path from vmhba0:C0:T2:L42 to vmhba1:C0:T1:L42 for device "naa.60050768018900725000000000000810".
2012-04-29T13:18:19.747Z cpu2:4786)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60050768018900725000000000000810" - issuing command 0x4124011a3a00
2012-04-29T13:18:19.748Z cpu6:7233)NMP: nmpCompleteRetryForPath:321: Retry world recovered device "naa.60050768018900725000000000000810"
When you see "Retry world recovered device" or "Retry world restored device", this means that the path has switched and commands are completing on the new path. The firmware update is applied, controller is rebooted, and then brought back online.
16 minutes later we observe LUN resets for all paths/LUN on the controller that was brought down for the firmware update
2012-04-29T13:34:16.269Z cpu1:4144)ScsiCore: 1455: Power-on Reset occurred on vmhba1:C0:T2:L42
2012-04-29T13:34:16.270Z cpu1:4144)ScsiCore: 1455: Power-on Reset occurred on vmhba0:C0:T2:L42
This is the result of the SVC coming back online after the firmware upgrade. There are no subsequent messages showing that the ESX host has failed back to the original paths, however the reason for this is because there were no preferred paths set by the customer. For more information on this behavior, see KB Preferred path behaves like MRU if preferred paths are not specified in fixed path policy (1009414).
Now it is time to apply the firmware update to the second IBM SVC:
2012-04-29T14:21:23.073Z cpu20:4778)<3> rport-5:0-5: blocked FC remote port time out: saving binding
2012-04-29T14:21:23.074Z cpu20:4794)<3>lpfc820 0000:04:00.0: 0:(0):0203 Devloss timeout on WWPN 50:05:07:68:01:40:59:fc NPort x3f010b Data: x0 x7 x0
2012-04-29T14:21:23.143Z cpu2:4762)<3> rport-6:0-3: blocked FC remote port time out: saving binding
2012-04-29T14:21:23.144Z cpu18:4795)<3>lpfc820 0000:04:00.1: 1:(0):0203 Devloss timeout on WWPN 50:05:07:68:01:20:59:fc NPort x5d000f Data: x0 x7 x0
This time we see the other IBM SVC device disappear:
WWPN 50:05:07:68:01:40:59:fc
WWPN 50:05:07:68:01:20:59:fc
Just as before, we see the host status of "H:0×1" or NO_CONNECT, which requires us to failover:
2012-04-29T14:21:23.153Z cpu2:7233)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x412400dd5080) to dev "naa.60050768018900725000000000000810" on path "vmhba1:C0:T1:L42" Failed: H:0×1 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.Act:FAILOVER
2012-04-29T14:21:23.153Z cpu2:7233)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60050768018900725000000000000810": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Then we see failover is successful:
2012-04-29T14:21:23.154Z cpu9:4283)vmw_psp_fixed: psp_fixedSelectPathToActivateInt:479: Changing active path from vmhba1:C0:T1:L42 to vmhba1:C0:T2:L42 for device "naa.60050768018900725000000000000810".
2012-04-29T14:21:23.154Z cpu6:4786)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60050768018900725000000000000810" - issuing command 0x412400dd5080
2012-04-29T14:21:27.640Z cpu2:4132)WARNING: NMP: nmpCompleteRetryForPath:429:Retry world restored device "naa.60050768018900725000000000000810" - no more commands to retry
Later we observe more LUN resets, indicating that the IBM SVC is back online:
2012-04-29T14:37:29.305Z cpu17:4759)ScsiCore: 1455: Power-on Reset occurred on vmhba1:C0:T1:L42
2012-04-29T14:37:29.306Z cpu23:4759)ScsiCore: 1455: Power-on Reset occurred on vmhba0:C0:T3:L42
At this point it looks like a clean and simple firmware upgrade to both controllers has completed, however shortly after the last controller came back online, we see a flood of SCSI commands to LUNs begin to fail with various host statuses:
2012-04-29T14:37:46.430Z cpu5:4101)ScsiDeviceIO: 2305: Cmd(0x412400d7ce40) 0x2a, CmdSN 0x104bbb to dev "naa.60050768018900725000000000000672" failed H:0xb D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-04-29T14:37:46.430Z cpu5:4755)ScsiDeviceIO: 2305: Cmd(0x412401f5e3c0) 0x2a, CmdSN 0x2c47eb to dev "naa.600507680189007250000000000007d7" failed H:0xb D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-04-29T14:37:48.427Z cpu1:8559)ScsiDeviceIO: 2305: Cmd(0x412400d7ce40) 0x2a, CmdSN 0x40b3c1 to dev "naa.6005076801890072500000000000080d" failed H:0xb D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-04-29T14:37:48.427Z cpu3:4099)ScsiDeviceIO: 2305: Cmd(0x412401d2e5c0) 0x2a, CmdSN 0x54f766 to dev "naa.6005076801890072500000000000065e" failed H:0×5 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-04-29T14:37:48.497Z cpu13:4109)ScsiDeviceIO: 2288: Cmd(0x412401f94840) 0x2a, CmdSN 0×80000015 to dev "naa.6005076801890072500000000000065e" failed H:0×8 D:0×0 P:0×0
This is where all hosts started to observe an inability to complete SCSI commands to the IBM SVC. All hosts are littered with the above failure messages, and with Emulex verbose logging we might be able to determine why different host errors statuses were returned, but that isn't important right now. This behavior eventually leads to the following:
2012-04-29T14:38:17.866Z cpu17:134520)ScsiDeviceIO: 2305: Cmd(0x412440df52c0) 0×16, CmdSN 0×207323 to dev "naa.6005076801890072500000000000065e" failed H:0×2 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-04-29T14:38:17.866Z cpu17:134520)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L17, reservation state on device naa.6005076801890072500000000000065e is unknown.
2012-04-29T14:38:18.804Z cpu4:4100)ScsiDeviceIO: 2305: Cmd(0x412401f5e3c0) 0×16, CmdSN 0×207324 to dev "naa.6005076801890072500000000000061e" failed H:0×2 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-04-29T14:38:18.804Z cpu4:4100)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L7, reservation state on device naa.6005076801890072500000000000061e is unknown.
What you are looking at here is a command failure for SCSI command 0×16, which is a SCSI-2 RESERVE command. Not only is this a command failure, but we specifically state that the reservation state of a LUN is “unknown”. This message is displayed when we attempt to issue a SCSI-2 RESERVE and the command does not complete. When we attempt to send an abort for that command and the abort does not complete either, we legitimately do not know what the reservation state of the LUN is at that point. This will typically only happen if there are severe problems trying to communicate with the device, whether the problem is fabric related or the array itself. The fact that this is an IBM SVC will add a layer of complexity to the analysis since there are arrays behind it.
Next we see Reservation errors:
2012-04-29T14:38:28.606Z cpu14:4133)WARNING: FS3: 1835: Reservation error: Timeout
2012-04-29T14:38:28.807Z cpu22:4131)WARNING: FS3: 1835: Reservation error: IO was aborted
Then eventually we see the SATP for the IBM SVC report SCSI Reservation Conflicts for LUNs 1, 2, and 14:
2012-04-29T14:39:53.082Z cpu7:4277)VMW_SATP_SVC: satp_svc_UpdatePath:213: Failed to update path "vmhba0:C0:T3:L1" state. Status=SCSI reservation conflict
2012-04-29T14:39:53.082Z cpu17:4278)VMW_SATP_SVC: satp_svc_UpdatePath:213: Failed to update path "vmhba1:C0:T1:L2" state. Status=SCSI reservation conflict
2012-04-29T14:43:27.452Z cpu13:4277)VMW_SATP_SVC: satp_svc_UpdatePath:213: Failed to update path "vmhba1:C0:T1:L14" state. Status=SCSI reservation conflict
This needs to be our area of focus. The issue was finally resolved by issuing a LUN reset to the affected LUNs:
2012-04-30T02:26:32.197Z cpu23:2689589)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.6005076801890072500000000000083b. This will clear any SCSI-2 reservations on the device.
2012-04-30T02:26:32.204Z cpu23:2689589)<3>lpfc820 0000:04:00.1: 1:(0):0713 SCSI layer issued Device Reset (1, 14) reset status x2002 flush status x2002
2012-04-30T02:26:32.204Z cpu23:2689589)Resv: 618: Executed out-of-band lun reset on naa.6005076801890072500000000000083b
2012-04-30T02:26:32.873Z cpu6:2689590)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.60050768018900725000000000000672. This will clear any SCSI-2 reservations on the device.
2012-04-30T02:26:32.879Z cpu6:2689590)<3>lpfc820 0000:04:00.0: 0:(0):0713 SCSI layer issued Device Reset (2, 2) reset status x2002 flush status x2002
2012-04-30T02:26:32.879Z cpu6:2689590)Resv: 618: Executed out-of-band lun reset on naa.60050768018900725000000000000672
2012-04-30T02:29:28.765Z cpu18:1055548)WARNING: NMP: nmpDeviceTaskMgmt:2210:Attempt to issue lun reset on device naa.6005076801890072500000000000095f. This will clear any SCSI-2 reservations on the device.
2012-04-30T02:29:28.773Z cpu18:1055548)<3>lpfc820 0000:04:00.1: 1:(0):0713 SCSI layer issued Device Reset (2, 22) reset status x2002 flush status x2002
2012-04-30T02:29:28.773Z cpu18:1055548)Resv: 618: Executed out-of-band lun reset on naa.6005076801890072500000000000095f
Why did a SCSI-2 RESERVE get left on a LUN requiring a manual LUN reset be issued to the device to clear it? Well, as stated above, we observed a host state that the reservation state of a LUN was unknown because of the following sequence:
1. Issue SCSI command 0×16 (SCSI RESERVE)
2. Command fails (timeout)
3. Issue abort for the SCSI command 0×16
4. Command may fail
5. Reservation state of device is unknown
When we look at the other hosts during that time period, they also had difficulty completing SCSI commands, specifically SCSI-2 RESERVE resulting in the other hosts not knowing the reservation state of LUNs:
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:38:02.029Z cpu5:4101)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L17, reservation state on device naa.6005076801890072500000000000065e is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:38:02.029Z cpu21:4117)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L16, reservation state on device naa.6005076801890072500000000000065d is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:38:02.029Z cpu21:4117)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L39, reservation state on device naa.6005076801890072500000000000080d is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:38:42.028Z cpu17:4113)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L18, reservation state on device naa.6005076801890072500000000000066f is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:38:42.028Z cpu3:4099)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L2, reservation state on device naa.60050768018900725000000000000672 is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:38:42.029Z cpu17:4113)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L15, reservation state on device naa.600507680189007250000000000007d7 is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:39:22.028Z cpu10:550864)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L39, reservation state on device naa.6005076801890072500000000000080d is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:39:22.028Z cpu18:7351)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L14, reservation state on device naa.6005076801890072500000000000083b is unknown.
esx-esxhost1.local-2012-04-30-14.41/var/run/log/vmkernel.1:2012-04-29T14:39:22.028Z cpu10:550864)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L17, reservation state on device naa.6005076801890072500000000000065e is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:38:17.866Z cpu17:134520)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L17, reservation state on device naa.6005076801890072500000000000065e is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:38:18.804Z cpu4:4100)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T0:L7, reservation state on device naa.6005076801890072500000000000061e is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:38:18.866Z cpu23:2134223)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L39, reservation state on device naa.6005076801890072500000000000080d is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:38:57.866Z cpu15:2134223)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L3, reservation state on device naa.60050768018900725000000000000721 is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:38:58.866Z cpu2:1781073)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L18, reservation state on device naa.6005076801890072500000000000066f is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:38:58.866Z cpu23:1675016)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L15, reservation state on device naa.600507680189007250000000000007d7 is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:39:37.866Z cpu17:2154857)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L1, reservation state on device naa.600507680189007250000000000006d7 is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:39:38.866Z cpu20:1675016)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L17, reservation state on device naa.6005076801890072500000000000065e is unknown.
esx-esxhost2.local-2012-04-30-14.52/var/run/log/vmkernel.1:2012-04-29T14:39:38.866Z cpu20:1675016)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T2:L2, reservation state on device naa.60050768018900725000000000000672 is unknown.
esx-esxhost3.local-2012-04-30-09.49/var/run/log/vmkernel.log:2012-04-29T09:40:01.001Z cpu8:4104)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T0:L14, reservation state on device naa.6005076801890072500000000000083b is unknown.
esx-esxhost3.local-2012-04-30-09.49/var/run/log/vmkernel.log:2012-04-29T09:40:41.000Z cpu0:4120)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba0:C0:T0:L14, reservation state on device naa.6005076801890072500000000000083b is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:39:00.633Z cpu17:4113)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T3:L7, reservation state on device naa.6005076801890072500000000000061e is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:39:01.633Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L3, reservation state on device naa.60050768018900725000000000000721 is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:39:01.633Z cpu15:14989)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L15, reservation state on device naa.600507680189007250000000000007d7 is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:39:40.633Z cpu3:4099)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L1, reservation state on device naa.600507680189007250000000000006d7 is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:39:41.633Z cpu5:4101)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L14, reservation state on device naa.6005076801890072500000000000083b is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:39:41.633Z cpu16:15845)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T3:L7, reservation state on device naa.6005076801890072500000000000061e is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:40:20.634Z cpu15:4174)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L3, reservation state on device naa.60050768018900725000000000000721 is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:40:21.634Z cpu17:4113)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L15, reservation state on device naa.600507680189007250000000000007d7 is unknown.
esx-esxhost4.local-2012-04-30-14.37/var/run/log/vmkernel.1:2012-04-29T14:40:21.634Z cpu17:4113)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba1:C0:T1:L1, reservation state on device naa.600507680189007250000000000006d7 is unknown.
This shows us that more than LUN 1, 2, and 14 reported the reservation state unknown errors, though LUNs 1, 2, and 14 were the only LUNs that had a SCSI-2 RESERVE left on the volume. The reason why one LUN had the reserve left on it while another did not is purely due to whatever operation a host was doing when it attempted to grab the reserve. Determining why this was the case for one LUN over another is not as important as why the hosts started to have difficulty completing SCSI commands to the IBM SVC following the reboot of the last controller after a firmware update.
These unknown reservation states occurred for a 2 minute window between 2012-04-29T14:38:02.029Z and 2012-04-29T14:40:21.634Z. Since this is UTC time, this translates to 9:38am to 9:40am, which the time period in which the last SVC reboot occurred and then brought its paths back online (9:38am) to the moment the SVC had fully logged back into its targets (9:40am). This can be illustrated with the following entries from the IBM SVC logs:
Error Log Entry 960
Node Identifier : SVCN4
Object Type : node
Object ID : 13
Copy ID :
Sequence Number : 30815
Root Sequence Number : 30815
First Error Timestamp : Sun Apr 29 09:38:13 2012
: Epoch + 1335710293
Last Error Timestamp : Sun Apr 29 09:38:13 2012
: Epoch + 1335710293
Error Count : 1
Error ID : 987102 : Node coldstarted
Error Code :
Status Flag : SNMP trap raised
Type Flag : INFORMATION
Error Log Entry 6438
Node Identifier : SVCN3
Object Type : cluster
Object ID : 0
Copy ID :
Sequence Number : 0
Root Sequence Number : 0
First Error Timestamp : Sun Apr 29 09:38:18 2012
: Epoch + 1335710298
Last Error Timestamp : Sun Apr 29 09:38:18 2012
: Epoch + 1335710298
Error Count : 1
Error ID : 990415 : Software Install completed
Error Code :
Status Flag :
Type Flag : CONFIGURATION EVENT
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Error Log Entry 962
Node Identifier : SVCN4
Object Type : cluster
Object ID : 0
Copy ID :
Sequence Number : 30816
Root Sequence Number : 30813
First Error Timestamp : Sun Apr 29 09:40:43 2012
: Epoch + 1335710443
Last Error Timestamp : Sun Apr 29 09:40:43 2012
: Epoch + 1335710443
Error Count : 1
Error ID : 981003 : FC discovery occurred, configuration changes complete
Error Code :
Status Flag : SNMP trap raised
Type Flag : INFORMATION
01 02 00 00 04 1E 00 00 04 04 00 00 06 00 06 00
06 00 00 00 50 01 00 00 60 00 60 00 60 00 30 00
00 00 00 00 00 00 00 00 68 FE 01 00 66 00 00 00
32 01 00 00 F0 FF FF FF 0B 00 00 00 66 00 00 00
66 00 66 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Error Log Entry 420
Node Identifier : SVCN3
Object Type : cluster
Object ID : 0
Copy ID :
Sequence Number : 30817
Root Sequence Number : 30813
First Error Timestamp : Sun Apr 29 09:40:43 2012
: Epoch + 1335710443
Last Error Timestamp : Mon Apr 30 01:00:12 2012
: Epoch + 1335765612
Error Count : 2
Error ID : 981004 : FC discovery occurred, no configuration changes were detected
Error Code :
Status Flag : SNMP trap raised
Type Flag : INFORMATION
01 01 00 00 1C 1E 00 00 04 04 00 00 06 00 06 00
00 00 00 00 50 01 00 00 60 00 60 00 60 00 30 00
00 00 00 00 00 00 00 00 68 FE 01 00 00 00 00 00
98 01 00 00 F0 FF FF FF 0F 00 00 00 66 00 00 00
00 00 66 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The collective ESX logs alone clearly show that the SVC was not completing all SCSI commands from the hosts within the 5 second timeout window during the timeframe of 9:38 and 9:40, but some of those commands did complete, particularly a few SCSI-2 reserves that the ESX hosts lost track of due to the timeout events. Also, based on the ESX hosts logs and the SVC logs, we would postulate that the SVC advertised its paths as online and active before it had fully logged into the arrays behind it. This means that commands could have been queued up on the SVC side while waiting for login to complete and then flush those commands through, which may have explained the behavior we observed.
While the SCSI-2 Reserve was a symptom of the issue, this scenario as well as many other scenarios regarding SCSI Reservation Conflicts can be avoid by utilizing the hardware assisted locking feature of VAAI, also known as ATS (Atomic Test and Set). This type of locking will not prevent another host from accessing the volume. This feature was introduced in vSphere 4.1 and improved in vSphere 5.0, and requires a storage array that supports VAAI.
For more information on VAAI and ATS, please see KB vStorage APIs for Array Integration FAQ (1021976).
Dropped Storage Paths Across all Hosts

Today, Nathan Small brings us another log deep-dive, this time for a problem with storage paths being dropped randomly across all hosts. Jumping right in…
First we will want to observe when the FC ports from the array timeout across all hosts, mostly to see if this events occurred simultaneously. The logs were collect via remote syslog (best practice when using ESXi 4.x hosts):
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu2:4449)<3> rport-3:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu2:4442)<3> rport-3:0-4: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:57.346 cpu7:4445)<3> rport-3:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:57.346 cpu5:4450)<3> rport-3:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:57.346 cpu5:4450)<3> rport-3:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:57.346 cpu3:4463)<3> rport-3:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:38:39.231 cpu8:4459)<3> rport-4:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:38:39.231 cpu8:4459)<3> rport-4:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:38:39.231 cpu8:4459)<3> rport-4:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:38:39.231 cpu15:4440)<3> rport-4:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu14:4447)<3> rport-3:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-3: blocked FC remote port time out: saving binding
We can see above that all paths for a given HBA are dropping at the same time, however it affects hosts at different times. Since this issue had occurred multiple times in the past for the customer, we enabled verbose Qlogic driver logging, which would show when the HBA logs back into the targets. The default logging for the Qlogic driver will only show when a device has disappeared. FC ports restored:
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.639 cpu5:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.764 cpu5:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.901 cpu6:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.957 cpu10:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T15:31:50-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:55.834 cpu4:4719)scsi(4:0) rport_add: 5006016046e06b58 3a0500
<181>1 2012-04-21T15:32:10-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:22:15.849 cpu7:4719)scsi(4:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T15:32:10-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:22:15.925 cpu7:4719)scsi(4:3) rport_add: 5006016946e06b40 b10c00
<181>1 2012-04-21T15:32:11-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:22:16.813 cpu10:4719)scsi(4:1) rport_add: 5006016946e06b58 3a0700
<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:58.151 cpu19:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:18.201 cpu19:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:19.057 cpu15:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700
<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:40.071 cpu17:4719)scsi(4:1) rport_add: 5006016946e06b58 3a0700
<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:40.117 cpu16:4719)scsi(4:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:40.195 cpu16:4719)scsi(4:3) rport_add: 5006016946e06b40 b10c00
<181>1 2012-04-21T16:02:06-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:40.989 cpu13:4719)scsi(4:0) rport_add: 5006016046e06b58 3a0500
<181>1 2012-04-21T16:07:39-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:10.470 cpu20:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.630 cpu12:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.969 cpu18:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:04.781 cpu9:4718)scsi(4:0) rport_add: 5006016146e06b58 190600
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:04.844 cpu6:4718)scsi(4:1) rport_add: 5006016846e06b58 190700
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:04.929 cpu5:4718)scsi(4:2) rport_add: 5006016846e06b40 d30700
<181>1 2012-04-21T16:10:29-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:05.013 cpu4:4718)scsi(4:3) rport_add: 5006016146e06b40 d30900
Let's take a closer look at the ESX105 host outage:
<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:47.346 cpu22:4118)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:47.346 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:47.349 cpu22:4118)<6>qla2xxx 0000:04:00.6: scsi(3:2:66) Port down status: port-state=0×3
<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:47.349 cpu22:4118)<6>qla2xxx 0000:04:00.6: scsi(3:2:66): FCP command status: 0×28-0×0 (0×20000) portid=b10800 oxid=0×806 ser=0x1728f030 cdb=120000 len=0xff rsp_info=0×0 resid=0×0 fw_resid=0×0
<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:17.577 cpu22:4118)qla24xx_report_id_acquisition:format 1: scsi(3): VP[0] enabled - status 0 - with port id 45000f
<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:17.587 cpu22:4118)scsi(3): Asynchronous PORT UPDATE.
<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:17.587 cpu22:4118)<6>scsi(3): Port database changed 0001 0007 1700.
<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:17.976 cpu14:4717)scsi(3): qla2x00_reset_marker()
<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:17.976 cpu14:4717)scsi(3): qla2x00_loop_resync()
<181>1 2012-04-21T15:35:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:18.574 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.
<181>1 2012-04-21T15:35:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:18.876 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.
<181>1 2012-04-21T15:35:48-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:19.188 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.
<181>1 2012-04-21T15:35:56-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:27.605 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.
<181>1 2012-04-21T15:35:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:29.602 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.
Physical connection comes back:
<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:57.558 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:57.558 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:18.201 cpu19:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3): LOOP READY
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3): qla2x00_loop_resync - end
Second drop occurs:
<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:19.685 cpu22:16844)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:19.685 cpu22:16844)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-0 - port retry count: 0 remaining
<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-1 - port retry count: 0 remaining
<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-2 - port retry count: 0 remaining
<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-3 - port retry count: 0 remaining
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu14:4447)<3> rport-3:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Device is visible 2 seconds later:
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.759 cpu22:4118)scsi(3): DCBX Completed — 0000 0000 0000
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.917 cpu22:4118)qla24xx_report_id_acquisition:format 1: scsi(3): VP[0] enabled - status 0 - with port id 45000f
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.925 cpu3:4099)scsi(3): Loop Down - aborting the queues before time expire
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.925 cpu17:4717)scsi(3): qla2x00_reset_marker()
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.926 cpu22:4118)scsi(3): Asynchronous PORT UPDATE.
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.926 cpu22:4118)<6>scsi(3): Port database changed 0002 0007 1700.
<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:49.967 cpu22:16606)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/1700.
<181>1 2012-04-21T16:07:19-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:50.926 cpu21:4717)scsi(3): qla2x00_loop_resync()
<181>1 2012-04-21T16:07:21-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:52.205 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.
Then connection is established again:
<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:09.892 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:09.892 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:09.893 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored ffff/0006/0000.
<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:10.018 cpu14:4717)scsi(3): F/W Ready - OK
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3): LOOP READY
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3): qla2x00_loop_resync - end
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.947 cpu18:4717)scsi(3): qla2x00_port_login()
-
Let's compare with another host:
<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:17.726 cpu2:4098)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:17.726 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
Port timeout:
<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-0 - port retry count: 0 remaining
<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-2 - port retry count: 0 remaining
<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-3 - port retry count: 0 remaining
<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-4 - port retry count: 0 remaining
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu2:4449)<3> rport-3:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu2:4442)<3> rport-3:0-4: blocked FC remote port time out: saving binding
Loop up 10 seconds later:
<181>1 2012-04-21T15:17:11-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:47.988 cpu2:4098)scsi(3): Asynchronous PORT UPDATE.
<181>1 2012-04-21T15:17:11-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:47.988 cpu2:4098)<6>scsi(3): Port database changed 0000 0007 1700.
<181>1 2012-04-21T15:17:12-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:48.799 cpu11:4717)scsi(3): qla2x00_reset_marker()
<181>1 2012-04-21T15:17:12-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:48.799 cpu11:4717)scsi(3): qla2x00_loop_resync()
<181>1 2012-04-21T15:17:12-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:48.971 cpu2:3615303)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.
<181>1 2012-04-21T15:17:14-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:50.363 cpu2:14604)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.
<181>1 2012-04-21T15:17:22-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:58.986 cpu2:14606)scsi(3): Asynchronous PORT UPDATE ignored 0000/0007/0b00.
<181>1 2012-04-21T15:17:23-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:59.971 cpu2:4098)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.
<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:27.962 cpu2:4098)scsi(3): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:27.962 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
-
If we look at this from just a LOOP perspective:
<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:17.726 cpu2:4098)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:17.726 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:27.962 cpu2:4098)scsi(3): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:27.962 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.764 cpu2:4717)scsi(3): LOOP READY
LOOP detected down at 15:16:41 and then detected as up again 70 seconds later, however the LIP (Loop Initialization) did not complete until 25 seconds after that. Now let us look at the next host:
<181>1 2012-04-21T15:30:59-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:04.984 cpu21:4117)scsi(4): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:30:59-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:04.984 cpu21:4117)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:31:49-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:55.278 cpu21:4117)scsi(4): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T15:31:49-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:21:55.278 cpu21:4117)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T15:32:10-04:00 ESX003.local vmkernel - - - vmkernel: 24:00:22:15.926 cpu7:4719)scsi(4): LOOP READY
LOOP detected down at 15:30:59, which isn't the same time as the first host. The loop is up again 50 seconds later with the LIP finishing 20 seconds later. Onto the next host:
<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:47.346 cpu22:4118)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:21:47.346 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:57.558 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:22:57.558 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3): LOOP READY
LOOP detected down at 15:35:16, yet again not the same as either host. The loop is up again 70 seconds later with the LIP finishing 20 seconds later. Next host:
<181>1 2012-04-21T16:00:55-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:38:29.229 cpu6:4102)scsi(4): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T16:00:55-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:38:29.229 cpu6:4102)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T16:01:45-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:19.519 cpu6:14454)scsi(4): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T16:01:45-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:19.519 cpu6:14454)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel - - - vmkernel: 23:20:39:40.195 cpu15:4719)scsi(4): LOOP READY
LOOP detected down at 16:00:55, still different from every other host. The loop is up again 50 seconds later with the LIP finishing 25 seconds later. Onto the last outage:
<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:19.685 cpu22:16844)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:53:19.685 cpu22:16844)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:09.892 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).
<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:09.892 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel - - - vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3): LOOP READY
LOOP detected down at 16:06:48, LOOP up 50 seconds later with the LIP finishing in only 3 seconds.
Let's take a closer look at ESX006, since it lost paths to both HBAs, though at different times:
All paths to the first HBA drop:
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu2:4449)<3> rport-3:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-3: blocked FC remote port time out: saving binding
<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:27.727 cpu2:4442)<3> rport-3:0-4: blocked FC remote port time out: saving binding
All paths to the second HBA drop:
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-3: blocked FC remote port time out: saving binding
Why did this host observe path failures to the second HBA? No other host has issues with both HBAs.
Let's breakdown when the paths came back:
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.639 cpu5:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.764 cpu5:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.901 cpu6:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500
<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:52.957 cpu10:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:04.781 cpu9:4718)scsi(4:0) rport_add: 5006016146e06b58 190600
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:04.844 cpu6:4718)scsi(4:1) rport_add: 5006016846e06b58 190700
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:04.929 cpu5:4718)scsi(4:2) rport_add: 5006016846e06b40 d30700
<181>1 2012-04-21T16:10:29-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:05.013 cpu4:4718)scsi(4:3) rport_add: 5006016146e06b40 d30900
Interestingly enough, we see another type of device drop its connection around the same time, but the connection is restored at the EXACT same time:
<14>1 2012-04-21T16:10:06-04:00 ESX006.local vobd - - - vobd: Apr 21 16:10:06.896: 2065962831343us: [vob.net.vmnic.linkstate.down] vmnic vmnic3 linkstate down.
<14>1 2012-04-21T16:10:27-04:00 ESX006.local vobd - - - vobd: Apr 21 16:10:27.936: 2065983869689us: [vob.net.vmnic.linkstate.up] vmnic vmnic3 linkstate up.
Is it just a coincidence that the problem suddenly spread to a vmnic instead of staying isolated to vmhbas? No, it is not. The HBAs and NICs in these hosts are the same physical device:
# cat /proc/scsi/qla2xxx/3
QLogic PCI-Express Single Port 10Gb CNA for QLE8240:
FC Firmware version 4.09.34 (804), Driver version 841.k1.34.1-1vmw
Host Device Name vmhba2
This is a CNA device which is capable of being not only an FCoE HBA but an H/W iSCSI HBA as well:
# esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic2 0000:04:00.00 qlcnic Up 10000Mbps Full 00:0e:1e:05:0a:e8 1500 QLogic Corp QLogic QLE8240 10GbE Single Channel CNA
vmnic3 0000:06:00.00 qlcnic Up 10000Mbps Full 00:0e:1e:05:0c:90 1500 QLogic Corp QLogic QLE8240 10GbE Single Channel CNA
Let's compare the PCI device outputs for the vmhba:
<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel - - - vmkernel: 23:20:59:17.726 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.
<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:00:27.962 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).
<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:03.966 cpu11:4718)<6>qla2xxx 0000:06:00.6: LOOP UP detected (10 Gbps).
This tells us that vmnic2 and vmhba2 are the same physical card (04:00.6 = bus 04, slot 00, function 6), while vmnic3 and vmhba3 are also the same card (06:00.6 = bus 06, slot 00, function 6).
Now, why was there no LOOP DOWN event for vmhba3 on this host? We clearly see a LOOP UP event, so what gives? Well, to answer that question we need to investigate the log messages surrounding the 'rport' messages, which indicate that the array controller port timed out:
Rport messages (again):
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-0: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-1: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-2: blocked FC remote port time out: saving binding
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-3: blocked FC remote port time out: saving binding
Therefore the logging frame of reference we need is at the 2012-04-21T16:10:17 time period:
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.026 cpu0:4454)<6>qlcnic 0000:06:00.0: Loading firmware from flash
This is the Qlogic CNA driver loading the firmware from flash for the same PCI device that we see a LOOP UP on. Let's grep for "qlcnic" for more information. The qlcnic is the Qlogic CNA driver:
<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:42.441 cpu11:4450)<6>qlcnic 0000:06:00.0: Changed the adapter dev_state to NEED_RESET.
<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:42.441 cpu11:4450)<6>qlcnic 0000:06:00.0: Adapter not in operational state, resetting adapter.
<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:42.441 cpu17:4445)<6>qlcnic 0000:06:00.0: Disabled bus mastering.
<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel - - - vmkernel: <6>qlcnic 0000:06:00.0: PEG_HALT_STATUS1: 0×40006700, PEG_HALT_STATUS2: 0×662240,
<181>1 2012-04-21T16:10:08-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:44.445 cpu10:4454)<6>qlcnic 0000:06:00.0: Adapter reset acked by reset owner.
<181>1 2012-04-21T16:10:08-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:44.445 cpu10:4454)<6>qlcnic 0000:06:00.0: Changed adapter state to DEV_INITIALIZING.
<181>1 2012-04-21T16:10:11-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:47.520 cpu10:4454)<6>qlcnic 0000:06:00.0: Successfully collected firmware minidump.
<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:52:53.026 cpu0:4454)<6>qlcnic 0000:06:00.0: Loading firmware from flash
<181>1 2012-04-21T16:10:25-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:01.806 cpu9:4454)<6>qlcnic 0000:06:00.0: Adapter state set to DEV_READY.
<181>1 2012-04-21T16:10:26-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:02.183 cpu8:4454)<6>qlcnic 0000:06:00.0: Dual XGb SFP+ LP, Board S/N: SFP_P3P, Chip rev: 0×54
<181>1 2012-04-21T16:10:26-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:02.183 cpu8:4454)<6>qlcnic 0000:06:00.0: Firmware v4.9.34 [legacy]
<181>1 2012-04-21T16:10:26-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:02.184 cpu8:4454)<6>qlcnic 0000:06:00.0: Re enabled bus mastering.
<181>1 2012-04-21T16:10:27-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:03.456 cpu16:4112)<6>qlcnic 0000:06:00.0: vmnic3 NIC Link is up
<181>1 2012-04-21T16:10:32-04:00 ESX006.local vmkernel - - - vmkernel: 23:21:53:08.871 cpu11:4145)<3>qlcnic_netq_ops: OP 16 not supported
Typically this behavior is observed with faulty hardware, but it isn’t impossible for this to be the result of a firmware bug or incompatibility with the switch they are connected with. When reviewing past incidences for this behavior in the logs, we can see that this same event occurred on other hosts:
<181>1 2012-04-07T06:07:32-04:00 ESX001.local vmkernel - - - vmkernel: 8:13:30:24.940 cpu16:4447)<6>qlcnic 0000:04:00.0: Changed the adapter dev_state to NEED_RESET.
<181>1 2012-04-07T06:07:32-04:00 ESX001.local vmkernel - - - vmkernel: 8:13:30:24.940 cpu16:4447)<6>qlcnic 0000:04:00.0: Adapter not in operational state, resetting adapter.
<181>1 2012-04-02T16:20:13-04:00 ESX106.local vmkernel - - - vmkernel: 40:22:32:23.462 cpu2:9060083)<6>qlcnic 0000:06:00.0: Changed the adapter dev_state to NEED_RESET.
<181>1 2012-04-02T16:20:13-04:00 ESX106.local vmkernel - - - vmkernel: 40:22:32:23.464 cpu2:9060083)<6>qlcnic 0000:06:00.0: Adapter not in operational state, resetting adapter.
Since we aren't positive why this is occurring, it is recommended to run the latest driver from QLogic to rule out old driver/firmware issues. It is important for us to confirm the issue still occurs on the latest firmware and driver, particularly if we need to involve the engineering team from the HBA vendor. This customer is currently running Driver version 841.k1.34.1-1vmw and 911.k1.1-19vmw is available on the QLogic site as of today, April 26th 2012.
Usually as a best practice we recommend running the latest firmware, as firmware updates, just like driver updates, contain bug fixes. We say usually since new code branches of firmware can sometimes come with a new set of bugs, so waiting until you a more mature release of that new code branch. A new code branch is define when you go from something like a v1.67 release to a v2.00 release. Obviously there are exceptions to the rule, but this generally applies.
Given the fact that multiple hosts are seeing this behavior at random times, it isn't impossible for there to be a bad batch of cards. When this is suspected to be the case, reviewing the serial numbers of the cards may reveal if they were from the same batch. In the case of this customer, the serial numbers of the cards were pretty close to being sequential. The HBA vendor confirmed that there may have been a very, very tiny batch of cards with bad firmware, so the affected HBAs were RMA'ed and replaced.
Multiple Storage Path Failures and Network Drops

Here's our fourth log deep-dive from Nathan Small. If you missed the first one, here it is.
History of the issue
A customer experienced multiple storage path failures and network drops in their environment.
There are quite of few instances of the array controller ports timing out to the adapter. When this occurs, it is observed on multiple hosts at the exact same time. A quick way to determine this from ESX hosts would be to grep for ‘rport’ in the vmkernel log: (ie: cat /var/log/vmkernel |grep rport):
Apr 6 19:07:20 esx1 vmkernel: 11:09:46:03.193 cpu5:4298)<3> rport-3:0-10: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx1 vmkernel: 11:09:46:03.193 cpu5:4298)<3> rport-3:0-11: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx3 vmkernel: 10:09:10:17.967 cpu20:4290)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx3 vmkernel: 10:09:10:17.967 cpu5:4292)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx4 vmkernel: 10:07:51:54.795 cpu15:4300)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx4 vmkernel: 10:07:51:54.795 cpu15:4300)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx8 vmkernel: 9:09:41:37.103 cpu0:4300)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:07:20 esx8 vmkernel: 9:09:41:37.103 cpu12:4294)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:07:22 esx2 vmkernel: 10:10:35:21.466 cpu20:4301)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:07:22 esx2 vmkernel: 10:10:35:21.466 cpu20:4301)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:19:14 esx1 vmkernel: 11:09:57:57.228 cpu4:4300)<3> rport-2:0-13: blocked FC remote port time out: saving binding
Apr 6 19:19:14 esx1 vmkernel: 11:09:57:57.228 cpu6:4295)<3> rport-2:0-12: blocked FC remote port time out: saving binding
Apr 6 19:19:14 esx2 vmkernel: 10:10:47:15.993 cpu19:4292)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr 6 19:19:14 esx2 vmkernel: 10:10:47:15.993 cpu19:4292)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr 6 19:19:14 esx8 vmkernel: 9:09:53:31.612 cpu2:4296)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr 6 19:19:14 esx8 vmkernel: 9:09:53:31.612 cpu2:4296)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr 6 19:19:15 esx3 vmkernel: 10:09:22:12.469 cpu10:4291)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr 6 19:19:15 esx3 vmkernel: 10:09:22:12.469 cpu10:4291)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr 6 19:19:15 esx4 vmkernel: 10:08:03:49.296 cpu17:4280)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr 6 19:19:15 esx4 vmkernel: 10:08:03:49.296 cpu17:4280)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx1 vmkernel: 11:10:07:39.282 cpu3:4292)<3> rport-3:0-10: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx1 vmkernel: 11:10:07:39.282 cpu5:4288)<3> rport-3:0-11: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx2 vmkernel: 10:10:56:57.549 cpu19:4281)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx2 vmkernel: 10:10:56:57.549 cpu19:4281)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx3 vmkernel: 10:09:31:53.981 cpu9:4291)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx3 vmkernel: 10:09:31:53.981 cpu9:4291)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx4 vmkernel: 10:08:13:30.816 cpu19:4295)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx4 vmkernel: 10:08:13:30.816 cpu2:4296)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx8 vmkernel: 9:10:03:13.149 cpu21:4280)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr 6 19:28:56 esx8 vmkernel: 9:10:03:13.149 cpu5:4289)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Note: The ‘rport’ messages are only observed when an array controller port times out, not when it is visible again.
The message of “blocked FC remote port time out” literally translates to the array controller port connection timed out, so we mark the paths as dead but save the “binding” information in case the device ever comes back on the fabric, in which case the connection is reestablished and the paths change state from “dead” to “on”. This path transition can also be observed under /var/log/messages.
The above behavior is very concerning since we can see it affects multiple hosts simultaneously. An inspection into the underlying server hardware reveals that these are HP blades, which may explain some of this behavior since they can share the same infrastructure:
From dmidecode:
System Information
Manufacturer: HP
Product Name: ProLiant BL460c G7
BIOS Information
Vendor: HP
Version: I27
Release Date: 05/05/2011
Since this customer is complaining about network drops as well, we need to take a closer look at the messages that surround the ‘rport’ messages. Let's start with the April 6th @ 19:07 outage that affected all hosts:
# cat vm-support-esx*/var/log/vmkernel*|grep "Apr 6 19:0"|grep -i link
Apr 6 19:02:20 esx1 vmkernel: 11:09:41:02.701 cpu9:4317)<3>lpfc820 0000:06:00.3: 3:2718 Clear Virtual Link Received for VPI 0×100 tag 0xa
Apr 6 19:07:10 esx1 vmkernel: 11:09:45:53.191 cpu8:4104)vmnic7 : 06:00.1 Link Down
Apr 6 19:07:10 esx1 vmkernel: 11:09:45:53.191 cpu3:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xd received Data: xd x20 x800100 x0 x0
Apr 6 19:07:21 esx2 vmkernel: 10:10:30:21.179 cpu20:4317)<3>lpfc820 0000:06:00.3: 3:2718 Clear Virtual Link Received for VPI 0×100 tag 0×4
Apr 6 19:07:22 esx2 vmkernel: 10:10:35:11.465 cpu20:4116)vmnic7 : 06:00.1 Link Down
Apr 6 19:07:22 esx2 vmkernel: 10:10:35:11.465 cpu12:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x7 received Data: x7 x20 x800100 x0 x0
Apr 6 19:07:10 esx3 vmkernel: 10:09:10:07.966 cpu4:5117)vmnic7 : 06:00.1 Link Down
Apr 6 19:07:10 esx3 vmkernel: 10:09:10:07.966 cpu5:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x3 received Data: x3 x20 x800110 x0 x0
Apr 6 19:07:10 esx4 vmkernel: 10:07:51:44.794 cpu18:4114)vmnic7 : 06:00.1 Link Down
Apr 6 19:07:10 esx4 vmkernel: 10:07:51:44.794 cpu19:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x3 received Data: x3 x20 x800110 x0 x0
Apr 6 19:07:13 esx8 vmkernel: 9:09:36:36.881 cpu8:4317)<3>lpfc820 0000:06:00.3: 3:2718 Clear Virtual Link Received for VPI 0×100 tag 0×2
Apr 6 19:07:14 esx8 vmkernel: 9:09:41:27.103 cpu2:5449)vmnic7 : 06:00.1 Link Down
Apr 6 19:07:14 esx8 vmkernel: 9:09:41:27.103 cpu9:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x5 received Data: x5 x20 x800100 x0 x0
We can see that not only are “lpfc” devices dropping, but vmnic7 is dropping too. Also, these devices are dropping at the exact same time, and across multiple hosts at the same time as well. The acronym ‘lpfc’ translates to Light Pulse Fiber Channel, which is an Emulex device. At this point we can assume that since both devices drop simultaneously, they are physically connected to the same device, perhaps a switch that supports both Ethernet and Fiber Channel/Fiber Channel over Ethernet.
Let’s have a look at the next outage to see if the same behavior was observed:
# cat vm-support-esx*/var/log/vmkernel*|grep "Apr 6 19:28"|grep -i link
Apr 6 19:28:46 esx1 vmkernel: 11:10:07:29.281 cpu4:8170)vmnic7 : 06:00.1 Link Down
Apr 6 19:28:46 esx1 vmkernel: 11:10:07:29.281 cpu9:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x14 received Data: x14 x20 x800110 x0 x0
Apr 6 19:28:46 esx2 vmkernel: 10:10:56:47.548 cpu20:4546)vmnic7 : 06:00.1 Link Down
Apr 6 19:28:46 esx2 vmkernel: 10:10:56:47.548 cpu14:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xe received Data: xe x20 x800110 x0 x0
Apr 6 19:28:46 esx3 vmkernel: 10:09:31:43.981 cpu10:5119)vmnic7 : 06:00.1 Link Down
Apr 6 19:28:46 esx3 vmkernel: 10:09:31:43.981 cpu3:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
Apr 6 19:28:46 esx4 vmkernel: 10:08:13:20.816 cpu2:4098)vmnic7 : 06:00.1 Link Down
Apr 6 19:28:46 esx4 vmkernel: 10:08:13:20.816 cpu13:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
Apr 6 19:28:46 esx8 vmkernel: 9:10:03:03.148 cpu6:4102)vmnic7 : 06:00.1 Link Down
Apr 6 19:28:46 esx8 vmkernel: 9:10:03:03.148 cpu7:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xc received Data: xc x20 x800110 x0 x0
In every outage, the same devices were affect: "lpfc820 0000:06:00.3" and "vmnic7 : 06:00.1" across all hosts simultaneously. The numbering in these devices literally translate to bus:slot:function, so the FC HBA is bus 6, Slot 0, and function 3 while the vmnic is bus 6, slot 0, function 1. The only possible way to have an FC device and NIC device share the same bus is if the PCI card hosting these devices is a CNA card, which a review of the driver proc node /proc/scsi/lpfc820 shows that this HBA is a CNA (Converge Network Adapter), which is capable of multiple functions:
# cat /proc/scsi/lpfc820/3
Emulex OneConnect FCoE SCSI 8.2.1.90.28
HP NC553m 2P FlexFabric Adptr on PCI bus 0000:06 device 02 irq 49 port 1 Logical Link Speed: 4000 Mbps
BoardNum: 2
ESX Adapter: vmhba3
Function 0 and 2 of this PCI device are being used, though function 0 is another NIC and function 2 is another HBA.
In the end, the root cause for this issue wasn’t a faulty switch but instead a faulty HP Virtual Connect Module (VCM). The reason why the other NIC and HBA weren’t affected by these outages was due to the fact that they were connected to a different VCM. Once the VCM was replaced, the FC path and network drops stopped.
Social Media and the My VMware Launch

Aside from writing this blog my duties include monitoring social media and engaging with our customers for things that involve VMware Shared Services. That includes the conversations going on around the launch of My VMware.
The launch is going really well considering the sheer size of this portal. I heard someone say My VMware is one of the biggest B2B sites in the world!
Social Media can work for you, or work against you. It can be like trying to tame a wild Mustang. Things spread like wildfire. I saw one tweet show up today about a funny browser issue when trying to log in to My VMware.
- Trouble logging into MyVMware - It happens Video: http://t.co/iZ3CmF3V
I knew instantly what the problem was in this case. He needed to clear his browser cache, but suddenly I had to work at lightning speed to get ahead of this. Someone retweeted. Somebody else posted the video on YouTube! This had the makings of going viral, and fast!
Everywhere I followed, I commented with the fix. I tweeted to the original poster to try the fix. I'd need to wait for his reply. I needed to reply to every tweet now talking about this quirk.
For a while, I thought it could run faster than I could, but then the original poster answered!
- If your MyVMware button moves around (see video), just clear your cache & works like a charm! Thanks @VMwareCares http://t.co/HheqyUJi
- Worked great. I've updated my post to include your quick response and the fix! Thanks! You guys are doing a great job!
YES!! Now social was working for me. The answer was being spread as quickly as the problem had. Not only was the whole thing blowing over and our customers having a good chuckle over it, but in the meantime, I got our Knowledge Base Team writing up an official KB article, which is already published. I have to brag - we are a truly incredible team.
See: My VMware log in button on vmware.com moves when you try to click it 2018745
Host goes unresponsive in vCenter and SSH to the host fails

Here's our third log deep-dive from Nathan Small. If you missed the first one, here it is. Funny how all of these problems point to a KB article that solves the issue!
History of issue
A host went unresponsive last night in vCenter and SSH fails to the host. The host was rebooted to bring it back online, and now root cause is requested.
To determine how long the server was actually unresponsive, we need to check /var/log/vmksummary. When we do this, we see the server was not logging for 3 days, which could point to a system hang:
Mar 25 04:01:17 esx04 logger: (1332666076) hb: vmk loaded, 6169963.27, 6169603.352, 43, 398348, 398348, 1260, vpxa-125984, vmware-h-101788, sfcbd-19084
Mar 25 05:01:16 esx04 logger: (1332669676) hb: vmk loaded, 6173563.07, 6173202.948, 43, 398348, 398348, 1260, vpxa-126004, vmware-h-101788, sfcbd-19044
Mar 28 10:15:20 esx04 vmkhalt: (1332947720) Starting system…
Mar 28 10:15:52 esx04 logger: (1332947752) loaded VMkernel
NOTE: If the server was gracefully shutdown or rebooted, you would see a message stating this the line before "Starting system…"
What it is interesting to note is that while the host was observed to be unresponsive in vCenter only the previous night, the system hasnt been logging for over 3 days.
At this point we should gather more information around the physical server:
Server Info gathered from 'dmidecode':
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R900
BIOS Information
Vendor: Dell Inc.
Version: 1.1.15
Release Date: 04/15/2010
Next, it is important to determine ESX version:
# vmware -v
VMware ESX 4.0.0 build-398348
This ESX version is prior to 4.1 and therefore the IRQ remapping issue would not apply. This means the host stopped logging due to hardware, but specifically how or why isn't known yet. We know the host stopped logging for 3 day, so we should start with the device that communicates with the local disks, where /var/log is stored.
Since this host is a Dell server, it uses the PERC (PowerEdge RAID Controller), which is a rebranded LSI controller. The controller contains a log that is maintained on the hardware itself and can be extracted with the 'lsi_log' command that is available on ESX hosts.
When we run this command to pull the logs off the controller we can see a rather alarming event:
Event Sequence Number : 9658
Timestamp : 3/25/2012 ; 10:19:9
Event code : 15
Locale : Boot/Shutdown event
Class : Dead (visible after next controller boot)
Description of the event : Fatal firmware error: Line 156 in ../../raid/1078int.c
Argument Type Value: 18
Argument Type : MR_EVT_ARGS_STR
String : Line 156 in ../../raid/1078int.c
When searching our Knowledge Base for the search string "Fatal firmware error", we find only one article, however it is a perfect match! (KB 2011987 — http://kb.vmware.com/kb/2011987). This customer will need to update the firmware of their PERC controller. To prevent outages from other hosts with the same hardware configuration, we advised that they do the same for any Dell PERC or IBM ServeRAID controllers, as these are rebranded LSI controllers.
Storage Performance and Timeout Issues

Here’s our second log deep-dive from Nathan Small. If you missed the first one, here it is.
History of the issue
General storage performance issue while experiencing timeouts and SCSI reservation issues (Timeout/conflict) on ESX 4.1 (VMware View) and ESX 5.0 environments. Need to determine why these events are being seen and how we can improve performance.
The first step when troubleshooting this type of issue is to take a frame of reference that is visible across all hosts. Here is the issue observed on one of the ESX 4.1 host:
ESXi 4.1:
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)ScsiDeviceIO: 1688: Command 0×16 to device “naa.60060480000190100554533032324644″ failed H:0×5 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “naa.60060480000190100554533032324644″ state in doubt; requested fast path state update…
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu14:4134)WARNING: FS3: 7120: Reservation error: Timeout
When reviewing the logs from the ESX 5.0 host, we see the issue observed at the same time:
ESXi 5.0:
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0×16 (0x4124003d4740, 0) to dev “naa.60060480000190100554533033364445″ on path “vmhba2:C0:T3:L23″ Failed: H:0×8 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0. Act:EVAL
2012-03-28T17:12:56.481Z cpu11:4107)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “naa.60060480000190100554533033364445″ state in doubt; requested fast path state update…
2012-03-28T17:12:56.481Z cpu11:4107)ScsiDeviceIO: 2322: Cmd(0x4124003d4740) 0×16, CmdSN 0×23102 from world 0 to dev “naa.60060480000190100554533033364445″ failed H:0×8 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-03-28T17:12:56.481Z cpu1:4132)WARNING: FS3: 1835: Reservation error: Timeout
These logs show that the ESX storage stack is not aware of what the SCSI reservation state of the device is. This will happen when a SCSI RESERVE command is sent, doesn’t not complete to the target due to timeout, and then abort sent for the SCSI reserve doesn’t complete either due to timeout. This will usually happen for 1 of 2 reasons:
1. Fabric issues where frames are dropped
2. Array or array controller is overloaded
There are other messages occurring in the logs that also indicate an array performance issue:
2012-03-26T23:00:30.950Z cpu8:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW02′ (held for 1457 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-27T23:01:15.491Z cpu4:4133)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW04′ (held for 2640 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T17:12:59.456Z cpu15:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07′ (held for 2967 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T20:05:19.372Z cpu0:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07′ (held for 19196 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
See KB 1025299 for more information on this message: http://kb.vmware.com/kb/1025299
To simply state that this is a performance issue that should be investigated further by the storage team or the array vendor may be acceptable for some, however let’s try to dig into this further and provide more details if that is the direction we decide to go. To start, let’s take a closer look at the LUNs reporting the issue in both environments:
ESX 4.1:
Error message:
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
LUN in question: naa.60060480000190100554533037394644
# esxcfg-scsidevs -m |grep naa.60060480000190100554533037394644
naa.60060480000190100554533037394644:1 /vmfs/devices/disks/naa.60060480000190100554533037394644:1 4d433199-f78406e2-ccc3-1cc1de093452 0 Lun_56
# esxcfg-mpath -b|grep -A4 naa.60060480000190100554533032324644
naa.60060480000190100554533032324644 : EMC Fibre Channel Disk (naa.60060480000190100554533032324644)
vmhba0:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0a:24 Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
vmhba1:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0b:24 Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
From this output we can see the WWPN (World Wide Port Name) of the connected array ports, which are 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82.
ESX 5.0:
Error message:
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
LUN in question: naa.60060480000190100554533033364445
# esxcfg-scsidevs -m |grep naa.60060480000190100554533033364445
naa.60060480000190100554533033364445:1 /vmfs/devices/disks/naa.60060480000190100554533033364445:1 4e70c147-a779fe32-8747-0025b5010a6a 0 VMW07
# esxcfg-mpath -b| grep -A4 naa.60060480000190100554533033364445
naa.60060480000190100554533033364445 : EMC Fibre Channel Disk (naa.60060480000190100554533033364445)
vmhba2:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0a:0d Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
vmhba3:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0b:0d Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
We can see from the esxcfg-mpath output that the LUN having the same issue in this cluster is connected to the same WWPN: 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82
Given the fact that there are two separate environments connected to the same array controller (FA port, in this case), it is highly likely that the array controller is overloaded. The next step would be to collect performance information from the array side from this point forward so that when another event occurs the storage team or storage array vendor can review the health of the array at that timeframe to ensure that it can keep up with the I/O being generated.
In this particular case, the storage team and EMC became involved and confirmed that the FA port was overloaded. To distribute the load, and to ensure View best practices are followed, the VMware View environment was moved onto a different, dedicated FA on the array.
Hosts Unresponsive after Firmware Upgrade of IBM SVC

We have an exciting new series starting here at The Support Insider. Our senior support engineers sometimes need to deep dive into customer issues, digging into logs some of you might not even know exist. We thought we'd share some examples of these with you. You see the problem presented, the game plan unfold, and how our top escalation engineers investigate tough problems.
We begin today with Nathan Small. Nathan is a Senior Escalation Engineer on the storage team in Global Support Services, and has been employed with VMware since May 2005. He has written and designed many storage technical presentations that have been delivered at various VMware sponsored events including VMUGs and VMworld conferences. He is recognized in the worldwide virtualization community as a storage and infrastructure guru.
History of the issue: This customer performed a firmware upgrade of their IBM SVC however after doing so all hosts in 1 cluster went to an unresponsive state in vCenter, and it was not possible to SSH to the servers. It was still possible to ping the affected hosts. The servers were rebooted to resolve the issue. A host in a different cluster also connected to the IBM SVC did not experience this same issue.
Log review:
The host vmris01-sc stopped logging after 5am. The /var/log/vmksummary file updates at the first minute of every hour stating the uptime of the host, build version of the vmkernel/service console, and the memory usage of the top 3 process:
Mar 1 04:01:13 vmris01-sc logger: (1330592473) hb: vmk loaded, 31076125.88, 31076117.975, 2, 208167, 208167, 136, sfcbd-153328, ftPerl-76692, vmware-h-73592
Mar 1 05:01:14 vmris01-sc logger: (1330596074) hb: vmk loaded, 31079726.01, 31079718.111, 2, 208167, 208167, 136, sfcbd-153336, ftPerl-76736, vmware-h-73188
Mar 1 08:20:13 vmris01-sc vmkhalt: (1330608013) Starting system…
Mar 1 08:20:25 vmris01-sc logger: (1330608025) loaded VMkernel
Since there was no update at 6:01, this means the server stopped logging at this point. This means one of two possibilities: 1. the server hung (due to hardware), or 2. the server was no longer able to write to /var/log. It is highly unlikely that the server hung since multiple hosts were affected simultaneously, so the only possibility is that these hosts were not able to write to /var/log.
This host observed an event beginning at 4:13am in the form of commands failing to complete to the target:
Mar 1 04:13:39 vmris01-sc vmkernel: 359:16:27:43.715 cpu14:5443)VMK_PCI: 739: device 065:00.0 capType 16 capIndex 76
Mar 1 04:29:34 vmris01-sc vmkernel: 359:16:43:38.593 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0a2f00) to NMP device "naa.60050768019102668000000000000328" failed on physical path "vmhba0:C0:T1:L1" H:0×2 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Then we see the remote ports (array controller ports) timeout, specifically the second target (‘1’) on both HBAs:
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.167 cpu15:4303)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.253 cpu25:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding
This causes a host status of H:0×1, which requires us to failover:
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0df680) to NMP device "naa.60050768019102668000000000000328" failed on physical path "vmhba0:C0:T1:L1" H:0×1 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.60050768019102668000000000000328": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Then failover actually occurs (from target 1 to target 0):
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu24:4268)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L5 to vmhba1:C0:T0:L5 for device "naa.60050768019102668000000000000335".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu26:8602)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L1 to vmhba1:C0:T0:L1 for device "naa.60050768019102668000000000000328".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60050768019102668000000000000335" - issuing command 0x41000a15d380
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60050768019102668000000000000328" - issuing command 0x41000a0df680
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L2 to vmhba1:C0:T0:L2 for device "naa.6005076801910266800000000000032b".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu12:8604)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L0 to vmhba1:C0:T0:L0 for device "naa.6005076801910266800000000000031c".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.6005076801910266800000000000031c" - issuing command 0x41000a1ec240
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L4 to vmhba1:C0:T0:L4 for device "naa.60050768019102668000000000000334".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60050768019102668000000000000335"
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L6 to vmhba1:C0:T0:L6 for device "naa.60050768019102668000000000000336".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4135)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60050768019102668000000000000328"
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L7 to vmhba1:C0:T0:L7 for device "naa.60050768019102668000000000000337".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L8 to vmhba1:C0:T0:L8 for device "naa.6005076801910266800000000000033a".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L9 to vmhba1:C0:T0:L9 for device "naa.6005076801910266800000000000033b".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.638 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.6005076801910266800000000000031c"
The statement of “Retry world recovered device” indicates that failover completed successfully.
After failover,Everything is running fine until 70 minutes later when we see the first target (‘0’) timeout to both HBAs:
Mar 1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.805 cpu0:4300)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Mar 1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.929 cpu30:4301)<3> rport-3:0-0: blocked FC remote port time out: saving binding
We then failover back from target 0 to target 1:
Mar 1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.228 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T0:L10 to vmhba0:C0:T1:L10 for device "naa.60050768019102668000000000000348".
Mar 1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.229 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T0:L11 to vmhba1:C0:T1:L11 for device "naa.6005076801910266800000000000035b".
This behavior is consistent with a firmware upgrade of an IBM SVC or any array type that supports non-disruptive upgrading of the controllers.
What is odd is that 20 minutes later we see more failover events, target 1 back to target 0, right before the server stopped logging to the vmkernel log:
Mar 1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.379 cpu38:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L10 to vmhba1:C0:T0:L10 for device "naa.60050768019102668000000000000348".
Mar 1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.381 cpu12:4303)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T1:L11 to vmhba0:C0:T0:L11 for device "naa.6005076801910266800000000000035b".
Mar 1 08:20:22 vmris01-sc vmkernel: pol 1
Mar 1 08:20:22 vmris01-sc vmkernel: 0:00:00:03.637 cpu0:4096)Chipset: 497: bus pci, id 4
We can see the timestamp jumps from 359 days to 0 days, which is essentially the hard reboot of this host.
It is odd that the logging suddenly stopped, however we would speculate that logging stopped because the ESX host was not able to communicate with its boot device. The only other possibility would be that /var/log ran out of space on all 3 host, which is highly unlikely. When reviewing the esxcfg-mpath output for this host, there is no local RAID controller, which means that these are likely servers that boot-from-SAN. A quick check of /etc/vmware/esx.conf reveals the following:
/boot/cosvmdk = "/vmfs/volumes/4d0f879c-ccb0a33e-91f7-842b2b5c4f54/esxconsole-4d0f86da-ab5c-79a0-83ee-842b2b5c4f54/esxconsole.vmdk"
# cat tmp/4d0f879c-ccb0a33e-91f7-842b2b5c4f54.vpa.txt
VMFS-3.33 file system spanning 1 partitions.
File system label (if any): Storage1 (26)
Mode: public
Capacity 30870077440 (29440 file blocks * 1048576), 21807235072 (20797 blocks) avail
UUID: 4d0f879c-ccb0a33e-91f7-842b2b5c4f54
Partitions spanned (on "lvm"):
naa.6005076801910266800000000000031c:5
This server is booting off the 5th partition of naa.6005076801910266800000000000031c
naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)
vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5
vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00
vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5
vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00
The output above confirms the host is boot-from-SAN.
Let's review the server hardware (from dmidecode or smbios):
Manufacturer: Dell Inc.
Product Name: PowerEdge R910
BIOS Information
Vendor: Dell Inc.
Version: 1.2.0
Release Date: 06/22/2010
And the ESX version:
VMware ESX 4.0.0 build-208167
This BIOS version is very old which means it may have been susceptible to the IRQ remapping issue (KB 1030265 — http://kb.vmware.com/kb/1030265), however the ESX version is prior to 4.1 when this feature was introduced. This means that the reason the hosts stopped logging was more than likely related to the fact that they boot-from-SAN and all paths to the array were lost around 6am.
Let's review another host:
Mar 1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)NMP: nmp_CompleteCommandForPath: Command 0×12 (0x41000a1a3e00) to NMP device "mpx.vmhba2:C0:T0:L0" failed on physical path "vmhba2:C0:T0:L0" H:0×0 D:0×2 P:0×0 Valid sense data: 0×5 0×24 0×0.
Mar 1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)ScsiDeviceIO: 747: Command 0×12 to device "mpx.vmhba2:C0:T0:L0" failed H:0×0 D:0×2 P:0×0 Valid sense data: 0×5 0×24 0×0.
Mar 1 08:17:56 vmris02-sc vmkernel: 497: bus pci, id 3
Mar 1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:A (0) 01-03 trg 2 pol 1
Mar 1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:B (1) 01-14 trg 2 pol 1
This host stopped logging at 6:04 instead of 6:00, which is a difference of 4 minutes. Let's review when the first remote ports disappeared to help correlate:
Mar 1 04:33:09 vmris02-sc vmkernel: 435:13:35:12.922 cpu33:4307)<3> rport-3:0-0: blocked FC remote port time out: saving binding
Mar 1 04:33:10 vmris02-sc vmkernel: 435:13:35:13.143 cpu24:4294)<3> rport-2:0-0: blocked FC remote port time out: saving binding
This is also off by 4 minutes, which means we can likely conclude that they actually occurred at the exact same time on both hosts, but these hosts are not all setup for NTP.
Finally, let’s review the last host to verify these hosts observed the issue identically. This host observed the same behavior, though the server time seems to be off by 9 minutes (lack of NTP):
Mar 1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.554 cpu18:4313)<3> rport-3:0-1: blocked FC remote port time out: saving binding
Mar 1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.826 cpu12:4307)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar 1 05:48:24 vmris03-sc vmkernel: 75:20:02:46.594 cpu6:4304)<3> rport-3:0-0: blocked FC remote port time out: saving binding
Mar 1 05:48:27 vmris03-sc vmkernel: 75:20:02:46.640 cpu18:4309)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Needless to say, all hosts observed the issue at the exact same time, when taking into account the time drift is a direct result of a lack of NTP running. Root cause for why storage paths disappeared again at 6am is not known since the ESX hosts had nowhere to write logs, which suggests that all available paths to the storage were simultaneously lost at that moment.
The other host that logs were upload for was a host that did not experience the same outage as hosts in another cluster:
Mar 1 04:42:38 vmhost16-sc vmkernel: 167:15:49:25.581 cpu14:4295)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Mar 1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.397 cpu12:4313)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar 1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.815 cpu12:4298)<3> rport-3:0-0: blocked FC remote port time out: saving binding
We can see that only 3 out of 4 remote ports went offline compared to other hosts, however a grep for rport shows the port did go down 2.5 hours later:
Mar 1 07:18:23 vmhost16-sc vmkernel: 167:18:25:13.641 cpu12:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding
*Note: With default HBA driver logging it will only report when a remote disappears, not when it is visible again.
This host logs well past 6am, which likely means this host does not boot-from-SAN:
/boot/cosvmdk = "/vmfs/volumes/4d2b18aa-56df78bf-9bc2-842b2b5c502c/esxconsole-4d2b174e-97a3-3540-1b89-842b2b5c502c/esxconsole.vmdk"
# cat /tmp/4d2b18aa-56df78bf-9bc2-842b2b5c502c.vpa.txt
VMFS-3.33 file system spanning 1 partitions.
File system label (if any): Storage1
Mode: public
Capacity 30870077440 (29440 file blocks * 1048576), 21595422720 (20595 blocks) avail
UUID: 4d2b18aa-56df78bf-9bc2-842b2b5c502c
Partitions spanned (on "lvm"):
naa.600507680191026680000000000001a6:5
naa.600507680191026680000000000001a6 : IBM Fibre Channel Disk (naa.600507680191026680000000000001a6)
vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
As we can see, this is actually a boot-from-SAN configuration, so why were 3 hosts affected but not this one?
Going back to the fact that this last host didn’t lose all paths to the storage (only 3 out of 4), we will focus on the fact that it eventually lost the last path 2.5 hours later, after which the host complained constantly about SCSI Reservation conflicts:
Mar 1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.646 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba0:C0:T0:L9" state. Status=SCSI reservation conflict
Mar 1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.652 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba0:C0:T1:L9" state. Status=SCSI reservation conflict
Also, the qlogic driver is aborting commands every 5 minutes exactly, which is the frequency of the disk.patheval code (300 seconds be default):
Mar 1 08:28:33 vmhost16-sc vmkernel: 167:19:35:23.641 cpu3:20943)<6>qla2xxx 0000:41:00.0: scsi(3:1:72): Abort command issued — 1 32bf3c06 2002.
Mar 1 08:33:23 vmhost16-sc vmkernel: 167:19:40:13.637 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:67): Abort command issued — 1 32c01950 2002.
Mar 1 08:38:23 vmhost16-sc vmkernel: 167:19:45:13.641 cpu12:20943)<6>qla2xxx 0000:41:00.0: scsi(3:3:0): Abort command issued — 1 32c0d62a 2002.
Mar 1 08:43:23 vmhost16-sc vmkernel: 167:19:50:13.639 cpu9:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:74): Abort command issued — 1 32c18b9f 2002.
Mar 1 08:53:43 vmhost16-sc vmkernel: 167:20:00:33.641 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:1:66): Abort command issued — 1 32c31fbf 2002.
Mar 1 08:58:23 vmhost16-sc vmkernel: 167:20:05:13.644 cpu4:20944)<6>qla2xxx 0000:41:00.0: scsi(3:3:69): Abort command issued — 1 32c3938d 2002.
Eventually we see paths deleted for various LUNs:
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.923 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=20
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=22
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=23
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=25
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=26
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=27
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=30
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=35
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=38
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.536 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=39
Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.494 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=40
Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.496 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=44
Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.497 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=45
Mar 1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.208 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=18
Mar 1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.209 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=19
Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.350 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=51
Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=52
Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=53
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.700 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=0
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=1
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=2
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=3
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.702 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=4
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=8
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=9
This means the connection to all LUNs were removed for vmhba1 to a particular array. There are still paths from vmhba1 to the other array:
eui.00173800123a001a : IBM Fibre Channel Disk (eui.00173800123a001a)
vmhba1:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:92
vmhba1:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:82
vmhba1:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:72
vmhba1:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:52
vmhba1:C0:T1:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:42
vmhba0:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:80
vmhba0:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:50
vmhba0:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:40
vmhba0:C0:T6:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:90
vmhba0:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:81
We can see all LUNs from the IBM array with WWNN 50:05:07:68:01:00:4c:c5 and 50:05:07:68:01:00:4b:00 has only 2 paths:
naa.600507680191026680000000000002ac : IBM Fibre Channel Disk (naa.600507680191026680000000000002ac)
vmhba0:C0:T1:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
naa.6005076801910266800000000000027c : IBM Fibre Channel Disk (naa.6005076801910266800000000000027c)
vmhba0:C0:T1:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
naa.6005076801910266800000000000033f : IBM Fibre Channel Disk (naa.6005076801910266800000000000033f)
vmhba0:C0:T1:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
naa.6005076801910266800000000000027d : IBM Fibre Channel Disk (naa.6005076801910266800000000000027d)
vmhba0:C0:T1:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
Based on this evidence, this could be a zoning issue or simply LUN presentation problem (perhaps that HBA is not logged into the target when it should be?). Regardless of whether or not zoning changes were implemented as well, the focus should remain that all hosts in one cluster stopped logging to /var/log while a host in another cluster did not.
Let's get back to the original issue, which is why did 3 servers behave differently than 1 server. To do that, let's compare logged-in targets:
Affected host (vmris01) boot LUN targets:
naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)
vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5
vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00
vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5
vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00
The HBAs were connected to the following:
50:05:07:68:01:10:4c:c5
50:05:07:68:01:10:4b:00
50:05:07:68:01:30:4c:c5
50:05:07:68:01:30:4b:00
The host that remained online the entire time is connected to the following ports:
50:05:07:68:01:40:4b:00
50:05:07:68:01:40:4c:c5
These are different ports that the host that stopped logging and required rebooting. Now let's compare with the other 2 hosts that had to be rebooted (vmris02 and vmris03):
vmris02:
50:05:07:68:01:20:4b:00
50:05:07:68:01:20:4c:c5
50:05:07:68:01:40:4b:00
50:05:07:68:01:40:4c:c5
vmris03:
50:05:07:68:01:10:4c:c5
50:05:07:68:01:30:4b:00
50:05:07:68:01:30:4c:c5
50:05:07:68:01:10:4b:00
While vmris03 is logged into the same physical ports and vmris01, vmris02 is logged into the same ports as the host that remained online. The host that remained online is only communicating on 2 of the 4 ports however, so do they identically match?
vmhost2: (from /proc/scsi/qla2xxx/)
FC Port Information:
scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;
scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;
FC Port Information:
scsi-qla1-port-0=5005076801004cc5:5005076801204cc5:83031e:1;
vmris02: (from /proc/scsi/qla2xxx/)
FC Port Information:
scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;
scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;
FC Port Information:
scsi-qla1-port-0=5005076801004b00:5005076801204b00:83031b:81;
scsi-qla1-port-1=5005076801004cc5:5005076801204cc5:83031e:82;
Even though vmhost is missing one of the paths, we can see that the targets are the same. Since the hosts were logged into the same targets but had different results when it comes to logged out ports and being able to stay online, this may have to do with the amount of I/O that was in-flight from each host at the time. A host that was not issuing hardly any I/O at the time may have survived the outage because it did not issue any I/O, whereas the service console of the ESX hosts that were doing I/O at the time had a timeout on the local disk, resulting in the EXT3 file systems remounting as read-only. This would have resulted in an inability to write to /var/log. It would also cause some services to stop functioning, such as SSH, while other stacks in memory continue to operate (responds to ICMP ping requests).
The host that remained online (vmhost2) was only running 2 VMs at the time the logs were collected:
# cat /proc/vmware/vm/*/names
vmid=4371 pid=-1 cfgFile="/vmfs/volumes/4d1a3ce4-d8b96856-0231-842b2b5c4f58/uhmc-riscxr/uhmc-riscxr.vmx" uuid="42 38 00 37 fc 6e 04 38-a4 cb a9 22 98 f3 ce 9d" displayName="uhmc-riscxr"
vmid=4376 pid=-1 cfgFile="/vmfs/volumes/4d2f4d6a-a751b8d0-1109-842b2b5c4f58/uhmc-risrpt/uhmc-risrpt.vmx" uuid="42 38 18 73 93 c6 d2 89-c5 87 e8 9e 24 5e 34 94" displayName="uhmc-risrpt"
A review of the vmkernel logs for that host do not show any VMs being powered on/off before, during, or after the event, so it is safe to assume these were the only VMs running on this host. We can only speculate what these VMs are designed for, however I would speculate that they are low I/O VMs and thus there were no or very few I/O's issued at the time the event occurred. We would also surmise that hosts vmris01, vmris02, and vmris03 had well more than 2 VMs running and they were all doing I/O at the time the event occurred. This theory explains why vmhost02 remained online while other hosts did not. As far as why all paths were lost at the same moment, which forcing the service consoles EXT partitions to remount as read-only, this would be possible if the paths were unavailable for longer than the disk timeout period, which is around 30 seconds. The question then becomes, what occurred on the SVC at ~6am? Were both SVC's hard rebooted after the firmware was successfully applied?
We'd love to hear your comments! Use the box below.
How our Users Contribute to our Collective Wisdom

We had another example today of our customers helping us to correct a small peice of information that had potential for wreacking havoc with one's ESX/ESXi host.
At 11:55 this morning I received a Tweet from colleague John Troyer, alerting me that more than one customer he knew of had been hit with a PSOD (purple screen of death) while attempting to apply an upgrade to their ESX/ESXi host. Apparently, our Compatibility Matrix had an error in it for which versions the update to version 5.0 could apply.
A blog post on Wahl Network here talked more about the problem. I sprung into action and tapped my team of Knowledge Base writers for some insight. By 12:02 I had my answer. KB article 2009586 was written to speak to the issue and it turns out our Compatability Matrix was already being updated to indicate the correct versions after someone internally noticed the oversight.
The main message for this problem is-
Update your host to ESX/ESXi 4.1 Update 1 before adding it to vCenter Server 5.0
~+~
By 12:06, the originating blog had an update posted.
Isn't it incredible how quickly information can move when all the right bits are in place? If you ever spot something like this, Tweet me up at @VMwareKB, and we'll get it fixed lickity split!
What are we Missing?

This morning while going about my duties monitoring various social media I picked up a Tweet from a customer indicating he had just gotten past a particularly nagging problem during one of his upgrades. What caught my eye being a representative of the Knowledge Management team here at VMware was the fact that he cited a blog post (not one of ours) as having the fix he needed, and as well, pointing out that his problem was not found in our Knowledgebase.
This was enough for me to pounce on the opportunity to fill a gap in our arsenal of KB articles. I have a whole team of subject matter experts at my disposal whose sole task is to author new KB articles. We also have an elaborate workflow process wrapped around every aspect of what we do. In a matter of minutes I had a work item in the hopper to be processed by one of our technical experts, to be then sent through a team of technical writers for final edits, and then on to final publishing in our Knowledgebase.
I Tweeted back to this customer that we were taking action too. One more opportunity to (hopefully) exceed expectations!
Most of our KB content comes from our support engineers manning our phones in our call centers, but a significant number also come from this customer driven discovery process. It's a win-win for us.
If you ever need a fix for something we have not covered in the KB, we invite you to simply ask for it!
Visit Knowledge Base Content Request
vCenter Database Pre-Upgrade Checker Released

In the past, upgrading versions of VirtualCenter or vCenter Server sometimes meant a time consuming call into Tech Support because of an issue here or there related to the SQL databases that support these products.
For everyone dealing with upgrades we've released a Pre-Upgrade Checker that you can run on your current vCenter Server database to reveal problems that could prevent the upgrade or impact the performance of your database after the upgrade.
The Pre-Upgrade Checker can be used for the following upgrades:
- On a VirtualCenter 2.5 Update 6 or later database before you upgrade to vCenter Server 4.x
- On a vCenter Server 4.0.x database before you upgrade to vCenter Server 4.1 or 5.0
- On a vCenter Server 4.1.x database before you upgrade to vCenter Server 5.0
We have written KB article 2004286 - Using the VMware vCenter Database Pre-Upgrade Checker, which explains how to use the VMware vCenter Server Database Pre-Upgrade Checker version 1.0.
A recommended read.
Go Que vs The World