New Portuguese KB Articles
Yesterday we featured new Spanish articles. Today we have our list of brand new translated content for our Portuguese speaking customers. We've hand picked this list from our best content. Enjoy!
New Spanish Knowledgebase Content
Here is nice list of brand new translated content for all of our Spanish speaking customers. We've hand picked this list from our best content. Enjoy!
My VMware Update
My VMware had a successful first week with customers across the globe logging in and using all functionalities of the site, including downloading products and evaluations, submitting Support Requests, generating license keys, creating folders and more. Early in the week, customers experienced some issues with access and stability of the environment due to high-volume traffic. We have resolved these issues and are committed to providing an exceptional experience.
Please join us for an upcoming webinar training series to learn about the new features and functionalities of My VMware. The webinar series will help customers understand the basics of logging in and navigating the site, account management, managing license keys and Support Requests, and downloading and evaluating products. We encourage you to register today!
Please continue to send us feedback! We’ve already made a lot of progress and the input from customers is helping us target specific areas of concern and guiding our improvement roadmap.
Follow us on Twitter @vmwarecares for updates and helpful links.
If you have an issue and need to contact support:
Web: https://my.vmware.com/group/vmware/get-help
Phone: https://my.vmware.com/group/vmware/contact-vmware
Social Media and the My VMware Launch
Aside from writing this blog my duties include monitoring social media and engaging with our customers for things that involve VMware Shared Services. That includes the conversations going on around the launch of My VMware.
The launch is going really well considering the sheer size of this portal. I heard someone say My VMware is one of the biggest B2B sites in the world!
Social Media can work for you, or work against you. It can be like trying to tame a wild Mustang. Things spread like wildfire. I saw one tweet show up today about a funny browser issue when trying to log in to My VMware.
- Trouble logging into MyVMware - It happens Video: http://t.co/iZ3CmF3V
I knew instantly what the problem was in this case. He needed to clear his browser cache, but suddenly I had to work at lightning speed to get ahead of this. Someone retweeted. Somebody else posted the video on YouTube! This had the makings of going viral, and fast!
Everywhere I followed, I commented with the fix. I tweeted to the original poster to try the fix. I'd need to wait for his reply. I needed to reply to every tweet now talking about this quirk.
For a while, I thought it could run faster than I could, but then the original poster answered!
- If your MyVMware button moves around (see video), just clear your cache & works like a charm! Thanks @VMwareCares http://t.co/HheqyUJi
- Worked great. I've updated my post to include your quick response and the fix! Thanks! You guys are doing a great job!
YES!! Now social was working for me. The answer was being spread as quickly as the problem had. Not only was the whole thing blowing over and our customers having a good chuckle over it, but in the meantime, I got our Knowledge Base Team writing up an official KB article, which is already published. I have to brag - we are a truly incredible team.
See: My VMware log in button on vmware.com moves when you try to click it 2018745
Host goes unresponsive in vCenter and SSH to the host fails
Here's our third log deep-dive from Nathan Small. If you missed the first one, here it is. Funny how all of these problems point to a KB article that solves the issue!
History of issue
A host went unresponsive last night in vCenter and SSH fails to the host. The host was rebooted to bring it back online, and now root cause is requested.
To determine how long the server was actually unresponsive, we need to check /var/log/vmksummary. When we do this, we see the server was not logging for 3 days, which could point to a system hang:
Mar 25 04:01:17 esx04 logger: (1332666076) hb: vmk loaded, 6169963.27, 6169603.352, 43, 398348, 398348, 1260, vpxa-125984, vmware-h-101788, sfcbd-19084
Mar 25 05:01:16 esx04 logger: (1332669676) hb: vmk loaded, 6173563.07, 6173202.948, 43, 398348, 398348, 1260, vpxa-126004, vmware-h-101788, sfcbd-19044
Mar 28 10:15:20 esx04 vmkhalt: (1332947720) Starting system…
Mar 28 10:15:52 esx04 logger: (1332947752) loaded VMkernel
NOTE: If the server was gracefully shutdown or rebooted, you would see a message stating this the line before "Starting system…"
What it is interesting to note is that while the host was observed to be unresponsive in vCenter only the previous night, the system hasnt been logging for over 3 days.
At this point we should gather more information around the physical server:
Server Info gathered from 'dmidecode':
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R900
BIOS Information
Vendor: Dell Inc.
Version: 1.1.15
Release Date: 04/15/2010
Next, it is important to determine ESX version:
# vmware -v
VMware ESX 4.0.0 build-398348
This ESX version is prior to 4.1 and therefore the IRQ remapping issue would not apply. This means the host stopped logging due to hardware, but specifically how or why isn't known yet. We know the host stopped logging for 3 day, so we should start with the device that communicates with the local disks, where /var/log is stored.
Since this host is a Dell server, it uses the PERC (PowerEdge RAID Controller), which is a rebranded LSI controller. The controller contains a log that is maintained on the hardware itself and can be extracted with the 'lsi_log' command that is available on ESX hosts.
When we run this command to pull the logs off the controller we can see a rather alarming event:
Event Sequence Number : 9658
Timestamp : 3/25/2012 ; 10:19:9
Event code : 15
Locale : Boot/Shutdown event
Class : Dead (visible after next controller boot)
Description of the event : Fatal firmware error: Line 156 in ../../raid/1078int.c
Argument Type Value: 18
Argument Type : MR_EVT_ARGS_STR
String : Line 156 in ../../raid/1078int.c
When searching our Knowledge Base for the search string "Fatal firmware error", we find only one article, however it is a perfect match! (KB 2011987 — http://kb.vmware.com/kb/2011987). This customer will need to update the firmware of their PERC controller. To prevent outages from other hosts with the same hardware configuration, we advised that they do the same for any Dell PERC or IBM ServeRAID controllers, as these are rebranded LSI controllers.
My VMware is Easy
The all new My VMware portal is now live and ten of thousands of you are logging in to check it out. We've gotten lots of feedback about how elegant and easy it is to use, and that's great, but in some cases a few of you have been a little frustrated with some facets of the site. It turns out that these just needed some explanation. So yes, My VMware is easy, once you're armed with Knowledge!
With that in mind, here are the Knowledebase articles we're seeing requested the most.
- My VMware FAQ (2014350)
- Users and permissions in My VMware (2006977)
- Filing a Support Request in My VMware (2006985)
- Viewing, editing, and copying user permissions in My VMware (2006978)
- Viewing and updating Support Requests in My VMware (2007048)
- Super Users and Procurement Contacts in My VMware (2011011)
- Creating a My VMware profile (2007005)
- Folder management in My VMware (2011222)
- Deactivating your My VMware profile (2013926)
- Viewing license keys in My VMware (2006831)
- Adding and removing users in My VMware (2016235)
- My VMware account administration tasks that require VMware support (2006987)
Storage Performance and Timeout Issues
Here’s our second log deep-dive from Nathan Small. If you missed the first one, here it is.
History of the issue
General storage performance issue while experiencing timeouts and SCSI reservation issues (Timeout/conflict) on ESX 4.1 (VMware View) and ESX 5.0 environments. Need to determine why these events are being seen and how we can improve performance.
The first step when troubleshooting this type of issue is to take a frame of reference that is visible across all hosts. Here is the issue observed on one of the ESX 4.1 host:
ESXi 4.1:
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)ScsiDeviceIO: 1688: Command 0×16 to device “naa.60060480000190100554533032324644″ failed H:0×5 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “naa.60060480000190100554533032324644″ state in doubt; requested fast path state update…
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu14:4134)WARNING: FS3: 7120: Reservation error: Timeout
When reviewing the logs from the ESX 5.0 host, we see the issue observed at the same time:
ESXi 5.0:
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0×16 (0x4124003d4740, 0) to dev “naa.60060480000190100554533033364445″ on path “vmhba2:C0:T3:L23″ Failed: H:0×8 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0. Act:EVAL
2012-03-28T17:12:56.481Z cpu11:4107)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “naa.60060480000190100554533033364445″ state in doubt; requested fast path state update…
2012-03-28T17:12:56.481Z cpu11:4107)ScsiDeviceIO: 2322: Cmd(0x4124003d4740) 0×16, CmdSN 0×23102 from world 0 to dev “naa.60060480000190100554533033364445″ failed H:0×8 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-03-28T17:12:56.481Z cpu1:4132)WARNING: FS3: 1835: Reservation error: Timeout
These logs show that the ESX storage stack is not aware of what the SCSI reservation state of the device is. This will happen when a SCSI RESERVE command is sent, doesn’t not complete to the target due to timeout, and then abort sent for the SCSI reserve doesn’t complete either due to timeout. This will usually happen for 1 of 2 reasons:
1. Fabric issues where frames are dropped
2. Array or array controller is overloaded
There are other messages occurring in the logs that also indicate an array performance issue:
2012-03-26T23:00:30.950Z cpu8:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW02′ (held for 1457 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-27T23:01:15.491Z cpu4:4133)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW04′ (held for 2640 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T17:12:59.456Z cpu15:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07′ (held for 2967 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T20:05:19.372Z cpu0:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07′ (held for 19196 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
See KB 1025299 for more information on this message: http://kb.vmware.com/kb/1025299
To simply state that this is a performance issue that should be investigated further by the storage team or the array vendor may be acceptable for some, however let’s try to dig into this further and provide more details if that is the direction we decide to go. To start, let’s take a closer look at the LUNs reporting the issue in both environments:
ESX 4.1:
Error message:
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
LUN in question: naa.60060480000190100554533037394644
# esxcfg-scsidevs -m |grep naa.60060480000190100554533037394644
naa.60060480000190100554533037394644:1 /vmfs/devices/disks/naa.60060480000190100554533037394644:1 4d433199-f78406e2-ccc3-1cc1de093452 0 Lun_56
# esxcfg-mpath -b|grep -A4 naa.60060480000190100554533032324644
naa.60060480000190100554533032324644 : EMC Fibre Channel Disk (naa.60060480000190100554533032324644)
vmhba0:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0a:24 Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
vmhba1:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0b:24 Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
From this output we can see the WWPN (World Wide Port Name) of the connected array ports, which are 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82.
ESX 5.0:
Error message:
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
LUN in question: naa.60060480000190100554533033364445
# esxcfg-scsidevs -m |grep naa.60060480000190100554533033364445
naa.60060480000190100554533033364445:1 /vmfs/devices/disks/naa.60060480000190100554533033364445:1 4e70c147-a779fe32-8747-0025b5010a6a 0 VMW07
# esxcfg-mpath -b| grep -A4 naa.60060480000190100554533033364445
naa.60060480000190100554533033364445 : EMC Fibre Channel Disk (naa.60060480000190100554533033364445)
vmhba2:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0a:0d Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
vmhba3:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0b:0d Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
We can see from the esxcfg-mpath output that the LUN having the same issue in this cluster is connected to the same WWPN: 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82
Given the fact that there are two separate environments connected to the same array controller (FA port, in this case), it is highly likely that the array controller is overloaded. The next step would be to collect performance information from the array side from this point forward so that when another event occurs the storage team or storage array vendor can review the health of the array at that timeframe to ensure that it can keep up with the I/O being generated.
In this particular case, the storage team and EMC became involved and confirmed that the FA port was overloaded. To distribute the load, and to ensure View best practices are followed, the VMware View environment was moved onto a different, dedicated FA on the array.
Hosts Unresponsive after Firmware Upgrade of IBM SVC
We have an exciting new series starting here at The Support Insider. Our senior support engineers sometimes need to deep dive into customer issues, digging into logs some of you might not even know exist. We thought we'd share some examples of these with you. You see the problem presented, the game plan unfold, and how our top escalation engineers investigate tough problems.
We begin today with Nathan Small. Nathan is a Senior Escalation Engineer on the storage team in Global Support Services, and has been employed with VMware since May 2005. He has written and designed many storage technical presentations that have been delivered at various VMware sponsored events including VMUGs and VMworld conferences. He is recognized in the worldwide virtualization community as a storage and infrastructure guru.
History of the issue: This customer performed a firmware upgrade of their IBM SVC however after doing so all hosts in 1 cluster went to an unresponsive state in vCenter, and it was not possible to SSH to the servers. It was still possible to ping the affected hosts. The servers were rebooted to resolve the issue. A host in a different cluster also connected to the IBM SVC did not experience this same issue.
Log review:
The host vmris01-sc stopped logging after 5am. The /var/log/vmksummary file updates at the first minute of every hour stating the uptime of the host, build version of the vmkernel/service console, and the memory usage of the top 3 process:
Mar 1 04:01:13 vmris01-sc logger: (1330592473) hb: vmk loaded, 31076125.88, 31076117.975, 2, 208167, 208167, 136, sfcbd-153328, ftPerl-76692, vmware-h-73592
Mar 1 05:01:14 vmris01-sc logger: (1330596074) hb: vmk loaded, 31079726.01, 31079718.111, 2, 208167, 208167, 136, sfcbd-153336, ftPerl-76736, vmware-h-73188
Mar 1 08:20:13 vmris01-sc vmkhalt: (1330608013) Starting system…
Mar 1 08:20:25 vmris01-sc logger: (1330608025) loaded VMkernel
Since there was no update at 6:01, this means the server stopped logging at this point. This means one of two possibilities: 1. the server hung (due to hardware), or 2. the server was no longer able to write to /var/log. It is highly unlikely that the server hung since multiple hosts were affected simultaneously, so the only possibility is that these hosts were not able to write to /var/log.
This host observed an event beginning at 4:13am in the form of commands failing to complete to the target:
Mar 1 04:13:39 vmris01-sc vmkernel: 359:16:27:43.715 cpu14:5443)VMK_PCI: 739: device 065:00.0 capType 16 capIndex 76
Mar 1 04:29:34 vmris01-sc vmkernel: 359:16:43:38.593 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0a2f00) to NMP device "naa.60050768019102668000000000000328" failed on physical path "vmhba0:C0:T1:L1" H:0×2 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Then we see the remote ports (array controller ports) timeout, specifically the second target (‘1’) on both HBAs:
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.167 cpu15:4303)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.253 cpu25:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding
This causes a host status of H:0×1, which requires us to failover:
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0df680) to NMP device "naa.60050768019102668000000000000328" failed on physical path "vmhba0:C0:T1:L1" H:0×1 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.60050768019102668000000000000328": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Then failover actually occurs (from target 1 to target 0):
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu24:4268)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L5 to vmhba1:C0:T0:L5 for device "naa.60050768019102668000000000000335".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu26:8602)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L1 to vmhba1:C0:T0:L1 for device "naa.60050768019102668000000000000328".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60050768019102668000000000000335" - issuing command 0x41000a15d380
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60050768019102668000000000000328" - issuing command 0x41000a0df680
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L2 to vmhba1:C0:T0:L2 for device "naa.6005076801910266800000000000032b".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu12:8604)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L0 to vmhba1:C0:T0:L0 for device "naa.6005076801910266800000000000031c".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.6005076801910266800000000000031c" - issuing command 0x41000a1ec240
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L4 to vmhba1:C0:T0:L4 for device "naa.60050768019102668000000000000334".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60050768019102668000000000000335"
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L6 to vmhba1:C0:T0:L6 for device "naa.60050768019102668000000000000336".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4135)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60050768019102668000000000000328"
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L7 to vmhba1:C0:T0:L7 for device "naa.60050768019102668000000000000337".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L8 to vmhba1:C0:T0:L8 for device "naa.6005076801910266800000000000033a".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L9 to vmhba1:C0:T0:L9 for device "naa.6005076801910266800000000000033b".
Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.638 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.6005076801910266800000000000031c"
The statement of “Retry world recovered device” indicates that failover completed successfully.
After failover,Everything is running fine until 70 minutes later when we see the first target (‘0’) timeout to both HBAs:
Mar 1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.805 cpu0:4300)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Mar 1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.929 cpu30:4301)<3> rport-3:0-0: blocked FC remote port time out: saving binding
We then failover back from target 0 to target 1:
Mar 1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.228 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T0:L10 to vmhba0:C0:T1:L10 for device "naa.60050768019102668000000000000348".
Mar 1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.229 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T0:L11 to vmhba1:C0:T1:L11 for device "naa.6005076801910266800000000000035b".
This behavior is consistent with a firmware upgrade of an IBM SVC or any array type that supports non-disruptive upgrading of the controllers.
What is odd is that 20 minutes later we see more failover events, target 1 back to target 0, right before the server stopped logging to the vmkernel log:
Mar 1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.379 cpu38:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L10 to vmhba1:C0:T0:L10 for device "naa.60050768019102668000000000000348".
Mar 1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.381 cpu12:4303)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T1:L11 to vmhba0:C0:T0:L11 for device "naa.6005076801910266800000000000035b".
Mar 1 08:20:22 vmris01-sc vmkernel: pol 1
Mar 1 08:20:22 vmris01-sc vmkernel: 0:00:00:03.637 cpu0:4096)Chipset: 497: bus pci, id 4
We can see the timestamp jumps from 359 days to 0 days, which is essentially the hard reboot of this host.
It is odd that the logging suddenly stopped, however we would speculate that logging stopped because the ESX host was not able to communicate with its boot device. The only other possibility would be that /var/log ran out of space on all 3 host, which is highly unlikely. When reviewing the esxcfg-mpath output for this host, there is no local RAID controller, which means that these are likely servers that boot-from-SAN. A quick check of /etc/vmware/esx.conf reveals the following:
/boot/cosvmdk = "/vmfs/volumes/4d0f879c-ccb0a33e-91f7-842b2b5c4f54/esxconsole-4d0f86da-ab5c-79a0-83ee-842b2b5c4f54/esxconsole.vmdk"
# cat tmp/4d0f879c-ccb0a33e-91f7-842b2b5c4f54.vpa.txt
VMFS-3.33 file system spanning 1 partitions.
File system label (if any): Storage1 (26)
Mode: public
Capacity 30870077440 (29440 file blocks * 1048576), 21807235072 (20797 blocks) avail
UUID: 4d0f879c-ccb0a33e-91f7-842b2b5c4f54
Partitions spanned (on "lvm"):
naa.6005076801910266800000000000031c:5
This server is booting off the 5th partition of naa.6005076801910266800000000000031c
naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)
vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5
vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00
vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5
vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00
The output above confirms the host is boot-from-SAN.
Let's review the server hardware (from dmidecode or smbios):
Manufacturer: Dell Inc.
Product Name: PowerEdge R910
BIOS Information
Vendor: Dell Inc.
Version: 1.2.0
Release Date: 06/22/2010
And the ESX version:
VMware ESX 4.0.0 build-208167
This BIOS version is very old which means it may have been susceptible to the IRQ remapping issue (KB 1030265 — http://kb.vmware.com/kb/1030265), however the ESX version is prior to 4.1 when this feature was introduced. This means that the reason the hosts stopped logging was more than likely related to the fact that they boot-from-SAN and all paths to the array were lost around 6am.
Let's review another host:
Mar 1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)NMP: nmp_CompleteCommandForPath: Command 0×12 (0x41000a1a3e00) to NMP device "mpx.vmhba2:C0:T0:L0" failed on physical path "vmhba2:C0:T0:L0" H:0×0 D:0×2 P:0×0 Valid sense data: 0×5 0×24 0×0.
Mar 1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)ScsiDeviceIO: 747: Command 0×12 to device "mpx.vmhba2:C0:T0:L0" failed H:0×0 D:0×2 P:0×0 Valid sense data: 0×5 0×24 0×0.
Mar 1 08:17:56 vmris02-sc vmkernel: 497: bus pci, id 3
Mar 1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:A (0) 01-03 trg 2 pol 1
Mar 1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:B (1) 01-14 trg 2 pol 1
This host stopped logging at 6:04 instead of 6:00, which is a difference of 4 minutes. Let's review when the first remote ports disappeared to help correlate:
Mar 1 04:33:09 vmris02-sc vmkernel: 435:13:35:12.922 cpu33:4307)<3> rport-3:0-0: blocked FC remote port time out: saving binding
Mar 1 04:33:10 vmris02-sc vmkernel: 435:13:35:13.143 cpu24:4294)<3> rport-2:0-0: blocked FC remote port time out: saving binding
This is also off by 4 minutes, which means we can likely conclude that they actually occurred at the exact same time on both hosts, but these hosts are not all setup for NTP.
Finally, let’s review the last host to verify these hosts observed the issue identically. This host observed the same behavior, though the server time seems to be off by 9 minutes (lack of NTP):
Mar 1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.554 cpu18:4313)<3> rport-3:0-1: blocked FC remote port time out: saving binding
Mar 1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.826 cpu12:4307)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar 1 05:48:24 vmris03-sc vmkernel: 75:20:02:46.594 cpu6:4304)<3> rport-3:0-0: blocked FC remote port time out: saving binding
Mar 1 05:48:27 vmris03-sc vmkernel: 75:20:02:46.640 cpu18:4309)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Needless to say, all hosts observed the issue at the exact same time, when taking into account the time drift is a direct result of a lack of NTP running. Root cause for why storage paths disappeared again at 6am is not known since the ESX hosts had nowhere to write logs, which suggests that all available paths to the storage were simultaneously lost at that moment.
The other host that logs were upload for was a host that did not experience the same outage as hosts in another cluster:
Mar 1 04:42:38 vmhost16-sc vmkernel: 167:15:49:25.581 cpu14:4295)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Mar 1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.397 cpu12:4313)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar 1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.815 cpu12:4298)<3> rport-3:0-0: blocked FC remote port time out: saving binding
We can see that only 3 out of 4 remote ports went offline compared to other hosts, however a grep for rport shows the port did go down 2.5 hours later:
Mar 1 07:18:23 vmhost16-sc vmkernel: 167:18:25:13.641 cpu12:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding
*Note: With default HBA driver logging it will only report when a remote disappears, not when it is visible again.
This host logs well past 6am, which likely means this host does not boot-from-SAN:
/boot/cosvmdk = "/vmfs/volumes/4d2b18aa-56df78bf-9bc2-842b2b5c502c/esxconsole-4d2b174e-97a3-3540-1b89-842b2b5c502c/esxconsole.vmdk"
# cat /tmp/4d2b18aa-56df78bf-9bc2-842b2b5c502c.vpa.txt
VMFS-3.33 file system spanning 1 partitions.
File system label (if any): Storage1
Mode: public
Capacity 30870077440 (29440 file blocks * 1048576), 21595422720 (20595 blocks) avail
UUID: 4d2b18aa-56df78bf-9bc2-842b2b5c502c
Partitions spanned (on "lvm"):
naa.600507680191026680000000000001a6:5
naa.600507680191026680000000000001a6 : IBM Fibre Channel Disk (naa.600507680191026680000000000001a6)
vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
As we can see, this is actually a boot-from-SAN configuration, so why were 3 hosts affected but not this one?
Going back to the fact that this last host didn’t lose all paths to the storage (only 3 out of 4), we will focus on the fact that it eventually lost the last path 2.5 hours later, after which the host complained constantly about SCSI Reservation conflicts:
Mar 1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.646 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba0:C0:T0:L9" state. Status=SCSI reservation conflict
Mar 1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.652 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba0:C0:T1:L9" state. Status=SCSI reservation conflict
Also, the qlogic driver is aborting commands every 5 minutes exactly, which is the frequency of the disk.patheval code (300 seconds be default):
Mar 1 08:28:33 vmhost16-sc vmkernel: 167:19:35:23.641 cpu3:20943)<6>qla2xxx 0000:41:00.0: scsi(3:1:72): Abort command issued — 1 32bf3c06 2002.
Mar 1 08:33:23 vmhost16-sc vmkernel: 167:19:40:13.637 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:67): Abort command issued — 1 32c01950 2002.
Mar 1 08:38:23 vmhost16-sc vmkernel: 167:19:45:13.641 cpu12:20943)<6>qla2xxx 0000:41:00.0: scsi(3:3:0): Abort command issued — 1 32c0d62a 2002.
Mar 1 08:43:23 vmhost16-sc vmkernel: 167:19:50:13.639 cpu9:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:74): Abort command issued — 1 32c18b9f 2002.
Mar 1 08:53:43 vmhost16-sc vmkernel: 167:20:00:33.641 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:1:66): Abort command issued — 1 32c31fbf 2002.
Mar 1 08:58:23 vmhost16-sc vmkernel: 167:20:05:13.644 cpu4:20944)<6>qla2xxx 0000:41:00.0: scsi(3:3:69): Abort command issued — 1 32c3938d 2002.
Eventually we see paths deleted for various LUNs:
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.923 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=20
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=22
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=23
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=25
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=26
Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=27
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=30
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=35
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=38
Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.536 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=39
Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.494 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=40
Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.496 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=44
Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.497 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=45
Mar 1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.208 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=18
Mar 1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.209 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=19
Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.350 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=51
Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=52
Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=53
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.700 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=0
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=1
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=2
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=3
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.702 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=4
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=8
Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=9
This means the connection to all LUNs were removed for vmhba1 to a particular array. There are still paths from vmhba1 to the other array:
eui.00173800123a001a : IBM Fibre Channel Disk (eui.00173800123a001a)
vmhba1:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:92
vmhba1:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:82
vmhba1:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:72
vmhba1:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:52
vmhba1:C0:T1:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:42
vmhba0:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:80
vmhba0:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:50
vmhba0:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:40
vmhba0:C0:T6:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:90
vmhba0:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:81
We can see all LUNs from the IBM array with WWNN 50:05:07:68:01:00:4c:c5 and 50:05:07:68:01:00:4b:00 has only 2 paths:
naa.600507680191026680000000000002ac : IBM Fibre Channel Disk (naa.600507680191026680000000000002ac)
vmhba0:C0:T1:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
naa.6005076801910266800000000000027c : IBM Fibre Channel Disk (naa.6005076801910266800000000000027c)
vmhba0:C0:T1:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
naa.6005076801910266800000000000033f : IBM Fibre Channel Disk (naa.6005076801910266800000000000033f)
vmhba0:C0:T1:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
naa.6005076801910266800000000000027d : IBM Fibre Channel Disk (naa.6005076801910266800000000000027d)
vmhba0:C0:T1:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
vmhba0:C0:T0:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
Based on this evidence, this could be a zoning issue or simply LUN presentation problem (perhaps that HBA is not logged into the target when it should be?). Regardless of whether or not zoning changes were implemented as well, the focus should remain that all hosts in one cluster stopped logging to /var/log while a host in another cluster did not.
Let's get back to the original issue, which is why did 3 servers behave differently than 1 server. To do that, let's compare logged-in targets:
Affected host (vmris01) boot LUN targets:
naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)
vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5
vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00
vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5
vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00
The HBAs were connected to the following:
50:05:07:68:01:10:4c:c5
50:05:07:68:01:10:4b:00
50:05:07:68:01:30:4c:c5
50:05:07:68:01:30:4b:00
The host that remained online the entire time is connected to the following ports:
50:05:07:68:01:40:4b:00
50:05:07:68:01:40:4c:c5
These are different ports that the host that stopped logging and required rebooting. Now let's compare with the other 2 hosts that had to be rebooted (vmris02 and vmris03):
vmris02:
50:05:07:68:01:20:4b:00
50:05:07:68:01:20:4c:c5
50:05:07:68:01:40:4b:00
50:05:07:68:01:40:4c:c5
vmris03:
50:05:07:68:01:10:4c:c5
50:05:07:68:01:30:4b:00
50:05:07:68:01:30:4c:c5
50:05:07:68:01:10:4b:00
While vmris03 is logged into the same physical ports and vmris01, vmris02 is logged into the same ports as the host that remained online. The host that remained online is only communicating on 2 of the 4 ports however, so do they identically match?
vmhost2: (from /proc/scsi/qla2xxx/)
FC Port Information:
scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;
scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;
FC Port Information:
scsi-qla1-port-0=5005076801004cc5:5005076801204cc5:83031e:1;
vmris02: (from /proc/scsi/qla2xxx/)
FC Port Information:
scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;
scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;
FC Port Information:
scsi-qla1-port-0=5005076801004b00:5005076801204b00:83031b:81;
scsi-qla1-port-1=5005076801004cc5:5005076801204cc5:83031e:82;
Even though vmhost is missing one of the paths, we can see that the targets are the same. Since the hosts were logged into the same targets but had different results when it comes to logged out ports and being able to stay online, this may have to do with the amount of I/O that was in-flight from each host at the time. A host that was not issuing hardly any I/O at the time may have survived the outage because it did not issue any I/O, whereas the service console of the ESX hosts that were doing I/O at the time had a timeout on the local disk, resulting in the EXT3 file systems remounting as read-only. This would have resulted in an inability to write to /var/log. It would also cause some services to stop functioning, such as SSH, while other stacks in memory continue to operate (responds to ICMP ping requests).
The host that remained online (vmhost2) was only running 2 VMs at the time the logs were collected:
# cat /proc/vmware/vm/*/names
vmid=4371 pid=-1 cfgFile="/vmfs/volumes/4d1a3ce4-d8b96856-0231-842b2b5c4f58/uhmc-riscxr/uhmc-riscxr.vmx" uuid="42 38 00 37 fc 6e 04 38-a4 cb a9 22 98 f3 ce 9d" displayName="uhmc-riscxr"
vmid=4376 pid=-1 cfgFile="/vmfs/volumes/4d2f4d6a-a751b8d0-1109-842b2b5c4f58/uhmc-risrpt/uhmc-risrpt.vmx" uuid="42 38 18 73 93 c6 d2 89-c5 87 e8 9e 24 5e 34 94" displayName="uhmc-risrpt"
A review of the vmkernel logs for that host do not show any VMs being powered on/off before, during, or after the event, so it is safe to assume these were the only VMs running on this host. We can only speculate what these VMs are designed for, however I would speculate that they are low I/O VMs and thus there were no or very few I/O's issued at the time the event occurred. We would also surmise that hosts vmris01, vmris02, and vmris03 had well more than 2 VMs running and they were all doing I/O at the time the event occurred. This theory explains why vmhost02 remained online while other hosts did not. As far as why all paths were lost at the same moment, which forcing the service consoles EXT partitions to remount as read-only, this would be possible if the paths were unavailable for longer than the disk timeout period, which is around 30 seconds. The question then becomes, what occurred on the SVC at ~6am? Were both SVC's hard rebooted after the firmware was successfully applied?
We'd love to hear your comments! Use the box below.
Old and New Forums in VMTN: Server Virtualization & Cloud
Many of you are very familiar with our Community Forums. They are amongst the most active in the industry. Look at some of these number though. 13+ thousand discussions around VMware vCenter Server. 21+ thousand discussions in the ESX 3.5 forum. You may not have heard of some of these too. The vCloud Client for iPad is only a month old.
See any communities below that you might be missing?
What You’re Subscribing To
It's interesting sometimes to look over the myriad of stats that Google gives us for the Knowledgebase. One of the items they provide us is a list of Subscriber Stats.
Subscriber stats display the number of Google users who have subscribed to our feeds using any Google product (such as Reader, iGoogle, or Orkut). Our Knowledgebase is kind of unique in that every single article published in it has its own RSS feed. Why is that you ask? It's our way of keeping users current when changes occur to articles. Since our KB articles are the official voice of VMware on technical issues, you want to be sure you're up to date with the latest information on a given problem. (For more info on how-to subscribe to a KB article, follow this link.)
Today, our most subscribed to KB article is 2006859 Windows 8 operating system does not boot or install on ESXi or ESX. Seems a lot of you are anxious to try out Windows 8 so you can support it virtualized. We can certainly understand that.
We're going to be studying our stats more closely going forward as it tells us a lot about what users are interested in.
As a reminder, if you ever have an idea for a KB article you'd like to see written, but isn't yet, you can drop your suggestion right here!
Go Que vs The World