Site Stats


Visits today: 5

Total Visits: 6462, since: December 6, 2011

Archives

Richard Blythe

New Portuguese KB Articles

Yesterday we featured new Spanish articles. Today we have our list of brand new translated content for our Portuguese speaking customers. We've hand picked this list from our best content. Enjoy!

Portuguese KB Article Title English Version
2018857 Falha na conversão com o erro: Não foi possível para o conversor criar um instantâneo VSS do volume de origem 1019690
2017708 2017708Trabalhar com instantâneos 1009402
2017322 Configurar a velocidade e o duplex de um adaptador de rede do host ESX/ESXi 1004089
2017328 Suporte do Microsoft Cluster Service (MSCS) para o ESX/ESXi 1004617
2017335 Registrar ou adicionar uma máquina virtual ao inventário 1006160
2018853 O dispositivo USB host não funciona no VMware Workstation ou máquina virtual Player 1016809
2017336 Compreender os tipos de rede em produtos hospedados 1006480
2018871 Instalar as ferramentas VMware em um convidado do Mac OS 1032440
2017691 Usar esxtop para identificar problemas de desempenho do armazenamento 1008205
2014528 Comparação entre o VMware ESXi/ESX 4.1 e o ESXi 5.0 2005377
2018931 Testar a conectividade da rede com o comando ping 1003486
2016700 Prácticas recomendadas para actualizar a ESXi 5.0 1009039
2018943 Aumentar o tamanho de uma partição de disco 1004071
2018933 Reiniciar os agentes de gerenciamento em um servidor ESX ou ESXi 1003490
2018938 Configurar uma VLAN em um grupo de portas 1003825
2018973 A opção Raw Device Mapping está desativada 1017704
2018876 Alta utilização de CPU em máquinas virtuais do Windows 1077
2018925 Configurar manualmente uma máquina virtual para usar o driver do adaptador de rede e1000 1003020
2018956 Atualizar uma máquina virtual para a versão de hardware mais recente 1010675
2018941 Fazer uma nova verificação do armazenamento 1003988
2018944 Solucionar problemas no setor de inicialização em sistemas operacionais baseados no Windows NT 1006556
2018878 Dicas para reparar o arquivo VMDK em um sistema convidado travado do Workstation ou GSX Server em um host do Windows 2016
2018937 Configuração de VLAN em switch virtual, switch físico e máquinas virtuais 1003806
2018955 Amostra de configuração da porta do switch para VLAN e TRUNK MODE 1006628
2018965 Ocorre falha na instalação das ferramentas VMware quando o recurso Easy Install está em andamento 1017687
2019031 Práticas recomendadas para instantâneos de máquinas virtuais no ambiente do VMware 1025279
2018979 Instalar o ESXi Installable em uma unidade USB ou cartão SD 1020655
2018879 Sistemas de arquivos baseados em Linux tornam-se somente leitura 51306
2019029 Migrar um banco de dados existente do vCenter Server para a versão 4.1 usando o recurso Data Migration Tool 1021635
2018940 Abrir um prompt de comando ou de shell 1003892

New Spanish Knowledgebase Content

Here is nice list of brand new translated content for all of our Spanish speaking customers. We've hand picked this list from our best content.  Enjoy!

Spanish KB Article Title English Version
2016356 Mover o copiar una máquina virtual dentro de un entorno de Vmware 1000936
2016685 Preguntas frecuentes sobre compatibilidad y licenciamiento en vSphere 5.0 1010839
2016640 Usar el controlador de disco VMware SCSI para sistemas operativos huésped Window 1006956
2016670 Licencias para VMware Workstation 1015481
2016341 Configuración de las conexiones de red desde la línea de comandos de la consola de servicio ESX 1000258
2016683 Entender y resolver problemas de las nuevas funciones en vSphere 5.0 1015000
2016325 Requisitos de compatibilidad de CPU de VMotion para procesadores Intel 1991
2016702 Configurar Network Time Protocol (NTP) en servidores ESX/ESXi 4.1 y ESXi 5.0 usando el cliente vSphere 1339
2016239 Mover o copiar discos virtuales en un entorno VMware 900
2016673 Utilizar el modo de soporte técnico en ESXi 4.1 y ESXi 5.0 1017910
2016648 Virtualizar controladores de dominio existentes 1006996
2016232 Resumen de VMware Tools 340
2016657 Solucionar problemas de fallas de conexión a la red 1008367
2016494 Introducir controladores de dispositivo SCSI en Windows cuando falla al iniciarse luego de convertirlo con VMware Converter 1005208
2016303 Ajustar la zona horaria del host de ESX 1436
2016668 Instrucciones generales de instalación de VMware Tools 1014294
2016603 Las mejores prácticas de medición de tiempo para huéspedes Linux 1006427
2016660 Solucionar problemas de funcionamiento de discos I/O alojados 1008885
2016344 La configuración de Windows XP no puede encontrar unidades de disco duro durante la instalación 1000863
2016664 Construir de forma manual módulos de kernel de VMware Tools en un sistema huésped Ubuntu 1008973
2016367 Prueba de conectividad de puertos con Telnet 1003487
2016674 Configurar VMware vCenter Server para enviar alarmas cuando las máquinas virtuales se están ejecutando desde instantáneas 1018029
2016450 Verificar la integridad del archivo de la máquina virtual ESX/ESXi 1003743
2016361 Cambiar le nombre a un disco de máquina virtual (VMDK) a través de vSphere Management Assistant (vMA) o vSphere CLI (vCLI) 1002491
2016682 Solucionar un error de MSI durante la instalación de un producto o herramienta de VMware 1031302
2016688 Consideraciones y requisitos de instalación previa de la actualización vSphere 5.0 1022137

My VMware Update

My VMware had a successful first week with customers across the globe logging in and using all functionalities of the site, including downloading products and evaluations, submitting Support Requests, generating license keys, creating folders and more. Early in the week, customers experienced some issues with access and stability of the environment due to high-volume traffic. We have resolved these issues and are committed to providing an exceptional experience.

Please join us for an upcoming webinar training series to learn about the new features and functionalities of My VMware. The webinar series will help customers understand the basics of logging in and navigating the site, account management, managing license keys and Support Requests, and downloading and evaluating products. We encourage you to register today!

Please continue to send us feedback! We’ve already made a lot of progress and the input from customers is helping us target specific areas of concern and guiding our improvement roadmap.

Follow us on Twitter @vmwarecares for updates and helpful links.

If you have an issue and need to contact support:

Web: https://my.vmware.com/group/vmware/get-help

Phone: https://my.vmware.com/group/vmware/contact-vmware

Social Media and the My VMware Launch

Aside from writing this blog my duties include monitoring social media and engaging with our customers for things that involve VMware Shared Services. That includes the conversations going on around the launch of My VMware.

The launch is going really well considering the sheer size of this portal. I heard someone say My VMware is one of the biggest B2B sites in the world!

Social Media can work for you, or work against you. It can be like trying to tame a wild Mustang. Things spread like wildfire. I saw one tweet show up today about a funny browser issue when trying to log in to My VMware.

I knew instantly what the problem was in this case. He needed to clear his browser cache, but suddenly I had to work at lightning speed to get ahead of this. Someone retweeted. Somebody else posted the video on YouTube! This had the makings of going viral, and fast!

Everywhere I followed, I commented with the fix. I tweeted to the original poster to try the fix. I'd need to wait for his reply. I needed to reply to every tweet now talking about this quirk.

For a while, I thought it could run faster than I could, but then the original poster answered!

  • If your MyVMware button moves around (see video), just clear your cache & works like a charm! Thanks @VMwareCares http://t.co/HheqyUJi
  • Worked great. I've updated my post to include your quick response and the fix! Thanks! You guys are doing a great job!

YES!! Now social was working for me. The answer was being spread as quickly as the problem had. Not only was the whole thing blowing over and our customers having a good chuckle over it, but in the meantime, I got our Knowledge Base Team writing up an official KB article, which is already published. I have to brag - we are a truly incredible team.

See: My VMware log in button on vmware.com moves when you try to click it 2018745

Host goes unresponsive in vCenter and SSH to the host fails

Here's our third log deep-dive from Nathan Small. If you missed the first one, here it is. Funny how all of these problems point to a KB article that solves the issue!

History of issue

A host went unresponsive last night in vCenter and SSH fails to the host. The host was rebooted to bring it back online, and now root cause is requested.
 
To determine how long the server was actually unresponsive, we need to check /var/log/vmksummary. When we do this, we see the server was not logging for 3 days, which could point to a system hang:
 
Mar 25 04:01:17 esx04 logger: (1332666076) hb: vmk loaded, 6169963.27, 6169603.352, 43, 398348, 398348, 1260, vpxa-125984, vmware-h-101788, sfcbd-19084
Mar 25 05:01:16 esx04 logger: (1332669676) hb: vmk loaded, 6173563.07, 6173202.948, 43, 398348, 398348, 1260, vpxa-126004, vmware-h-101788, sfcbd-19044
Mar 28 10:15:20 esx04 vmkhalt: (1332947720) Starting system…
Mar 28 10:15:52 esx04 logger: (1332947752) loaded VMkernel
 
NOTE: If the server was gracefully shutdown or rebooted, you would see a message stating this the line before "Starting system…"
 
What it is interesting to note is that while the host was observed to be unresponsive in vCenter only the previous night, the system hasnt been logging for over 3 days.
 
At this point we should gather more information around the physical server:
 
Server Info gathered from 'dmidecode':
 
System Information
        Manufacturer: Dell Inc.
        Product Name: PowerEdge R900
 
BIOS Information
        Vendor: Dell Inc.
        Version: 1.1.15
        Release Date: 04/15/2010
 
Next, it is important to determine ESX version:
 
# vmware -v
VMware ESX 4.0.0 build-398348
 
This ESX version is prior to 4.1 and therefore the IRQ remapping issue would not apply. This means the host stopped logging due to hardware, but specifically how or why isn't known yet. We know the host stopped logging for 3 day, so we should start with the device that communicates with the local disks, where /var/log is stored.
 
Since this host is a Dell server, it uses the PERC (PowerEdge RAID Controller), which is a rebranded LSI controller. The controller contains a log that is maintained on the hardware itself and can be extracted with the 'lsi_log' command that is available on ESX hosts.
 
When we run this command to pull the logs off the controller we can see a rather alarming event:
 
Event Sequence Number : 9658
Timestamp : 3/25/2012 ; 10:19:9
Event code : 15
Locale : Boot/Shutdown event
Class : Dead (visible after next controller boot)
Description of the event : Fatal firmware error: Line 156 in ../../raid/1078int.c
 
Argument Type Value: 18
Argument Type : MR_EVT_ARGS_STR
        String : Line 156 in ../../raid/1078int.c
 
When searching our Knowledge Base for the search string "Fatal firmware error", we find only one article, however it is a perfect match! (KB 2011987 — http://kb.vmware.com/kb/2011987). This customer will need to update the firmware of their PERC controller. To prevent outages from other hosts with the same hardware configuration, we advised that they do the same for any Dell PERC or IBM ServeRAID controllers, as these are rebranded LSI controllers.

My VMware is Easy

The all new My VMware portal is now live and ten of thousands of you are logging in to check it out. We've gotten lots of feedback about how elegant and easy it is to use, and that's great, but in some cases a few of you have been a little frustrated with some facets of the site. It turns out that these just needed some explanation. So yes, My VMware is easy, once you're armed with Knowledge!

With that in mind, here are the Knowledebase articles we're seeing requested the most.

Storage Performance and Timeout Issues

Here’s our second log deep-dive from Nathan Small. If you missed the first one, here it is.

History of the issue

General storage performance issue while experiencing timeouts and SCSI reservation issues (Timeout/conflict) on ESX 4.1 (VMware View) and ESX 5.0 environments. Need to determine why these events are being seen and how we can improve performance.
 
The first step when troubleshooting this type of issue is to take a frame of reference that is visible across all hosts. Here is the issue observed on one of the ESX 4.1 host:
 
ESXi 4.1:
 
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)ScsiDeviceIO: 1688: Command 0×16 to device “naa.60060480000190100554533032324644″ failed H:0×5 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “naa.60060480000190100554533032324644″ state in doubt; requested fast path state update…
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu14:4134)WARNING: FS3: 7120: Reservation error: Timeout
 
When reviewing the logs from the ESX 5.0 host, we see the issue observed at the same time:
 
ESXi 5.0:
 
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0×16 (0x4124003d4740, 0) to dev “naa.60060480000190100554533033364445″ on path “vmhba2:C0:T3:L23″ Failed: H:0×8 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0. Act:EVAL
2012-03-28T17:12:56.481Z cpu11:4107)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “naa.60060480000190100554533033364445″ state in doubt; requested fast path state update…
2012-03-28T17:12:56.481Z cpu11:4107)ScsiDeviceIO: 2322: Cmd(0x4124003d4740) 0×16, CmdSN 0×23102 from world 0 to dev “naa.60060480000190100554533033364445″ failed H:0×8 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
2012-03-28T17:12:56.481Z cpu1:4132)WARNING: FS3: 1835: Reservation error: Timeout
 
These logs show that the ESX storage stack is not aware of what the SCSI reservation state of the device is. This will happen when a SCSI RESERVE command is sent, doesn’t not complete to the target due to timeout, and then abort sent for the SCSI reserve doesn’t complete either due to timeout. This will usually happen for 1 of 2 reasons:
 
1. Fabric issues where frames are dropped
2. Array or array controller is overloaded
 
There are other messages occurring in the logs that also indicate an array performance issue:
 
2012-03-26T23:00:30.950Z cpu8:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW02′ (held for 1457 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-27T23:01:15.491Z cpu4:4133)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW04′ (held for 2640 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T17:12:59.456Z cpu15:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07′ (held for 2967 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T20:05:19.372Z cpu0:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07′ (held for 19196 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
 
See KB 1025299 for more information on this message: http://kb.vmware.com/kb/1025299
 
To simply state that this is a performance issue that should be investigated further by the storage team or the array vendor may be acceptable for some, however let’s try to dig into this further and provide more details if that is the direction we decide to go. To start, let’s take a closer look at the LUNs reporting the issue in both environments:
 
ESX 4.1:
 
Error message:
 
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
 
LUN in question: naa.60060480000190100554533037394644
 
# esxcfg-scsidevs -m |grep naa.60060480000190100554533037394644
 
naa.60060480000190100554533037394644:1                           /vmfs/devices/disks/naa.60060480000190100554533037394644:1 4d433199-f78406e2-ccc3-1cc1de093452  0  Lun_56
 
# esxcfg-mpath -b|grep -A4 naa.60060480000190100554533032324644
 
naa.60060480000190100554533032324644 : EMC Fibre Channel Disk (naa.60060480000190100554533032324644)
   vmhba0:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0a:24  Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
   vmhba1:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0b:24  Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
 
From this output we can see the WWPN (World Wide Port Name) of the connected array ports, which are 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82.
 
ESX 5.0:
 
Error message:
 
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
 
LUN in question: naa.60060480000190100554533033364445
 
# esxcfg-scsidevs -m |grep naa.60060480000190100554533033364445
 
naa.60060480000190100554533033364445:1                           /vmfs/devices/disks/naa.60060480000190100554533033364445:1 4e70c147-a779fe32-8747-0025b5010a6a  0  VMW07
 
# esxcfg-mpath -b| grep -A4 naa.60060480000190100554533033364445
 
naa.60060480000190100554533033364445 : EMC Fibre Channel Disk (naa.60060480000190100554533033364445)
   vmhba2:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0a:0d  Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
   vmhba3:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0b:0d  Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
 
We can see from the esxcfg-mpath output that the LUN having the same issue in this cluster is connected to the same WWPN: 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82
 
Given the fact that there are two separate environments connected to the same array controller (FA port, in this case), it is highly likely that the array controller is overloaded. The next step would be to collect performance information from the array side from this point forward so that when another event occurs the storage team or storage array vendor can review the health of the array at that timeframe to ensure that it can keep up with the I/O being generated.
 
In this particular case, the storage team and EMC became involved and confirmed that the FA port was overloaded. To distribute the load, and to ensure View best practices are followed, the VMware View environment was moved onto a different, dedicated FA on the array.

Hosts Unresponsive after Firmware Upgrade of IBM SVC

We have an exciting new series starting here at The Support Insider. Our senior support engineers sometimes need to deep dive into customer issues, digging into logs some of you might not even know exist. We thought we'd share some examples of these with you. You see the problem presented, the game plan unfold, and how our top escalation engineers investigate tough problems.

Nathan SmallWe begin today with Nathan Small. Nathan is a Senior Escalation Engineer on the storage team in Global Support Services, and has been employed with VMware since May 2005. He has written and designed many storage technical presentations that have been delivered at various VMware sponsored events including VMUGs and VMworld conferences. He is recognized in the worldwide virtualization community as a storage and infrastructure guru.

 

History of the issue: This customer performed a firmware upgrade of their IBM SVC however after doing so all hosts in 1 cluster went to an unresponsive state in vCenter, and it was not possible to SSH to the servers. It was still possible to ping the affected hosts. The servers were rebooted to resolve the issue. A host in a different cluster also connected to the IBM SVC did not experience this same issue.
 
Log review:
 
The host vmris01-sc stopped logging after 5am. The /var/log/vmksummary file updates at the first minute of every hour stating the uptime of the host, build version of the vmkernel/service console, and the memory usage of the top 3 process:
 
Mar  1 04:01:13 vmris01-sc logger: (1330592473) hb: vmk loaded, 31076125.88, 31076117.975, 2, 208167, 208167, 136, sfcbd-153328, ftPerl-76692, vmware-h-73592
Mar  1 05:01:14 vmris01-sc logger: (1330596074) hb: vmk loaded, 31079726.01, 31079718.111, 2, 208167, 208167, 136, sfcbd-153336, ftPerl-76736, vmware-h-73188
Mar  1 08:20:13 vmris01-sc vmkhalt: (1330608013) Starting system…
Mar  1 08:20:25 vmris01-sc logger: (1330608025) loaded VMkernel
 
Since there was no update at 6:01, this means the server stopped logging at this point. This means one of two possibilities: 1. the server hung (due to hardware), or 2. the server was no longer able to write to /var/log. It is highly unlikely that the server hung since multiple hosts were affected simultaneously, so the only possibility is that these hosts were not able to write to /var/log.
 
This host observed an event beginning at 4:13am in the form of commands failing to complete to the target:
 
Mar  1 04:13:39 vmris01-sc vmkernel: 359:16:27:43.715 cpu14:5443)VMK_PCI: 739: device 065:00.0 capType 16 capIndex 76
Mar  1 04:29:34 vmris01-sc vmkernel: 359:16:43:38.593 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0a2f00) to NMP device "naa.60050768019102668000000000000328" failed on physical path "vmhba0:C0:T1:L1" H:0×2 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
 
Then we see the remote ports (array controller ports) timeout, specifically the second target (‘1’) on both HBAs:
 
Mar  1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.167 cpu15:4303)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar  1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.253 cpu25:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding
 
This causes a host status of H:0×1, which requires us to failover:
 
Mar  1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0df680) to NMP device "naa.60050768019102668000000000000328" failed on physical path "vmhba0:C0:T1:L1" H:0×1 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
Mar  1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.60050768019102668000000000000328": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
 
Then failover actually occurs (from target 1 to target 0):
 
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu24:4268)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L5 to vmhba1:C0:T0:L5 for device "naa.60050768019102668000000000000335".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu26:8602)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L1 to vmhba1:C0:T0:L1 for device "naa.60050768019102668000000000000328".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60050768019102668000000000000335" - issuing command 0x41000a15d380
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.60050768019102668000000000000328" - issuing command 0x41000a0df680
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L2 to vmhba1:C0:T0:L2 for device "naa.6005076801910266800000000000032b".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu12:8604)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L0 to vmhba1:C0:T0:L0 for device "naa.6005076801910266800000000000031c".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device "naa.6005076801910266800000000000031c" - issuing command 0x41000a1ec240
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L4 to vmhba1:C0:T0:L4 for device "naa.60050768019102668000000000000334".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60050768019102668000000000000335"
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L6 to vmhba1:C0:T0:L6 for device "naa.60050768019102668000000000000336".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4135)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.60050768019102668000000000000328"
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L7 to vmhba1:C0:T0:L7 for device "naa.60050768019102668000000000000337".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L8 to vmhba1:C0:T0:L8 for device "naa.6005076801910266800000000000033a".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L9 to vmhba1:C0:T0:L9 for device "naa.6005076801910266800000000000033b".
Mar  1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.638 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device "naa.6005076801910266800000000000031c"
 
The statement of “Retry world recovered device” indicates that failover completed successfully.
 
After failover,Everything is running fine until 70 minutes later when we see the first target (‘0’) timeout to both HBAs:
 
Mar  1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.805 cpu0:4300)<3> rport-2:0-0: blocked FC remote port time out: saving binding
Mar  1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.929 cpu30:4301)<3> rport-3:0-0: blocked FC remote port time out: saving binding
 
We then failover back from target 0 to target 1:
 
Mar  1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.228 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T0:L10 to vmhba0:C0:T1:L10 for device "naa.60050768019102668000000000000348".
Mar  1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.229 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T0:L11 to vmhba1:C0:T1:L11 for device "naa.6005076801910266800000000000035b".
 
This behavior is consistent with a firmware upgrade of an IBM SVC or any array type that supports non-disruptive upgrading of the controllers.
 
What is odd is that 20 minutes later we see more failover events, target 1 back to target 0, right before the server stopped logging to the vmkernel log:
 
Mar  1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.379 cpu38:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L10 to vmhba1:C0:T0:L10 for device "naa.60050768019102668000000000000348".
Mar  1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.381 cpu12:4303)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T1:L11 to vmhba0:C0:T0:L11 for device "naa.6005076801910266800000000000035b".
Mar  1 08:20:22 vmris01-sc vmkernel: pol 1
Mar  1 08:20:22 vmris01-sc vmkernel: 0:00:00:03.637 cpu0:4096)Chipset: 497: bus pci, id 4
 
We can see the timestamp jumps from 359 days to 0 days, which is essentially the hard reboot of this host.
 
It is odd that the logging suddenly stopped, however we would speculate that logging stopped because the ESX host was not able to communicate with its boot device. The only other possibility would be that /var/log ran out of space on all 3 host, which is highly unlikely. When reviewing the esxcfg-mpath output for this host, there is no local RAID controller, which means that these are likely servers that boot-from-SAN. A quick check of /etc/vmware/esx.conf reveals the following:
 
/boot/cosvmdk = "/vmfs/volumes/4d0f879c-ccb0a33e-91f7-842b2b5c4f54/esxconsole-4d0f86da-ab5c-79a0-83ee-842b2b5c4f54/esxconsole.vmdk"
 
# cat tmp/4d0f879c-ccb0a33e-91f7-842b2b5c4f54.vpa.txt
VMFS-3.33 file system spanning 1 partitions.
File system label (if any): Storage1 (26)
Mode: public
Capacity 30870077440 (29440 file blocks * 1048576), 21807235072 (20797 blocks) avail
UUID: 4d0f879c-ccb0a33e-91f7-842b2b5c4f54
Partitions spanned (on "lvm"):
        naa.6005076801910266800000000000031c:5
 
This server is booting off the 5th partition of naa.6005076801910266800000000000031c
 
naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)
   vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5
   vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00
   vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5
   vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00
 
The output above confirms the host is boot-from-SAN.
 
Let's review the server hardware (from dmidecode or smbios):
 
        Manufacturer: Dell Inc.
        Product Name: PowerEdge R910
 
BIOS Information
        Vendor: Dell Inc.
        Version: 1.2.0
        Release Date: 06/22/2010
 
And the ESX version:
 
VMware ESX 4.0.0 build-208167
 
This BIOS version is very old which means it may have been susceptible to the IRQ remapping issue (KB 1030265 — http://kb.vmware.com/kb/1030265), however the ESX version is prior to 4.1 when this feature was introduced. This means that the reason the hosts stopped logging was more than likely related to the fact that they boot-from-SAN and all paths to the array were lost around 6am.
 
Let's review another host:
 
Mar  1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)NMP: nmp_CompleteCommandForPath: Command 0×12 (0x41000a1a3e00) to NMP device "mpx.vmhba2:C0:T0:L0" failed on physical path "vmhba2:C0:T0:L0" H:0×0 D:0×2 P:0×0 Valid sense data: 0×5 0×24 0×0.
Mar  1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)ScsiDeviceIO: 747: Command 0×12 to device "mpx.vmhba2:C0:T0:L0" failed H:0×0 D:0×2 P:0×0 Valid sense data: 0×5 0×24 0×0.
Mar  1 08:17:56 vmris02-sc vmkernel: 497: bus pci, id 3
Mar  1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:A (0) 01-03 trg 2 pol 1
Mar  1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:B (1) 01-14 trg 2 pol 1
 
This host stopped logging at 6:04 instead of 6:00, which is a difference of 4 minutes. Let's review when the first remote ports disappeared to help correlate:
 
Mar  1 04:33:09 vmris02-sc vmkernel: 435:13:35:12.922 cpu33:4307)<3> rport-3:0-0: blocked FC remote port time out: saving binding
Mar  1 04:33:10 vmris02-sc vmkernel: 435:13:35:13.143 cpu24:4294)<3> rport-2:0-0: blocked FC remote port time out: saving binding
 
This is also off by 4 minutes, which means we can likely conclude that they actually occurred at the exact same time on both hosts, but these hosts are not all setup for NTP.
 
Finally, let’s review the last host to verify these hosts observed the issue identically. This host observed the same behavior, though the server time seems to be off by 9 minutes (lack of NTP):
 
Mar  1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.554 cpu18:4313)<3> rport-3:0-1: blocked FC remote port time out: saving binding
Mar  1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.826 cpu12:4307)<3> rport-2:0-1: blocked FC remote port time out: saving binding
 
Mar  1 05:48:24 vmris03-sc vmkernel: 75:20:02:46.594 cpu6:4304)<3> rport-3:0-0: blocked FC remote port time out: saving binding
Mar  1 05:48:27 vmris03-sc vmkernel: 75:20:02:46.640 cpu18:4309)<3> rport-2:0-0: blocked FC remote port time out: saving binding
 
Needless to say, all hosts observed the issue at the exact same time, when taking into account the time drift is a direct result of a lack of NTP running. Root cause for why storage paths disappeared again at 6am is not known since the ESX hosts had nowhere to write logs, which suggests that all available paths to the storage were simultaneously lost at that moment.
 
The other host that logs were upload for was a host that did not experience the same outage as hosts in another cluster:
 
Mar  1 04:42:38 vmhost16-sc vmkernel: 167:15:49:25.581 cpu14:4295)<3> rport-2:0-0: blocked FC remote port time out: saving binding
 
Mar  1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.397 cpu12:4313)<3> rport-2:0-1: blocked FC remote port time out: saving binding
Mar  1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.815 cpu12:4298)<3> rport-3:0-0: blocked FC remote port time out: saving binding
 
We can see that only 3 out of 4 remote ports went offline compared to other hosts, however a grep for rport shows the port did go down 2.5 hours later:
 
Mar  1 07:18:23 vmhost16-sc vmkernel: 167:18:25:13.641 cpu12:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding
 
*Note: With default HBA driver logging it will only report when a remote disappears, not when it is visible again.
 
This host logs well past 6am, which likely means this host does not boot-from-SAN:
 
/boot/cosvmdk = "/vmfs/volumes/4d2b18aa-56df78bf-9bc2-842b2b5c502c/esxconsole-4d2b174e-97a3-3540-1b89-842b2b5c502c/esxconsole.vmdk"
 
# cat /tmp/4d2b18aa-56df78bf-9bc2-842b2b5c502c.vpa.txt
VMFS-3.33 file system spanning 1 partitions.
File system label (if any): Storage1
Mode: public
Capacity 30870077440 (29440 file blocks * 1048576), 21595422720 (20595 blocks) avail
UUID: 4d2b18aa-56df78bf-9bc2-842b2b5c502c
Partitions spanned (on "lvm"):
        naa.600507680191026680000000000001a6:5
 
naa.600507680191026680000000000001a6 : IBM Fibre Channel Disk (naa.600507680191026680000000000001a6)
   vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
   vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
 
As we can see, this is actually a boot-from-SAN configuration, so why were 3 hosts affected but not this one?
 
Going back to the fact that this last host didn’t lose all paths to the storage (only 3 out of 4), we will focus on the fact that it eventually lost the last path 2.5 hours later, after which the host complained constantly about SCSI Reservation conflicts:
 
Mar  1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.646 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba0:C0:T0:L9" state. Status=SCSI reservation conflict
Mar  1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.652 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba0:C0:T1:L9" state. Status=SCSI reservation conflict
 
Also, the qlogic driver is aborting commands every 5 minutes exactly, which is the frequency of the disk.patheval code (300 seconds be default):
 
Mar  1 08:28:33 vmhost16-sc vmkernel: 167:19:35:23.641 cpu3:20943)<6>qla2xxx 0000:41:00.0: scsi(3:1:72): Abort command issued — 1 32bf3c06 2002.
Mar  1 08:33:23 vmhost16-sc vmkernel: 167:19:40:13.637 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:67): Abort command issued — 1 32c01950 2002.
Mar  1 08:38:23 vmhost16-sc vmkernel: 167:19:45:13.641 cpu12:20943)<6>qla2xxx 0000:41:00.0: scsi(3:3:0): Abort command issued — 1 32c0d62a 2002.
Mar  1 08:43:23 vmhost16-sc vmkernel: 167:19:50:13.639 cpu9:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:74): Abort command issued — 1 32c18b9f 2002.
Mar  1 08:53:43 vmhost16-sc vmkernel: 167:20:00:33.641 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:1:66): Abort command issued — 1 32c31fbf 2002.
Mar  1 08:58:23 vmhost16-sc vmkernel: 167:20:05:13.644 cpu4:20944)<6>qla2xxx 0000:41:00.0: scsi(3:3:69): Abort command issued — 1 32c3938d 2002.
 
Eventually we see paths deleted for various LUNs:
 
Mar  1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.923 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=20
Mar  1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=22
Mar  1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=23
Mar  1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=25
Mar  1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=26
Mar  1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=27
Mar  1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=30
Mar  1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=35
Mar  1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=38
Mar  1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.536 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=39
Mar  1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.494 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=40
Mar  1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.496 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=44
Mar  1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.497 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=45
Mar  1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.208 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=18
Mar  1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.209 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=19
Mar  1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.350 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=51
Mar  1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=52
Mar  1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=53
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.700 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=0
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=1
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=2
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=3
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.702 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=4
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=8
Mar  1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=9
 
This means the connection to all LUNs were removed for vmhba1 to a particular array. There are still paths from vmhba1 to the other array:
 
eui.00173800123a001a : IBM Fibre Channel Disk (eui.00173800123a001a)
   vmhba1:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:92
   vmhba1:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:82
   vmhba1:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:72
   vmhba1:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:52
   vmhba1:C0:T1:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:42
   vmhba0:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:80
   vmhba0:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:50
   vmhba0:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:40
   vmhba0:C0:T6:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:90
   vmhba0:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:81
 
We can see all LUNs from the IBM array with WWNN 50:05:07:68:01:00:4c:c5 and 50:05:07:68:01:00:4b:00 has only 2 paths:
 
naa.600507680191026680000000000002ac : IBM Fibre Channel Disk (naa.600507680191026680000000000002ac)
   vmhba0:C0:T1:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
   vmhba0:C0:T0:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
 
naa.6005076801910266800000000000027c : IBM Fibre Channel Disk (naa.6005076801910266800000000000027c)
   vmhba0:C0:T1:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
   vmhba0:C0:T0:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
 
naa.6005076801910266800000000000033f : IBM Fibre Channel Disk (naa.6005076801910266800000000000033f)
   vmhba0:C0:T1:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
   vmhba0:C0:T0:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
 
naa.6005076801910266800000000000027d : IBM Fibre Channel Disk (naa.6005076801910266800000000000027d)
   vmhba0:C0:T1:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5
   vmhba0:C0:T0:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00
 
Based on this evidence, this could be a zoning issue or simply LUN presentation problem (perhaps that HBA is not logged into the target when it should be?). Regardless of whether or not zoning changes were implemented as well, the focus should remain that all hosts in one cluster stopped logging to /var/log while a host in another cluster did not.
 
Let's get back to the original issue, which is why did 3 servers behave differently than 1 server. To do that, let's compare logged-in targets:
 
Affected host (vmris01) boot LUN targets:
 
naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)
   vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5
   vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00
   vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90  Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5
   vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90  Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00
 
The HBAs were connected to the following:
 
50:05:07:68:01:10:4c:c5
50:05:07:68:01:10:4b:00
50:05:07:68:01:30:4c:c5
50:05:07:68:01:30:4b:00
 
The host that remained online the entire time is connected to the following ports:
 
50:05:07:68:01:40:4b:00
50:05:07:68:01:40:4c:c5
 
These are different ports that the host that stopped logging and required rebooting. Now let's compare with the other 2 hosts that had to be rebooted (vmris02 and vmris03):
 
vmris02:
 
50:05:07:68:01:20:4b:00
50:05:07:68:01:20:4c:c5
50:05:07:68:01:40:4b:00
50:05:07:68:01:40:4c:c5
 
vmris03:
 
50:05:07:68:01:10:4c:c5
50:05:07:68:01:30:4b:00
50:05:07:68:01:30:4c:c5
50:05:07:68:01:10:4b:00
 
While vmris03 is logged into the same physical ports and vmris01, vmris02 is logged into the same ports as the host that remained online. The host that remained online is only communicating on 2 of the 4 ports however, so do they identically match?
 
vmhost2: (from /proc/scsi/qla2xxx/)
 
FC Port Information:
scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;
scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;
 
FC Port Information:
scsi-qla1-port-0=5005076801004cc5:5005076801204cc5:83031e:1;
 
vmris02: (from /proc/scsi/qla2xxx/)
 
FC Port Information:
scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;
scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;
 
FC Port Information:
scsi-qla1-port-0=5005076801004b00:5005076801204b00:83031b:81;
scsi-qla1-port-1=5005076801004cc5:5005076801204cc5:83031e:82;
 
Even though vmhost is missing one of the paths, we can see that the targets are the same. Since the hosts were logged into the same targets but had different results when it comes to logged out ports and being able to stay online, this may have to do with the amount of I/O that was in-flight from each host at the time. A host that was not issuing hardly any I/O at the time may have survived the outage because it did not issue any I/O, whereas the service console of the ESX hosts that were doing I/O at the time had a timeout on the local disk, resulting in the EXT3 file systems remounting as read-only. This would have resulted in an inability to write to /var/log. It would also cause some services to stop functioning, such as SSH, while other stacks in memory continue to operate (responds to ICMP ping requests).
 
The host that remained online (vmhost2) was only running 2 VMs at the time the logs were collected:
 
# cat /proc/vmware/vm/*/names
vmid=4371   pid=-1     cfgFile="/vmfs/volumes/4d1a3ce4-d8b96856-0231-842b2b5c4f58/uhmc-riscxr/uhmc-riscxr.vmx"  uuid="42 38 00 37 fc 6e 04 38-a4 cb a9 22 98 f3 ce 9d"  displayName="uhmc-riscxr"
vmid=4376   pid=-1     cfgFile="/vmfs/volumes/4d2f4d6a-a751b8d0-1109-842b2b5c4f58/uhmc-risrpt/uhmc-risrpt.vmx"  uuid="42 38 18 73 93 c6 d2 89-c5 87 e8 9e 24 5e 34 94"  displayName="uhmc-risrpt"
 
A review of the vmkernel logs for that host do not show any VMs being powered on/off before, during, or after the event, so it is safe to assume these were the only VMs running on this host. We can only speculate what these VMs are designed for, however I would speculate that they are low I/O VMs and thus there were no or very few I/O's issued at the time the event occurred. We would also surmise that hosts vmris01, vmris02, and vmris03 had well more than 2 VMs running and they were all doing I/O at the time the event occurred. This theory explains why vmhost02 remained online while other hosts did not. As far as why all paths were lost at the same moment, which forcing the service consoles EXT partitions to remount as read-only, this would be possible if the paths were unavailable for longer than the disk timeout period, which is around 30 seconds. The question then becomes, what occurred on the SVC at ~6am? Were both SVC's hard rebooted after the firmware was successfully applied?

We'd love to hear your comments! Use the box below.

Old and New Forums in VMTN: Server Virtualization & Cloud

Many of you are very familiar with our Community Forums. They are amongst the most active in the industry. Look at some of these number though. 13+ thousand discussions around VMware vCenter Server. 21+ thousand discussions in the ESX 3.5 forum. You may not have heard of some of these too. The vCloud Client for iPad is only a month old.

See any communities below that you might be missing?

Community Discussions Documents
VMware vSphere™ 0
vSphere Upgrade & Install 3422 20
QuickStart Series 49 3
VMware ESXi 5 2970 10
VMware ESX™ 4 8280 36
VMware ESXi™ 4 11755 31
Backup & Recovery 2085 14
Automation Tools 237 4
VMware vSphere Hypervisor 386 0
VMware vSphere Client for iPad 53 0
VMware Compliance Checker for vSphere 12
vSphere Storage Appliance 71 7
VMware vCenter™ 21 0
VMware vCenter™ Server 13405 36
VMware vCenter™ Server Heartbeat™ 167 0
AppSpeed 102 8
CapacityIQ 135 1
VMware vCenter Chargeback 158 5
Converter 5827 33
Lab Manager 2399 14
Lifecycle Manager 166 3
Orchestrator 619 29
Site Recovery Manager 1404 2
Stage Manager 22 0
Update Manager 727 1
VMware vCenter QuickStart Series 6 10
VMware vCenter Configuration Manager: VCM 78 0
vCenter Application Discovery Manager: ADM 15 0
VMware vCenter Operations Manager 318 1
VMware vCenter Protect - ITScripts 3 27
VMware Infrastructure Navigator 5 0
VMware vCloud Director 783 6
vCloud Usage Meter 50 17
vCloud Client for iPad 4 0
vCloud Request Manager 33 2
VMware vShield 237 2
VMware Server™ 4 0
VMware Server 2 3592 25
VMware Server 1 10249 44
VMware Infrastructure™ 13 28
VI: VMware ESXi™ 3.5 8303 33
VI: VMware ESX™ 3.5 21970 86
VI: VMware ESX® 3.0 6765 34
ESX 2.x & VirtualCenter 1.x 1306 4
Virtual Appliance Marketplace 0 0
VMware Studio 321 14
Virtual Appliances 2736 6
VMware vSphere™ Storage 2947 18
VMware vSphere™ vNetwork 1621 10
Security and Compliance 726 9
Performance & VMmark 903 50
VMware VMmark™ 195 4
VMware I/O Analyzer 17 2
Availability: HA & FT 727 6
vMotion & Resource Management 595 5

What You’re Subscribing To

It's interesting sometimes to look over the myriad of stats that Google gives us for the Knowledgebase. One of the items they provide us is a list of Subscriber Stats.

Subscriber stats display the number of Google users who have subscribed to our feeds using any Google product (such as Reader, iGoogle, or Orkut). Our Knowledgebase is kind of unique in that every single article published in it has its own RSS feed. Why is that you ask? It's our way of keeping users current when changes occur to articles. Since our KB articles are the official voice of VMware on technical issues, you want to be sure you're up to date with the latest information on a given problem. (For more info on how-to subscribe to a KB article, follow this link.)

Today, our most subscribed to KB article is 2006859 Windows 8 operating system does not boot or install on ESXi or ESX. Seems a lot of you are anxious to try out Windows 8 so you can support it virtualized. We can certainly understand that.

We're going to be studying our stats more closely going forward as it tells us a lot about what users are interested in.

As a reminder, if you ever have an idea for a KB article you'd like to see written, but isn't yet, you can drop your suggestion right here!

Go Que Newsroom

Categories

FInd us on Facebook

Contact us

Your message was successfully sent.
Thank You!