Did you ever run into an unresponsive ESXi hosts?

well I did this month and it gave me a lot of headache this past weeks

I want to share my finding so others can learn from it and also solve it faster and not to have to restart your hosts and all your VM’s.

So let me tell a little bit of the environment. The customer had a IBM blade H chassis and 3 HS23 blade each 48 GB. All blades are connecting to a fiber IBM DS3524 and for replication we chose for an iSCSI qnap system to keep it cheap. at the moment of unresponsive the VMware vSphere version and ESXi host were version 5.5 U1

So it started short before christmas more exactly 04/12/2014. All the ESX hosts became unresponsive but my VM’s kept running. also management network kept responding on ping and SSH connections, but strange enough I could not connect with the vi Client. there was no PSOD and the hardware reacted normally.

So the first thing when you look for unresponsive ESX host you get this link
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1019082

none of the steps worked. so for the first time I just decided to restart the esx hosts in my cluster and restart all the vm’s. was hoping it was just a fluke and one time error. after a week it came back. so I decided to upgrade everything

  • upgrade IBM blade chassis firmware
  • upgrade IBM HS 23 blade firmware
  • upgrade IBM DS3524 firmware
  • upgrade brocade fiber switches (see my video on YouTube ‘Brocade fiber switch firmware upgrade’ )
  • upgrade BNT switches firmware
  • upgrade vcenter to 5.5 U2
  • upgrade ESXi hosts to 5.5 U2 (since IBM didn’t have a customized image at time of writing I used the VMware image instead)
  • upgrade replication storage qnap systems
  • upgraded veeam Backup and Replication 7.00x to 8.0.194

so I tought that should do it, but no such luck.

Since the customer decided not to take support contract I posted my problem on the VMware community. I must say some good tips so i wanna thank all the people who responded on my community post. So I suggest you take a look at these links from other posts where ESXi hosts became unresponsive
Exhausting inodes + Disconnected Host
Re: Free INODES and % free RAMDISK

here is my original post

So how did I solve it? actually by accident almost.  I was working in my homelab and needed to take a backup before I tested something. So I started up the backup and the next morning I saw my hosts also unresponsive and all the vm’s still running. so I tought that can’t be accidental

my home lab runs also an iSCSI datastores for backup so I rebooted that storage first because I didn’t need it right away. and my surprise was that my esxi hosts came back online !! YEEEES. finaly. anyway problem is the qnap iSCSI storage. so I checked for firmware version on qnap systems and it was on 1.4.2. the latest one.  but on 27/01/2015 there was an update for 1.4.2 firmware, and so far there is no esxi hosts gone offline.  I asked qnap support if they know about this problem and if the new firmware will solve this problem. so far havent heard back from them, but if they do ill edit this post

Also FritzBrause confirmed my logs that is was iSCSI so didn’t had to look any further

Conclusion

  • Qnap iSCSI datastore made my ESXi hosts become unresponsive. Look in your logs for
    2015-01-29T13:36:00.706Z cpu1:33451)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba33:CH:0 T:0 L:0 : Task mgmt “Abort Task” with itt=0x6cac7 (refITT=0x68fcd) timed out
  • update your firmware iSCSI storage or ask your vendors support what to do. my case it was QNAP but could be other vendors having the same issue.