Host profile removing vmk0 management port

Up until recently I had very minimal exposure and experience with host profiles and had the pleasure of getting better acquainted with the feature. There is a well documented “bug” with the host profiles & auto deploy where the vMotion kernel port takes the vmk0 port and the host disconnects from vCenter (eg. here & here) however my problem was slightly different and the many blog posts I found on the problem did not solve it for me.

Here is what went down.

I was targeting a host to upgrade from 5.1 to 5.5 which is deployed via Auto deploy, I pointed the host at the 5.5 image, rebooted it – all good. I made some slight tweaks to the config, updated the answer file and applied the profile. On the subsequent reboot the host came back up with part of its new config and then I watched vCenter apply the remaining settings. What happened next was that I noticed the task was attempting to reconnect with the host. Flicking over to the IMM showed that the host had now been re-configured with the vMotion kernel port IP address and soon enough vCenter marked the host as disconnected.

Some head scratching ensued and many hours went by as I tried to get to the source of the problem. From furious googling to validating the host profiles against other clusters and a number of pointless reboots. I found myself at a bit of a “chicken or the egg” scenario as the host profile prevented me from modifying the management network IP and I couldn’t manage the host from vCenter :(. I soon discovered that I had a brief window in where the host was connected to vCenter and where I could remove the host profile from it. After two reboots I was back to a host without the host profile attached, where I could again manage the host.

I decided it was time to configure the host manually and create a new host profile. It all went swimmingly and my host was now complaint, woo hoo! Happy days… until I went to re-build the next host in the cluster. As I went to apply the profile, the summary page stated something along the lines of “Remove vmk0 from vSwitch 0”. Damn, same issue!

The same headaches eventuated and I wasted another couple of hours. After some deliberating with colleagues I figured I would create the vmkernel ports manually and then try applying the profile. This time around, no messages about removing vmk0, I felt a slight hint of confidence and rebooted the host. This time around the host came back complaint and configured as per the host profile, finally!

So in conclusion, create all your VM Kernel ports prior to applying the host profile.

This may be common knowledge to some, but it wasn’t to me and I didn’t see it called out anywhere through my searches, although I may have missed it.

Thanks for reading!

Advertisements

Disconnected from Host. Agent is out of date and needs a manual upgrade

I recently performed an upgrade of vCenter from 5.0 to 5.5 and had an issue with just one of my hosts connecting to vCenter.

Agent out of date
First I tried to re-connect it manually with no success. I restarted the management agents from the iLO (not sure why I didn’t try go via ssh first as I normally would) and tried to connect again as this worked for other issues I came across in the past, still nothing.

OK, let’s uninstall the FDM client manually and try re-connecting the host (KB1003714):

cp /opt/vmware/uninstallers/VMware-fdm-uninstall.sh /tmp
chmod +x /tmp/VMware-fdm-uninstall.sh
/tmp/VMware-fdm-uninstall.sh

This time upon connecting the host I was prompted for the root credentials and proceeded through the wizard; enter next error:

Cannot contact the specified host (host1.lab.local) 
The host may not be available on the network, a network configuration 
problem may exist, or the management services on this host may not be
responding

Here, I did a sanity check and confirmed that DNS resolution was working and that I could ping the host, all good. Hrmm…

Next I went over to take a look at the fdm.log and the vpxa.log which were not giving me much to go off. So I went over to the events tab in the c# client to have another look at the error and noticed I was also getting and incorrect username and password error, dafaq ?

So I decided to ssh to the host to confirm that I wasn’t mistyping the password and then realized that I couldn’t initiate a session:

"Network Error: Connection Refused"

I checked lockdown mode was disabled, SSH service was running and the host firewall rules were okay. Hrmm, why on earth is SSH refusing my connection?

Some quick google-fu came up with the following KB article (KB1039095).
Back to the iLO I went and sure enough, the inetd.conf file was blank
I copied the contents of the configuration file from another host in the cluster, restarted the ssh daemon along with the management agents.

Here is my inetd.conf incase you don’t have another host to copy from:

# Internet server configuration database

# Remote shell access

ssh      stream   tcp   nowait   root   /usr/lib/vmware/openssh/bin/sshd       sshd ++swap,group=host/vim/vimuser/terminal/ssh -i
ssh      stream   tcp6  nowait   root   /usr/lib/vmware/openssh/bin/sshd       sshd ++swap,group=host/vim/vimuser/terminal/ssh -i

# VMware authentication daemon
authd   stream    tcp   nowait   root   /sbin/authd           authd
authd   stream    tcp6  nowait   root   /sbin/authd           authd

I could now ssh to the host, great! I proceeded to try connect the host and voila the host was able to connect back into the cluster.

As vCenter needs to copy over the installation files to the host for the installation of the FDM agent it was unable to do so as SSH was busted.

Here is another useful article I used during my troubleshooting (KB2004429).