Esxi + rc2 = weekly reset of host machine ?



  • Hi,

    2 weeks ago i've updated my RC1 dating april 14th to RC2 (may 25th snapshot)
    since then i've gotten 2 instances where the VM becomes unresponsive (unable to ping any of its ip's wan or lan or vlan)
    with the vsphere client i am able to login but unable to power off/reset/power on any virtual machine (pfsense is the only one running).

    on the esxi console (ie. the keyboard directly attached to the machine) i try to reboot the esxi host machine … this does not work (it seems to be waiting for something)

    The only option i have is to cold boot the esxi host machine to get stuff back online !!

    I have no logs, no errors, all seems to get lost with the cold boot.

    For now i've reverted back to the rc1.

    I didn't think it was possible that VM "killed" a host machine put apparently it is .......


  • Rebel Alliance Developer Netgate

    It would be very unusual indeed for a VM to kill the host in that way.

    You could try redirecting your logs to an external syslog server to see if there are any errors logged at or near the time of the connectivity loss.

    You might also consider swapping out NICs in the host machine to eliminate hardware problems.



  • It's a brand new Dell R310 running pfsense on top of esxi 4.1

    I put an exact duplicate of that machine in a different school today, will check later this week if similar issue's occur.

    Also the problem descibed above is the result of an update from RC1 to snapshot of may 25th … perhaps even a clean install would fix it. Problem is that highschool students and teachers get hysterical when i plug their internet to try something ;)

    Have there been any driver changes (e1000 VM nic) from RC1 to RC2 ?


  • Banned

    Setup another virtual machine next to the one that has problems. Do a fresh install and configure it the same way, but with different IP addresses so you can test next to the prduction environment…



  • @heper:

    on the esxi console (ie. the keyboard directly attached to the machine) i try to reboot the esxi host machine … this does not work (it seems to be waiting for something)

    The only option i have is to cold boot the esxi host machine to get stuff back online !!

    I have no logs, no errors, all seems to get lost with the cold boot.

    For now i've reverted back to the rc1.

    I didn't think it was possible that VM "killed" a host machine put apparently it is .......

    You have a problem in your host then, no VM can kill the host (short of an ESX bug but that's highly unlikely).



  • http://fcaltdit.wordpress.com/2011/03/07/powering-off-a-virtual-machine-on-an-esxi-host/

    seems like "unresponsive" vm's can cause esxi to malfunction, as in not being able to power them off or on.
    kill -9 seems to be the only option in that case

    perhaps that is what is happening on my server


  • Banned

    Which version of ESXi?



  • 4.1 / latest build i could find on their official website ….

    perhaps i should try the custom builds by dell


  • Banned

    Have you installed VmWare tools on the Pfsense VM??



  • @Supermule:

    Have you installed VmWare tools on the Pfsense VM??

    no


  • Banned

    Thx :) Which type of virtual machine have you chosen??



  • i guess a regular 64-bit freebsd setup, using the e1000 nics

    so far the old RC1 doesn't seem to cause any issues, so i'm still guessing something went wrong with the auto-update from rc1 to rc2


  • Banned

    Reinstall on 32 bit freebsd and see if it betters the situation. Best advice I can give based on the information.



  • I'd double check your bios on that Dell to ensure you're running the latest suggested version. Make sure you also check the HCL on vmware.com for your version of ESXi (http://tinyurl.com/3pp2ynf). I've seen similar problems when there is a bad RAID card involved.

    Sean



  • It ran fine for around 2 months … then "suddenly" it started.

    But i doubt it is a hardware issue.

    The reason i doubt this is because
    The crashes allways happen on a saturday (today is the 3rd time in a row). Also it does not seem to be related with the update to RC2 as today the old RC1 VM crashed.

    I'm starting to think me or a collegue must have installed some package that is coded to do something on a saturday that causes the VM to go unresponsive .

    What packages could do this ?
    AFAIK we have the following installed:
    squid + squidguard (they ran for 2 months without issues)
    lightsquid (been installed for a couple of weeks without issues)
    ntop (probably most recently installed package and most likely suspect ?)



  • If your host is still unresponsive you're chasing the wrong problem, it's an ESX issue completely unrelated to the guest(s) if the host doesn't behave.


  • Banned

    I had the issue with a windows 2008 machine on an early Vpshere snapshot. So not entirely….



  • i went over there today, it's a national holliday  so nobody around needing internet. I reinstalled esxi and a clean install of pfsense using snapshot of june 12.

    I used the dell custom esxi (it might have more up to date drivers for dell hardware).
    Pfsense installed with minimal packages required (only squid and squidguard were added).

    Also  @cmb:

    i tried installing squidguard before squid was installed… This install did not complete giving an error bout a missing file (some squid config file).
    I Manually installed squid. then squidguard reinstalled without issues.


Locked