Bug report - pfsense on ESXi 5 freeze

AshleyRBlack

"Chris Buechler ‏@cbuechler
@AshleyRBlack post info like hypervisor, config overview, anything else that might be pertinent, and i'll check it out and reply."

Chris Buechler ‏@cbuechler
@AshleyRBlack it's not Windows, bad band aid and prob won't fix. twitter too short to troubleshoot, I'll reply @ forum or mailing list post

Okays, Problem is at home, and in the lab VM at work, after a few days, pfsense just freezes, including the VM console.
At work, the boss just did a cron job a few weeks ago to restart it every night, which I copied for now for home.

I will give details of my home one, as I don't have access to the work lab one yet.

_*** Welcome to pfSense 2.0.1-RELEASE-pfSense (amd64) on pfSense ***

WAN (wan) -> em1 -> 86.x.x.x (DHCP)
LAN (lan) -> em0 -> x.x.x.x
WAN2 (opt1) -> em2 -> 87.x.x.x
…
...
Enter an option: 8

[2.0.1-RELEASE][admin@pfSense.localdomain]/root(1): uname -a
FreeBSD pfSense.localdomain 8.1-RELEASE-p6 FreeBSD 8.1-RELEASE-p6 #0: Mon Dec 12 18:15:35 EST 2011 root@FreeBSD_8.0_pfSense_2.0-AMD64.snaps.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 amd64
[2.0.1-RELEASE][admin@pfSense.localdomain]/root(2):_

and its is running on VMware ESXi Version 5.0.0 build 469512

and I can provide anything else needed…

heper

is it the pfsense that freezes or are you unable to restart/start other VMs on the same machine ? Are you able to reboot the host machine ? Is the host machine a dell R310 ?

AshleyRBlack

@heper:

is it the pfsense that freezes or are you unable to restart/start other VMs on the same machine ? Are you able to reboot the host machine ? Is the host machine a dell R310 ?

Okays, all other VM's carry on working. Just a pfsense freeze.

Not rebooted the esxi host, as no need, its running fine. As are the rest of my VM's inc Juniper SA ssl vpn, ubuntu, and vm xp for emergency access.

Host is a HP ProLiant ML115 G5. with an added Intel duel GIG-E card.

Do that help any? need some log files or something ?

heper

is there any version of VM-tools installed ?

also did you check this on the wiki:

Certain intel igb cards, especially multi-port cards, can very easily/quickly exhaust mbufs and cause panics, especially on amd64. The following tweaks should help:

In /boot/loader.conf.local - Add the following (or create the file if it does not exist):

kern.ipc.nmbclusters="131072"
hw.igb.num_queues=1

That will increase the amount of network memory buffers, and make the card use one queue instead of multiple queues, to reduce the strain on the system.

The same settings can also apply to em(4) cards, just use "em" in place of "igb" in the setting(s) above.

i'm not sure if this is relevant on esxi tho.

Personally i've only had issues on a dellR310 & esxi 4.1, running pfsense VM, would bring the hypervisor in an semi unresponsive state. (could enter console, but any action as in reboot/shutdown would fail). This was solved when updating to esxi5.0 and might have been related to the cheap basic hardware raid card inside.

cmb

I suspect this is because of a timecounter issue some people see on occasion, where the system clock stops. The console is generally still responsive in those cases, but many services stop functioning because they're time-dependent. Try running:

sysctl kern.timecounter.hardware=i8254

and see if it happens again. Once you know it's something you want applied permanently, add it under System>Advanced.

The couple other times we've seen this, the console was responsive, and running the above immediately brought everything back to life as it fixed the system clock. Why it applies to so few people I'm not sure. ESX is the most widely used hypervisor by far, our production firewalls run in ESX, numerous of our customers and other users in the community do as well, it's an extremely small percentage.

Definitely interested in whether that fixes it for you.

AshleyRBlack

In /boot/loader.conf.local - Add the following (or create the file if it does not exist):

kern.ipc.nmbclusters="131072"
hw.igb.num_queues=1

I have added this, so we will see what happens.

sysctl kern.timecounter.hardware=i8254

and see if it happens again. Once you know it's something you want applied permanently, add it under System>Advanced.

Well, the console is always unresponsive. Should I try this preemptively?

Thanks

biggsy

Hope cmb's suggestion works but

… running on VMware ESXi Version 5.0.0 build 469512

There's nothing in the list of bug fixes that hints it might help you but there is 5.0 Update 1 (build 623860)

… 2.0.1-RELEASE-pfSense (amd64) ...

Is it practical for you to build a 32-bit pfSense VM, restore your config and see if it suffers from the same problem?

AshleyRBlack

Is it practical for you to build a 32-bit pfSense VM, restore your config and see if it suffers from the same problem?

Yes, this could be done. Actually would be quite easy, and if it all goes pete tong, just spin up the original version.

I could build today and switch over tonight, only problem now is that i have 2 weeks holiday from Friday, so monitoring becomes a problem…

EDIT: Just checked, and the lab was installed with the 32 bit version, and it experiances the same problem. (same esx host and ver as well)

biggsy

Also, have you seen this:

http://forums.freebsd.org/showthread.php?t=31929

AshleyRBlack

@biggsy:

Also, have you seen this:

http://forums.freebsd.org/showthread.php?t=31929

I have now… erm.. kinda leaves very few options other that just going physical.

EDIT: added sysctl kern.timecounter.hardware=ACPI-safe as per the bsd post, and added to my "system tunables" see what happens now.

cmb

Looks like this quote in particular from VMware in the above linked FreeBSD forum thread has the answer:

" just wanted to get in touch with you to let you know that I've reviewed the logs and information you have provided. I've sent the details on to our Engineering team - it appears other customers are experiencing this issue and a case was only opened with Engineering last week regarding this issue. The same workaround you found (manually force the guest OS to use the ACPI-safe source) appears to be working for other customers as well.

We are in the process of drafting a KB article for this issue while Engineering work on a fix."

AshleyRBlack

@cmb:

Looks like this quote in particular from VMware in the above linked FreeBSD forum thread has the answer:

" just wanted to get in touch with you to let you know that I've reviewed the logs and information you have provided. I've sent the details on to our Engineering team - it appears other customers are experiencing this issue and a case was only opened with Engineering last week regarding this issue. The same workaround you found (manually force the guest OS to use the ACPI-safe source) appears to be working for other customers as well.

We are in the process of drafting a KB article for this issue while Engineering work on a fix."

Thanks very much for all the help on here and twitter. So far so good. the webui and console seem to be more responsive, and nothing like a slowdown or freeze. But Guess i wont know for sure till I get back in 2 1/2 weeks.

I will add an update then and confirm that this has worked.

AshleyRBlack

After a power cut,

uptime 14 days, 15:29

which I think it means that this is a fix/workaround. before it would fail within a week and need to be restarted.