Extremely strange network issue with virtualised pfsense
-
Hi all,
First post here, be gentle :)
I have a customer with a Dell Poweredge T110 running ESXi 4.1.0 - upon which there are 2 VMs: SBS2008 and pfsense
They were running well for ages with just the one SBS2008 VM, but the Draytek router I had was giving them bad traffic shaping for their Grandstream VoIP handsets so I added a single Intel NIC and a pfsense VM.
Pfsense was built into VM with two E1000 NICs specified, and a second virtual switch was created (called WAN) and the intel NIC attached to that. The LAN leg of pfsense therefore shared the inbuilt Broadcom NIC with the SBS2008 VM.
Everything worked fine for about a week, and then I had a report from the customer that 'nobody can use the network'. I thought the pfsense VM might have died at that point, but I was able to use remote desktop from my house and get into it. Since the server couldn't see the clients or the Grandstream handsets, and vice versa, then I diagnosed a broken switch and drove over to replace it.
With it replaced, everything worked again.
A couple more days passed, and I went to their site to talk to the manager on another matter. Just as I turned up, they said "glad you've arrived, we're having problems with the internet". This turned out to be sporadic lack of DNS resolution for some sites from desktop PCs (although they would work fine from the server). I therefore thought "something's wrong with pfsense" and rebooted the VM. I thought this had fixed things, but some lack of resolution and slow network performance continued. However, the customer then had a perfectly good VoIP conversation with a customer via his desk phone, through the PBX (sat in the SBS2008 box) and onto the internet.
The symptoms went away shortly afterwards, so I left site. I was rung up an hour later saying "nobody can get to the network", so by now fearing the worst I grabbed a bunch of things (switches, an intel dual NIC card, etc) and rushed back to site.
I found when I came back to site that the only person who could see anything was the manager who's PC wasn't piggybacked off the Grandstream desk phone, but everyone else was dead on the network. I plugged my laptop in to the same switch as the server, and couldn't even PING the SBS server on the LAN. Even changing switch ports, LAN cables was fruitless.
I therefore decided to insert the Intel dual NIC card and split the functions. The Intel card now does both WAN and LAN functions for pfsense, and the SBS VM is back on the inbuilt Broadcom NIC, and it appeared to work better for the rest of Friday (yesterday).
I'm therefore currently at a loss as to why I couldn't even PING the SBS server from the LAN when the trouble occurred.
My current choice of culprits are:
- that the Broadcom NIC ran out of resources running 2 VMs at once with pfsense
- that the Netgear switches have some sort of ARP (or other) issue with 2 VMs running on the same switch port which hit a DoS trigger on them and blocked traffic
- that there's a bug in VMware 4.1.0 in regard to networking of some sort that would make this happen
Does anyone else have any suggestions as to why it would affect things this badly? I don't have any issues with running pfsense virtualised on VMware at my house on an HP Microserver with dual Intel NIC, and it's been up for ages. They're both on the same version (indeed the same ISO built both), and the version details are:
2.0.1-RELEASE (i386)
built on Mon Dec 12 17:53:52 EST 2011
FreeBSD 8.1-RELEASE-p6I can post any more info I can get, but you'll need to tell me how to get it if it's Linux (as i'm a Windows guy mainly)
Cheers,
Mike. -
Another update….
Customer rang this morning to say that it's gone again. I therefore moved everything off the Broadcom NIC and got the customer to unplug it - still no joy, so that rues out the Broadcom.
-
…if it's Linux (as i'm a Windows guy mainly)
You didn't happen tell ESXi that pfSense was Linux when you built the VM?
It would probably be better to run ESXi 5.1 now than 4.1 now.
-
No, just checked (was going to kick myself if I had), but it's FreeBSD 64bit (identical settings and identical ISO that I built my HP Microserver pfsense from that i'm typing this behind).
The only difference my HP Microserver ESXi has is that it's running 4.1.0 build 502767 versus build 348481 on the Dell.
-
Hold on….......
Does "2.0.1 release (i386)" mean it's a 32bit ISO in a 64bit VM??
This might explain something..... (although it wouldn't explain why the HP Microserver at my home is working fine).
-
348481 is 4.1U1 and the 502767 is 4.1U2
So different ESXi :)
-
…although it wouldn't explain why the HP Microserver at my home is working fine
Maybe your customer's machine is under bit more load than yours and some situation just doesn't get handled properly.
Not sure about that 32-bit image running in a 64-bit VM thing but it's gotta be suspect.
It's probably easier to go straight to 5.1 than it is to update 4.1u1 to u2 or u3.
-
5.1 has some issues relating to FreeBSD and I would suggest 4.1U3 until 5.1U1 is released.
-
5.1 has some issues relating to FreeBSD and I would suggest 4.1U3 until 5.1U1 is released.
What sort of issues? I thought I'd checked pretty thoroughly for any reported problems before going to 5.1 about a week ago.
-
Been running my pfsense on 5.1 since the day it released and not issue one.. So what are these issues your going on about?
-
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2032803
Cant find the rest right now…
-
Well, whichever it is i'll be doing it in a week or so's time. I've just built an identical pfsense Microserver here in the office so that we can try it out for a week, then at least I know there's likely to be a solution for my customer (despite me needing to migrate VMs across whilst I upgrade the server).
-
Are you using E1000 NIC's in both boxes?
-
"Enhanced VMXNET adapter cannot obtain a DHCP IP address with FreeBSD 8.2"
Ok not using vmxnet, nor are we using 8.2 of freebsd ;)
So that would explain why this has been non issue for me.. And would never have used the vmxnet2 in the first place, article states it works with vmxnet3. But using e1000, there was a benchmark done a few threads back and they sure didn't notice any real performance increase with them, and I had some issues with access ipsec vpn on the outside of the pfsense box from clients inside when using them. Went back to e1000 and no issues.
-
I have used pfsense on a dell r710 before and it worked … I was not firewalling though ... just a soft router. I also have 128GB of memory and 4 CPUs, so I was not hurting for resources. Even though I had upwards of 30 VMs running at one time. props to pfsense.
-
Well, whichever it is i'll be doing it in a week or so's time.
Please post again and let us know how it goes.
-
Are you using E1000 NIC's in both boxes?
Yes, I have no need to go to VMXNET-based NICs, and one nagging doubt I had was that there may have been E1000-based code in FreeBSD which didn't translate well to the Broadcom NIC, but I must say it was incredibly weird to find that other VMs on the server were affected by the pfsense VM.
-
Well, whichever it is i'll be doing it in a week or so's time.
Please post again and let us know how it goes.
Will do. However, so far so good with the Microserver running a 32bit VM of a 32bit OS, and i'm sure my web developers will test pfsense a hell of a lot more than a training company do :)
-
Just to close off on this, I have rebuilt the 32-bit PFSense in a 32-bit VM container, and it's been stable for a week now. I think that must have been the issue. Glad to have spotted that or it would have driven me round the bend!