PfSense VM experiences massive packetloss when running off UPS/inverter
Acquired and setup a new server for work over the weekend, XEON processor, Supermicro X10SLM-F board with integrated Intel i210 and i217 Gigabit Ethernet, 32GB ECC and an additional Supermicro AOC-12 Dual Port PCI-E Intel Gigabit Ethernet adapter.
Server is currently running ESXI 5.5 U2 with latest patches and latest BIOS. Currently running a few virtual machines incl. domain controller, PABX, ESET RA appliance and pfSense 2.2.3 x64.
WAN and LAN interfaces on pfSense use the E1000 driver and are assigned respectively to the additional PCI-E Intel GIgabit Ethernet adapter ports, the rest of the VM's are assigned to the integrated i210 and i217 adapters using the VMXNET3 driver.
Server is connected to an APC 1500va UPS which in turn is fed by a 3kva pure sine wave inverter with 4 x 102Ah deep cycle batteries (load shedding is frequent here in South Africa).
I am facing a problem in that during loadshedding and when running off the inverter, the WAN & LAN interfaces of pfSense start to experience crazy packetloss and the pfSense web interface becomes unresponsive. SSH to the pfSense VM is impossible. Yet as soon as utility power comes back online, all returns to normal. However, the pfSense VM is the only one with issue, none of the other VM's have any problems (can ping the remainder, SSH into them, browse the various web interfaces, etc just fine).
Furthermore, absolutely no errors/kernel panics/etc appear in the system logs on pfSense, interface shows 2 days of constant uptime, etc.
I thought it might be a troublesome network card, so swapped it out. Issue persists. Disabled all ASPM options in the BIOS, rebooted and yanked the power supply to the inverter. Issue still persists.
Anyone ever experienced similar or have any suggestions? I'm going to try swap pfSense to the VMXNET3 drivers this coming weekend to see if it has any change.
I think this is not a really to pfSense related issue but more then to the APC PSU and switching between the
really electric and APC PSU.
Perhaps, but then surely everything should be affected?
I mean, during loadshedding I can login to the domain using my laptop, make a VOIP call using a SIP client to another extension, my backups to USB disk using BazaarVCB run perfectly, I can browse the file server (that is running off another APC UPS and inverter)… the ONLY thing that has issue is the pfSense VM, everything else works perfectly?
I would do a diagnostic. Either have another machine in the same subnets with wireshark or use pfSense's built in packet capture and see what is happening, see if the errors are tx/rx errors or what. Low level diagnostics are hard to do but are clear as water once you do see it.
That VM has no idea whether or not it's running on line power. Either the server itself is doing something screwy when it's on UPS power, or being on UPS power triggers some kind of general network problem, or maybe causes something to go nuts and start spewing a ton of traffic that's putting the firewall VM under a ton of load.
Raiker's suggestion is good, packet capture, see what's actually happening at the time. Check other things on the system like the RRD graphs for utilization, throughput, pps, etc. at the time.
Have you tried using one of the other known working NICs? Maybe the issue is related to something physical like which slot the NIC is in. I've seen issues were all slots below a certain slot don't work correctly during certain circumstances.
Cheers for the advice everyone!
I've armed myself with a replacement switch, replacement flyleads, replacement network card, replacement UPS, updated EEPROM flash from Supermicro, Wireshark and a spare HP server (going to install ESXI on the HP tomorrow, restore a copy of the pfSense VM to the HP server, pop the Supermicro NIC in the HP and see if I can replicate the issue).
Fingers crossed I find the cause of this, will update the post over the weekend.
Lower end APC's are notorious for putting out a really chunky square wave.
Try stringing APC's like the SmartUPS 1000 in series, by the time you get to the third one, the output 'power' is useless, as it's been mangled so badly.
Try running a small electric motor off an APC 1000, you can here it chunking away, hating the wave form.
As you say, your upstream inverter is a nice true sine wave. That's what you want your gear running on.
It could be that your onboard NICs are behaving very differently to your PCI nics with respect to bad power. Different rails on the power supply perhaps.
The other distinct possibility is earth potential differences while on UPS. Some floating earth difference is drifting across some of your ethernet cables, and smashing your packets. Just a tiny leak or float on 230v is a big deal to 5v ethernet. Shielded ethernet can make the problem worse, better off with UTP unshielded-twisted-pair.
Make sure all your gear is earthed properly.
I think an oscilloscope is going to tell you a lot more than wireshark.