2.3.1-p1 Unstable on Hyper-V (packet loss)

pciccone

We run a very stable 2.2.6 release on Hyper-V. When we upgraded to 2.3.0 it would freeze hard periodically. We understand that was fixed in 2.3.1. Great! We upgraded to 2.3.1-p1 but now see extreme packet loss that comes in spurts on all interfaces (it comes, then goes away for a few minutes). The loss is significant enough that it will break RDP or SSH sessions across the firewall. We can, however, keep a clean ping running from the WAN to the pfSense appliance itself. We quickly had to roll-back the snapshot to 2.2.6 as it was in production use.

I am opening this forum post in case others have this problem, maybe we can find a pattern / cause.

Our pfSense environment uses a lot of features. We do VLAN pass-through/trunking via single NIC on Hyper-V, we use IPSec, OpenVPN and Snort.

InAr

We switched from PFS 2.2.6 to 2.3.1-p1 on hyper-v hoping for the apinger/dpinger change to solve some problems when switching gateways in groups.

With 2.3.1-p1 we are seeing the same extreme packet loss now.

Our pfSense environment is quite simple: 2 Gateway groups, 1 OpenVPN connection.
It has 2 dedicated NICs (1 for each WAN connection) and 1 NIC shared with other VMs for LAN.
1 gateway group for openvpn traffic, the other gateway group with inverted tiers for internet access.

It looks like if I put some traffic on one of the WAN connections both tend to get a lot of packet loss and the remotedesktop access via openvpn connection (and the other connection too) gets very laggy.

I'll roll-back to 2.2.6 today and check if this fixes the problem. Maybe I just didn't recognize it under 2.2.6.

pciccone

Do indeed let me know if you become stable on your 2.2.6 rollback. I have a feeling things will go back to normal. Also check (at least for testing) that VMQ is disabled on Hyper-V for all vNICs on the pfSense VM. There are some manufactures that this will cause packet loss. This was a known issue for us, on certain DELL servers that use Broadcom (I think?) chipsets.

InAr

We have/had VMQ disabled on all interfaces with 2.2.6 and with 2.3.1-p1.

With 2.3.1-p1 the packet loss started every morning when user started to login onto the ms remotedesktop server via openvpn connection or whenever a bit more traffic occured.

Now back to 2.2.6 everything seems to works fine.
This morning the connection remained stable without any packet loss during the usual rdp login time.
And putting some heavy load onto the wan connections isn't causing any packet loss.

pciccone

I just found another case of this yesterday where I had to revert this. Completely different network, WAN, building, server, etc. Exact same behavior we are describing. Reverting to 2.2.6 again resolved the problem.

Please let this forum post serve as warning to Hyper-V users. Do not upgrade to 2.3.x until this serious issue can be diagnosed and resolved. Stay on 2.2.6 which appears to be extremely stable on Hyper-V.

Phil

JasonJoel

Well, I have been on 2.3 since it's release on Hyper-V 2012 R2…. And everything has worked perfectly.

So the issue certainly is not universal. It could be dependent on packages installed, and VM configuration I suppose.

pciccone

May I ask, is your traffic substantial? We did not notice it at our first upgrade location as traffic was casual. We just had some drops but no one noticed until we ramped up traffic.

Phil

JasonJoel

Substantial is all relative, of course.

I would call mine not substantial though. The link is 300 Mbit down, 20 mbit up.

I regularly do 250 mbit down sustained, but only for short times (10-20 minutes), and my total simultaneous users is low (50 maybe).

The pfSense box is also doing inter-VLAN routing, but again, only ~50 nodes.

cmb

Those who are having issues, what Windows version?

It's certainly not a universal problem with Hyper-V, but from the sounds of it there must be something to it in some edge case.

pciccone

Both of my two cases are Hyper-V on Windows 2012 R2. They are both managed under Systems Center 2012 (SCVMM). They both use DELL hardware. One is using NIC trunking, but the other is not. Both have IPsec tunnels. One of my locations is a branch office, I can clone the 2.2.6 VM and upgrade the clone to do parallel testing if you want to look at this further. The other unit is in a data center handling very critical traffic. But, if we find it on one, then no doubt it will fix us globally.

Phil C

InAr

My case is Hyper-V on Windows 2012 R2 (Datacenter), using HP hardware (ProLiant ML350 G6).

1xNIC "HP NC382T PCIe DP" (2 Ports - 1.Port NIC Team#1 Hyper-V Host, 2.Port NIC Team#2 Hyper-V VMs)
1xNIC "HP NC326i PCIe Dual Port" (2 Ports - 1.Port NIC Team#1 Hyper-V Host, 2.Port NIC Team#2 Hyper-V VMs)
1xNIC "Intel(R) PRO/1000 PT" (2 Ports - 1. Port = WAN1, 2.Port = OPT1)

The PFSense VM uses Team#2 for its LAN interface, Intel Port 1 for WAN1, Intel Port 2 for OPT1.

VMQ is disabled on all VMs/interfaces.

tsolp2001

Same problem here after upgrading to 2.3.1

Running Server 2012 (not R2) with 3 network cards.

Watching Video Streams is a mess. always interrupts, and broken remote sessions too.

Update to 2.3.1p5 no change.

tsolp2001

No movement here. Tried some dev releases no change so far.

Is there a way to get back to 2.2.6
Didn't find the download, have a 2.2.4 image, can it be upgraded to 2.2.6 and not to the latest release?
Can I restore a 2.3.1 backup to 2.2.6?

Thx for your support

headhunter_unit23

I had the same issues with pci-passthrough on esxi 5.1 and a DUAL NIC Intel PCI-E card (82575EB); awful latency and packet loss.

I removed the pci-passthrough, added the NICs to a virtual switch and used virtual nics instead and everything is back to normal.

Had the same issue with Hyper-V server 2012 r2 on a Supermicro with 2x 10GB onboard NICs and thought it was a port negociation problem. Switched to virtual NICs and the problem was gone.

But it might not be related with pci-passthrough for all of you.

Are you guys using pci-passthrough?

kapara

@tsolp2001:

No movement here. Tried some dev releases no change so far.

Is there a way to get back to 2.2.6
Didn't find the download, have a 2.2.4 image, can it be upgraded to 2.2.6 and not to the latest release?
Can I restore a 2.3.1 backup to 2.2.6?

Thx for your support

You can update or reinstall 2.2.6 and restore config. I ran into this problem when I tried to upgrade from 2.2.2 to 2.2.6 and could not find the update as 2.3.1 was the only one available. So I updated to 2.3.1 and the firewall would not even boot. Tthey must have made some major changes as I used to always be able to upgrade versions. I also do not think they tested in Hyper-V to check compatibility.

Luckily I did a snapshot before upgrading so I was able to restore back.

2.2.6 update: https://atxfiles.pfsense.org/mirror/updates/old/pfSense-Full-Update-2.2.6-RELEASE-amd64.tgz

2.2.6 full: https://portal.pfsense.org/firmware/2.2.6/

cmb

@kapara:

I also do not think they tested in Hyper-V to check compatibility.

Not true or even close to it. We fully verified Hyper-V and Azure. Microsoft themselves even tested 2.3 as well to approve it for Azure.

If it didn't boot, it's probably because of the drive type change from old versions that made the fstab invalid, so it needed updating.

kapara

@CMB

Sorry I may not have been clear. I meant that the upgrade process may not have been tested. If so is there any documentation that explains what needs to be done when upgrading form 2.2.x to 2.3 in hyper-v so that you do not get the mount error?

cmb

https://doc.pfsense.org/index.php/Upgrade_Guide#Disk_Driver_Changes

should be fine just running ufslabels.sh prior to upgrade. Otherwise manually specify the appropriate drive at the mountroot prompt. ufs:/dev/da0s1a replacing da0 as needed.

kapara

Thank you! ;D

Enrica_CH

I have the same situation with Pfsense 2.3.2 on KVM (Proxmox PVE) with virtio nic drivers. I use two WANs with routing groups. Both have significant package losses. One of these WAN interfaces switches to offline sometimes and stays in this status. I have a second Pfsense on an APU board with CARP with same issue.

I use following services:

Dual WAN with three routing groups
OpenVPN
CARP
Captive Portal
Free Radius
Watch Dog

I did some investigations and found following other behaviours than in 2.2.6:

I find in syslog "check_reload_status" with "reloading filter". This interrupts the traffic and provoques packages losses. This reload is absolutely unnecessary.
Every few minutes there is a process "xinetd" with "readjusting service 6969-udp" even if TFTP-Proxy isn't activated. This service doesn't stop.

I tried to switch off "Flush all states when a gateway goes down" to avoid state killing if an interface is shortly stated as offline. But if the interface doesn't come up again users are excluded from internet access because the switch from tier1 to tier2 is done but the routing state isn't killed.

So it's really unusable and I have to go back to 2.2.6 also for the moment. But how can we find out if 2.3.x will be ok?