XenServer 7 and pfSense 2.3 - Packet loss only under heavy load



  • We have spent countless hours on this, and have currently fallen away from pfSense on XenServer until we can get this resolved.

    Scenario:

    • pfSense 2.3 (latest) deployed on an XenServer 7 pool. Fresh install, not an upgrade.
    • Xen Utils installed per forum post (reported as tools from 6.2).
    • TX offloading disabled at the XenServer level per several threads on this forum about the reasons why this needs to be done.
    • Single interface provided to pfSense utilizing VLAN trunking into the pfSense guest (we have more VLANs than the maximum adapters allow).
    • 8 vCPUs with the highest priority set to the VM.
    • MBUFs increased to 1 million.
    • All XenServer 7 physical servers connected by a converged 10GBe LACP backbone. Standard 1500MTU.

    All of this works perfectly…. That is, until we start passing large amounts of traffic. At that point, pfSense starts dropping packets on all interfaces causing the whole network to become unstable. As soon as the traffic is backed down all returns back to normal.

    Steps to create this problem: We connect to a Windows 2012 R2 server on one subnet and start copying a large multi-gigabyte test file to a server in a different subnet. Pings running across several servers, pinging other servers in different subnets to monitor for problem. Big hint: The problem does not start immediately, it seems there is a delay where the traffic has to be sustained about about a minute before the packets start dropping. Additionally, packets will drop for a few seconds after the load is stopped before returning to normal. The copy runs at a sustained (approx) 70mb/s (560mbps) until the packet loss starts about a minute into the copy, the copy then starts suffering as packets are lost until it's either cancelled or ultimately completes.

    The pfSense dashboard reports the CPU utilization at 10% to 20% during the copy. It does not appear to be a stress on the pfSense VM.

    As of now we have our virtualized VMs powered off and are running back on our appliance solution until we can resolve. We have the ability to move between the two solutions (physical vs. virtualized) rather easily, but only during off-peak hours for additional testing.

    I cannot find anyone who has reported this problem. At the same time, it's possible no one is trying to use pfSense on XenServer (especially v7) with very high sustained traffic loads. I am unsure if this is environmental specific to us, or a larger issue that no one has discovered yet. We sure would like to get to the bottom of it, as we want to move away from our standard appliances when we can.

    Phil



  • Try testing with HVM network interfaces, might be VirtIO related. Other option would be direct NIC hardware passthrough.



  • I can confirm, that pfSense has heavy problems under Xenserver 7.

    We had a LAN-Party some weeks ago and pfSense was used as Gateway for about 280 machines.
    First using it via PHVHM-mode, which worked quite well, but it has a huge problem/bug: using that xen-devices, you can't enable limiter.
    So we used a config-tweak to fall back to emulated realtek NICs.

    After doing so, the VM-Load went to about 6 and the VM became unusuable slow.

    We installed it on bare metal, which we luckily got organized, and it ran like a charm. A load of 0.1 - 0.2 on a dual quadcore Xeon with 48 GB DDR3.
    RAM wasn't used and the CPU-cores either, but I wonder why it ran so smooth without any lag etc. on that machine, compared to hardly reacting VM on XenServer.

    We needed more pfSense VMs as gateways/VPN-tunels and the also ran totally smooth.

    Regars

    • Nagilum


  • @Nagilum:

    I can confirm, that pfSense has heavy problems under Xenserver 7.

    We had a LAN-Party some weeks ago and pfSense was used as Gateway for about 280 machines.
    First using it via PHVHM-mode, which worked quite well, but it has a huge problem/bug: using that xen-devices, you can't enable limiter.
    So we used a config-tweak to fall back to emulated realtek NICs.

    After doing so, the VM-Load went to about 6 and the VM became unusuable slow.

    We installed it on bare metal, which we luckily got organized, and it ran like a charm. A load of 0.1 - 0.2 on a dual quadcore Xeon with 48 GB DDR3.
    RAM wasn't used and the CPU-cores either, but I wonder why it ran so smooth without any lag etc. on that machine, compared to hardly reacting VM on XenServer.

    We needed more pfSense VMs as gateways/VPN-tunels and the also ran totally smooth.

    Regars

    • Nagilum

    So it does work with PV, but not without.



  • That's our experience from our last LAN, which made us a lot of trouble, because you just have about 2 days and every hour, stuff is not working… :/
    That stuff had cost us several hours.

    But, whyever, the PV-devices are not shown as limiter-capable and we needed traffic shaping.
    For performance reason, we didn't want to take the captive portal, which does limiting, too.



  • @Nagilum:

    That's our experience from our last LAN, which made us a lot of trouble, because you just have about 2 days and every hour, stuff is not working… :/
    That stuff had cost us several hours.

    But, whyever, the PV-devices are not shown as limiter-capable and we needed traffic shaping.
    For performance reason, we didn't want to take the captive portal, which does limiting, too.

    They are limiter-capable here.