Proxmox 5.1 and hanging pfsense



  • Hello !

    I have problem with hanging pfsense. Everythings works few days and suddenly "puff".
    Traffic is stopped … and pfsense is hang.

    My environment:
    Pfsense - Current Base System 2.4.3
    Pfsense is virtual machine on Proxmox 5.1 (4 CPU / 6GB RAM)

    System log with "boot" message:

    Apr 11 20:33:17	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 11 20:33:25	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 19 08:49:39	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 19 08:49:47	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 07:16:18	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 07:16:28	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 07:29:32	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 07:29:40	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 10:13:56	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 10:14:04	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 10:20:32	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 10:20:39	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 11:27:33	syslogd		kernel boot file is /boot/kernel/kernel
    Apr 2 11:27:40	syslogd		kernel boot file is /boot/kernel/kernel
    

    At the moment of suspension, I do not see any errors in the logs …

    Example traffic graph
    https://s3.amazonaws.com/uploads.hipchat.com/505334/3263799/nYVWBv38CHJUKF8/upload.png

    Do you have any idea what should I check or any suggestion about my problem ?
    What should I do to solve the problem?



  • After reporting to proxmox, they says that there are no bugs or problems reported.
    Where to find the reason for the pfsense hanging ??

    Maybe some one have any idea ?
    I noticed one thing, https://doc.pfsense.org/index.php/Virtualizing_pfSense_on_Proxmox
    But in my config I use E1000 instead VirtIO network model.
    Is this a mistake? Can it cause pfsense hang ?



  • I've been experiencing your issue on some of our systems.  We have several proxmox virtual hosts servers, and I run pfSense guests on them - but we're starting to have a lot of trouble exactly as you describe – at my office my pfsense server runs great for a few days (typically less than 3 weeks), and suddenly POOF- it's not passing packets anymore - and I have to reboot it...

    I recently did some more extensive testing - was setting up two new sites each on DL360 Gen9 servers - running proxmox (KVM).  I installed pfSense (current) on both. We tried virtio NICs we tried Intel, and Realtek.  We found that Realtek was faster than Intel but not as fast as Virtio - but even virtIO was very SLOW ~ 25Mbps at best...

    In each test - suddenly we would lose connectivity - and we had to reboot pfsense to get going again.

    We also found that selecting "NO offload hardware checksum" did result in better speeds but we still had the loss of connectivity - often within just a few minutes under heavy testing.

    The other curious thing was that although we have 200 Mbps Internet, we typically only saw a best of 25Mbps or occasionally spikes beyond that - with the virtio driver, and often 3Mbps or 11 Mbps as a best with the other NIC drivers.

    In the end, I switched these two systems over to Untangle - and poof - right out of the box - 200Mbps and stable since November2017 - no reboots needed yet.

    I'm bummed since I know pfSense a lot better, and I'm sure there IS a resolution - but I haven't found it yet...

    Note that currently I still have 2 sites who are running pfSense on proxmox, and one of them just stopped passing packets yesterday, and again today - a quick reboot got it going again both times - but I can't have that! - I observed it "break" today while I was downloading a bunch of large files...
    The other site where I'm running pfsense on proxmox has lost connectivity once a month or so for the past 3 months... Since I was already troubleshooting this issue I just rebooted their pfsense pre-emptively about every week - but I have to get this fixed and soon.
    I temporarily switched one site over to a small physical PC, and it's solid and stable as I'm used to seeing from pfSense...

    One post I saw seemed to indicate that turning off the hardware checksum in the GUI did NOT actually turn it off in the system - but I doubt that since I observed a speed difference with it off - but maybe that was luck?

    Anyway - You're not alone - this seems to be an issue - but I don't have the answer for you



  • FWIW I thought I should list my configs:

    Proxmox VE4.3-1/e7cdc165

    bonded Vswitch config with vlan tags for internal and external networks using built-in server NICS on DL360 g9 and also one Poweredge 720.

    running currently updated pfsense as of last week.

    Here's a copy of my interfaces file on proxmox.  All seems to work well except pfSense.  I have a dozen vm guests - Ubuntu, Windows server 2008, 2012, and Untangle.

    auto lo
    iface lo inet loopback

    iface eth0 inet manual
    iface eth1 inet manual
    iface eth2 inet manual
    iface eth3 inet manual
    iface eth4 inet manual

    auto bond0
    iface bond0 inet manual
            slaves eth1 eth2
            bond_miimon 100
            bond_mode 802.3ad
            bond_xmit_hash_policy layer2

    auto vmbr0
    iface vmbr0 inet static
            address 10.something
            netmask 255.255.255.0
            gateway 10.something else
            bridge_ports eth0
            bridge_stp off
            bridge_fd 0

    auto vmbr1
    iface vmbr1 inet manual
            bridge_ports bond0
            bridge_stp off
            bridge_fd 0
            bridge_vlan_aware yes

    auto vmbr2



  • Thanks for answer.

    I wrote to Proxmox and currently they do not have any reported problems with pfsense.
    I also have more machines running on Proxmox and I have problems with pfsense only. It looks like a problem with him …



  • About a week ago I upgraded a customers Proxmox cluster from 4.4 to version 5.2 and after this was done I also upgraded pfsense VM from 2.3 to 2.4.3-RELEASE-p1, because I was on premise to do that.
    Then the trouble began. One out of 7 pfsense interfaces suddenly stops forwarding traffic and the sub net is cut off. All virtual NICs are E1000. Some of them are plain bridged interfaces, some of them use VLAN awareness of the bridge on proxmox layer. VLANs are not activated inside pfsense. The problematic IF is OPT1, the underlying proxmox interface is a plain bridge (vmbr1) where the physical interface eth2 is connected directly to the switch on an untagged 1GB port.

    After long hours of analysing and testing I found out, that I only need to disable the interface, then apply, and re-enable it, to get it working again. But the problem reoccurs after a differing amount of time.
    Unfortunately I also setup a new Cisco switch and changed some network settings to use VLANs on Proxmox layer. So I was chasing the problem on switch, bonding, proxmox bridge and pfsense layer. I found that the problem can only be in pfsense, because other VMs on the same proxmox/linux bridge can communicate, the VLANs that run over the bond work fine in and outside of pfsense. The interface that is having trouble is physically connected without bonding from Cisco switch to Proxmox eth2 (vmbr1) without any VLAN tagging on either layer. The problematic interface is a bnx2 driver interface on linux, but as I said the vmbr1 connected VMs can always talk to each other also on different proxmox hosts, so switch and bridge comm. is ok. Also I disabled the pf packet filter with no avail.
    The funny thing is that if I ping the "dead" pfsense interfaces IP (opt1_address) from another subnet it responds. If I try to ping (from PFsense GUI) to a host in the dead subnet I get 100% packet loss and also the hosts from dead subnet can not ping the firewall. I have no clue what is happening here ??



  • Update: now also OPT2 (em driver) interface "froze", which has the same pfsense config as the OPT1. On the proxmox layer it is slightly different (vmbr2 bridge on eth4 with e1000e driver). It seems that this always happens when there is "heavy" traffic on the interface (moving GBs of data to and from NAS/file server). I still had my old/second pfsense VM with same firewall config (also already updated to 2.4.3) residing on antother proxmox node with other physical hardware. I changed all 7 VM interfaces to use virtio NICs and then shutdown the problematic instance and started the virtio one.
    After that I have no outage for the last 24hours, but the bandwidth going through the firewall seems very bad. I have to analyze further, but don't have the time at the moment...
    Could it be possible that the packet filter is behaving faulty and stops working/blocks traffic under certain conditions?



  • @bolek2000 I use Proxmox, pfSense and VirtIO and I have no problems with hanging etc.

    Probably you forgot to disable hardware offload for VirtIO, that will cause terrible performance.



  • Thanks, for the suggestion, but I had offloading features disabled from the GUI-> Advanced -> Networking from the beginning (also before upgrade). I'm not sure how the sysctl values should look like, so I post them:

    net.inet.tcp.tso: 1
    hw.hn.tso_maxlen: 65535
    hw.vtnet.tso_disable: 0
    dev.vtnet.0.txq0.tso: 0
    dev.vtnet.0.tx_tso_offloaded: 0
    dev.vtnet.0.tx_tso_not_tcp: 0
    dev.vtnet.0.tx_tso_bad_ethtype: 0

    Luckily after doing some more testing I get around 30MB/s transfer rate through the firewall copying data from VM to VM (samba on RAID 6 with Virtio SCSI), while having 60 MB/s without firewall with the machines on the raw hardware with GBit network (without Virt/Samba layers). The problem reported yesterday by the customer was some other issue, I guess.

    After changing to VirtIO, the network is still stable, so for me the issue is RESOLVED. So be aware if you upgrade your VM to 2.4.3 with E1000 NICs. Also I realized, that the CPU consumption on the problematic "E1000 VM" was peaking to 90 % (3 vCPUs) and the Virtio-VM now has a maximum of 30% (4 vCPUs) with comparable load (copying big amounts of data around, while having the normal Internet traffic and so on in the background...)



  • Some posts here about this too. Seems that turning off GRO might be a solution if you can't change from E1000 to VirtIO



  • @muppet actually - ...

    @muppet said in Proxmox 5.1 and hanging pfsense:

    @bolek2000 I use Proxmox, pfSense and VirtIO and I have no problems with hanging etc.

    Probably you forgot to disable hardware offload for VirtIO, that will cause terrible performance.

    possibly you forgot to read my earlier post where I already stated that disabling hardware offload for virtio helped performance a little but did NOT resolve the freezing problem...

    I am glad it's not happening to you - but it is well documented as an issue with [Probably] FreeBSD over KVM, and not specifically pfSense and proxmox.

    I believe the issue may be related to bridging or bonding multiple NICs and/or vlan tagging? Would you post your interfaces file (sanitized) to compare with mine? (posted above already)



  • root@orbit:~# lspci
    00:00.0 Host bridge: Intel Corporation Broadwell-U Host Bridge -OPI (rev 09)
    00:02.0 VGA compatible controller: Intel Corporation HD Graphics 6000 (rev 09)
    00:03.0 Audio device: Intel Corporation Broadwell-U Audio Controller (rev 09)
    00:14.0 USB controller: Intel Corporation Wildcat Point-LP USB xHCI Controller (rev 03)
    00:16.0 Communication controller: Intel Corporation Wildcat Point-LP MEI Controller #1 (rev 03)
    00:1b.0 Audio device: Intel Corporation Wildcat Point-LP High Definition Audio Controller (rev 03)
    00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #1 (rev e3)
    00:1c.1 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #2 (rev e3)
    00:1c.2 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3)
    00:1c.4 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #5 (rev e3)
    00:1d.0 USB controller: Intel Corporation Wildcat Point-LP USB EHCI Controller (rev 03)
    00:1f.0 ISA bridge: Intel Corporation Wildcat Point-LP LPC Controller (rev 03)
    00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03)
    00:1f.3 SMBus: Intel Corporation Wildcat Point-LP SMBus Controller (rev 03)
    01:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
    02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
    03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
    04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
    
    
    auto lo
    iface lo inet loopback
    
    iface enp2s0 inet manual
    #LAN Interface
    
    iface enp1s0 inet manual
    #WAN Interface
    
    iface enp3s0 inet manual
    
    iface enp4s0 inet manual
    
    auto vmbr0
    iface vmbr0 inet static
            address  192.168.0.254
            netmask  255.255.255.0
            gateway  192.168.0.1
            bridge_ports enp2s0
            bridge_stp off
            bridge_fd 0
            pre-up ip link set enp2s0 mtu 9000
            pre-up ethtool -G enp2s0 rx 1024 tx 1024
            pre-up ethtool -K enp2s0 tx off gso off
            post-up ethtool -K vmbr0 tx off gso off
    #LAN Interface Bridge
    
    auto vmbr1
    iface vmbr1 inet manual
            bridge_ports enp1s0
            bridge_stp off
            bridge_fd 0
            pre-up ip link set enp1s0 mtu 9000
            pre-up ethtool -G enp1s0 rx 1024 tx 1024
            pre-up ethtool -K enp1s0 tx off gso off
            post-up ethtool -K vmbr1 tx off gso off
    #WAN Interface Bridge
    
    root@orbit:~# modinfo igb
    filename:       /lib/modules/4.15.18-2-pve/kernel/drivers/net/ethernet/intel/igb/igb.ko
    version:        5.3.5.18
    license:        GPL
    description:    Intel(R) Gigabit Ethernet Linux Driver
    author:         Intel Corporation, <e1000-devel@lists.sourceforge.net>
    

    Hope this helps.



  • @muppet

    Yes, that gives me a number of things to try...

    You're using a couple of parameters which are different from mine - I will experiment... The other differences are that I am bonding multiple nics, and I am also using vlan_aware directive...

    Thank you!

    I'll do a little more testing and post results.



  • I was on vacation for a few days YAY... so I'm back on this finally

    Note that I am not setting tx off and gso off in the interfaces file, but I am in the GUI for pfsense, that was the difference between your interfaces file and mine... however, when I show my NIC settings via ethtool (Proxmox host OS), it does show that gso is off (and that there are 0 tx messages) for the interface and for the bridge, so I'm not very hopeful that setting it in the interfaces file will affect anything - but I will put these settings in my interfaces file (which will take effect next time I reboot the host - probably overnight in the next few days... since I don't want to reboot all my guests at this time)

    Sincerely thanks for your kind responses.