Virtual pfSense under XEN - no luck with 2.2



  • Hi guys,
    I just tried to upgrade from 2.1.5 (which was running fully stable for a number of months) to 2.2 and I am being faced with a few issues. I have looked here, but haven't found anything comparable to the problems I am seeing. My seup is as follows:

    XEN system using vt-d capable hardware, pfSense running as a HVM domU, all based on gentoo linux
    devices being passed through: 1 Realtek Ethernet card (WAN) for connecting to the ADSL modem and two Atheros WiFi cards (OPT1 and OPT2 - both currently disabled)
    Connection between dom0 and the pfSense domU is through a bridged setup (the XEN recommended way) providing a paravirtualized xn0 device to pfSense for the LAN interface which is assigned a static IP address on the local network within pfSense

    Internet connection is handled through PPTP from the Realtek (rl0) to the ADSL modem (Speedtouch) which then provides the static IP directly to the PPTP(rl0) interface. Building up the connection works - although negotiation seems to be slower and there are a few up/down cycles after bootup before it seems to stabilize. At the moment, however, that's the least of a worry for me, but I plan to dig into this later as well.

    Finally the passed through Realtek (rlo interface) acts as the OPT3 interface handling the connection to the ADSL modem (10.0.0.138) in order to be able to get to the web interface of the Speedtouch ADSL modem.

    What I can do / what works:

    • ping works from domU to pfSense and any other domUs as well as all other (real) system on the LAN

    • ping works to any responsive host in the Internet

    • ping works to the ADSL Modem and the ADSL modem is able to ping the endpoint on the rl0 card (10.0.0.2)

    • NTP synchronization works from the modem to pfSense

    • http(s) works from both dom0, other domUs and any PC in the local network to pfSense

    • http works to the Speedtouch modem from any PC in the LAN

    • http(s) works from any PC in the local network to Interet hosts

    • ssh works from any PC in the local network to any other host, including Interet hosts

    • ssh works form dom0 and domUs to pfSense and other domUs

    • DNS and reverse DNS works from all systems (dom0, domUs, PC in the local network

    What I can't do / what doesn't work:

    • http connections from the dom0 server and any other domU to any Internet host

    • http connection from the dom0 server and any other domU to the Speedtouch modem

    • rsync (outgoing) connections to any Internet server from both dom0 and other domUs

    • ssh (outgoing) from dom0 or domUs to any system on the Internet

    I have tried three different upgrade paths to rule out errors during the upgrade process and all three show the same behaviour described above:

    • Fresh install to empty disk with then manually duplicating settings from the old 2.1.5 system (ubound is working here as well)

    • Upgrade from 2.1.5 using the upgrade .tgz file (from the web interface)

    • Fresh install to new disk with then importing the saved configuration settings from 2.1.5

    The old 2.1.5 version is still available as an image (this totals four images with the three on 2.2 all failing) and still works flawlessly, but with 2.2 I seem to be out of luck. To me it seems that there's some blockage going on as soon as the TCP protocol (as opposed to UDP and ICMP) is involved, but only when working from the XEN system (i.e. dom0 or domUs), not form any PC in the local network.

    I'd very much appreciate any help in diagnosing and resolving this issue. I am more than happy to provide any information requested. To do this I can easily switch between the old working 2.1.5 version and any of the new 2.2 versions by simply shutting down the running virtual machine and starting the requested other virtual machine.

    TIA Atom2


  • Netgate



  • Derelict,
    many thanks for your quick reply. I did have a look at your link, but I am not sure whether that's related to my problem:

    First of all I am using XEN (the Open Source project) and not XEN Server (a commercial product)
    Secondly I am not having any issues with speed but rather no connections at all - in other words my speed is zero with TCP connections like rsync, http, ssh (and that's not confined to traffic between virtual machines but rather to the internet as long as it originates from one of the domUs or the dom0 behind the firewall; traffic between the domUs and between domUs and dom0 work as expected). Furthermore all works as expected with UDP.

    If I look at netstat -an from the issuing system, the process (tried with http to port 80 and ssh to port 22, both to hosts on the internet) just hangs in the "SYN sent" state until it times out - so it appears, there's simply no answer / the SYN never makes it to the other system. Ping, however, to the same host works.

    I'll nevertheless try to figure out if there are any such settings in XEN as well. Thanks again,

    Atom2



  • @johnkeates:

    On the domU, disable all tx offloading. I'm having the same issue with 2.2. Something is wrong with FreeBSD 10 and NATing domU traffic while tx is on. ICMP will work, but TCP and UDP are dead. You can confirm by tcpdumping on the bridge or on the LAN interface on pfsense and grepping for "incorrect".

    Thanks for the tip, that confirms what I thought about some broken checksum behavior observed. I don't use Xen much and haven't tried 2.2 on it yet. That's something we should get reported upstream to FreeBSD. If you have a good description of the issue, if you can open a FreeBSD PR it'd be appreciated. Or post to freebsd-net list maybe.



  • johnkeates,
    thanks for sharing your findings. I did my own research in the last couple of days and have independently found a solution prior to reading your post. As it was late at night and as I wanted to do some further test, I decided to withheld it to see whether the solution was only temporary. I'll explain what I found out and als try to provide a reasoning why I think it works that way:

    • I can confirm your findings that any TCP traffic from a domU or the dom0 arriving at the pfSense domU through the XEN virtual netback/netfron vif interface xn0 displays an incorrect checksum. I also figured this through tcpdump within the pfSense domU listening on the xn0 interface.

    • It appeared to me that the firewall system (probably the pf packet filter) dropped all traffic with an incorrect checksum - this assumption of mine is probably something that somebody more knowledgable from the pfSense team needs to confirm.

    • The reason why those packets arrive without a proper checksum lies within the paravirtualized XEN vif interfaces: Those are using shared memory between dom0 and the domUs (respectively between different domUs). As this is considered to be safe (there's no transfer over a probably faulty wire) the checksum calculation functionality is advertised by the driver, but in essence is just a null-function.

    • This means that a normal domU / the dom0 does neither calculate the checksum when sending packets nor checks the checksum when receiving traffic through a paravirtualized vif interface. The pfSense system / the pf packet filter however seemed to care about those seemingly incorrect checksum values and decided to drop the packet.

    • Given the above I came to the conclusion that changing anything at the pfSense domU level would not make any sense as there was no option to just ignore the faulty checksum for received packets and accept the packet despite that error. In particular enabling "Disable hardware checksum offload" under System/Advanced/Networking can stay off (i.e. remain with offloading hardware checksum calculation to the NIC). This offloading (in terms of XEN) is only relevant for sending traffic from the pfSense firewall to any domU / the dom0 through the paravirtualized vif xn0 and neither of these recipients cares about the checksum arriving on a paravirtualized vif interface.

    • The only issue remaining was to ensure that traffic arriving at the pfSense firewall through the xn0 interface had correct checksums. In the standard bridged setup this clearly could only be sorted in the dom0 backend vif connected to the frontend for the pfSense domU.

    • Once I arrived at that conclusion I simply disabled the tx offload function for the backend connected to the pfSense domU within the dom0 machine:```
      /usr/sbin/ethtool --offload "$ifname" tx off

      
      After a day of testing, this seems to have sorted my connection issues: All communication is back and everything seems to work again on that front. Furthermore it is only a small change which can easily be incorporated into the setup-script for the bridged setup for the vif to the pfSense domU through parameters in the xl configuration file for the pfSesne domU.
      
      Also there's no need to do this for any (other) domU; it's only required for this single pfSense backend vif in dom0.
      
      **johnkeates:**
      Re-reading your suggestion I am slightly confused:@johnkeates:
      
      > On the domU, disable all tx offloading.
      > [snip]
      > On the guests: sudo ethtool -K eth0 tx off
      
      According to my understanding that would solve the issue for the specific domU where you have executed that command and which then tries to TCP communicate with your pfSense firewall but I guess any TCP communication from the dom0 (and all other domUs) to or via the firewall would still fail. Can you confirm this?
      Also you said UDP failed as well in your setup. I can't confirm this; DNS resolution has always worked for me.
      
      **cmb:**
      @cmb:
      
      > That's something we should get reported upstream to FreeBSD. If you have a good description of the issue, if you can open a FreeBSD PR it'd be appreciated. Or post to freebsd-net list maybe.
      
      Given my thought process and the conclusions I have arrived at, I am not sure whether that's really an issue with FreeBSD. I do have a few other FreeBSD systems (on 10.0; all of those without pf) and communication is flawless. I guess this is more an issue with pf (the FreeBSD packet filter) and/or probably the way pfSense handles checksum errors - though I might be wrong here.
      
      Regards Atom2
      
      P.S. In terms of systems and versions, I use gentoo (kernel 3.17.7) for my dom0 and most of my domUs; the XEN version is 4.3.3; there are two FreeBSD 10.0 system (providing ZFS storage through SAMBA 4.1) and pfSense 2.2 (FreeBSD 10.1) for the firewall
      All of this is on a single hardware box


  • @johnkeates:

    One thing that I found but didn't find an explanation for was someone at Citrix telling me that changing any VIF or TAP parameters (so on the dom0 side) is a really bad idea.

    The only reason I could think of that this is considered bad is that dom0, in a bridged setup, is instrumental in allowing domUs to communicate. So messing with this might be considered dangerous as, if done improperly, it results in communication breakdown.

    @johnkeates:

    To be safe I therefore only suggested changes on the netfront/domU side since that was not sold as 'bad'

    In my view that solution is suboptimal as this means the checksum is being calculated for every transfer - even from domU to another domU not involving pfSense which is a waste of CPU cycles.
    Also setting this only within the domUs would still leave the dom0 unable to communicate with pfSense (and subsequently the outside world behind the firewall). If you require full dom0 communication I guess there's no way to avoid follwing the "really bad idea" of changing the backend - which then is sufficient for all domUs.



  • @johnkeates:

    Anyway, for now, disabling tx/rx offloading on pfSense's VIF/TAP is pretty much 'the fix'.

    In my setup disabling TX offloading alone was sufficient and also consistent with my reasoning: Only the sending of packets from/via dom0 to pfSense (i.e. the TX-side) needs to have a correct checksum. RX offloading - whatever that does - is only relevant for the dom0/domU receiving packets from pfSense and that has never been an issue as for any packet received on the vif interface the checksum is anyways ignored in any case.

    @johnkeates:

    Furter investigations regarding pf or any other part after the interface on the pfSense domU might be useful to determine the source of the dropped packets and if it's configurable to stop dropping them.

    Unfortunately I have no idea what's going on inside pf or pfSense, so that's for somebody else to comment …

    Regards Atom2