Encrypted traffic randomly fails after upgrading from 2.4.3_p1 to 2.4.4



  • Hello,

    I have an issue, and googling around made me believe I'm the only one affected...

    The setup: The PfSense box is there to share the internet and serve VLANs in a multi-office building. We have about 15 tenants, each on their VLANs, with the default VLAN (Pf's LAN) being the guest network.

    2.4.3_p1: Everything works well, config is stable for the last 2 years. No extra modules, except some I never used plus the OpenVPN exporter. Your basic router, plus the unusual amount of identical VLANs on it.

    Upgraded to 2.4.4 as soon as available, just because I love all the good works put by the dev team (seriously, amazing piece of software guys!).

    2.4.4: HTTPS sites randomly crashes. Most affected sites are Google's, as they are all encrypted and often accessed.
    When it works, it's noticeably slower than usual (a couple seconds instead of instantaneous-ish).
    Chrome & Firefox hand out SSL_INTERFERENCE and CONNECTION_RESET after ~30sec of trying various things.
    Disabling TLSv1.3 in Chrome did not help
    Affected sites work half the time, blocked half the time. Intermittent and not constant.
    Slack and other messaging apps are also affected.
    Incoming services are affected too, OpenVPN and Windows Server SSL VPN drops very quickly, HTTPS hosted sites drops.
    Packet capture reveals a lot of connections reset, forcing encryption handshake to renegotiate [I believe that's what I saw, I'm no expert here].
    Connecting straight in the modem works perfectly
    All the VLANs are affected
    One of the VLAN has their on DNS server (windows domain, I think it's a resolver), and they are affected like the rest.
    This makes the internet present, but extremely unreliable. Just unreliable enough to make it a huge pain. A bit more, and people would have gone home and not bothered staying at the office.

    Done a fresh install of 2.4.4: basic WAN/LAN config works well.
    Restore config, all goes to hell again.

    Done a fresh install of 2.4.3 (by the way, I know Netgate has a policy to not host old version, but I had to download it on some shady website... There should be a provision for regression testing imho).
    All is good, internet works.
    Restore config saved from the 2.4.4, everything works gracefully.
    Upgrade to 2.4.4 again, problematic behaviour happens again.

    Reinstalled 2.4.3, restored config, all is good, called it a night, went here to post this ;)

    Soooooo.... I went trough all the configs, trying to catch something unusual. I tried to set all the default values I could. Nothing seemed to work.

    I have a packet capture and the config.xml, if someone's willing to take a look.
    I'm just completely puzzled, pretty sure it's a real bug, unsure if I did something wrong.

    Thanks!


  • Netgate Administrator

    So you only see this with https websites?

    You don't see packet loss otherwise?

    You don't see dnslookup errors?

    What sort of WAN connection do you have?
    One possible issue that changed in 2.4.4 is that if you have any DHCP option set on the WAN (if it uses DHCP) it will now correctly obey whatever MTU the DHCP server sends it. Prviously it ignored that which was wrong but it seems some providers/devices are sending bogus small values, ~500, which are now causing a problem.

    The other thing to check is the default gateway. The new default gateway group setup can get confused in 2.4.4.

    Both those are patched in 2.4.5 snapshots.

    Steve



  • Hi Stephen,

    ==> So you only see this with https websites?
    I noticed it only with HTTPS traffic, but did not seriously try for unencrypted. It might be both. These days, most things are encrypted...

    ==>You don't see packet loss otherwise?
    I see lots of TCP Retransmission, RST, Duplicate ACK.

    ==>You don't see dnslookup errors?
    Again, maybe I just sampled in a sweet spot, but NSLOOKUP looked fine.

    ==>What sort of WAN connection do you have?
    Cable modem, DHCP autoconfigured.

    =====
    One possible issue that changed in 2.4.4 is that if you have any DHCP option set on the WAN (if it uses DHCP) it will now correctly obey whatever MTU the DHCP server sends it. Prviously it ignored that which was wrong but it seems some providers/devices are sending bogus small values, ~500, which are now causing a problem.

    I must admit I played a lot with those settings a while ago. Checked back today: All is by default (including MTU) except:

    • Advanced configuration Ticked
    • Reject leases from: 192.168.100.1
    • Protocol timing:
      • Timeout: 60
      • Retry: 15
      • Select timeout: 0
      • Reboot: [blank]
      • Backoff cutoff: [blank]
      • Initial interval: 1
    • Presets: pfSense Default

    ==>The other thing to check is the default gateway. The new default gateway group setup can get confused in 2.4.4.
    I have only one gateway configured, and no gateway groups.
    I noticed the "Default Gateway" mark is unticked.

    I also noticed, in Interfaces->Assignement->PPP, I have two left-over PPPoE connection configured from previous internet providers. They are not in use, but they show up there.

    ==>Both those are patched in 2.4.5 snapshots.
    Thank you devs! I will give it a spin (when I have time to).


  • Netgate Administrator

    I would definitely check the actual interface MTU. If you have any options set there it removes the default dhclient settings and adds the custom values and that removes supersede interface-mtu 0 which otherwise prevents the bad MTU value.
    https://redmine.pfsense.org/issues/8507
    You can add that back in the option modifiers field in 2.4.4. if you need any other advanced options.

    I would definitely set a default gateway if none is set there.

    Steve



  • @Jay2 we're experiencing a very similar issue on two devices running 2.3.5, and have been struggling with it for about a year. We host a single public facing web server, so we have a pretty simple NAT/ACL config. Our problem, like yours is not only intermittent, but even more wonky. Two users connected to the same LAN accessing the SSL site via the firewall, one will get it and the other won't. Likewise, it may work fine from the internal network (via nat reflection), but fail on mobile device. In every case, packet capture at the web server shows successful 2 way communication, but failed TLS handshakes and spurious retransmissions from the clients. In that same packet capture will be successful TLS handshakes and https sessions.

    We've made physical changes, and even redeployed the server in attempt to rule out the PFsense firewall, but nothing changes this fact: when a TLS handshake fails, concurrent packet captures on the web server show the SYN/SYN-ACK/ACK succeeding, then the 129 bye "change cipher spec, encrypted handshake message" packet leaving the web server, but never arriving at the firewall. We've even done a mirrored switch port to confirm that the message "hits the wire", but the capture interface on the PFSense does not receive (or discards) the message.

    It's possible, with NAT configured, that the packet reaches PFSense and doesn't traverse interfaces, but it's hard to say since you can't run concurrent captures on both the LAN and WAN interfaces.

    I see other users with similar SSL handshake problems, but they all seem to be in the context of user or point to point VPNs, or even accessing the GUI itself.

    We need to fix this, but we're more in favor of Cisco ASA than a $1k netgate support contract for this issue.



  • Hello dentarthurdent,

    I'm unsure if our issues are the same.
    I had mostly "dropped" encrypted traffic, but resulting in duplicate resend, timeout... not much in the "change cypher" messages.
    Also, we went trough 2.3.5 without issues, my problem only manifested on 2.4.4.

    I have not tried the suggestions from Stephen yet, as I'm waiting for 2.4.5 to be officially released to just in case I actually hit a real bug that is patched in 2.4.5 (as was previously suggested).

    As your problem seems linked to a specific server offering a specific service (mine was with everybody using any internet at all!), I would suggest movingthe server behind some different router (like, a cheap home D-Link) to just prove the thing works on it's own, and then capture both sides of both setup (D-Link vs PFSense) and come back with detailed logs (on your own thread, please).

    I understand your issue if translating to personal pain, as it is always the case. Try to have fun along the way!


  • Netgate Administrator

    Almost certainly not the same issue.

    But small packet pass (initial handshake) big packets fail (TLS exchange) does sounds like it could be MTU.

    Why are you running 2.3.5 though? Unless you have a very good reason not to you should upgrade before doing anything else.

    Steve