Encrypted traffic randomly fails after upgrading from 2.4.3_p1 to 2.4.4



  • Hello,

    I have an issue, and googling around made me believe I'm the only one affected...

    The setup: The PfSense box is there to share the internet and serve VLANs in a multi-office building. We have about 15 tenants, each on their VLANs, with the default VLAN (Pf's LAN) being the guest network.

    2.4.3_p1: Everything works well, config is stable for the last 2 years. No extra modules, except some I never used plus the OpenVPN exporter. Your basic router, plus the unusual amount of identical VLANs on it.

    Upgraded to 2.4.4 as soon as available, just because I love all the good works put by the dev team (seriously, amazing piece of software guys!).

    2.4.4: HTTPS sites randomly crashes. Most affected sites are Google's, as they are all encrypted and often accessed.
    When it works, it's noticeably slower than usual (a couple seconds instead of instantaneous-ish).
    Chrome & Firefox hand out SSL_INTERFERENCE and CONNECTION_RESET after ~30sec of trying various things.
    Disabling TLSv1.3 in Chrome did not help
    Affected sites work half the time, blocked half the time. Intermittent and not constant.
    Slack and other messaging apps are also affected.
    Incoming services are affected too, OpenVPN and Windows Server SSL VPN drops very quickly, HTTPS hosted sites drops.
    Packet capture reveals a lot of connections reset, forcing encryption handshake to renegotiate [I believe that's what I saw, I'm no expert here].
    Connecting straight in the modem works perfectly
    All the VLANs are affected
    One of the VLAN has their on DNS server (windows domain, I think it's a resolver), and they are affected like the rest.
    This makes the internet present, but extremely unreliable. Just unreliable enough to make it a huge pain. A bit more, and people would have gone home and not bothered staying at the office.

    Done a fresh install of 2.4.4: basic WAN/LAN config works well.
    Restore config, all goes to hell again.

    Done a fresh install of 2.4.3 (by the way, I know Netgate has a policy to not host old version, but I had to download it on some shady website... There should be a provision for regression testing imho).
    All is good, internet works.
    Restore config saved from the 2.4.4, everything works gracefully.
    Upgrade to 2.4.4 again, problematic behaviour happens again.

    Reinstalled 2.4.3, restored config, all is good, called it a night, went here to post this ;)

    Soooooo.... I went trough all the configs, trying to catch something unusual. I tried to set all the default values I could. Nothing seemed to work.

    I have a packet capture and the config.xml, if someone's willing to take a look.
    I'm just completely puzzled, pretty sure it's a real bug, unsure if I did something wrong.

    Thanks!


  • Netgate Administrator

    So you only see this with https websites?

    You don't see packet loss otherwise?

    You don't see dnslookup errors?

    What sort of WAN connection do you have?
    One possible issue that changed in 2.4.4 is that if you have any DHCP option set on the WAN (if it uses DHCP) it will now correctly obey whatever MTU the DHCP server sends it. Prviously it ignored that which was wrong but it seems some providers/devices are sending bogus small values, ~500, which are now causing a problem.

    The other thing to check is the default gateway. The new default gateway group setup can get confused in 2.4.4.

    Both those are patched in 2.4.5 snapshots.

    Steve



  • Hi Stephen,

    ==> So you only see this with https websites?
    I noticed it only with HTTPS traffic, but did not seriously try for unencrypted. It might be both. These days, most things are encrypted...

    ==>You don't see packet loss otherwise?
    I see lots of TCP Retransmission, RST, Duplicate ACK.

    ==>You don't see dnslookup errors?
    Again, maybe I just sampled in a sweet spot, but NSLOOKUP looked fine.

    ==>What sort of WAN connection do you have?
    Cable modem, DHCP autoconfigured.

    =====
    One possible issue that changed in 2.4.4 is that if you have any DHCP option set on the WAN (if it uses DHCP) it will now correctly obey whatever MTU the DHCP server sends it. Prviously it ignored that which was wrong but it seems some providers/devices are sending bogus small values, ~500, which are now causing a problem.

    I must admit I played a lot with those settings a while ago. Checked back today: All is by default (including MTU) except:

    • Advanced configuration Ticked
    • Reject leases from: 192.168.100.1
    • Protocol timing:
      • Timeout: 60
      • Retry: 15
      • Select timeout: 0
      • Reboot: [blank]
      • Backoff cutoff: [blank]
      • Initial interval: 1
    • Presets: pfSense Default

    ==>The other thing to check is the default gateway. The new default gateway group setup can get confused in 2.4.4.
    I have only one gateway configured, and no gateway groups.
    I noticed the "Default Gateway" mark is unticked.

    I also noticed, in Interfaces->Assignement->PPP, I have two left-over PPPoE connection configured from previous internet providers. They are not in use, but they show up there.

    ==>Both those are patched in 2.4.5 snapshots.
    Thank you devs! I will give it a spin (when I have time to).


  • Netgate Administrator

    I would definitely check the actual interface MTU. If you have any options set there it removes the default dhclient settings and adds the custom values and that removes supersede interface-mtu 0 which otherwise prevents the bad MTU value.
    https://redmine.pfsense.org/issues/8507
    You can add that back in the option modifiers field in 2.4.4. if you need any other advanced options.

    I would definitely set a default gateway if none is set there.

    Steve