pfSense WAN connection hangs after about a minute



  • Greetings everyone

    I'm a new user to the forums, but a long time pfSense user. Recently though, I've run into a strange issue that I can't figure out.

    I recently purchased a Protectli Vault to replace an ancient desktop (also running pfSense as a VM) as my gateway router. I had no trouble setting up pfSense, but I've run into this issue where my WAN connection (500/100 FTTD connection using PPPoE) initially comes up okay, but just freezes after a minute or so. The PPPoE connection still shows as up, but no traffic flows after that initial minute or so, not even pings from pfSense to the WAN. If I manually cycle the WAN connection then it runs for another minute before freezing again and I have no idea why.

    I've tried all the performance tuning and troubleshooting tips found on https://docs.netgate.com/pfsense/en/latest/hardware/tune.html without success. Thinking it must be some kind of incompatibility between the 210AT NIC and Telco box, I returned my existing vault and purchased a 6 port vault with the older Intel 82583V chips. Same problem with the new vault.

    I also tried other router software with mixed success. OPNSense has the same problem as pfSense. Sophos XG works great (so the issue isn't pure hardware), but doesn't support UPnP. ClearOS doesn't die, but only runs at about 2/3rds line speed (it typically shows 250-300Mb/s where pfSense and Sophos are showing the full 500Mb/s).

    I know it's not pure hardware because Sophos XG and ClearOS don't have the stall problem, but it's not pure software because my old box still works like a champ. Any thoughts on what might be causing this odd issue? I'd really like to stick with pfSense as my firewall if I can.


  • Netgate Administrator

    Check the PPP logs, are you seeing anything showing when it's not passing traffic.

    Run a pcap on the PPPoE WAN are you seeing anything? Gateway pings even?

    Run a pcap on the PPPoE parent NIC (assign it if you have to) do you see anything there?

    Anything in the system log?

    What is the telco box? It doesn't show the Ethernet link go down I assume?

    If it's exactly 60s each time that seems like a timeout somewhere, probably in PPPoE and should be logged.

    Steve



  • It's not exactly 60 seconds that it goes down. It seems to be dependent on amount of traffic. Trying a speedtest will kill it almost immediately. The page will load, but then trying to connect to a server fails. Also, if I leave it alone for a few minutes it'll sometimes accept a little more traffic, but dies within seconds.

    For the NIC, I can't see the telco NIC as it's inside the telco side of my fiber box but the NIC on the pfSense side shows as connected and the WAN connection shows as up.

    System and PPP logs don't show any errors that I recognize. The only entry in the system log after startup is my kicking the NIC into promiscuous mode for the pcaps and the last entries in the PPP log are the connection coming up. No reconnection attempt or link down messages there.

    For the pcaps, I'm not the best at reading these, but I do see packets in both of them.

    The both the WAN and em0 port (the NIC underlying the PPP WAN) show motly DNS protocol packets from the WAN IP, but only a handful of responses. There are also a few TCP ACKed unseen segment packets. I'm not sure what those mean.


  • Netgate Administrator

    You see the gateway monitoring pings though? And replies?

    Seeing ACKs to traffic you didn't see might be some sort of ARP issue.

    We'll need to see the logs really I think.

    Steve



  • That I can do. I've attached the PCAP files for both the WAN PPP port and em0 (the underlying WAN NIC).

    PCAP.zip


  • Netgate Administrator

    Hmm, so those were taken after it stopped responding?

    It looks like your WAN IP is in the carrier grade NAT range, is that expected?

    There are no replies to any pings shown in either cap even to the WAN gateway IP. Those succeed before for the 1min before it stops responding though?

    There is some reply traffic though. It looks as though something is exhausted somewhere in the route, like a state table maybe. If the local telco device is NATing I could believe that but it's very unlikely given PPPoE and CGN.

    Bringing down the PPPoE session and reconnecting it allows traffic to flow for a further minute?
    Do you get a different WAN IP every time it connects?

    Steve



  • Both of the pcaps are during the period that it isn't responding. And yes, sadly, they are using Carrier grade NAT IPs as my 'public' address. :(

    Pings were working during the first few seconds to minute that the connection stays up.

    Bringing down and reconnecting the PPPoE session does allow traffic to briefly flow again and I'll double-check, but I believe the IP does not change when I reconnect.


  • LAYER 8 Global Moderator

    Just a theory, more a guess - but if your gateway stops answering pings.. Then yeah your gateway will go down..

    That might explain why your XG and etc. doesn't fall down - if they are not monitoring via ping your connection? Or if they do it less often, maybe some sort of rate limiting being done?

    Turn off monitoring and just set the gateway to always be up... Does that allow traffic to work for more than your 1 minute?


  • Netgate Administrator

    pfSense will still send traffic via that gateway as it's the default even if it's marked down. But obviously the gateway itself will have stopped responding for that to happen. It could indeed be upstream responding badly to the gateway pings.
    As suggested edit the gateway in System > Routing > Gateways and disable monitoring for it.

    If that does prevent it going down you can adjust the monitoring settings there to ping a lot less frequently.

    Steve



  • I can't test this yet but will as soon as I'm done with work for the day. Thanks for the suggestion.



  • I tried disabling gateway monitoring and it didn't affect anything. Before disabling it, dpinger does show errors in the log. Example:

    send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 64.37.18.5 bind_addr 100.64.24.89 identifier "WAN_PPPOE "

    Also, I was wrong about the IP staying the same when I disconnect and reconnect the WAN port. It does increment by one each time I cycle.

    And I confirmed I could ping the WAN IP, i.e. 100.64.24.89 in my above example, even when no traffic was flowing any further.

    Another oddity was keeping a ping to 8.8.8.8 going from just after I reconnected the WAN. It kept working even when the interface seemed to die, but pings to any other public IP were failing. That seems very odd to me.

    I keep thinking this has to be something up stream, but if that were the case then the same hardware should fail with other OSs, or my current pfSense box shouldn't work if it's the software.


  • Netgate Administrator

    @fulkren said in pfSense WAN connection hangs after about a minute:

    send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 64.37.18.5 bind_addr 100.64.24.89 identifier "WAN_PPPOE "

    That is not an error, it's what dpinger logs when it starts showing what IPs it's running on and what values it's using.

    If it is exhausting something, like a state table, in the ISP router then I would expect existing states to stay open and continue to pass traffic but new states to fail. That would normally include the gateway monitoring pings though.

    You have any other hardware device you can test pfSense on?

    Steve


  • LAYER 8 Global Moderator

    @stephenw10 said in pfSense WAN connection hangs after about a minute:

    That would normally include the gateway monitoring pings though.

    Not always the case.. We just had this discussion awhile back about icmp and states.. While we all know icmp is actually a stateless protocol - most firewalls track it, via an entry in the state table.. Pf does this..

    So pinging especially twice a second would most likely keep such tracking active and allowed.

    But its possible whatever is upstream isn't tracking it at all? And if it was having state issues, maybe icmp doesn't hit whatever issue that is be it exhaustion or whatever.

    To me It really comes down to this - if you see pfsense sending traffic upstream, and you get no answer - then the problem is upstream. If pfsense was malforming whatever its sending, why would it not be doing that for the whole time - and it wouldn't even work for a minute?

    Your saying you can only ever get a connection for exactly 1 minute, and then it fails? exactly 1 minute, or it just happens often? But it far more likely that its something upstream if your seeing traffic sent upstream with no reply.



  • It's not exactly one minute. The time it takes to stop responding typically varies from a few seconds to about a minute and appears to depend on how much traffic I load on the interface. If I just run one ping to a single address it'll stay up for a while, but much more than that and it goes down.

    As for other hardware, that's the part that confuses me most. My current pfSense box is a VM on an old EVGA i7 mobo with two onboard gigabit NICs (Realteks no less) and it's working perfectly. Same version of pfSense and exactly the same config as is failing on the Protectli Vault I purchased to replace it. The only config differences I made were testing each adjustment recommended in Netgate's troubleshooting and tuning document for Intel NICs.

    I originally thought it was a hardware compatibility issue between my Telco box and the Intel 210AT NIC chipset on the Vault and swapped it out for a Vault with the older Intel 82583V chipset. Unfortunately I run into the exact same behavior on the new Vault (what I'm currently testing with). And it can't be purely hardware because running Sophos XG or ClearOS on the Vault does work (although ClearOS seems to have a lot of overhead and is unable to saturate the 500Mb/s line).

    So the problem isn't purely software or purely hardware and appears to be the combination of the software and the hardware that causes the issue. Maybe the combination of the Intel based NICS and the driver(s) that pfSense uses?


  • LAYER 8 Global Moderator

    if traffic continues to be sent upstream, and response stop to be seen. That screams upstream problem.. Do a sniff while your connecting so you collect all traffic being sent upstream and downstream... See if you can see any changes in the traffic being sent.. Wireshark for example would show errors and any packets that are malformed, etc.


  • Netgate Administrator

    @johnpoz said in pfSense WAN connection hangs after about a minute:

    But its possible whatever is upstream isn't tracking it at all? And if it was having state issues, maybe icmp doesn't hit whatever issue that is be it exhaustion or whatever.

    Yeah, that's a good point. Or maybe it requires a new state for each ping and therefore breaks when it stops being able to open new states. Which seems like what we're seeing here.

    I could believe the pings themselves are exhausting it except disabling monitoring did not appear to help.

    Steve


  • LAYER 8 Global Moderator

    A sniff of the traffic should show us if new states are what are being denied, or if we are loosing responses to existing traffic, etc.

    I am just having a hard time coming up with any sort of issue where pfsense would continue to send traffic, but malform it or change it in some way that the upstream didn't like.. Have never seen such thing ever..

    Sure some device have hard time talking to each other - but this is normally seen in just basic negotiation of speed and duplex, etc. Not working fine for X amount of time, and then start having issues with just some traffic.



  • Sorry for the delay in responding. I did another packet capture on em0 (the NIC underlying the PPP connection). This time I started the packet capture with the WAN cable unplugged, plugged it in, and recorded until it died (about 35 seconds later). I don't see any malformed packets (that I recognize as such) but I do see some spurious re-transmissions as well as a small number of RST packets.

    I've attached the PCAP (as much as the forum allows, about the first 8k packets out of 22k total). Anything stand out to you?

    If not, then I'm down to calling the Telco and seeing if they can offer any advice. Though I expect their response to be essentially "use our provided router or get stuffed". Regarding which, their provided router is just an ancient D-Link DIR-825 with stock software. Nothing exciting there.em0-2020-10-21-1.zip


  • LAYER 8 Global Moderator

    I don't see how that sniff is from before you connected.. It starts with a flood of DNS..

    Why would you be sending a syn (tcp) to 53??

    Also - clearly your missing start of conversations... This one for example

    start.png

    There is no syn here, you see sending http get, never getting a response.. So retran, then finally says screw it and sends fin.. Get a answer finally, then client side says screw it and sends RST..

    But you never see the start of this conversation.

    So how exactly did you start this sniff before you wan was even up? Make sure you clear all states if your going to do that again.. But default I believe is to flush all states when connection goes down, have you adjusted that setting?

    Game is on - so that is my comments for now.. Will look later after the game, or maybe halftime



  • I haven't adjusted the settings for flushing states. And just to confirm I haven't accidentally borked anything up I have done a reset on the box settings and then re-configured for my local subnet, PPPoE connection, and DNS Resolver.

    For the Sniff, my first attempt was to just unplug the WAN cable (LAN still plugged in), start the capture on WAN in promiscuous mode and then plug it back in. That just got me that flood of DNS packets as you saw. I also tried clearing states and then starting a pcap and, as quickly as I could, running "/usr/local/sbin/pfSctl -c 'interface reload wan'" from the shell but that seems to be giving me similar results to what I already posted.

    Is there a better way for me to get good data during PPP startup?


  • LAYER 8 Global Moderator

    When you pull the cable on wan.. Clear all your states.. then start you sniff, then plug the cable back in.. We should then see everything, your pppoe connection, your dhcp on your wan, etc. And since all states are cleared, nothing will be allowed through the firewall to even go out the wan without sending a new syn to start a conversation.

    We can then see what conversations are failing, what are working.

    Also make sure you run it the whole time until your saying nothing is working.. You said it was like 30 some seconds that it worked, but that sniff is only 11 seconds.

    Why I don't get is why so much tcp dns traffic... These all seem to be root servers, but your talking to them via tcp.. Is UDP blocked??

    53tcp.png

    This seems nuts... in less than 1ms you sent 870 something dns queries - not one of them I see from quick look got answered over UDP..

    dns.png

    That doesn't seem normal... Most of your dns should be UDP.. and only stuff that is LARGE answers should have to do TCP..


  • Netgate Administrator

    Some aliases with fqdns being resolved?

    Though those all look like typical stuff you might have behind pfSense.

    You have Unbound configured in resolving mode? I could just about imagine something at your ISP is blocking you based on those DNS requests. The ISPs router will just be using the DNS servers supplied by the ISP, that would be a difference when using pfSense.
    You could try switching Unbound to forwarding mode and checking 'Allow DNS server list to be overridden by DHCP/PPP on WAN' in System > General Setup.

    Steve


  • LAYER 8 Global Moderator

    @stephenw10 said in pfSense WAN connection hangs after about a minute:

    I could just about imagine something at your ISP is blocking you based on those DNS requests.

    Yeah I agree something is blocking dns, tries udp - not working so tries tcp..

    I didn't notice anything bad in the queries - just a freaking shit ton of them.. 870 queries in 1 ms?? How many clients do you have behind pfsense?



  • I'll follow your suggestions on getting a new PCAP. I'm getting a small switch in place with just my client laptop and pfSense connected to it to minimize extraneous traffic for the PCAP.

    As for clients, this is just a home network so 6-8 PCs, a couple of appliances, and a few mobile devices.

    I was running DNS to Cloudflare (1.1.1.1 and 1.0.0.1), have the DNS Forwarder disabled, and the DNS Resolver enabled. Apparently I unchecked 'enable forwarding mode' by mistake somewhere in my testing yesterday. I'll fix that and change to using the default IP specified DNS servers for my next test (I've tested it before and it didn't help, but at this point I'm eliminating as many variables as I can). DNS Resolver will have Network Interfaces set to LAN and Localhost, with Outgoing Network Interface set to WAN.

    I'll get a new PCAP completed and uploaded as soon as I can.



  • Success! My one laptop test worked. pfSense is up and running. After testing for 10-15 minutes with just one laptop I put it back on the main switch and so far so good.

    The only thing I couldn't do was get a good pcap of the WAN link coming up. As suggested, I rebooted pfSense, cleared the States, re-logged into the GUI and then started a promiscuous pcap with full details up to 10,000 packets on the WAN interface with the WAN cable unplugged. I then plugged the cable in and worked with it until it stopped (or in this case, didn't stop).

    When I stopped the pcap after a few minutes, I only had three packets listed. I've pasted the three packets in below, but why only three IPv6 packets? I should be seeing a lot more than this shouldn't I? Or does plugging the cable in somehow disrupt the capture?

    No. Time Source Destination Protocol Length Info
    1 0.000000 :: ff02::16 ICMPv6 100 Multicast Listener Report Message v2

    Frame 1: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
    Null/Loopback
    Internet Protocol Version 6, Src: ::, Dst: ff02::16
    Internet Control Message Protocol v6

    No. Time Source Destination Protocol Length Info
    2 0.152920 :: ff02::1:ff1e:22f0 ICMPv6 76 Neighbor Solicitation for fe80::2e0:67ff:fe1e:22f0

    Frame 2: 76 bytes on wire (608 bits), 76 bytes captured (608 bits)
    Null/Loopback
    Internet Protocol Version 6, Src: ::, Dst: ff02::1:ff1e:22f0
    Internet Control Message Protocol v6

    No. Time Source Destination Protocol Length Info
    3 0.636174 :: ff02::16 ICMPv6 80 Multicast Listener Report Message v2

    Frame 3: 80 bytes on wire (640 bits), 80 bytes captured (640 bits)
    Null/Loopback
    Internet Protocol Version 6, Src: ::, Dst: ff02::16
    Internet Control Message Protocol v6


  • Netgate Administrator

    So after switching back to DNS forwarding mode and using Clouflare for upstream you are no longer being blocked after a few minutes?
    Are you using DNS over TLS for that?

    Steve



  • That's correct except for the Cloudflare part. Right now it's just using the ISP DNS provided by the PPPoE connection. I'll try Cloudflare and DNS over TLS again this evening and see if that affects anything.


  • LAYER 8 Global Moderator

    So it was never that your connection was down, its your resolution of dns was not work.. Which is why somestuff worked and other stuff didn't



  • Not entirely. While DNS was definitely part of the issue, the WAN connection did appear to stop returning packets. For example, pings from a box on the LAN to a public IP like 8.8.8.8 would start timing out; as would pings to any other public IPs. I would initially get DNS resolution and traffic, but after a short period (15-60 seconds) both DNS resolution and ICMP responses would stop.



  • I re-enabled Cloudflare DNS+TLS and it's still stable. I'm beginning to think it was that Enable Forwarding Mode checkbox that caused the issue (will test this later when the rest of the household isn't online). I wonder if all the root traffic was convincing my ISP that I was hosting a DNS server and they blocked my connection. I'll report back if I find out anything further, but I'm happy that the Vault is up and working like it should be.

    Thank you both for taking the time to look at my issue and for providing both solid suggestions and feedback. It really helped to have some experienced eyes on the problem.


  • LAYER 8 Global Moderator

    @fulkren said in pfSense WAN connection hangs after about a minute:

    I'm beginning to think it was that Enable Forwarding Mode checkbox that caused the issue

    Huh?? When you forward you do not talk to roots..



  • Sorry, I should have been more specific. I was postulating that accidentally having Enable Forwarding Mode disabled (unchecked) was causing the issue.


  • LAYER 8 Global Moderator

    I would check with your ISP - I personally would be livid and looking for another ISP if my isp was interfering in my dns traffic like that.


  • Netgate Administrator

    Yeah, that should not happen. But that is what I was postulating above.

    Steve



  • @johnpoz said in pfSense WAN connection hangs after about a minute:

    I would check with your ISP - I personally would be livid and looking for another ISP if my isp was interfering in my dns traffic like that.

    I love when you speak from your heart. Usually that's the truth.
    In fact ISP use your DNS queries to make a profile of you, it can be use for you (adds) or against you (if you sue ISP). On top of that they reduced costs by participate on Google Global Cache or GGC program for Youtube, Netflix, and so on.
    Just block their GGC cache and you will get a faster speed.


Log in to reply