pfSense WAN connection hangs after about a minute

fulkren

I can't test this yet but will as soon as I'm done with work for the day. Thanks for the suggestion.

fulkren

I tried disabling gateway monitoring and it didn't affect anything. Before disabling it, dpinger does show errors in the log. Example:

send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 64.37.18.5 bind_addr 100.64.24.89 identifier "WAN_PPPOE "

Also, I was wrong about the IP staying the same when I disconnect and reconnect the WAN port. It does increment by one each time I cycle.

And I confirmed I could ping the WAN IP, i.e. 100.64.24.89 in my above example, even when no traffic was flowing any further.

Another oddity was keeping a ping to 8.8.8.8 going from just after I reconnected the WAN. It kept working even when the interface seemed to die, but pings to any other public IP were failing. That seems very odd to me.

I keep thinking this has to be something up stream, but if that were the case then the same hardware should fail with other OSs, or my current pfSense box shouldn't work if it's the software.

stephenw10

@fulkren said in pfSense WAN connection hangs after about a minute:

send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 64.37.18.5 bind_addr 100.64.24.89 identifier "WAN_PPPOE "

That is not an error, it's what dpinger logs when it starts showing what IPs it's running on and what values it's using.

If it is exhausting something, like a state table, in the ISP router then I would expect existing states to stay open and continue to pass traffic but new states to fail. That would normally include the gateway monitoring pings though.

You have any other hardware device you can test pfSense on?

Steve

johnpoz

@stephenw10 said in pfSense WAN connection hangs after about a minute:

That would normally include the gateway monitoring pings though.

Not always the case.. We just had this discussion awhile back about icmp and states.. While we all know icmp is actually a stateless protocol - most firewalls track it, via an entry in the state table.. Pf does this..

So pinging especially twice a second would most likely keep such tracking active and allowed.

But its possible whatever is upstream isn't tracking it at all? And if it was having state issues, maybe icmp doesn't hit whatever issue that is be it exhaustion or whatever.

To me It really comes down to this - if you see pfsense sending traffic upstream, and you get no answer - then the problem is upstream. If pfsense was malforming whatever its sending, why would it not be doing that for the whole time - and it wouldn't even work for a minute?

Your saying you can only ever get a connection for exactly 1 minute, and then it fails? exactly 1 minute, or it just happens often? But it far more likely that its something upstream if your seeing traffic sent upstream with no reply.

fulkren

It's not exactly one minute. The time it takes to stop responding typically varies from a few seconds to about a minute and appears to depend on how much traffic I load on the interface. If I just run one ping to a single address it'll stay up for a while, but much more than that and it goes down.

As for other hardware, that's the part that confuses me most. My current pfSense box is a VM on an old EVGA i7 mobo with two onboard gigabit NICs (Realteks no less) and it's working perfectly. Same version of pfSense and exactly the same config as is failing on the Protectli Vault I purchased to replace it. The only config differences I made were testing each adjustment recommended in Netgate's troubleshooting and tuning document for Intel NICs.

I originally thought it was a hardware compatibility issue between my Telco box and the Intel 210AT NIC chipset on the Vault and swapped it out for a Vault with the older Intel 82583V chipset. Unfortunately I run into the exact same behavior on the new Vault (what I'm currently testing with). And it can't be purely hardware because running Sophos XG or ClearOS on the Vault does work (although ClearOS seems to have a lot of overhead and is unable to saturate the 500Mb/s line).

So the problem isn't purely software or purely hardware and appears to be the combination of the software and the hardware that causes the issue. Maybe the combination of the Intel based NICS and the driver(s) that pfSense uses?

johnpoz

if traffic continues to be sent upstream, and response stop to be seen. That screams upstream problem.. Do a sniff while your connecting so you collect all traffic being sent upstream and downstream... See if you can see any changes in the traffic being sent.. Wireshark for example would show errors and any packets that are malformed, etc.

stephenw10

@johnpoz said in pfSense WAN connection hangs after about a minute:

But its possible whatever is upstream isn't tracking it at all? And if it was having state issues, maybe icmp doesn't hit whatever issue that is be it exhaustion or whatever.

Yeah, that's a good point. Or maybe it requires a new state for each ping and therefore breaks when it stops being able to open new states. Which seems like what we're seeing here.

I could believe the pings themselves are exhausting it except disabling monitoring did not appear to help.

Steve

johnpoz

A sniff of the traffic should show us if new states are what are being denied, or if we are loosing responses to existing traffic, etc.

I am just having a hard time coming up with any sort of issue where pfsense would continue to send traffic, but malform it or change it in some way that the upstream didn't like.. Have never seen such thing ever..

Sure some device have hard time talking to each other - but this is normally seen in just basic negotiation of speed and duplex, etc. Not working fine for X amount of time, and then start having issues with just some traffic.

fulkren

Sorry for the delay in responding. I did another packet capture on em0 (the NIC underlying the PPP connection). This time I started the packet capture with the WAN cable unplugged, plugged it in, and recorded until it died (about 35 seconds later). I don't see any malformed packets (that I recognize as such) but I do see some spurious re-transmissions as well as a small number of RST packets.

I've attached the PCAP (as much as the forum allows, about the first 8k packets out of 22k total). Anything stand out to you?

If not, then I'm down to calling the Telco and seeing if they can offer any advice. Though I expect their response to be essentially "use our provided router or get stuffed". Regarding which, their provided router is just an ancient D-Link DIR-825 with stock software. Nothing exciting there.em0-2020-10-21-1.zip

johnpoz

I don't see how that sniff is from before you connected.. It starts with a flood of DNS..

Why would you be sending a syn (tcp) to 53??

Also - clearly your missing start of conversations... This one for example

There is no syn here, you see sending http get, never getting a response.. So retran, then finally says screw it and sends fin.. Get a answer finally, then client side says screw it and sends RST..

But you never see the start of this conversation.

So how exactly did you start this sniff before you wan was even up? Make sure you clear all states if your going to do that again.. But default I believe is to flush all states when connection goes down, have you adjusted that setting?

Game is on - so that is my comments for now.. Will look later after the game, or maybe halftime

fulkren

I haven't adjusted the settings for flushing states. And just to confirm I haven't accidentally borked anything up I have done a reset on the box settings and then re-configured for my local subnet, PPPoE connection, and DNS Resolver.

For the Sniff, my first attempt was to just unplug the WAN cable (LAN still plugged in), start the capture on WAN in promiscuous mode and then plug it back in. That just got me that flood of DNS packets as you saw. I also tried clearing states and then starting a pcap and, as quickly as I could, running "/usr/local/sbin/pfSctl -c 'interface reload wan'" from the shell but that seems to be giving me similar results to what I already posted.

Is there a better way for me to get good data during PPP startup?

johnpoz

When you pull the cable on wan.. Clear all your states.. then start you sniff, then plug the cable back in.. We should then see everything, your pppoe connection, your dhcp on your wan, etc. And since all states are cleared, nothing will be allowed through the firewall to even go out the wan without sending a new syn to start a conversation.

We can then see what conversations are failing, what are working.

Also make sure you run it the whole time until your saying nothing is working.. You said it was like 30 some seconds that it worked, but that sniff is only 11 seconds.

Why I don't get is why so much tcp dns traffic... These all seem to be root servers, but your talking to them via tcp.. Is UDP blocked??

This seems nuts... in less than 1ms you sent 870 something dns queries - not one of them I see from quick look got answered over UDP..

That doesn't seem normal... Most of your dns should be UDP.. and only stuff that is LARGE answers should have to do TCP..

stephenw10

Some aliases with fqdns being resolved?

Though those all look like typical stuff you might have behind pfSense.

You have Unbound configured in resolving mode? I could just about imagine something at your ISP is blocking you based on those DNS requests. The ISPs router will just be using the DNS servers supplied by the ISP, that would be a difference when using pfSense.
You could try switching Unbound to forwarding mode and checking 'Allow DNS server list to be overridden by DHCP/PPP on WAN' in System > General Setup.

Steve

johnpoz

@stephenw10 said in pfSense WAN connection hangs after about a minute:

I could just about imagine something at your ISP is blocking you based on those DNS requests.

Yeah I agree something is blocking dns, tries udp - not working so tries tcp..

I didn't notice anything bad in the queries - just a freaking shit ton of them.. 870 queries in 1 ms?? How many clients do you have behind pfsense?

fulkren

I'll follow your suggestions on getting a new PCAP. I'm getting a small switch in place with just my client laptop and pfSense connected to it to minimize extraneous traffic for the PCAP.

As for clients, this is just a home network so 6-8 PCs, a couple of appliances, and a few mobile devices.

I was running DNS to Cloudflare (1.1.1.1 and 1.0.0.1), have the DNS Forwarder disabled, and the DNS Resolver enabled. Apparently I unchecked 'enable forwarding mode' by mistake somewhere in my testing yesterday. I'll fix that and change to using the default IP specified DNS servers for my next test (I've tested it before and it didn't help, but at this point I'm eliminating as many variables as I can). DNS Resolver will have Network Interfaces set to LAN and Localhost, with Outgoing Network Interface set to WAN.

I'll get a new PCAP completed and uploaded as soon as I can.

fulkren

Success! My one laptop test worked. pfSense is up and running. After testing for 10-15 minutes with just one laptop I put it back on the main switch and so far so good.

The only thing I couldn't do was get a good pcap of the WAN link coming up. As suggested, I rebooted pfSense, cleared the States, re-logged into the GUI and then started a promiscuous pcap with full details up to 10,000 packets on the WAN interface with the WAN cable unplugged. I then plugged the cable in and worked with it until it stopped (or in this case, didn't stop).

When I stopped the pcap after a few minutes, I only had three packets listed. I've pasted the three packets in below, but why only three IPv6 packets? I should be seeing a lot more than this shouldn't I? Or does plugging the cable in somehow disrupt the capture?

No. Time Source Destination Protocol Length Info
1 0.000000 :: ff02::16 ICMPv6 100 Multicast Listener Report Message v2

Frame 1: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Null/Loopback
Internet Protocol Version 6, Src: ::, Dst: ff02::16
Internet Control Message Protocol v6

No. Time Source Destination Protocol Length Info
2 0.152920 :: ff02::1:ff1e:22f0 ICMPv6 76 Neighbor Solicitation for fe80::2e0:67ff:fe1e:22f0

Frame 2: 76 bytes on wire (608 bits), 76 bytes captured (608 bits)
Null/Loopback
Internet Protocol Version 6, Src: ::, Dst: ff02::1:ff1e:22f0
Internet Control Message Protocol v6

No. Time Source Destination Protocol Length Info
3 0.636174 :: ff02::16 ICMPv6 80 Multicast Listener Report Message v2

Frame 3: 80 bytes on wire (640 bits), 80 bytes captured (640 bits)
Null/Loopback
Internet Protocol Version 6, Src: ::, Dst: ff02::16
Internet Control Message Protocol v6

stephenw10

So after switching back to DNS forwarding mode and using Clouflare for upstream you are no longer being blocked after a few minutes?
Are you using DNS over TLS for that?

Steve

fulkren

That's correct except for the Cloudflare part. Right now it's just using the ISP DNS provided by the PPPoE connection. I'll try Cloudflare and DNS over TLS again this evening and see if that affects anything.

johnpoz

So it was never that your connection was down, its your resolution of dns was not work.. Which is why somestuff worked and other stuff didn't

fulkren

Not entirely. While DNS was definitely part of the issue, the WAN connection did appear to stop returning packets. For example, pings from a box on the LAN to a public IP like 8.8.8.8 would start timing out; as would pings to any other public IPs. I would initially get DNS resolution and traffic, but after a short period (15-60 seconds) both DNS resolution and ICMP responses would stop.