Not getting a DHCP WAN IP Address on netgate hardware.

Austin 0

@Austin-0 Looks like I spoke too soon. It worked for 5-10 minutes or so and then I got 100% packet loss according to the gateway monitor. I rebooted, and the same thing happened. It worked for 5-10 minutes and then it was dropping all of the packets. Below are the logs from after the reboot. As you can see, It came back up from the reboot at 16:38, and dpinger sent the alarm at 16:45.

System Logs 2023-09-10.png

stephenw10

For the 4100 specifically?

Austin 0

@stephenw10 Yes

stephenw10

@Austin-0 said in Not getting a DHCP WAN IP Address on netgate hardware.:

Looks like I spoke too soon

Did the switch lose link? pfSense only shows the pings to 1.1.1.1 started to fail.

Austin 0

@stephenw10 The switch did not lose the link. 1.1.1.1 is what I have gateway monitoring set to. However I can confirm that all internet access was lost at that time, not just to 1.1.1.1.

stephenw10

Even to the actual gateway? Is it still in the ARP table?

Austin 0

@stephenw10 I ran a tracert from one of the computers at the time of the failure and it only got to the Pfsense box so yes even the connection to the gateway was down. As far as the ARP table goes I have unfortunately left the building, and I won't be back until at least Friday. I will test it again asap and look at the ARP table this time.

stephenw10

The ISP gateway may not appear in a traceroute. If you've tried it before and it did of course it still should.

Austin 0

@stephenw10 Okay I have confirmed that the ISP gateway does appear in tracerts normally. Also, the ISP gateway stays in the arp table.

stephenw10

Hmm, hard to say then. pfSense still has an IP on the WAN I assume? But it cannot ping the WAN gateway even though it's in the ARP table?

Do you see the pings in a pcap on WAN?

Works for a few minutes then stops sure seems like it could be an ARP issue.

Austin 0

@stephenw10 Here is the packet capture. I have replaced the public IP with xxx.xxx.xxx.xxx. You can see the ICMP requests that dpringer is making to 1.1.1.1. I noticed a lot of these are getting flagged for bad checksum, but I am not quite sure what to do about that.

10:45:51.905405 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 77: (tos 0x0, ttl 127, id 38590, offset 0, flags [none], proto UDP (17), length 63, bad cksum 0 (->861d)!)
    xxx.xxx.xxx.xxx.35343 > 1.1.1.1.53: [udp sum ok] 49570+ A? forum.netgate.com. (35)
10:45:51.905451 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 77: (tos 0x0, ttl 127, id 38591, offset 0, flags [none], proto UDP (17), length 63, bad cksum 0 (->861c)!)
    xxx.xxx.xxx.xxx.42560 > 1.1.1.1.53: [udp sum ok] 51250+ Type65? forum.netgate.com. (35)
10:45:51.921752 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 83: (tos 0x0, ttl 127, id 38592, offset 0, flags [none], proto UDP (17), length 69, bad cksum 0 (->8615)!)
    xxx.xxx.xxx.xxx.60130 > 1.1.1.1.53: [udp sum ok] 19119+ A? signaler-pa.youtube.com. (41)
10:45:51.921848 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 83: (tos 0x0, ttl 127, id 38593, offset 0, flags [none], proto UDP (17), length 69, bad cksum 0 (->8614)!)
    xxx.xxx.xxx.xxx.19205 > 1.1.1.1.53: [udp sum ok] 21591+ Type65? signaler-pa.youtube.com. (41)
10:45:51.934131 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 54: (tos 0x0, ttl 127, id 21954, offset 0, flags [DF], proto TCP (6), length 40, bad cksum 0 (->c8e1)!)
    xxx.xxx.xxx.xxx.8358 > 52.226.139.121.443: Flags [R.], cksum 0xb229 (correct), seq 3206783296, ack 1699559528, win 0, length 0
10:45:52.200962 60:22:32:46:45:0d > 01:80:c2:00:00:00, 802.3, length 39: LLC, dsap STP (0x42) Individual, ssap STP (0x42) Command, ctrl 0x03: STP 802.1w, Rapid STP, Flags [Learn, Forward, Agreement], bridge-id 8000.60:22:32:46:45:0c.8010, length 43
	message-age 0.00s, max-age 20.00s, hello-time 2.00s, forwarding-delay 15.00s
	root-id 8000.60:22:32:46:45:0c, root-pathcost 0, port-role Designated
10:45:52.216361 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 43: (tos 0x0, ttl 64, id 63816, offset 0, flags [none], proto ICMP (1), length 29, bad cksum 0 (->62c5)!)
    xxx.xxx.xxx.xxx > 1.1.1.1: ICMP echo request, id 47797, seq 1577, length 9
10:45:52.240408 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 86: (tos 0x0, ttl 64, id 15133, offset 0, flags [none], proto UDP (17), length 72, bad cksum 0 (->12a8)!)
    xxx.xxx.xxx.xxx.6424 > 8.8.8.8.53: [bad udp cksum 0x2d26 -> 0xc857!] 13895+ PTR? 8.179.243.104.in-addr.arpa. (44)
10:45:52.297247 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 55: (tos 0x0, ttl 127, id 13743, offset 0, flags [DF], proto TCP (6), length 41, bad cksum 0 (->d40b)!)
    xxx.xxx.xxx.xxx.50716 > 172.64.41.3.443: Flags [.], cksum 0x25e3 (correct), seq 110458962:110458963, ack 572074752, win 1028, length 1
10:45:52.720360 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 43: (tos 0x0, ttl 64, id 23499, offset 0, flags [none], proto ICMP (1), length 29, bad cksum 0 (->43)!)
    xxx.xxx.xxx.xxx > 1.1.1.1: ICMP echo request, id 47797, seq 1578, length 9
10:45:52.835580 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 63, id 57232, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 0 (->4b70)!)
    xxx.xxx.xxx.xxx.25868 > 3.95.234.235.30011: Flags [.], cksum 0x1094 (correct), seq 707216205:707217653, ack 148236916, win 166, options [nop,nop,TS val 35204921 ecr 94619178], length 1448
10:45:52.835654 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 63, id 57233, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 0 (->4b6f)!)
    xxx.xxx.xxx.xxx.25868 > 3.95.234.235.30011: Flags [.], cksum 0x7d8e (correct), seq 1448:2896, ack 1, win 166, options [nop,nop,TS val 35204921 ecr 94619178], length 1448
10:45:52.835665 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 63, id 57234, offset 0, flags [DF], proto TCP (6), length 1500, bad cksum 0 (->4b6e)!)
    xxx.xxx.xxx.xxx.25868 > 3.95.234.235.30011: Flags [.], cksum 0x77e6 (correct), seq 2896:4344, ack 1, win 166, options [nop,nop,TS val 35204921 ecr 94619178], length 1448
10:45:52.835779 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 1461: (tos 0x0, ttl 63, id 57235, offset 0, flags [DF], proto TCP (6), length 1447, bad cksum 0 (->4ba2)!)
    xxx.xxx.xxx.xxx.25868 > 3.95.234.235.30011: Flags [P.], cksum 0x5acb (correct), seq 4344:5739, ack 1, win 166, options [nop,nop,TS val 35204921 ecr 94619178], length 1395
10:45:52.871216 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 127, id 47650, offset 0, flags [none], proto UDP (17), length 56, bad cksum 0 (->54b2)!)
    xxx.xxx.xxx.xxx.7567 > 8.8.8.8.53: [udp sum ok] 57083+ A? dns.google. (28)
10:45:52.871224 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 127, id 38594, offset 0, flags [none], proto UDP (17), length 56, bad cksum 0 (->8620)!)
    xxx.xxx.xxx.xxx.40601 > 1.1.1.1.53: [udp sum ok] 57083+ A? dns.google. (28)
10:45:52.918662 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 127, id 47651, offset 0, flags [none], proto UDP (17), length 56, bad cksum 0 (->54b1)!)
    xxx.xxx.xxx.xxx.31362 > 8.8.8.8.53: [udp sum ok] 54725+ A? dns.google. (28)
10:45:52.918707 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 70: (tos 0x0, ttl 127, id 47652, offset 0, flags [none], proto UDP (17), length 56, bad cksum 0 (->54b0)!)
    xxx.xxx.xxx.xxx.8219 > 8.8.8.8.53: [udp sum ok] 16179+ Type65? dns.google. (28)
10:45:52.919544 90:ec:77:34:73:8e > 78:ba:f9:30:82:33, ethertype IPv4 (0x0800), length 77: (tos 0x0, ttl 127, id 47653, offset 0, flags [none], proto UDP (17), length 63, bad cksum 0 (->54a8)!)

stephenw10

Mmm, nothing coming back from the gateway at all though.

The checksum errors are because hardware checksum off-loading is enabled. That's not a problem but you can disable it in Sys > Adv > Networking

Austin 0

@stephenw10 Yeah nothing comes back. It is weird.

stephenw10

In you can install the arping pkg you can try arping for the gateway:

[23.09-DEVELOPMENT][admin@4100-3.stevew.lan]/root: pkg install arping
Updating pfSense-core repository catalogue...
Fetching meta.conf:   0%
pfSense-core repository is up to date.
Updating pfSense repository catalogue...
Fetching meta.conf:   0%
pfSense repository is up to date.
All repositories are up to date.
The following 2 package(s) will be affected (of 0 checked):

New packages to be INSTALLED:
	arping: 2.21_1 [pfSense]
	libnet: 1.2,1 [pfSense]

Number of packages to be installed: 2

118 KiB to be downloaded.

Proceed with this action? [y/N]: y
[1/2] Fetching libnet-1.2,1.pkg: 100%   92 KiB  94.1kB/s    00:01    
[2/2] Fetching arping-2.21_1.pkg: 100%   26 KiB  26.5kB/s    00:01    
Checking integrity... done (0 conflicting)
[1/2] Installing libnet-1.2,1...
[1/2] Extracting libnet-1.2,1: 100%
[2/2] Installing arping-2.21_1...
[2/2] Extracting arping-2.21_1: 100%
[23.09-DEVELOPMENT][admin@4100-3.stevew.lan]/root: rehash

Then:

[23.09-DEVELOPMENT][admin@4100-3.stevew.lan]/root: arping -c 3 172.21.16.1
ARPING 172.21.16.1
60 bytes from 00:08:a2:0c:c9:91 (172.21.16.1): index=0 time=767.357 usec
60 bytes from 00:08:a2:0c:c9:91 (172.21.16.1): index=1 time=661.690 usec
60 bytes from 00:08:a2:0c:c9:91 (172.21.16.1): index=2 time=682.343 usec

--- 172.21.16.1 statistics ---
3 packets transmitted, 3 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 0.662/0.704/0.767/0.046 ms

If the gateway doesn't respond even to arp there must be something low level disconnected somehow.

The ARP entry in the table will expired after ~15mins so it may appear to be there still even if it's not responding at all.

JonathanLee

This post is deleted!

JonathanLee

What about the MTU settings? Does that matter with ONT modems? Also a duplex mismatch could occur Is the connection set to auto or full duplex on the WAN? I think it's a duplex mismatch as it corrects with a switch so the switch could be set to auto negotiation, and somehow the firewall is set to half of something.

https://docs.netgate.com/pfsense/en/latest/troubleshooting/low-throughput.html

stephenw10

There appear to be two issues here, at least. Firstly the ONT seems to be set to 100M fixed which means the interfaces on the 4100 cannot link to it directly.

Secondly the ISP gateway stops responding after some time. That's unlikely to be an MTU issue because pings are tiny. As are the DHCP requests.

We have seen something similar to this previously. A misbehaving ISP gateway stopped responding when it's ARP entry expired instead of sending an ARP request to renew it. IIRC we worked around it by setting the pfSense ARP expiry time low so that it sends an ARP request before the gateway expires it's entry. By default it's 20mins:

[23.09-DEVELOPMENT][admin@4100-3.stevew.lan]/root: sysctl net.link.ether.inet.max_age
net.link.ether.inet.max_age: 1200

Try setting that to 5mins and see if that allows it to continue:

[23.09-DEVELOPMENT][admin@4100-3.stevew.lan]/root: sysctl net.link.ether.inet.max_age=300
net.link.ether.inet.max_age: 1200 -> 300

If that works you can add it as a system tunable.

Running an arping against the gateway would probably also renew the remote ARP entry.

Both are hacks that shouldn't be required!

Austin 0

@stephenw10 Thank you for your time on this. I will not have physical access to the device until Friday or Saturday. I will try it again and let you know what happens asap.

Austin 0

@stephenw10 This was the result of ARPing the gateway's mac

stephenw10

I assume that's after it stops responding? Does that ARPing work initially?

Did you try setting a lower max_age value?