Fiber ISP, ONT Calix, and loss of WAN routes at the time of original DHCP lease expiration (possibly related to ARP weirdness?)
-
I have a weird case of regular network outages, when trying a new fiber internet service.
Here is the problem:
I get the connection established, seems to work great.
At about the sixth hour from the time the first DHCP lease is established, but just before that time (5 seconds or so, it seams) there is a DHCP request broadcast (not unicast as usual), and there is a loss of route to the pinger target (I set the monitor IP to 8.8.8.8). Since the ISP requested that I try the connection with a computer directly connected to the ONT, I also used a Linux laptop to look at the connection.Here is what I observed:
ARP strangeness
I noticed the following log entries:
Dec 17 09:43:58 kernel arp: <MAC of ONT device> is using my IP address <DHCP leased WAN IP address> on igb0!
A network trace indeed shows the ONT MAC issues ARP queries on the behalf of the IP address that was assigned by the WAN:
In the above screenshot, SuperMic_f6:b3:7d is the ONT, SuperMic_f4:a4:22 is the pfSense WAN, and Nokia_93:cc:59 is the gatewayAs the ISP requested that I connect a computer directly to the ONT, I also did a network trace from a Linux laptop. That laptop connection did not show entries for ARP requests from the ONT.
Since the pfSense connection was active, even though done overnight, while the laptop was just sitting there, I suspected the there might be an issue with the frequency of expiring entries in the ARP table. On the Linux connection it defaults to 25-30 seconds, while the pfSense default is 1200 seconds. I set the tunable
net.link.ether.inet.max_age
to 25 seconds, but the issue with the errant ARP requests persist.I also noticed that these inappropriate(?) ARP requests (in groups of 3, 1 second apart) seem to occur on an decreasing interval, where they "culminate" in a lost route. The sequence in the next screenshot shows them occurring (in offsets from the start of the trace, which was started just before the WAN interface was brought up): 10663 s / 16090 s / 18803 s / 20159 s / 20837 s / 21177 s / 21346 s / 21430 s / 21473 s and at that point the problem starts:
When I look at the full trace around that time point, I can see that the sequence is:
- ping request from the pfSense monitor (sucessful)
- a DHCP request broadcast, form the ONT (?!)
- an errand ARP request from the ONT
- answered by pfSense as a gratuitous request notice
- followed by a combined ICMP message "Destination unreachable, port unreachable" and a DHCP ACK to the brodacast request. The packet it sent from the ONT MAC address to the gateway MAC address, again the ONT impersonating the IP address of the pfSense WAN
- after that is the first unsuccessful ping request from the pfSense monitor, there is no reply and the subsequent packets are just unsuccessful DNS requests and transmissions.
The screenshot of that is here:
I also managed to catch the failed state and ran ping from the pfSense shell. The requests to 8.8.8.8 went unanswered, while the requests to the gateway's IP address came back as "ping: sendto: Host is down"
Given the strange DHCP broadcast request shown above, I started looking at that as well.
DHCP Strangeness
One of the things that I noticed was that the strange DHCP request was broadcast, and not unicast. An earlier thread in this forum mentioned that sometimes ONT units are misconfigured and can't properly forward DHCP requests. I added the suggested
supersede dhcp-server-identifier 255.255.255.255
to no effect - everything was the same. Another thread on the internet mentioned that the ONT may need much shorter lease renewal, so I also addedsupersede dhcp-lease-time 3600
. The DHCP trafic from the same trace can be see here:
I also looked at the DHCP trafic on the Linux laptop. And lo and behold, the same pattern of a DHCP request broadcast from 0.0.0.0 followed by a DHCP ACK, and the combined ICMP (host unreachable) / DHCP ACK packet:
I added a script to [ping 8.8.8.8 every 5 seconds, and that script detected the connectivity problem on the linux laptop, however only with one type of USB3 dongle as the network device. The same laptop, connected to a thunderbolt 2 docking station, still showed the the same pattern in the DHCP connections, but without degradation of the service - pings to 8.8.8.8 were successful immediately following the ICMP/DHCP (host unreachable) combo packet.
FWIW, on the pfSense, the pattern happens with both Kea and ISC DHCP
What seems to be the problem?
I am beyond the boundaries of my meager networking knowledge, and I have no clue where to go from here.
The changes I have made to the pfSense setup are:- Disabled Gateway Monitoring Action
- Set 8.8.8.8 to be used for monitoring
- Set
net.link.ether.inet.max_age
to 25 seconds - Set the DHCP Client section on the WAN to advanced and
- set BSD Default configuration
- set
supersede dhcp-server-identifier 255.255.255.255
andsupersede dhcp-lease-time 3600
There are several obvious questions, about which I have no clue:
- Why is the DHCP pattern present in the Linux trace, however the MAC address is of the Linux network device and not of the ONT? Is it possible that in Linux the network stack just can't fathom that another device would "speak for" the network device, and assume that it is always the local device as log as the IP address is the one assigned to it?
- Why is the network disruption in Linux of much shorter duration compared to pfSense (and practically non-existent when using the Thunbderbolt docking station)?
Finally the hardware on the pfSense side is a Netgate C2758 with an Intel i345 network device.
Any help or suggestions would be greatly appreciated.
-
@vassil_netgate said in Fiber ISP, ONT Calix, and loss of WAN routes at the time of original DHCP lease expiration (possibly related to ARP weirdness?):
with an Intel i345 network device
You mean i350? A 4 port 1G expansion card?
Since it seems to be in some way NIC dependent I would try reassigning WAN to one of the other NICs available if you can.
That does seem like some truly weird behaviour!
You might also try a different pfSense device with completely different NICs if you can. Even if there's nothing connected to it other than WAN. Perhaps on the laptop for example.
Steve
-
@stephenw10 Yes, I mis-typed the card model, it is I354.
pciconf -lv | grep -A1 -B3 network igb0@pci0:0:20:0: class=0x020000 rev=0x03 hdr=0x00 vendor=0x8086 device=0x1f41 subvendor=0x15d9 subdevice=0x1f41 vendor = 'Intel Corporation' device = 'Ethernet Connection I354' class = network subclass = ethernet
Is there a vm image that I can use to run pfSense on the laptop? And any documentation how to set up the networking in such a case?
-
There's no VM image specifically but you can use the 2.7.2 ISO image to install in a VM: https://www.pfsense.org/download/
If you have any real hardware available I would try that first but if not a VM is still a good test.