pfSense WAN keeps disconnecting out every 10 mins



  • Hello, I have a problem with a firewall where the WAN connection keeps dropping every 10 or so minutes.

    The setup is as follows:
    Internet -> TalkTalk -> Open Reach modem -> pfSense -> LAN (network).

    For the last 3 or more years, the pfSense box has connected to the Open Reach modem via Ethernet cable and connected to the TalkTalk network using DHCP. This is a UK FTTC connection. This has worked perfectly, until recently.

    I have swapped out the old pfSense box and replaced it with some new hardware, a supermicro 5018d-fn4t. (this problem also happened on the old pfSense box)
    Also running the newest pfSense verison 2.4.4, from a clean fresh install, no config import.

    However, the WAN drops out about every 10 min, but does not hang pfSense and reconnects after a further 5 to 10 mins.

    I have attached part of the System-General log below.

    The log gives the error "kernel arpresolve: can't allocate llinfo for xx.xx.xx.xx on igb0" (The xx.xx.xx.xx is, I think, their WAN gateway IP on the TalkTalk network)

    I have read that adding "supersede interface-mtu 0" to the Option modifiers of the WAN interface fixes this problem.
    However, adding this has not fixed the problem. In fact I still see the arpresolve errors in the logs.

    I have also replaced the WAN network cable, the OpenReach modem, and the phone line cable between the modem and the phone socket. So I have replaced everything from the phone socket, into my network, and still it keeps dropping the WAN connection.

    I am completely at a loss and would greatly appreciate any help or ideas to try, that anyone can think of.
    Many thanks in advance.

    Nov 4 15:52:28	kernel		arpresolve: can't allocate llinfo for xx.xx.xx.xx on igb0
    Nov 4 15:52:28	kernel		arpresolve: can't allocate llinfo for xx.xx.xx.xx on igb0
    Nov 4 15:52:29	kernel		arpresolve: can't allocate llinfo for xx.xx.xx.xx on igb0
    Nov 4 15:52:29	kernel		arpresolve: can't allocate llinfo for xx.xx.xx.xx on igb0
    Nov 4 15:52:30	kernel		arpresolve: can't allocate llinfo for xx.xx.xx.xx on igb0
    Nov 4 15:52:30	check_reload_status		rc.newwanip starting igb0
    Nov 4 15:52:31	php-fpm	8772	/rc.newwanip: rc.newwanip: Info: starting on igb0.
    Nov 4 15:52:31	php-fpm	8772	/rc.newwanip: rc.newwanip: on (IP address: xx.xx.xx.xx) (interface: WAN[wan]) (real interface: igb0).
    Nov 4 15:52:31	dhcpleases		/etc/hosts changed size from original!
    Nov 4 15:52:34	php-fpm	8772	/rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through xx.xx.xx.xx
    Nov 4 15:52:34	php-fpm	8772	/rc.newwanip: Gateway, none 'available' for inet6, use the first one configured. ''
    Nov 4 15:52:37	dhcpleases		/etc/hosts changed size from original!
    Nov 4 15:52:37	dhcpleases		Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 4 15:52:37	dhcpleases		kqueue error: unkown
    Nov 4 15:52:39	php-fpm	8772	/rc.newwanip: Resyncing OpenVPN instances for interface WAN.
    Nov 4 15:52:39	php-fpm	8772	/rc.newwanip: Creating rrd update script
    Nov 4 15:52:42	php-fpm	8772	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - xx.xx.xx.xx -> xx.xx.xx.xx - Restarting packages.
    Nov 4 15:52:42	check_reload_status		Starting packages
    Nov 4 15:52:43	php-fpm	72176	/rc.start_packages: Restarting/Starting all packages.
    Nov 4 15:52:43	php-fpm	72176	[pfBlockerNG] Starting cron process.
    Nov 4 15:58:39	php-fpm	62120	/index.php: Successful login for user 'xxxx' from: 10.0.0.205 (Local Database)
    Nov 4 16:00:04	php		[pfBlockerNG] Starting cron process.
    Nov 4 16:00:35	php		[pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
    Nov 4 16:00:58	check_reload_status		Syncing firewall
    Nov 4 16:01:01	dhcpleases		/etc/hosts changed size from original!
    Nov 4 16:01:02	dhcpleases		Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
    Nov 4 16:01:02	dhcpleases		kqueue error: unkown
    Nov 4 16:01:04	dhcpleases		kqueue error: unkown
    Nov 4 16:06:08	rc.gateway_alarm	71635	>>> Gateway alarm: WAN_DHCP (Addr:8.8.8.8 Alarm:1 RTT:20.506ms RTTsd:.246ms Loss:21%)
    Nov 4 16:06:08	check_reload_status		updating dyndns WAN_DHCP
    Nov 4 16:06:08	check_reload_status		Restarting ipsec tunnels
    Nov 4 16:06:08	check_reload_status		Restarting OpenVPN tunnels/interfaces
    Nov 4 16:06:08	check_reload_status		Reloading filter
    Nov 4 16:06:09	php-fpm	77247	/rc.openvpn: Gateway, none 'available' for inet, use the first one configured. 'WAN_DHCP'
    Nov 4 16:06:09	php-fpm	77247	/rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
    Nov 4 16:06:09	php-fpm	77247	/rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_DHCP.
    

  • Netgate

    What is happening when those ARP resolve messages start? You showed the end, what about the beginning? Is the MTU showing anything strange on the interface when it is not working?

    dpinger is trying to ping the gateway address but it cannot because it is not receiving an ARP response for it on WAN. Then it miraculously does for some reason.

    If it were me I'd packet capture for ARP on WAN and see what is happening. I'd just set interface WAN protocol ARP and a packet count of 100000 or 1000000 and let it run. Then get the times of the start and end of the can't allocate llinfo logs and see what's happening there in wireshark.



  • @derelict There isn't anything in the log before the arpresolve.
    There is a minute or so of nothing in the logs, before the arpresolve messages.

    I will run the packet capture, as you suggested, and see if anything jumps out to indicate the problem.
    I will post back here when I know more.

    Many thanks for the help.



  • So I have run a packet capture of the ARP packets during one of the outages.
    Not many packets were actually captured and I can't see anything obvious.
    I have masked my IP and my MAC address is being spoofed as the original TT router, as I was testing to see of that had anything to do with it, as it does with other ISPs.

    18:05:42.642541 04:c0:6f:3c:f0:88 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 62.64.160.1 tell 62.64.167.xx, length 28
    18:05:42.650730 28:8a:1c:ed:5c:53 > 04:c0:6f:3c:f0:88, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 62.64.160.1 is-at 28:8a:1c:ed:5c:53, length 46
    18:08:27.713590 04:c0:6f:3c:f0:88 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 62.64.167.xx tell 62.64.167.xx, length 28
    18:08:27.734947 04:c0:6f:3c:f0:88 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 62.64.160.1 tell 62.64.167.xx, length 28
    18:08:27.740788 28:8a:1c:ed:5c:53 > 04:c0:6f:3c:f0:88, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 62.64.160.1 is-at 28:8a:1c:ed:5c:53, length 46
    18:13:38.449603 ac:1f:6b:4b:db:44 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 92.1.32.1 tell 92.1.39.xx, length 46
    18:19:15.670626 ac:1f:6b:4b:db:44 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 92.1.32.1 tell 92.1.39.xx, length 46
    18:21:10.750994 ac:1f:6b:4b:db:44 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 92.23.240.1 tell 92.23.246.xx, length 46
    18:28:23.847441 04:c0:6f:3c:f0:88 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 62.64.160.1 tell 62.64.167.xx, length 28
    18:28:23.855219 28:8a:1c:ed:5c:53 > 04:c0:6f:3c:f0:88, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 62.64.160.1 is-at 28:8a:1c:ed:5c:53, length 46
    18:31:06.363902 04:c0:6f:3c:f0:88 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 62.64.167.xx tell 62.64.167.xx, length 28
    18:31:06.534937 04:c0:6f:3c:f0:88 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 62.64.160.1 tell 62.64.167.xx, length 28
    18:31:06.540914 28:8a:1c:ed:5c:53 > 04:c0:6f:3c:f0:88, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 62.64.160.1 is-at 28:8a:1c:ed:5c:53, length 46
    
    

  • Netgate

    You will have to explain the relationship between those numbering schemes and your WAN. One (62.64.160.1) is being answered. The others (92.1.32.1) is not.

    Please just download the pcap and upload it to the link I sent in chat. I can't tell anything with all that obfuscation.


  • Netgate

    Are you saying you have the same MAC address spoofed in the upstream router and pfSense WAN?



  • @derelict I have the MAC set in pfSense, to the MAC that is of the TalkTalk router that they supplied when I purchased the FTTC package from TalkTalk.
    The setting of the MAC is irrelevant, as either if it is spoofed or not, it gives the same problem.


  • Netgate

    And how about a similar capture of DHCP traffic. I'd love to see a capture with both ARP and DHCP but you can't do that in the web gui.

    Have you called TalkTalk? What do they have to say?



  • @derelict I will get a capture of the DHCP traffic also for you.

    I haven't talked to TalkTalk yet, as it will be a real pain trying to get someone useful to talk to.
    However, I put their supplied router back in for a little while and it seemed stable, however, I will need to test for longer with it.


  • Netgate

    They have extremely short DHCP lease time of 15 minutes.

    When pfSense decides it is time to renew at about half that, it tries every few seconds and there is no response.

    The lease eventually expires and is removed from the pfSense interface. You can then no longer ping their gateway address because the WAN is now unnumbered.

    pfSense dhclient then sends a request to the broadcast and gets no answer.

    pfSense dhclient then reverts to DHCPDISCOVER and, after several seconds, finally gets an answer.

    This looks like it results in roughly 90 seconds of having no interface address on WAN in the capture you sent.

    Is your modem in pure bridge mode or whatever you guys call it over there?

    In my opinion the DHCP server should be honoring the renewal requests and responding with an ACK. It looks to pfSense like the DHCP server disappeared.


  • Netgate

    It looks to me like your WAN is doing exactly what is described here:

    https://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol#Reliability

    The failure is in the lack of response from upstream (either the modem itself or the ISP.)



  • Thank you for digging into the problem.
    Yes I believe my modem is set to pure bridge mode. I have not changed anything on the modem. Also I have tested with a modem that I know works on another TalkTalk line.

    I have started a conversation with TalkTalk on this problem and will see where I get with it.

    If anyone happens to know of a way to resolve this problem, I would love to know.


  • Netgate Administrator

    I don't think it will make any difference here but just to confirm the modem you have is an actual Openreach device? Is it the Huawei or ECI 'modem'?
    As far as I know there is no difference between them in practical terms and they are not configured any differently for Talktalk.

    Steve