WAN connection drops every 10 minutes, AT&T Fiber Modem & pfSense VIP

ttmcmurry

Hello,

I'm having a problem identical to another user in this post.

UPDATED POST

The original goal was setting up a "poor man's failover" (cable swap) for when I'm not home or on a business trip and the internet goes down and I'm not there to help figure it out. A simple 'bypass pfsense' and hook directly into the AT&T modem with little/no effect on static/dhcp clients. The source of my problem stems from configuring pfSense's side of the LAN interface to be at .1. Then I created a VIP to .254, which corresponds to the AT&T modem's LAN IP, but it sits outside of pfSense's LAN. My assumption is since pfSense was attached to the AT&T modem in DMZ+ (IP Passthrough) mode, there would be no implications for pfSense. However, I was wrong.

What I didn't realize was happening is how AT&T handles DHCP on their modem when in DMZ+ mode. Consider the DHCP offering to pfSense's WAN port from AT&T:

lease {
  interface "hn0";
  fixed-address 71.136.153.110;
  next-server 192.168.1.254;
  option subnet-mask 255.255.252.0;
  option time-offset -14400;
  option routers 71.136.152.1;
  option domain-name-servers 192.168.1.254;
  option domain-name "attlocal.net";
  option dhcp-lease-time 600;
  option dhcp-message-type 5;
  option dhcp-server-identifier 192.168.1.254;
  option dhcp-renewal-time 300;
  option dhcp-rebinding-time 525;
  option dhcp-client-identifier 1:0:15:5d:2:b3:4;
  renew 0 2020/5/3 20:22:09;
  rebind 0 2020/5/3 20:25:54;
  expire 0 2020/5/3 20:27:09;

The AT&T modem's DHCP offer specified a DHCP IP which exists in pfSense as a VIP in the directly-connected subnet for the LAN. The WAN was ultimately asking the LAN DHCP server for a new IP, but was blocked from it because of bogon blocking. The IP would expire, and a DHCP Broadcast would retrieve a new WAN IP. Removing the VIP at .254 fixed the issue and the internet worked correctly ever since.

Does DHCLIENT have a 'request DHCP only via broadcast' option? Said another way, override the next-server / dhcp-server-identifier options and just use a broadcast?

If AT&T's 600-second DHCP policy wasn't so aggressive, 86400 (1 day) would have meant I may not notice the once-a-day blip.

Original Post below, when I thought it was possibly an interaction with Hyper-V.

Before I did anything to troubleshoot, I've rebooted the ONT, Modem, Hyper-V host, but it did not do any good or harm. To get the basics out of the way:

(1) AT&T fiber ISP. Their modem/gateway is correctly configured in passthrough mode (DMZ+). The logs on the AT&T modem reveal nothing to raise suspicion. In the system log I can see the WAN interface allocated to pfSense showing "UP" every 10 minutes. No other logs correlate to the 10-minute interval, except the system log. The modem doesn't appear to be at fault. No disruptions in service.

(2) Hyper-V (Win10) is completely up to date. pfSense is the only running VM. Host is running on a 4th Gen 4c/8t Xeon E3v3, 32GB RAM, Intel I-350QP NIC, SSD. Nic is interfaced as Port #1 AT&T Modem #2 LAN Switch #3 4GLTE/TMO Cradlepoint #4 NAS. There are 3 virtual switches (1) AT&T (2) Inside (3) LTE - with only 1 vNIC in each vSwitch, connected to the VM. Each virtual switch has mac spoofing permitted, no filters or hardware assist enabled. System Sleep and Screen Sleep are disabled.

(3) pfSense VM is 2.4.5. This is a new, from-scratch, deployment that is 2 days old. All NIC hardware acceleration is disabled (Tuneables). Kernel PTI & Meltdown disabled. There are 3 interfaces WAN/ATT, LAN/LAN, OPT1/TMO. There are no defined firewall rules aside from what is automatically generated at pfSense's install. I'm in Hybrid AON mode, with 1 Static Port mapping for an Xbox, all else is auto-generated. States have been capped at 15000 to respect AT&T's 15460 state limit (I don't get close to this limit). Snort is installed and only enabled on the LAN interface. Two gateway groups are defined IPv4/IPv6 with a total of 4 defined gateways. The gateway groups are set to prefer ATT over TMO. Appropriate Ping/Packet Loss settings have been applied for Fiber (lower tolerance) & LTE (higher tolerance). Each Gateway pings a different IP address (one of four OpenDNS IPs). The ping response is 100% all the time, except when the WAN interface goes down every 10 minutes. Both AT&T & LTE/TMO are correctly operating with IPv4 and IPv6. LAN is tracking AT&T's interface for IPv6. LAN IPv4 interface is .1 and I've defined a .254 VIP so I don't have to re-ip my statically configured devices. There is no protocol-based routing, no VPN.

There are two types of logs. One where ARP complains and one where pfSense doesn't complain. When ARP complains, it's the IPv4 AT&T Next-Hop Router IPv4, 71.136.152.1. Even though pfSense logs indicate a WAN IP change, this is not the case, it has remained as 71.136.153.110 for several weeks now.

In Hyper-V, I tried enabling/disabling network hardware offload. No difference.
In pfSense, I originally kept the default hardware offload, then turned it off. No difference.
In pfSense, I've tried with and without Spectre/Meltdown mitigations enabled. No difference.
Rebooting all associated hardware & pfSense. No difference.

I'm unsure what might be happening. The regularity of 10 minutes should make it easier to coincide logs with events; I'm unable to find the pattern. For example, routing logs indicate dpinger can't reach a target - and this is true, WAN goes down, all routes are pulled, then comes back online in a few ping cycles. dpinger isn't a cause, it's a symptom of whatever is causing the WAN to cycle offline/online.

Log 1 - No ARP

May 3 17:01:06	php-fpm	37183	/rc.start_packages: Restarting/Starting all packages.
May 3 17:01:05	check_reload_status		Starting packages
May 3 17:01:05	php-fpm	340	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 71.136.153.110 -> 71.136.153.110 - Restarting packages.
May 3 17:01:03	php-fpm	340	/rc.newwanip: Creating rrd update script
May 3 17:01:03	php-fpm	340	/rc.newwanip: Resyncing OpenVPN instances for interface ATT.
May 3 17:01:00	dhcpleases		kqueue error: unknown
May 3 17:01:00	dhcpleases		Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
May 3 17:01:00	dhcpleases		/etc/hosts changed size from original!
May 3 17:00:54	php-fpm	340	/rc.newwanip: dpinger: timeout while retrieving status for gateway ATT_DHCP
May 3 17:00:51	php-fpm	340	/rc.newwanip: dpinger: timeout while retrieving status for gateway ATT_DHCP6
May 3 17:00:47	php-fpm	340	/rc.newwanip: Removing static route for monitor 2620:119:35::35 and adding a new route through fe80::2c30:44ff:fe29:803f%hn2
May 3 17:00:47	php-fpm	340	/rc.newwanip: Removing static route for monitor 208.67.220.220 and adding a new route through 100.78.192.246
May 3 17:00:47	php-fpm	340	/rc.newwanip: Removing static route for monitor 2620:119:53::53 and adding a new route through fe80::de7f:a4ff:fe8d:16c1%hn0
May 3 17:00:47	php-fpm	340	/rc.newwanip: Removing static route for monitor 208.67.222.222 and adding a new route through 71.136.152.1
May 3 17:00:40	dhcpleases		/etc/hosts changed size from original!
May 3 17:00:40	php-fpm	340	/rc.newwanip: rc.newwanip: on (IP address: 71.136.153.110) (interface: ATT[wan]) (real interface: hn0).
May 3 17:00:40	php-fpm	340	/rc.newwanip: rc.newwanip: Info: starting on hn0.

Log 2 - Kernel/ARP error

May 3 16:40:58	php-fpm	23764	/rc.start_packages: Restarting/Starting all packages.
May 3 16:40:57	check_reload_status		Starting packages
May 3 16:40:57	php-fpm	37183	/rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 71.136.153.110 -> 71.136.153.110 - Restarting packages.
May 3 16:40:55	php-fpm	37183	/rc.newwanip: Creating rrd update script
May 3 16:40:55	php-fpm	37183	/rc.newwanip: Resyncing OpenVPN instances for interface ATT.
May 3 16:40:53	dhcpleases		kqueue error: unknown
May 3 16:40:52	dhcpleases		Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) does not exist, No such process.
May 3 16:40:52	dhcpleases		/etc/hosts changed size from original!
May 3 16:40:50	php-fpm	37183	/rc.newwanip: Default gateway setting Interface TMO_DHCP Gateway as default.
May 3 16:40:47	php-fpm	37183	/rc.newwanip: dpinger: timeout while retrieving status for gateway ATT_DHCP
May 3 16:40:43	php-fpm	37183	/rc.newwanip: dpinger: timeout while retrieving status for gateway ATT_DHCP6
May 3 16:40:40	php-fpm	37183	/rc.newwanip: Removing static route for monitor 2620:119:35::35 and adding a new route through fe80::2c30:44ff:fe29:803f%hn2
May 3 16:40:40	php-fpm	37183	/rc.newwanip: Removing static route for monitor 208.67.220.220 and adding a new route through 100.78.192.246
May 3 16:40:40	php-fpm	37183	/rc.newwanip: Removing static route for monitor 2620:119:53::53 and adding a new route through fe80::de7f:a4ff:fe8d:16c1%hn0
May 3 16:40:40	php-fpm	37183	/rc.newwanip: Removing static route for monitor 208.67.222.222 and adding a new route through 71.136.152.1
May 3 16:40:33	dhcpleases		/etc/hosts changed size from original!
May 3 16:40:33	php-fpm	37183	/rc.newwanip: rc.newwanip: on (IP address: 71.136.153.110) (interface: ATT[wan]) (real interface: hn0).
May 3 16:40:33	php-fpm	37183	/rc.newwanip: rc.newwanip: Info: starting on hn0.
May 3 16:40:32	check_reload_status		rc.newwanip starting hn0
May 3 16:40:31	kernel		arpresolve: can't allocate llinfo for 71.136.152.1 on hn0
^^^ Above may repeat one or more times

provels

Have you tried running a constant ping to an external site?
Pretty obvious, but you have disabled all power saving on the server...?

ttmcmurry

@provels - Thanks for this. Of course I had to type a litany to eventually discover the problem is one I created but didn't realize I had created one.

TL;DR, the AT&T WAN DHCP address specified its IP to renew from. That IP sat in the LAN on pfSense's side as a VIP. The WAN couldn't reach that IP for renewal. At expiry, a new DHCP broadcast would occur and everything would be good for another 10 minutes.

I removed the VIP, everything went back to normal. Updated the original post to be more clear since I figured it out.