WAN interfaces fail to return after power outage



  • Hi all,

    I have 2 WAN interfaces:

    igb0: WAN (DHCP) fixed wireless NBN connection
    igb1: WWAN (DHCP) 4G cellular connection via ethernet from a Netgear Nighthawk M1 MR1100

    When the power goes off I have noticed that both WAN connections remain "pending". They have received an IP address but there's no action. I checked the System Logs and under the General tab I find this:

    Sep 29 12:19:00
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:05
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:05
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:06
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:10
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:11
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:12
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    Sep 29 12:19:15
    kernel
    arpresolve: can't allocate llinfo for 100.65.48.1 on igb0
    
    

    etc etc

    If I disable one of the WAN interfaces and then enable it, the interface comes back online. So whenever the power goes out, I need to remember to log into pfSense, disable both WAN interfaces, re-enable them, and then we're back online.

    There doesn't seem to be any info in the System Log about igb1 WWAN.

    Running 2.4.3-RELEASE-p1 (amd64)

    Any ideas what's going on here?


  • Netgate Administrator

    Is that the correct gateway address for your WAN connection?

    If not you might be getting a lease from the local modem until it detects the link is back up. In which case you can set DHCP to refuse leases from that device.

    Steve



  • Thanks Stephenw10 yes that's the correct gateway address for our WAN, it gets handed to us via DHCP from the ISP. There is no modem involved.


  • Netgate Administrator

    Hmm, well it appears it's not responding to ARP requests for some reason.

    Are you spoofing your MAC address maybe?

    When it's in that state try running ifconfig -v igb0

    And if possible run a packet capture ion that interface to see what it's actually sending and what's coming back (if anything).

    Steve



  • Thanks Steve.

    I'm not spoofing my MAC address.

    What exactly is it that's not responding to ARP requests?

    I'll do some testing and try to replicate it so I can do a capture. Thanks again.


  • Netgate Administrator

    It's whatever is upstream. If you connect directly and pull a DHCP lease it's whatever that is giving you as a gateway, 100.65.48.1.

    Steve



  • Seems that way, or might it be a problem with pfsense because it was happening with both WANs, 2 completely different service? And if it were a problem upstream, it wouldn't go away from just unplugging and replugging the cable, right?



  • @stephenw10

    Please see my post about a similar issue, and another thread I responded to. I think I know what is happening here, and it needs to be documented, and or added as a default setting in new installs.

    It seems that the dhclient code has been changed to respect the "interface-mtu" option that is being issued via DHCP by some CMTS equipment.

    I the cases I have seen the MTU is being set to 576, rather than being left at the pfSense default of 1500.

    In addition, the interface-mtu option issued via DHCP takes precedence over an MTU explicitly set by the user. To override the bad interface-mtu being set via DHCP, dhclient must be instructed to ignore the option. This is done by setting supersede interface-mtu 0 in the "Option modifiers" section of "Lease Requirements and Requests".

    When the MTU is set too small, it seems that DHCP renewals are failing. The failures coincide with errors like:
    arpresolve: can't allocate llinfo for $GATEWAY on $INTERFACE

    I think this is biting a lot of users on the upgrade to 2.4.4-RELEASE, and the symptoms are quirky. If left alone, the interface may stay up, but certain websites will become inaccessible due to packet fragmentation.

    PRIOR POSTS ON SAME SUBJECT:
    https://forum.netgate.com/topic/136089/solved-and-revised-2-4-4-release-arpresolve-can-t-allocate-llinfo-for-gateway-on-interface0-dhcp-mtu-576

    https://forum.netgate.com/topic/136253/frequent-internet-loss-need-help-figuring-out-where-and-why-maybe-pfsense-modem-isp-or-all-3


  • Netgate Administrator

    Wow, nice catch. And what are they doing sending 576?! 😕

    That does seem like it could be related at least. I'll be watching with anticipation...

    Steve


  • Netgate Administrator

    Actually that should have been resolved already in https://redmine.pfsense.org/issues/8507

    If it does work that has somehow missed your install.

    Steve



  • From what I have read, I see references to the 576 MTU related to dialup connections. This might be an old fall-back that is being exposed only because dhclient now respects the option interface-mtu value being sent by the DHCP server. The value shows up in the issued lease. The changes in dhclient upstream are now exposing this.

    This is worth exploring in connection with reports of WAN interface disconnections, unpredictable website connectivity, and may affect things like name resolution. When combined with the "IP Do-Not-Fragment compatibility" option in System/Advanced/Firewall&NAT, the small MTU breaks connectivity with some websites. I saw problems with the iHeart Radio website and streams, and with loading newyorker.com. Please propagate this up the chain. My earlier post has links to the issues as they are discussed in the FreeBSD development system.

    https://forum.netgate.com/topic/136089/solved-and-revised-2-4-4-release-arpresolve-can-t-allocate-llinfo-for-gateway-on-interface0-dhcp-mtu-576



  • @stephenw10

    The fix discussed in Redmine doesn't seem to have made it into 2.4.4-RELEASE.

    Rather, the fix of using the patched version of dhclient now in the FreeBSD tree is that the user must issue "supersede interface-mtu 0" to ignore the requested option 26 information. Dhclient is still requesting option 26 info from the DHCP server. The patch allows being able to supersede option 26 as issued with the lease.



  • I have opened a Redmine account, and posted in the relevant thread.


  • Netgate Administrator

    I agree, it's certainly worth exploring. It could explain a number of threads here.

    The value supersede interface-mtu 0 should be in the dhclient conf files in /var/etc by default. It is on everything I've just checked. If some connections are still seeing a 576 MTU then there must be some combination of factors that prevent it being added. If that is the case we need to find out what they are and stop that happening.

    Steve



  • I have a sneaking suspicion that a prior manual setting of MTU on the interface may be interfering with the the setting of supersede interface-mtu 0 in dhclient.conf on upgrade. I know that I have previously hard set the MTU to 1500 on a number of boxes as a matter of course. In this instance, the hard set MTU will not be respected if supersede interface-mtu 0 is not making it into dhclient.conf.


  • Netgate Administrator

    I think you're right. Working on something now....



  • Jim Pingle, the developer working on this has entered a new diff. Apparently, checking the advanced options checkbox and then saving and applying the config with no other changes entered, and then upgrading to 2.4.4-RELEASE, is enough to disrupt the fix the developers had put in place for the option 26 interface-mtu bug introduced by the new dhclient.


  • Rebel Alliance Developer Netgate

    I added a new note with a workaround to the Upgrade Guide: https://www.netgate.com/docs/pfsense/install/upgrade-guide.html#upgrading-from-versions-older-than-pfsense-2-4-4

    A patch is available that can be added with the System Patches package.

    The fix is discussed on https://redmine.pfsense.org/issues/8507



  • Thank you! This is wonderful.



  • I think this bug also applies to fresh installs using a restored config, not just on in-place upgrades. That is the case for the system I encountered this on.


  • Rebel Alliance Developer Netgate

    @bfeitell said in WAN interfaces fail to return after power outage:

    I think this bug also applies to fresh installs using a restored config, not just on in-place upgrades. That is the case for the system I encountered this on.

    Since it is a setting in the configuration and not a problem on the filesystem, that is correct. If you restore a config with advanced or custom options set there, it would fail this way.



  • It is an insidious bug. I triggered the DHCP renewal problems by saving and applying on the WAN with or without changes. Unless triggered by the user, it will lurk until the next DHCP renewal fails, and that may not happen for 30 minutes or more. Looking through recent forum posts, I suspect this bug is in play whenever a user notices arpresolve: can't set llinfo for $GATEWAY on $INTERFACE errors.



  • Okay so it went a bit over my head there. Can someone please break it down for me? Where are we up to with this one? Is there a patch? Or a configuration change is needed?


  • Netgate Administrator

    First try just adding the following to the option modifiers field in the advanced section of the dhcp setup on WAN. Check the 'Advanced Configuration' box to see that field if it's not already.
    supersede interface-mtu 0

    If that works then you can try the patch instead. That would be a helpful test for us.

    Steve



  • Hey Steve, okay I tried it (see screenshot) but didn't change anything. System log didn't report any errors this time though.2_1538947004879_Screenshot_20181008-080516.png 1_1538947004850_Screenshot_20181008-080614.png 0_1538947004779_Screenshot_20181008-080428.png I didn't have the WWAN connected this time so that's why it's not showing up. To get the WAN connection going again after power out, I need to either unplug the Ethernet cable and re plug, or disable and re enable the interface, or make some change in the WAN interface and save.


  • Netgate Administrator

    Hmm, no 'arpresolve' errors though?

    Did you ever try running ifconfig -av during the working and non-working states to compare them?

    Steve



  • Did you see the above images? One of them is of the syslog while it was happening after putting in the string into the Option Modifiers field.

    Yes I did, but it's on my other computer, I'll paste it here later.


  • Netgate Administrator

    Yes I see those. I don't see any arpresolve errors in there but I thought you may have seen some that aren't in that shot. That only shows 10s worth of logs.

    Steve



  • yeah that's the last few entries from it booting, then there's nothing else after that.

    Here's the output from ifconfig -v but according to www.diffchecker.com there's no difference in the output whether it's working properly or not:

    Petes-MBP:~ Peter$ ifconfig -v
    lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 index 1
     	eflags=12000000<ECN_DISABLE,SENDLIST>
     	options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP>
     	inet 127.0.0.1 netmask 0xff000000 
     	inet6 ::1 prefixlen 128 
     	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
     	nd6 options=201<PERFORMNUD,DAD>
     	link quality: 100 (good)
     	state availability: 0 (true)
     	timestamp: disabled
     	qosmarking enabled: no mode: none
    gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 index 2
     	eflags=1000000<ECN_ENABLE>
     	state availability: 0 (true)
     	qosmarking enabled: no mode: none
    stf0: flags=0<> mtu 1280 index 3
     	eflags=1000000<ECN_ENABLE>
     	state availability: 0 (true)
     	qosmarking enabled: no mode: none
    XHC20: flags=0<> mtu 0 index 4
     	eflags=41000000<ECN_ENABLE,FASTLN_ON>
     	state availability: 0 (true)
     	qosmarking enabled: yes mode: none
    en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 index 5
     	eflags=41200880<TXSTART,ARPLL,NOACKPRI,ECN_ENABLE,FASTLN_ON>
     	ether a0:99:9b:14:37:55 
     	inet6 fe80::42d:fd6:9a6d:c5ec%en0 prefixlen 64 secured scopeid 0x5 
     	inet 10.20.63.133 netmask 0xffffff00 broadcast 10.20.63.255
     	nd6 options=201<PERFORMNUD,DAD>
     	media: autoselect
     	status: active
     	type: Wi-Fi
     	link quality: 100 (good)
     	state availability: 0 (true)
     	scheduler: FQ_CODEL (driver managed)
     	link rate: 53.95 Mbps
     	qosmarking enabled: yes mode: none
    p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304 index 6
     	eflags=41000080<TXSTART,ECN_ENABLE,FASTLN_ON>
     	ether 02:99:9b:14:37:55 
     	media: autoselect
     	status: inactive
     	type: Wi-Fi
     	state availability: 0 (true)
     	scheduler: FQ_CODEL (driver managed)
     	link rate: 10.00 Mbps
     	qosmarking enabled: yes mode: none
    awdl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1484 index 7
     	eflags=413e0080<TXSTART,LOCALNET_PRIVATE,ND6ALT,RESTRICTED_RECV,AWDL,NOACKPRI,ECN_ENABLE,FASTLN_ON>
     	ether 0a:60:dd:0c:a4:0f 
     	inet6 fe80::860:ddff:fe0c:a40f%awdl0 prefixlen 64 scopeid 0x7 
     	nd6 options=201<PERFORMNUD,DAD>
     	media: autoselect
     	status: active
     	type: Wi-Fi
     	state availability: 0 (true)
     	scheduler: FQ_CODEL (driver managed)
     	link rate: 10.00 Mbps
     	qosmarking enabled: yes mode: none
    en1: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 index 8
     	eflags=41000080<TXSTART,ECN_ENABLE,FASTLN_ON>
     	options=60<TSO4,TSO6>
     	ether 6a:00:00:9f:55:50 
     	media: autoselect <full-duplex>
     	status: inactive
     	type: Ethernet
     	state availability: 0 (true)
     	scheduler: FQ_CODEL 
     	qosmarking enabled: yes mode: none
    en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 index 9
     	eflags=41000080<TXSTART,ECN_ENABLE,FASTLN_ON>
     	options=60<TSO4,TSO6>
     	ether 6a:00:00:9f:55:51 
     	media: autoselect <full-duplex>
     	status: inactive
     	type: Ethernet
     	state availability: 0 (true)
     	scheduler: FQ_CODEL 
     	qosmarking enabled: yes mode: none
    bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 index 10
     	eflags=41000000<ECN_ENABLE,FASTLN_ON>
     	options=63<RXCSUM,TXCSUM,TSO4,TSO6>
     	ether 6a:00:00:9f:55:50 
     	Configuration:
     		id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0
     		maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200
     		root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0
     		ipfilter disabled flags 0x2
     	member: en1 flags=3<LEARNING,DISCOVER>
     	        ifmaxaddr 0 port 8 priority 0 path cost 0
     	        hostfilter 0 hw: 0:0:0:0:0:0 ip: 0.0.0.0
     	member: en2 flags=3<LEARNING,DISCOVER>
     	        ifmaxaddr 0 port 9 priority 0 path cost 0
     	        hostfilter 0 hw: 0:0:0:0:0:0 ip: 0.0.0.0
     	media: <unknown type>
     	status: inactive
     	state availability: 0 (true)
     	qosmarking enabled: yes mode: none
    utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000 index 11
     	eflags=1002080<TXSTART,NOAUTOIPV6LL,ECN_ENABLE>
     	inet6 fe80::aae2:a7ca:ad8a:540%utun0 prefixlen 64 scopeid 0xb 
     	nd6 options=201<PERFORMNUD,DAD>
     	agent domain:ids501 type:clientchannel flags:0xc3 desc:"IDSNexusAgent ids501 : clientchannel"
     	state availability: 0 (true)
     	scheduler: FQ_CODEL 
     	qosmarking enabled: no mode: none
    utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380 index 12
     	eflags=1002080<TXSTART,NOAUTOIPV6LL,ECN_ENABLE>
     	inet6 fe80::a4b9:a2b2:95c:b086%utun1 prefixlen 64 scopeid 0xc 
     	inet6 fdb7:98e5:bc83:2490:a4b9:a2b2:95c:b086 prefixlen 64 
     	nd6 options=201<PERFORMNUD,DAD>
     	state availability: 0 (true)
     	scheduler: FQ_CODEL 
     	qosmarking enabled: no mode: none
    Petes-MBP:~ Peter$
    

  • Rebel Alliance Developer Netgate

    @peter_richardson said in WAN interfaces fail to return after power outage:

    yeah that's the last few entries from it booting, then there's nothing else after that.

    Here's the output from ifconfig -v but according to www.diffchecker.com there's no difference in the output whether it's working properly or not:

    That appears to be from your Mac, not pfSense. Try that command on pfSense when it works and when it doesn't.



  • @jimp sorry, that was a silly mistake! I'll test again later today and report back.



  • @jimp said in WAN interfaces fail to return after power outage:

    hat command on pfSense when it works and when it doesn't.

    Hey guys, sorry for the late reply, I have finally had a spare minute to do some testing. Please see attached.

    EDIT: I CAN'T REPLY BECAUSE I GET THIS ERROR:

    ERROR
    Post content was flagged as spam by Akismet.com


  • Netgate Administrator

    What are you trying to post? The ifconfig output?

    It will probably allow you to post a screenshot of it if it's still blocking you.

    Steve



  • Yes, 2 of then but that would make 4 images plus the other image, and then you can't copy and paste text for analysis. Can we please fix this issue? Why would it mark my post as spam when I am a registered user, replying to a thread that I created, from the same public ip as always? And why not also tell me why it has been marked as spam so I can try to fix whatever it's issue is?





  • anyone?


  • Netgate Administrator

    I up-voted enough of your posts that you should not get hit by Akismet again.

    Do you see anything logged showing it getting the 0.0.0.0 address? It looks like the WAN doesn't come back up because it sees that as a valid address. If something giving that to it you can exclude a particular DHCP server as a source.

    Steve



  • Thanks Stephen!

    So you're saying that it appears that pfSense is giving the WWAN interface the address of 0.0.0.0?


  • Netgate Administrator

    No, more likely it's something upstream giving it that before the connection comes up. Cable modems do that with monotonous regularity. 🙄 But I'm not sure what sort of hand-off you have. I know NBN can be several different things.
    If that is the case though you should see something logged in the DHCP logs as the dhclient pulls the bad address.

    Steve



  • I see. No, the WWAN isn't NBN, the WWAN is 4G with a Telstra SIM card, using a Netgear Nighthawk M1.

    So changing the WWAN interface from DHCP to static with an IP address that is in the M1's usual DHCP range should fix this. IE: set the IP to 192.168.0.50 but alas, it just shows on the dashboard as "gateway down". So it looks like this isn't the issue. In fact, once I set the IP to static, it didn't work at all, which makes me wonder what is going on since previously, any changes to the WWAN connection would make the connection come back online, and giving it a static address shouldn't prevent it from working under normal conditions, so long as the address is within the M1's normal DHCP range. 😕