Rapid DOWN-UP of WAN PHY + IP change causes complete loss of WAN sometimes



  • This has been plaguing me for a long time and now with 2.4.0 stable it's been easier to try to tease out what might be going wrong.

    The setup:
    LAN <–--> SG-1000 <----> HFC Cable Modem
    NOTE: I have zero control over the cable modem as it is controlled by the infrastructure provider (NBN Australia).

    The problem is that Internet connectivity will sometimes completely disappear and not come back unless the SG-1000 is power cycled (have not confirmed yet whether it can be accessed via web console when it is in this state).

    From studying what happens in multiple logs it seems that:

    1. ISP sets WAN IP via DHCP with 150 second renew time.

    2. Seemingly randomly (2-3 times a day or once a week) WAN IP is changed.

    3. A moment before the IP is changed the cable modem will in under one second force the link to pfsense DOWN then UP.

    4. From this point on, it's all bets are off. Either:
      a) ~ 50% of the time IP will update and WAN connection stays
      b) ~ 50% of the time WAN dies. Internet cannot be accessed from the LAN and also the unit cannot be pinged from the Internet.

    In the case where Internet access dies, it seems that the new IP sticks, but some other odd things happen. Combined logs:

    
    07:35:35 dhclient: RENEW
    07:35:35 dhclient: Creating resolv.conf
    07:38:05 dhclient: RENEW
    07:38:05 dhclient: Creating resolv.conf
    07:40:35 dhclient: RENEW
    07:40:35 dhclient: Creating resolv.conf
    07:43:05 dhclient: RENEW
    07:43:05 dhclient: Creating resolv.conf
    07:45:35 dhclient: RENEW
    07:45:35 dhclient: Creating resolv.conf
    07:48:05 dhclient: RENEW
    07:48:05 dhclient: Creating resolv.conf
    07:49:35 kernel: cpsw0: link state changed to DOWN
    07:49:35 kernel: cpsw0: link state changed to UP
    07:49:35 check_reload_status: Linkup starting cpsw0
    07:49:35 check_reload_status: Linkup starting cpsw0
    07:49:36 php-fpm[88066]: /rc.linkup: DEVD Ethernet detached event for wan
    07:49:36 dhclient[9971]: connection closed
    07:49:36 dhclient[9971]: exiting.
    07:49:37 miniupnpd[58525]: ioctl(s, SIOCGIFADDR, ...): Can't assign requested address
    07:49:37 miniupnpd[58525]: Failed to get IP for interface cpsw0
    07:49:37 miniupnpd[58525]: SendNATPMPPublicAddressChangeNotification: cannot get public IP address, stopping
    07:49:37 check_reload_status: Reloading filter
    07:49:37 php-fpm[13091]: /rc.linkup: DEVD Ethernet attached event for wan
    07:49:37 php-fpm[13091]: /rc.linkup: HOTPLUG: Configuring interface wan
    07:49:38 miniupnpd[58525]: SendNATPMPPublicAddressChangeNotification: sendto(s_udp=10): No route to host
    07:49:38 dhclient: PREINIT
    07:49:38 dhclient[59642]: DHCPREQUEST on cpsw0 to 255.255.255.255 port 67
    07:49:38 dhclient[59642]: DHCPNAK from ##ISP GATEWAY##
    07:49:38 dhclient[59642]: DHCPDISCOVER on cpsw0 to 255.255.255.255 port 67 interval 1
    07:49:39 miniupnpd[58525]: ioctl(s, SIOCGIFADDR, ...): Can't assign requested address
    07:49:39 miniupnpd[58525]: Failed to get IP for interface cpsw0
    07:49:39 miniupnpd[58525]: SendNATPMPPublicAddressChangeNotification: cannot get public IP address, stopping
    07:49:39 miniupnpd[58525]: SendNATPMPPublicAddressChangeNotification: sendto(s_udp=10): No route to host
    07:49:39 dhclient[59642]: DHCPDISCOVER on cpsw0 to 255.255.255.255 port 67 interval 2
    07:49:39 dhclient[59642]: DHCPOFFER from ##ISP GATEWAY##
    07:49:39 dhclient: ARPSEND
    07:49:39 dhclient[59642]: DHCPOFFER from ##ISP GATEWAY##
    07:49:39 dhclient[59642]: DHCPOFFER already seen.
    07:49:39 dhclient[59642]: DHCPOFFER from ##ISP GATEWAY 2##
    07:49:39 dhclient: ARPSEND
    07:49:39 dhclient: ARPCHECK
    07:49:39 dhclient[59642]: DHCPREQUEST on cpsw0 to 255.255.255.255 port 67
    07:49:39 dhclient[59642]: DHCPACK from ##ISP GATEWAY##
    07:49:39 dhclient: BOUND
    07:49:39 dhclient: Starting add_new_address()
    07:49:39 dhclient: ifconfig cpsw0 inet ##WAN IP## netmask 255.255.0.0 broadcast ##WAN BROADCAST##
    07:49:39 dhclient: New IP Address (cpsw0): ##WAN IP##
    07:49:39 dhclient: New Subnet Mask (cpsw0): 255.255.0.0
    07:49:39 dhclient: New Broadcast Address (cpsw0): ##WAN BROADCAST##
    07:49:39 dhclient: New Routers (cpsw0): ##ISP GATEWAY##
    07:49:39 dhclient: Adding new routes to interface: cpsw0
    07:49:40 dnsmasq[97316]: exiting on receipt of SIGTERM
    07:49:40 dhclient: /sbin/route add default ##ISP GATEWAY##
    07:49:40 dhclient: Creating resolv.conf
    07:49:40 dhclient[59642]: bound to ##WAN IP## -- renewal in 150 seconds.
    07:49:40 check_reload_status: rc.newwanip starting cpsw0
    07:49:40 php-fpm[13091]: /rc.linkup: ROUTING: setting default route to ##ISP GATEWAY##
    07:49:40 check_reload_status: Restarting ipsec tunnels
    07:49:41 dhcpleases: /etc/hosts changed size from original!
    07:49:41 dhcpleases: Sending HUP signal to dns daemon(97316)
    07:49:41 dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/dnsmasq.pid) does not exist, No such process.
    07:49:41 kernel: arpresolve: can't allocate llinfo for ##ISP GATEWAY 2## on cpsw0
    07:49:41 php-fpm[88066]: /rc.newwanip: rc.newwanip: Info: starting on cpsw0.
    07:49:41 kernel: arpresolve: can't allocate llinfo for ##ISP GATEWAY 2## on cpsw0
    07:49:41 php-fpm[88066]: /rc.newwanip: rc.newwanip: on (IP address: ##WAN IP##) (interface: WAN[wan]) (real interface: cpsw0).
    07:49:41 dhcpleases: /etc/hosts changed size from original!
    07:49:41 php-fpm[88066]: /rc.newwanip: IP Address has changed, killing states on former IP Address 106.71.88.7.
    07:49:41 dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/dnsmasq.pid) does not exist, No such process.
    07:49:42 dhcpleases: /etc/hosts changed size from original!
    07:49:42 dhcpleases: Sending HUP signal to dns daemon(97316)
    07:49:42 dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/dnsmasq.pid) does not exist, No such process.
    07:49:42 dhcpleases: kqueue error: unkown
    07:49:42 dhcpleases: Sending HUP signal to dns daemon(71274)
    07:49:42 dnsmasq[71274]: started, version 2.78 cachesize 10000
    07:49:42 dnsmasq[71274]: compile time options: IPv6 GNU-getopt no-DBus i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect no-inotify
    07:49:42 dnsmasq[71274]: reading /etc/resolv.conf
    07:49:42 dnsmasq[71274]: using nameserver ##ISP DNS###53
    07:49:42 dnsmasq[71274]: using nameserver ##ISP DNS 2###53
    07:49:42 dnsmasq[71274]: read /etc/hosts - 2 addresses
    07:49:42 dnsmasq[71274]: read /etc/hosts - 4 addresses
    07:49:42 dhcpleases: /etc/hosts changed size from original!
    07:49:42 dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/dnsmasq.pid) does not exist, No such process.
    07:49:42 dhcpleases: kqueue error: unkown
    07:49:53 rc.gateway_alarm[85709]: >>> Gateway alarm: WAN_DHCP (Addr:##ISP GATEWAY 2## Alarm:1 RTT:16620ms RTTsd:2417ms Loss:22%)
    07:49:53 check_reload_status: updating dyndns WAN_DHCP
    07:49:53 check_reload_status: Restarting ipsec tunnels
    07:49:53 check_reload_status: Restarting OpenVPN tunnels/interfaces
    07:49:53 check_reload_status: Reloading filter
    
    

    After all that the IP lease is renew via DHCP although the behaviour is a bit different:

    
    07:52:09 dhclient: Creating resolv.conf
    07:52:09 dhclient[66767]: bound to ##WAN IP## -- renewal in 150 seconds.
    07:54:39 dhclient[66767]: DHCPREQUEST on cpsw0 to ##ISP GATEWAY## port 67
    07:54:40 dhclient[66767]: DHCPACK from ##ISP GATEWAY##
    
    

    This is a remote site so it is hard to debug, but I thought of trying to maybe recreate the issue by:

    1. Unplugging then replugging the WAN cable
    2. Power cycling the cable modem with a 1 minute off time

    I've found several issues and threads which are similar but the solution either didn't work for me or it's been patched. The fact that it is 50/50 makes me wonder if this is some race condition in the rc.linkup and/or rc.newwanip scripts which pop up in the logs.

    Any ideas, thoughts on what it might be or other ways I can dig up information would be greatly appreciated.

    Thanks

    EDIT 1:

    • I get the same behaviour with unbound instead of dnsmasq
    • Internet does not go down with the ISP supplied crappy router

    EDIT 2:
    Currently running 2.4.0-RELEASE (arm)



  • This is the system log of the fail event but also up until the unit was power cycled.  I noticed further down shortly before the power cycle several of these messages:

    08:26:06 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:26:15 kernel: cpsw1: link state changed to DOWN
    08:26:15 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:26:27 kernel: cpsw1: link state changed to UP

    where cpsw1 is the LAN

    Also:
    08:31:37 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:37 kernel: cpsw0: link state changed to DOWN
    08:31:37 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:40 kernel: cpsw0: link state changed to UP
    08:31:40 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed

    where cpsw0 is the WAN.

    Didn't really take note of it earlier because I was so focused on the messages at the failure time, but maybe they mean something. There's not a lot of information on the net about these messages.

    Full system.log from start of failure until power cycle:

    
    07:49:35 kernel: cpsw0: link state changed to DOWN
    07:49:35 kernel: cpsw0: link state changed to UP
    07:49:35 check_reload_status: Linkup starting cpsw0
    07:49:35 check_reload_status: Linkup starting cpsw0
    07:49:36 php-fpm[88066]: /rc.linkup: DEVD Ethernet detached event for wan
    07:49:37 check_reload_status: Reloading filter
    07:49:37 php-fpm[13091]: /rc.linkup: DEVD Ethernet attached event for wan
    07:49:37 php-fpm[13091]: /rc.linkup: HOTPLUG: Configuring interface wan
    07:49:40 check_reload_status: rc.newwanip starting cpsw0
    07:49:40 php-fpm[13091]: /rc.linkup: ROUTING: setting default route to ###ISP GATEWAY 1###
    07:49:40 check_reload_status: Restarting ipsec tunnels
    07:49:41 kernel: arpresolve: can't allocate llinfo for ###ISP GATEWAY 2### on cpsw0
    07:49:41 php-fpm[88066]: /rc.newwanip: rc.newwanip: Info: starting on cpsw0.
    07:49:41 kernel: arpresolve: can't allocate llinfo for ###ISP GATEWAY 2### on cpsw0
    07:49:41 php-fpm[88066]: /rc.newwanip: rc.newwanip: on (IP address: ###NEW WAN IP###) (interface: WAN[wan]) (real interface: cpsw0).
    07:49:41 dhcpleases: /etc/hosts changed size from original!
    07:49:41 php-fpm[88066]: /rc.newwanip: IP Address has changed, killing states on former IP Address 106.71.88.7.
    07:49:41 dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/dnsmasq.pid) does not exist, No such process.
    07:49:42 dhcpleases: /etc/hosts changed size from original!
    07:49:42 dhcpleases: Could not deliver signal HUP to process because its pidfile (/var/run/dnsmasq.pid) does not exist, No such process.
    07:49:42 dhcpleases: kqueue error: unkown
    07:49:53 rc.gateway_alarm[85709]: >>> Gateway alarm: WAN_DHCP (Addr:###ISP GATEWAY 2### Alarm:1 RTT:16620ms RTTsd:2417ms Loss:22%)
    07:49:53 check_reload_status: updating dyndns WAN_DHCP
    07:49:53 check_reload_status: Restarting ipsec tunnels
    07:49:53 check_reload_status: Restarting OpenVPN tunnels/interfaces
    07:49:53 check_reload_status: Reloading filter
    07:51:36 sshd[82454]: Timeout, client not responding.
    07:59:46 check_reload_status: Linkup starting cpsw1
    07:59:46 kernel: cpsw1: link state changed to DOWN
    08:00:06 kernel: cpsw1: link state changed to UP
    08:00:06 check_reload_status: Linkup starting cpsw1
    08:00:15 check_reload_status: Linkup starting cpsw1
    08:00:15 kernel: cpsw1: link state changed to DOWN
    08:00:28 check_reload_status: Linkup starting cpsw1
    08:00:28 kernel: cpsw1: link state changed to UP
    08:25:43 kernel: cpsw1: link state changed to DOWN
    08:25:43 check_reload_status: Linkup starting cpsw1
    08:26:06 kernel: cpsw1: link state changed to UP
    08:26:06 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:26:15 kernel: cpsw1: link state changed to DOWN
    08:26:15 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:26:27 kernel: cpsw1: link state changed to UP
    08:26:27 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:30:51 kernel: cpsw1: link state changed to DOWN
    08:30:51 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:30:55 kernel: cpsw0: link state changed to DOWN
    08:30:55 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:14 kernel: cpsw1: link state changed to UP
    08:31:14 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:20 kernel: cpsw1: link state changed to DOWN
    08:31:20 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:25 kernel: cpsw0: link state changed to UP
    08:31:25 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:37 kernel: cpsw1: link state changed to UP
    08:31:37 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:37 kernel: cpsw0: link state changed to DOWN
    08:31:37 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:40 kernel: cpsw0: link state changed to UP
    08:31:40 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:31:43 kernel: cpsw0: link state changed to DOWN
    08:31:43 kernel: sonewconn: pcb 0xc3abe840: pru_attach() failed
    08:33:08 syslogd: kernel boot file is /boot/kernel/kernel
    08:33:08 kernel: Copyright (c) 1992-2017 The FreeBSD Project.
    08:33:08 kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    
    

    [



  • Does your wan Access come back if you power cycle your cable modem without rebooting PFSense?  I have, what may be, a similar issue but I can restore internet by rebooting only
    The cable modem. It started the day after I upgraded to 2.4 and unfortunately it is still randomly happening after 2.4.1 - I’m considering rolling back for a while until the dust settles more on 2.4 🙁


Log in to reply