PfSense 2.1, how to debug droped LAN connection



  • Hi,

    I've been running pfSense 2.1  snapshots for some time now on an x86 system.  I was running a snapshot from around March 10th with about 2 months of up time with no issues.

    I did an update on Friday, 10-May and have had the system drop the LAN connection very, very frequently,  sometimes several times in an hour.  But had at least a day of stability.  I switched the WAN and LAN interfaces (one was on the motherboard and one was a PCI card) and it was still the LAN side that would get dropped.

    The PC is usually headless, but I stuck a monitor on it and would see the message "sk0 (or sk1 if I switched which one was the LAN interface) 2 link states coalesced" multiple times.

    The status line when sshing or looking at the console that lists the WAN and LAN and respective Addresses would show the LAN information blank after the " -> " characters.  Before I would reboot and the system would be back to normal for a little while.  But I found that if I just unplugged and re-plugged the the cable to the LAN interface, the status for the LAN port would be back to the static IP I had assigned and everything would be fine for while.  I'd lose the connection again later and could repeat the process of unplugging and replugging in the cable to get back online.

    the WAN side seems stable regardless of which physical interface it's mapped to.

    I changed nothing other than updating to the latest snapshot on the 10th and have been updating every day, but, if anything, the frequency of the LAN side going down has gotten worse.  sometimes as little as 30 minutes before it goes down.  Looking at the load on the machine when I'm at the console and the LAN side has disconnected, I don't see any sort of load on the machine.

    I've been pruning down my settings and rules and packages and have yet to see any difference.

    I'm at a loss as to how to capture more information about what is happening to try to get to root cause.  I swapped out the switch on the LAN side and that also had no impact.  I don't know if there's anything external that can contribute to this behavior.

    I'd love to be able to try reverting back to a march snapshot to see if that light switches the behavior – are there archives of past snapshots that far back?

    Could someone suggest something I can do to try to gather more information to help diagnose things.  I've seen some posts about issues on the WAN side sort of along the same lines, but for me the WAN side has seemed to be stable (it's a cable modem that gets a dynamic address).

    Thank you for all the thoughts and suggestions.

    --
    Jay



  • @synapxe:

    I'm at a loss as to how to capture more information about what is happening to try to get to root cause.

    An extract from the pfSense system log (see Status -> System Logs) would help show what pfSense was reporting. Include a little more than you think is relevant (say 3 or 4 entries before the start of the entries related to the trouble).

    Please post the output of pfSense shell commands:```
    /etc/rc.banner
    netstat -i

    
    Have you tried replacing the cable connected to the pfSense LAN interface?


  • @wallabybob:

    @synapxe:

    I'm at a loss as to how to capture more information about what is happening to try to get to root cause.

    An extract from the pfSense system log (see Status -> System Logs) would help show what pfSense was reporting. Include a little more than you think is relevant (say 3 or 4 entries before the start of the entries related to the trouble).

    Here is the contents from Status -> System Logs.  The last event happened right around 20:49

    
    May 19 21:05:49	check_reload_status: Reloading filter
    May 19 21:05:48	dhcpleases: Could not deliver signal HUP to process because its pidfile does not exist, No such process.
    May 19 21:05:47	check_reload_status: Syncing firewall
    May 19 21:01:46	login: login on ttyv0 as root
    May 19 21:01:46	login: login on ttyv0 as root
    May 19 21:01:44	check_reload_status: updating dyndns lan
    May 19 21:01:40	php: : HOTPLUG: Configuring interface lan
    May 19 21:01:40	php: : DEVD Ethernet attached event for lan
    May 19 21:01:38	kernel: sk1: link state changed to UP
    May 19 21:01:38	check_reload_status: Linkup starting sk1
    May 19 21:01:36	php: : DEVD Ethernet detached event for lan
    May 19 21:01:34	kernel: sk1: link state changed to DOWN
    May 19 21:01:34	check_reload_status: Linkup starting sk1
    May 19 20:55:54	check_reload_status: updating dyndns lan
    May 19 20:55:54	php: : The command '/usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd -cf /etc/dhcpd.conf -pf /var/run/dhcpd.pid sk1' returned exit code '1', the output was 'Internet Systems Consortium DHCP Server 4.2.4-P2 Copyright 2004-2012 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ Wrote 0 deleted host decls to leases file. Wrote 0 new dynamic host decls to leases file. Wrote 0 leases to leases file. No subnet declaration for sk1 (no IPv4 addresses). ** Ignoring requests on sk1\. If this is not what you want, please write a subnet declaration in your dhcpd.conf file for the network segment to which interface sk1 is attached. ** Not configured to listen on any interfaces! If you did not get this software from ftp.isc.org, please get the latest from ftp.isc.org and install that before requesting help. If you did get this software from ftp.isc.org and have not yet read the README, please read it before requestin
    May 19 20:55:53	dhcpleases: Could not deliver signal HUP to process because its pidfile does not exist, No such process.
    May 19 20:55:50	php: : DEVD Ethernet detached event for lan
    May 19 20:55:50	php: : HOTPLUG: Configuring interface lan
    May 19 20:55:50	php: : DEVD Ethernet attached event for lan
    May 19 20:55:47	check_reload_status: Linkup starting sk1
    May 19 20:55:47	kernel: sk1: link state changed to UP
    May 19 20:55:47	kernel: sk1: link state changed to DOWN
    May 19 20:55:47	check_reload_status: Linkup starting sk1
    May 19 20:55:34	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    May 19 20:55:34	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    May 19 20:49:08	check_reload_status: updating dyndns lan
    May 19 20:49:07	dhcpleases: Could not deliver signal HUP to process because its pidfile does not exist, No such process.
    May 19 20:49:03	php: : HOTPLUG: Configuring interface lan
    May 19 20:49:03	php: : DEVD Ethernet attached event for lan
    May 19 20:49:01	kernel: sk1: link state changed to UP
    May 19 20:49:01	kernel: sk1: 2 link states coalesced
    May 19 20:49:01	check_reload_status: Linkup starting sk1
    --> lost connection around here.
    May 19 20:38:57	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    May 19 20:38:57	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    May 19 20:31:13	check_reload_status: updating dyndns lan
    May 19 20:31:11	dhcpleases: Could not deliver signal HUP to process because its pidfile does not exist, No such process.
    May 19 20:31:08	php: : HOTPLUG: Configuring interface lan
    May 19 20:31:08	php: : DEVD Ethernet attached event for lan
    May 19 20:31:06	kernel: sk1: link state changed to UP
    May 19 20:31:06	kernel: sk1: 2 link states coalesced
    May 19 20:31:06	check_reload_status: Linkup starting sk1
    May 19 20:13:08	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    May 19 20:13:08	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    
    

    I hadn't seen the messages about the PID file before – if there's something wrong with the filesystem and it can't update the file, could that do something strange like this?

    @wallabybob:

    Please post the output of pfSense shell commands:```
    /etc/rc.banner
    netstat -i

    Here is the contents of "/etc/rc.banner":

    
    *** Welcome to pfSense 2.1-BETA1-pfSense (i386) on gw ***
    
     WAN1 (wan)      -> sk0        -> v4/DHCP4: 72.197.240.138/22
     LAN (lan)       -> sk1        -> 
    
    

    while the LAN (sk1:) was down, I grabbed the contents of "netstat -i":

    
    Name               Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
    sk0               1500 <link#1>00:11:95:c0:ba:fc  5971251     0     0  2876406     0     0
    sk0               1500 fe80::211:95f fe80::211:95ff:fe        0     -     -        1     -     -
    sk0               1500 72.197.240.0/ ip72-197-240-138.    87609     -     -    48301     -     -
    fwe0*             1500 <link#2>02:30:1b:00:a6:05        0     0     0        0     0     0
    fwip0*            1500 <link#3>00:30:1b:b7:00:00:       0     0     0        0     0     0
    sk1               1500 <link#4>00:30:1b:b7:a5:a1  2890507     2     0  5575839     0     0
    sk1               1500 fe80::230:1bf fe80::230:1bff:fe        0     -     -        0     -     -
    enc0*             1536 <link#5>0     0     0        0     0     0
    pflog0*          33192 <link#6>0     0     0      314     0     0
    lo0              16384 <link#7>114     0     0      114     0     0
    lo0              16384 your-net      localhost              114     -     -      114     -     -
    lo0              16384 ::1           ::1                      0     -     -        0     -     -
    lo0              16384 fe80::1%lo0   fe80::1                  0     -     -        0     -     -
    pfsync0*          1460 <link#8>0     0     0        0     0     0
    pptpd0*           1500 <link#9>0     0     0        0     0     0
    pptpd1*           1500 <link#10>0     0     0        0     0     0
    pptpd2*           1500 <link#11>0     0     0        0     0     0
    pptpd3*           1500 <link#12>0     0     0        0     0     0
    pptpd4*           1500 <link#13>0     0     0        0     0     0
    pptpd5*           1500 <link#14>0     0     0        0     0     0
    pptpd6*           1500 <link#15>0     0     0        0     0     0
    pptpd7*           1500 <link#16>0     0     0        0     0     0
    pptpd8*           1500 <link#17>0     0     0        0     0     0
    pptpd9*           1500 <link#18>0     0     0        0     0     0</link#18></link#17></link#16></link#15></link#14></link#13></link#12></link#11></link#10></link#9></link#8></link#7></link#6></link#5></link#4></link#3></link#2></link#1> 
    

    Normally the LAN (sk1) has a static IP of 192.168.1.250

    I did remember I had turned on the "Do not forward private reverse lookups" for the DNS forwarder.  I'm turning that off now.

    Have you tried replacing the cable connected to the pfSense LAN interface?

    Yes, I tried a couple of different cables.  I've also cycled power on the other switches up stream from the pfsense box on the LAN interface.  The first switch up stream was swapped out too and the behavior didn't seem to change.

    the output of a "dmesg" command shows:

    
    sk1: link state changed to UP
    sk1: 2 link states coalesced
    sk1: link state changed to UP
    sk1: link state changed to DOWN
    sk1: link state changed to UP
    sk1: link state changed to DOWN
    sk1: link state changed to UP
    sk1: 2 link states coalesced
    sk1: link state changed to UP
    sk1: 2 link states coalesced
    sk1: link state changed to UP
    sk1: 2 link states coalesced
    sk1: link state changed to UP
    sk1: link state changed to DOWN
    sk1: link state changed to UP
    
    

    I'm noticed in looking the /var/log/system.log had a number of those messages from dhcpleases: "Could not deliver signal HUP to process because its pidfile does not exist, No such file or directory."

    I looked and the file seems to exist (this is after the sk1 was brought back up).  It's the only file in the /var/run directory with execute permissions though:
    "-rwxr-xr–  1 root  wheel        6 May 19 21:05 dhcpleases.pid"

    thank you very much for your thoughts with this

    --
    Jay



  • @synapxe:

    Here is the contents from Status -> System Logs.  The last event happened right around 20:49

    
    May 19 20:55:54	check_reload_status: updating dyndns lan
    
    May 19 20:49:08	check_reload_status: updating dyndns lan
    
    May 19 20:31:13	check_reload_status: updating dyndns lan
    
    

    You have configured Dynamic DNS to monitor your LAN?



  • @wallabybob:

    @synapxe:

    Here is the contents from Status -> System Logs.  The last event happened right around 20:49

    
    May 19 20:55:54	check_reload_status: updating dyndns lan
    
    May 19 20:49:08	check_reload_status: updating dyndns lan
    
    May 19 20:31:13	check_reload_status: updating dyndns lan
    
    

    You have configured Dynamic DNS to monitor your LAN?

    Not that the GUI shows – it lists my WAN1 for all 4 entries

    In my config.xml file, I see:

     <dyndns><type>dyndns</type>
                            <username>XXXX</username>
                            <password>XXXX</password>
                            <host>XXXX</host>
                            <mx><enable><interface>wan</interface>
                            <zoneid><ttl><updateurl><resultmatch><requestif>wan</requestif>
                            <descr><force><id>0</id></force></descr></resultmatch></updateurl></ttl></zoneid></enable></mx></dyndns> 
    
    

    so I think that it's using the right interface, but you're right that log message looks odd




  • The other thing that is odd is the LAN interface, supposedly configured with static IPv4 address, is sometimes losing that address. Did you reboot the system after interchanging the role of sk0 and sk1? If not, please reboot and see if things fail in the same way. If they do fail in the same sort of way please post the output of shell command```
    ps ax



  • *** Welcome to pfSense 2.1-BETA1-pfSense (i386) on gw ***

    WAN1 (wan)      -> sk0        -> v4/DHCP4: 72.197.240.138/22
    LAN (lan)      -> sk1        ->

    Now that I read this thread, I did have a case a few days ago when a system had lost its static LAN IP. The banner output on the console was just like the above. I ended up rebooting to get it back.
    Maybe my change to /etc/rc.linkup at https://github.com/pfsense/pfsense/commit/8e47f55b6ec58fa287b9cfb0815aa47ff9d4b069 has caused a problem with hotplug events on LAN. The change does make interface_configure do more stuff when a physical interface comes up again - that will be causing the "updating dyndns lan" messages (even though it will find no dyndns entries using LAN). It would be worth reversing the "true" to "false" from that GitHub diff and seeing if it helps you. Then we can look at what is special about a LAN hotplug event that needs handling better in interface_configure.

    On the physical side, why are the hotplug events happening "randomly" in the first place? Is a switch on the LAN side having short power cycling, or???



  • @wallabybob:

    The other thing that is odd is the LAN interface, supposedly configured with static IPv4 address, is sometimes losing that address. Did you reboot the system after interchanging the role of sk0 and sk1? If not, please reboot and see if things fail in the same way. If they do fail in the same sort of way please post the output of shell command```
    ps ax

    Hi,  Yes, a power cycle was done after switching sk0 and sk1.  After I saw no apparent effect, I switched them back to the original configuration and did another power cycle.

    I report back when I've got a failure and have captured the output of the ps command.

    One strange thing is that sometimes, the issue happens almost the instant I connect through the web interface.  I could have been browsing other sites or connecting through skype, or something else was using the network,  then I connect to the web interface (https) and it hangs and then the LAN side is down.  I can either cycle power, do a (ifconfig sk1 down; sleep; ifconfig sk1 up) to recover it or just unplug and re-plug the cable.  Very strange that sometimes it's right when I connect.  It's happened more than a single time.  Seems too much of a coincidence.

    Since I need remote access and can't always be available on site to fix things, I've added a 5 minute cron job to ping one of the LAN side servers.  If the ping times out, then the script does the ifconfig sk1 down; sleep 5; ifconfig sk1 up to recover and then, at least the network isn't down longer than about 5 minutes :-)

    Thanks again for your assistance.


    Jay



  • @phil.davis:

    *** Welcome to pfSense 2.1-BETA1-pfSense (i386) on gw ***

    WAN1 (wan)      -> sk0        -> v4/DHCP4: 72.197.240.138/22
    LAN (lan)       -> sk1        ->

    Now that I read this thread, I did have a case a few days ago when a system had lost its static LAN IP. The banner output on the console was just like the above. I ended up rebooting to get it back.
    Maybe my change to /etc/rc.linkup at https://github.com/pfsense/pfsense/commit/8e47f55b6ec58fa287b9cfb0815aa47ff9d4b069 has caused a problem with hotplug events on LAN. The change does make interface_configure do more stuff when a physical interface comes up again - that will be causing the "updating dyndns lan" messages (even though it will find no dyndns entries using LAN). It would be worth reversing the "true" to "false" from that GitHub diff and seeing if it helps you. Then we can look at what is special about a LAN hotplug event that needs handling better in interface_configure.

    On the physical side, why are the hotplug events happening "randomly" in the first place? Is a switch on the LAN side having short power cycling, or???

    Hi and thank you for your information.

    I tried swapping out the switch that's connected to the LAN port as well as the cable and didn't see any difference.  It does seem to be pretty random, but as I said in response to wallbybob's message, I have noticed that the issue can happen right when I connect to the web interface.  Everything seems fine up until then; when I click a link to the pfsense https interface, things hang.  This is not the only time that the hang happens, but it's happened this way on more than one occasion.

    I did look at the /etc/rc.linkup script and changed it to have the previous "interface_configure($iface, false, true);" statement as I saw on the git diff.

    This change seemed to be working great and it was up for 24 hours (better than anything I've had all week).  I had been opening the web interface fine.  I tried to connect (to the web interface) some time later, after I was browsing and connecting through the machine just fine.  As soon as I tried to connect, the link went down.  So I don't know if I was just lucky  that it had been up for so long with this change or not.

    Here is the system log for the failure (the last entry is when I logged in from the console to reset the link)

    
    May 21 17:51:34	login: login on ttyv0 as root
    May 21 17:50:30	php: : DEVD Ethernet detached event for lan
    May 21 17:50:29	php: : HOTPLUG: Configuring interface lan
    May 21 17:50:29	php: : DEVD Ethernet attached event for lan
    May 21 17:50:27	check_reload_status: Linkup starting sk1
    May 21 17:50:27	kernel: sk1: link state changed to UP
    May 21 17:50:27	kernel: sk1: link state changed to DOWN
    May 21 17:50:27	check_reload_status: Linkup starting sk1
    May 21 17:50:26	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    May 21 17:50:26	php: /index.php: Successful login for user 'admin' from: 192.168.1.100
    
    

    Seems like a second after I authenticated, the link was set to down.

    Very strange.

    Thank you again for your suggestions and thoughts.


    Jay



  • Please post the output of pfSense shell commands```
    ifconfig ; pciconf -l -v

    
    PERHAPS your LAN interface is running at 1Gbps and there isn't quite enough PCI bus bandwidth for both interfaces, the NIC reports an error and the driver cycles the link to recover.
    
    I wonder what would happen if you forced the LAN interface to 100Mbps.
    
    Perhaps enabling flow control on the link would help but the FreeBSD man page gives no clue on how to enable flow control or read its state.

Locked