Several problems(incl.pfSense self-reboot) during failback after long timeperiod
-
Sounds like hardware issues.
Set a remote syslog server so that you can capture debugging information before the crash.
-
…
pfSense computer reboot(it did it itself).
...Was the reboot related to some special action, e.g. unplugging a cable? Also is it reproducable like "I cut the red wire it blows up?"
Anyway you can test this on different hardware or different nics? -
Sounds like hardware issues.
Set a remote syslog server so that you can capture debugging information before the crash.
This is what i found on the remote syslog server(i log everything except for firewall events) that almost matches my
timestamps during writing:Feb 6 18:02:44 slbd[93497]: ICMP poll succeeded for 195.67.199.40, marking service UP
Feb 6 18:02:46 slbd[93497]: ICMP poll succeeded for 195.67.199.41, marking service UP
Feb 6 18:02:48 slbd[93497]: ICMP poll succeeded for 195.54.122.198, marking service UP
Feb 6 18:02:50 slbd[93497]: ICMP poll succeeded for 195.54.122.199, marking service UP
Feb 6 18:02:52 slbd[93497]: ICMP poll succeeded for 195.54.122.200, marking service UP
Feb 6 18:02:54 slbd[93497]: ICMP poll succeeded for 81.26.227.3, marking service UP
Feb 6 18:03:08 check_reload_status: rc.newwanip starting
Feb 6 18:03:09 php: : Informational: rc.newwanip is starting.
Feb 6 18:03:09 login: login on ttyv0 as root
Feb 6 18:03:09 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:03:09 dhclient: New Subnet Mask (xl1): 255.255.255.0
Feb 6 18:03:09 dhclient: New Broadcast Address (xl1): 81.236.139.255
Feb 6 18:03:09 dhclient: New Routers (xl1): 81.236.139.1
Feb 6 18:03:10 dhclient[88039]: connection closed
Feb 6 18:03:10 dhclient[88039]: connection closed
Feb 6 18:03:10 dhclient[88039]: exiting.
Feb 6 18:03:10 dhclient[88039]: exiting.
Feb 6 18:03:11 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:03:11 dhclient: New Subnet Mask (xl1): 255.255.255.0
Feb 6 18:03:11 dhclient: New Broadcast Address (xl1): 81.236.139.255
Feb 6 18:03:11 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:04:30 dhclient[208]: Corrupt lease file - possible data loss!
Feb 6 18:04:30 dhclient[208]: DHCPREQUEST on xl1 to 255.255.255.255 port 67
Feb 6 18:04:30 dhclient[208]: DHCPACK from 81.236.128.1
Feb 6 18:04:31 dhclient[208]: bound to 81.236.139.6 – renewal in 600 seconds.
Feb 6 18:04:31 dhclient[288]: DHCPREQUEST on xl0 to 255.255.255.255 port 67
Feb 6 18:04:31 sshd[297]: Server listening on :: port 22.
Feb 6 18:04:31 sshd[297]: Server listening on 0.0.0.0 port 22.
Feb 6 18:04:31 dhclient[288]: DHCPACK from 172.21.248.191
Feb 6 18:04:32 dhclient[288]: bound to 213.114.138.12 – renewal in 1800 seconds.
Feb 6 18:04:35 dnsmasq[579]: started, version 2.36 cachesize 150The reboot part looks like it takes place between:
Feb 6 18:03:11 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:04:30 dhclient[208]: Corrupt lease file - possible data loss!
I will go ahead today and reproduce the issue and i will stand in front
of the pfSense computer monitor to see if it prints out anything right before it reboots. -
…
pfSense computer reboot(it did it itself).
...Was the reboot related to some special action, e.g. unplugging a cable? Also is it reproducible like "I cut the red wire it blows up?"
Anyway you can test this on different hardware or different nics?Yes, it's during reconnecting of cables(not any NIC cable) to the primary ISP port.
The ISP port is 2 switches away from the pfSense computer.
During my failover/failback tests, i never disconnect any cable connected to the pfSense NIC's.
I prefer to test failover/failback without breaking the physical link between the OS and a switch, thus
always checking that the failover/failback feature never simply works by checking the LINK status of NICs.And i can reproduce it if i take the long test. It's not reproducible if running the short test(about 3 minutes).
Running the short test does not create a strange route table nor put identical gateways on both WAN and OPT1 NICs,
probably because the dhclient does not try to renew the ipaddresses during this short period.I have even run the same long test on a different computer, but sorry to say, not different WAN NICs. I had to move the
WAN NICs. The result was the same, computer did reboot.I will try to capture the computer screen of the pfSense computer later today to see if it prints anything right before the reboot.
-
Just an update:
I will give it a go on trying it on a third computer. Every single peace of hardware will be
different from the first and second computer, including the NICs, CPU, motherboard, memory, harddrive etc.
The only piece of hardware that will remain the same will be the computer monitor and keyboard.The only thing i'm worried about are the PCI NICs.
The onboard NIC is a gigabit Marvell, but the PCI NICs will be SMC and DLink, both new and never used(about 6 and 4 years old).The problem should not be with(at least not faulty) hardware, because both the first and second computer can run pfSense and
still cope with 130-150 Mbps(full duplex) or 98 Mbps out day in and day out without
the pfSense computer rebooting on it's own. -
If you can, get some genuine Intel cards. Every nic you mentioned is crappy on FreeBSD. Honest.
-
I know ;D.
Thats why i don't use them.It will only be for this test when trying to reproduce the problem, to see if the problems spanns between
different hardware.Otherwise i use Intel/3Com onboard and some nice old 3C905B on the real gateway box.
-
I was unable to test pfSense in the same condition as the first and second, on a third computer because it
told me that it could not use the harddrive(or something similar like that). The harddrive was of the same type
as the one used on the first and second computer. I did get alot of UDMA errors during the freebsd bootup.
So i scrubbed the idea of testing it on a third computer.But i still wanted to try to capture the computer monitor right before it was about to reboot
when reproducing the issue, so i started the second computer and started to go step by
step, but it did not reboot this time. The difference was that i started the reproduce just after
about 3 minutes of runtime and i was done after about 1+ hour.The routes tables and WAN gateway information were still really wrong, thus traffic did not work
because of DNS resolution going out the wrong IF(i use two of the ping monitors as dns) and
WAN gateway was the same as OPT1.Manual reboot fixed the problem :).
I will let the computer runt for a couple of days before i retry the recreate the reboot issue. -
It happened again yesterday :-[,
and i missed it, because i was not at home when my primary WAN came back.This time it was no sort of testing at my account. This was
real downtime on the WAN, and OPT1 took care of traffic.At 02/13/2007 3:15 PM my WAN was gone.
At 02/14/2007 11:02 AM my WAN was back.I noticed when i got home that my pfSense uptime was only a couple of hours
instead of several days, so i checked the remote syslog, and this is what i got:Feb 14 11:02:24 dhclient[14276]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 9
Feb 14 11:02:33 dhclient[14276]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 5
Feb 14 11:02:38 dhclient[14276]: No DHCPOFFERS received.
Feb 14 11:02:38 dhclient[14276]: No working leases in persistent database - sleeping.
Feb 14 11:02:38 login: login on ttyv0 as root
Feb 14 11:02:51 dhclient[15458]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 6
Feb 14 11:02:57 dhclient[15458]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 14
Feb 14 11:02:57 dhclient[15458]: DHCPOFFER from 81.236.128.1
Feb 14 11:04:09 dhclient[208]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 3
Feb 14 11:04:09 dhclient[208]: DHCPOFFER from 81.236.128.1
Feb 14 11:04:10 sshd[257]: Server listening on :: port 22.
Feb 14 11:04:10 sshd[257]: Server listening on 0.0.0.0 port 22.
Feb 14 11:04:11 dhclient[208]: DHCPREQUEST on xl1 to 255.255.255.255 port 67
Feb 14 11:04:11 dhclient[208]: DHCPACK from 81.236.128.1I take it that the reboot happened right after 11:02:57.
For the record, no cables have been unplugged during this
time to any NIC on the pfSense computer. Nor has the LINK status changed on
any NIC on the pfSense computer, because they are all connected to switches
here at home.With this new information, i can revise my earlier statement about this issue.
There is no need for both WAN's to stop working, it's enough that the primary
WAN stops working for several hours and returns, to trigger this issue.And i can not get it out of my head. The WAN NIC transports during a 16h period
(according to the RRD graphs) a average of 37.15 Mb/s total, and a maximum of
94.54 Mb/s without any issue, so i am almost sure that this would be a sign that
the NIC is in good condition and not broken in any way. I use 3C905B for WAN's and
a Intel onboard for LAN. I do plan to get 2 INTEL DESKTOP ADAPTER INTEL PRO 1000 GT-SINGLE LP,
but right now i don't have the cash to do so, and i'm stil just testing pfSense because of it's
failover feature.I will change one thing on the pfSense computer, i will tell it to even log locally. Maybe
the local log can tell me something more. -
local log will not survive a reboot