Several problems(incl.pfSense self-reboot) during failback after long timeperiod

Veni

I noticed a issue with my WatchGuard gateway box when i lost primary wan and later on secondary wan.
That little red box shut down the udhcpc, thus making it impossible for it to return back to secondary wan(secondary WAN came back first).
When it shut down the udhcpc, it could not ping the monitoring ip, because it itself did not have a ip address. A reboot fixed the problem.
After this issue with the WG box, i decided to check out if pfSense could face the same problem.

My pfSense IF connections:
WAN = Primary WAN
OPT1 = Secondary WAN
Both dhcp based.

My first testresults came back flawless. The tests were:
1. Blocking access through WAN, thus forcing it a switch to OPT1. No problem.
2. Blocking access through OPT1, thus having totally no internet access.
3. Restoring OPT1. Access restored and working.
4. Restoring WAN. Access restored and traffic running through WAN(I did notice that some traffic that was not new, was still running through OPT1. Could not force it going through WAN, despite shutting down the server software that kept the connection. Only way to getting the software to use the WAN IF, was to break the OPT1 connection a couple of seconds, and during the break restart the server software that keep persistent tcp connections, or to reset the states).

Then i wanted to try a longer test(approx. 2 hr) to see if pfSense shut down the dhcp client(trying to see if it worked the same strange way as the WG box).
I will go into details(I tried to reproduce the issue, and it was easy to reproduce, thus easy to document), but the four
biggest issues where:

*** pfSense computer reboot(it did it itself).

Depending on what happens before the reboot(i think its right before the reboot), the configuration file can become corrupt/missing(don't know which one, but the data is gone).
Route tables a little bit wrong.
Gateway information on xl1 is the same as on xl0.**

I tried to reproduce the issue inside a VMware environment, but it worked inside that environment.
Don't know if it's because that it is a simulated environment of a internal network,
but i have no problem of reproducing the issue on a real computer hardware that runs live.

Okay, this is what i got for anyone who has the time and hardware to confirm the issue.
I'm running one Load balancer pool that is for failover purposes only. I will give more details upon request to anyone who want's them.
pfSense 1.0.1-SNAPSHOT-02-02-2007 built on Sat Feb 3 20:14:47 EST 2007.

I will post timestamps to show how long i wait/"notice any change" between each step.

1. 16:19 Everything running good. Disconnected WAN two switches away.
2. 16:20 Load balancer status confirms WAN offline and www.myip.se confirms usage of OPT1.
3. 16:27 WAN is at this time missing(600 seconds renewal period) IP address and gateway when looking at IF status.
4. 16:34 Disconnected my ADSL modem(OPT1) from the telephone line, thus remaining physical link status to pfSense.
5. 16:34 Load balancer status show now OPT1 as offline.
6. 17:44 After checking after 70 minutes, i notice now that OPT1 has lost it's ip address and gateway(my ADSL provider runs with a 1800 seconds renewal period). dhclient still running and this is better than the WG box, it instead kills the udhcpc process.
7. 17:52 Now i reconnect the phoneline to the ADSL modem.
8. 17:54 Now something strange happened on the load balancer status page, every single monitor ip shows up as Online(only OPT1 should be Online), and it's persistent. Looking at the route table confirms that the failover monitoring ip's(all of them) use the OPT1 gateway, thus making it possible to ping the WAN monitor ip's.
9. 17:59 Trying to use a webbrowser on a client computer. No luck. Even trying to surf to http://64.233.183.104(google).
10. 18:00 Reconnected WAN.
11. 18:01 Route tables still showing wrong gateway for monitoring ip's.
12. 18:03 pfSense web gui stops responding.
13. 18:05 I walked over to the computer monitor connected to the pfSense computer and it shows the FreeBSD loader waiting for a keystroke(in my case F1). I don't touch anything and let it do it's job. I can't find any reason in the remote syslog to why it rebooted.

I have some pictures from the routetables and IF status, plus some of the
computer monitor when probably the configuration file is broken(after one of theese reboots).

1.jpg=Normal route table.
2.jpg=Normal IF status.
6.jpg=Something wrong with routetable. 195.67.199.40,41 and 42 point to the wrong gateway(step 8 above).
7.jpg=Something wrong with gateway data. Look at OPT1 and WAN gateway(step 8 above), they are the same.
8.jpg=When all connections are back(step 11 above). The picture is just for reference to show that the data is correct.
9.jpg=Routetable when all connections are back. 195.67.199.40,41 and 42 still point to the wrong gateway(step 11 above).

IMG_0542.jpg and IMG_0547.jpg show the computer monitor after one these reboots occured.
Notice the ipaddresses in IMG_0548.jpg as well as the local host name(just a dot).
The problem with the configuration does not happened everytime i try to reproduce this issue.
Resetting to factory defaults and reconfigurating fixes the IMG_0548.jpg problem.

IMG_0547.JPG_thumb

IMG_0548.JPG_thumb

1.JPG_thumb

2.JPG_thumb

6.JPG_thumb

9.JPG_thumb

sullrich

Sounds like hardware issues.

Set a remote syslog server so that you can capture debugging information before the crash.

hoba

@Veni:

…
pfSense computer reboot(it did it itself).
...

Was the reboot related to some special action, e.g. unplugging a cable? Also is it reproducable like "I cut the red wire it blows up?"
Anyway you can test this on different hardware or different nics?

Veni

@sullrich:

Sounds like hardware issues.

Set a remote syslog server so that you can capture debugging information before the crash.

This is what i found on the remote syslog server(i log everything except for firewall events) that almost matches my
timestamps during writing:

Feb 6 18:02:44 slbd[93497]: ICMP poll succeeded for 195.67.199.40, marking service UP
Feb 6 18:02:46 slbd[93497]: ICMP poll succeeded for 195.67.199.41, marking service UP
Feb 6 18:02:48 slbd[93497]: ICMP poll succeeded for 195.54.122.198, marking service UP
Feb 6 18:02:50 slbd[93497]: ICMP poll succeeded for 195.54.122.199, marking service UP
Feb 6 18:02:52 slbd[93497]: ICMP poll succeeded for 195.54.122.200, marking service UP
Feb 6 18:02:54 slbd[93497]: ICMP poll succeeded for 81.26.227.3, marking service UP
Feb 6 18:03:08 check_reload_status: rc.newwanip starting
Feb 6 18:03:09 php: : Informational: rc.newwanip is starting.
Feb 6 18:03:09 login: login on ttyv0 as root
Feb 6 18:03:09 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:03:09 dhclient: New Subnet Mask (xl1): 255.255.255.0
Feb 6 18:03:09 dhclient: New Broadcast Address (xl1): 81.236.139.255
Feb 6 18:03:09 dhclient: New Routers (xl1): 81.236.139.1
Feb 6 18:03:10 dhclient[88039]: connection closed
Feb 6 18:03:10 dhclient[88039]: connection closed
Feb 6 18:03:10 dhclient[88039]: exiting.
Feb 6 18:03:10 dhclient[88039]: exiting.
Feb 6 18:03:11 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:03:11 dhclient: New Subnet Mask (xl1): 255.255.255.0
Feb 6 18:03:11 dhclient: New Broadcast Address (xl1): 81.236.139.255
Feb 6 18:03:11 dhclient: New IP Address (xl1): 81.236.139.6
Feb 6 18:04:30 dhclient[208]: Corrupt lease file - possible data loss!
Feb 6 18:04:30 dhclient[208]: DHCPREQUEST on xl1 to 255.255.255.255 port 67
Feb 6 18:04:30 dhclient[208]: DHCPACK from 81.236.128.1
Feb 6 18:04:31 dhclient[208]: bound to 81.236.139.6 – renewal in 600 seconds.
Feb 6 18:04:31 dhclient[288]: DHCPREQUEST on xl0 to 255.255.255.255 port 67
Feb 6 18:04:31 sshd[297]: Server listening on :: port 22.
Feb 6 18:04:31 sshd[297]: Server listening on 0.0.0.0 port 22.
Feb 6 18:04:31 dhclient[288]: DHCPACK from 172.21.248.191
Feb 6 18:04:32 dhclient[288]: bound to 213.114.138.12 – renewal in 1800 seconds.
Feb 6 18:04:35 dnsmasq[579]: started, version 2.36 cachesize 150

The reboot part looks like it takes place between:

Feb 6 18:03:11 dhclient: New IP Address (xl1): 81.236.139.6

Feb 6 18:04:30 dhclient[208]: Corrupt lease file - possible data loss!

I will go ahead today and reproduce the issue and i will stand in front
of the pfSense computer monitor to see if it prints out anything right before it reboots.

Veni

@hoba:

@Veni:

…
pfSense computer reboot(it did it itself).
...

Was the reboot related to some special action, e.g. unplugging a cable? Also is it reproducible like "I cut the red wire it blows up?"
Anyway you can test this on different hardware or different nics?

Yes, it's during reconnecting of cables(not any NIC cable) to the primary ISP port.
The ISP port is 2 switches away from the pfSense computer.
During my failover/failback tests, i never disconnect any cable connected to the pfSense NIC's.
I prefer to test failover/failback without breaking the physical link between the OS and a switch, thus
always checking that the failover/failback feature never simply works by checking the LINK status of NICs.

And i can reproduce it if i take the long test. It's not reproducible if running the short test(about 3 minutes).

Running the short test does not create a strange route table nor put identical gateways on both WAN and OPT1 NICs,
probably because the dhclient does not try to renew the ipaddresses during this short period.

I have even run the same long test on a different computer, but sorry to say, not different WAN NICs. I had to move the
WAN NICs. The result was the same, computer did reboot.

I will try to capture the computer screen of the pfSense computer later today to see if it prints anything right before the reboot.

Veni

Just an update:

I will give it a go on trying it on a third computer. Every single peace of hardware will be
different from the first and second computer, including the NICs, CPU, motherboard, memory, harddrive etc.
The only piece of hardware that will remain the same will be the computer monitor and keyboard.

The only thing i'm worried about are the PCI NICs.
The onboard NIC is a gigabit Marvell, but the PCI NICs will be SMC and DLink, both new and never used(about 6 and 4 years old).

The problem should not be with(at least not faulty) hardware, because both the first and second computer can run pfSense and
still cope with 130-150 Mbps(full duplex) or 98 Mbps out day in and day out without
the pfSense computer rebooting on it's own.

sullrich

If you can, get some genuine Intel cards. Every nic you mentioned is crappy on FreeBSD. Honest.

Veni

I know ;D.
Thats why i don't use them.

It will only be for this test when trying to reproduce the problem, to see if the problems spanns between
different hardware.

Otherwise i use Intel/3Com onboard and some nice old 3C905B on the real gateway box.

Veni

I was unable to test pfSense in the same condition as the first and second, on a third computer because it
told me that it could not use the harddrive(or something similar like that). The harddrive was of the same type
as the one used on the first and second computer. I did get alot of UDMA errors during the freebsd bootup.
So i scrubbed the idea of testing it on a third computer.

But i still wanted to try to capture the computer monitor right before it was about to reboot
when reproducing the issue, so i started the second computer and started to go step by
step, but it did not reboot this time. The difference was that i started the reproduce just after
about 3 minutes of runtime and i was done after about 1+ hour.

The routes tables and WAN gateway information were still really wrong, thus traffic did not work
because of DNS resolution going out the wrong IF(i use two of the ping monitors as dns) and
WAN gateway was the same as OPT1.

Manual reboot fixed the problem :).
I will let the computer runt for a couple of days before i retry the recreate the reboot issue.

Veni

It happened again yesterday :-[,
and i missed it, because i was not at home when my primary WAN came back.

This time it was no sort of testing at my account. This was
real downtime on the WAN, and OPT1 took care of traffic.

At 02/13/2007 3:15 PM my WAN was gone.
At 02/14/2007 11:02 AM my WAN was back.

I noticed when i got home that my pfSense uptime was only a couple of hours
instead of several days, so i checked the remote syslog, and this is what i got:

Feb 14 11:02:24 dhclient[14276]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 9
Feb 14 11:02:33 dhclient[14276]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 5
Feb 14 11:02:38 dhclient[14276]: No DHCPOFFERS received.
Feb 14 11:02:38 dhclient[14276]: No working leases in persistent database - sleeping.
Feb 14 11:02:38 login: login on ttyv0 as root
Feb 14 11:02:51 dhclient[15458]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 6
Feb 14 11:02:57 dhclient[15458]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 14
Feb 14 11:02:57 dhclient[15458]: DHCPOFFER from 81.236.128.1
Feb 14 11:04:09 dhclient[208]: DHCPDISCOVER on xl1 to 255.255.255.255 port 67 interval 3
Feb 14 11:04:09 dhclient[208]: DHCPOFFER from 81.236.128.1
Feb 14 11:04:10 sshd[257]: Server listening on :: port 22.
Feb 14 11:04:10 sshd[257]: Server listening on 0.0.0.0 port 22.
Feb 14 11:04:11 dhclient[208]: DHCPREQUEST on xl1 to 255.255.255.255 port 67
Feb 14 11:04:11 dhclient[208]: DHCPACK from 81.236.128.1

I take it that the reboot happened right after 11:02:57.
For the record, no cables have been unplugged during this
time to any NIC on the pfSense computer. Nor has the LINK status changed on
any NIC on the pfSense computer, because they are all connected to switches
here at home.

With this new information, i can revise my earlier statement about this issue.
There is no need for both WAN's to stop working, it's enough that the primary
WAN stops working for several hours and returns, to trigger this issue.

And i can not get it out of my head. The WAN NIC transports during a 16h period
(according to the RRD graphs) a average of 37.15 Mb/s total, and a maximum of
94.54 Mb/s without any issue, so i am almost sure that this would be a sign that
the NIC is in good condition and not broken in any way. I use 3C905B for WAN's and
a Intel onboard for LAN. I do plan to get 2 INTEL DESKTOP ADAPTER INTEL PRO 1000 GT-SINGLE LP,
but right now i don't have the cash to do so, and i'm stil just testing pfSense because of it's
failover feature.

I will change one thing on the pfSense computer, i will tell it to even log locally. Maybe
the local log can tell me something more.

jeroen234

local log will not survive a reboot