Hanging/Crashing every few hours
-
I recently updated pfsense to the current release ( after a few years of not doing so ), after which I experienced a pile of issues, so I wiped the drive and started fresh. Now every few hours pfsense crashes/hangs, noticed by the ping time from any machine on the LAN to the box pfsense is on, the ping time worsens until there is no response at all, followed by "host is down", so far only a power cycle recovers pfsense. Any ideas on where/what to look for?
-
What hardware?
-
Hardware is as follows :
Intel(R) Atom(TM) CPU D525 @ 1.80GHz
4 CPUs: 1 package(s) x 2 core(s) x 2 HTT threads1 x Transcend SO-DIMM DDR3 1600 Memory 2GB
1 x Emphase Industrial - S1 SATA Flash Module 4 GB
1 x Jetway 3x 1Gb Realtek LAN Module
The hardware has run pfSense 2.1 for the past two years or so, recently I updated to the most current version
-
Hardware is as follows :
Intel(R) Atom(TM) CPU D525 @ 1.80GHz
4 CPUs: 1 package(s) x 2 core(s) x 2 HTT threads1 x Transcend SO-DIMM DDR3 1600 Memory 2GB
1 x Emphase Industrial - S1 SATA Flash Module 4 GB
1 x Jetway 3x 1Gb Realtek LAN Module
The hardware has run pfSense 2.1 for the past two years or so, recently I updated to the most current version
Are you getting any watchdog timeouts on the console screen?
Can you share system log files before and after the crash/hang?
Can you share any custom sysctl (system tunables) or loader.conf.local modifications? I noticed when configuring my 2.3.2 pfSense setup that many of the FreeBSD tweaks on the web are wrong for recent version of FreeBSD.
-
I have not touched sysctl.conf or loader.conf.local, and here they are :
sysctl.conf :
$FreeBSD$
# This file is read when going to multi-user and its contents piped thru
#sysctl'' to adjust kernel values.
man 5 sysctl.conf'' for details.Uncomment this to prevent users from seeing information about processes that
are being run under another UID.
#security.bsd.see_other_uids=0
loader.conf.local :
kern.cam.boot_delay=10000
I see no console messages, the system just grinds to a halt and becomes unresponsive, let me see if I can get the system logs before/after a crash, it may be a trick as I am using cron to auto-reboot every hour as a duct tape/bubblegum workaround
Thank you
-
I have not touched sysctl.conf or loader.conf.local, and here they are :
sysctl.conf :
$FreeBSD$
# This file is read when going to multi-user and its contents piped thru
#sysctl'' to adjust kernel values.
man 5 sysctl.conf'' for details.Uncomment this to prevent users from seeing information about processes that
are being run under another UID.
#security.bsd.see_other_uids=0
loader.conf.local :
kern.cam.boot_delay=10000
I see no console messages, the system just grinds to a halt and becomes unresponsive, let me see if I can get the system logs before/after a crash, it may be a trick as I am using cron to auto-reboot every hour as a duct tape/bubblegum workaround
Thank you
That would be good. Does the machine lock up at the console or does the NIC just fail?
I have a feeling your Realtek NIC is experiencing a watchdog timeout, or something similar.
-
does this help at all :
gateways.log:Jul 26 00:46:16 pfSense dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr XX.XX.XX.XX bind_addr YY.YY.YY.YY identifier "GW_WAN "
there are also alot of these :
dhcpd.log:Jul 27 12:49:32 pfSense dhcpd: DHCPREQUEST for 192.168.2.29 from b0:a7:37:cb:ca:73 via re0: unknown lease 192.168.2.29.
-
does this help at all :
gateways.log:Jul 26 00:46:16 pfSense dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr XX.XX.XX.XX bind_addr YY.YY.YY.YY identifier "GW_WAN "
there are also alot of these :
dhcpd.log:Jul 27 12:49:32 pfSense dhcpd: DHCPREQUEST for 192.168.2.29 from b0:a7:37:cb:ca:73 via re0: unknown lease 192.168.2.29.
no, the first gateways.log messages are just dpinger (gateway monitor) telling you that you lost your WAN connection
the dhcp.log issue is also not the cause of this. Are you losing WAN or LAN or both when this issue occurs?
-
LAN stays up, WAN goes down, and pinging / communicating with pfsense is lost
-
LAN stays up, WAN goes down, and pinging / communicating with pfsense is lost
Can you access the console next time the WAN goes down? I am pretty sure you are getting watchdog timeouts on your WAN ethernet adapter. What type of Realtek adapter are you using? How much traffic are you pushing through your WAN when the interface fails?
-
When the WAN goes down the box is hanging, accessing the webconfigurater or ssh-ing to the console does not respond.
The system just crashed in between auto-reboots, and looking at system.log the last entery was midnight last nihgt, which does not seem correct.
As a another clue, the system is up right now and email/web sites responding, but pings result in immediate time outs
-
will get the NIC details shortly
-
When the WAN goes down the box is hanging, accessing the webconfigurater or ssh-ing to the console does not respond.
The system just crashed in between auto-reboots, and looking at system.log the last entery was midnight last nihgt, which does not seem correct.
As a another clue, the system is up right now and email/web sites responding, but pings result in immediate time outs
This is probably due to a bad Realtek driver. Can you turn off the auto reboot? Otherwise, there is no point debugging this
-
I am way from the office ( its a small small company ) for the next week, which is why the auto-reboot is on, the webserver and email server need to be keep up ), when I get back I can turn it off, and reboot when need be
-
I am way from the office ( its a small small company ) for the next week, which is why the auto-reboot is on, the webserver and email server need to be keep up ), when I get back I can turn it off, and reboot when need be
no worries. i've been in your position before - have a nice evening
-
Ok I am back in the office and found the following message reported twice in the console when a crash happened :
re0: discard frame w/o leading ethernet header ( len 4294967292 pkt len 4294967292 )
does that help?
-
I'm going to follow this post intensely, as I have a very similar problem.
-
Ok I am back in the office and found the following message reported twice in the console when a crash happened :
re0: discard frame w/o leading ethernet header ( len 4294967292 pkt len 4294967292 )
does that help?
Definitely it's not good, but not always causes crash or hang.
DO you have polling enabled?
Does reverting back to 2.1 solves problem? -
I do not have device polling enabled, is it worth turning on?
-
where can I find older versions to try a reversion?