Hanging/Crashing every few hours

kryngle

I recently updated pfsense to the current release ( after a few years of not doing so ), after which I experienced a pile of issues, so I wiped the drive and started fresh. Now every few hours pfsense crashes/hangs, noticed by the ping time from any machine on the LAN to the box pfsense is on, the ping time worsens until there is no response at all, followed by "host is down", so far only a power cycle recovers pfsense. Any ideas on where/what to look for?

cmb

What hardware?

kryngle

Hardware is as follows :

Intel(R) Atom(TM) CPU D525 @ 1.80GHz
4 CPUs: 1 package(s) x 2 core(s) x 2 HTT threads

1 x Transcend SO-DIMM DDR3 1600 Memory 2GB

1 x Emphase Industrial - S1 SATA Flash Module 4 GB

1 x Jetway 3x 1Gb Realtek LAN Module

The hardware has run pfSense 2.1 for the past two years or so, recently I updated to the most current version

Paint

@kryngle:

Hardware is as follows :

Intel(R) Atom(TM) CPU D525 @ 1.80GHz
4 CPUs: 1 package(s) x 2 core(s) x 2 HTT threads

1 x Transcend SO-DIMM DDR3 1600 Memory 2GB

1 x Emphase Industrial - S1 SATA Flash Module 4 GB

1 x Jetway 3x 1Gb Realtek LAN Module

The hardware has run pfSense 2.1 for the past two years or so, recently I updated to the most current version

Are you getting any watchdog timeouts on the console screen?

Can you share system log files before and after the crash/hang?

Can you share any custom sysctl (system tunables) or loader.conf.local modifications? I noticed when configuring my 2.3.2 pfSense setup that many of the FreeBSD tweaks on the web are wrong for recent version of FreeBSD.

kryngle

I have not touched sysctl.conf or loader.conf.local, and here they are :

sysctl.conf :

$FreeBSD$

# This file is read when going to multi-user and its contents piped thru
# sysctl'' to adjust kernel values. man 5 sysctl.conf'' for details.

Uncomment this to prevent users from seeing information about processes that

are being run under another UID.

#security.bsd.see_other_uids=0

loader.conf.local :

kern.cam.boot_delay=10000

I see no console messages, the system just grinds to a halt and becomes unresponsive, let me see if I can get the system logs before/after a crash, it may be a trick as I am using cron to auto-reboot every hour as a duct tape/bubblegum workaround

Thank you

Paint

@kryngle:

I have not touched sysctl.conf or loader.conf.local, and here they are :

sysctl.conf :

$FreeBSD$

# This file is read when going to multi-user and its contents piped thru
# sysctl'' to adjust kernel values. man 5 sysctl.conf'' for details.

Uncomment this to prevent users from seeing information about processes that

are being run under another UID.

#security.bsd.see_other_uids=0

loader.conf.local :

kern.cam.boot_delay=10000

I see no console messages, the system just grinds to a halt and becomes unresponsive, let me see if I can get the system logs before/after a crash, it may be a trick as I am using cron to auto-reboot every hour as a duct tape/bubblegum workaround

Thank you

That would be good. Does the machine lock up at the console or does the NIC just fail?

I have a feeling your Realtek NIC is experiencing a watchdog timeout, or something similar.

kryngle

does this help at all :

gateways.log:Jul 26 00:46:16 pfSense dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr XX.XX.XX.XX bind_addr YY.YY.YY.YY identifier "GW_WAN "

there are also alot of these :

dhcpd.log:Jul 27 12:49:32 pfSense dhcpd: DHCPREQUEST for 192.168.2.29 from b0:a7:37:cb:ca:73 via re0: unknown lease 192.168.2.29.

Paint

@kryngle:

does this help at all :

gateways.log:Jul 26 00:46:16 pfSense dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr XX.XX.XX.XX bind_addr YY.YY.YY.YY identifier "GW_WAN "

there are also alot of these :

dhcpd.log:Jul 27 12:49:32 pfSense dhcpd: DHCPREQUEST for 192.168.2.29 from b0:a7:37:cb:ca:73 via re0: unknown lease 192.168.2.29.

no, the first gateways.log messages are just dpinger (gateway monitor) telling you that you lost your WAN connection

the dhcp.log issue is also not the cause of this. Are you losing WAN or LAN or both when this issue occurs?

kryngle

LAN stays up, WAN goes down, and pinging / communicating with pfsense is lost

Paint

@kryngle:

LAN stays up, WAN goes down, and pinging / communicating with pfsense is lost

Can you access the console next time the WAN goes down? I am pretty sure you are getting watchdog timeouts on your WAN ethernet adapter. What type of Realtek adapter are you using? How much traffic are you pushing through your WAN when the interface fails?

kryngle

When the WAN goes down the box is hanging, accessing the webconfigurater or ssh-ing to the console does not respond.

The system just crashed in between auto-reboots, and looking at system.log the last entery was midnight last nihgt, which does not seem correct.

As a another clue, the system is up right now and email/web sites responding, but pings result in immediate time outs

kryngle

will get the NIC details shortly

Paint

@kryngle:

When the WAN goes down the box is hanging, accessing the webconfigurater or ssh-ing to the console does not respond.

The system just crashed in between auto-reboots, and looking at system.log the last entery was midnight last nihgt, which does not seem correct.

As a another clue, the system is up right now and email/web sites responding, but pings result in immediate time outs

This is probably due to a bad Realtek driver. Can you turn off the auto reboot? Otherwise, there is no point debugging this

kryngle

I am way from the office ( its a small small company ) for the next week, which is why the auto-reboot is on, the webserver and email server need to be keep up ), when I get back I can turn it off, and reboot when need be

Paint

@kryngle:

I am way from the office ( its a small small company ) for the next week, which is why the auto-reboot is on, the webserver and email server need to be keep up ), when I get back I can turn it off, and reboot when need be

no worries. i've been in your position before - have a nice evening

kryngle

Ok I am back in the office and found the following message reported twice in the console when a crash happened :

re0: discard frame w/o leading ethernet header ( len 4294967292 pkt len 4294967292 )

does that help?

cowburner

I'm going to follow this post intensely, as I have a very similar problem.

w0w

@kryngle:

Ok I am back in the office and found the following message reported twice in the console when a crash happened :

re0: discard frame w/o leading ethernet header ( len 4294967292 pkt len 4294967292 )

does that help?

Definitely it's not good, but not always causes crash or hang.
DO you have polling enabled?
Does reverting back to 2.1 solves problem?

kryngle

I do not have device polling enabled, is it worth turning on?

kryngle

where can I find older versions to try a reversion?