Pfsense hangs up every day - bosses are getting shouty
-
The first thing I would do is manually switch the CARP pair. If this is a hardware issue that will solve it or prove it's something shared by both boxes.
Steve
-
Update:
I haven't had a hang on my master firewall since Saturday morning, which is better than it has been for several weeks. I decided to wait on doing any updates etc so I could investigate a little more.
Going down the hardware failure route, I thought I ought to map out exactly what hardware is in the box. I inherited these firewalls after someone left, so wasn't involved in the spec'ing of them.
What I have found, and I wasn't expecting, is that they are running on SSD's. dmesg shows the following :-
ad4: 76319MB <intel ssdsa2cw080g3="" 4pc10362="">at ata2-master UDMA100 SATA 3Gb/sThat has given me a new path to go down, especially after reading kejianshi's comment previously about SSD hardware error recovery hanging their system.
My problem is that I'm not that familiar with BSD. What tools are available to me on the pfsense installation that would help me diagnose a faulty/failing SSD ? I have found sysctl is installed, but I don't know what things I should be looking for.
Any suggestions gratefully accepted.</intel>
-
Those are some nice disks not known to fail prematurely. How long have they been in service?
Check the SMART status in the Diagnostics: menu.You seem either reluctant to switch the master and backup or you already did that and I haven't realised. ;)
Steve
Edit: This thread might help: https://forum.pfsense.org/index.php/topic,66067.0.html
-
Th disk has been in production usage for about 18 months.
I haven't switched the master/backup yet, nor upgraded either firewall to the latest software. Trying to find the appropriate time …
-
could it be a memory issue ?
Just a note..
my pfSense (Alix 2D13) gets very sluggish when my ISP have had problems.. (which means there has been an interrupt on the WAN cable)..
I have to remove the WAN cable and insert it again to get up running.. -
Another update:
Still running fine since the weekend. I got the SMART info from the diagnostics page. Seems to suggest the disk is OK, but maybe there are some numbers there that indicate an issue that I can't see. As just pointed out, RAM might also be an issue. Strange that it can freeze the firewall and then carry on some minutes later as though nothing has happened - I don't know FreeBSD well enough to think that that is unusual behaviour.
smartctl 6.0 2012-10-10 r3643 [FreeBSD 8.3-RELEASE-p4 amd64] (local build) Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Intel 320 Series SSDs Device Model: INTEL SSDSA2CW080G3 Serial Number: BTPR210202P5080BGN LU WWN Device Id: 5 001517 972e41902 Firmware Version: 4PC10362 User Capacity: 80,026,361,856 bytes [80.0 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Wed Dec 4 12:02:18 2013 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 1) seconds. Offline data collection capabilities: (0x75) SMART execute Offline immediate. No Auto Offline data collection support. Abort Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 1) minutes. Conveyance self-test routine recommended polling time: ( 1) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 5 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 13812 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 170 Reserve_Block_Count 0x0033 100 100 010 Pre-fail Always - 0 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0030 100 100 000 Old_age Offline - 0 184 End-to-End_Error 0x0032 100 100 090 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 13 199 UDMA_CRC_Error_Count 0x0030 100 100 000 Old_age Offline - 0 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 41866 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 682 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 0 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 828574 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 41866 242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 284 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 0 Note: revision number not 1 implies that no selective self-test has ever been run SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
-
Hmm, yes looks OK. Media wearout still at 0% despite having wrtitten 1.3TB in 575days. Which is what I'd expect to see from a quality Intel SSD.
Something else then. Bad RAM almost always results in a complete failure rather than a delay.
Something you could try if you have the patience/luck is to run top in a console and catch what process is using the cpu time when it stalls.
Steve
-
Have you switched over to the secondary box yet? If not, you really need to do that to see if the problem goes away. Excluding VPN traffic, this is an online action. and is accomplished with a single button click.
-
Yep. Though I fully understand why you might be hesitant to try it in the middle of a work day when the box has an undiagnosed issue. ;)
Steve
-
Yep. Though I fully understand why you might be hesitant to try it in the middle of a work day when the box has an undiagnosed issue. ;)
Steve
Sure, but if the thing is really breaking every single day anyway, I'm honestly confused as to why he hasn't just turned it off at a failure point. Either the backup box will work or it won't. Better to find out now than later when the first box flakes out permanently.