Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Pfsense hangs up every day - bosses are getting shouty

    Scheduled Pinned Locked Moved General pfSense Questions
    17 Posts 5 Posters 3.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      saxmad
      last edited by

      Hi,

      I've used a pair of pfsense firewalls for several years and it protects our network extremely well.  They are configured as master/backup with CARP failover etc.

      Over the last couple of weeks, we have been experiencing hangs on our network, both incoming and outgoing.  It tends to last for a few minutes and then everything goes back to working again.

      I have gone through the troubleshooting routine of checking for issues with our ISP, checking all other hardware in our rack for faults/failures etc, but everything comes back clean.  That leaves me with the pfsense firewalls.

      Today I set up remote syslog collection so I can interrogate the firewall logs more easily - as the hang occurs at least once every day, I shouldn't have to wait that long for it to happen again.

      Any clues/ideas about how I can see what is happening on the firewall during the hanging period, what to look for in the logs, what are possible causes for the hangs etc etc.

      Version 2.1-BETA0 (amd64)
      built on Thu Nov 8 06:41:07 EST 2012

      CPU/memory usage ~ 10%
      Disk usage ~ 1%

      Thanks
      Gary

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        Any reason why you're running 2.1 beta and not release?
        Do the carp pair fail over when this happens? If you manually swap them does that change anything?
        Is there anything in the logs when this fault occurs?

        Are you running a full install from a HD? A failing hard drive can some times appear to stall but not for a few minutes. SSDs can sometimes stop responding while they are moving data for ware leveling and that can take a while. I would expect some errors in the logs though if the drive stopped responding for any significant time.

        Do you see the outage in the wan quality RRD graphs?

        Steve

        1 Reply Last reply Reply Quote 0
        • S
          saxmad
          last edited by

          Steve - thanks for a very helpful reply.

          I'm running 2.1 beta because these are 24x7 production firewalls and I was working on the maxim of "if it ain't broke, don't fix it".  Had a few unfortunate experiences with bad upgrades over the years, so only do it when I really have to on equipment such as the pfsense boxes.

          I am not 100% sure but I think that the CARP pairs do not fail over correctly, as our web sites are still unavailable throughout the hanging period.  I have successfully managed to manually force a failover recently, so I'm pretty confident that they are working under normal circumstances.

          I haven't found anything in the logs that might be helpful during the hanging period.  I now have all syslog messages from both firewalls being collected on a local server so I can get much easier access to the logs.

          This is a full install on a HD - the spinning variety.

          Looking at the RRD graphs for wan quality during the last hang, there is no noticeable change in stats.  I did find it interesting that the figures for the master firewall was around 77ms whilst the figures for the backup firewall was 2.6ms.  Quite a difference.  The wanv6 figures for both firewalls was around 3ms.  Not sure what that is telling me about the master connection.

          I had a clean night last night with no hangs, so I'm hoping that it may happen again today whilst I'm in the office.  Not hoping, but you get my drift.

          I'll post back any further findings.

          Cheers,
          Gary

          1 Reply Last reply Reply Quote 0
          • J
            jasonlitka
            last edited by

            A CARP pair can be upgraded or rebooted in the middle of the day without downtime.  I do it all the time.  The only thing that should glitch out would be any VPN tunnels.

            I recently had one of my boxes in a CARP pair that would periodically lock up but still be alive enough to not trigger a failover.  I don't know what it was because I replaced the boxes.  I suspect a failing disk though.

            I can break anything.

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              So, just to be clear, you tried manually swapping the CARP pair and there was no difference in the stalling behaviour? Did the longer ping time switch from the master to the backup? What is apinger using as a monitor IP? 77 vs 2.6ms is quite a discrepancy but it could be explained by the fact that the master box is carrying traffic on that link where as the backup is not. The link would be fairly congested though.

              Are you using incoming or outgoing load balancing?

              Steve

              1 Reply Last reply Reply Quote 0
              • K
                kejianshi
                last edited by

                Strange that they would quit working or freeze up.  A CARP pair should just keep working if one of them screws up.  You would think.
                Especially if nothing has changed with them.

                The only time I've ever gotten a "freeze" followed by it suddenly unfreezing on its own is when my SSDs are doing a hardware error recovery.

                1 Reply Last reply Reply Quote 0
                • S
                  saxmad
                  last edited by

                  @stevenw10: In the recent past, before then hanging incidents started, I had manually forced failover from master to backup successfully, so I am pretty confident that under normal circumstances, failover is working as expected.  I am not seeing successful failover during the hanging.  I have not checked the ping time  on the gateways during a failover, but will do so as soon as it happens again.  apinger is using the same IP address as the monitor IP on both firewalls.  My load balancing is incoming, requests are shared out between two identical local servers on my LAN.

                  I had another extended episode of hanging over the weekend, but I was not online to monitor unfortunately.  There is nothing in the log files of either firewall to suggest anything untowards is going on.

                  I couldn't find any BSD system log to look for failing disks etc other than dmesg and system.log.  Are there any others I could look at ?

                  I think my next step is to upgrade both firewalls to the latest release of software (currently running 2.1-BETA0 (amd64)).  I'm aiming to do this over the next couple of days.

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    The first thing I would do is manually switch the CARP pair. If this is a hardware issue that will solve it or prove it's something shared by both boxes.

                    Steve

                    1 Reply Last reply Reply Quote 0
                    • S
                      saxmad
                      last edited by

                      Update:

                      I haven't had a hang on my master firewall since Saturday morning, which is better than it has been for several weeks.  I decided to wait on doing any updates etc so I could investigate a little more.

                      Going down the hardware failure route, I thought I ought to map out exactly what hardware is in the box.  I inherited these firewalls after someone left, so wasn't involved in the spec'ing of them.

                      What I have found, and I wasn't expecting, is that they are running on SSD's.  dmesg shows the following :-
                      ad4: 76319MB <intel ssdsa2cw080g3="" 4pc10362="">at ata2-master UDMA100 SATA 3Gb/s

                      That has given me a new path to go down, especially after reading kejianshi's comment previously about SSD hardware error recovery hanging their system.

                      My problem is that I'm not that familiar with BSD.  What tools are available to me on the pfsense installation that would help me diagnose a faulty/failing SSD ?  I have found sysctl is installed, but I don't know what things I should be looking for.

                      Any suggestions gratefully accepted.</intel>

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Those are some nice disks not known to fail prematurely. How long have they been in service?
                        Check the SMART status in the Diagnostics: menu.

                        You seem either reluctant to switch the master and backup or you already did that and I haven't realised.  ;)

                        Steve

                        Edit: This thread might help: https://forum.pfsense.org/index.php/topic,66067.0.html

                        1 Reply Last reply Reply Quote 0
                        • S
                          saxmad
                          last edited by

                          Th disk has been in production usage for about 18 months.

                          I haven't switched the master/backup yet, nor upgraded either firewall to the latest software.  Trying to find the appropriate time …

                          1 Reply Last reply Reply Quote 0
                          • K
                            kilko
                            last edited by

                            could it be a memory issue ?

                            Just a note..
                            my pfSense (Alix 2D13) gets very sluggish when my ISP have had problems.. (which means there has been an interrupt on the WAN cable)..
                            I have to remove the WAN cable and insert it again to get up running..

                            1 Reply Last reply Reply Quote 0
                            • S
                              saxmad
                              last edited by

                              Another update:

                              Still running fine since the weekend.  I got the SMART info from the diagnostics page.  Seems to suggest the disk is OK, but maybe there are some numbers there that indicate an issue that I can't see.  As just pointed out, RAM might also be an issue.  Strange that it can freeze the firewall and then carry on some minutes later as though nothing has happened - I don't know FreeBSD well enough to think that that is unusual behaviour.

                              
                              smartctl 6.0 2012-10-10 r3643 [FreeBSD 8.3-RELEASE-p4 amd64] (local build)
                              Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
                              
                              === START OF INFORMATION SECTION ===
                              Model Family:     Intel 320 Series SSDs
                              Device Model:     INTEL SSDSA2CW080G3
                              Serial Number:    BTPR210202P5080BGN
                              LU WWN Device Id: 5 001517 972e41902
                              Firmware Version: 4PC10362
                              User Capacity:    80,026,361,856 bytes [80.0 GB]
                              Sector Size:      512 bytes logical/physical
                              Rotation Rate:    Solid State Device
                              Device is:        In smartctl database [for details use: -P show]
                              ATA Version is:   ATA8-ACS T13/1699-D revision 4
                              SATA Version is:  SATA 2.6, 3.0 Gb/s
                              Local Time is:    Wed Dec  4 12:02:18 2013 GMT
                              SMART support is: Available - device has SMART capability.
                              SMART support is: Enabled
                              
                              === START OF READ SMART DATA SECTION ===
                              SMART overall-health self-assessment test result: PASSED
                              
                              General SMART Values:
                              Offline data collection status:  (0x00)	Offline data collection activity
                              					was never started.
                              					Auto Offline Data Collection: Disabled.
                              Self-test execution status:      (   0)	The previous self-test routine completed
                              					without error or no self-test has ever 
                              					been run.
                              Total time to complete Offline 
                              data collection: 		(    1) seconds.
                              Offline data collection
                              capabilities: 			 (0x75) SMART execute Offline immediate.
                              					No Auto Offline data collection support.
                              					Abort Offline collection upon new
                              					command.
                              					No Offline surface scan supported.
                              					Self-test supported.
                              					Conveyance Self-test supported.
                              					Selective Self-test supported.
                              SMART capabilities:            (0x0003)	Saves SMART data before entering
                              					power-saving mode.
                              					Supports SMART auto save timer.
                              Error logging capability:        (0x01)	Error logging supported.
                              					General Purpose Logging supported.
                              Short self-test routine 
                              recommended polling time: 	 (   1) minutes.
                              Extended self-test routine
                              recommended polling time: 	 (   1) minutes.
                              Conveyance self-test routine
                              recommended polling time: 	 (   1) minutes.
                              SCT capabilities: 	       (0x003d)	SCT Status supported.
                              					SCT Error Recovery Control supported.
                              					SCT Feature Control supported.
                              					SCT Data Table supported.
                              
                              SMART Attributes Data Structure revision number: 5
                              Vendor Specific SMART Attributes with Thresholds:
                              ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
                                3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
                                4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
                                5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
                                9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       13812
                               12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
                              170 Reserve_Block_Count     0x0033   100   100   010    Pre-fail  Always       -       0
                              171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
                              172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
                              183 Runtime_Bad_Block       0x0030   100   100   000    Old_age   Offline      -       0
                              184 End-to-End_Error        0x0032   100   100   090    Old_age   Always       -       0
                              187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
                              192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       13
                              199 UDMA_CRC_Error_Count    0x0030   100   100   000    Old_age   Offline      -       0
                              225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       41866
                              226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       682
                              227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       0
                              228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       828574
                              232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
                              233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
                              241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       41866
                              242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       284
                              
                              SMART Error Log Version: 1
                              No Errors Logged
                              
                              SMART Self-test log structure revision number 1
                              No self-tests have been logged.  [To run self-tests, use: smartctl -t]
                              
                              SMART Selective self-test log data structure revision number 0
                              Note: revision number not 1 implies that no selective self-test has ever been run
                               SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
                                  1        0        0  Not_testing
                                  2        0        0  Not_testing
                                  3        0        0  Not_testing
                                  4        0        0  Not_testing
                                  5        0        0  Not_testing
                              Selective self-test flags (0x0):
                                After scanning selected spans, do NOT read-scan remainder of disk.
                              If Selective self-test is pending on power-up, resume after 0 minute delay.
                              
                              
                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Hmm, yes looks OK. Media wearout still at 0% despite having wrtitten 1.3TB in 575days. Which is what I'd expect to see from a quality Intel SSD.

                                Something else then. Bad RAM almost always results in a complete failure rather than a delay.

                                Something you could try if you have the patience/luck is to run top in a console and catch what process is using the cpu time when it stalls.

                                Steve

                                1 Reply Last reply Reply Quote 0
                                • J
                                  jasonlitka
                                  last edited by

                                  Have you switched over to the secondary box yet?  If not, you really need to do that to see if the problem goes away.  Excluding VPN traffic, this is an online action. and is accomplished with a single button click.

                                  I can break anything.

                                  1 Reply Last reply Reply Quote 0
                                  • stephenw10S
                                    stephenw10 Netgate Administrator
                                    last edited by

                                    Yep. Though I fully understand why you might be hesitant to try it in the middle of a work day when the box has an undiagnosed issue.  ;)

                                    Steve

                                    1 Reply Last reply Reply Quote 0
                                    • J
                                      jasonlitka
                                      last edited by

                                      @stephenw10:

                                      Yep. Though I fully understand why you might be hesitant to try it in the middle of a work day when the box has an undiagnosed issue.  ;)

                                      Steve

                                      Sure, but if the thing is really breaking every single day anyway, I'm honestly confused as to why he hasn't just turned it off at a failure point.  Either the backup box will work or it won't.  Better to find out now than later when the first box flakes out permanently.

                                      I can break anything.

                                      1 Reply Last reply Reply Quote 0
                                      • First post
                                        Last post
                                      Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.