• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

Pfsense hangs every two weeks!

Scheduled Pinned Locked Moved Hardware
27 Posts 6 Posters 7.0k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F
    fneto
    last edited by Jul 10, 2013, 3:33 PM

    Hi all!

    I'm having a problem with an installation of PfSense that every 2 weeks stop working. I have search for a lot ot things in the installation and log files and the only thing that I think that may be the root of the problem in a IRQ storm although I'm not having messages of IRQ storm in the system logs.

    Look below the data that I get now from PfSense

    last pid: 39621;  load averages:  0.42,  0.21,  0.14  up 2+20:07:29    12:27:54
    194 processes: 6 running, 167 sleeping, 21 waiting

    Mem: 371M Active, 1337M Inact, 202M Wired, 56K Cache, 112M Buf, 1252M Free
    Swap: 8192M Total, 8192M Free

    PID USERNAME PRI NICE  SIZE    RES STATE  C  TIME  WCPU COMMAND
      11 root    171 ki31    0K    32K CPU3    3  67.4H 100.00% {idle: cpu3}
      11 root    171 ki31    0K    32K RUN    2  67.3H 97.66% {idle: cpu2}
      11 root    171 ki31    0K    32K CPU1    1  67.4H 94.97% {idle: cpu1}
      12 root    -64    -    0K  176K CPU0    0  37.4H 60.25% {irq18: atapci0+}
      11 root    171 ki31    0K    32K RUN    0  29.2H 35.06% {idle: cpu0}
    38589 proxy    51    0 50824K 34776K select  3  48:51 12.50% squid
      12 root    -28    -    0K  176K WAIT    0  62:00  6.98% {swi5: +}
    5906 proxy    45    0 59676K 20600K sbwait  2  0:39  2.29% squidGuard
      12 root    -68    -    0K  176K WAIT    0  6:51  0.88% {irq17: re0 ehci1}
      12 root    -32    -    0K  176K WAIT    0  5:55  0.00% {swi4: clock}
      14 root    -16    -    0K    8K -      1  5:44  0.00% yarrow
      12 root    -68    -    0K  176K WAIT    0  3:40  0.00% {irq16: re1 ehci0}
      243 root      76  20  3408K  1184K kqread  2  2:49  0.00% check_reload_status
    33706 root      44    0  6080K  6104K select  3  2:30  0.00% ntpd
    36169 root      44    0 12008K  6012K select  1  1:52  0.00% nmbd
    47036 root      44    0 18036K 11008K select  2  1:35  0.00% winbindd
    12702 root      44    0  4956K  2556K select  1  1:18  0.00% syslogd
    37681 root      44    0 17012K  9688K select  3  0:59  0.00% winbindd

    $ vmstat -i
    interrupt                          total      rate
    irq16: re1 ehci0                33542771        136
    irq17: re0 ehci1                65189415        265
    irq18: atapci0+                428362821      1746
    cpu0: timer                    490546107      2000
    cpu1: timer                    490545694      1999
    cpu3: timer                    490545694      1999
    cpu2: timer                    490545692      1999
    Total                        2489278194      10149

    As you can see this is very strange and I have only a SATA disk and a idle CDROM plugged in the controller.

    The machine is a Dell optiplex pc with a Core i3 processor and 4GB of RAM.

    Everything that I have read point to this as the root cause, so I'd liek to know if we have some system tunable on freebsd/PfSense that I can set to stop the crazy behavior. I read that is a test change the controller mode to native, AHCI or IDE to see if it fix the problem, but this is a production machine and we can't reboot it during work hours, that's why I'm looking for another alternatives before starts to reboot the server ok!

    Thanks!

    1 Reply Last reply Reply Quote 0
    • J
      jimp Rebel Alliance Developer Netgate
      last edited by Jul 10, 2013, 8:23 PM

      The + means that there is something else sharing that IRQ with your storage controller. Look through /var/log/dmesg.boot and see if anything else mentions irq 18  (grep 'irq 18' /var/log/dmesg.boot).

      Watch vmstat -i a bit to see what the interrupts are doing.

      Run top -SH from the shell and press "m" to switch to I/O mode and see if anything stands out there.

      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

      Need help fast? Netgate Global Support!

      Do not Chat/PM for help!

      1 Reply Last reply Reply Quote 0
      • F
        fneto
        last edited by Jul 10, 2013, 8:29 PM

        The grep command shows:

        pcib3: <acpi pci-pci="" bridge="">irq 18 at device 28.2 on pci0
        atapci0: <intel ata="" controller="">port 0x5110-0x5117,0x5100-0x5103,0x50f0-0x50f7,0x50e0-0x50e3,0x50d0-0x50df,0x50c0-0x50cf irq 18 at device 31.2 on pci0
        atapci1: <intel ata="" controller="">port 0x50b0-0x50b7,0x50a0-0x50a3,0x5090-0x5097,0x5080-0x5083,0x5070-0x507f,0x5060-0x506f irq 18 at device 31.5 on pci0

        and the top -SH didn't show me anything different as you can see!

        last pid: 58807;  load averages:  0.10,  0.07,  0.06                                                                      up 3+01:09:04  17:29:29
        196 processes: 6 running, 169 sleeping, 21 waiting
        CPU:  1.0% user,  0.0% nice,  1.2% system, 15.1% interrupt, 82.7% idle
        Mem: 375M Active, 1547M Inact, 200M Wired, 56K Cache, 112M Buf, 1041M Free
        Swap: 8192M Total, 8192M Free

        PID USERNAME  VCSW  IVCSW  READ  WRITE  FAULT  TOTAL PERCENT COMMAND
          11 root        48 128158      0      0      0      0  0.00% {idle: cpu1}
          11 root        48 128158      0      0      0      0  0.00% {idle: cpu3}
          11 root        48 128158      0      0      0      0  0.00% {idle: cpu2}
          12 root    136977  7914      0      0      0      0  0.00% {irq18: atapci0+}
          11 root        48 128158      0      0      0      0  0.00% {idle: cpu0}
          12 root    136977  7914      0      0      0      0  0.00% {swi5: +}
        38589 proxy      295    74      0      0      0      0  0.00% squid
          12 root    136977  7914      0      0      0      0  0.00% {irq17: re0 ehci1}
          12 root    136977  7914      0      0      0      0  0.00% {swi4: clock}
          14 root          7      0      0      0      0      0  0.00% yarrow
          12 root    136977  7914      0      0      0      0  0.00% {irq16: re1 ehci0}
        33706 root        103      0      0      0      0      0  0.00% ntpd
          243 root          0      0      0      0      0      0  0.00% check_reload_status
        47036 root        23      3      0      0      0      0  0.00% winbindd
        36169 root        13      0      0      0      0      0  0.00% nmbd
        12702 root          2      0      0      0      0      0  0.00% syslogd
        37681 root        10      0      0      0      0      0  0.00% winbindd
            0 root          0      0      0      0      0      0  0.00% {swapper}
        12049 root          7      0      0      0      0      0  0.00% logger
        11891 root          2      0      0      0      0      0  0.00% tcpdump
          12 root    136977  7914      0      0      0      0  0.00% {swi1: netisr 0}
          22 root          1      0      0      0      0      0  0.00% syncer
          12 root    136977  7914      0      0      0      0  0.00% {swi4: clock}
            8 root          1      0      0      0      0      0  0.00% pfpurge
        54408 proxy        20      0      0      0      0      0  0.00% squidGuard
        37113 root          0      0      0      0      0      0  0.00% sh</intel></intel></acpi>

        1 Reply Last reply Reply Quote 0
        • J
          jimp Rebel Alliance Developer Netgate
          last edited by Jul 10, 2013, 8:39 PM

          It was very idle in that output compared to the last one.

          The fact that it's all storage controllers on that IRQ means it would almost have to be a disk or optical drive on there causing it.

          Are there any BIOS options for IRQs or PnP or similar?

          Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

          Need help fast? Netgate Global Support!

          Do not Chat/PM for help!

          1 Reply Last reply Reply Quote 0
          • W
            wallabybob
            last edited by Jul 10, 2013, 11:49 PM

            @fneto:

            I'm having a problem with an installation of PfSense that every 2 weeks stop working.

            What build of pfSense?

            Please provide more details of what you mean by "stop working"? Shuts down by itself? Stops forwarding packets? Stops responding to console keypresses?

            1 Reply Last reply Reply Quote 0
            • F
              fneto
              last edited by Jul 11, 2013, 12:21 AM

              First thanks by your help and support!

              Below is the details that I think is relevant to help in find the solution of this problem ok!

              The hardware:
              Dell Optiplex 390
              Core i3-2120
              4Gb de RAM
              1 DVD-RW
              1 SATA 500GB Hard drive
              2 x Realtek 8111E Gigabit (1 onboard IRQ 16 and 1 offboard pci-express on IRQ 17)
              I have only one VGA monitor and the dell keyboard plugged in this machine.

              As I told you this is a production machine and is very difficult to me to reboot and check all the BIOS options, but I really don't remenber of IRQ or PNP options in this machine, I only remenber that when I was installing I could install it only in a specific SATA mode, but I don't remenber now what mode is it ok!

              What I want tell when I say the it stop work is the every 15 days exactly the machine stop forwarding packets on the network. One time that nobody is there to reboot the server it start to work again after 10 minutes, the other times that it happens we run until the machine and restart it using the console and choosing "Reboot Server"

              So the problem didn't freeze the consolebut it stops the packet forwarding.

              If you need more information fell free to ask me!

              thanks!

              1 Reply Last reply Reply Quote 0
              • W
                wallabybob
                last edited by Jul 11, 2013, 12:45 AM

                What build of pfSense are you using? (See the version string on the home page for the box.)

                @fneto:

                What I want tell when I say the it stop work is the every 15 days exactly the machine stop forwarding packets on the network. One time that nobody is there to reboot the server it start to work again after 10 minutes, the other times that it happens we run until the machine and restart it using the console and choosing "Reboot Server"

                Before restarting the computer it would be good to get the output of the shell command```
                netstat -m

                1 Reply Last reply Reply Quote 0
                • F
                  fneto
                  last edited by Jul 11, 2013, 5:47 PM

                  Sorry by the delay, the build that I'm using in this server is: 2.0.2-RELEASE (i386) built on Fri Dec 7 16:30:38 EST 2012

                  below is the output of the command that you suggest me, but now the server is running for 4 days only!

                  The only time that the server come back working without a reboot I analysed the RRD graphs and saw a little network outage as you can see in the image attached to this post. The blue circle show the hour that the server has failed!

                  $ netstat -m
                  518/2557/3075 mbufs in use (current/cache/total)
                  4/1408/1412/131072 mbuf clusters in use (current/cache/total/max)
                  3/893 mbuf+clusters out of packet secondary zone in use (current/cache)
                  1/215/216/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
                  512/593/1105/6400 9k jumbo clusters in use (current/cache/total/max)
                  0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
                  4749K/9652K/14401K bytes allocated to network (current/cache/total)
                  0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
                  0/0/0 requests for jumbo clusters denied (4k/9k/16k)
                  0/10/6656 sfbufs in use (current/peak/max)
                  0 requests for sfbufs denied
                  0 requests for sfbufs delayed
                  0 requests for I/O initiated by sendfile
                  0 calls to protocol drain routines

                  graph.png
                  graph.png_thumb

                  1 Reply Last reply Reply Quote 0
                  • W
                    wallabybob
                    last edited by Jul 12, 2013, 9:03 AM

                    A single report from netstat is not sufficient to establish a trend. A single snapshot at the time of the "hang" would be useful to see if mbuf usage contributes to the hang.

                    The System -> Processor RRD graph shows number of processes. Is this graph flat or does it increase up to the time of the hangs then drop significantly on the reboots? (PERHAPS you are running low on free memory because something is starting new processes which aren't terminated.)

                    1 Reply Last reply Reply Quote 0
                    • K
                      kejianshi
                      last edited by Jul 12, 2013, 11:49 AM

                      Have you tried 2.03 or is the install time too much down time?

                      1 Reply Last reply Reply Quote 0
                      • F
                        fneto
                        last edited by Jul 12, 2013, 2:03 PM

                        Unfortunately I don't have the processor and memory graph from that day, but I attached the processor and memory graphs from these last days, maybe it can help.

                        The server was turned off because of a big maintence of the eletric power of the building, and the memory usage for me is strange but I'd like to hear your opinion!

                        About the upgrade to the latest 2.03 version, we don't do it until now because I work about 120 miles from the main build and these PC is working with a compiled and manual installation of the Realtek 8111E driver. We are afraid that after the update the system loose the network drivers (stored in /boot and called in loader.conf) and we can't turn the server again.

                        So we need to schedule a visit there to make the upgrade and if is the case manually install the network drivers again!

                        memory1.png
                        memory1.png_thumb
                        memory2.png
                        memory2.png_thumb
                        memory3.png
                        memory3.png_thumb
                        processor.png
                        processor.png_thumb

                        1 Reply Last reply Reply Quote 0
                        • K
                          kejianshi
                          last edited by Jul 12, 2013, 2:13 PM

                          Have them reboot every 3 days in the dead of night then if you don't get it worked out.

                          However, it looks like something one of mine was doing.  MBUFS and CPU usage climbing and climbing.

                          I reinstalled made the changes recommended for the MBUFS and for the specific NICs I have and the issue never returned.

                          But that doesn't sound like an option for you, so I'd recommend reboots as a chron job.

                          1 Reply Last reply Reply Quote 0
                          • K
                            kejianshi
                            last edited by Jul 12, 2013, 2:18 PM

                            Are you running squid?

                            Never mind.  I see it.

                            What are your memory cache settings?

                            1 Reply Last reply Reply Quote 0
                            • F
                              fneto
                              last edited by Jul 12, 2013, 2:23 PM

                              Hi kejianshi we are running squid and squidguard on this server. What MBUFS paramenter should I verify/change on the server?

                              Actually I have only it on system tunables: kern.ipc.nmbclusters="131072"

                              Thanks!

                              1 Reply Last reply Reply Quote 0
                              • K
                                kejianshi
                                last edited by Jul 12, 2013, 2:25 PM

                                Squid cache settings please?

                                1 Reply Last reply Reply Quote 0
                                • F
                                  fneto
                                  last edited by Jul 12, 2013, 2:37 PM

                                  The squid settings is attached ok!

                                  squid1.png
                                  squid1.png_thumb
                                  squid2.png
                                  squid2.png_thumb

                                  1 Reply Last reply Reply Quote 0
                                  • F
                                    fneto
                                    last edited by Jul 12, 2013, 2:37 PM

                                    The squid settings is attached ok!

                                    squid3.png
                                    squid3.png_thumb
                                    squid4.png
                                    squid4.png_thumb

                                    1 Reply Last reply Reply Quote 0
                                    • K
                                      kejianshi
                                      last edited by Jul 12, 2013, 2:43 PM

                                      Squid doesn't seem ok to me.  To me it seems there is far to much HD cache given his ram.

                                      1 Reply Last reply Reply Quote 0
                                      • K
                                        kejianshi
                                        last edited by Jul 12, 2013, 2:44 PM

                                        How much Ram does this box have?

                                        1 Reply Last reply Reply Quote 0
                                        • K
                                          kejianshi
                                          last edited by Jul 12, 2013, 2:52 PM Jul 12, 2013, 2:49 PM

                                          I'll put it this way.  I have several times your RAM with basically the same size cache stipulated and I'll hit 35% in a couple days of running.  40% sometimes.  Mine used to crash daily til I reduced my disk cache and mem cache.  Indexing 40GB of drive can take upwards of 2GB ram or more if the cache is full of lots little things.

                                          1 Reply Last reply Reply Quote 0
                                          3 out of 27
                                          • First post
                                            3/27
                                            Last post
                                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                                            This community forum collects and processes your personal information.
                                            consent.not_received