Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Possible CRON Issue, Routing Dies @ Top Of Each Hour

    Scheduled Pinned Locked Moved General pfSense Questions
    11 Posts 2 Posters 455 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S
      stephenw10 Netgate Administrator
      last edited by

      Check the system log when it happens. I'd expect to see something if the firewall stops passing traffic.

      Do you see the outage on multiple connected clients? At the same time?

      Steve

      House Of CardsH 1 Reply Last reply Reply Quote 0
      • House Of CardsH
        House Of Cards @stephenw10
        last edited by House Of Cards

        @stephenw10 I'll have to re-enable writing system logs to disk. I had read that log rotation could cause some restarted services, so I turned it off to rule it out, and because it's better for the SSD.

        The fact is, I looked though all the logs when I was keeping them, but never saw anything that gave me, of limited knowledge, a clue that something was wrong. For example, there wouldn't even be any entries anywhere during the few minutes it was down, or right before/after. Like it had nothing going on. On the dashboard widget I do notice, if it's open at the time, that my states jump up when it's happening and then drop slowly back down. But if the dashboard isn't open I can't even get to the GUI. It's very unresponsive (GUI) when the system hangs. The dashboard will even time out if I'm already in...

        It's as though the whole system hangs, and the states reload. Even though no feed updates or anything are occurring.

        And the firewall doesn't stop passing traffic. If I'm streaming something from my media server, for example, playback continues uninterrupted. But if I exit the stream and go back to the server, things won't load or update. So I can't play something new. Like the states aren't there and new connections aren't happening, or happening really slow.

        And yes, it happens on everything when it happens, whether it's an internal connection, or something over the WAN...

        1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          It could be something else on the network flooding it and then getting blocked. Perhaps a loop that gets created somehow.

          If the state spike it's not something blocking traffic. Connections are arriving at the firewall and opening states.

          It could be hosts attempting to open connections multiple times when they fail. If those are opening states on WAN it also could be something upstream failing to pass traffic.

          House Of CardsH 1 Reply Last reply Reply Quote 0
          • House Of CardsH
            House Of Cards @stephenw10
            last edited by

            @stephenw10 I may have jumped the gun on states being the culprit.

            I've been watching them closely at the top of the hour, and they don't actually jump up there when the system becomes unresponsive. They do jump once in a while when I look at the history, but it's not coinciding with the unresponsiveness. I must have seen it happen as a coincidence, and thought that was happening each time.

            I run 2500-4000 states at normal operation. Once in a while, I see that jump to 15000, but I haven't investigated what is causing that. Could be a reload of some IP list or something.

            What I did notice is that if I watch the hardware at the top of each hour, the hard drive goes active and thrashes around writing for 1-2 full minutes. It's at that point when everything hangs. Watching the GUI doesn't show anything abnormal as far a CPU usage, or drastic memory changes... But it's reading/writing like crazy when the unresponsiveness happens.

            The logs are pointless in the GUI. They don't show a thing going on to tell me what is happening when this occurs. Only "warning" I saw anywhere was that it took 4+ seconds to write the data to disk from vnstatd (Traffic Totals). But it wasn't at the time of the unresponsiveness.

            So I'm lost... I'm going to upgrade to 24.03 which just came out today and see if it changes. In the meantime, any idea how to check the disk usage that's happening? Any way to see what is doing the disk thrashing?

            Thanks!

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              Try running: top -HaSP -m io

              Like:

              last pid: 83078;  load averages:  0.23,  0.27,  0.25                                                                     up 0+07:03:24  23:21:55
              290 threads:   5 running, 259 sleeping, 26 waiting
              CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
              CPU 1:  0.0% user,  0.0% nice,  6.2% system,  0.0% interrupt, 93.8% idle
              CPU 2:  0.8% user,  0.0% nice,  0.4% system,  0.0% interrupt, 98.8% idle
              CPU 3:  0.0% user,  0.0% nice,  0.7% system,  0.0% interrupt, 99.3% idle
              Mem: 33M Active, 195M Inact, 384M Wired, 7238M Free
              ARC: 108M Total, 28M MFU, 76M MRU, 412K Anon, 702K Header, 2812K Other
                   81M Compressed, 182M Uncompressed, 2.24:1 Ratio
              Swap: 1024M Total, 1024M Free
              
                PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
                 11 root          12     36      0      0      0      0   0.00% [idle{idle: cpu0}]
                 11 root          17    139      0      0      0      0   0.00% [idle{idle: cpu3}]
                 11 root         156     20      0      0      0      0   0.00% [idle{idle: cpu2}]
                 11 root          20     25      0      0      0      0   0.00% [idle{idle: cpu1}]
                  0 root           2      0      0      0      0      0   0.00% [kernel{e6000sw0 taskq}]
                 12 root         120      0      0      0      0      0   0.00% [intr{swi0: uart uart}]
              76247 root         122      0      0      0      0      0   0.00% top -HaSP -m io
                  7 root          19      0      0      0      0      0   0.00% [pf purge]
              64885 root          20      0      0      0      0      0   0.00% /usr/sbin/powerd -b hadp -a hadp -n hadp
                  0 root           4      0      0      0      0      0   0.00% [kernel{if_config_tqg_0}]
              32001 root          14      0      0      0      0      0   0.00% /bin/sh /root/7100_fan.sh
                  2 root          22      0      0      0      0      0   0.00% [clock{clock (0)}]
              
              House Of CardsH 1 Reply Last reply Reply Quote 0
              • House Of CardsH
                House Of Cards @stephenw10
                last edited by

                @stephenw10 Does this need to be done using SSH or something? I don't have that set up, and if I put this into the command line tool of the GUI, I only get this...

                last pid: 73517;  load averages:  0.77,  0.87,  1.05  up 0+13:51:11    12:33:22
                324 threads:   4 running, 303 sleeping, 17 waiting
                CPU 0: 10.9% user,  0.8% nice,  3.2% system,  0.2% interrupt, 84.8% idle
                CPU 1:  9.2% user,  0.8% nice,  1.9% system,  0.1% interrupt, 88.0% idle
                Mem: 646M Active, 2845M Inact, 1498M Laundry, 1407M Wired, 434M Buf, 1716M Free
                Swap: 3881M Total, 1805M Used, 2077M Free, 46% Inuse
                
                

                There isn't anything else...

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  Yes, that's interactive, it can't be run from the gui command prompt. The Diag > System Activity page shows the top output but in CPU mode not IO.

                  House Of CardsH 1 Reply Last reply Reply Quote 0
                  • House Of CardsH
                    House Of Cards @stephenw10
                    last edited by

                    @stephenw10 I may have found the culprit...

                    Screenshot_20240424_150033.png

                    I shut down NTOPNG and the last time I noticed to check at the top of the hour, I didn't see any sluggishness. I'll watch with it disabled and see if that was the issue.

                    Is NTOPNG known to do things like this, or is this a bug? I can't be certain, but I started noticing these periods of things not loading around the same timeframe I did the switch from CE to Plus. This was never an issue for years like this... And I'm not sure if this would be an issue with NTOPNG or PFSENSE. But if I can confirm this was the cause of the loss in connectivity, I'm happy to submit whatever you guys need to determine why this would happen. It could be affecting others.

                    On the other hand, my system is an older machine. Maybe I'm just asking too much of it? This a home environment, and the information in NTOPNG is fun to look at, but I don't really need it running... My hardware should be more than capable for a modest firewall with limited workload though.

                    I'll report back if this was the issue...

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      Hmm, ntopng does generate a lot of logging and can use significant CPU but I've never heard of it stopping routing like that.

                      Is it a very slow disk?

                      House Of CardsH 1 Reply Last reply Reply Quote 0
                      • House Of CardsH
                        House Of Cards @stephenw10
                        last edited by

                        @stephenw10 No, not a slow disk... It's a SATA III SSD... It was a budget build, but more than adequate for what I'm doing.

                        https://www.newegg.com/kingston-a400-120gb/p/N82E16820242399?Item=N82E16820242399

                        This does seem to have stopped the issue, so I'm going to wipe/reinstall NTOPNG, and leave it disabled. I can enable it if I want to do some troubleshooting, but in a home environment, it doesn't justify the wear and tear on the SSD to run all the time.

                        If I can do anything to help troubleshoot why it would kill the browsing altogether, let me know. I appreciate the insight into figuring this out. 👍

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.